Handcrafted Program Embeddings

This project contains the source code for the paper 'Towards Demystifying Dimensions of Source Code Embeddings' (arXiv, ACM DL) accepted at the RL+SE&PL'20 workshop, co-located with ESEC/FSE'20 conference.

Structure

├── HandcraftedExtractor/   # extractor to extract handcrafted embeddings of Java programs.
├── images/                 # some figures and tables from the paper for README.
├── report/                 # scripts to generate figures and tables.
├── svm-handcrafted/        # scripts to dump SVMlight format data and train models.
├── temp/                   # temoprary working files, skip it.

Workflow:

Top-Ten Deduplication Datasets:

Download Java-large dataset.
Extract methods from class file (i.e. JavaMethodExtractor).
Apply DuplicateCodeDetector to remove near duplicate methods.
Select ten most-frequent methods (i.e. topN).

Feature List and Vector Type:

Run HandcraftedExtractor to extract handcrafted embeddings.
Check code2vec for path-based code vectors.
Check naive_embeddings for sequence-based naive baselines.
Check complexity_embeddings for code complexity metrics.

SVM-Handcrafted:

Download source code/binaries of SVMlight.
Run data_light.py to dump SVMlight format file.
Run model_light.py to train and evaluate SVMlight models.

Comparison of Classifiers:

Observations:

Observation 1: The presence of a feature can be used to recognize a method instead of counting the number of occurrences of that feature in programs. On average, the choice of binary vectors has increased the 𝐹1-Score up to 1% than the numeric vectors.
Observation 2: The code complexity features can be useful to better recognize a method along with the method-only features. On average, the additional code complexity features have increased the 𝐹1-Score up to 2.2% than the method-only features.
Observation 3: The handcrafted features significantly outperform the sequence of characters (by 59.65%) and the sequence of tokens (by 26.85%) for predicting method name.
Observation 4: The handcrafted features with a very smaller feature set can achieve highly comparable results to the higher dimensional embeddings of deep neural model such as code2vec.
Observation 5: Compare to the handcrafted features, the information gains are more evenly distributed in the code2vec embeddings. Moreover, the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features.

Citation:

Towards Demystifying Dimensions of Source Code Embeddings

@inproceedings{rabin2020demystifying,
    author = {Rabin, Md Rafiqul Islam and Mukherjee, Arjun and Gnawali, Omprakash and Alipour, Mohammad Amin},
    title = {Towards Demystifying Dimensions of Source Code Embeddings},
    year = {2020},
    isbn = {9781450381253},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://dl.acm.org/doi/10.1145/3416506.3423580},
    doi = {10.1145/3416506.3423580},
    booktitle = {Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages},
    pages = {29–38},
    numpages = {10},
    keywords = {Models of Code, Source Code Embeddings, Interpretability, Source Code Representation},
    location = {Virtual, USA},
    series = {RL+SE\&PL 2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 57 Commits
HandcraftedExtractor		HandcraftedExtractor
images		images
report		report
svm-handcrafted		svm-handcrafted
temp		temp
.gitignore		.gitignore
README.md		README.md
config.py		config.py
helper.py		helper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HandcraftedExtractor

HandcraftedExtractor

images

images

report

report

svm-handcrafted

svm-handcrafted

temp

temp

.gitignore

.gitignore

README.md

README.md

config.py

config.py

helper.py

helper.py

Repository files navigation

Handcrafted Program Embeddings

Structure

Workflow:

Top-Ten Deduplication Datasets:

Feature List and Vector Type:

SVM-Handcrafted:

Comparison of Classifiers:

Observations:

Citation:

About

Contributors 2

Languages

mdrafiqulrabin/handcrafted-embeddings

Folders and files

Latest commit

History

Repository files navigation

Handcrafted Program Embeddings

Structure

Workflow:

Top-Ten Deduplication Datasets:

Feature List and Vector Type:

SVM-Handcrafted:

Comparison of Classifiers:

Observations:

Citation:

About

Resources

Stars

Watchers

Forks

Languages