This project contains the source code for the paper 'Towards Demystifying Dimensions of Source Code Embeddings' (arXiv, ACM DL) accepted at the RL+SE&PL'20 workshop, co-located with ESEC/FSE'20 conference.
├── HandcraftedExtractor/ # extractor to extract handcrafted embeddings of Java programs.
├── images/ # some figures and tables from the paper for README.
├── report/ # scripts to generate figures and tables.
├── svm-handcrafted/ # scripts to dump SVMlight format data and train models.
├── temp/ # temoprary working files, skip it.
- Download Java-large dataset.
- Extract methods from class file (i.e. JavaMethodExtractor).
- Apply DuplicateCodeDetector to remove near duplicate methods.
- Select ten most-frequent methods (i.e. topN).
- Run HandcraftedExtractor to extract handcrafted embeddings.
- Check code2vec for path-based code vectors.
- Check naive_embeddings for sequence-based naive baselines.
- Check complexity_embeddings for code complexity metrics.
- Download source code/binaries of SVMlight.
- Run data_light.py to dump SVMlight format file.
- Run model_light.py to train and evaluate SVMlight models.
- Observation 1: The presence of a feature can be used to recognize a method instead of counting the number of occurrences of that feature in programs. On average, the choice of binary vectors has increased the 𝐹1-Score up to 1% than the numeric vectors.
- Observation 2: The code complexity features can be useful to better recognize a method along with the method-only features. On average, the additional code complexity features have increased the 𝐹1-Score up to 2.2% than the method-only features.
- Observation 3: The handcrafted features significantly outperform the sequence of characters (by 59.65%) and the sequence of tokens (by 26.85%) for predicting method name.
- Observation 4: The handcrafted features with a very smaller feature set can achieve highly comparable results to the higher dimensional embeddings of deep neural model such as code2vec.
- Observation 5: Compare to the handcrafted features, the information gains are more evenly distributed in the code2vec embeddings. Moreover, the code2vec embeddings are more resilient to the removal of dimensions with low information gains than the handcrafted features.
Towards Demystifying Dimensions of Source Code Embeddings
@inproceedings{rabin2020demystifying,
author = {Rabin, Md Rafiqul Islam and Mukherjee, Arjun and Gnawali, Omprakash and Alipour, Mohammad Amin},
title = {Towards Demystifying Dimensions of Source Code Embeddings},
year = {2020},
isbn = {9781450381253},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://dl.acm.org/doi/10.1145/3416506.3423580},
doi = {10.1145/3416506.3423580},
booktitle = {Proceedings of the 1st ACM SIGSOFT International Workshop on Representation Learning for Software Engineering and Program Languages},
pages = {29–38},
numpages = {10},
keywords = {Models of Code, Source Code Embeddings, Interpretability, Source Code Representation},
location = {Virtual, USA},
series = {RL+SE\&PL 2020}
}