Deep Learning based Defect Prediction Model for Source Code

This project is a line-level defect prediction model for software source code from scratch. Line level defect classifiers predict which lines in a code are likely to be buggy.

The data used for this project has been scraped from multiple GitHub repositories, and organized into dataframes with the following four columns:

instance: the line under test
context before: the context of the line under test right before the line. In this question, the context before consists of all the lines in the functions before the tested line.
context after: the context of the line under test right after the line. In this question, the context after consists of all the lines in the functions after the tested line.
is buggy: the label of the line tested. 0 means the line is not buggy, 1 means the line is buggy

We used Bidirectional Encoder Representations from Transformers, aka BERT in order to better capture the context of the tokens in the instance. BERT is a trained transformer encoder stack that applies bidirectional (rather non-directional) training in Transformer (a popular attention model), and has been immensely popular in encoding text (both software as well as natural language). BERT was published in 2018 by Devlin at al at Google AI (link: https://arxiv.org/pdf/1810.04805.pdf). BERT was inspired by the Transformer model that was first suggested in the paper: Attention Is All You Need (link: https://arxiv.org/pdf/1706.03762.pdf). The HuggingFace is an open-source community that has made available architectures from NLP in its Transformers library (includes BERT as well). We used HuggingFace transformers library to use BERT in our project while using the Pytorch framework (link: https://huggingface.co/transformers/model_doc/bert.html).

The BERT model we use has 4 attention heads, and 2 transformer layers (blocks). Along with the preprocessed tokens, we also made use of:

positional embeddings (0,1..1000): Conveys index of word/token
token types (0,1,2): Conveys type of token: 0 if token lies before ; 1 if token is between and ; 2 if token is after
attention mask (1/0) to convey: Conveys where padding has been done

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
results		results
README.md		README.md
auc_model.png		auc_model.png
dataset.py		dataset.py
model.py		model.py
preprocess.py		preprocess.py
requirements-gpu.txt		requirements-gpu.txt
requirements.txt		requirements.txt
tokenization.py		tokenization.py
train.py		train.py
train_and_test.py		train_and_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

results

results

README.md

README.md

auc_model.png

auc_model.png

dataset.py

dataset.py

model.py

model.py

preprocess.py

preprocess.py

requirements-gpu.txt

requirements-gpu.txt

requirements.txt

requirements.txt

tokenization.py

tokenization.py

train.py

train.py

train_and_test.py

train_and_test.py

Repository files navigation

Deep Learning based Defect Prediction Model for Source Code

About

Releases

Packages

Languages

meghaSahay/Deep-Learning-based-Defect-Prediction-Model-for-Source-Code

Folders and files

Latest commit

History

Repository files navigation

Deep Learning based Defect Prediction Model for Source Code

About

Resources

Stars

Watchers

Forks

Languages