IntentClassification

A Machine Learning model for text based multi-label Intent Classification of code review questions

Problem description

Code review is a systematic examination of a source code produced by a person or a group of people, to find bugs, give suggestions, make clarifications, and hence evaluate the quality of the code. In this process, there will be one or more presenters, who will be facing questions or suggestions from a panel. The dataset used here, emerged from a case study aimed at investigating the communicative intention of developers’ questions during the code review process. The challenge is to extract the necessary information from the bunch of questions so as to come up with a proper conclusion about the intention of the questions and hence create a model which predicts the same with a plausible accuracy.

About the dataset

The dataset is obtained from an exploratory case study done as part of a research paper : Communicative Intention in Code Review Questions, involving 399 Android code reviews, from which 499 questions have been extracted and fed into an Excel sheet. It has the following 4 fields:

inline-comment-id
#Comment
Question : question/statement raised by the reviewer
Final Label : intention of the question/statement (Target variable)

The intentions are broadly classified into 5, with further subdivisions summing up to 9 and 2 additional subdivisions, making it a total of 14. We have considered only the broad classifications and hence converted the attribute Final Label to contain only the intial level of classification with 5 subcategories, namely Suggestions, Requests, Attitudes and emotions, Hypothetical Scenario, Rhetorical Question. This converted category is fed as a new attribute New Label. See the figure for a better understanding:

Feature extraction

The attribute Questions is taken as the input variable and New Label as the output variable. Distribution of the output variable (number of samples per label) can be seen here:

Since the input variable is in textual form and not numerical, certain preprocessing steps are required to be done before training the model with it. The output variable can be fed to the model either directly, or after encoding it into numerical values using encoders such as the Label Encoder. Here, we have fed them directly.

Text preprocessing

The preprocessing of the text has been done using two approaches:

Approach 1 :
- Expanding the contractions (e.g: haven’t --> have not)
- Conversion of text to word sequence (includes conversion to lower case)
- Removal of numbers and punctuations
Approach 2 :
- Conversion to lower case
- Removal of numbers and punctuations
- Tokenisation : converts the sentence into a list of words
- Lemmatisation : converts each word to its root form

These approaches are done on the input variable and the results are stored in a new attribute Question Words.

We are required to add new features to train and test the model so as to enhance its accuracy. For this purpose, we have introduced new columns related to the attribute “Questions”:

excl_marks : count of exclamation marks
qstn_marks : count of question marks
puncts : count of punctuations like .,:;
symbols : count of other symbols like *&$%=/
word_count : number of words Out of these, word_count and excl_marks have been included as features, in addition to the preprocessed text column Question Words. These have been chosen because they contribute the most to the emotion in a statement/question, compared to other symbols. You can experiment training the model with other features as well. But here, the vector for training consists of only three attributes – Question Words, excl_marks and word_count.

Packages used

To convert Question Words to numerical vectors, the TF-IDF vectoriser from sklearn is used, which assigns a particular numerical value to each word, signifying its importance. For text preprocessing, the package nltk, keras, contractions, string (for removing punctuations), re (Regular Expressions) have been used. For training, testing and accuracy analysis of the model, sklearn has been used.

Model fitting and goodness

For each approach, the model is fit to three different classifiers - Linear SVM, Random Forest Classifier, Logistic Regression - among which Linear SVM performed the best in both the approaches.

NOTE : If any errors are raised while importing the modules from nltk, it might be because explicit download of the modules are required. In such cases, try the following code in your Jupyter notebook:

import nltk
nltk.download('stopwords')
nltk.download('wordnet')

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
dataset.xlsx		dataset.xlsx
text_classification_code.ipynb		text_classification_code.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

IntentClassification

Problem description

About the dataset

Feature extraction

Text preprocessing

Packages used

Model fitting and goodness

About

Uh oh!

Releases

Packages

Languages

naveena-as/IntentClassification

Folders and files

Latest commit

History

Repository files navigation

IntentClassification

Problem description

About the dataset

Feature extraction

Text preprocessing

Packages used

Model fitting and goodness

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages