## A simple model based on logistic regression for determining if the comments are useful or not.

Import libraries and parse CSV data --> include only the comment strings and their labels (non-information)

In [31]:
import unicodedata
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv("./../data/train_set_0520.csv", usecols=['type', 'comment', 'non-information'])

comments= data['comment'].apply(str)
comments = pd.concat([comments, data['type']], axis=1, keys=['comment', 'type'])
values = data['non-information'].values
values = np.where(values == 'yes', 1, 0)
print(comments)

                                                comment     type
0     @implNote taken from {@link com.sun.javafx.sce...  Javadoc
1     icon.setToolTipText(printedViewModel.getLocali...     Line
2     Synchronize changes of the underlying date val...     Line
3     Ask if the user really wants to close the give...  Javadoc
4                                    css: information *    Block
...                                                 ...      ...
1306  icon.setToolTipText(qualityViewModel.getLocali...     Line
1307  icon.setToolTipText(rankViewModel.getLocalizat...     Line
1308                                   Init preferences     Line
1309         TODO: reflective access, should be removed     Line
1310  for (Entry<String, SortType> entries : prefere...    Block

[1311 rows x 2 columns]


Split the comments into train (used for training the model) and test data (used for evaluating the model).

In [32]:
from sklearn.model_selection import train_test_split
comments_train, comments_test, y_train, y_test = train_test_split(comments, values, test_size=0.25, random_state=1000)
comments_train

Unnamed: 0,comment,type
431,Opens a shared database.,Javadoc
213,Executes a callable task that provides a retur...,Javadoc
730,Copies the selected entries and formats them w...,Javadoc
1005,Auto-generated method stub,Line
130,"Keep track of changes made to the columns, lik...",Javadoc
...,...,...
769,Test if the field is legally set.,Line
350,"Initializes the components, the layout, the da...",Javadoc
1275,panel.getUndoManager().addEdit(new UndoableMov...,Line
71,TODO: Remove old entry. Or... add it to a list...,Line


Preprocess data:
1. remove all special characters
2. TODO: expand contradictions (don't = do not etc)
3. remove special characters
4. stemming --> put the word into its most basic form
5. lemmatisation --> removes the word's affixes to get to the basic form of the word

Observations: Removing stopwords decreased accuracy

In [33]:
from helpers.data_preprocessing import DataProcesser
dp = DataProcesser()

comments = comments_train['comment'].tolist()
comments = list(map(dp.preprocess, comments))
comments_train['comment'] = comments_train['comment'].map(dp.preprocess)


Vectorise data - map a numerical value to each word

This is based on the Bag-of-Words Model. It ignores the order of words and focuses only on their frequency.

In [34]:
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(comments_train)

x_train = vectorizer.transform(comments_train)
x_test  = vectorizer.transform(comments_test)


Create a logistic regression model and fit the training data into it.

In [28]:
from sklearn.linear_model import LogisticRegression
from sklearn.utils.fixes import loguniform
#param_grid = {'C': loguniform(1e0, 1e3)}

classifier = LogisticRegression()

LogisticRegression()

Evaluate the model based on the following properties:
* <b>accuracy</b> - How often a data point is classified correctly?
The number of true positives and true negatives divided by the number of true positives, true negatives, false positives, and false negatives
* <b>precision</b> - What proportion of positive identifications was actually correct?
The number of true positives divided by the number of true positives and false positives
* <b>recall</b> - What proportion of actual positives was identified correctly?
The number of true positives divided by the number of true positives and false negatives
* <b>F1 score</b>- The F1 Score is the 2*((precision*recall)/(precision+recall)).
 conveys the balance between the precision and the recall.

In [37]:
from sklearn.model_selection import cross_val_predict, cross_validate

y_pred = cross_val_predict(classifier, x_train, y_train)
#lassifier.predict(x_test)

# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

# Attempted to optimise hyperparameters - didn't work

# from sklearn.model_selection import GridSearchCV
# clf = LogisticRegression()
# grid_values = {'C':[0.001,.009,0.01,.09,1,5,10,25]}
# grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
# grid_clf_acc.fit(x_train, y_train)
#
# #Predict values based on new parameters
# y_pred_acc = grid_clf_acc.predict(x_test)
#
# # New Model Evaluation metrics
# print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
# print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
# print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
# print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

ValueError: Found input variables with inconsistent numbers of samples: [2, 983]

Random forest classifier

In [35]:
randomForestClassifier = RandomForestClassifier(max_depth=2, random_state=0)
randomForestClassifier.fit(x_train, y_train)

y_pred = randomForestClassifier.predict(x_test)

# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

ValueError: Found input variables with inconsistent numbers of samples: [2, 983]