## A simple model based on logistic regression for determining if the comments are useful or not.

Import libraries and parse CSV data --> include only the comment strings and their labels (non-information)

In [4]:
import unicodedata
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv("./../data/train_set_0520.csv", usecols=['type', 'comment', 'non-information'])

#.apply(str).apply()
values = data['non-information'].values
values = np.where(values == 'yes', 1, 0)

Preprocess data:
1. remove all special characters
2. TODO: expand contradictions (don't = do not etc)
3. remove special characters
4. stemming --> put the word into its most basic form
5. lemmatisation --> removes the word's affixes to get to the basic form of the word

Observations: Removing stopwords decreased accuracy

In [5]:
from helpers.data_preprocessing import DataProcesser
dp = DataProcesser()

data['comment'] = data['comment'].apply(str)
data['comment'] = data['comment'].apply(dp.preprocess)
length_data = data['comment'].apply(lambda c: len(c))
comments = data['comment']
length_data.head()

0    18
1    40
2    69
3    51
4     9
Name: comment, dtype: int64

Split the comments into train (used for training the model) and test data (used for evaluating the model).

In [6]:
from sklearn.model_selection import train_test_split
comments_train, comments_test, y_train, y_test = train_test_split(comments, values, test_size=0.25, random_state=1000)
comments_train


431                                  open a share databas
213     execut a callabl task that provid a valu after...
730     copi the select entri and mat them with the se...
1005                               auto gener method stub
130     keep track of chang made to the column like re...
                              ...                        
769                           test the field is legal set
350     initi the compon the layout the data structur ...
1275    panel getunmanag addedit unablemovegroup group...
71      to remov old en or add it to a list of entri t...
599     need to toggl a twice to make sure everyth is ...
Name: comment, Length: 983, dtype: object

Vectorise data - map a numerical value to each word

This is based on the Bag-of-Words Model. It ignores the order of words and focuses only on their frequency.

In [10]:
#TODO add length after count vectoriser --> seperate into files
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
vectorizer.fit(comments_train)


x_train = vectorizer.transform(comments_train)
x_test  = vectorizer.transform(comments_test)

x_test

<328x921 sparse matrix of type '<class 'numpy.int64'>'
	with 3403 stored elements in Compressed Sparse Row format>

Create a logistic regression model and fit the training data into it.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.utils.fixes import loguniform
#param_grid = {'C': loguniform(1e0, 1e3)}

classifier = LogisticRegression()
classifier.fit(x_train, y_train)

LogisticRegression()

Evaluate the model based on the following properties:
* <b>accuracy</b> - How often a data point is classified correctly?
The number of true positives and true negatives divided by the number of true positives, true negatives, false positives, and false negatives
* <b>precision</b> - What proportion of positive identifications was actually correct?
The number of true positives divided by the number of true positives and false positives
* <b>recall</b> - What proportion of actual positives was identified correctly?
The number of true positives divided by the number of true positives and false negatives
* <b>F1 score</b>- The F1 Score is the 2*((precision*recall)/(precision+recall)).
 conveys the balance between the precision and the recall.

In [12]:
from sklearn.model_selection import cross_val_predict, cross_validate

y_pred = classifier.predict(x_test)#lassifier.predict(x_test)

# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

# Attempted to optimise hyperparameters - didn't work

# from sklearn.model_selection import GridSearchCV
# clf = LogisticRegression()
# grid_values = {'C':[0.001,.009,0.01,.09,1,5,10,25]}
# grid_clf_acc = GridSearchCV(clf, param_grid = grid_values,scoring = 'recall')
# grid_clf_acc.fit(x_train, y_train)
#
# #Predict values based on new parameters
# y_pred_acc = grid_clf_acc.predict(x_test)
#
# # New Model Evaluation metrics
# print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred_acc)))
# print('Precision Score : ' + str(precision_score(y_test,y_pred_acc)))
# print('Recall Score : ' + str(recall_score(y_test,y_pred_acc)))
# print('F1 Score : ' + str(f1_score(y_test,y_pred_acc)))

ValueError: X has 921 features per sample; expecting 1579

Random forest classifier

In [7]:
randomForestClassifier = RandomForestClassifier(max_depth=2, random_state=0)
randomForestClassifier.fit(x_train, y_train)

y_pred = randomForestClassifier.predict(x_test)

# Model Evaluation metrics
from sklearn.metrics import accuracy_score,recall_score,precision_score,f1_score
print('Accuracy Score : ' + str(accuracy_score(y_test,y_pred)))
print('Precision Score : ' + str(precision_score(y_test,y_pred)))
print('Recall Score : ' + str(recall_score(y_test,y_pred)))
print('F1 Score : ' + str(f1_score(y_test,y_pred)))

Accuracy Score : 0.7042682926829268
Precision Score : 0.0
Recall Score : 0.0
F1 Score : 0.0


  _warn_prf(average, modifier, msg_start, len(result))
