# Coursework 1: Train a Sentiment Analysis Classifier
In this course work, you are asked to train a sentiment analysis classifier for movie reviews. Below using appropriate machine learning based NLP techniques, an analysis is performed over a given text corpus to train the model. First we read the dataset & it has been done using Pandas. 

In [1]:
# load data and take a quick look
import pandas as pd
raw_data = pd.read_csv('coursework1_train.csv')
raw_data.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,Enjoy the opening credits. They're the best th...,neg
1,1,"Well, the Sci-Fi channel keeps churning these ...",neg
2,2,It takes guts to make a movie on Gandhi in Ind...,pos
3,3,The Nest is really just another 'nature run am...,neg
4,4,Waco: Rules of Engagement does a very good job...,pos


So there are 40000 rows in the dataset & 3 columnswith the last one consisting whether the corresponding review is negative or positive. 

In [2]:
raw_data.shape

(40000, 3)

Let's try exploring how a bit of data looks like here. 

In [3]:
raw_data.text

0        Enjoy the opening credits. They're the best th...
1        Well, the Sci-Fi channel keeps churning these ...
2        It takes guts to make a movie on Gandhi in Ind...
3        The Nest is really just another 'nature run am...
4        Waco: Rules of Engagement does a very good job...
                               ...                        
39995    This was a Hindi movie. Hindi=Horrible. reason...
39996    I'm really tempted to reward "The Case of the ...
39997    Poor Jane Austen. This dog of a production doe...
39998    Subtle, delicate ,touching.<br /><br />A young...
39999    OK, Anatomie is not a reinvention of the Horro...
Name: text, Length: 40000, dtype: object

In [4]:
# check the size of the data and its class distribution
all_text = raw_data['text'].tolist()
all_lables = raw_data['sentiment'].tolist()

print('entry num', len(all_text))
print('num of pos entries', len([l for l in all_lables if l=='pos']))
print('num of neg entries', len([l for l in all_lables if l=='neg']))

entry num 40000
num of pos entries 20000
num of neg entries 20000


Now below we're going to pre-process the text so that it can be analysed. Following packages have to be imported in order to perform the pre-processing/data cleaning step. 

Stopwords like “the”, “a”, “an”, “in” which don't affect the meaning of the sentence are removed from using this package. Also, the word_tokenize will help to tokenize each & every word. & Stemming is preferred over Lemmatization as it was giving more faster results by computational purposes and also did not let down the accuracy by much. 

In [5]:
# text cleaning and preprocessing:
# This sample code does not perform any text normalization/pre-processing
# Feel free to apply any pre-processing steps you find appropriate

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer


stopwords = set(stopwords.words('english'))

def clean_txt(txt):
    corpus = []    
    for t in txt:
        
        #First it is converted to lower and anyother than alphabets are removed, also punctuations are also removed.
        
        review = t.lower()
        review = re.sub('[^a-zA-Z]', ' ', review)
        review = re.sub(r'[^\w\s]', '', review)
        
        w_txt = word_tokenize(review)
        words = [ww for ww in w_txt if ww not in stopwords]
        detok = ' '.join(words)
        corpus.append(detok)
    
    return corpus 

clean = clean_txt(raw_data.text)
clean

['enjoy opening credits best thing second rate inoffensive time killer features passable performances likes eric roberts martin kove main part however goes newcomer tommy lee thomas looks bit diminutive kind action nevertheless occasionally manages project banty rooster kind belligerence first time see bare chested sweaty engaged favorite beefcake activity chopping wood seven scenes without shirt including one hanged wrists zapped electricity la mel gibson lethal weapon could use better script however since manner exposes truth corruption violence inside prison never convincing also talk millions dollars apparently tied investigation never explained pluses though sending john woodrow undercover john wilson amusing play presidential name co star jody ross nolan shows promise inmate early proceedings shown hanged wrists getting punched burly guard one final note movie low budget painfully responsible lack extras despite impressive size prison seems hold inmates note cast credits end help

Here splitting of data is done using the scikit learn model selection train_test_split. I chose to do it with this as I was familiar with this approach from before.

In [6]:
# data split. 
# Feel free to use differnt raios or strategies to split the data.
#train_text = all_text[:35000]
#train_labels = all_lables[:35000]
#test_text = all_text[35000:]
#test_labels = all_lables[35000:]

from sklearn.model_selection import train_test_split

train_text, test_text, train_labels, test_labels = train_test_split(clean,raw_data.sentiment, random_state=220)

Sentiment analysis classifier: Here, various machine learning algorithms are tried and tested and depending upon the results, we'll choose the best one for our model.

The Tf-idf vectorisation gives a matrix which in turn can tell how important a keyword is to the analysis of an sentiment. Which is why it is suitable for this type of code. 

Below:, three algorithms other than Logistic regression have been applied: Random Forest Classifier, KNeighbors classifier, Linear Discriminant analysis. Out of which in terms of accuracy, Logistic regression and the Linear Discriminant analysis are quite similar with LR edging by a bit. Computationally, LR was also the one which gave a faster output.

After giving some parameters to the Random Forest Classifier, it's accuracy increased. And after using the technique of Grid Search, it gave '5' neighbors as the best parameters for KNeighbors classifiers. Although it gave a low acccuracy. 

In [7]:
# training: tf-idf + logistic regression
# you should explore different representations and algorithms.
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

# train model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(train_vecs, train_labels)

# test model
test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.8658
precision 0.865912122520385
rec 0.8658490081570779
f1 0.8657974018376996


In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(n_estimators = 200, criterion = 'gini', random_state=220) 

clf.fit(train_vecs, train_labels)

test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.8352
precision 0.835235326531
rec 0.8352286752681708
f1 0.8351998945279325


In [9]:
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

from sklearn.neighbors import KNeighborsClassifier


clf = KNeighborsClassifier(n_neighbors=5, metric= 'euclidean')

clf.fit(train_vecs, train_labels)

test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.7143
precision 0.7143429819899956
rec 0.7143310898945814
f1 0.7142989686192767


In [10]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

clf = LinearDiscriminantAnalysis()
clf.fit(train_vecs.toarray(), train_labels)

test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.8616
precision 0.8617885427359441
rec 0.8616629260909248
f1 0.8615932180676853
