## Text Classification

### Objective
Try to predict whether a movie review is positive or negative. 
For this text classification, bag of words technique is used to extract features. Two classifiers naive bayes and support vector machines are trained and used for predictions  

### Data

The dataset used for this text analysis is obtained from https://www.cs.cornell.edu/people/pabo/movie-review-data/
The dataset contains 1000 postive and 1000 negative movie reviews

In [1]:
#Import required pacakges
import zipfile
import numpy as np
import pandas as pd
import nltk as nltk
import re
from sklearn.feature_extraction.text import CountVectorizer 
from sklearn import svm

### Read file  
Each review is stored as one text file with in a zip file. We shall read all the files and store each file content as an element in the list

In [2]:
#Read review from zip file
pos_zip_file= zipfile.ZipFile("DATA/pos1.zip")
#Each file has a review, so read each file as a list element
pos_review=[pos_zip_file.open(f).read() for f in pos_zip_file.namelist()]
label=["Positive"]*len(pos_review)

In [3]:
neg_zip_file= zipfile.ZipFile("DATA/neg1.zip")
neg_review=[neg_zip_file.open(f).read() for f in neg_zip_file.namelist()]
label=["Negative"]*len(neg_review)

Let's combine the negative and positive reviews and create labels.

In [4]:
#Combine positive and negative reviews
reviews = pos_review + neg_review
#Labels for review
label=["Positive"]*len(pos_review) + ["Negative"]*len(neg_review)


### Normalize text
Some of the language constructs used in natural language are not useful in predicting label. We need to normalize text in order to get features we are interested in. 

Following normalization techniques are used to normalize the movie reviews
- Convert text to lower case
- Replace special characters with space
- Remove stop words
- Stem the words
- Remove words that are less than two characters

In [5]:
#Function to Normalize tex
def normalize_text(comment):
    #Convert to lowercase
    comment=comment.lower()
    #Remove special characters
    comment=re.sub("[^a-zA-Z]", " ",comment)
    #Get words
    words = nltk.word_tokenize(comment)
    #Stem words
    porter = nltk.PorterStemmer()
    words=[porter.stem(w) for w in words]
    #Keep only words that are 3 or more character long
    words=[w for w in words if len(w) > 2]
    #Remove stop words
    words=[w for w in words if w not in nltk.corpus.stopwords.words('english')]
    return(" ".join(words))
    
print "Text: This is a sample text with some words"
print "Normalized text:",normalize_text("This is a sample text with some words")

Text: This is a sample text with some words
Normalized text: thi sampl text word


In [6]:
#Normalize all reviews
norm_reviews=[normalize_text(s) for s in reviews]

### Feature Extraction - Bag of words Technique
In order to predict if a review is positive are negative we need extract features from review text and use it for prediction. We shall use bag of words technique to extract the features. We count the frequencies of the words in a review and use that for prediction

In [7]:
#Create bag of words
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 500) #Use 500 most frequent words
bag_of_words = vectorizer.fit_transform(norm_reviews).toarray()
print "Bag of words array:",bag_of_words.shape
#Get word feature
feat_names=vectorizer.get_feature_names()

Bag of words array: (2000, 500)


#### Split data into test and training set
We shall use first 800 positive and first 800 negative reviews for training and last 200 positive and last 200 negative reviews for testing. 

In [8]:
test=np.vstack((bag_of_words[800:1000,],bag_of_words[1800:,]))
test_label=label[800:1000]+label[1800:]
train=np.vstack((bag_of_words[:800,], bag_of_words[1000:1800,]))
train_label=label[:800]+label[1000:1800]

### NLTK Naive Bayes
Let's try NLTK Navie Bayes to classify reviews. As NLTK Navie Bayes expects nominal values in a list of dictionary we shall convert the word frequency to word presence (i.e. logical indicating if the word is present in the review or not) and create list of dictionary. 

In [9]:
#Covert to word presence dictionary
train_dict=[(dict(zip(feat_names,row)),train_label[index]) 
                   for index,row in enumerate(train.astype(bool))]
classifier = nltk.NaiveBayesClassifier.train(train_dict)

test_dict=[(dict(zip(feat_names,row)),test_label[index]) for index,row in enumerate(test.astype(bool))]
print "Training Accuracy:",nltk.classify.accuracy(classifier,train_dict)
print "Test Accuracy:",nltk.classify.accuracy(classifier,test_dict)


Training Accuracy: 0.808125
Test Accuracy: 0.775


#### Words that indicate the Review type
We could check the informative features of classifier to check the words that indicate positive or negative reviews. As we could see from the list below the words "wast" (i.e waste), worst, stupid etc indicate negative review.

In [10]:
classifier.show_most_informative_features(10)

Most Informative Features
                    wast = True           Negati : Positi =      4.7 : 1.0
                   worst = True           Negati : Positi =      4.3 : 1.0
                  stupid = True           Negati : Positi =      4.3 : 1.0
                 portray = True           Positi : Negati =      2.9 : 1.0
                    bore = True           Negati : Positi =      2.9 : 1.0
                   oscar = True           Positi : Negati =      2.4 : 1.0
                  suppos = True           Negati : Positi =      2.2 : 1.0
                 sometim = True           Positi : Negati =      2.2 : 1.0
                 perfect = True           Positi : Negati =      2.2 : 1.0
                   touch = True           Positi : Negati =      2.0 : 1.0


### SVM Classifier
Let's try support vector machine (svm) classifer as SVM is good for linearly separable cases and the text classifications are often linearly separable. For SVM classifiers we shall use word frequencies.

In [11]:
svm_classifer=svm.SVC().fit(train,train_label)    
print "Training Accuracy:",svm_classifer.score(train,train_label)
print "Testing Accuracy:",svm_classifer.score(test,test_label)

Training Accuracy: 0.938125
Testing Accuracy: 0.8


SVM classifer using word frequencies has a high training accuracy and the testing accuracy is also better than our previous classifier which only used the presence of word. 

### Effect of adding more features
For earlier models we used 500 most frequent words. Let's try the models with more features, say 1000 most frequent words.

In [12]:
#Create bag of words
vectorizer = CountVectorizer(analyzer = "word",   
                             tokenizer = None,    
                             preprocessor = None, 
                             stop_words = None,   
                             max_features = 1000) #Use 1000 most frequent words
bag_of_words = vectorizer.fit_transform(norm_reviews).toarray()
print "Bag of words array:",bag_of_words.shape
#Get word feature
feat_names=vectorizer.get_feature_names()

Bag of words array: (2000, 1000)


In [13]:
#Test and Train Split
test=np.vstack((bag_of_words[800:1000,],bag_of_words[1800:,]))
test_label=label[800:1000]+label[1800:]
train=np.vstack((bag_of_words[:800,], bag_of_words[1000:1800,]))
train_label=label[:800]+label[1000:1800]

#NLTK NB
#Covert to word presence dictionary
train_dict=[(dict(zip(feat_names,row)),train_label[index]) 
                   for index,row in enumerate(train.astype(bool))]
classifier = nltk.NaiveBayesClassifier.train(train_dict)

test_dict=[(dict(zip(feat_names,row)),test_label[index]) for index,row in enumerate(test.astype(bool))]
print "Training Accuracy:",nltk.classify.accuracy(classifier,train_dict)
print "Test Accuracy:",nltk.classify.accuracy(classifier,test_dict)

Training Accuracy: 0.824375
Test Accuracy: 0.7875


Both training and testing accuracies have improved, but only marginally for navie bayes classifer. Let's check the SVM.

In [14]:
svm_classifer=svm.SVC().fit(train,train_label)    
print "Training Accuracy:",svm_classifer.score(train,train_label)
print "Testing Accuracy:",svm_classifer.score(test,test_label)

Training Accuracy: 0.935625
Testing Accuracy: 0.815


For SVM classifier the traning accuracy slightly dropped while the testing accuracy slightly improved.

### Summary
For the given dataset with movie review, we could predict if a review is postivie or negative with 80% accuracy using bag of words technqiue. Both the naive bayes and support vector machine performed well and the support vector machine performed slightly better than the Naive Bayes for this dataset. 