# Text Classification

(Using CountVectorizer from scikit-learn- http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

In this notebook we are going to perfom semantic data analysis of movie reviews provided by imdb. The sentiments are classified as Positive(1) or Negative(0) describing whether the review is positive or negative. The dataset we are using is obtained from kaggle(https://www.kaggle.com/c/word2vec-nlp-tutorial/download/labeledTrainData.tsv.zip). The dataset contains 3 columns and 25000 rows. We will be using 20000 entries for training our model and another 5000 for testing.

The first step is to import the dependencies that we will be using later.

1. numpy- for handling multidimensional arrays and scientific calculations
2. pandas- for data manipulation and handling in a structured manner
3. train_test_split- separating training and testing data to check performance of model for new inputs
4. BeautifulSoup- to remove all the html markups in review text
5. re- regular expression to remove everything except alphabets
6. nltk- natural language toolkit used in removing stopwords
7. CountVectorizer- to extract features from string data and convert them to vectors
8. svm- support vector machine to train the model and predict the output
9. shuffle- to shuffle the data

In [61]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from bs4 import BeautifulSoup
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import svm
from sklearn.utils import shuffle

The second step is to read the data in a pandas DataFrame.

In [5]:
data = pd.read_csv('/Users/kr_subham/Desktop/GitHub/TextClassification/reviews_imdb.tsv', delimiter='\t')
data.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


The data contains 3 columns- id, sentiment and review. Since id is of no use to us, we will drop that column from our DataFrame.

In [6]:
data.drop('id', axis=1, inplace=True)
data.head()

Unnamed: 0,sentiment,review
0,1,With all this stuff going down at the moment w...
1,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,0,The film starts with a manager (Nicholas Bell)...
3,0,It must be assumed that those who praised this...
4,1,Superbly trashy and wondrously unpretentious 8...


Now, it is better to shuffle the data. It will help us generalize the model by removing any patterns in data.

In [7]:
data = shuffle(data)
data.head()

Unnamed: 0,sentiment,review
947,1,Not a bad word to say about this film really. ...
5258,0,"Highly implausible, unbelievable, and incohere..."
624,1,"Being a fan of the first Lion King, I was defi..."
20339,1,I'm sure this is a show no one is that familia...
3957,1,Minor Spoilers<br /><br />Alison Parker (Crist...


In [9]:
data.shape

(25000, 2)

The data contains 25000 rows and 2 columns, as seen above.

The third step is to preprocess the data. We will be creating several functions along the way which will help in processing the data to be fed in the Support Vector Machine algorithm. This preprocessing will also help us in imporving the performance.

In [10]:
data['review'][19]

"Most people, especially young people, may not understand this film. It looks like a story of loss, when it is actually a story about being alone. Some people may never feel loneliness at this level.<br /><br />Cheadles character Johnson reflected the total opposite of Sandlers character Fineman. Where Johnson felt trapped by his blessings, Fineman was trying to forget his life in the same perspective. Jada is a wonderful additive to the cast and Sandler pulls tears. Cheadle had the comic role and was a great supporter for Sandler.<br /><br />I see Oscars somewhere here. A very fine film. If you have ever lost and felt alone, this film will assure you that you're not alone.<br /><br />Jerry"

First we need to remove all the html tags from the review. For this we are using BeautifulSoup.

In [15]:
no_html = []
def remove_html(raw_review):
    flag = BeautifulSoup(raw_review, 'html5lib')
    return flag.get_text()

for i in data['review']:
    no_html.append(remove_html(i))
    
no_html[19]

"Possibly not, but it is awful. Even the fantastic cast cant save it. OK, I admit it started off quite funny but it seemed to plummet downhill as soon as they jumped those girls in the Generals house. Bill Murray turned from being a quick witted, humorous guy into an arsehole who was shouting things at people in the street that just weren't funny, its like he was trying too hard to be funny. His character stole a weapon (an RV? come on...) and ends up being a national hero after invading another country and killing god knows how many soldiers, for a laugh. One good point is that this film shows the inadequacy and incompetence of the US Army and shows how arrogant and imbecilic they really are, albeit unintentionally. I actually felt disgusted that this kind of propaganda crap could really be released."

In the above segment, we created an empty list which will hold the reviews after removing the markups. remove_html function takes any review as an input and returns the review without markup. The for-loop iterates through all the reviews in our data. The list no_html is a list of list containing all the reviews without markup and each review from the data is a separate list.

Next, we need to remove punctuations, numbers etc because they are not much useful. We are using re for this purpose.

In [16]:
text_only = []
def keep_text(raw_review):
    flag = re.sub('[^a-zA-Z]', ' ', raw_review)
    return flag

for i in no_html:
    text_only.append(keep_text(i))
    
text_only[19]

'Possibly not  but it is awful  Even the fantastic cast cant save it  OK  I admit it started off quite funny but it seemed to plummet downhill as soon as they jumped those girls in the Generals house  Bill Murray turned from being a quick witted  humorous guy into an arsehole who was shouting things at people in the street that just weren t funny  its like he was trying too hard to be funny  His character stole a weapon  an RV  come on     and ends up being a national hero after invading another country and killing god knows how many soldiers  for a laugh  One good point is that this film shows the inadequacy and incompetence of the US Army and shows how arrogant and imbecilic they really are  albeit unintentionally  I actually felt disgusted that this kind of propaganda crap could really be released '

text_only is a list that contains reviews without any punctuations or numbers. These are replaced by a blank space, as seen above. The elements in text_only are separated by commas.

Now, convert all the text to lowercase.

In [17]:
lower_text = []
def lowercase(raw_review):
    return raw_review.lower()

for i in text_only:
    lower_text.append(lowercase(i))
    
lower_text[19]

'possibly not  but it is awful  even the fantastic cast cant save it  ok  i admit it started off quite funny but it seemed to plummet downhill as soon as they jumped those girls in the generals house  bill murray turned from being a quick witted  humorous guy into an arsehole who was shouting things at people in the street that just weren t funny  its like he was trying too hard to be funny  his character stole a weapon  an rv  come on     and ends up being a national hero after invading another country and killing god knows how many soldiers  for a laugh  one good point is that this film shows the inadequacy and incompetence of the us army and shows how arrogant and imbecilic they really are  albeit unintentionally  i actually felt disgusted that this kind of propaganda crap could really be released '

Lastly, we need to tokenize the sentences into a list of individual words and then remove stopwords from that list.
We will use the split function to create word tokens.

In [34]:
tokenize_words = []
def tokens(raw_review):
    return raw_review.split()

for i in lower_text:
    tokenize_words.append(tokens(i))
    
tokenize_words[19][:5]

['possibly', 'not', 'but', 'it', 'is']

tokenize_words is a list of lists containing tokens(including stopwords) for each sentence in the review(lower_text). The first 5 tokens of the 19th entry is shown above.

Next, we will use nltk to remove stopwords. nltk contains a large vocabulary of predefined stopwords. We will compare each token in our list to the words in nltk corpus and remove the stopwords.

In [33]:
no_stopwords = []
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))
def remove_stopwords(raw_review):
    raw_review = [w for w in raw_review if not w in stopwords ]
    return raw_review

for i in tokenize_words:
    no_stopwords.append(remove_stopwords(i))
    
no_stopwords[19][:5]

['possibly', 'awful', 'even', 'fantastic', 'cast']

This finishes our preprocessing. we will now join the remaining words from each review to form sentences and then save those sentences along with their sentiment in a new clean dataframe(This step is not mandatory).

In [21]:
sentences = []
for i in no_stopwords:
    sentences.append(' '.join(word for word in i))
    
dict_review = {'review':[i for i in sentences], 'sentiment':[j for j in data['sentiment']]}
data_clean = pd.DataFrame(dict_review, columns=['review', 'sentiment'])
data_clean.head()

Unnamed: 0,review,sentiment
0,bad word say film really initially impressed g...,1
1,highly implausible unbelievable incoherent spa...,0
2,fan first lion king definitely looking forward...,1
3,sure show one familiar might think good almost...,1
4,minor spoilersalison parker cristina raines su...,1


data_clean is a new DataFrame that contains clean reviews.

Dividing the cleaned data in train and test sets

In [44]:
train, test = train_test_split(data_clean, test_size=0.2)

In [46]:
countVect = CountVectorizer(max_features=8000)
x_train_counts = countVect.fit_transform(train['review'])
x_test_counts = countVect.fit_transform(test['review'])
print(x_train_counts.shape)
print(x_test_counts.shape)

(20000, 8000)
(5000, 8000)


This shows we have 20000 entries(vectors) for train data and 5000 for test data with 8000 features.

Lastly, we will use support vector machine to fit our model and predict the output. We will display accuracy of test data to see how well the model performs.

In [47]:
clf_svm = svm.SVC(kernel='rbf', C=10.0)
clf_svm.fit(x_train_counts, train['sentiment'])

SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [51]:
test_accuracy = clf_svm.score(x_test_counts, test['sentiment'])
print('Test accuracy using SVM: ', test_accuracy)

Test accuracy using SVM:  0.6208


The accuracy on test data using support vector machines is 0.6208(or 62.08%). This shows the model performs fairly well on new data as well. The performance can be improved by changing several performance and experimenting new techniques.