In [9]:
import pandas as pd
import numpy as np
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import matplotlib.pyplot as plt
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
from sklearn.neural_network import BernoulliRBM
from timeit import default_timer as timer


Dataset: IMDB text reviews with positive/negative sentiment

I chose this data set because I wanted to run a UCCA parser over the text to get a semantic annotation of the text data that wouldve been learned from a corpus of English Treebank reviews from the UCCA resources. I wanted to see what it would be like to try feeding these semantic features into correlative models to see if they produce different results from normal methods that use a purely correlative model and non-semantic text representations like a bag of words model. 

This data was produced for a paper published in 2011 and there is not any other information readily available about how the data was collected

http://ai.stanford.edu/~amaas/data/sentiment/

~~Research question: How does the performance of classifiers for positive/negative sentiment analysis on reviews change when the representations of text is changed to add more features for each particular word. How would a hybrid model perform when additional features from semantic annotations are added into the bag of words model. 

~~I would predict that a hybrid model with more features would add accuracy into the classifier, but if it performs on the same level as models that only use a bag of words for features, then that would mean that the information from semantic annotations is not compatible with simple correlative methods.~~

Research question: How does the performance of classifiers for positive/negative sentiment analysis on reviews change when the representations of text is changed to use/ignore stop words. 

An important difference between semantic annotations of text and bag of word models is that stop words are very important for showing the grammatical structure of text, but they are typically removed in bag of words models. So another interesting question is how do models perform with and without stop words. Perhaps neural nets or n-grams are able to capture some of this grammatical structure by making connections between stop words and other words, so I would expect these methods to perform better than the simple bag of words model 

Using a neural net or a naive bayes classifier is well suited for this problem because both of those methods are able to accept large numbers of features for NLP or make associations using stop words

Initial challenges: The reason why I was not able to try the UCCA annotations is because I was having difficulties with getting my C++ devtools setup for the parser package. I was also running into memory problems on Jupyter notebook, so I figured that adding in more features would've just made that worse. 

To get around some of the memory problems, I decided to just work with a small fraction of the 50,000 rows of text data. I also set the minimum frequency of terms to 5 for the counter, so this is also helping to limit the number of features

In [2]:
review_text = pd.read_csv("IMDB Dataset.csv")
print(review_text.head)


<bound method NDFrame.head of                                                   review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]>


In [3]:
text_array = np.array(review_text['review'][:3500])
print(text_array[:5])

["One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked. They are right, as this is exactly what happened with me.<br /><br />The first thing that struck me about Oz was its brutality and unflinching scenes of violence, which set in right from the word GO. Trust me, this is not a show for the faint hearted or timid. This show pulls no punches with regards to drugs, sex or violence. Its is hardcore, in the classic use of the word.<br /><br />It is called OZ as that is the nickname given to the Oswald Maximum Security State Penitentary. It focuses mainly on Emerald City, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. Em City is home to many..Aryans, Muslims, gangstas, Latinos, Christians, Italians, Irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />I would say the main appeal of the show is due to the f

In [4]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test  = train_test_split(
        np.array(range(3500)), 
        np.array(review_text['sentiment'][:3500]),
        train_size=0.80, 
        random_state=3)

In [5]:
print(x_train[:5])

[ 124 2997  824  166 1127]


In [5]:
vectorizer = CountVectorizer(max_df = 0.4, min_df = 5)
counts = vectorizer.fit_transform(text_array)
print(counts)
print(vectorizer.get_feature_names())
print(np.shape(counts))

  (0, 5535)	2
  (0, 6596)	1
  (0, 3658)	1
  (0, 4961)	1
  (0, 302)	1
  (0, 8591)	2
  (0, 5596)	6
  (0, 2741)	2
  (0, 4651)	3
  (0, 3831)	1
  (0, 6636)	2
  (0, 2812)	1
  (0, 8651)	2
  (0, 3625)	1
  (0, 4910)	4
  (0, 3112)	2
  (0, 7973)	1
  (0, 7607)	2
  (0, 4243)	2
  (0, 1099)	1
  (0, 6849)	1
  (0, 8482)	4
  (0, 8664)	1
  (0, 6998)	1
  (0, 8762)	2
  :	:
  (3499, 408)	1
  (3499, 1543)	1
  (3499, 54)	1
  (3499, 3951)	1
  (3499, 8098)	1
  (3499, 2439)	1
  (3499, 5083)	1
  (3499, 4225)	1
  (3499, 5739)	1
  (3499, 4590)	1
  (3499, 2529)	2
  (3499, 7924)	1
  (3499, 4164)	1
  (3499, 1005)	1
  (3499, 5217)	1
  (3499, 765)	1
  (3499, 6154)	1
  (3499, 131)	2
  (3499, 6655)	1
  (3499, 6578)	1
  (3499, 5534)	1
  (3499, 7555)	1
  (3499, 733)	1
  (3499, 1714)	1
  (3499, 4025)	1
(3500, 8859)


In [11]:
count_array = counts.toarray()
print(count_array[:10])

[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


In [8]:
#Trying Multi layer perceptron models

In [9]:
from timeit import default_timer as timer
start = timer()
print("Fitting MLPClassifier to the data...")
m = MLPClassifier(hidden_layer_sizes= (50,),
                 activation="relu",
                 alpha=0.001,
                 tol=1e-6,
                 max_iter=100000).fit(count_array[x_train],y_train)
end = timer()
print(f"Done. Took {end-start:.2f} seconds")
#test_preds = m.predict(count_array[x_test])
print(m.score(count_array[x_test], y_test))

Fitting MLPClassifier to the data...
Done. Took 329.99 seconds
0.8142857142857143


In [10]:
start = timer()
print("Fitting MLPClassifier to the data...")
m4 = MLPClassifier(hidden_layer_sizes= (50,),
                 activation="logistic",
                 alpha=0.001,
                 tol=1e-6,
                 max_iter=100000).fit(count_array[x_train],y_train)
end = timer()
print(f"Done. Took {end-start:.2f} seconds")
#test_preds = m4.predict(count_array[x_test])
print(m4.score(count_array[x_test], y_test))

Fitting MLPClassifier to the data...
Done. Took 491.90 seconds
0.8085714285714286


In [6]:
stopfile = open("longstop.txt", "r")
stopwords = stopfile.read()
#print(stopwords)
stoplist = stopwords.split("\n")
stopfile.close()
print(stoplist)

['', 'a', 'able', 'about', 'above', 'abst', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'added', 'adj', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'again', 'against', 'ah', 'all', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'apparently', 'approximately', 'are', 'aren', 'arent', 'arise', 'around', 'as', 'aside', 'ask', 'asking', 'at', 'auth', 'available', 'away', 'awfully', 'b', 'back', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'between', 'beyond', 'biol', 'both', 'brief', 'briefly', 'but', 'by', 'c', 'ca', 'came', 'can', 'cannot', "can't", 'cause', 'causes', 'certain', 'certainly', 'co', 'com', 'come', 'comes', '

In [12]:
len(stoplist)

668

In [7]:
stop_vectorizer = CountVectorizer(max_df = 0.4, min_df = 5, stop_words = stoplist)
stop_counts = stop_vectorizer.fit_transform(text_array)
print(stop_counts)
print(stop_vectorizer.get_feature_names())
print(np.shape(stop_counts))

  'stop_words.' % sorted(inconsistent))


  (0, 6283)	1
  (0, 4751)	1
  (0, 8158)	2
  (0, 5325)	6
  (0, 2638)	2
  (0, 4456)	3
  (0, 3672)	1
  (0, 2698)	1
  (0, 3480)	1
  (0, 7586)	1
  (0, 7243)	2
  (0, 1031)	1
  (0, 6528)	1
  (0, 8052)	4
  (0, 6667)	1
  (0, 8300)	2
  (0, 7807)	1
  (0, 2834)	1
  (0, 3547)	1
  (0, 5871)	1
  (0, 5877)	1
  (0, 2389)	1
  (0, 6679)	1
  (0, 3491)	1
  (0, 1409)	1
  :	:
  (3499, 6429)	1
  (3499, 385)	1
  (3499, 1467)	1
  (3499, 54)	1
  (3499, 7698)	1
  (3499, 2348)	1
  (3499, 4869)	1
  (3499, 4052)	1
  (3499, 5461)	1
  (3499, 4399)	1
  (3499, 2436)	2
  (3499, 7550)	1
  (3499, 3992)	1
  (3499, 939)	1
  (3499, 4993)	1
  (3499, 718)	1
  (3499, 5861)	1
  (3499, 131)	2
  (3499, 6341)	1
  (3499, 6265)	1
  (3499, 5273)	1
  (3499, 7193)	1
  (3499, 686)	1
  (3499, 1635)	1
  (3499, 3856)	1
(3500, 8387)


In [14]:
start = timer()
print("Fitting MLPClassifier to the data...")
m2 = MLPClassifier(hidden_layer_sizes= (50,),
                 activation="relu",
                 alpha=0.001,
                 tol=1e-6,
                 max_iter=100000).fit(stop_counts[x_train],y_train)
end = timer()
print(f"Done. Took {end-start:.2f} seconds")
#test_preds = m2.predict(stop_counts[x_test])
print(m2.score(stop_counts[x_test], y_test))

Fitting MLPClassifier to the data...
Done. Took 196.35 seconds
0.8128571428571428


In [15]:
start = timer()
print("Fitting MLPClassifier to the data...")
m3 = MLPClassifier(hidden_layer_sizes= (50,),
                 activation="logistic",
                 alpha=0.001,
                 tol=1e-6,
                 max_iter=100000).fit(stop_counts[x_train],y_train)
end = timer()
print(f"Done. Took {end-start:.2f} seconds")
#test_preds = m3.predict(stop_counts[x_test])
print(m3.score(stop_counts[x_test], y_test))

Fitting MLPClassifier to the data...
Done. Took 256.99 seconds
0.8142857142857143


In [16]:
from sklearn.naive_bayes import MultinomialNB

In [17]:
nb_full = MultinomialNB()
nb_full.fit(count_array[x_train], y_train)
nb_full.score(count_array[x_test], y_test)

0.8242857142857143

In [18]:
nb_stop = MultinomialNB()
nb_stop.fit(stop_counts[x_train], y_train)
nb_stop.score(stop_counts[x_test], y_test)

0.83

In [12]:
#test number of hidden layers

accuracy_scores_full = []
accuracy_scores_stop = []
layers = [1,5,10,15,20,25,30,35,40,45]
start = timer()
for i in layers:
    print('iteration : ' + str(i))
    m_full = MLPClassifier(hidden_layer_sizes= (i,),
                     activation="relu",
                     alpha=0.001,
                     tol=1e-6,
                     max_iter=100000).fit(count_array[x_train],y_train)
    accuracy_scores_full.append(m_full.score(count_array[x_test], y_test))
    
    m_stop = MLPClassifier(hidden_layer_sizes= (i,),
                 activation="relu",
                 alpha=0.001,
                 tol=1e-6,
                 max_iter=100000).fit(stop_counts[x_train],y_train)
    accuracy_scores_stop.append(m_stop.score(stop_counts[x_test], y_test))
    
end = timer()
print(f"Done. Took {end-start:.2f} seconds")

    

starting1
starting5
starting10
starting15
starting20




starting25
starting30
starting35
starting40
starting45
Done. Took 129471.32 seconds
