### LING1340 Todo 9
Name: **Daniel Zheng**

Email: **daniel.zheng@pitt.edu**

#### Dataset
Large Movie Review Dataset from [here](http://ai.stanford.edu/~amaas/data/sentiment/).

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. *The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).*


This is a modified version of my previous homework, trying some other classifiers

In [3]:
# useful libraries
import numpy as np
import re, glob


In [4]:
# loading training data
def load(filepath):
    files = glob.glob(filepath)
    raw = []
    for file in files:
        with open(file, encoding='utf-8') as f:
            raw.append(f.read())
    return raw
train_neg_raw = load('/home/dan/Documents/ling1340/data/aclImdb/train/neg/*')
train_pos_raw = load('/home/dan/Documents/ling1340/data/aclImdb/train/pos/*')
test_neg_raw = load('/home/dan/Documents/ling1340/data/aclImdb/test/neg/*')
test_pos_raw = load('/home/dan/Documents/ling1340/data/aclImdb/test/pos/*')

In [5]:
print('Negative training sample:', np.random.choice(train_neg_raw, 1))

Negative training sample: [ 'First let me say that I am not a Dukes fan, but after this movie the series looked like Law and Order. The worst thing was the casting of Roscoe and Boss Hogg. Burt Reynolds is not Boss Hogg, and even worse was M.C. Gainey as Roscoe, If they ever watched the show Roscoe was not a hard ass cop. He was more a Barney Fife than the role he played in this movie.<br /><br />The movie is loaded with the usual errors, cars getting torn up, and continues like nothing happened. The worst example of this is when the the General gets together with Billy Prickett, and the General is ran into a dirt hill obviously slowing to a near stop, but goes on to win the race.']


In [6]:
print('Positive training sample: ', np.random.choice(train_pos_raw, 1))

Positive training sample:  [ '**SPOILERS** Highly charge police drama about a serial killer loose in and around the small town of Riverside Wisconsin. Who\'s being tracked down by the local police using policewoman Gina Pulasky, Helen Hunt,as an undercover decoy to catch him. <br /><br />Nothing new in this made for TV movie that you haven\'t seen before but the depth of the acting and screenplay is unusually good and brings out a lot about not only the killer but the policewoman\'s, as well as her fellow policeman lover, state of mind.<br /><br />Having been put under psychiatric care after shooting an armed and unstable assailant, who attacked her partner with a rifle. Officer Palusky is given the task to go undercover to get close to murder suspect Kayle Timler, Steven Webber. After he was positively identified by the little girl Sahsa, Kim Kluznick,who saw him not far from where little Timmy Curtis was found stabbed, 18 times, to death the next day.<br /><br />Getting a job at the 

### Description of Data
There are train and test sets of data. Within `train` and `test`, there is a `neg` and `pos` folder each with 12,500 negative and positive samples. In the `train` folder there is also a folder called `unsup` with 50,000 examples for unsupervised learning.

### Processing
A lot of the data contains `<br>` tags from HTML, which will have to be cleaned up. I will weight using term frequency - inverse document frequency (tf-idf) and train a variety of classifiers

### Expected problems with approach
A lot of the movie review data is going to be background information on the movies that probably won't be helpful for learning sentiment. Even though tf-idf will deemphasize many common words like "is", it might emphasize rare background info even more than common reviewing terms like "terrible", "fantastic", etc... Since background info terms should have low-frequency, it shouldn't make a huge difference.

In [7]:
# function clean up data, taking out punctuation, numbers and special characters
def clean(raw_input):
    # tokenize and remove invalid characters
    cleaned = [' '.join([x for x in string.split() if re.sub('[a-zA-Z0-9_.,!"\'-/]', '', x) == '']) for string in raw_input]
    return cleaned


In [9]:
# takes a while because tokenizing + pos_tag is slow
train_neg = clean(train_neg_raw)
train_pos = clean(train_pos_raw)
test_neg = clean(test_neg_raw)
test_pos = clean(test_pos_raw)

In [10]:
# example of what the filtered data looks like
print(train_neg[:2])
print('total negative training samples:', len(train_neg))

['"Memoirs of a Geisha" is a visually stunning melodrama that seems more like a camp, drag queen satire than anything to do with real first half of the film defensively keeps insisting that geishas are neither prostitutes nor concubines, that they are the embodiment of traditional Japanese beauty. But other than one breathtaking dance, the rest of the movie degenerates into "Pretty Baby" in Storyville territory, or at least Vashti and Esther in the Purim story, as all the women\'s efforts at art and artifice are about entertaining much, much older, drunken boorish men. Maybe it is Japanese culture that is being prostituted, and not just to the American louts after World War it\'s the strain of speaking in English, but Ziyi Zhang shows barely little of the great flare she demonstrated in "House of Flying Daggers (Shi mian mai fu)" and "Hero (Ying xiong)." Michelle Yeoh occasionally gets to project a glimmer of her assured performance in "Crouching Tiger, Hidden Dragon (Wo hu cang long).

In [11]:
print(train_pos[:2])
print('total positive training samples:', len(train_pos))

['"Elvira, Mistress Of The Dark" is a sort of "Harper Valley P.T.A." with touches of the supernatural. Elvira (Cassandra Peterson) walks off her job as television horror movie hostess after the new station\'s owner gets fresh with her. She\'s now relying on a Las Vegas show to carry her through, but learns she needs to come up with more money to get the show started. Things look hopeless to raise that money until she receives notice of her aunt\'s death, which then takes Elvira to Massachusetts for the reading of the will. A house in need of repairs, a dog, and a cookbook are all that is left to her by her aunt, and again it seems Elvira is having trouble coming up with the money for the Las Vegas show. The adults of the small and narrow minded town make things worse by making things more difficult for Elvira. Only the local hunk (Daniel Greene), and a group of teenagers will befriend her. Elvira\'s Uncle "Vinnie" (W. Morgan Sheppard), presses to make a deal with Elvira for the cookboo

In [12]:
# some visualizations
from collections import Counter
neg_word_counts = Counter()
pos_word_counts = Counter()
for neg, pos in zip(train_neg, train_pos):
    neg_word_counts.update(word.strip('.,?!"\'').lower() for word in neg.split())
    pos_word_counts.update(word.strip('.,?!"\'').lower() for word in pos.split())

In [13]:
print('100 most common negative words:',neg_word_counts.most_common(100))

100 most common negative words: [('the', 157957), ('a', 78161), ('and', 72124), ('of', 68541), ('to', 68384), ('is', 49610), ('in', 42569), ('this', 38937), ('i', 37661), ('it', 37591), ('that', 34877), ('was', 26097), ('movie', 23129), ('for', 21374), ('but', 20784), ('with', 20470), ('as', 19801), ('film', 17447), ('on', 16623), ('not', 15692), ('have', 15082), ('you', 14669), ('are', 14506), ('be', 14292), ('his', 12048), ('one', 12025), ('at', 11963), ('he', 11876), ('they', 11687), ('all', 11291), ('like', 10670), ('so', 10619), ('just', 10414), ('by', 10321), ('an', 10129), ('or', 9649), ('from', 9526), ('who', 9160), ('about', 8898), ('if', 8691), ('out', 8491), ('some', 8122), ("it's", 8084), ('there', 8073), ('her', 7713), ('no', 7674), ('has', 7551), ('even', 7377), ('what', 7373), ('good', 7032), ('bad', 6952), ('would', 6828), ('only', 6636), ('more', 6560), ('when', 6520), ('up', 6330), ('really', 6135), ('had', 6096), ('were', 5964), ('my', 5716), ('time', 5662), ('very',

In [14]:
print('100 most common positive words:',pos_word_counts.most_common(100))

100 most common positive words: [('the', 167590), ('and', 87674), ('a', 82322), ('of', 76368), ('to', 66210), ('is', 56859), ('in', 48939), ('it', 37489), ('that', 33782), ('i', 33488), ('this', 33409), ('as', 25406), ('with', 22869), ('for', 21958), ('was', 21785), ('but', 19853), ('film', 19278), ('movie', 17901), ('his', 17083), ('on', 16466), ('are', 14716), ('he', 14428), ('you', 14324), ('not', 13813), ('one', 12682), ('have', 12514), ('be', 12201), ('by', 11810), ('all', 11177), ('an', 11159), ('at', 10997), ('from', 10587), ('who', 10570), ('her', 10209), ('has', 9134), ('they', 9033), ('so', 8590), ('like', 8506), ('very', 8185), ('about', 8178), ("it's", 8029), ('out', 7651), ('more', 7364), ('good', 7293), ('some', 7265), ('or', 7261), ('when', 7197), ('what', 7045), ('just', 7005), ('she', 6835), ('if', 6663), ('story', 6381), ('there', 6230), ('my', 6206), ('great', 6188), ('their', 6076), ('time', 5863), ('which', 5860), ('up', 5738), ('see', 5737), ('can', 5503), ('reall

### Frequency counts
Looking at the processed data above, words like "is" and "was" are very common, as expected. The negative set also has negative words like "awful", "terrible", and "stupid", while the positive set has words like "perfect", "excellent", and "beautiful". This is a good sign! After applying tf-idf, it should be pretty easy for a classifier to determine sentiment.

### Creating labels
Train and test labels are assigned using `0` as the negative class and `1` as the positive class.

In [15]:
train = train_neg + train_pos # concatenate for vectorizing
train_labels = [0]*len(train_neg) + [1]*len(train_pos) # labels
test = test_neg + test_pos
test_labels = [0]*len(test_neg) + [1]*len(test_pos)

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer
# should be the same as CountVectorizer combined with TfidfTransformer
tfidf = TfidfVectorizer()
train_vectors = tfidf.fit_transform(train)
# already fit to training set, so just transform
test_vectors = tfidf.transform(test)

In [17]:
print(train_vectors.shape)
print(test_vectors.shape)

(25000, 72996)
(25000, 72996)


In [18]:
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB().fit(train_vectors, train_labels)

In [19]:
predicted = classifier.predict(test_vectors)
print('Multinomial NB: ', np.mean(predicted == test_labels)*100, '% accuracy')

Multinomial NB:  82.8 % accuracy


### Results
82.8% accuracy is pretty good! Definitely better than expected. Just for fun, I put together my own test set of movie review strings to see how it performs.

In [20]:
custom_test = ["This movie was the worst. I hate it.", "Terrible acting. Negative, bland, uninteresting.", 
               "This movie was great, I really enjoyed the acting!", 
               "Amazing storyline, hilarious characters, and a shocking ending.", 
               "The vague plot was ridiculously boring, and put me to sleep."]
custom_labels = [0,0,1,1,0]
custom_test_vectors = tfidf.transform(clean(custom_test))

In [21]:
custom_predictions = classifier.predict(custom_test_vectors)
print(np.mean(custom_predictions == custom_labels)*100, '% accuracy on custom test set')
print(custom_predictions)

100.0 % accuracy on custom test set
[0 0 1 1 0]


### Conclusions
This code does a few things:
1. Reads in movie review data so that each review is one string in a list
2. Preprocesses, removing everything but adjectives and verbs within each review.
3. Creates train and test tf-idf vectors
4. Fits a naive-bayes classifier to the train vector
5. Test on testing data
So it looks like using tf-idf with a Multinomial Naive-Bayes classifier can pretty reliably guess binary sentiment of a movie review. Next we'll try another classifier.

In [22]:
from sklearn.naive_bayes import BernoulliNB
bernoulli_clf = BernoulliNB().fit(train_vectors, train_labels)
print('Bernoulli NB: ', np.mean(bernoulli_clf.predict(test_vectors) == test_labels)*100, '% test accuracy')

Bernoulli NB:  82.128 % test accuracy


In [23]:
from sklearn.ensemble import RandomForestClassifier
# tuned params a bit by hand
random_forest_clf = RandomForestClassifier(n_estimators=100, max_depth=80, criterion="gini", random_state=0).fit(train_vectors, train_labels)
print('Random Forest: ', np.mean(random_forest_clf.predict(test_vectors) == test_labels)*100, '% test accuracy')

Random Forest:  83.568 % test accuracy


In [25]:
from sklearn.ensemble import GradientBoostingClassifier
# didn't bother optimizing this by hand
gb_clf = GradientBoostingClassifier().fit(train_vectors, train_labels)
print('Gradient Boosting: ', np.mean(gb_clf.predict(test_vectors) == test_labels)*100, '% test accuracy')

Gradient Boosting:  81.024 % test accuracy


In [28]:
num, vec_size = train_vectors.shape

In [49]:
from keras.models import Model
from keras.layers import Input, Dense, Dropout

# use functional api for sparse input matrix
inputs = Input(shape=(vec_size,), sparse=True)

# a layer instance is callable on a tensor, and returns a tensor
x = Dense(500, activation='relu')(inputs)
x = Dropout(0.3)(x)
x = Dense(128, activation='relu')(x)
x = Dropout(0.5)(x)
x = Dense(64, activation='relu')(x)
#x = Dropout(0.5)(x)
predictions = Dense(1, activation='sigmoid')(x)
model = Model(inputs=inputs, outputs=predictions)
print(model.summary())

model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])
# train
model.fit(train_vectors, train_labels, epochs=2, batch_size=1000)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_11 (InputLayer)        (None, 72996)             0         
_________________________________________________________________
dense_41 (Dense)             (None, 500)               36498500  
_________________________________________________________________
dropout_19 (Dropout)         (None, 500)               0         
_________________________________________________________________
dense_42 (Dense)             (None, 128)               64128     
_________________________________________________________________
dropout_20 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_43 (Dense)             (None, 64)                8256      
_________________________________________________________________
dense_44 (Dense)             (None, 1)                 65        
Total para

<keras.callbacks.History at 0x7f2ce1076cc0>

In [50]:
score = model.evaluate(test_vectors, test_labels, batch_size=1000, verbose=1)
print('Neural Network: ', score[1]*100, '% test accuracy')

Neural Network:  87.2239995003 % test accuracy


With some optimizations, these classifiers would probably all be pretty similar, though I'd expected the NN to do better with larger datasets. In this case, the neural net did much better, but only because I tuned the params by hand and used mostly defaults for the other classifiers. I would have liked to use some kind of recurrent model w/ LSTMs, since that seems to be what the state-of-the-art models in text classification usually use these days.

In [44]:
score = model.evaluate(custom_test_vectors, custom_labels, batch_size=1000, verbose=1)
print('Neural Network: ', score[1]*100, '% test accuracy')

Neural Network:  100.0 % test accuracy


In [52]:
model.predict(tfidf.transform(['bad']))

array([[ 0.02084433]], dtype=float32)

In [53]:
model.predict(tfidf.transform(['good']))

array([[ 0.9377352]], dtype=float32)

In [55]:
model.predict(tfidf.transform(['terrible bad']))

array([[ 0.00937826]], dtype=float32)

In [62]:
model.predict(tfidf.transform(['terrible amazing']))

array([[ 0.39154819]], dtype=float32)

In [57]:
model.predict(tfidf.transform(['great amazing']))

array([[ 0.99096328]], dtype=float32)

It's interesting to see what the raw predictions are for some short phrases! Terrible and amazing are kind of opposites, so it makes sense that the model was less certain about sentiment. I defined 1 as positive and 0 as negative.