# Sentiment Analysis on Rotten Tomatoes Review Data

We will attempt to convert our unstructured data, namely reviews data collected from RottenTomatoes to sentiment score. To do this, we will train the model using the Large Movie Review Dataset v1.0 <http://ai.stanford.edu/~amaas/data/sentiment/> [3].
  
Full credit given to Aaron Kub for writing the article on applying sentiment analysis on the Large Movie Review Dataset v1.0 [1,2]. His code will be adapted to train a model to classify sentiment of a review, and the model will be used to classify our collection of RottenTomatoes reviews data and output an aggregate review's sentiment score for each movie.


**References**  
[1] https://towardsdatascience.com/sentiment-analysis-with-python-part-1-5ce197074184  
[2] https://towardsdatascience.com/sentiment-analysis-with-python-part-2-4f71e7bde59a  
[3] Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

In [1]:
import os
import re
import glob
import joblib
import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split



### Download and Extract the Data
Download the data from <http://ai.stanford.edu/~amaas/data/sentiment/> and extract aclImdb_v1.tar twice and place the aclImdb in the same directory as this notebook.

### Read the data

In [22]:
# train_pos = glob.glob('./aclImdb/train/pos/*.txt')
# train_neg = glob.glob('./aclImdb/train/neg/*.txt')
# train_files = train_pos + train_neg

# test_pos = glob.glob('./aclImdb/test/pos/*.txt')
# test_neg = glob.glob('./aclImdb/test/neg/*.txt')
# test_files = test_pos + test_neg

# reviews_train = []
# for f in train_files:
#     try:
#         with open(f, encoding="utf8") as f2:
#             first_line = f2.readline().strip()
#             reviews_train.append(first_line.strip())
#     except:
#         print(f)
        
# reviews_test = []
# for f in test_files:
#     try:
#         with open(f, encoding="utf8") as f2:
#             first_line = f2.readline().strip()
#             reviews_test.append(first_line.strip())
#     except:
#         print(f)

### Save the data in pkl format to reduce read time next time

In [None]:
# import joblib
# joblib.dump(reviews_train, "train.pkl")
# joblib.dump(reviews_test, "test.pkl")

### Load the data (previously saved as pkl)

In [23]:
reviews_train = joblib.load("train.pkl")
reviews_test = joblib.load("test.pkl")
    
target = [1 if i < 12500 else 0 for i in range(25000)]

### Clean the review data
- Remove special characters or extra spaces.

In [9]:
import re

REPLACE_NO_SPACE = re.compile("(\.)|(\;)|(\:)|(\!)|(\?)|(\,)|(\")|(\()|(\))|(\[)|(\])|(\d+)")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
NO_SPACE = ""
SPACE = " "

def preprocess_reviews(reviews):
    
    reviews = [REPLACE_NO_SPACE.sub(NO_SPACE, line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(SPACE, line) for line in reviews]
    
    return reviews

reviews_train_clean = preprocess_reviews(reviews_train)
reviews_test_clean = preprocess_reviews(reviews_test)

### Sentiment Model
- We will use the final model proposed by Aaron Kub in his article (part 2).


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
ngram_vectorizer.fit(reviews_train_clean)
X = ngram_vectorizer.transform(reviews_train_clean)
X_test = ngram_vectorizer.transform(reviews_test_clean)

X_train, X_val, y_train, y_val = train_test_split(
    X, target, train_size = 0.75
)
final = LinearSVC(C=0.01)
final.fit(X, target)
print("Final Accuracy: %s" % accuracy_score(target, final.predict(X_test)))


### Save the model in pkl

In [32]:
joblib.dump(final, "sentiment_model.pkl")

['sentiment_model.pkl']

### Top positive features and negative features

In [33]:
feature_to_coef = {
    word: coef for word, coef in zip(
        ngram_vectorizer.get_feature_names(), final.coef_[0]
    )
}

for best_positive in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1], 
    reverse=True)[:10]:
    print (best_positive)
    
print("\n\n")
for best_negative in sorted(
    feature_to_coef.items(), 
    key=lambda x: x[1])[:10]:
    print (best_negative)

('excellent', 0.27778915159678536)
('perfect', 0.21944130971718084)
('wonderful', 0.20303176870579795)
('great', 0.19436777630747099)
('amazing', 0.1925965525252106)
('superb', 0.17824207494793734)
('enjoyed', 0.17514059273124252)
('must see', 0.17499908843269293)
('enjoyable', 0.1688088747135193)
('favorite', 0.1683839906705165)



('worst', -0.3990599590725215)
('awful', -0.30139933561879434)
('waste', -0.2964901976169115)
('boring', -0.27275377562767644)
('bad', -0.24096709832080834)
('terrible', -0.2393709257471478)
('disappointment', -0.23196698267614316)
('poorly', -0.22648068843833707)
('poor', -0.22509660757588298)
('dull', -0.21996306004150176)


### Load the Rotten Tomatoes Review Data

In [39]:
review_df = pd.read_csv("hive_movie_reviews_semisep.csv", sep = ";")

### Pre-process the Data

In [41]:
reviews_list = preprocess_reviews(review_df["review"].astype(str))
X_reviews = ngram_vectorizer.transform(reviews_list)

### Predict the sentiment score for each review

In [42]:
sentiment_score = final.predict(X_reviews)

In [43]:
review_df["sentiment_score"] = sentiment_score

### Example of predicted sentiment score

In [93]:
print(review_df.loc[review_df.url_id == "https://www.rottentomatoes.com/m/gun_shy_2017"].iloc[0]["review"])
print(review_df.loc[review_df.url_id == "https://www.rottentomatoes.com/m/gun_shy_2017"].iloc[1]["review"])
print(review_df.loc[review_df.url_id == "https://www.rottentomatoes.com/m/gun_shy_2017"].iloc[2]["review"])

review_df.loc[review_df.url_id == "https://www.rottentomatoes.com/m/gun_shy_2017"]

Gun Shy is unlikely to leave much of an impression, even on those who've followed Banderas's interestingly hit-and-miss career.
It's loud and dumb and irritating and forgettable.
This lackluster comic thriller never matches the over-the-top enthusiasm of its star.


Unnamed: 0,review,url_id,sentiment_score
131995,Gun Shy is unlikely to leave much of an impres...,https://www.rottentomatoes.com/m/gun_shy_2017,0
131996,It's loud and dumb and irritating and forgetta...,https://www.rottentomatoes.com/m/gun_shy_2017,0
131997,This lackluster comic thriller never matches t...,https://www.rottentomatoes.com/m/gun_shy_2017,1
131998,Gun Shy somehow manages to come across as bein...,https://www.rottentomatoes.com/m/gun_shy_2017,0
131999,Random moments are not nearly enough to recomm...,https://www.rottentomatoes.com/m/gun_shy_2017,0
132000,"There's desperation for laughs, and then there...",https://www.rottentomatoes.com/m/gun_shy_2017,0
132001,... witless dialogue and a plot that has no id...,https://www.rottentomatoes.com/m/gun_shy_2017,0
132002,"The biggest problem for ""Gun Shy"" isn't its ri...",https://www.rottentomatoes.com/m/gun_shy_2017,0
132003,Antonio Banderas hams it up in this dumber tha...,https://www.rottentomatoes.com/m/gun_shy_2017,0


### Aggregate the sentiment score for each movie (url_id) by:
- ss_mean: mean of sentiment score (ss)
- ss_median: median of ss
- ss_p25: 25th percentile of ss
- ss_p75: 75th percentile of ss
- ss_std: standard deviation of ss
- ss_count: Total number of reviews for that movie

In [97]:
ss_mean = review_df.groupby("url_id")["sentiment_score"].mean()
ss_median = review_df.groupby("url_id")["sentiment_score"].median()
ss_p25 = review_df.groupby("url_id")["sentiment_score"].apply(lambda x: np.percentile(x, 25))
ss_p75 = review_df.groupby("url_id")["sentiment_score"].apply(lambda x: np.percentile(x, 75))
ss_std = review_df.groupby("url_id")["sentiment_score"].std()
ss_count = review_df.groupby("url_id")["sentiment_score"].count()

ss_df = pd.concat([ss_mean, ss_median, ss_p25, ss_p75, ss_std, ss_count], axis = 1)
ss_df.columns = ["ss_mean", "ss_median", "ss_p25", "ss_p75", "ss_std", "ss_count"]
ss_df.head()


Unnamed: 0_level_0,ss_mean,ss_median,ss_p25,ss_p75,ss_std,ss_count
url_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
https://www.rottentomatoes.com/m/10002519-breaking_point,1.0,1.0,1.0,1.0,0.0,6
https://www.rottentomatoes.com/m/1000_times_good_night,0.818182,1.0,1.0,1.0,0.389249,55
https://www.rottentomatoes.com/m/10011489-bananas,0.666667,1.0,0.25,1.0,0.516398,6
https://www.rottentomatoes.com/m/1001_grams,0.909091,1.0,1.0,1.0,0.294245,22
https://www.rottentomatoes.com/m/1003757-cat_people,0.913043,1.0,1.0,1.0,0.284885,46


In [98]:
ss_df.to_csv("reviews_5000_sentiment_score.csv")
ss_df.shape