# Application

This notebook is a demonstration of how to apply the sentiment analysis model I have built.  I have imported 5732 tweets from the past week, all containing the hashtag #FirstMan (in reference to a movie that was recently released).  The tweets will be processed as follows:

- They will be concatenated to the original movie review dataframe on the 'review' column
- They will be preprocessed by the review_cleaner function 
- Doc2Vec will train its models on the entire new corpus (movie reviews + tweets) in order to build a new vector space
- A gradient boost classifier will then train on these new vectors and will output a value of 'positive' or 'negative' when assessing the new tweets

### Import Modules

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

import re
import pickle

import requests
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.metrics import classification_report
from bs4 import BeautifulSoup
from nltk.tokenize import WordPunctTokenizer

from gensim.models.doc2vec import Doc2Vec, TaggedDocument

from collections import OrderedDict
import multiprocessing

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

import warnings
warnings.filterwarnings('ignore')
from sklearn import utils

np.random.seed(42)

### Import tweets_df

In [2]:
with open('../Data/tweets_df', 'rb') as f:
    tweets_df = pickle.load(f)

In [3]:
tweets_df.head()

Unnamed: 0,user,date,text
0,eaglesrising99,2018-10-18 03:49:26,Last chance! FOLLOW &amp; RT for your chance t...
1,AdiWriter,2018-10-18 03:48:37,#DamienChazelle has given us his third excelle...
2,Raulimartin,2018-10-18 03:48:00,"Daily Box Office Top 7 for Tuesday, October 16..."
3,dannysullivan,2018-10-18 03:47:38,Saw @FirstManMovie &amp; even though I knew ev...
4,KeaneMexico,2018-10-18 03:46:18,Just been to see #FirstMan. Brilliant film. #R...


# DataFrame Concatenation

### Read Data into Pandas Table

**Perform all operations that were already done to original dataset.**

In [4]:
df = pd.read_csv('../Data/imdb_master.csv', encoding = "ISO-8859-1")
df.drop(columns='Unnamed: 0', axis=1, inplace=True)

In [5]:
df_train = df[df['type'] == 'train']
df_train.reset_index(inplace=True)
df_train.drop(columns=['index'], inplace=True)

In [6]:
df_train = df_train[df_train['label'] != 'unsup']

In [7]:
df_train.drop_duplicates('review', inplace=True)

In [8]:
df_train.shape

(24904, 4)

### Concatenate tweets_df to df_train

In [9]:
tweets_df = tweets_df.rename(columns = {'text':'review'})

In [10]:
df_train.shape

(24904, 4)

In [12]:
tweets_df.shape

(5732, 3)

In [13]:
df_train = pd.merge(df_train, tweets_df, how='outer', on='review')

In [14]:
df_train.head()

Unnamed: 0,type,review,label,file,user,date
0,train,Story of a man who has unnatural feelings for ...,neg,0_3.txt,,NaT
1,train,Airport '77 starts as a brand new luxury 747 p...,neg,10000_4.txt,,NaT
2,train,This film lacked something I couldn't put my f...,neg,10001_4.txt,,NaT
3,train,"Sorry everyone,,, I know this is supposed to b...",neg,10002_1.txt,,NaT
4,train,When I was little my parents took me along to ...,neg,10003_1.txt,,NaT


### Drop Rows

In [15]:
df_train.drop(columns=['type', 'file', 'user', 'date', 'user', 'date'], inplace=True)

In [16]:
df_train.head()

Unnamed: 0,review,label
0,Story of a man who has unnatural feelings for ...,neg
1,Airport '77 starts as a brand new luxury 747 p...,neg
2,This film lacked something I couldn't put my f...,neg
3,"Sorry everyone,,, I know this is supposed to b...",neg
4,When I was little my parents took me along to ...,neg


# Text Preprocessing

In [17]:
url = 'https://raw.githubusercontent.com/stanfordnlp/CoreNLP/master/data/edu/stanford/nlp/patterns/surface/stopwords.txt'
res = requests.get(url)
soup = BeautifulSoup(res.content, 'lxml')

In [18]:
standford_nlp = soup.text.split('\n')
stop_word_list = standford_nlp + list(ENGLISH_STOP_WORDS) + ['film', 'films', 'movie', 'movies']

In [19]:
tok = WordPunctTokenizer()

negations_dict = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dict.keys()) + r')\b')
emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

def review_cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    stripped  = re.sub(r'@[A-Za-z0-9_]+', '', souped)
    stripped = re.sub(r'https?://[^ ]+', '', stripped)
    stripped = re.sub(r'www.[^ ]+', '', stripped)
    lower_case = text.lower()
    neg_dict = neg_pattern.sub(lambda x: negations_dict[x.group()], lower_case)
    letters_only = re.sub('[^a-zA-Z]', " ", neg_dict)
    no_emoji = re.sub(emoji_pattern, '', letters_only)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

### Map Function

In [20]:
df_train['review'] = df_train['review'].map(review_cleaner)

In [21]:
df_train.head()

Unnamed: 0,review,label
0,story of man who has unnatural feelings for pi...,neg
1,airport starts as brand new luxury plane is lo...,neg
2,this film lacked something could not put my fi...,neg
3,sorry everyone know this is supposed to be an ...,neg
4,when was little my parents took me along to th...,neg


# Doc2Vec

In [22]:
tagged_documents = []
for indx, doc in enumerate(df_train["review"].values):
    tagged_documents.append(TaggedDocument([x for x in doc.split()], [indx]))

In [23]:
cores = multiprocessing.cpu_count()
vec_size = 150

model_dbow = Doc2Vec(dm=0, dbow_words=1, vector_size=vec_size, negative=5, hs=0, min_count=2, sample=0, 
             workers=cores)

model_dm_mean = Doc2Vec(dm=1, dm_mean=1, vector_size=vec_size, window=10, negative=5, hs=0, min_count=2, sample=0, 
                workers=cores, alpha=0.05, comment='alpha=0.05')

model_dm_concat = Doc2Vec(dm=1, dm_concat=1, vector_size=vec_size, window=5, negative=5, hs=0, min_count=2, sample=0, 
                  workers=cores)

In [24]:
models = [(model_dbow, 'model_dbow'), (model_dm_mean, 'model_dm_mean'), (model_dm_concat, 'model_dm_concat')]

for model in models:
    model[0].build_vocab(tagged_documents)
    print("%s vocabulary scanned & state initialized" % model[0])
    
models_by_name = OrderedDict((str(model[1]), model[0]) for model in models)

Doc2Vec(dbow+w,d150,n5,w5,mc2,t4) vocabulary scanned & state initialized
Doc2Vec("alpha=0.05",dm/m,d150,n5,w10,mc2,t4) vocabulary scanned & state initialized
Doc2Vec(dm/c,d150,n5,w5,mc2,t4) vocabulary scanned & state initialized


In [25]:
for model in models:
    for epoch in range(30):
        print('Epoch: {0}'.format(epoch), 'Model: %s' % (model[0]))
        model[0].train(utils.shuffle(tagged_documents), total_examples=len(tagged_documents), epochs=1)
        model[0].alpha -= 0.002
        model[0].min_alpha = model[0].alpha

Epoch: 0 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 1 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 2 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 3 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 4 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 5 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 6 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 7 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 8 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 9 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 10 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 11 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 12 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 13 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 14 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 15 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 16 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 17 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 18 Model: Doc2Vec(dbow+w,d150,n5,w5,mc2,t4)
Epoch: 19 Model: Doc2Vec(dbow+w,d150,n5,w

### Define X and y

In [26]:
X = {}

for model in models:
    X[model[1]] = np.zeros((df_train.shape[0], vec_size))
    for i in range(df_train.shape[0]):
        X[model[1]][i] = model[0].docvecs[i]  

In [27]:
y = df_train['label'].values

# Modeling

In [28]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

import matplotlib.pyplot as plt
%matplotlib inline

### Gradient Boost

In [29]:
grad_model = {}

In [30]:
def grad_model_func(X,y):
    grad = GradientBoostingClassifier()
    pipe = Pipeline([
        ('grad', grad)
    ])
    
    params = {
        'grad__n_estimators': [1000],
        'grad__max_features': ['log2']
    }
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)
    gs_d2v_grad = GridSearchCV(pipe, param_grid=params, cv=2)
    gs_d2v_grad.fit(X_train, y_train)
    print('train score:', gs_d2v_grad.score(X_train, y_train))
    print('test score:', gs_d2v_grad.score(X_test, y_test))
    print('best score:', gs_d2v_grad.best_score_)
    print('best params:', gs_d2v_grad.best_params_)
    print('---')
    return gs_d2v_grad

In [31]:
for model in models:
    print(model)
    m = grad_model_func(X[model[1]][:24904,:], y[:24904])
    grad_model[model[1]+'_trained'] = m

(<gensim.models.doc2vec.Doc2Vec object at 0x1a367245c0>, 'model_dbow')
train score: 0.9557233108469858
test score: 0.8721490523610665
best score: 0.8663668486990042
best params: {'grad__max_features': 'log2', 'grad__n_estimators': 1000}
---
(<gensim.models.doc2vec.Doc2Vec object at 0x1a36724518>, 'model_dm_mean')
train score: 0.9405717956954706
test score: 0.8475746867973016
best score: 0.8380447585394581
best params: {'grad__max_features': 'log2', 'grad__n_estimators': 1000}
---
(<gensim.models.doc2vec.Doc2Vec object at 0x1a367246d8>, 'model_dm_concat')
train score: 0.8231609380019274
test score: 0.6553164150337295
best score: 0.6226041332048399
best params: {'grad__max_features': 'log2', 'grad__n_estimators': 1000}
---


Create new data frame with all new rows that have import tweets

In [62]:
df_new = df_train.iloc[24904:].reset_index(drop=True)
df_new.head()

Unnamed: 0,review,label,predicted,textblob
0,last chance follow amp rt for your chance to w...,,pos,
1,damienchazelle has given us his third excellen...,,pos,
2,daily box office top for tuesday october astar...,,pos,
3,saw firstmanmovie amp even though knew everyth...,,pos,
4,just been to see firstman brilliant film richa...,,pos,


Create a new column 'predicted' that will contain a prediction of positive or negative for each tweet.

In [63]:
df_train['predicted'] = grad_model['model_dbow_trained'].predict(X[models[0][1]])

Create a count of all predicted positive and negative tweets analyzed by my model.

In [65]:
pos_count = 0
neg_count = 0

for score in df_train.loc[24904:, 'predicted']:
    if score ==  'pos':
        pos_count += 1
    if score == 'neg':
        neg_count += 1
        
print("Positive count = {}".format(pos_count))
print("Negative count = {}".format(neg_count))

Positive count = 4816
Negative count = 916


# TextBlob and VADER Sentiment

### TextBlob

Apply TextBlob.sentiment to all columns with tweets and designate output to new column (TextBlob.sentiment is scored on a scale of -1 to 1.)

In [38]:
df_new['textblob'] = df_new['review'].apply(lambda x: TextBlob(x).polarity)

In [39]:
df_new.head()

Unnamed: 0,review,label,predicted,textblob
0,last chance follow amp rt for your chance to w...,,pos,0.4
1,damienchazelle has given us his third excellen...,,pos,0.5
2,daily box office top for tuesday october astar...,,pos,0.25
3,saw firstmanmovie amp even though knew everyth...,,pos,0.5
4,just been to see firstman brilliant film richa...,,pos,0.9


Create a count of all positive, neutral and negative tweets analyzed by TextBlob

In [49]:
pos_count = 0
neg_count = 0
neutral_count = 0

for score in df_train['textblob']:
    if score > 0:
        pos_count += 1


for score in df_train['textblob']:
    if score < 0:
        neg_count += 1


for score in df_train['textblob']:
    if score == 0:
        neutral_count += 1

print("Positive count = {}".format(pos_count))
print("Negative count = {}".format(neg_count))
print("Neutral count = {}".format(neutral_count))

Positive count = 3715
Negative count = 717
Neutral count = 1300


### Vader Sentiment

Apply SentimentIntensityAnalyzer.polarity_scores to all columns with tweets and designate the ouput to a new column (Vader sentiment is scored on a scale of -4 to 4.)

In [41]:
analyzer = SentimentIntensityAnalyzer()

In [42]:
vs = analyzer.polarity_scores

In [47]:
df_new['vader_sentiment'] = df_new['review'].apply(lambda x: vs(x))

In [48]:
pos_count = 0
neg_count = 0
neutral_count = 0

for dictionary in df_new.loc[:, 'vader_sentiment']:
    if dictionary['compound'] > 0:
        pos_count += 1
    elif dictionary['compound'] < 0:
        neg_count += 1
    else:
        neutral_count += 1 

print("Positive count = {}".format(pos_count))
print("Negative count = {}".format(neg_count))
print("Neutral count = {}".format(neutral_count))

Positive count = 3321
Negative count = 1007
Neutral count = 1404


### Sentiment Count df

In [67]:
i = ['my_model', 'textblob', 'vader_sentiment']
s = {'positive': [4816, 3715, 3321], 'negative': [916, 717, 1007], 'neutral': ['XXX', 1300, 1404]}

pd.DataFrame(data=s, index=i)

Unnamed: 0,positive,negative,neutral
my_model,4816,916,XXX
textblob,3715,717,1300
vader_sentiment,3321,1007,1404
