## Part 1: Using the TextBlob Sentiment Analyzer

### Import the movie review data as a data frame and ensure that the data is loaded properly.

In [258]:
#importing libraries
import numpy as np 
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
import nltk
import re
import string
from nltk.stem import WordNetLemmatizer
import warnings
warnings.filterwarnings("ignore")

In [259]:
# import tsv file as dataframe
df = pd.read_csv('labeledTrainData.tsv',sep = '\t') 
df

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...
...,...,...,...
24995,3453_3,0,It seems like more consideration has gone into...
24996,5064_1,0,I don't believe they made this film. Completel...
24997,10905_3,0,"Guy is a loser. Can't get girls, needs to buil..."
24998,10194_3,0,This 30 minute documentary Buñuel made in the ...


### How many of each positive and negative reviews are there?

In [260]:
# Sentiment - Sentiment of the review; 1 for positive reviews and 0 for negative reviews
pos_neg_reviews = df.groupby('sentiment').count() 
pos_neg_reviews

Unnamed: 0_level_0,id,review
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12500,12500
1,12500,12500


### Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [261]:
# Installation of TextBlob in system
! pip install -U textblob
# For the uninitiated – practical work in Natural Language Processing typically uses large bodies of linguistic data, or corpora
! python -m textblob.download_corpora

Finished.


[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to

In [262]:
# import TextBlob
from textblob import TextBlob

In [263]:
# Classify each movie review as positive or negative (assuming polarity greater than 0 is positive sentiment, and less than 0 is negative sentiment)
df[['polarity', 'subjectivity']] = df['review'].apply(lambda Text: pd.Series(TextBlob(Text).sentiment))
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818


In [264]:
# Create textblob_score column based on the polarity column
df['textblob_score'] = df['polarity'].apply(lambda x: 1 if x > 0 else 0)
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0


### Check the accuracy of this model. Is this model better than random guessing?

In [265]:
# Create new accuracy column
accuracy = np.where(df['textblob_score'] == df['sentiment'], 1, 0)
df["accuracy"] = accuracy
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0


In [266]:
df["accuracy"].mean() #obtain mean of the accuracy column to understand if its better than random guessing (50/50)

0.68528

In [267]:
# The model is better than random guessing (68% vs 50%)

### For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

### Use VADER to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [268]:
# Use VADER and install the core package, library and define analyzer variable
# https://gist.github.com/TosinJayeola/f8b46373ceba5fcb4d948536b1ed8d77
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer



In [269]:
#import libraries
import nltk
nltk.download('vader_lexicon')

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


True

In [270]:
# Create two functions to calculate the nltk vadar sentiment the compound sentiment score
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent_i = SentimentIntensityAnalyzer()

def vadar_sentiment(text):
    """ Calculate and return the nltk vadar (lexicon method) sentiment """
    return sent_i.polarity_scores(text)['compound']

# create new column for vadar compound sentiment score
df['vadar compound'] = df['review'].apply(vadar_sentiment)

def categorise_sentiment(sentiment, neg_threshold=-0.05, pos_threshold=0.05):
    """ categorise the sentiment value as positive (1), negative (0) 
        or neutral (0) based on given thresholds """
    if sentiment < neg_threshold:
        label = '0'
    elif sentiment > pos_threshold:
        label = '1'
    else:
        label = '1'
    return label

# new column with vadar sentiment label based on vadar compound score
df['vadar sentiment'] = df['vadar compound'].apply(categorise_sentiment)

In [271]:
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8278,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9819,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.2189,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.796,1


In [272]:
# Create textblob_score column based on the polarity column
df['textblob_score_vader'] = df['vadar compound'].apply(lambda x: 1 if x > 0 else 0)
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8278,0,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9819,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883,0,0
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.2189,0,0
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.796,1,1


### Check the accuracy of this model. Is this model better than random guessing?

In [273]:
# Create new accuracy column
accuracy_vadar = np.where(df['textblob_score_vader'] == df['sentiment'], 1, 0)
df["accuracy_vader"] = accuracy_vadar
#df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader,accuracy_vader
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,0.606746,1,1,-0.8278,0,0,0
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,0.531111,1,1,0.9819,1,1,1
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,0.562933,0,1,-0.9883,0,0,1
3,3630_4,0,It must be assumed that those who praised this...,0.134753,0.492901,1,0,-0.2189,0,0,1
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,-0.024842,0.459818,0,0,0.796,1,1,1


In [274]:
# Calculate the mean accuracy of the accuracy_vadar column
df["accuracy_vader"].mean() #obtain mean of the accuracy column to understand if its better than random guessing (50/50)

0.69364

In [275]:
# The model is better than random guessing (69% vs 50% of random guessing)

## Part 2: Prepping Text for a Custom Model

### If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.

### Convert all text to lowercase letters.

In [295]:
df['review'] = df['review'].str.lower() # use str.lower to convert all text to lowercase letters
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader,accuracy_vader
0,5814_8,1,stuff going moment mj ive started listening mu...,0.001277,0.606746,1,1,-0.8278,0,0,0
1,2381_9,1,classic war worlds timothy hines entertaining ...,0.256349,0.531111,1,1,0.9819,1,1,1
2,7759_3,0,film starts manager nicholas bell giving welco...,-0.053941,0.562933,0,1,-0.9883,0,0,1
3,3630_4,0,must assumed praised film greatest filmed oper...,0.134753,0.492901,1,0,-0.2189,0,0,1
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...,-0.024842,0.459818,0,0,0.796,1,1,1


### Remove punctuation and special characters from the text.

In [296]:
# pg 98
df['review'] = df['review'].str.replace(r'[^\w\s]+', '') # use str.replace to remove all the special characters
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader,accuracy_vader
0,5814_8,1,stuff going moment mj ive started listening mu...,0.001277,0.606746,1,1,-0.8278,0,0,0
1,2381_9,1,classic war worlds timothy hines entertaining ...,0.256349,0.531111,1,1,0.9819,1,1,1
2,7759_3,0,film starts manager nicholas bell giving welco...,-0.053941,0.562933,0,1,-0.9883,0,0,1
3,3630_4,0,must assumed praised film greatest filmed oper...,0.134753,0.492901,1,0,-0.2189,0,0,1
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...,-0.024842,0.459818,0,0,0.796,1,1,1


### Remove stop words.

In [297]:
# pg 99
# Import library
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kadams\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [298]:
# Import stopwords with nltk
stop = stopwords.words('english')

In [299]:
# Exclude stopwords with Python's list comprehension and pandas.DataFrame.apply.
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop)]))
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader,accuracy_vader
0,5814_8,1,stuff going moment mj ive started listening mu...,0.001277,0.606746,1,1,-0.8278,0,0,0
1,2381_9,1,classic war worlds timothy hines entertaining ...,0.256349,0.531111,1,1,0.9819,1,1,1
2,7759_3,0,film starts manager nicholas bell giving welco...,-0.053941,0.562933,0,1,-0.9883,0,0,1
3,3630_4,0,must assumed praised film greatest filmed oper...,0.134753,0.492901,1,0,-0.2189,0,0,1
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...,-0.024842,0.459818,0,0,0.796,1,1,1


### Apply NLTK’s PorterStemmer.

In [300]:
# pg 100: Stemming reduces a word to its stem by identifying and removing affixes (e.g. gerunds) keeping the root meaning of the word
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

In [301]:
porter_stemmer = PorterStemmer() #create porter_stemmer variable

In [302]:
# https://stackoverflow.com/questions/43795310/apply-porters-stemmer-to-a-pandas-column-for-each-word
# Tokenize the sentences
df['review_tokenized']=df['review'].apply(lambda x : filter(None,x.split(" ")))

In [303]:
# Apply stemmer to the above tokenized column as follows
df['review_stemmed']=df['review_tokenized'].apply(lambda x : [porter_stemmer.stem(y) for y in x])

In [304]:
# Go back to review in sentence format
df['review_stemmed_sentence']=df['review_stemmed'].apply(lambda x : " ".join(x))

In [305]:
df.head(5)

Unnamed: 0,id,sentiment,review,polarity,subjectivity,textblob_score,accuracy,vadar compound,vadar sentiment,textblob_score_vader,accuracy_vader,review_tokenized,review_stemmed,review_stemmed_sentence
0,5814_8,1,stuff going moment mj ive started listening mu...,0.001277,0.606746,1,1,-0.8278,0,0,0,<filter object at 0x0000018D60C7D640>,"[stuff, go, moment, mj, ive, start, listen, mu...",stuff go moment mj ive start listen music watc...
1,2381_9,1,classic war worlds timothy hines entertaining ...,0.256349,0.531111,1,1,0.9819,1,1,1,<filter object at 0x0000018D60C7DE80>,"[classic, war, world, timothi, hine, entertain...",classic war world timothi hine entertain film ...
2,7759_3,0,film starts manager nicholas bell giving welco...,-0.053941,0.562933,0,1,-0.9883,0,0,1,<filter object at 0x0000018D60C7D580>,"[film, start, manag, nichola, bell, give, welc...",film start manag nichola bell give welcom inve...
3,3630_4,0,must assumed praised film greatest filmed oper...,0.134753,0.492901,1,0,-0.2189,0,0,1,<filter object at 0x0000018D60C7D610>,"[must, assum, prais, film, greatest, film, ope...",must assum prais film greatest film opera ever...
4,9495_8,1,superbly trashy wondrously unpretentious 80s e...,-0.024842,0.459818,0,0,0.796,1,1,1,<filter object at 0x0000018D60C7D7C0>,"[superbl, trashi, wondrous, unpretenti, 80, ex...",superbl trashi wondrous unpretenti 80 exploit ...


### Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). 

In [312]:
# pg 104: Create bag-of-words matrix
from sklearn.feature_extraction.text import CountVectorizer
count= CountVectorizer()
bag_of_words = count.fit_transform(df['review_stemmed_sentence'])

### Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [308]:
bag_of_words.shape #.shape gives the shape of the dataframe, which is 25000 rows and that is the same as the original df dataframe

(25000, 92532)

### Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). 

In [313]:
# Use scikit-learn to create a sparse matrix using fit_transform
from sklearn.feature_extraction.text import TfidfVectorizer
v = TfidfVectorizer()
x = v.fit_transform(df['review_stemmed_sentence'])

### Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [314]:
x.shape #these are the same dimensions as the bag-of-words matrix

(25000, 92532)

## Additional Comments

### The bag-of-words and tf-idf matrices are stored as sparse matrices because most entries are zero.

### Each row in the bag-of-words/tf-idf matrices corresponds to a movie review.

### The columns in the bag-of-words/tf-idf matrices correspond to unique words appearing in the movie reviews.

### Entries in the bag-of-words matrix are the number of times a word appears in a review.

### Entries in the tf-idf matrix are numbers representing the word importance in a review.

### The bag-of-words/tf-idf matrices are possible feature (input) matrices for model building.

### We will revisit this preprocessed text data to build a custom model in the future.