#### Megan SIsson

Part 1: Using the TextBlob Sentiment Analyzer

1. Import the movie review data as a data frame and ensure that the data is loaded properly.

2. How many of each positive and negative reviews are there?

3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

4. Check the accuracy of this model. Is this model better than random guessing?

5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [1]:
#import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns



In [2]:
# Import the movie review data as a data frame and ensure that the data is loaded properly.
df = pd.read_csv('/Users/mksis/Documents/Data Science/DSC550 Data Mining/Data Sets/labeledTrainData.tsv', sep='\t')
df.head(3)

#Labeled Train Data imported as dataframe.

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...


In [3]:
# 2. How many of each positive and negative reviews are there?
df.groupby('sentiment').count()

# There are 12,500 positive reviews [1]
# There are 12,500 negative reviews [0]

Unnamed: 0_level_0,id,review
sentiment,Unnamed: 1_level_1,Unnamed: 2_level_1
0,12500,12500
1,12500,12500


In [4]:
#Install textblob if not already installed
#pip install textblob

In [5]:
from textblob import TextBlob

In [6]:
#3. Use TextBlob to classify each movie review as positive or negative. 
#Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

#function to make sure each review in df is a string. If it is not a string it will return 'None'
def sentiment_calc(review):
    try:
        return TextBlob(review).sentiment.polarity  #if review column is a string then return the sentiment
    except:
        return None
    
df['TextBlob_sentiment'] = df['review'].apply(sentiment_calc)

df.head(3)

#>= 0 Positive Sentiment
# <0 Negative Sentiment



Unnamed: 0,id,sentiment,review,TextBlob_sentiment
0,5814_8,1,With all this stuff going down at the moment w...,0.001277
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941


In [7]:
# Creating new classification column.
# If the TextBlob sentiment value is greater than or equal to 0, then it is a Positive sentiment
# If the TextBlob sentiment value is less than 0, then it is a Negative sentiment
df.loc[df['TextBlob_sentiment'] >= 0, 'classification'] = 'Positive'
df.loc[df['TextBlob_sentiment'] < 0, 'classification'] = 'Negative'

df.head(3)

Unnamed: 0,id,sentiment,review,TextBlob_sentiment,classification
0,5814_8,1,With all this stuff going down at the moment w...,0.001277,Positive
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...",0.256349,Positive
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,-0.053941,Negative


In [8]:
#4. Check the accuracy of this model. Is this model better than random guessing?
df.groupby('classification').count()

#The TextBlob sentiment resulted in 5983 negative results and 19017 positive results w.
#This shows that there is different sentiment than the 'sentiment' column label.

Unnamed: 0_level_0,id,sentiment,review,TextBlob_sentiment
classification,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Negative,5983,5983,5983,5983
Positive,19017,19017,19017,19017


Part 2: Prepping Text for a Custom Model

1. Convert all text to lowercase letters.

2. Remove punctuation and special characters from the text.

3. Remove stop words.

4. Apply NLTK’s PorterStemmer.

5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [9]:
#1. Convert all text to lowercase letters.

df['review'] = df['review'].apply(str.lower)
df.head(3)

Unnamed: 0,id,sentiment,review,TextBlob_sentiment,classification
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,Positive
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi...",0.256349,Positive
2,7759_3,0,the film starts with a manager (nicholas bell)...,-0.053941,Negative


In [10]:
#2. Remove punctuation and special characters from the text.

df['review'] = df['review'].str.replace('[^\w\s]', '')
df.head(3)


  df['review'] = df['review'].str.replace('[^\w\s]', '')


Unnamed: 0,id,sentiment,review,TextBlob_sentiment,classification
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,Positive
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,Positive
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,Negative


In [11]:
#3. Remove stop words.
# install nltk if needed 
#!pip install nltk





In [12]:
import nltk
nltk.download('stopwords')

[nltk_data] Error loading stopwords: <urlopen error [Errno 11001]
[nltk_data]     getaddrinfo failed>


False

In [13]:
from nltk.corpus import stopwords

In [14]:
#removing stop words from the dataframe.

stop_words = stopwords.words('English') #stopwords from nltk.corpus

#removing stopwords from the 'review' column
df['no_stopwords'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))
df.head(3)

Unnamed: 0,id,sentiment,review,TextBlob_sentiment,classification,no_stopwords
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,Positive,stuff going moment mj ive started listening mu...
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,Positive,classic war worlds timothy hines entertaining ...
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,Negative,film starts manager nicholas bell giving welco...


In [15]:
#4. Apply NLTK’s PorterStemmer.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

ps = PorterStemmer()

#function to stem each individual word in the reviews.
def stem_review(review):
    tokens = review.split()
    stemmed_tokens = [ps.stem(token) for token in tokens]
    return ' '.join(stemmed_tokens)

df['stem'] = df['no_stopwords'].apply(stem_review) #stemming the words in the 'no_stopwords' column

df.head()

Unnamed: 0,id,sentiment,review,TextBlob_sentiment,classification,no_stopwords,stem
0,5814_8,1,with all this stuff going down at the moment w...,0.001277,Positive,stuff going moment mj ive started listening mu...,stuff go moment mj ive start listen music watc...
1,2381_9,1,the classic war of the worlds by timothy hines...,0.256349,Positive,classic war worlds timothy hines entertaining ...,classic war world timothi hine entertain film ...
2,7759_3,0,the film starts with a manager nicholas bell g...,-0.053941,Negative,film starts manager nicholas bell giving welco...,film start manag nichola bell give welcom inve...
3,3630_4,0,it must be assumed that those who praised this...,0.134753,Positive,must assumed praised film greatest filmed oper...,must assum prais film greatest film opera ever...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...,-0.024842,Negative,superbly trashy wondrously unpretentious 80s e...,superbl trashi wondrous unpretenti 80 exploit ...


In [16]:
#5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review 
#(see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). 
#Display the dimensions of your bag-of-words matrix. 
#The number of rows in this matrix should be the same as the number of rows in your original data frame.

from sklearn.feature_extraction.text import CountVectorizer

count = CountVectorizer()
bag_words = count.fit_transform(df['stem']) #counting the vectors for the words in the 'stem' column

bag_words #showing features of the bag_words matrix

bag_words.shape

(25000, 92532)

In [17]:
#6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews 
#(see section 6.9 in the Machine Learning with Python Cookbook). 
#Display the dimensions of your tf-idf matrix. 
#These dimensions should be the same as your bag-of-words matrix.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
result = tfidf.fit_transform(df['stem'])

result.shape

(25000, 92532)