# Week 3 Sentiment Analysis and Processing Text: Exercise 3.2

## 1) Using the TextBlob Sentiment Analyzer:

***Instructions)***

1) Import the movie review data as a data frame and ensure that the data is loaded properly.
2) How many of each positive and negative reviews are there?
3) Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.
4) Check the accuracy of this model. Is this model better than random guessing?
5) For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

***Answer)***

**1) Opening and Reading the data**

In [1]:
import pandas as pd

# importing and reading the data
file_path = 'C:/Users/ivan2/gitLocal/DSC550-WINTER2023/Week3-word2vec-nlp-tutorial/labeledTrainData.tsv/labeledTrainData.tsv'

lbd_train_data = pd.read_csv(file_path, sep='\t')
lbd_train_data.head(5)


Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [2]:
lbd_train_data.shape

(25000, 3)

**2) Counting the positive and negative reviews**

There are 12,500 positive sentiments and 12,500 negatvie sentimetns.

In [3]:
# Review sentiment value counts
train_sentiment_counts = lbd_train_data['sentiment'].value_counts()
train_sentiment_counts

1    12500
0    12500
Name: sentiment, dtype: int64

**3 & 4) TextBlob**

The TextBlob library is used for NLP tasks for processing textual data and extracting insights from it. In this case we are using it for Sentiment Analysis.

In [4]:
from textblob import TextBlob

# Function to classify sentiment using TextBlob
def classify_sentiment(text):
    analysis = TextBlob(text)
    return 1 if analysis.sentiment.polarity >= 0 else 0

# Apply the function to classify each review
lbd_train_data['predicted_sentiment'] = lbd_train_data['review'].apply(classify_sentiment)

# Calculate the accuracy
accuracy = (lbd_train_data['predicted_sentiment'] == lbd_train_data['sentiment']).mean()
accuracy

0.68524

After appplying the TextBlob we get an accuracy of 68%. We can assume a 50% accuracy for random guessing in a binary classification, so in our case the TextBlob analyzer is better than random guessing.

**5) Extra Credit**

## 2) Prepping Text for a Custom Model:

***Instructions)***

If you want to run your own model to classify text, it needs to be in proper form to do so. The following steps will outline a procedure to do this on the movie reviews text.

1) Convert all text to lowercase letters.
2) Remove punctuation and special characters from the text.
3) Remove stop words.
4) Apply NLTK’s PorterStemmer.
5) Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.
6) Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

**0) Remving HTML**

This step was not specifically included in the instructions but it is still recommended to do this step.

In [5]:
# Checking for html tags
print(lbd_train_data["review"][0])

With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally star

In [6]:
# Import BeautifulSoup to remove HTML tags
from bs4 import BeautifulSoup

# Define a function to apply BeautifulSoup and get text
def remove_html(text):
    return BeautifulSoup(text, "html.parser").get_text()

# Appling the BeautifulSoup function directly to the review column
lbd_train_data['review'] = lbd_train_data['review'].apply(remove_html)
print(lbd_train_data['review'][0])

  return BeautifulSoup(text, "html.parser").get_text()


With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.The actual feature film bit when it finally starts is only on for 20 min

**1) Convert all text to lowercase**

In [7]:
# Convert all text to lower case in the data frame
lbd_train_data['review'] = lbd_train_data['review'].str.lower()

**2) Remove Punctuations and Special Characters**

In [8]:
import re

# Replace punctuations and special characters with an empty string
# The regular expression '[^\w\s]' matches any character that is not a word character (alphanumeric) or whitespace
lbd_train_data['review'] = lbd_train_data['review'].str.replace('[^\w\s]', '', regex=True)

print(lbd_train_data['review'][0])

with all this stuff going down at the moment with mj ive started listening to his music watching the odd documentary here and there watched the wiz and watched moonwalker again maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent moonwalker is part biography part feature film which i remember going to see at the cinema when it was originally released some of it has subtle messages about mjs feeling towards the press and also the obvious message of drugs are bad mkayvisually impressive but of course this is all about michael jackson so unless you remotely like mj in anyway then you are going to hate this and find it boring some may call mj an egotist for consenting to the making of this movie but mj and most of his fans would say that he made it for the fans which if true is really nice of himthe actual feature film bit when it finally starts is only on for 20 minutes or so e

**3) Remove stop words**

In [9]:
# Import the stop word list
from nltk.corpus import stopwords 
print(stopwords.words("english")) 

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [10]:
from nltk.tokenize import word_tokenize

# Load the list of English stop words
stop_words = set(stopwords.words('english'))

# Define a function to remove stop words from a text
def remove_stopwords(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Remove stop words
    filtered_tokens = [w for w in tokens if not w.lower() in stop_words]
    # Join the tokens back into a string
    return ' '.join(filtered_tokens)

# Apply the function to the review column
lbd_train_data['review'] = lbd_train_data['review'].apply(remove_stopwords)

print(lbd_train_data['review'][0])

stuff going moment mj ive started listening music watching odd documentary watched wiz watched moonwalker maybe want get certain insight guy thought really cool eighties maybe make mind whether guilty innocent moonwalker part biography part feature film remember going see cinema originally released subtle messages mjs feeling towards press also obvious message drugs bad mkayvisually impressive course michael jackson unless remotely like mj anyway going hate find boring may call mj egotist consenting making movie mj fans would say made fans true really nice himthe actual feature film bit finally starts 20 minutes excluding smooth criminal sequence joe pesci convincing psychopathic powerful drug lord wants mj dead bad beyond mj overheard plans nah joe pescis character ranted wanted people know supplying drugs etc dunno maybe hates mjs musiclots cool things like mj turning car robot whole speed demon sequence also director must patience saint came filming kiddy bad sequence usually direct

**4) Apply NLTK’s PorterStemmer**

***Note*** For my future reference, I am noting that The Porter Stemmer is an algorithm used for stemming, which is the process of reducing words to their word stem or root form. For instance, words like "running", "runner", and "ran" would all be stemmed to "run".

In [11]:
from nltk.stem import PorterStemmer

# Create an instance of the PorterStemmer
stemmer = PorterStemmer()

# Define a function to stem each word in the text
def stem_text(text):
    # Tokenize the text
    tokens = word_tokenize(text)
    # Stem each word
    stemmed_tokens = [stemmer.stem(word) for word in tokens]
    # Join the stemmed words back into a string
    return ' '.join(stemmed_tokens)

# Apply the function to the review column
lbd_train_data['review'] = lbd_train_data['review'].apply(stem_text)

print(lbd_train_data['review'][0])

stuff go moment mj ive start listen music watch odd documentari watch wiz watch moonwalk mayb want get certain insight guy thought realli cool eighti mayb make mind whether guilti innoc moonwalk part biographi part featur film rememb go see cinema origin releas subtl messag mj feel toward press also obviou messag drug bad mkayvisu impress cours michael jackson unless remot like mj anyway go hate find bore may call mj egotist consent make movi mj fan would say made fan true realli nice himth actual featur film bit final start 20 minut exclud smooth crimin sequenc joe pesci convinc psychopath power drug lord want mj dead bad beyond mj overheard plan nah joe pesci charact rant want peopl know suppli drug etc dunno mayb hate mj musiclot cool thing like mj turn car robot whole speed demon sequenc also director must patienc saint came film kiddi bad sequenc usual director hate work one kid let alon whole bunch perform complex danc scenebottom line movi peopl like mj one level anoth think peo

**5) Bag of Words Matrix**

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

# Create an instance of CountVectorizer
vectorizer = CountVectorizer()

# Apply CountVectorizer to the stemmed reviews
BoW_matrix = vectorizer.fit_transform(lbd_train_data['review'])

# Convert to a dense array or DataFrame for easier viewing
BoW_dense = BoW_matrix.toarray()

# Convert it to a DataFrame
# import pandas as pd
BoW_df = pd.DataFrame(BoW_dense, columns=vectorizer.get_feature_names_out())

In [16]:
# dimensions of the bag of words
BoW_df.shape

(25000, 113034)

**6) tf-idf matrix**

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create an instance of TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer()

# Apply TfidfVectorizer to the stemmed reviews
# Assuming lbd_train_data['review'] contains the stemmed reviews
tfidf_matrix = tfidf_vectorizer.fit_transform(lbd_train_data['review'])

# Convert to a dense array or DataFrame for easier viewing
tfidf_dense = tfidf_matrix.toarray()

# Convert it to a DataFrame
tfidf_df = pd.DataFrame(tfidf_dense, columns=tfidf_vectorizer.get_feature_names_out())

In [18]:
# dimensions of the tfidf data frame
tfidf_df.shape

(25000, 113034)