# Assignment: Exercise 3.2
# Class: DSC 550
# Name: Wittlieff, Harlan
# Date: 12/19/2021

## Part 1.

### 1. Import the movie review data as a data frame and ensure that the data is loaded properly.

In [2]:
# Import the pandas library
import pandas as pd

# Load the datafile
df = pd.read_csv(r'Data/labeledTrainData.tsv', sep='\t')

# Validate the data loaded correctly
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


### 2. How many of each positive and negative reviews are there?

In [5]:
df.groupby('sentiment')['id'].count()

sentiment
0    12500
1    12500
Name: id, dtype: int64

There are 12,500 negative (sentiment = 0) and 12,500 positive (sentiment = 1) reviews.

### 3. Use TextBlob to classify each movie review as positive or negative. Assume that a polarity score greater than or equal to zero is a positive sentiment and less than 0 is a negative sentiment.

In [17]:
# Import the textblob library
from textblob import TextBlob

# Add the TextBlob polarity to a new column
df['tb_polarity'] = df['review'].apply(lambda review: TextBlob(review).sentiment.polarity)

In [46]:
# Convert the polarity scores into sentiment scores (1=positive, 0=negative)
def pos_or_neg (row):
    if row < 0:
        return 0
    else:
        return 1

df['tb_sentiment'] = df['tb_polarity'].apply(lambda row: pos_or_neg(row))

### 4. Check the accuracy of this model. Is this model better than random guessing?

In [28]:
# Create a new column to calculate the change in sentiment. A value of 0 means the model scored correctly, a value of -1
# means the model scored a positive review negatively, a value of 1 means the model scored a negative review positively.
df['sentiment_change'] = df['tb_sentiment'] -df['sentiment']

In [29]:
# Obtain the totals for each sentiment_change score
df.groupby('sentiment_change')['id'].count()

sentiment_change
-1      676
 0    17131
 1     7193
Name: id, dtype: int64

Overally textblob accurately modeled 17,131 reviews. This gives a total accuracy of 68.5%. Random guessing would have only had an accuracy of 50% meaning that textblob's predictions are more accurate.

### 5. For up to five points extra credit, use another prebuilt text sentiment analyzer, e.g., VADER, and repeat steps (3) and (4).

In [38]:
# Import library
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer

sid = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\wittl\AppData\Roaming\nltk_data...


In [42]:
# Calculate scores
df['v_scores'] = df['review'].apply(lambda review: sid.polarity_scores(review))
df['v_compound'] = df['v_scores'].apply(lambda score_dict:score_dict['compound'])

In [48]:
# Convert the polarity scores into sentiment scores (1=positive, 0=negative)
df['v_sentiment'] = df['v_compound'].apply(lambda row: pos_or_neg(row))

In [50]:
# Create a new column to calculate the change in sentiment. A value of 0 means the model scored correctly, a value of -1
# means the model scored a positive review negatively, a value of 1 means the model scored a negative review positively.
df['v_sentiment_change'] = df['v_sentiment'] -df['sentiment']

In [51]:
# Obtain the totals for each sentiment_change score
df.groupby('v_sentiment_change')['id'].count()

v_sentiment_change
-1     1843
 0    17339
 1     5818
Name: id, dtype: int64

Overall the vader predictions are slightly more accurate for this data set. 17,339 accurate predictions (accuracy of 69.4%)

## Part 2.

### 1. Convert all text to lowercase letters.

In [55]:
# Reload the data set
df_prep = pd.read_csv(r'Data/labeledTrainData.tsv', sep='\t')

# Convert the review field to lowercase
df_prep.review = df_prep.review.str.lower()

### 2. Remove punctuation and special characters from the text.

In [58]:
# import the string library
import string

# def formula to remove punctuation
def remove_punctuation(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, "")
    return text

df_prep.review = df_prep.review.apply(remove_punctuation)

### 3. Remove stop words.

In [63]:
# Load library
from nltk.corpus import stopwords

# Load stop words
stop_words = stopwords.words('english')

# Remove stop words
df_prep['review'] = df_prep['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

### 4. Apply NLTK's PorterStemmer

In [67]:
# Load library
from nltk.stem.porter import PorterStemmer

# Create Stemmer
porter = PorterStemmer()

# Apply stemmer
df_prep['review'] = df_prep['review'].apply(lambda x: ' '.join([porter.stem(word) for word in x.split()]))

### 5. Create a bag-of-words matrix from your stemmed text (output from (4)), where each row is a word-count vector for a single movie review (see sections 5.3 & 6.8 in the Machine Learning with Python Cookbook). Display the dimensions of your bag-of-words matrix. The number of rows in this matrix should be the same as the number of rows in your original data frame.

In [73]:
# Load library
from sklearn.feature_extraction.text import CountVectorizer

# Create the bag of words feature matrix
count = CountVectorizer()
bag_of_words = count.fit_transform(df_prep.review)

# Show dimensions of feature matrix
bag_of_words

<25000x92379 sparse matrix of type '<class 'numpy.int64'>'
	with 2439461 stored elements in Compressed Sparse Row format>

### 6. Create a term frequency-inverse document frequency (tf-idf) matrix from your stemmed text, for your movie reviews (see section 6.9 in the Machine Learning with Python Cookbook). Display the dimensions of your tf-idf matrix. These dimensions should be the same as your bag-of-words matrix.

In [76]:
# Load Library
from sklearn.feature_extraction.text import TfidfVectorizer

# Create the tf-idf feature matrix
tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df_prep.review)

# Show tf-idf feature matrix
feature_matrix

<25000x92379 sparse matrix of type '<class 'numpy.float64'>'
	with 2439461 stored elements in Compressed Sparse Row format>