# DCS 550 Data Mining (DSC550-T302 2227-1)
## Bellevue University
## 5.2 Exercise: Build your own Sentiment Analysis Model
## Author: Jake Meyer
## Date: 

## Assignment Instructions:
Download the labeled training dataset from this link: [Bag of Words Meets Bags of Popcorn.](https://www.kaggle.com/c/word2vec-nlp-tutorial/data)

<ol>
    <li> Get the stemmed data using the same process you did in Week 3.
    <li> Split this into a training and test set.
    <li> Fit and apply the tf-idf vectorization to the training set.
    <li> Apply bud DO NOT FIT the tf-idf vectorization to the test set (Why?)
    <li> Train a logistic regression using the training data.
    <li> Find the model accuracy on the test set.
    <li> Create a confusion matrix for the test set predictions.
    <li> Get the precision, recall, and F1-score for the test set predictions.
    <li> Create a ROC curve for the test set.
    <li> Pick another classification model you learned about this week and repeat steps (5) - (9).
<ol>

In [1]:
'''
Install the required packages (if necessary) to complete the assignment for Week 5.
Commented out the required installations since already installed for Week 3.
'''
import sys
# !{sys.executable} -m pip install -U textblob
# !{sys.executable} -m pip install nltk
# !{sys.executable} -m pip install vaderSentiment
# !{sys.executable} -m pip install neattext

In [2]:
'''
Import the necessary libraries to complete Exercise 5.2.
'''
import numpy as np
import pandas as pd
import textblob
import nltk
import nltk.corpus
import vaderSentiment
import unicodedata
import string
import re
import neattext as nt
import neattext.functions as nfx
from textblob import TextBlob, Word, Blobber
from textblob.classifiers import NaiveBayesClassifier
from textblob.taggers import NLTKTagger
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.stem.porter import *
from nltk.stem.wordnet import WordNetLemmatizer
from string import punctuation, printable, digits
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jkmey\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [3]:
'''
Check the versions of the packages.
'''
print('numpy version:', np.__version__)
print('pandas version:', pd.__version__)
print('textblob version:', textblob.__version__)
print('nltk version:', nltk.__version__)

numpy version: 1.20.3
pandas version: 1.3.4
textblob version: 0.17.1
nltk version: 3.6.5


## Part 1: Get the stemmed data using the same process you did in Week 3.

### Part 1.1 - Import the movie data as a data frame and ensure that the data is loaded properly.

In [4]:
'''
Section 1.1 - Import Data
Import the movie review data from labeledTrainData.tsv.
Note: A copy of the TSV file was placed into the same directory as this notebook.
Utilize pd.read_table() to read the tsv file as a pandas data frame. (tab seperated value file)
'''
df = pd.read_table('labeledTrainData.tsv')

In [5]:
'''
Show the data has been loaded successfully into the data frame 
by printing the first 5 rows with head().
'''
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...
3,3630_4,0,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...


In [6]:
'''
Understand the shape of the dataframe.
'''
print('There are {} rows and {} columns in this data frame.'.format(df.shape[0], df.shape[1]))

There are 25000 rows and 3 columns in this data frame.


In [7]:
'''
Display the total size of this data frame.
'''
print('This data frame contains {} records.'.format(df.size))

This data frame contains 75000 records.


In [8]:
'''
Understand if there are any missing values in the data frame.
'''
df.isna().sum()

id           0
sentiment    0
review       0
dtype: int64

### Part 1.2 - How many of each positive and negative reviews are there?

In [9]:
'''
Section 1.2 - Understand how many positive or negative reviews exist.
Use value_counts() to find the number of sentiments classified as 1 (Postive) or 0 (Negative).
'''
df['sentiment'].value_counts()

1    12500
0    12500
Name: sentiment, dtype: int64

There are 12500 positive reviews and 12500 negative reviews based on the output shown above.

### Part 1.3 - Convert all text to lowercase letters.

In [11]:
'''
Convert the reviews to lowercase using the .str.lower() command.
Show the first 5 rows of df_prep to confirm the lowercase correction.
'''
df['review'] = df['review'].str.lower()
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,with all this stuff going down at the moment w...
1,2381_9,1,"\the classic war of the worlds\"" by timothy hi..."
2,7759_3,0,the film starts with a manager (nicholas bell)...
3,3630_4,0,it must be assumed that those who praised this...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...


### Part 1.4 - Remove punctuation and special characters from the text.

In [12]:
'''
Section 1.4 - Remove punctuation and special characters from the text.
Write a function that will use translate to remove punctuation.
'''
def remove_punctuation(string: str, repl: str = '') -> str:
    return string.translate(str.maketrans('', '', punctuation))

In [13]:
'''
Use apply() to remove punctuation for the 'review' column in df
based on the remove_punctuation() function from above.
'''
df['review'] = df['review'].apply(remove_punctuation)

In [14]:
'''
Check that the punctuation has been removed from a random review in df.
'''
df['review'][104]

'the first time you see the second renaissance it may look boring look at it at least twice and definitely watch part 2 it will change your view of the matrix are the human people the ones who started the war  is ai a bad thing '

In [15]:
'''
Show the first five rows of df after lowercase and punctuation removal were applied.
'''
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,with all this stuff going down at the moment w...
1,2381_9,1,the classic war of the worlds by timothy hines...
2,7759_3,0,the film starts with a manager nicholas bell g...
3,3630_4,0,it must be assumed that those who praised this...
4,9495_8,1,superbly trashy and wondrously unpretentious 8...


In [16]:
'''
Create function to limit the text to only alpha characters.
'''
def alpha_char(raw_reviews):
    letters_only = re.sub("[^a-zA-Z]", " ", raw_reviews)
    return letters_only

In [17]:
'''
Use apply() to remove numeric characters for the 'review' column in df
based on the alpha_char() function from above.
'''
df['review'] = df['review'].apply(alpha_char)

In [18]:
'''
Check that the special characters have been removed from a random review in df.
'''
df['review'][104]

'the first time you see the second renaissance it may look boring look at it at least twice and definitely watch part   it will change your view of the matrix are the human people the ones who started the war  is ai a bad thing '

### Part 1.5 - Remove stop words.

In [19]:
'''
Section 1.5 - Remove stop words.
Show the list of stop_words pulled in from nltk package.
'''
stop_words = stopwords.words("english")
stop_words[0:10]

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [20]:
'''
Now tokenize the review data. Create a function to tokenize the text from the reviews.
'''
def tokenize_words(string):
    return word_tokenize(string)

In [21]:
'''
Apply the tokenize_words function to the df_prep['review'] column.
'''
df['review'] = df['review'].apply(tokenize_words)

In [22]:
'''
View the first 5 rows of df['review'] to see the tokenized text.
'''
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,"[with, all, this, stuff, going, down, at, the,..."
1,2381_9,1,"[the, classic, war, of, the, worlds, by, timot..."
2,7759_3,0,"[the, film, starts, with, a, manager, nicholas..."
3,3630_4,0,"[it, must, be, assumed, that, those, who, prai..."
4,9495_8,1,"[superbly, trashy, and, wondrously, unpretenti..."


In [23]:
'''
Create a function to remove stop words.
'''
def remove_stop_words(raw_review):
    cleaned_words = [w for w in raw_review if w not in stop_words]
    return cleaned_words

In [24]:
'''
Apply the remove_stop_words() function to the review column from df.
Verify the stop words were removed by viewing the first 5 rows of df.
'''
df['review'] = df['review'].apply(remove_stop_words)
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,"[stuff, going, moment, mj, ive, started, liste..."
1,2381_9,1,"[classic, war, worlds, timothy, hines, enterta..."
2,7759_3,0,"[film, starts, manager, nicholas, bell, giving..."
3,3630_4,0,"[must, assumed, praised, film, greatest, filme..."
4,9495_8,1,"[superbly, trashy, wondrously, unpretentious, ..."


### Part 1.6 - Apply NLTK's PorterStemmer.

In [25]:
'''
Section 1.6 - Apply NLTK's PorterStemmer.
Start by creating the stemmer.
'''
porter = PorterStemmer()

In [26]:
'''
Create a function that will apply the porterstemmer function from NLTK.
'''
def porter_stemmer(reviews):
    stemmed_words = [porter.stem(word) for word in reviews]
    return " ".join(stemmed_words)

In [27]:
'''
Apply the porter_stemmer function created above to the df['review'] column.
Verify the stemm for the words worked by checking the first 5 rows of the data frame.
'''
df['review'] = df['review'].apply(porter_stemmer)
df.head()

Unnamed: 0,id,sentiment,review
0,5814_8,1,stuff go moment mj ive start listen music watc...
1,2381_9,1,classic war world timothi hine entertain film ...
2,7759_3,0,film start manag nichola bell give welcom inve...
3,3630_4,0,must assum prais film greatest film opera ever...
4,9495_8,1,superbl trashi wondrous unpretenti exploit hoo...


### Part 1.7 - Create a bag-of-words matrix from the stemmed text.

In [28]:
'''
Section 1.7 - Create a bag-of-words matrix.
Start by converting the 'review' column from df to an array.
'''
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None,\
                            preprocessor = None, stop_words = None,\
                            max_features = 20000)

In [29]:
'''
Create the bag of words feature matrix.
Convert the bag_of_words matrix to an array.
Show the dimensions of the matrix as the output.
'''
bag_of_words = vectorizer.fit_transform(df['review'])
bag_of_words = bag_of_words.toarray()
print(bag_of_words.shape)

(25000, 20000)


In [30]:
'''
Show the feature names for the bag_of_words matrix.
'''
feature_names = vectorizer.get_feature_names()
print(feature_names)



In [31]:
'''
Show the bag_of_words matrix.
'''
bag_of_words

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Part 1.8 - Create a term frequency-inverse document frequency (tf-dif) matrix from the stemmed text.

In [32]:
'''
Section 1.8 - Create a term frequency-inverse document frequency (tf-idf) matrix.
Create the tf-idf feature matrix. Start by using TfidfVectorizer().
'''
tfidf = TfidfVectorizer(analyzer = "word", tokenizer = None,\
                            preprocessor = None, stop_words = None,\
                            max_features = 20000)

In [33]:
'''
Create the matrix using the fit_transform() function.
Convert the tfidf feature matrix to an array.
Show the dimensions of the matrix as the output.
'''
feature_matrix = tfidf.fit_transform(df['review'])
feature_matrix = feature_matrix.toarray()
print(feature_matrix.shape)

(25000, 20000)


In [34]:
'''
Show the feature names for the tfidf matrix.
'''
tfidf.vocabulary_

{'stuff': 16992,
 'go': 7244,
 'moment': 11469,
 'mj': 11416,
 'ive': 9129,
 'start': 16732,
 'listen': 10241,
 'music': 11757,
 'watch': 19274,
 'odd': 12347,
 'documentari': 4877,
 'wiz': 19638,
 'moonwalk': 11541,
 'mayb': 10962,
 'want': 19217,
 'get': 7096,
 'certain': 2786,
 'insight': 8891,
 'guy': 7627,
 'thought': 17773,
 'realli': 14255,
 'cool': 3660,
 'eighti': 5376,
 'make': 10666,
 'mind': 11283,
 'whether': 19461,
 'guilti': 7571,
 'innoc': 8867,
 'part': 12896,
 'biographi': 1684,
 'featur': 6160,
 'film': 6288,
 'rememb': 14483,
 'see': 15483,
 'cinema': 3090,
 'origin': 12547,
 'releas': 14449,
 'subtl': 17065,
 'messag': 11165,
 'feel': 6175,
 'toward': 18043,
 'press': 13686,
 'also': 463,
 'obviou': 12325,
 'drug': 5126,
 'bad': 1173,
 'br': 2047,
 'visual': 19097,
 'impress': 8666,
 'cours': 3788,
 'michael': 11199,
 'jackson': 9142,
 'unless': 18649,
 'remot': 14494,
 'like': 10182,
 'anyway': 719,
 'hate': 7885,
 'find': 6323,
 'bore': 1958,
 'may': 10959,
 'cal

In [35]:
'''
Show the feature_matrix.
'''
feature_matrix

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

### Part 2 - Split the data into a training and test set.