# Introduction

Author: Luis Sejas 

Student ID: 8440116

Hi and welcome to the first notebook of this nlp project.

On this notebook, we will be implementing the TF_IDF framework and analyze its effect on sentiment analysis.

The focus will be on two classification models: Logistic Regression and Random Forest Classifier.

Since it is given that these models will be underperformers compared to neural networks, there will not be a lot of emphasis on accuracy.

The main objective of this notebook is to start exploring machine learning models and end with the most powerful model.

# Before the Model

## Part 1: Loading and Seeing the Data

In [1]:
!pip install tensorflow-datasets > /dev/null

In [2]:
import tensorflow_datasets as tfds
import pandas as pd
import numpy as np
import re
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [3]:
(ds_train,ds_test),ds_info = tfds.load(
    name="imdb_reviews",
    split=["train","test"],
    shuffle_files=True,
    as_supervised=True,
    with_info=True
)

[1mDownloading and preparing dataset imdb_reviews/plain_text/1.0.0 (download: 80.23 MiB, generated: Unknown size, total: 80.23 MiB) to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0...[0m


Dl Completed...: 0 url [00:00, ? url/s]

Dl Size...: 0 MiB [00:00, ? MiB/s]





0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFO4IAV/imdb_reviews-train.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFO4IAV/imdb_reviews-test.tfrecord


  0%|          | 0/25000 [00:00<?, ? examples/s]

0 examples [00:00, ? examples/s]

Shuffling and writing examples to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0.incompleteFO4IAV/imdb_reviews-unsupervised.tfrecord


  0%|          | 0/50000 [00:00<?, ? examples/s]



[1mDataset imdb_reviews downloaded and prepared to /root/tensorflow_datasets/imdb_reviews/plain_text/1.0.0. Subsequent calls will reuse this data.[0m


In [4]:
df_train = tfds.as_dataframe(ds_train, ds_info)
df_test = tfds.as_dataframe(ds_test, ds_info)

In [5]:
df_train.head()

Unnamed: 0,label,text
0,0,"b""This was an absolutely terrible movie. Don't..."
1,0,b'I have been known to fall asleep during film...
2,0,b'Mann photographs the Alberta Rocky Mountains...
3,1,b'This is the kind of film for a snowy Sunday ...
4,1,"b'As others have mentioned, all the women that..."


## Part 2: Pre-processing the data

I have noticed that the reviews start with b' or with b" and ' or " at the end, among other stuff.

The aim here is to clean the data to train an algorithm that will automatically detect the sentiment correctly Ideally, even ambiguous text

Below is a series of formulas to clean the reviews.

Keep in mind this is only the beginning, therefore some deep cleaning will not be employed at this stage and yes on the other ones.

This pre-processing will be preserved for comparison purposes.

In [6]:
"""
Order and Explanation

1. prepare_for_ai

The purpose of this function is to convert the df column values into a list.
This way we can easily manipulate the data and clean it.
It is then passed on to the next function.

2. clean_entry

The purpose of this function is to convert each entry of the list from bytes
into string and getting the converted bytes that are now strings into a new 
list. This is passed on to apply_re.

3. apply_re

The purpose of this function is to apply regular expressions and remove the 
punctuation from each movie review. This makes our life easier, since punctuations
do not a lot of value for the analysis. 
This is the moved on to the next function

4. remove_brbr

The purpose of this function is to remove some unusual characters seen while
using the dataset. Then it is moved to the next function.

5. convert_to_df

The purpose of this function is to convert the data back into pdf for more flexible
processing later on.

*6. prepare_for_logistic_regression

The first model to be tested is a simple one, logistic regression.
From there on, more complex models will be employed.

"""


def remove_brbr(re_list):
  lower_string = []
  for text in re_list:
    text.replace("br br", "")
    lower_string.append(text.lower())
  return convert_to_df(lower_string)


def apply_re(str_list):
  re_list = []
  for text in str_list:
    text = re.sub("[^0-9A-Za-z ]", "", text)
    re_list.append(text)
  return remove_brbr(re_list)

def clean_entry(text_list):
  str_list = []
  for text in text_list:
    str_text = str(text)
    str_text_lim = len(str_text)-1
    str_text = str_text[1:str_text_lim]
    str_list.append(str_text)
  return apply_re(str_list)

def prepare_for_ai(df_col):
  list_to_return = df_col.tolist()
  return clean_entry(list_to_return)

def convert_to_df(lower_string):
  new_df = pd.DataFrame(lower_string, columns=['text'])
  return new_df

def prepare_for_logistic_regression(new_df):
  x_values = new_df.text.values
  tokenizer=Tokenizer(num_words=130500)
  tokenizer.fit_on_texts(x_values)
  encoded_reviews = tokenizer.texts_to_sequences(x_values)
  padded_seq = pad_sequences(encoded_reviews, maxlen=500)
  return padded_seq

In [7]:
# The following variables are base and every model will have its own adaptations

x_train = prepare_for_ai(df_train['text'])
x_test = prepare_for_ai(df_test['text'])
y_train = df_train['label']
y_test = df_test['label']

# Model 1: Logistic Regression (No IDF)

The idea of starting with such a simple model is that we need to have a baseline to see why other models perform better than others.

Here is a very simple architecture: 

- With the preprocessing we tokenize the entries, allowing up to 130500 words to be formed.

- Then we create a padding to make sure each entry has the same length due to memory and convergence restrictions, it was limited to 500.

Given that this model WILL NOT consider the relative importance of each word, I expect it to perform fairly poorly.

In [8]:
x_train_lr = prepare_for_logistic_regression(x_train)
x_test_lr = prepare_for_logistic_regression(x_test)

In [9]:
model = LogisticRegression(random_state=32)
model.fit(x_train_lr, y_train)

y_pred_lr = model.predict(x_test_lr)
print("The accuracy for this model is: ", accuracy_score(y_test, y_pred_lr))

The accuracy for this model is:  0.50684


As seen, the model performed poorly and is not able to predict accurately the sentiment.

How about a more advanced model? Such as random forest.

I expect the model to perform better since it will on its own start to realize some words matter more than orders.

Just to keep it 'fair', I will keep the same processing from the Logistic Regression.

# Model 2: Random Forest (No IDF)

In this section, we will explore the model Random Forest.

Given that the data is to some level pre-processed we will recycle it and employ it.

The only adjustment will be n_estimators, since this is more to illustrate rather than achieve optimal performance.

In [10]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators = 500, random_state=32)

# the x_train and y_train can be the same as in logistic regression

model.fit(x_train_lr, y_train)

y_pred_lr = model.predict(x_test_lr)
print("The accuracy for this model is: ", accuracy_score(y_test, y_pred_lr))

The accuracy for this model is:  0.5366


As expected, the model performed better than logistic regression with almost no tuning.

However, as language speakers, we know that certain words are key to determine sentiment.

Therefore the next model to be used will be TF_IDF.

# TF-IDF on Both Models

The idea of this model, is to focus on the words that do not appear often in the documents and with that apply better predictions.

For this model, additional preprocessing will need to employed as it will be shown below.

**REMINDER**

In the reviews, a lot of stop words, or words that do not contribute to the analysis will be present.

During the pre-processing, these will be removed.


## Stopwords: NLTK or SpaCy?

There are many libraries from where to retrieve our stop words.

Two of them are NLTK and SpaCy. I will then see which one of those has more stop words and from then decide which one to follow.

In [14]:
# SpaCy
!python -m spacy download en_core_web_sm
import spacy
import en_core_web_sm

en_spacy = en_core_web_sm.load()

stopwords_spacy = en_spacy.Defaults.stop_words

print(len(stopwords_spacy)) #326 Stop Words


Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 29.9 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
326


In [15]:
# NLTK

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords_nltk =set(stopwords.words('english'))
len(stopwords_nltk) # 179 Stop Words

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


179

As seen here, the model with more stop words is spacy, therefore we will proceed with spacy.

## TF_IDF: Prepping the Data

In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=2500,
    min_df = 15,
    max_df = 0.75,
    stop_words = stopwords_spacy
)

x_train_idf = vectorizer.fit_transform(x_train['text'].values).toarray()
x_test_idf = vectorizer.fit_transform(x_test['text'].values).toarray()

  % sorted(inconsistent)


Rationale:

Here I am constraining the model,

- Given that there will only be a certain number of key words to describe sentiment, limiting to 1500 words on all reviews is pretty conservative.

- The words should appear at least in 15 reviews and it cannot be in more than 65% of them.

- The stop words used was spacy.


## TF_IDF Model 1: Logistic Regression

Now that we have better quality data, we will explore whether the accuracy of Logistic Regression can be improved.

My hypothesis is that we will see some sort of improvement.

In [20]:
model_idf_lr = LogisticRegression(random_state=32)
model_idf_lr.fit(x_train_idf, y_train)

y_pred_idf_lr = model_idf_lr.predict(x_test_idf)
print("The accuracy for this model is: ", accuracy_score(y_test, y_pred_idf_lr))

The accuracy for this model is:  0.60796


We see that accuracy has been improved by about 10%, I expect random forest to perform better than its previous model since it will 'filter' the already filtered words.

## TF_IDF Model 2: Random Forest

In [22]:
model_idf_rf = RandomForestClassifier(n_estimators = 700, random_state=32)
model_idf_rf.fit(x_train_idf, y_train)

y_pred_idf_rf = model_idf_rf.predict(x_test_idf)
print("The accuracy for this model is: ", accuracy_score(y_test, y_pred_idf_rf))

The accuracy for this model is:  0.59796


# Conclusion

As one can see, applying TF-IDF to both models led to better results.

This was to be expected because a lot of the noise was removed from the pre-processing.

A surprising result was how logistic regression outperformed random forest after the application of TF-IDF.

It seems that after the transformation, the model became simpler and the random forest itself was liable to overfitting.

However, one must consider that minimal tuning was applied purposely.

Fortunately, after TF-IDF, both of the models perform better than randomly predicting a sentiment. 

On the next section, LSTM Classifier will be explored and the results will be compared with these two models.