# **Sentiment Classification - Machine Learning / Basic Preprocessing**

# **Prerequisites**

**Install Required Packages**

In [1]:
# install the 'datasets' library 
!pip install datasets -q

# install the 'spacy' library
!pip install spacy -q

# installing the joblib library
!pip install joblib -q


[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip

[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
# downloads the small English model (en_core_web_sm) for the Spacy library
!python -m spacy download en_core_web_sm -q

[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')



[notice] A new release of pip is available: 23.3.1 -> 23.3.2
[notice] To update, run: python.exe -m pip install --upgrade pip


**Load Dataset**

In [3]:
# import the 'load_and_prepare_imdb_dataset' function from the 'imdb_data_loader'
from imdb_data_loader import load_and_prepare_imdb_dataset

# call the 'load_and_prepare_imdb_dataset' function to import the IMDB dataset
trainData, testData = load_and_prepare_imdb_dataset()

# **Dataset Analysis**

**Data Checks**

In [4]:
# call the 'head()' method on the 'trainData' DataFrame to inspect the first five rows
trainData.head()

Unnamed: 0,text,label
0,I rented I AM CURIOUS-YELLOW from my video sto...,0
1,"""I Am Curious: Yellow"" is a risible and preten...",0
2,If only to avoid making this type of film in t...,0
3,This film was probably inspired by Godard's Ma...,0
4,"Oh, brother...after hearing about this ridicul...",0


In [5]:
# get the shape of the 'trainData' DataFrame
trainData.shape

(25000, 2)

In [6]:
# call the 'head()' method on the 'testData' DataFrame to inspect the first five rows
testData.head()

Unnamed: 0,text,label
0,I love sci-fi and am willing to put up with a ...,0
1,"Worth the entertainment value of a rental, esp...",0
2,its a totally average film with a few semi-alr...,0
3,STAR RATING: ***** Saturday Night **** Friday ...,0
4,"First off let me say, If you haven't enjoyed a...",0


In [7]:
# get the shape of the 'testData' DataFrame
testData.shape

(25000, 2)

**Remove Duplicates**

In [8]:
# calculate the number of duplicated entries in the 'text' column of the 'trainData' DataFrame
trainDataDuplicates = trainData['text'].duplicated().sum()
trainDataDuplicates

96

In [9]:
# remove duplicate rows from 'trainData' based on the 'text' column
noTrainDataDuplicates = trainData.drop_duplicates(subset='text')
noTrainDataDuplicatesShape = noTrainDataDuplicates.shape

noTrainDataDuplicatesShape

(24904, 2)

In [10]:
# calculate the number of duplicated entries in the 'text' column of the 'testData' DataFrame
testDataDuplicates = testData['text'].duplicated().sum()
testDataDuplicates

199

In [11]:
# remove duplicate rows from 'testData' based on the 'text' column
noTestDataDuplicates = testData.drop_duplicates(subset='text')
noTestDataDuplicatesShape = noTestDataDuplicates.shape

noTestDataDuplicatesShape

(24801, 2)

# **Basic Preprocessing**

In [12]:
# import the preprocessing function 'preprocess_basic' from 'basic_preprocessing'

# Convert text to lowercase.
# Remove HTML tags using BeautifulSoup.
# Handle contractions.
# Expand acronyms.
# Tokenize the text using SpaCy.
# Remove punctuation tokens.
# Remove non-alphabetic characters.

from basic_preprocessing import preprocess_basic

In [13]:
# create copies of the noTrainDataDuplicates and noTestDataDuplicates DataFrames
rawtrainData = noTrainDataDuplicates.copy()
rawtestData = noTestDataDuplicates.copy()

In [14]:
# apply the 'preprocess_basic' function to each text entry in the 'rawtrainData' DataFrame
trainDataBasic = rawtrainData['text'].apply(preprocess_basic)

  text = BeautifulSoup(text, "html.parser").get_text()


In [15]:
# print the first three preprocessed text entries (train data) to verify the preprocessing step
for index, value in trainDataBasic.items():
    print(f"Index {index}: {value}")
    if index == 2:
        break

Index 0: i rented i am curious yellow from my video store because of all the controversy that surrounded it when it was first released in i also heard that at first it was seized by customs if it ever tried to enter this country therefore being a fan of films considered controversial i really had to see this for plot is centered around a young swedish drama student named lena who wants to learn everything she can about life in particular she wants to focus her attentions to making some sort of documentary on what the average swede thought about certain political issues such as the vietnam war and race issues in the united states in between asking politicians and ordinary denizens of stockholm about their opinions on politics she has sex with her drama teacher classmates and married kills me about i am curious yellow is that years ago this was considered pornographic really the sex and nudity scenes are few and far between even then it has it is not shot like some cheaply made porno whi

In [16]:
# apply the 'preprocess_basic' function to each text entry in the 'rawtestData' DataFrame

testDataBasic = rawtestData['text'].apply(preprocess_basic)

In [17]:
# print the first three preprocessed text entries (test data) to verify the preprocessing step
for index, value in testDataBasic.items():
    print(f"Index {index}: {value}")
    if index == 2:
        break

Index 0: i love sci fi and am willing to put up with a lot sci fi movies tv are usually underfunded under appreciated and misunderstood i tried to like this i really did but it is to good tv sci fi as babylon is to star trek the original silly prosthetics cheap cardboard sets stilted dialogues cg that does not match the background and painfully one dimensional characters can not be overcome with a sci fi setting i sure there are those of you out there who think babylon is good sci fi tv it has it is not it has it is clichéd and uninspiring while us viewers might like emotion and character development sci fi is a genre that does not take itself seriously cf star trek it may treat important issues yet not as a serious philosophy it has it is really difficult to care about the characters here as they are not simply foolish just missing a spark of life their actions and reactions are wooden and predictable often painful to watch the makers of earth know it has it is rubbish as they have to

# **Feature Extraction**

**Bag of Words**

In [18]:
# import the function for creating a Bag-of-Words (BOW) representation from 'bow_text_vectorization'
from bow_text_vectorization import create_bow_representation

trainBow, testBow = create_bow_representation(trainDataBasic, testDataBasic)

**Term Fequency-Inverse Document Frequency**

In [19]:
# import the function for creating a TF-IDF representation from 'tfidf_text_vectorization
from tfidf_text_vectorization import create_tfidf_representation

trainTfidf, testTfidf = create_tfidf_representation(trainDataBasic, testDataBasic)

# **Classification**

In [20]:
# import functions for training different classifiers
from final_machine_learning_models_evaluation import train_nb_classifier, train_svm_classifier, train_lr_classifier, evaluate_model

**Naive Bayes**

In [21]:
# train a Naive Bayes classifier using the BOW representation of the training data
nb_model = train_nb_classifier(trainBow, rawtrainData['label'])
f1_nb_bow = evaluate_model(nb_model, testBow, rawtestData['label'])

print(f"F1 Score for Naive Bayes(BOW): {f1_nb_bow}")

F1 Score for Naive Bayes(BOW): 0.7993638786211639


In [22]:
# train a Naive Bayes classifier using the TF-IDF representation of the training data
nb_model = train_nb_classifier(trainTfidf, rawtrainData['label'])
f1_nb_tfidf = evaluate_model(nb_model, testTfidf, rawtestData['label'])

print(f"F1 Score for Naive Bayes(TFIDF): {f1_nb_tfidf}")

F1 Score for Naive Bayes(TFIDF): 0.8207615400978514


**Support Vector Machine**

In [23]:
# train an SVM classifier using the BOW representation of the training data
svm_model = train_svm_classifier(trainBow, rawtrainData['label'])
f1_svm_bow = evaluate_model(svm_model, testBow, rawtestData['label'])

print(f"F1 Score for SVM(BOW): {f1_svm_bow}")

F1 Score for SVM(BOW): 0.8355399845194932


In [24]:
# train an SVM classifier using the TF-IDF representation of the training data
svm_model = train_svm_classifier(trainTfidf, rawtrainData['label'])
f1_svm_tfidf = evaluate_model(svm_model, testTfidf, rawtestData['label'])

print(f"F1 Score for SVM(TFIDF): {f1_svm_tfidf}")

F1 Score for SVM(TFIDF): 0.8784539340410015


**Logistic Regression**

In [25]:
# train a Logistic Regression classifier using the BOW representation of the training data
lr_model = train_lr_classifier(trainBow, rawtrainData['label'])
f1_lr_bow = evaluate_model(lr_model, testBow, rawtestData['label'])

print(f"F1 Score for Logistic Regression(BOW): {f1_lr_bow}")

F1 Score for Logistic Regression(BOW): 0.8608385370205174


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [26]:
# train a Logistic Regression classifier using the TF-IDF representation of the training data
lr_model = train_lr_classifier(trainTfidf, rawtrainData['label'])
f1_lr_tfidf = evaluate_model(lr_model, testTfidf, rawtestData['label'])

print(f"F1 Score for Logistic Regression(TFIDF): {f1_lr_tfidf}")

F1 Score for Logistic Regression(TFIDF): 0.8806534685337196


# **Results**

In [27]:
# import the function 'generate_results_df' from the 'model_results' module
from model_results import generate_ml_results_df

In [28]:
# generate a DataFrame 'resultsDF' using the F1 scores from different models
resultsDF = generate_ml_results_df(f1_nb_bow, f1_nb_tfidf, f1_svm_bow, f1_svm_tfidf, f1_lr_bow, f1_lr_tfidf)
print(resultsDF)

                 Model  F1 Score
0    Naive Bayes (BOW)  0.799364
1  Naive Bayes (TFIDF)  0.820762
2            SVM (BOW)  0.835540
3          SVM (TFIDF)  0.878454
4             LR (BOW)  0.860839
5           LR (TFIDF)  0.880653


In [29]:
# save the 'resultsDF' DataFrame to a CSV file named "final_ml_models_basic_prep.csv".
resultsDF.to_csv("final_ml_models_basic_prep.csv", index=False)