<a href="https://colab.research.google.com/github/killianb22/BI_CA/blob/main/Killian_Brady_NLP_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [None]:
import pandas as pd

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
train_df.head(10)

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0
5,Several years ago the Navy kept a studied dist...,1
6,This is a masterpiece footage in B/W 35mm film...,1
7,Such a long awaited movie.. But it has disappo...,0
8,When two writers make a screenplay of a horror...,1
9,"Make no mistake, Maureen O'Sullivan is easily ...",1


#### __Test data:__

In [None]:
import pandas as pd

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
test_df.head()

Unnamed: 0,text,label
0,What I hoped for (or even expected) was the we...,0
1,Garden State must rate amongst the most contri...,0
2,There is a lot wrong with this film. I will no...,1
3,"To qualify my use of ""realistic"" in the summar...",1
4,Dirty War is absolutely one of the best politi...,1


## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [None]:
train_df["label"].value_counts()

0    12500
1    12500
Name: label, dtype: int64

In [None]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25000 entries, 0 to 24999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    25000 non-null  object
 1   label   25000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 390.8+ KB


In [None]:
print(train_df.isna().sum())

text     0
label    0
dtype: int64


In [None]:
# Install the library and functions
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from nltk.tokenize import word_tokenize


stop = stopwords.words('english')
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]


stop_words = [word for word in stop if word not in excluding]
snow = SnowballStemmer('english')

def process_text(texts):
    final_text_list=[]
    for sent in texts:
        if isinstance(sent, str) == False:
            sent = ""

        filtered_sentence=[]

        sent = sent.lower()
        sent = sent.strip()
        sent = re.sub('\s+', ' ', sent)
        sent = re.compile('<.*?>').sub('', sent)

        for w in word_tokenize(sent):
            if(not w.isnumeric()) and (len(w)>2) and (w not in stop_words):
                filtered_sentence.append(snow.stem(w))
        final_string = " ".join(filtered_sentence)
        final_text_list.append(final_string)

    return final_text_list

In [None]:
print("Processing the text field")
train_text_list = process_text(train_df["text"].tolist())
test_text_list = process_text(test_df["text"].tolist())

y_train= train_df["label"].tolist()
y_test= test_df["label"].tolist()

Processing the text field


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
import gensim
from gensim.models import Word2Vec

pipeline = Pipeline([
    ('text_vect', CountVectorizer(binary=True,
                                  max_features=200)),
    ('knn', KNeighborsClassifier())
                                ])
from sklearn import set_config
set_config(display='diagram')
pipeline

In [None]:
# We using lists of processed text fields
X_train = train_text_list
X_test = test_text_list

# Fit the Pipeline to training data
pipeline.fit(X_train, y_train)

## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [None]:
# Implement this
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

# Use the fitted pipeline to make predictions on the validation dataset
val_predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, val_predictions))
print(classification_report(y_test, val_predictions))
print("Accuracy (validation):", accuracy_score(y_test, val_predictions))

[[7556 4944]
 [3451 9049]]
              precision    recall  f1-score   support

           0       0.69      0.60      0.64     12500
           1       0.65      0.72      0.68     12500

    accuracy                           0.66     25000
   macro avg       0.67      0.66      0.66     25000
weighted avg       0.67      0.66      0.66     25000

Accuracy (validation): 0.6642


# Describing 'Max_Features' decision

My initial 'Max_Features' value was 50 which returned an accuracy of 62.72% I then lowered this value which decreased the accuracy of the model so I increased the value from 50 up to 5000 the find the most accurate. The final value I choose was 200 which returned an accuracy of 66.42%, after increasing this value the accuracy fell so I determined this is the optimal value.

# 4. Implementing KNN

In [None]:
import nltk, re
from nltk.corpus import stopwords
from nltk.stem import SnowballStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import word_tokenize

stop = stopwords.words('english')
excluding = ['against', 'not', 'don', "don't",'ain', 'aren', "aren't", 'couldn', "couldn't",
             'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't",
             'haven', "haven't", 'isn', "isn't", 'mightn', "mightn't", 'mustn', "mustn't",
             'needn', "needn't",'shouldn', "shouldn't", 'wasn', "wasn't", 'weren',
             "weren't", 'won', "won't", 'wouldn', "wouldn't"]


stop_words = [word for word in stop if word not in excluding]
snow = SnowballStemmer('english')

def preprocess_text(text):
    tokens = nltk.word_tokenize(text.lower())
    tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
    tokens = [snow.stem(word) for word in tokens]
    return ' '.join(tokens)

train_df['cleaned_text'] = train_df['text'].apply(preprocess_text)

# Vectorization (using TF-IDF)
tfidf_vectorizer = TfidfVectorizer(max_features=200)
X = tfidf_vectorizer.fit_transform(train_df['cleaned_text'])
y = train_df['label']

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

knn = KNeighborsClassifier(n_neighbors=5)


knn.fit(X_train, y_train)

y_pred = knn.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")

Accuracy: 0.6954
Precision: 0.7117743254292723
Recall: 0.6803438843298163
F1 Score: 0.6957042957042956


In [None]:
k_values = [2, 4, 6, 8, 10]

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print(f"K={k} - Accuracy: {accuracy}")

K=2 - Accuracy: 0.6258
K=4 - Accuracy: 0.6646
K=6 - Accuracy: 0.6828
K=8 - Accuracy: 0.6914
K=10 - Accuracy: 0.6974


A k value of 10 seems to be the most accurate with 69.74%

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression

#Initialize and train Naive Bayes model
nb_model = MultinomialNB()
nb_model.fit(X_train, y_train)

#Initialize and train Logistic Regression model
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

models = [nb_model, lr_model]
model_names = ['Naive Bayes', 'Logistic Regression']

for i, model in enumerate(models):
    y_pred = model.predict(X_test)
    print(f"Model: {model_names[i]}")
    print(classification_report(y_test, y_pred))

Model: Naive Bayes
              precision    recall  f1-score   support

           0       0.77      0.76      0.77      2441
           1       0.78      0.79      0.78      2559

    accuracy                           0.78      5000
   macro avg       0.78      0.78      0.78      5000
weighted avg       0.78      0.78      0.78      5000

Model: Logistic Regression
              precision    recall  f1-score   support

           0       0.79      0.76      0.78      2441
           1       0.78      0.81      0.80      2559

    accuracy                           0.79      5000
   macro avg       0.79      0.79      0.79      5000
weighted avg       0.79      0.79      0.79      5000



Logistic Regression seems to be slightly more accurate than Naive Bayes but both more models are much more accuarte than KNN

# BERT Transformers