<a href="https://colab.research.google.com/github/kyleoneill20/Applied-Predictive-Analytics/blob/main/Copy_of_NLP_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Natural Language Processing </a>

## Assignment: K Nearest Neighbors Model for the IMDB Movie Review Dataset

For the final project, build a K Nearest Neighbors model to predict the sentiment (positive or negative) of movie reviews. The dataset is originally hosted here: http://ai.stanford.edu/~amaas/data/sentiment/

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

You can follow these steps:
1. Read training-test data (Given)
2. Train a KNN classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Expermintation (Implement)

__You can use the KNN Classifier from here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html__

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset.

#### __Training data:__
Let's read our training data. Here, we have the text and label fields. Labe is 1 for positive reviews and 0 for negative reviews.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report

# Reading training and test data
train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv')
test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv')

# Displaying the first few rows of the dataset
print(train_df.head())
print(test_df.head())


                                                text  label
0  This movie makes me want to throw up every tim...      0
1  Listening to the director's commentary confirm...      0
2  One of the best Tarzan films is also one of it...      1
3  Valentine is now one of my favorite slasher fi...      1
4  No mention if Ann Rivers Siddons adapted the m...      0
                                                text  label
0  What I hoped for (or even expected) was the we...      0
1  Garden State must rate amongst the most contri...      0
2  There is a lot wrong with this film. I will no...      1
3  To qualify my use of "realistic" in the summar...      1
4  Dirty War is absolutely one of the best politi...      1


#### __Test data:__

In [None]:
# Separating features and labels
X_train = train_df['text']
y_train = train_df['label']
X_test = test_df['text']
y_test = test_df['label']

# TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=5000)  # Limit the number of features for efficiency

# Fit and transform the training data
X_train_tfidf = vectorizer.fit_transform(X_train)

# Transform the test data
X_test_tfidf = vectorizer.transform(X_test)

# Splitting the training data into training and validation sets
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train_tfidf, y_train, test_size=0.2, random_state=42)



## 2. Train a KNN Classifier
Here, you will apply pre-processing operations we covered in the class. Then, you can split your dataset to training and validation here. For your first submission, you will use __K Nearest Neighbors Classifier__. It is available [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html).

In [None]:
# Implement this

# Initializing the KNN Classifier with k=5 (default)
knn = KNeighborsClassifier(n_neighbors=5)

# Training the classifier
knn.fit(X_train_split, y_train_split)

# Making predictions on the validation set
y_val_pred = knn.predict(X_val_split)

# Evaluating performance on validation set
val_accuracy = accuracy_score(y_val_split, y_val_pred)
print(f"Validation Accuracy: {val_accuracy * 100:.2f}%")


Validation Accuracy: 72.90%


## 3. Make predictions on your test dataset

Once we select our best performing model, we can use it to make predictions on the test dataset. You can simply use __.fit()__ function with your training data to use the best performing K value and use __.predict()__ with your test data to get your test predictions.

In [None]:
# Implement this

# Making predictions on the test set
y_test_pred = knn.predict(X_test_tfidf)

# Evaluating performance on test set
test_accuracy = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {test_accuracy * 100:.2f}%")

# Detailed classification report
print(classification_report(y_test, y_test_pred))


Test Accuracy: 66.25%
              precision    recall  f1-score   support

           0       0.64      0.76      0.69     12500
           1       0.70      0.57      0.63     12500

    accuracy                           0.66     25000
   macro avg       0.67      0.66      0.66     25000
weighted avg       0.67      0.66      0.66     25000



## 4. Experimentation

For each of the following tasks, track both the **weighted F1-score** and **accuracy**:

1. **Change the binary parameter in CountVectorizer**: Test both `binary=True` and `binary=False`, and evaluate performance.
2. **Switch to TfidfVectorizer**: Replace the CountVectorizer with TfidfVectorizer and compare results.
3. **Adjust the max_features**: Experiment with different values of `max_features` for both TfidfVectorizer and CountVectorizer (`binary=True`).
4. **Optimize KNN**: Select the best-performing model from task 3 and vary the number of neighbors (`n_neighbors`) in the KNN classifier.


In [None]:
# Task 1

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import f1_score

# Initialize CountVectorizer with binary=False
vectorizer_count_false = CountVectorizer(binary=False, max_features=5000)
X_train_count_false = vectorizer_count_false.fit_transform(X_train)
X_test_count_false = vectorizer_count_false.transform(X_test)

# Initialize CountVectorizer with binary=True
vectorizer_count_true = CountVectorizer(binary=True, max_features=5000)
X_train_count_true = vectorizer_count_true.fit_transform(X_train)
X_test_count_true = vectorizer_count_true.transform(X_test)

# Function to train and evaluate KNN
def evaluate_knn(X_train, y_train, X_test, y_test, n_neighbors=5):
    knn = KNeighborsClassifier(n_neighbors=n_neighbors)
    knn.fit(X_train, y_train)
    y_pred = knn.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')
    return accuracy, f1

# Evaluate KNN with binary=False
accuracy_count_false, f1_count_false = evaluate_knn(X_train_count_false, y_train, X_test_count_false, y_test)
print(f"CountVectorizer (binary=False): Accuracy = {accuracy_count_false*100:.2f}%, F1-score = {f1_count_false:.2f}")

# Evaluate KNN with binary=True
accuracy_count_true, f1_count_true = evaluate_knn(X_train_count_true, y_train, X_test_count_true, y_test)
print(f"CountVectorizer (binary=True): Accuracy = {accuracy_count_true*100:.2f}%, F1-score = {f1_count_true:.2f}")


# Implement this

CountVectorizer (binary=False): Accuracy = 62.11%, F1-score = 0.62
CountVectorizer (binary=True): Accuracy = 62.74%, F1-score = 0.63


In [None]:
# Task 2

# Initialize TfidfVectorizer
vectorizer_tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer_tfidf.fit_transform(X_train)
X_test_tfidf = vectorizer_tfidf.transform(X_test)

# Evaluate KNN with TfidfVectorizer
accuracy_tfidf, f1_tfidf = evaluate_knn(X_train_tfidf, y_train, X_test_tfidf, y_test)
print(f"TfidfVectorizer: Accuracy = {accuracy_tfidf*100:.2f}%, F1-score = {f1_tfidf:.2f}")

# Implement this

TfidfVectorizer: Accuracy = 66.48%, F1-score = 0.66


In [None]:
# Task 3

# Function to evaluate KNN with varying max_features
def experiment_with_max_features(vectorizer, max_features_list, binary=False):
    for max_features in max_features_list:
        if isinstance(vectorizer, CountVectorizer):
            vectorizer = CountVectorizer(binary=binary, max_features=max_features)
        else:
            vectorizer = TfidfVectorizer(max_features=max_features)

        # Transform the data
        X_train_vect = vectorizer.fit_transform(X_train)
        X_test_vect = vectorizer.transform(X_test)

        # Evaluate KNN
        accuracy, f1 = evaluate_knn(X_train_vect, y_train, X_test_vect, y_test)
        print(f"max_features={max_features}: Accuracy = {accuracy*100:.2f}%, F1-score = {f1:.2f}")

# List of max_features to test
max_features_list = [1000, 2000, 5000, 10000]

print("\nCountVectorizer (binary=True) with different max_features:")
experiment_with_max_features(CountVectorizer(binary=True), max_features_list)

print("\nTfidfVectorizer with different max_features:")
experiment_with_max_features(TfidfVectorizer(), max_features_list)


# Implement this


CountVectorizer (binary=True) with different max_features:
max_features=1000: Accuracy = 62.03%, F1-score = 0.62
max_features=2000: Accuracy = 62.20%, F1-score = 0.62
max_features=5000: Accuracy = 62.11%, F1-score = 0.62
max_features=10000: Accuracy = 61.91%, F1-score = 0.62

TfidfVectorizer with different max_features:
max_features=1000: Accuracy = 62.03%, F1-score = 0.62
max_features=2000: Accuracy = 62.20%, F1-score = 0.62
max_features=5000: Accuracy = 62.11%, F1-score = 0.62
max_features=10000: Accuracy = 61.91%, F1-score = 0.62


In [None]:
# Task 4

# Function to experiment with different n_neighbors
def experiment_with_knn_neighbors(X_train, X_test, y_train, y_test, neighbors_list):
    for n_neighbors in neighbors_list:
        accuracy, f1 = evaluate_knn(X_train, y_train, X_test, y_test, n_neighbors=n_neighbors)
        print(f"n_neighbors={n_neighbors}: Accuracy = {accuracy*100:.2f}%, F1-score = {f1:.2f}")

# List of n_neighbors to test
neighbors_list = [3, 5, 7, 9]

print("\nOptimizing KNN with TfidfVectorizer (best model):")
experiment_with_knn_neighbors(X_train_tfidf, X_test_tfidf, y_train, y_test, neighbors_list)


# Implement this


Optimizing KNN with TfidfVectorizer (best model):
n_neighbors=3: Accuracy = 64.85%, F1-score = 0.65
n_neighbors=5: Accuracy = 66.48%, F1-score = 0.66
n_neighbors=7: Accuracy = 67.34%, F1-score = 0.67
n_neighbors=9: Accuracy = 68.23%, F1-score = 0.68
