<a href="https://colab.research.google.com/github/leon3108/Applied/blob/main/Copy_of_NLP_KNN_BERT_EMBEEDINGS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# BERT Embeeding Natural Language Processing </a>

## K Nearest Neighbors Model for a Classification Problem: Classify Product Reviews as Positive or Negative

In this notebook, we use the K Nearest Neighbors method to build a classifier to predict the __isPositive__ field of our review dataset (that is very similar to the final project dataset).


1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Text Processing: Stop words removal and stemming</a>
4. <a href="#4">Train - Validation Split</a>
5. <a href="#5">Data processing with Pipeline</a>
6. <a href="#6">Train the classifier</a>
7. <a href="#7">Test the classifier</a> Find more details on the KNN Classifier here: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
8. <a href="#8">Ideas for improvement</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes). *This field is a processed version of the votes field. People can click on the "helpful" button when they find a customer review helpful. This increases the vote by 1. __log_votes__ is calculated like this log(1+votes). This formulation helps us get a smaller range for votes.*
* __isPositive:__ Whether the review is positive or negative (1 or 0)


## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

We will use the __pandas__ library to read our dataset.

In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

print('The shape of the dataset is:', df.shape)

The shape of the dataset is: (70000, 6)


In [3]:
# IMDB Dataset
df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)

train_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_train.csv', header=0)
# train_df.head()

test_df = pd.read_csv('https://raw.githubusercontent.com/aws-samples/aws-machine-learning-university-accelerated-nlp/master/data/final_project/imdb_test.csv', header=0)
# test_df.head()


Let's look at the first 10 rows of the dataset.

In [4]:
df.head(10)

Unnamed: 0,text,label
0,This movie makes me want to throw up every tim...,0
1,Listening to the director's commentary confirm...,0
2,One of the best Tarzan films is also one of it...,1
3,Valentine is now one of my favorite slasher fi...,1
4,No mention if Ann Rivers Siddons adapted the m...,0
5,Several years ago the Navy kept a studied dist...,1
6,This is a masterpiece footage in B/W 35mm film...,1
7,Such a long awaited movie.. But it has disappo...,0
8,When two writers make a screenplay of a horror...,1
9,"Make no mistake, Maureen O'Sullivan is easily ...",1


## 2. <a name="2">Exploratory data analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the distribution of __isPositive__ field.

In [5]:
df["label"].value_counts()

0    12500
1    12500
Name: label, dtype: int64

We can check the number of missing values for each columm below.

In [6]:
print(df.isna().sum())

text     0
label    0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Text Processing: Stop words removal and stemming</a>
(<a href="#0">Go to top</a>)

In [7]:
df=df.dropna()
print(df.isna().sum())

text     0
label    0
dtype: int64


In [8]:
!pip install transformers torch scikit-learn


Collecting transformers
  Downloading transformers-4.34.1-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m39.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m106.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m89.7 MB/s[0m eta [36m0:00:00[0m
Co

## 4. <a name="4">Train - Validation Split</a>
(<a href="#0">Go to top</a>)

Let's split our dataset into training (90%) and validation (10%).

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Split the dataset into train and test sets
train_data, test_data, train_labels, test_labels = train_test_split(df["text"], df['label'], test_size=0.1, random_state=42)

## Use BERT for text embeddings:
(<a href="#0">Go to top</a>)

You can use the Hugging Face Transformers library to load a pre-trained BERT model and tokenize your text dat

In [10]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [13]:
from transformers import RobertaTokenizer, RobertaModel
import torch
from tqdm import tqdm


# Move the model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# model.to(device)
# BERT-based Classifier
# Load BERT tokenizer and model, move to GPU
tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
bert_model = RobertaModel.from_pretrained("roberta-base").to(device)

# Tokenize and encode the text data, move to GPU
max_length = 128
X_train_tokens = tokenizer(list(train_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)
X_test_tokens = tokenizer(list(test_data), truncation=True, padding=True, max_length=max_length, return_tensors="pt", add_special_tokens=True).to(device)

# Calculate BERT embeddings for the text data
def get_bert_embeddings(tokens):
    embeddings = []
    for i in tqdm(range(len(tokens['input_ids']))):
        with torch.no_grad():
            output = bert_model(input_ids=tokens['input_ids'][i].unsqueeze(0), attention_mask=tokens['attention_mask'][i].unsqueeze(0))
        embeddings.append(output[0].squeeze().mean(dim=0).cpu().numpy())
    return embeddings

X_train_bert_embeddings = get_bert_embeddings(X_train_tokens)
X_test_bert_embeddings = get_bert_embeddings(X_test_tokens)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-base and are newly initialized: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 22500/22500 [04:42<00:00, 79.75it/s]
100%|██████████| 2500/2500 [00:32<00:00, 77.53it/s]


## 6. <a name="6">Train the classifier</a>
(<a href="#0">Go to top</a>)

We train our classifier with __.fit()__ on our training dataset.
Train a KNN model:
Now that you have BERT embeddings for your text data, you can train a KNN model using scikit-learn.

This code demonstrates how to use BERT for text embeddings and then train a KNN model for sentiment analysis of AWS product reviews. Make sure to replace 'your_dataset.csv' with the actual path to your dataset file and adjust other parameters as needed. You may also fine-tune the model and preprocessing steps to improve performance.

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Initialize and train the KNN classifier
knn = KNeighborsClassifier(n_neighbors=10)
knn.fit(X_train_bert_embeddings, train_labels)

NB = GaussianNB()
NB.fit(X_train_bert_embeddings, train_labels)

# Train a classifier on BERT embeddings (you can use any classifier of your choice)
# Here, we'll use Logistic Regression as an example
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train_bert_embeddings, train_labels)

#rf=RandomForestClassifier()
#rf.fit(train_embeddings, train_labels)
#xgb=GradientBoostingClassifier()
#xgb.fit(train_embeddings, train_labels)



##Test the classifier
(Go to top)

To evaluate the KNN model's performance on sentiment classification, you can generate a classification report and a confusion matrix. Here's how you can do it using scikit-learn:
|--|--|--|

In [15]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)

# Calculate accuracy
accuracy = accuracy_score(test_labels, predictions)
print(f'KNN Accuracy: {accuracy * 100:.2f}%')
#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Naive Bais Accuracy: {accuracy * 100:.2f}%')

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
accuracy = accuracy_score(test_labels, predictions)
print(f'Logistic Regression Accuracy: {accuracy * 100:.2f}%')

KNN Accuracy: 78.08%
Naive Bais Accuracy: 78.00%
Logistic Regression Accuracy: 85.80%


In [16]:
from sklearn.metrics import classification_report, confusion_matrix

# Make predictions on the test data
predictions = knn.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = NB.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)

#predictions = rf.predict(test_embeddings)
predictions = lr.predict(X_test_bert_embeddings)
# Create a classification report
class_report = classification_report(test_labels, predictions, target_names=['negative', 'positive'])

# Create a confusion matrix
conf_matrix = confusion_matrix(test_labels, predictions)

# Print the classification report and confusion matrix
print("Classification Report:")
print(class_report)

print("\nConfusion Matrix:")
print(conf_matrix)




Classification Report:
              precision    recall  f1-score   support

    negative       0.72      0.89      0.80      1207
    positive       0.87      0.68      0.76      1293

    accuracy                           0.78      2500
   macro avg       0.79      0.78      0.78      2500
weighted avg       0.80      0.78      0.78      2500


Confusion Matrix:
[[1070  137]
 [ 411  882]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.77      0.77      0.77      1207
    positive       0.79      0.79      0.79      1293

    accuracy                           0.78      2500
   macro avg       0.78      0.78      0.78      2500
weighted avg       0.78      0.78      0.78      2500


Confusion Matrix:
[[ 932  275]
 [ 275 1018]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      1207
    positive       0.86      0.86      0.86      1293

    accuracy         

****Best With BoW****
TfidfVectorizer
Max Features 100
k = 10
              precision    recall  f1-score   support

    negative       0.66      0.76      0.70      12500
    positive       0.71      0.60      0.65      12500

    accuracy                           0.68      25000
   macro avg       0.68      0.68      0.68      25000
weighted avg       0.68      0.68      0.68      25000

<br/>
<br/>


****With Bert****

Classification Report:
              precision    recall  f1-score   support

    negative       0.72      0.87      0.79      1207
    positive       0.85      0.68      0.75      1293

    accuracy                           0.77      2500
   macro avg       0.78      0.78      0.77      2500
weighted avg       0.79      0.77      0.77      2500


Confusion Matrix:
[[1056  151]
 [ 418  875]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.71      0.77      0.74      1207
    positive       0.77      0.71      0.74      1293

    accuracy                           0.74      2500
   macro avg       0.74      0.74      0.74      2500
weighted avg       0.74      0.74      0.74      2500


Confusion Matrix:
[[929 278]
 [374 919]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.82      0.82      0.82      1207
    positive       0.83      0.83      0.83      1293

    accuracy                           0.83      2500
   macro avg       0.83      0.83      0.83      2500
weighted avg       0.83      0.83      0.83      2500


Confusion Matrix:
[[ 992  215]
 [ 216 1077]]

<br/>
<br/>
<br/>

 ***With Roberta***

KNN Accuracy: 78.08%

Naive Bais Accuracy: 78.00%

Logistic Regression Accuracy: 85.80%

Classification Report:
              precision    recall  f1-score   support

    negative       0.72      0.89      0.80      1207
    positive       0.87      0.68      0.76      1293

    accuracy                           0.78      2500
   macro avg       0.79      0.78      0.78      2500
weighted avg       0.80      0.78      0.78      2500


Confusion Matrix:
[[1070  137]
 [ 411  882]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.77      0.77      0.77      1207
    positive       0.79      0.79      0.79      1293

    accuracy                           0.78      2500
   macro avg       0.78      0.78      0.78      2500
weighted avg       0.78      0.78      0.78      2500


Confusion Matrix:
[[ 932  275]
 [ 275 1018]]
Classification Report:
              precision    recall  f1-score   support

    negative       0.85      0.85      0.85      1207
    positive       0.86      0.86      0.86      1293

    accuracy                           0.86      2500
   macro avg       0.86      0.86      0.86      2500
weighted avg       0.86      0.86      0.86      2500


Confusion Matrix:
[[1027  180]
 [ 175 1118]]

In [21]:
from transformers import pipeline

classifier = pipeline("feature-extraction")
cl = classifier("I've been waiting for a HuggingFace course  my whole life.")
print(cl)

No model was supplied, defaulted to distilbert-base-cased and revision 935ac13 (https://huggingface.co/distilbert-base-cased).
Using a pipeline without specifying a model name and revision in production is not recommended.


[[[0.34841567277908325, 0.17586295306682587, -0.02909609116613865, -0.16489030420780182, -0.26146945357322693, -0.11878133565187454, 0.4164116680622101, -0.15681076049804688, 0.09240972995758057, -1.1160190105438232, -0.18134604394435883, -0.012234225869178772, -0.10304189473390579, -0.08109723776578903, -0.4432018995285034, 0.07643585652112961, 0.133864626288414, 0.14760109782218933, -0.1267905980348587, -0.224333718419075, 0.0298288706690073, -0.2123529314994812, 0.5039181709289551, -0.26784926652908325, 0.2697387933731079, 0.08961866050958633, 0.28673845529556274, 0.1971598118543625, -0.333369642496109, 0.3026839792728424, 0.006308895070105791, 0.11743327230215073, -0.03913762792944908, -0.0010521016083657742, -0.31574955582618713, 0.12717166543006897, -0.03555822744965553, -0.38847601413726807, -0.13536590337753296, -0.2928382158279419, -0.48132938146591187, 0.1862703412771225, 0.5725143551826477, -0.2594643831253052, 0.00269020046107471, -0.4819663166999817, 0.033089395612478256, 