# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
# Setting up the environment
!pip install gensim xgboost transformers -U
!pip install accelerate -U
!pip install numpy --upgrade --force-reinstall
!pip install pandas scikit-learn

Collecting numpy
  Using cached numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
Installing collected packages: numpy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
torchtext 0.17.1 requires torch==2.2.1, but you have torch 2.3.0 which is incompatible.[0m[31m
[0mSuccessfully installed numpy-1.26.4


In [1]:
# Import libraries
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.pipeline import make_pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score, make_scorer, classification_report
from gensim.models import Word2Vec
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch

In [None]:
# Function to read the dataset file
def load_data(file_path):
    with open(file_path, 'r') as file:
        lines = file.readlines()
        data = [line.strip().split(' ', 1) for line in lines if line.strip()]
        data = [[int(label), sentence.strip()] for label, sentence in data]
    return pd.DataFrame(data, columns=['label', 'review'])

# Load training and test data
train_data = load_data('/content/stsa-train.txt')
test_data = load_data('/content/stsa-test.txt')

# Verify the head of the dataframe
print(train_data.head())
print(test_data.head())


   label                                             review
0      1  a stirring , funny and finally transporting re...
1      0  apparently reassembled from the cutting-room f...
2      0  they presume their audience wo n't sit still f...
3      1  this is a visually stunning rumination on love...
4      1  jonathan parker 's bartleby should have been t...
   label                                             review
0      0     no movement , no yuks , not much of anything .
1      0  a gob of drivel so sickly sweet , even the eag...
2      0  gangs of new york is an unapologetic mess , wh...
3      0  we never really feel involved with the story ,...
4      1            this is one of polanski 's best films .


In [None]:
from sklearn.model_selection import cross_validate

def evaluate_model(model, X, y):
    kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=1)
    scoring = {
        'accuracy': make_scorer(accuracy_score),
        'recall': make_scorer(recall_score, average='binary'),
        'precision': make_scorer(precision_score, average='binary'),
        'f1_score': make_scorer(f1_score, average='binary')
    }
    scores = cross_validate(model, X, y, cv=kfold, scoring=scoring, return_train_score=False)
    return scores

# Define the vectorizer
vectorizer = TfidfVectorizer(stop_words='english', max_features=5000)
X_train_vect = vectorizer.fit_transform(train_data['review'])
y_train = train_data['label']

# Define the models
models = [
    MultinomialNB(),
    SVC(),
    KNeighborsClassifier(),
    DecisionTreeClassifier(),
    RandomForestClassifier(),
    XGBClassifier(use_label_encoder=False, eval_metric='logloss')
]

model_names = [
    'MultinomialNB',
    'SVM',
    'KNN',
    'DecisionTree',
    'RandomForest',
    'XGBClassifier'
]

# Train and evaluate models
for model, name in zip(models, model_names):
    scores = evaluate_model(model, X_train_vect, y_train)
    # Output the performance metrics for each model
    print(f"{name} - Accuracy: {np.mean(scores['test_accuracy'])}, "
          f"Recall: {np.mean(scores['test_recall'])}, "
          f"Precision: {np.mean(scores['test_precision'])}, "
          f"F1: {np.mean(scores['test_f1_score'])}")


MultinomialNB - Accuracy: 0.7757225433526012, Recall: 0.8232686980609418, Precision: 0.7650230414362393, F1: 0.7929471525338624
SVM - Accuracy: 0.7721098265895955, Recall: 0.8063711911357341, Precision: 0.7684245248053749, F1: 0.7867812610086388
KNN - Accuracy: 0.5141618497109828, Recall: 0.32465373961218835, Precision: 0.5587600329837245, F1: 0.40911007755775985
DecisionTree - Accuracy: 0.6504335260115607, Recall: 0.6332409972299169, Precision: 0.6765902301262351, F1: 0.6539971839591352
RandomForest - Accuracy: 0.7218208092485549, Recall: 0.7055401662049862, Precision: 0.7476092492812774, F1: 0.725769865628839
XGBClassifier - Accuracy: 0.6901734104046243, Recall: 0.7592797783933518, Precision: 0.6877859940139102, F1: 0.7178211688723224


In [None]:
# Word2Vec
from gensim.models import Word2Vec

# Tokenize the text
tokenized_train = [review.split() for review in train_data['review']]

# Train the Word2Vec model
w2v_model = Word2Vec(sentences=tokenized_train, vector_size=100, window=5, min_count=1, workers=4)

# Function to create a document vector by averaging the word vectors
def document_vector(model, doc):
    # Remove out-of-vocabulary words and get the mean vector
    doc_vector = np.mean([model.wv[word] for word in doc if word in model.wv], axis=0)
    return doc_vector

# Create document vectors for training and test sets
X_train_w2v = np.array([document_vector(w2v_model, doc) for doc in tokenized_train])

# Train a classifier using these document vectors
from sklearn.ensemble import RandomForestClassifier

rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_w2v, y_train)

# Tokenize the test data and create document vectors
tokenized_test = [review.split() for review in test_data['review']]
X_test_w2v = np.array([document_vector(w2v_model, doc) for doc in tokenized_test])

# Predict and evaluate using the test set
y_test = test_data['label']
predictions_w2v = rf_classifier.predict(X_test_w2v)
print(classification_report(y_test, predictions_w2v))

              precision    recall  f1-score   support

           0       0.59      0.49      0.54       912
           1       0.56      0.65      0.60       909

    accuracy                           0.57      1821
   macro avg       0.57      0.57      0.57      1821
weighted avg       0.57      0.57      0.57      1821



In [None]:
!pip install torch torchvision torchaudio

# BERT
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Tokenization and encoding for BERT
train_encodings = tokenizer(train_data['review'].tolist(), truncation=True, padding=True, max_length=512, return_tensors='pt')
test_encodings = tokenizer(test_data['review'].tolist(), truncation=True, padding=True, max_length=512, return_tensors='pt')

# Dataset class for BERT
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

# Convert encodings to dataset objects
train_dataset = IMDbDataset(train_encodings, train_data['label'].tolist())
test_dataset = IMDbDataset(test_encodings, test_data['label'].tolist())

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

# Train and evaluate
trainer.train()
results = trainer.evaluate()
print(results)




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Step,Training Loss
500,0.3993


Step,Training Loss
500,0.3993
1000,0.2021


## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [3]:
!pip install pandas nltk scikit-learn
import pandas as pd
import re
import string
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import nltk
nltk.download('stopwords')
data = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')



[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove punctuation
    text = re.sub(f'[{string.punctuation}]', '', text)
    # Remove numbers
    text = re.sub(r'\d+', '', text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])
    # Stem the words
    stemmer = PorterStemmer()
    text = ' '.join([stemmer.stem(word) for word in text.split()])
    return text

# Apply the preprocessing function to the Reviews column
data['Processed_Reviews'] = data['Reviews'].astype(str).apply(preprocess_text)

In [1]:
# K-means
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

# Create TF-IDF features
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(data['Processed_Reviews'])

# Perform K-means clustering
num_clusters = 5  # You might want to find the optimal number using methods like the elbow method.
kmeans = KMeans(n_clusters=num_clusters, n_init=10, random_state=42)
kmeans.fit(X_tfidf)

# Add cluster labels to the original data
data['Kmeans_Cluster'] = kmeans.labels_

# Plot the clusters (if possible, PCA is used to reduce the dimensionality to 2D for visualization)
pca = PCA(n_components=2)
scatter_plot_points = pca.fit_transform(X_tfidf.toarray())
colors = ['r', 'g', 'b', 'c', 'm', 'y', 'k']

x_axis = [o[0] for o in scatter_plot_points]
y_axis = [o[1] for o in scatter_plot_points]
fig, ax = plt.subplots(figsize=(20,10))

for i in range(num_clusters):
    points = [scatter_plot_points[j] for j in range(len(scatter_plot_points)) if data['Kmeans_Cluster'][j] == i]
    ax.scatter([p[0] for p in points], [p[1] for p in points], s=25, c=colors[i])

plt.show()


NameError: name 'data' is not defined

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

In [None]:
I was having difficulty running the above cell, as it kept throwing error "data not defined" when I did execute the above two codes to retrive the data from csv file
Upon my understanding, I feel it's cause of the large dataset

**Write your response here:**

.

.

.

.

.




# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

This assignment was bit hard as the BERT execution needed alot of research and was taking lot of time, constantly needed to re-start the kernel as the cell execution took lot of time to train the data
I learnt how to perform 10-fold cross-validation for different models and evaluating the data based on the parameters and how to use Word2Vec and BERT for text classification



'''