# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load data
def load_data(filename):
    with open(filename, 'r', encoding='utf-8') as file:
        lines = file.readlines()
    labels = []
    texts = []
    for line in lines:
        parts = line.strip().split(' ', 1)  # Split only once at the first space
        labels.append(int(parts[0]))
        texts.append(parts[1])
    return texts, labels

train_texts, train_labels = load_data('stsa-train.txt')
test_texts, test_labels = load_data('stsa-test.txt')

# Preprocess data
X_train = train_texts
y_train = train_labels
X_test = test_texts
y_test = test_labels

# Split training data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Create CountVectorizer
vectorizer = CountVectorizer()

# Fit and transform on training data
X_train_counts = vectorizer.fit_transform(X_train)

# Transform validation and test data
X_val_counts = vectorizer.transform(X_val)
X_test_counts = vectorizer.transform(X_test)

# Define classifiers
classifiers = {
    "MultinomialNB": MultinomialNB(),
    "SVM": SVC(),
    "KNN": KNeighborsClassifier(),
    "DecisionTree": DecisionTreeClassifier(),
    "RandomForest": RandomForestClassifier(),
    "XGBoost": XGBClassifier()
}

# Initialize evaluation metrics
metrics = {
    "Accuracy": accuracy_score,
    "Precision": precision_score,
    "Recall": recall_score,
    "F1 Score": f1_score
}

# Evaluate models using 10-fold cross-validation
for clf_name, clf in classifiers.items():
    print(f"Evaluating {clf_name}:")
    cv_scores = cross_val_score(clf, X_train_counts, y_train, cv=StratifiedKFold(n_splits=10, shuffle=True), scoring='accuracy')
    print(f"Mean cross-validation accuracy: {np.mean(cv_scores)}")

    # Train model on entire training data
    clf.fit(X_train_counts, y_train)

    # Evaluate on validation set
    y_pred = clf.predict(X_val_counts)
    print("Validation Metrics:")
    for metric_name, metric_func in metrics.items():
        print(f"{metric_name}: {metric_func(y_val, y_pred)}")

    print("--------------------------------------")



Evaluating MultinomialNB:
Mean cross-validation accuracy: 0.7821528126856464
Validation Metrics:
Accuracy: 0.7947976878612717
Precision: 0.777490297542044
Recall: 0.8429172510518934
F1 Score: 0.8088829071332435
--------------------------------------
Evaluating SVM:
Mean cross-validation accuracy: 0.7389859055627003
Validation Metrics:
Accuracy: 0.7557803468208093
Precision: 0.7394636015325671
Recall: 0.8120617110799438
F1 Score: 0.7740641711229947
--------------------------------------
Evaluating KNN:
Mean cross-validation accuracy: 0.5706295167155195
Validation Metrics:
Accuracy: 0.6163294797687862
Precision: 0.6226415094339622
Recall: 0.6479663394109397
F1 Score: 0.6350515463917525
--------------------------------------
Evaluating DecisionTree:
Mean cross-validation accuracy: 0.6221091388618694
Validation Metrics:
Accuracy: 0.6517341040462428
Precision: 0.6460176991150443
Recall: 0.7166900420757363
F1 Score: 0.6795212765957447
--------------------------------------
Evaluating RandomF

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [13]:
# Write your code here
# !pip install sentence-transformers
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from gensim.models import Word2Vec
from sentence_transformers import SentenceTransformer
import pandas as pd
import re

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')

# Load the dataset (first 1000 records)
data = pd.read_csv("McDonald_s_Reviews.csv",encoding='latin-1').head(1000)

# Preprocess the text data (remove punctuation, lowercase, stop words removal, tokenization)
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    if isinstance(text, str):  # Check if text is a string
        # Convert text to lowercase
        text = text.lower()
        # Remove punctuation
        text = re.sub(r'[^\w\s]', '', text)
        # Remove stop words
        text = ' '.join([word for word in text.split() if word not in stop_words])
        # Tokenization
        tokens = word_tokenize(text)
        # Join tokens back to text
        text = ' '.join(tokens)
    return text

# Apply preprocessing to 'Reviews' column
data['processed_text'] = data['review'].apply(preprocess_text)

# Choose appropriate features or representations
# For K-means, DBSCAN, and Hierarchical clustering
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
tfidf_matrix = tfidf_vectorizer.fit_transform(data['processed_text'])

# For Word2Vec
word2vec_model = Word2Vec(sentences=data['processed_text'].apply(lambda x: x.split()), vector_size=100, window=5, min_count=1, workers=4)
word2vec_model.train(data['processed_text'], total_examples=len(data['processed_text']), epochs=10)

# For BERT
bert_model = SentenceTransformer('bert-base-nli-mean-tokens')
bert_embeddings = bert_model.encode(data['processed_text'])

# Apply K-means clustering
kmeans = KMeans(n_clusters=5, random_state=42)
kmeans_clusters = kmeans.fit_predict(tfidf_matrix)
kmeans_silhouette = silhouette_score(tfidf_matrix, kmeans_clusters)

# Apply DBSCAN with adjusted parameters
dbscan = DBSCAN(eps=0.5, min_samples=10)  # Adjusted min_samples
dbscan_clusters = dbscan.fit_predict(tfidf_matrix)
# Filter out outliers (-1) from TF-IDF matrix
tfidf_matrix_filtered = tfidf_matrix[dbscan_clusters != -1]
# Filter out outliers from DBSCAN clusters
dbscan_clusters_filtered = dbscan_clusters[dbscan_clusters != -1]
# Calculate silhouette score using filtered data
dbscan_silhouette = silhouette_score(tfidf_matrix_filtered, dbscan_clusters_filtered)

# Adjust the sample size for hierarchical clustering due to memory constraints
sample_size = 500  # Adjust according to your memory constraints
hierarchical = AgglomerativeClustering(n_clusters=5)
hierarchical_clusters = hierarchical.fit_predict(tfidf_matrix[:sample_size].toarray())
hierarchical_silhouette = silhouette_score(tfidf_matrix[:sample_size], hierarchical_clusters)


# Apply K-means clustering to BERT embeddings
kmeans_bert = KMeans(n_clusters=5, random_state=42)
bert_clusters = kmeans_bert.fit_predict(bert_embeddings)

# Calculate silhouette score for BERT
bert_silhouette = silhouette_score(bert_embeddings, bert_clusters)

# Apply K-means clustering to Word2Vec vectors
kmeans_word2vec = KMeans(n_clusters=5, random_state=42)
word2vec_clusters = kmeans_word2vec.fit_predict(word2vec_model.wv.vectors)

# Calculate silhouette score for Word2Vec
word2vec_silhouette = silhouette_score(word2vec_model.wv.vectors, word2vec_clusters)

# Output the silhouette scores including Word2Vec and BERT
print(f"Silhouette Score (K-means, Word2Vec): {kmeans_silhouette}")
print(f"Silhouette Score (DBSCAN, Word2Vec): {dbscan_silhouette}")
print(f"Silhouette Score (Hierarchical, Word2Vec): {hierarchical_silhouette}")
print(f"Silhouette Score (Word2Vec): {word2vec_silhouette}")
print(f"Silhouette Score (K-means, BERT): {bert_silhouette}")




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Silhouette Score (K-means, Word2Vec): 0.18063719465518144
Silhouette Score (DBSCAN, Word2Vec): 1.0
Silhouette Score (Hierarchical, Word2Vec): 0.0017263556648420106
Silhouette Score (Word2Vec): 0.9314850568771362
Silhouette Score (K-means, BERT): 0.21864353120326996


**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

The clustering algorithms and embedding approaches performed differently while clustering text data from McDonald's reviews. K-means clustering produced a modest silhouette score, suggesting satisfactory cluster quality. DBSCAN, on the other hand, returned a flawless silhouette score, indicating that it successfully recognized dense areas in the data. However, because of its inclination to group outliers together, DBSCAN may have oversimplified the grouping. Hierarchical clustering obtained the lowest silhouette score, suggesting that the cluster quality was inadequate in comparison to other approaches. When clustered with K-means, Word2Vec embeddings had good cluster quality, however it was somewhat worse than when K-means were applied to TF-IDF features. When paired with K-means, BERT embeddings produced the greatest silhouette score of any approach, showing the efficiency.Overall, while each technique has advantages and disadvantages, BERT embeddings paired with K-means provided the most promising results for clustering McDonald's review data.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Completing these exercises gave me practical experience in applying various text clustering techniques to real-world data.
I learned about preprocessing, implementing different clustering algorithms, evaluating results using silhouette scores, and exploring embedding techniques like Word2Vec and BERT.
It was valuable to see how these methods performed and to understand their strengths and limitations in clustering textual data.



'''