# **INFO5731 In-class Exercise 5**

**This exercise aims to provide a comprehensive learning experience in text analysis and machine learning techniques, focusing on both text classification and clustering tasks.**

***Please use the text corpus you collected in your last in-class-exercise for this exercise. Perform the following tasks***.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission, and no requests will be answered. Manage your time accordingly.**


## **Question 1 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text classification** as well as the performance evaluation. In addition, you are requried to conduct **10 fold cross validation** (https://scikit-learn.org/stable/modules/cross_validation.html) in the training.



The dataset can be download from canvas. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data.


**Algorithms:**

*   MultinominalNB
*   SVM
*   KNN
*   Decision tree
*   Random Forest
*   XGBoost
*   Word2Vec
*   BERT

**Evaluation measurement:**


*   Accuracy
*   Recall
*   Precison
*   F-1 score


In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_validate, StratifiedKFold
from xgboost import XGBClassifier
train_dfrme = pd.read_csv(r'/content/stsa-train.txt',sep = 'delimiter=',header= None,names=['Text'])
train_dfrme[['Sentiment','Text']] = train_dfrme["Text"].str.split(" ", n=1, expand=True)

train_dfrme.head()

  train_dfrme = pd.read_csv(r'/content/stsa-train.txt',sep = 'delimiter=',header= None,names=['Text'])


Unnamed: 0,Text,Sentiment
0,"a stirring , funny and finally transporting re...",1
1,apparently reassembled from the cutting-room f...,0
2,they presume their audience wo n't sit still f...,0
3,this is a visually stunning rumination on love...,1
4,jonathan parker 's bartleby should have been t...,1


In [4]:
test_dfrme = pd.read_csv(r'stsa-test.txt',sep = 'delimiter=',header= None,names=['Text'])
test_dfrme[['Sentiment','Text']] = test_dfrme["Text"].str.split(" ", n=1, expand=True)

test_dfrme.head()

  test_dfrme = pd.read_csv(r'stsa-test.txt',sep = 'delimiter=',header= None,names=['Text'])


Unnamed: 0,Text,Sentiment
0,"no movement , no yuks , not much of anything .",0
1,"a gob of drivel so sickly sweet , even the eag...",0
2,"gangs of new york is an unapologetic mess , wh...",0
3,"we never really feel involved with the story ,...",0
4,this is one of polanski 's best films .,1


In [5]:
import nltk
import re
import string
nltk.download('stopwords')
nltk.download('wordnet')
stopword=nltk.corpus.stopwords.words('english')
from nltk.stem import WordNetLemmatizer
w_l= WordNetLemmatizer()
def clean_text(txt):
  txt="".join([w.lower() for w in txt if w not in string.punctuation])
  txt = re.sub("^\d+\s|\s\d+\s|\s\d+$", " ", txt)
  tok = re.split('\W+',txt)
  txt = [w_l.lemmatize(w1) for w1 in tok if w1 not in stopword]
  return txt

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [6]:
from sklearn.feature_extraction.text import TfidfVectorizer
nltk.download('omw-1.4')
tf_vect = TfidfVectorizer(analyzer = clean_text)
X_idf = tf_vect.fit_transform(train_dfrme['Text'])
print(X_idf.shape)
X_idf_dafrme=pd.DataFrame(X_idf.toarray())
# Get the feature names using the vocabulary_ attribute
feature_names = tf_vect.vocabulary_.keys()
# Assign the feature names to the columns of the DataFrame
X_idf_dafrme.columns=feature_names
X_test_idf = tf_vect.transform(test_dfrme['Text'])
print(X_idf.shape)

[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


(6920, 13343)
(6920, 13343)


In [7]:
MNB = MultinomialNB()
SVM = LinearSVC()
KNN = KNeighborsClassifier(n_neighbors=5,n_jobs=-1)
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()
XGB = XGBClassifier()
x_train, x_test, y_train, y_test = train_test_split(X_idf_dafrme, train_dfrme['Sentiment'].values,
                                                test_size=0.2, random_state=42)

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
MNB_test = MNB.fit(x_train,y_train)
Y_MNB = MNB_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_MNB,y_test))
print(classification_report(y_test,Y_MNB))
from sklearn.model_selection import cross_val_score
sco_MNB = cross_val_score(MNB, x_test, y_test, cv=10)
print("Accuracy using MNB",sco_MNB.mean())

Accuracy using MNB 0.7247054530288813


In [9]:
SVM_test = SVM.fit(x_train,y_train)
Y_SVM = SVM_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_SVM,y_test))
print(classification_report(y_test,Y_SVM))
from sklearn.model_selection import cross_val_score
sco_SVM = cross_val_score(SVM, x_test, y_test, cv=10)
print("Accuracy using SVM",sco_SVM.mean())

Accuracy 0.791907514450867
              precision    recall  f1-score   support

           0       0.81      0.75      0.78       671
           1       0.78      0.83      0.80       713

    accuracy                           0.79      1384
   macro avg       0.79      0.79      0.79      1384
weighted avg       0.79      0.79      0.79      1384

Accuracy using SVM 0.7348034615785632


In [10]:
KNN_test = KNN.fit(x_train,y_train)
Y_KNN = KNN_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_KNN,y_test))
print(classification_report(y_test,Y_KNN))
from sklearn.model_selection import cross_val_score
scores_KNN = cross_val_score(KNN, x_test, y_test, cv=10)
print("Accuracy using knn",scores_KNN.mean())

Accuracy 0.740606936416185
              precision    recall  f1-score   support

           0       0.75      0.71      0.73       671
           1       0.74      0.77      0.75       713

    accuracy                           0.74      1384
   macro avg       0.74      0.74      0.74      1384
weighted avg       0.74      0.74      0.74      1384

Accuracy using knn 0.6675737670732979


In [11]:
dec_tree_test = DT.fit(x_train,y_train)
Y_dec_tree = dec_tree_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_dec_tree,y_test))
print(classification_report(y_test,Y_dec_tree))
scores_DT = cross_val_score(DT, x_test, y_test, cv=10)
print("Accuracy of Decision trees",scores_DT.mean())

Accuracy 0.6488439306358381
              precision    recall  f1-score   support

           0       0.64      0.61      0.63       671
           1       0.65      0.68      0.67       713

    accuracy                           0.65      1384
   macro avg       0.65      0.65      0.65      1384
weighted avg       0.65      0.65      0.65      1384

Accuracy of Decision trees 0.6119017829214888


In [12]:
RF_test = RF.fit(x_train,y_train)
Y_RF = RF_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_RF,y_test))
print(classification_report(y_test,Y_RF))
sco_RF = cross_val_score(RF, x_test, y_test, cv=10)
print("Accuracy using Random Forest",sco_RF.mean())

Accuracy 0.7413294797687862
              precision    recall  f1-score   support

           0       0.78      0.64      0.71       671
           1       0.71      0.83      0.77       713

    accuracy                           0.74      1384
   macro avg       0.75      0.74      0.74      1384
weighted avg       0.75      0.74      0.74      1384

Accuracy using Random Forest 0.6784172661870504


In [None]:
# Convert y_train and y_test to integers
y_train = y_train.astype(int)
y_test = y_test.astype(int)

# Fit the XGBClassifier
XGB_test = XGB.fit(x_train, y_train)

# Make predictions and evaluate the model
Y_XGB = XGB_test.predict(x_test)
print('Accuracy %s' % accuracy_score(Y_XGB, y_test))
print(classification_report(y_test, Y_XGB))
sco_XGB = cross_val_score(XGB, x_test, y_test, cv=10)
print("Accuracy using XGBoost", sco_XGB.mean())

In [None]:
print("Accuracy using the MNB",sco_MNB.mean())
print("Accuracy using the SVM",sco_SVM.mean())
print("Accuracy using the knn",scores_KNN.mean())
print("Accuracy using the Decision trees",scores_DT.mean())
print("Accuracy using the Random Forest",sco_RF.mean())
print("Accuracy using the XGBoost",sco_XGB.mean())
predict_MNB = MNB_test.predict(X_test_idf)
print('Based on the test data, the Final trained model(MNB) with high accuracy evaluated value is: %s' % accuracy_score(predict_MNB,test_dfrme['Sentiment']))

## **Question 2 (20 Points)**

The purpose of the question is to practice different machine learning algorithms for **text clustering**.

Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

**Apply the listed clustering methods to the dataset:**
*   K-means
*   DBSCAN
*   Hierarchical clustering
*   Word2Vec
*   BERT

You can refer to of the codes from  the follwing link below.
https://www.kaggle.com/karthik3890/text-clustering

In [18]:
import pandas as pd

df = pd.read_csv('/content/Amazon_Unlocked_Mobile.csv')

In [19]:
import nltk

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [20]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
stop = stopwords.words('english')
from textblob import Word
nltk.download('wordnet')

df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x.lower() for x in str(x).split()))
df['Reviews'] = df['Reviews'].str.replace('[^\w\s]','')
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.head()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,Product Name,Brand Name,Price,Rating,Reviews,Review Votes
0,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,feel lucky found used (phone u & used hard all...,1.0
1,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,"nice phone, nice grade pantach revue. clean se...",0.0
2,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,5,pleased,0.0
3,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,work good go slow sometimes good phone love,0.0
4,"""CLEAR CLEAN ESN"" Sprint EPIC 4G Galaxy SPH-D7...",Samsung,199.99,4,great phone replace lost phone. thing volume b...,0.0


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from textblob import Word
import nltk
from nltk.corpus import stopwords
import pandas as pd

# Assuming df is your DataFrame
# For example: df = pd.read_csv('your_dataset.csv')

# Cleaning the data before beginning clustering
# Lowercasing
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x.lower() for x in str(x).split()))
# Removing punctuation
df['Reviews'] = df['Reviews'].str.replace('[^\w\s]','')
# Removing stop words
stop = stopwords.words('english')
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
# Lemmatization
df['Reviews'] = df['Reviews'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

# Create the vectorizer
tfidf_vect = TfidfVectorizer()

# Fit the vectorizer on the reviews
tfidf_vects = tfidf_vect.fit_transform(df['Reviews'].values.astype('U'))

# Get the feature names from the vocabulary
feature_names = tfidf_vect.vocabulary_.keys()

# Form 9 clusters
model = KMeans(n_clusters=9, init='k-means++', max_iter=10000, random_state=50)
model.fit(tfidf_vects)
from collections import Counter
print(Counter(model.labels_))

In [None]:
# Convert dict_keys to list
feature_names_list = list(feature_names)

top_words = 7
centroids = model.cluster_centers_.argsort()[:, ::-1]

for cluster_num in range(9):
    # Use list indexing instead of get method
    key_features = [feature_names_list[i] for i in centroids[cluster_num, :top_words]]

    print(f'Cluster {cluster_num + 1}')
    print('Top Words:', key_features)

In [None]:
# Get cluster centers
cluster_centers = model.cluster_centers_
cluster_centers

In [None]:
# DBSCAN
reviews = [str(i).split() for i in df['Reviews']]

import gensim
# Use vector_size instead of size
w2v_model = gensim.models.Word2Vec(reviews, vector_size=100, workers=4)

import numpy as np
vectors = []

for i in reviews:
    vector = np.zeros(100)
    count = 0
    for word in i:
        try:
            vec = w2v_model.wv[word]
            vector += vec
            count += 1
        except:
            pass
    vector /= count
    vectors.append(vector)

vectors = np.array(vectors)
vectors = np.nan_to_num(vectors)

In [None]:
from sklearn.cluster import DBSCAN

min_pts = 2 * 100

# Lower bound function
def lower_bound(nums, target):
    left, right = 0, len(nums) - 1

    # Binary searching
    while left <= right:
        mid = int(left + (right - left) / 2)
        if nums[mid] >= target:
            right = mid - 1
        else:
            left = mid + 1
    return left

def compute_200th_nearest_neighbour(x, data):
    dists = []

    for val in data:
        # Computing distances
        dist = np.sum((x - val) ** 2)

        if len(dists) == 200 and dists[199] > dist:
            l = int(lower_bound(dists, dist))

            if 0 <= l < 200 and dists[l] > dist:
                dists[l] = dist
        else:
            dists.append(dist)
            dists.sort()

    # Dist 199 contains the distance of the 200th nearest neighbour.
    return dists[199]

vectors.shape

**In one paragraph, please compare the results of K-means, DBSCAN, Hierarchical clustering, Word2Vec, and BERT.**

**Write your response here:**

Unsupervised machine learning algorithms that are commonly used for data clustering include K-means clustering, DBSCAN, and Hierarchical clustering.

K-means is sensitive to initial centroid selection and appropriate for compact, spherical clusters since it finds cluster centers by minimizing intra-cluster distances.

DBSCAN clusters dense point regions and can detect arbitrary-shaped clusters with good noise management, although it needs precise parameter adjustment.

Although it can be computationally costly, hierarchical clustering creates a tree-like hierarchy of clusters that can provide insights on data structure and different cluster morphologies.

Conversely, Word2Vec and BERT are models for natural language processing. While BERT, a transformer-based model, contextualizes word embeddings and excels in tasks like text classification and language understanding by collecting bidirectional context, Word2Vec learns dense vector representations of words, capturing semantic links.

Word2Vec and BERT provide semantic understanding and contextual comprehension in NLP tasks by embedding words in a continuous vector space, whereas clustering algorithms group data points based on similarity.

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment.


**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Finishing the tasks in this assignment was a worthwhile educational experience for me.
I was able to comprehend the tasks and expectations by reading the directions, which were brief and easy to understand.
I was able to successfully implement the concepts and approaches provided in the course since the activities were both demanding and manageable.

'''