# **The fifth in-class-exercise (40 points in total, 11/11/2021)**

(20 points) The purpose of the question is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training. 

The dataset can be download from here: https://github.com/unt-iialab/info5731_spring2021/blob/main/class_exercises/exercise09_datacollection.zip. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data. 

Algorithms:

(1) MultinominalNB

(2) SVM 

(3) KNN 

(4) Decision tree

(5) Random Forest

(6) XGBoost

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison 

(4) F-1 score

In [None]:
# Importing required libraries
import re, string, nltk
import pandas as pd
import numpy as np
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
import gensim
from scipy.cluster import hierarchy
#nltk.download("stopwords")
#nltk.download('punkt')
#nltk.download('wordnet')

In [None]:
# Write your code here
with open("stsa-train.txt", 'r') as file: # open the txt file 
    data = file.read().splitlines()
sentiments= [int(review[0]) for review in data] # extract sentiments from txt file
reviews = [review[2:] for review in data] # extract reviews from the txt file

In [None]:
stopwords_list = stopwords.words('english') # importing stopwords
punctuations_list = string.punctuation # get punctuations
tokenizer = nltk.tokenize.TweetTokenizer() # initiliza tokenizer
lemmatizer = WordNetLemmatizer() # initialize word lemmatizer
def preprocessing(text):
    """
    This function will clean the givern text
    """
    text = re.sub('<[^>]*>', '', text)
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower())
    text = text + " ".join(emoticons).replace('-', '')
    tokenize_text = [lemmatizer.lemmatize(word.lower()) for word in nltk.tokenize.word_tokenize(text) if (word not in stopwords_list) and (word not in punctuations_list) and (len(word)>=2) and (word.isalnum())]
    return " ".join(tokenize_text)
reviews = list(map(preprocessing, reviews)) # Clean all the reviews got from txt file

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reviews, sentiments, test_size=0.2) # split the data as 80% train, 20% validation
c_vectorize=CountVectorizer()
X_train_reviews = c_vectorize.fit_transform(X_train) # transform the data in vector formate
X_test_reviews = c_vectorize.transform(X_test) # transform the data in vector formate


In [None]:
# Validation 
"""
Algorithms:

(1) MultinominalNB

(2) SVM 

(3) KNN 

(4) Decision tree

(5) Random Forest

(6) XGBoost
"""
# make models of upper given algorithms with train data (respectively)
multinomial_model = MultinomialNB().fit(X_train_reviews, y_train)  #(1)
svm_model = LinearSVC().fit(X_train_reviews, y_train)  #(2)
knn_model = KNeighborsClassifier().fit(X_train_reviews, y_train)  #(3)
d_tree_model = DecisionTreeClassifier().fit(X_train_reviews, y_train)  #(4)
r_forest_model = RandomForestClassifier().fit(X_train_reviews, y_train)  #(5)
x_g_boost_model = GradientBoostingClassifier().fit(X_train_reviews, y_train)  #(6)
# print out score of model withrespect to validation data (respectively)
print("Score of 'MultinominalNB' for training and validationg data: ", multinomial_model.score(X_test_reviews, y_test))
print("Score of 'SVM' for training and validationg data: ", svm_model.score(X_test_reviews, y_test))
print("Score of 'KNN' for training and validationg data: ", knn_model.score(X_test_reviews, y_test))
print("Score of 'Decision tree' for training and validationg data: ", d_tree_model.score(X_test_reviews, y_test))
print("Score of 'Random Forest' for training and validationg data: ", r_forest_model.score(X_test_reviews, y_test))
print("Score of 'XGBoost' for training and validationg data: ", x_g_boost_model.score(X_test_reviews, y_test))

In [None]:
scoring = ["accuracy", 'recall', 'precision', 'f1'] # parameters of evaluations
X = c_vectorize.fit_transform(reviews) # get vectorized form of reviews
y = sentiments 
# Apply cross validationn to each above mentioed algorithms with k-fold=10
mulit_nomial_scores = cross_validate(MultinomialNB(), X, y, scoring=scoring, cv=10)
svm_score = cross_validate(LinearSVC(), X, y, scoring=scoring, cv=10)
knn_score = cross_validate(KNeighborsClassifier(), X, y, scoring=scoring, cv=10)
d_tree_score = cross_validate(DecisionTreeClassifier(), X, y, scoring=scoring, cv=10)
r_forest_score = cross_validate(RandomForestClassifier(), X, y, scoring=scoring, cv=10)
xg_boost_score = cross_validate(GradientBoostingClassifier(), X, y, scoring=scoring, cv=10)

In [None]:
# Print out the eveluations
print("-------- MultinominalNB Cross Validation to Validations Set --------")
print("Accuracy:\t", mulit_nomial_scores["test_accuracy"].mean())
print("Recall:\t", mulit_nomial_scores["test_recall"].mean())
print("Precesion:\t", mulit_nomial_scores["test_precision"].mean())
print('F-1:\t', mulit_nomial_scores["test_f1"].mean())
print()
print("-------- SVM Cross Validation to Validations Set --------")
print("Accuracy:\t", svm_score["test_accuracy"].mean())
print("Recall:\t", svm_score["test_recall"].mean())
print("Precesion:\t", svm_score["test_precision"].mean())
print('F-1:\t', svm_score["test_f1"].mean())
print()
print("-------- KNN Cross Validation to Validations Set --------")
print("Accuracy:\t", knn_score["test_accuracy"].mean())
print("Recall:\t", knn_score["test_recall"].mean())
print("Precesion:\t", knn_score["test_precision"].mean())
print('F-1:\t', knn_score["test_f1"].mean())
print()
print("-------- Decision tree Cross Validation to Validations Set --------")
print("Accuracy:\t", d_tree_score["test_accuracy"].mean())
print("Recall:\t", d_tree_score["test_recall"].mean())
print("Precesion:\t", d_tree_score["test_precision"].mean())
print('F-1:\t', d_tree_score["test_f1"].mean())
print()
print("-------- Random Forest Cross Validation to Validations Set --------")
print("Accuracy:\t", r_forest_score["test_accuracy"].mean())
print("Recall:\t", r_forest_score["test_recall"].mean())
print("Precesion:\t", r_forest_score["test_precision"].mean())
print('F-1:\t', r_forest_score["test_f1"].mean())
print()
print("-------- XGBoost Cross Validation to Validations Set --------")
print("Accuracy:\t", xg_boost_score["test_accuracy"].mean())
print("Recall:\t", xg_boost_score["test_recall"].mean())
print("Precesion:\t", xg_boost_score["test_precision"].mean())
print('F-1:\t', xg_boost_score["test_f1"].mean())


In [None]:
# Write your code here
with open("stsa-test.txt", 'r') as file: # open the test file
    data = file.read().splitlines()
test_sentiments= [int(review[0]) for review in data] # extract sentiments
test_reviews = [review[2:] for review in data] # extrct test reviews

In [None]:
test_reviews = list(map(preprocessing, test_reviews)) # clean the reviews

In [None]:
X = c_vectorize.fit_transform(reviews) # make vector of reviews
y = sentiments
X_test_d = c_vectorize.transform(test_reviews) # make vector of test reviews
y_test_d = test_sentiments

In [None]:
# make models of given algortihms respectively,  with training reviews and sentiments
multinomial_model = MultinomialNB().fit(X, y)
svm_model = LinearSVC().fit(X, y)
knn_model = KNeighborsClassifier().fit(X, y)
d_tree_model = DecisionTreeClassifier().fit(X, y)
r_forest_model = RandomForestClassifier().fit(X, y)
x_g_boost_model = GradientBoostingClassifier().fit(X, y)
# print out the score of models with respect to test reviews and sentiments
print("Score of 'MultinominalNB' for training and test data: ", multinomial_model.score(X_test_d, y_test_d))
print("Score of 'SVM' for training and test data: ", svm_model.score(X_test_d, y_test_d))
print("Score of 'KNN' for training and test data: ", knn_model.score(X_test_d, y_test_d))
print("Score of 'Decision tree' for training and test data: ", d_tree_model.score(X_test_d, y_test_d))
print("Score of 'Random Forest' for training and test data: ", r_forest_model.score(X_test_d, y_test_d))
print("Score of 'XGBoost' for training and test data: ", x_g_boost_model.score(X_test_d, y_test_d))

In [None]:
scoring = ["accuracy", 'recall', 'precision', 'f1'] # evaluating criteria
# Apply cross validationn to each above mentioed algorithms with k-fold=10
mulit_nomial_scores = cross_validate(multinomial_model, X_test_d, y_test_d, scoring=scoring, cv=10)
svm_score = cross_validate(svm_model, X_test_d, y_test_d, scoring=scoring, cv=10)
knn_score = cross_validate(knn_model, X_test_d, y_test_d, scoring=scoring, cv=10)
d_tree_score = cross_validate(d_tree_model, X_test_d, y_test_d, scoring=scoring, cv=10)
r_forest_score = cross_validate(r_forest_model, X_test_d, y_test_d, scoring=scoring, cv=10)
X_test_dg_boost_score = cross_validate(x_g_boost_model, X_test_d, y_test_d, scoring=scoring, cv=10)

In [None]:
# print out cross validation scores
print("-------- MultinominalNB Cross Validation to test data --------")
print("Accuracy:\t", mulit_nomial_scores["test_accuracy"].mean())
print("Recall:\t", mulit_nomial_scores["test_recall"].mean())
print("Precesion:\t", mulit_nomial_scores["test_precision"].mean())
print('F-1:\t', mulit_nomial_scores["test_f1"].mean())
print()
print("-------- SVM Cross Validation to test data --------")
print("Accuracy:\t", svm_score["test_accuracy"].mean())
print("Recall:\t", svm_score["test_recall"].mean())
print("Precesion:\t", svm_score["test_precision"].mean())
print('F-1:\t', svm_score["test_f1"].mean())
print()
print("-------- KNN Cross Validation to test data --------")
print("Accuracy:\t", knn_score["test_accuracy"].mean())
print("Recall:\t", knn_score["test_recall"].mean())
print("Precesion:\t", knn_score["test_precision"].mean())
print('F-1:\t', knn_score["test_f1"].mean())
print()
print("-------- Decision tree Cross Validation to test data --------")
print("Accuracy:\t", d_tree_score["test_accuracy"].mean())
print("Recall:\t", d_tree_score["test_recall"].mean())
print("Precesion:\t", d_tree_score["test_precision"].mean())
print('F-1:\t', d_tree_score["test_f1"].mean())
print()
print("-------- Random Forest Cross Validation to test data --------")
print("Accuracy:\t", r_forest_score["test_accuracy"].mean())
print("Recall:\t", r_forest_score["test_recall"].mean())
print("Precesion:\t", r_forest_score["test_precision"].mean())
print('F-1:\t', r_forest_score["test_f1"].mean())
print()
print("-------- XGBoost Cross Validation to test data --------")
print("Accuracy:\t", xg_boost_score["test_accuracy"].mean())
print("Recall:\t", xg_boost_score["test_recall"].mean())
print("Precesion:\t", xg_boost_score["test_precision"].mean())
print('F-1:\t', xg_boost_score["test_f1"].mean())


(20 points) The purpose of the question is to practice different machine learning algorithms for text clustering
Please downlad the dataset by using the following link.  https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones
(You can also use different text data which you want)

Apply the listed clustering methods to the dataset:

K means, 
DBSCAN,
Hierarchical clustering. 

You can refer to of the codes from  the follwing link below. 
https://www.kaggle.com/karthik3890/text-clustering 

In [None]:
#Write your code here.
df = pd.read_csv("Amazon_Unlocked_Mobile.csv", usecols=["Reviews"], nrows=2000) # import the data as pandas dataframe
df['Reviews'].replace('', np.nan, inplace=True) # replace empty rows with NAN
df.dropna(subset=['Reviews'], inplace=True) # remove rows having NAN
amazon_reviews = list(df['Reviews'])  # get reviews in list formate

In [None]:
amazon_reviews = list(map(preprocessing, amazon_reviews)) # clean the reviews

In [None]:
count_vect = CountVectorizer()
bow = count_vect.fit_transform(amazon_reviews) # convert reviews in vector formate

In [None]:
model = KMeans(n_clusters = 10,init='k-means++',random_state=99) # Inititate k-means model 
model.fit(bow) # apply k0means model to reviews
cluster_labels = model.labels_ # get cluster labels
df['K-mean Clusters'] = cluster_labels # save cluster labels in dataframe
cluster_center=model.cluster_centers_ # get cluster center

s_score = silhouette_score(bow, cluster_labels) # get silhouette score
print("Silhouette_score of K-means clustering: ", s_score)

In [None]:
# print out top 10 clusters with their top 10 keywords 
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = count_vect.get_feature_names_out()
for i in range(10): print("Cluster -->" , i+1, '-->', [terms[ind] for ind in order_centroids[i, :10]])

In [None]:
# Print out one review for each cluster
for i in range(10):
    print("*"*70, "\n","Cluster: ", i+1)
    print(df.iloc[df.groupby(['K-mean Clusters']).groups[i][0]]["Reviews"])

In [None]:
model = DBSCAN(min_samples=10) # initiate DBSCAN model
model.fit(bow) # fit thhe model on vector form of reviews
cluster_labels = model.labels_ # get the cluster labels
df['DBSCAN Clusters'] = cluster_labels
s_score = silhouette_score(bow, cluster_labels) # get and print silhouette score
print("Silhouette_score of K-means clustering: ", s_score)

In [None]:
for i in range(len(set(cluster_labels))-1): # print clusters with keywords
    print("*"*70, "\n", "Cluster: ", i+1)
    print(df.iloc[df.groupby(['DBSCAN Clusters']).groups[i][0]]["Reviews"])

In [None]:
w2v_model=gensim.models.Word2Vec(amazon_reviews) # inititate model to convert words into vectors

In [None]:
import numpy as np
sent_vectors = []
for sent in amazon_reviews:
    """
    This fucntion will convert each word of revews to a vector form
    """
    sent_vec = np.zeros(100)
    cnt_words =0
    for word in sent:
        try:
            vec = w2v_model.wv[word]
            sent_vec += vec
            cnt_words += 1
        except: pass
    sent_vec /= cnt_words
    sent_vectors.append(sent_vec)
sent_vectors = np.array(sent_vectors)
sent_vectors = np.nan_to_num(sent_vectors) # each word of a sentence as a vestor

In [None]:
dendro=hierarchy.dendrogram(hierarchy.linkage(sent_vectors,method='ward')) # show hierarchy

In [None]:
# as noted above their are total 4 clusters 
cluster = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')  #took n=4 from dendrogram curve 
Agg=cluster.fit_predict(sent_vectors)

In [None]:

df['Hierarchical Clusters'] = cluster.labels_ # store cluster labels
s_score = silhouette_score(bow, cluster.labels_)
print("Silhouette_score of K-means clustering: ", s_score) # print silhuete score

In [None]:
for i in range(len(set(cluster.labels_))): # print clusters with reviews
    print("*"*70, "\n", "Cluster: ", i+1)
    print(df.iloc[df.groupby(['Hierarchical Clusters']).groups[i][0]]["Reviews"])

In one paragraph, please compare K means, DBSCAN and Hierarchical clustering.

In [None]:
#You can write you answer here. (No code needed)
"""
clustering method in which we make cluster s of data test(in our case the data set was text). 
K-means is most simple method provide n-number of clusters which can be small or large dependes in value of n. Although it is
most simple and fast method but by changing number of clusters the clusters can overlap each other. but this is not in case of others
DBSCAN is an advance method and as we saw it provide limited number of clusters, and they are not overlapping each other. but it provide more simple clusters. and they are not so unique!
Hierarchical clustering is most advance and useful method. it has limited number of totaly independent and unique clusters with unique data sets
"""
