## COSC 74/274: Machine Learning and Statistical Data Analysis

## Spring 2023 Class Final Project

## Katherine Lasonde

### Overview
The goal of the course project is to implement machine learning models and concepts covered
in this course for a real-world dataset. The project will utilize the Amazon product review dataset and focus on binary classification, multi-class classification, and clustering approaches to analyze and categorize product reviews. All code must be implemented in Python and all models must use the Scikit Learn toolkit - https://scikit-learn.org/stable/index.html. You are not allowed to use other toolkits, such as NLTK or transformer network architectures, for your project results.

Here are examples of some useful Scikit modules:
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer
- https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html#sklearn.feature_extraction.text.CountVectorizer

In [1]:
# Imports
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, f1_score, accuracy_score, make_scorer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import TfidfVectorizer
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import autograd.numpy as np
from autograd import grad 
%matplotlib inline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_validate
from sklearn.cluster import KMeans
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.metrics import silhouette_score
from sklearn.metrics.cluster import adjusted_rand_score

In [2]:
# Read in the data
training_data = pd.read_csv('Training.csv')
testing_data = pd.read_csv('Test.csv')

In [3]:
# The cutoff is not an input to the model, but to the experiment.
def threshold_label(rating, threshold=1):
    # all samples with a rating <= cutoff will have label 0
    class_label = 0

    # all samples with a rating > threshold have label 1.
    if rating > threshold:
        class_label = 1
    
    return class_label

In [4]:
# Begin to parse data
training_data['combined_text'] = (training_data['reviewText'].fillna('') + ' ' +  (training_data['summary'].fillna('')))
testing_data['combined_text'] = (testing_data['reviewText'].fillna('') + ' ' +  (testing_data['summary'].fillna('')))

# Concatenate data
all_text = training_data['combined_text'].tolist() + testing_data['combined_text'].tolist()
train_size = len(training_data) # For indexing purposes

# Fit the vectorizer! for both train and test data
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', strip_accents='ascii', max_df=0.85, max_features=4000)
tfidf_matrix = tfidf_vectorizer.fit_transform(all_text)
tfidf_matrix_train = tfidf_matrix.toarray()[:train_size]
tfidf_matrix_test = tfidf_matrix.toarray()[train_size:]

# Create a data frame for it
X_train = pd.DataFrame(tfidf_matrix_train, columns=tfidf_vectorizer.get_feature_names())

# Use a lambda function to apply to ea value fo the training data to either 0 or 1
y_train = training_data['overall'].apply(lambda x: threshold_label(x,2)) # Binary

In [5]:
# Do the same with the test data
X_test = pd.DataFrame(tfidf_matrix_test, columns=tfidf_vectorizer.get_feature_names())

In [6]:
# Create a cross validation object with 5-folds
num_folds = 5
cross_validation = StratifiedKFold(n_splits=num_folds)

### Binary classification
In this task, you have to develop binary classifiers to classify product reviews as good or bad. 

The cutoff of ‘goodness’ will be an input, i.e., you have to develop classifiers with the following cutoffs of product rating: 1,2,3,4. 

Note: The cutoff is not an input to the model, but to the experiment. For example, when cutoff=3, all samples with a rating <= 3 will have label 0, and all samples with a rating > 3 have label 1. 
- Report the performance of at least three different classifiers for each of the four cutoffs. 
- Perform cross-validation for hyperparameter tuning. 

Your report should describe why certain model parameters help or hurt the model performance. For each classifier, you should report in your report the confusion matrix, ROC, AUC, macro F1 score, and accuracy for the best combination of hyperparameters using 5-fold cross-validation. We will share a baseline macro F1 score for classification and at least one of your classification models must achieve at least the baseline score for full credit.

In [7]:
# Plot the ROC curve
def plot_roc_curve(false_positive_rate, true_positive_rate):
    # Plot ROC curve
    plt.figure()
    plt.plot(false_positive_rate, true_positive_rate)
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver Operating Characteristic')
    plt.show()
    
# Define a function to calculate the best model and associated metrics for each cutoff
def evaluate_performance(y_test, y_pred):  
    num_decimal_points = 3

    # Calculate 1. Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Calculate 2. Accuracy 
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculate 3. ROC
    false_positive_rate, true_positive_rate, _ = roc_curve(y_test, y_pred)
    # plot_roc_curve(false_positive_rate, true_positive_rate)

    # Calculate 4. AUC
    auc_score = roc_auc_score(y_test, y_pred)
    
    # Calculate 5. Macro F1 score
    f1 = f1_score(y_test, y_pred, average='macro')
    
    print("\tConfusion Matrix: TP = " + str(cm[0][0]) + "\tTN = " + str(cm[1][1]) + "\tFP = " + str(cm[0][1]) + "\tFN = " + str(cm[1][0]))
    print("\tFPR: " + str(round(np.array(false_positive_rate)[1], num_decimal_points)) + " TPR: " + str(round(np.array(true_positive_rate)[1], num_decimal_points)))
    print("\tAccuracy: " + str(round(accuracy, num_decimal_points)))
    print("\tAUC Score: " + str(round(auc_score,num_decimal_points)))
    print("\tF1 Score: " + str(round(f1, num_decimal_points)))
    
    return f1

In [8]:
# Find the best fit for the data given a model type
def find_best_fit(fcn, param, cross_validation, X_train, y_train):    
    # Separate the test and train data with the new y-threshold
    x_train_specific, x_test_specific, y_train_specific, y_test_specific = train_test_split(X_train, y_train, test_size = 0.15)
        
    # Fit the function to the training daata
    fcn.fit(x_train_specific, y_train_specific)
         
    # Define grid search with cross-validation via grid search cv
    grid_search_fcn = GridSearchCV(fcn, param_grid=param, cv=cross_validation, scoring='f1_macro')
        
    # Fit the grid search validation to the data
    grid_search_fcn.fit(x_train_specific, y_train_specific)

    # Predict using the best model
    best_model = grid_search_fcn.best_estimator_
        
    return [best_model, x_test_specific, y_test_specific]

In [9]:
# Export data for submission  
def export_data(best_model, current_threshold):
    y_pred_submission = best_model.predict(X_test)

    # IMPORTANT: use 'id' and 'predicted' as the column names
    test_ids = list(testing_data.index) # the 'id' column name is the index of the test samples
    test_submission_mnb = pd.DataFrame({'id':test_ids, 'binary_split_' + str(current_threshold): y_pred_submission})

    test_submission_mnb.to_csv('test_submission_mnb' + str(current_threshold) + '.csv', index=False)

# Binary classifier 1 - Logistic regression

In [10]:
# First classifier, logistic regression
logistic_regression = LogisticRegression(max_iter=5000)

In [11]:
# Test different parameters
parameters_logistic_regression = {'C': [0.1, 0.5, 1, 5, 10]}
parameters_logistic_regression_1 = {'C': [0.01]}
parameters_logistic_regression_2 = {'C': [0.1]}
parameters_logistic_regression_3 = {'C': [0.01, 0.1]}
parameters_logistic_regression_4 = {'C': [0.01, 0.1, 1]}
parameters_logistic_regression_5 = {'C': [0.1, 1]}
parameters_logistic_regression_6 = {'C': [1]}
parameters_logistic_regression_7 = {'C': [1, 10]}
parameters_logistic_regression_8 = {'C': [0.01, 1, 10]}
parameters_logistic_regression_9 = {'C': [0.01, 0.1, 1, 10]}
parameters_logistic_regression_10 = {'C': [0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10]}
parameters_logistic_regression_11 = {'C': [0.01, 0.05, 0.1, 0.2, 0.25, 0.3, 0.5, 0.6, 0.75, 0.8, 1, 2.5, 5, 7.5, 10]}
parameters_logistic_regression_12 = {'C': [0.01, 0.025, 0.05, 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.75, 0.8, 1, 2.5, 5, 7.5, 10]}

lr_param_array = [parameters_logistic_regression, parameters_logistic_regression_1, parameters_logistic_regression_2, parameters_logistic_regression_3, parameters_logistic_regression_4, parameters_logistic_regression_5, parameters_logistic_regression_6, parameters_logistic_regression_7, parameters_logistic_regression_8, parameters_logistic_regression_9, parameters_logistic_regression_10, parameters_logistic_regression_11, parameters_logistic_regression_12]
mean_f1_score = 0

for i in range(len(lr_param_array)-1):
    [best_model, X_test_lr, y_test_lr] = find_best_fit(logistic_regression, lr_param_array[i], cross_validation, X_train, y_train)
    y_pred_lr = best_model.predict(X_test_lr)
    f1 = f1_score(y_test_lr, y_pred_lr, average='macro')
    mean_f1_score = mean_f1_score + f1
    print("Parameters: " + str(lr_param_array[i]) + " F1: " + str(f1))

mean_f1_score = mean_f1_score/len(lr_param_array)

print('Mean F1 score: ' + str(mean_f1_score))

Parameters: {'C': [0.1, 0.5, 1, 5, 10]} F1: 0.7953245026178931
Parameters: {'C': [0.01]} F1: 0.5135770099165156
Parameters: {'C': [0.1]} F1: 0.7606102190258699
Parameters: {'C': [0.01, 0.1]} F1: 0.7671353712636166
Parameters: {'C': [0.01, 0.1, 1]} F1: 0.7918924056648609
Parameters: {'C': [0.1, 1]} F1: 0.7908520171264223
Parameters: {'C': [1]} F1: 0.7867407153513046
Parameters: {'C': [1, 10]} F1: 0.7891742356613155
Parameters: {'C': [0.01, 1, 10]} F1: 0.7969903922907082
Parameters: {'C': [0.01, 0.1, 1, 10]} F1: 0.794535760835229
Parameters: {'C': [0.1, 0.25, 0.5, 0.75, 1, 2.5, 5, 7.5, 10]} F1: 0.7953250025723486
Parameters: {'C': [0.01, 0.05, 0.1, 0.2, 0.25, 0.3, 0.5, 0.6, 0.75, 0.8, 1, 2.5, 5, 7.5, 10]} F1: 0.7823624992213625
Mean F1 score: 0.7049630870421113


In [None]:
# Eval the model and param for each fold!! 
print("Logistic Regression results: ")

lr_best_models = []

for current_threshold in range(4):
    print("\n Threshold number " + str(current_threshold + 1))
    y_train = training_data['overall'].apply(lambda x: threshold_label(x,current_threshold + 1)) # Binary
    
    # Now.... use the trained model & make ~predictions~ via the .predict method
    [best_model, X_test_lr, y_test_lr] = find_best_fit(logistic_regression, parameters_logistic_regression, cross_validation, X_train, y_train)
    
    y_pred_lr = best_model.predict(X_test_lr)
    
    # Get the metrics (mainly printed except for f1)
    f1 = evaluate_performance(y_test_lr, y_pred_lr)
    
    lr_best_models.append(best_model)

#     export_data(best_model, current_threshold + 1)

# Binary classifier 2 - Multinomial naive bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

# Define the multinomial naive bayes model
multinomial_nb = MultinomialNB()

In [None]:
# Try a bunch of the parameters for grid search
parameters_mnb =  {'alpha': [0.001], 'fit_prior': [True]}
parameters_mnb_1 = {'alpha': [0.001], 'fit_prior': [False]}
parameters_mnb_2 = {'alpha': [0.01], 'fit_prior': [True]}
parameters_mnb_3 = {'alpha': [0.01], 'fit_prior': [False]}
parameters_mnb_4 = {'alpha': [0.1], 'fit_prior': [True]}
parameters_mnb_5 = {'alpha': [0.1], 'fit_prior': [False]}
parameters_mnb_6 = {'alpha': [1], 'fit_prior': [True]}
parameters_mnb_7 = {'alpha': [1], 'fit_prior': [False]}
parameters_mnb_8 = {'alpha': [10], 'fit_prior': [True]}
parameters_mnb_9 = {'alpha': [10], 'fit_prior': [False]}
parameters_mnb_10 = {'alpha': np.logspace(-1, 1, 10), 'fit_prior': [True, False]}
parameters_mnb_11 = {'alpha': np.logspace(-3, 1, 10), 'fit_prior': [True, False]}
parameters_mnb_12 = {'alpha': np.logspace(-3, 3, 10), 'fit_prior': [True, False]}
parameters_mnb_13 = {'alpha': np.logspace(-2, 2, 6), 'fit_prior': [True, False]}

mnb_param_array = [parameters_mnb, parameters_mnb_1, parameters_mnb_2, parameters_mnb_3, parameters_mnb_4, parameters_mnb_5, parameters_mnb_6, parameters_mnb_7, parameters_mnb_8, parameters_mnb_9, parameters_mnb_10, parameters_mnb_11, parameters_mnb_12, parameters_mnb_13]
mean_f1_score_mnb = 0

for i in range(len(mnb_param_array)):
    [best_model, X_test_mnb, y_test_mnb] = find_best_fit(multinomial_nb, mnb_param_array[i], cross_validation, X_train, y_train)
    y_pred_mnb = best_model.predict(X_test_mnb)
    f1_mnb = f1_score(y_test_mnb, y_pred_mnb, average='macro')
    mean_f1_score_mnb = mean_f1_score_mnb + f1_mnb
    print("Parameters: " + str(mnb_param_array[i]) + " f1 : " + str(f1_mnb))

In [None]:
mean_f1_score_all_mnb = mean_f1_score_mnb/len(mnb_param_array) 
print("The mean score for all hyperparameters for multinomial naive bayes: " + str(mean_f1_score_all_mnb))

In [None]:
# Eval the model and param for each fold!! 
print("Multinomial naive bayes results with the selected parameters " + str(parameters_mnb))

mnb_best_models = []

for current_threshold in range(4):
    print("\n Threshold number " + str(current_threshold + 1))
    y_train = training_data['overall'].apply(lambda x: threshold_label(x,current_threshold + 1)) # Binary
    
    # Now.... use the trained model & make ~predictions~ via the .predict method
    [best_model, X_test_mnb, y_test_mnb] = find_best_fit(multinomial_nb, parameters_mnb_11, cross_validation, X_train, y_train)
    
    y_pred = best_model.predict(X_test_mnb)
    
    # Get the metrics (mainly printed except for f1)
    f1 = evaluate_performance(y_test_mnb, y_pred)
    
    mnb_best_models.append(best_model)

In [None]:
print(mnb_best_models)

for i in range(len(mnb_best_models)):
    export_data(mnb_best_models[i], i + 1)

# Binary classifier 3 - Perceptron

In [None]:
from sklearn.datasets import load_digits
from sklearn.linear_model import Perceptron

perceptron = Perceptron(tol=1e-3, random_state=0)
perptron_parameters = {'penalty': ['l2'], 'alpha': [0.0001], 'max_iter': [1000], 'tol': [1e-3]}

In [None]:
# Try a bunch of parameters to see which are the best
parameters_percep_1 = {'penalty': ['l2'], 'alpha': [0.0001], 'max_iter': [1000], 'tol': [1e-3]}
parameters_percep_2 = {'penalty': ['l1'], 'alpha': [0.001], 'max_iter': [1000], 'tol': [1e-3]}
parameters_percep_3 = {'penalty': ['elasticnet'], 'alpha': [0.01], 'max_iter': [1000], 'tol': [1e-3]}
parameters_percep_4 = {'penalty': ['l2'], 'alpha': [0.0001], 'max_iter': [2000], 'tol': [1e-3]}
parameters_percep_5 = {'penalty': ['l1'], 'alpha': [0.001], 'max_iter': [2000], 'tol': [1e-3]}
parameters_percep_6 = {'penalty': ['elasticnet'], 'alpha': [0.01], 'max_iter': [2000], 'tol': [1e-3]}
parameters_percep_7 = {'penalty': ['l2'], 'alpha': [0.0001], 'max_iter': [1000], 'tol': [1e-4]}
parameters_percep_8 = {'penalty': ['l1'], 'alpha': [0.001], 'max_iter': [1000], 'tol': [1e-4]}
parameters_percep_9 = {'penalty': ['elasticnet'], 'alpha': [0.01], 'max_iter': [1000], 'tol': [1e-4]}
parameters_percep_10 = {'penalty': ['None'], 'max_iter': [1000], 'tol': [1e-3]}

parameters_percep_array = [parameters_percep_1, parameters_percep_2, parameters_percep_3, parameters_percep_4, parameters_percep_5, parameters_percep_6, parameters_percep_7, parameters_percep_8, parameters_percep_9, parameters_percep_10]

In [None]:
mean_f1_score_percep = 0

for i in range(len(parameters_percep_array)):
    [best_model, X_test_percep, y_test_percep] = find_best_fit(perceptron, parameters_percep_array[i], cross_validation, X_train, y_train)
    y_pred_percep = best_model.predict(X_test_percep)
    f1_percep = f1_score(y_test_percep, y_pred_percep, average='macro')
    mean_f1_score_percep = mean_f1_score_percep + f1_percep
    print("Parameters: " + str(parameters_percep_array[i]) + " f1 : " + str(f1_percep))

In [None]:
mean_f1_score_all_percep = mean_f1_score_percep/len(parameters_percep_array) 
print("The mean score for all hyperparameters for perceptron: " + str(mean_f1_score_all_percep))

In [None]:
print("Perceptron model with selected parameters of : " + str())

for current_threshold in range(4):
    print("\n Threshold number " + str(current_threshold + 1))
    y_train = training_data['overall'].apply(lambda x: threshold_label(x,current_threshold + 1)) # Binary
    
    # Now.... use the trained model & make ~predictions~ via the .predict method
    [best_model_percep, X_test_percep, y_test_percep] = find_best_fit(perceptron, parameters_percep_10, cross_validation, X_train, y_train)
    
    y_pred_percep = best_model_percep.predict(X_test_percep)
    
    # Get the metrics (mainly printed except for f1)
    f1 = evaluate_performance(y_test_percep, y_pred_percep)
    
    print("Threshold: " + str(current_threshold+1) + ' F1: ' + str(f1))
#     export_data(best_model, current_threshold + 1)


## Multiclass testing and performance

### Logistic regression, multiclass

In [None]:
# First classifier, logistic regression
y_train = training_data['overall']

# Eval the model and param for each fold!! 
print("Logistic Regression multiclass results: ")
# Now.... use the trained model & make ~predictions~ via the .predict method
[best_model_lr_multi, X_test_lr_multi, y_test_lr_multi] = find_best_fit(logistic_regression, parameters_logistic_regression, cross_validation, X_train, y_train)
y_pred_lr_multi = best_model_lr_multi.predict(X_test_lr_multi)

In [None]:
y_pred_lr_proba = logistic_regression.predict_proba(X_test_lr_multi)
f1_lr = evaluate_performance_multi(y_test_lr_multi, y_pred_lr_multi, y_pred_lr_proba)

In [None]:
y_pred_lr_multi_submission = best_model_lr_multi.predict(X_test)

# IMPORTANT: use 'id' and 'predicted' as the column names
test_ids = list(testing_data.index) # the 'id' column name is the index of the test samples
test_submission_dt = pd.DataFrame({'id':test_ids, 'label':y_pred_lr_multi_submission})

test_submission_dt.to_csv('test_submission_lr_multi.csv', index=False)

In [None]:
def evaluate_performance_multi(y_test, y_pred, y_pred_lr_proba):
    num_decimal_points = 3

    # Calculate 1. Confusion matrix
    cm = confusion_matrix(y_test, y_pred)
    
    # Calculate 2. Accuracy 
    accuracy = accuracy_score(y_test, y_pred)
    
    # Calculate 3. ROC
    false_positive_rate, true_positive_rate, _ = roc_curve(y_test, y_pred, pos_label=1)
    plot_roc_curve(false_positive_rate, true_positive_rate)

    # Calculate 4. AUC
    auc_score = roc_auc_score(y_test, y_pred_lr_proba, multi_class='ovr')
    
    # Calculate 5. Macro F1 score
    f1 = f1_score(y_test, y_pred, average='macro')
    
    print("\tConfusion Matrix: \n\t\tTP = " + str(cm[0][0]) + "\tTN = " + str(cm[1][1]) + "\tFP = " + str(cm[0][1]) + "\tFN = " + str(cm[1][0]))
    print("\tFPR: " + str(round(np.array(false_positive_rate)[1], num_decimal_points)) + " TPR: " + str(round(np.array(true_positive_rate)[1], num_decimal_points)))
    print("\tAccuracy: " + str(round(accuracy, num_decimal_points)))
    print("\tAUC Score: " + str(round(auc_score,num_decimal_points)))
    print("\tF1 Score: " + str(round(f1, num_decimal_points)))
    
    return f1

### Multinomial bayes, multiclass

In [None]:
# Eval the model and param for each fold!! 
print("Multinomial Naive Bayes multiclass results: ")

# Now.... use the trained model & make ~predictions~ via the .predict method
[best_model_mnb_multi, X_test_mnb_multi, y_test_mnb_multi] = find_best_fit(multinomial_nb, parameters_mnb_11, cross_validation, X_train, y_train)
y_pred_mnb_multi = best_model_mnb_multi.predict(X_test_mnb_multi)

In [None]:
# Get the metrics (mainly printed except for f1)
y_pred_mnb_proba = multinomial_nb.predict_proba(X_test_mnb_multi)
f1_mnb = evaluate_performance_multi(y_test_mnb_multi, y_pred_mnb_multi, y_pred_mnb_proba)
# export_data(best_model_mnb_multi, current_threshold + 1)

In [None]:
y_pred_mnb_multi_submission = best_model_mnb_multi.predict(X_test)

# IMPORTANT: use 'id' and 'predicted' as the column names
test_ids = list(testing_data.index) # the 'id' column name is the index of the test samples
test_submission_dt = pd.DataFrame({'id':test_ids, 'label':y_pred_mnb_multi_submission})

test_submission_dt.to_csv('test_submission_mnb_multi.csv', index=False)

### Perceptron, multiclass

In [None]:
# Define the Perceptron model
perceptron = Perceptron(tol=1e-3, random_state=0)

# Eval the model and param for each fold!! 
print("Perceptron multiclass results: ")

# Now.... use the trained model & make ~predictions~ via the .predict method
[best_model_percep_multi, X_test_percep_multi, y_test_percep_multi] = find_best_fit(perceptron, parameters_percep_10, cross_validation, X_train, y_train)
y_pred_percep_multi = best_model_percep_multi.predict(X_test_percep_multi)    

In [None]:
# Get the metrics (mainly printed except for f1)
y_pred_percep_proba = perceptron.decision_function(X_test_percep_multi) #perceptron.predict_proba(X_test_percep_multi)
f1_percep = evaluate_performance_multi(y_test_percep_multi, y_pred_percep_multi, y_pred_percep_proba)
# export_data(best_model, current_threshold + 1)

In [None]:
y_pred_percep_multi_submission = best_model_percep_multi.predict(X_test)

# IMPORTANT: use 'id' and 'predicted' as the column names
test_ids = list(testing_data.index) # the 'id' column name is the index of the test samples
test_submission_dt = pd.DataFrame({'id':test_ids, 'multi_split':y_pred_percep_multi_submission})

test_submission_dt.to_csv('test_submission_percep.csv', index=False)

### 3.3 Clustering
In this task, you will cluster the product reviews in the test dataset. You will need to create word features from the data and use that for k-means clustering. Clustering will be done by product types, i.e., in this case, the labels will be product categories. You will use the Silhouette score and Rand index to analyze the quality of clustering. We will share a baseline silhouette score for clustering, and your model must achieve at least the baseline score for full credit.

In [None]:
cluster_labels = training_data['category']
# Fit the vectorizer! for both train and test data
tfidf_vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', strip_accents='ascii', max_df=0.85, max_features=4000)

# transform the deeta
tfidf_matrix_cluster = tfidf_vectorizer.fit_transform(cluster_labels).toarray()

# Create a data frame for it
X_train_clust = pd.DataFrame(tfidf_matrix_cluster, columns=tfidf_vectorizer.get_feature_names())

In [None]:
X_train_cluster, X_test_cluster, y_train_cluster, y_test_cluster = train_test_split(X_train_clust, y_train, test_size = 0.15)

In [None]:
# Cluster the data
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

true_k = 5  # adjust this to the number of product categories you have
k_means_model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
k_means_model.fit=(X_train_cluster, y_train_cluster)
y_pred_cluster_labels = model.predict(X_test_cluster)

# Calculate  silhouette_score and
silhouette_score = silhouette_score(X_test_cluster, y_pred_cluster, metric='euclidean')

# Calculate rand index score
rand_index_score = adjusted_rand_score(y_test_cluster, y_pred_cluster)

print("Silhouette score: ", silhouette_score)
print("Rand Index Score: ", rand_index_score)