<center><h1>Haidar_Anastasia_HW7</h1></center>

Name: Anastasia Haidar
<br>
Github Username: haidarnastya
<br>
USC ID: 1163-9833-46

## 1. Multi-class and Multi-Label Classification Using Support Vector Machines

Import packages

In [9]:
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV, GroupKFold
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import make_scorer, hamming_loss, accuracy_score, classification_report
from imblearn.over_sampling import SMOTE
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics import hamming_loss

### (a) Download the Anuran Calls (MFCCs) Data Set

In [2]:
mfcc_data = pd.read_csv('./data/Frogs_MFCCs.csv')

#Choose 70% of the data randomly as the training set. Note Record ID
record_id = mfcc_data['RecordID'].unique()
np.random.seed(42)
training_ids = np.random.choice(record_id, size=int(0.7*len(record_id)), replace=False)

training_data = mfcc_data[mfcc_data['RecordID'].isin(training_ids)]
test_data = mfcc_data[~mfcc_data['RecordID'].isin(training_ids)]
print('Training data shape:', training_data.shape)
print('Test data shape:', test_data.shape)


Training data shape: (3445, 26)
Test data shape: (3750, 26)


### (b) Train a classifier for each label
Each instance has three labels: Families, Genus, and Species. Each of the labels
has multiple classes. We wish to solve a multi-class and multi-label problem.
One of the most important approaches to multi-label classification is to train a
classifier for each label (binary relevance). We first try this approach:

#### (i) Research

Research exact match and hamming score/ loss methods for evaluating multilabel classification and use them in evaluating the classifiers in this problem.

Hamming Loss measures the fraction of labels that are incorrectly predicted, accounting for both false positives and false negatives while also being normalized. The Hamming Score, represents the proportion of correctly predicted labels and provides a less strict assessment than exact match. Exact Match (0-1 Loss) evaluates whether all labels for an instance are predicted perfectly, counting any partially incorrect prediction as a failure.

https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics


#### (ii) Train a SVM for each of the labels

In [None]:
#Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. 
#Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation.

###RAW ATTRIBUTES###
feature_cols = [col for col in training_data.columns if col.startswith('MFCCs_')]
x_train = training_data[feature_cols].values
x_test = test_data[feature_cols].values


#get record ids for grouped cross validation
training_groups = training_data['RecordID'].values
test_groups = test_data['RecordID'].values

#select label to train on
labels = ['Family', 'Genus', 'Species']
results = {}

#loop through each label
for label in labels:
    y_train = training_data[label].values
    y_test = test_data[label].values
    print("Training SVM for label:", label)

    ### (1) FIND PARAMETER RANGES FOR MODEL ###
    accuracy_threshold = 0.70

    C_test_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]
    valid_C = []
    #test c values (gamma = 1.0)
    for C in C_test_values:
        svm_model = SVC(kernel = 'rbf', C=C, gamma=1.0, random_state=42)
        svm_model.fit(x_train, y_train)
        training_accuracy = svm_model.score(x_train, y_train)
        print(f"C={C}, Training Accuracy={training_accuracy}")
        if training_accuracy >= accuracy_threshold:
            valid_C.append(C)

    #test gamma values
    gamma_test_values = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
    valid_gamma = []

    for gamma in gamma_test_values:
        svm_model = SVC(kernel='rbf', C=1.0, gamma=gamma, random_state=42)
        svm_model.fit(x_train, y_train)
        training_accuracy = svm_model.score(x_train, y_train)
        print(f"Gamma={gamma}, Training Accuracy={training_accuracy}")
        if training_accuracy >= accuracy_threshold:
            valid_gamma.append(gamma)

    #finalize gamma and C ranges
    C_min, C_max = min(valid_C), max(valid_C)
    gamma_min, gamma_max = min(valid_gamma), max(valid_gamma)
    print ('C range:', C_min, '-', C_max)
    print ('Gamma range:', gamma_min, '-', gamma_max)

    ###(2) CROSS VALIDATION FOR BEST PARAMS###
    n_points = 10

    #for C use log spacing
    #for gamma use linear spacing
    log_C_min = np.log10(C_min)
    log_C_max = np.log10(C_max)
    C_values = np.logspace(log_C_min, log_C_max, n_points)
    gamma_values = np.linspace(gamma_min, gamma_max, n_points)

    param_grid = {'C': C_values, 'gamma': gamma_values}
    print(param_grid)

    print('Cross validating for label:', label)
    n_folds = 10
    gkf = GroupKFold(n_splits=n_folds)

    svm_cv = SVC(kernel='rbf', decision_function_shape='ovr', random_state=42)

    grid_search = GridSearchCV(estimator=svm_cv, param_grid=param_grid, scoring='accuracy', cv=gkf, n_jobs=-1, verbose=2)

    #fit grid search with the grouped recordIDs
    grid_search.fit(x_train, y_train, groups=training_groups)

    print('Best parameters:')
    print('Best C =', grid_search.best_params_['C'])
    print('Best gamma =', grid_search.best_params_['gamma'])
    print('Best cross-validation accuracy =', grid_search.best_score_)

    #get best model
    best_model = grid_search.best_estimator_

    #store results
    results[label] = {'best_C': grid_search.best_params_['C'], 'best_C': grid_search.best_params_['C'], 'best_gamma': grid_search.best_params_['gamma'], 'cv_accuracy': grid_search.best_score_, 'model': best_model}

print('Results following Cross Validation:')
for label, res in results.items():
    print(f"Label: {label}, Best C: {res['best_C']}, Best gamma: {res['best_gamma']}, CV Accuracy: {res['cv_accuracy']}")

###(3) USE BEST MODEL ON TEST SET PREDICTION ###
#print test error result for each label
print('Test Set Results:')
for label, res in results.items():
    best_model = res['model']
    y_test = test_data[label].values
    y_pred = best_model.predict(x_test)
    test_err = 1 - accuracy_score(y_test, y_pred)
    print(f"Label: {label}, Test Error: {test_err}")


Training SVM for label: Family
C=0.001, Training Accuracy=0.46879535558780844
C=0.01, Training Accuracy=0.7866473149492017
C=0.1, Training Accuracy=0.9515239477503629
C=1, Training Accuracy=0.9851959361393323
C=10, Training Accuracy=0.9979680696661829
C=100, Training Accuracy=1.0
C=1000, Training Accuracy=1.0
C=10000, Training Accuracy=1.0
C=100000, Training Accuracy=1.0
C=1000000, Training Accuracy=1.0
Gamma=0.001, Training Accuracy=0.5297532656023222
Gamma=0.01, Training Accuracy=0.783744557329463
Gamma=0.1, Training Accuracy=0.9210449927431059
Gamma=0.5, Training Accuracy=0.9753265602322206
Gamma=1.0, Training Accuracy=0.9851959361393323
Gamma=2.0, Training Accuracy=0.9939042089985486
Gamma=5.0, Training Accuracy=0.9985486211901307
Gamma=10.0, Training Accuracy=0.9994194484760522
C range: 0.01 - 1000000
Gamma range: 0.01 - 10.0
{'C': array([1.00000000e-02, 7.74263683e-02, 5.99484250e-01, 4.64158883e+00,
       3.59381366e+01, 2.78255940e+02, 2.15443469e+03, 1.66810054e+04,
       1.

#### (iii) Repeat 1(b)ii with L1-penalized SVMs

In [None]:
#Repeat 1(b)ii with L1-penalized SVMs.
#Remember to standardize the attributes. 
#Determine the weight of the SVM penalty using 10 fold cross validation.

###STANDARDIZE ATTRIBUTES###
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

#get record ids for grouped cross validation
training_groups = training_data['RecordID'].values
test_groups = test_data['RecordID'].values

#select label to train on
labels = ['Family', 'Genus', 'Species']
results_l1 = {}

#loop through each label
for label in labels:
    y_train = training_data[label].values
    y_test = test_data[label].values
    print("Training L1 SVM for label:", label)

    ### (1) FIND PARAMETER RANGES FOR MODEL ###
    accuracy_threshold = 0.70

    C_test_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]
    valid_C = []
    #test c values (gamma = 1.0)
    for C in C_test_values:
        l1_svm_model = LinearSVC(penalty = 'l1', dual=False, C=C, max_iter=10000, random_state=42)
        l1_svm_model.fit(x_train_scaled, y_train)
        training_accuracy = l1_svm_model.score(x_train_scaled, y_train)
        print(f"C={C}, Training Accuracy={training_accuracy}")
        if training_accuracy >= accuracy_threshold:
            valid_C.append(C)

    #finalize C range
    C_min, C_max = min(valid_C), max(valid_C)
    print ('C range:', C_min, '-', C_max)

    ###(2) CROSS VALIDATION FOR BEST PARAMS###
    n_points = 10

    #for C use log spacing
    #for gamma use linear spacing
    log_C_min = np.log10(C_min)
    log_C_max = np.log10(C_max)
    C_values = np.logspace(log_C_min, log_C_max, n_points)
    gamma_values = np.linspace(gamma_min, gamma_max, n_points)

    param_grid = {'C': C_values}
    print(param_grid)

    print('Cross validating for label:', label)
    n_folds = 10
    l1_gkf = GroupKFold(n_splits=n_folds)

    l1_svm_cv = LinearSVC(penalty='l1', dual=False, max_iter=10000, random_state=42)
    grid_search = GridSearchCV(estimator=l1_svm_cv, param_grid=param_grid, scoring='accuracy', cv=l1_gkf, n_jobs=-1, verbose=2)

    #fit grid search with the grouped recordIDs
    grid_search.fit(x_train_scaled, y_train, groups=training_groups)

    print('Best parameters:')
    print('Best C =', grid_search.best_params_['C'])
    print('Best cross-validation accuracy =', grid_search.best_score_)

    #get best model
    l1_best_model = grid_search.best_estimator_

    #store results
    results[label] = {'best_C': grid_search.best_params_['C'], 'best_C': grid_search.best_params_['C'], 'cv_accuracy': grid_search.best_score_, 'model': best_model}

print('Results following Cross Validation:')
for label, res in results.items():
    print(f"Label: {label}, Best C: {res['best_C']}, CV Accuracy: {res['cv_accuracy']}")

###(3) USE BEST MODEL ON TEST SET PREDICTION ###
#print l1 test error result for each label
print('Test Set Results:')
for label, res in results.items():
    l1_best_model = res['model']
    y_test = test_data[label].values
    y_pred_l1 = l1_best_model.predict(x_test)
    test_err_l1 = 1 - accuracy_score(y_test, y_pred_l1)
    print(f"Label: {label}, Test Error: {test_err_l1}")


Training L1 SVM for label: Family
C=0.001, Training Accuracy=0.7822931785195936
C=0.01, Training Accuracy=0.8827285921625544
C=0.1, Training Accuracy=0.9027576197387518
C=1, Training Accuracy=0.9123367198838896
C=10, Training Accuracy=0.9134978229317852
C=100, Training Accuracy=0.9134978229317852
C=1000, Training Accuracy=0.9134978229317852
C=10000, Training Accuracy=0.9134978229317852
C=100000, Training Accuracy=0.9134978229317852
C=1000000, Training Accuracy=0.9134978229317852
C range: 0.001 - 1000000
{'C': array([1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01, 1.e+02, 1.e+03, 1.e+04,
       1.e+05, 1.e+06])}
Cross validating for label: Family
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best parameters:
Best C = 100.0
Best cross-validation accuracy = 0.8026461132481291
Training L1 SVM for label: Genus
C=0.001, Training Accuracy=0.6862119013062409
C=0.01, Training Accuracy=0.8844702467343977
C=0.1, Training Accuracy=0.939622641509434
C=1, Training Accuracy=0.941654571843251

#### (iv) Repeat 1(b)iii by using SMOTE or any other method for imbalance

In [None]:
#Repeat 1(b)iii by using SMOTE or any other method you know to remedy class imbalance. 
#Report your conclusions about the classifiers you trained.

#labels to train on
labels = ['Family', 'Genus', 'Species']
results_smote = {}

#loop through each label
for label in labels:
    y_train = training_data[label].values
    y_test = test_data[label].values
    print("Training SVM with SMOTE for label:", label)

    ###(1) APPLY SMOTE ###
    #apply SMOTE to TRAINING DATA only
    smote = SMOTE(random_state=42)
    x_train_smote, y_train_smote = smote.fit_resample(x_train_scaled, y_train)

    #find c param range
    accuracy_threshold = 0.70
    C_test_values = [0.001, 0.01, 0.1, 1, 10, 100, 1000, 10000, 100000, 1000000]
    valid_C = []
    print('Testing C values for label:', label)
    for C in C_test_values:
        svm_smote_model = LinearSVC(penalty='l1', dual=False, C=C, max_iter=10000, random_state=42)
        svm_smote_model.fit(x_train_smote, y_train_smote)
        training_acc_smote = svm_smote_model.score(x_train_smote, y_train_smote)
        if training_acc_smote >= accuracy_threshold:
            valid_C.append(C)
    
    #finalize C range
    C_min, C_max = min(valid_C), max(valid_C)
    print ('C range:', C_min, '-', C_max)

    ###(2) CROSS VALIDATION FOR BEST PARAMS###
    n_points = 10
    #log spacing for C
    log_C_min = np.log10(C_min)
    log_C_max = np.log10(C_max)
    C_values = np.logspace(log_C_min, log_C_max, n_points)

    params = {'C': C_values}
    print('Performing 10-fold Cross Validation for label:', label)
    n_folds = 10

    svm_smote_cv = LinearSVC(penalty='l1', dual=False, max_iter=10000, random_state=42)
    grid_search = GridSearchCV(estimator=svm_smote_cv, param_grid=params, scoring='accuracy', cv=n_folds, n_jobs=-1, verbose=2)
    grid_search.fit(x_train_smote, y_train_smote)

    print('Best Params for {label}:')
    print('Best C =', grid_search.best_params_['C'])
    print('Best Cross validation accuracy =', grid_search.best_score_)

    #get best model
    smote_best_model = grid_search.best_estimator_

    ###(3) EVALUATE ON TEST SET (NO SMOTE ON TEST)###
    y_pred_smote = smote_best_model.predict(x_test_scaled)
    test_err_smote = 1 - accuracy_score(y_test, y_pred_smote)

    #build results dictionary
    results_smote[label] = {'best_C': grid_search.best_params_['C'],
                            'cv_accuracy': grid_search.best_score_,
                            'test_error': test_err_smote,
                            'model': smote_best_model}
    for label, res in results_smote.items():
     print(f"Label: {label}, Best C: {res['best_C']}, CV Accuracy: {res['cv_accuracy']}, Test Error: {res['test_error']}")



Training SVM with SMOTE for label: Family
Testing C values for label: Family
C range: 0.001 - 1000000
Performing 10-fold Cross Validation for label: Family
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best Params for {label}:
Best C = 1.0
Best Cross validation accuracy = 0.9092879256965943
Label: Family, Best C: 1.0, CV Accuracy: 0.9092879256965943, Test Error: 0.13280000000000003
Training SVM with SMOTE for label: Genus
Testing C values for label: Genus
C range: 0.001 - 1000000
Performing 10-fold Cross Validation for label: Genus
Fitting 10 folds for each of 10 candidates, totalling 100 fits
Best Params for {label}:
Best C = 10.0
Best Cross validation accuracy = 0.9516728624535317
Label: Family, Best C: 1.0, CV Accuracy: 0.9092879256965943, Test Error: 0.13280000000000003
Label: Genus, Best C: 10.0, CV Accuracy: 0.9516728624535317, Test Error: 0.16133333333333333
Training SVM with SMOTE for label: Species
Testing C values for label: Species
C range: 0.001 - 1000000
P

Extra Practice: Study the Classifier Chain method and apply it to the above
problem.
Extra Practice: Research how confusion matrices, precision, recall, ROC,
and AUC are defined for multi-label classification and compute them for the
classifiers you trained in above.

## 2. K-Means Clustering on a Multi-Class and Multi-Label Data Set
Monte-Carlo Simulation: Perform the following procedures 50 times, and report
the average and standard deviation of the 50 Hamming Distances that you calculate.

### (a) Use k-means clustering

In [None]:
#Use k-means clustering on the whole Anuran Calls (MFCCs) Data Set (do not split the data into train and test, 
#as we are not performing supervised learning in this exercise). 
#Choose k={1; 2... 50} automatically based on one of the methods provided in the slides 
#(CH or Gap Statistics or scree plots or Silhouettes) or any other method you know.

#use full data set
mfcc_data = pd.read_csv('./data/Frogs_MFCCs.csv')

#extract features
feature_cols = [col for col in mfcc_data.columns if col.startswith('MFCCs_')]
x_data = mfcc_data[feature_cols].values

#standardize
scaler = StandardScaler()
x_scaled = scaler.fit_transform(x_data)

###(1) MONTE CARLO SIMULATION###
#k values to iterate over
k_values = range(2, 51)
n_simulations = 50
best_ks = []

print('Performing 50 Monte Carlo simulations')
for sim in range(n_simulations):
    silhouette_scores = []

    ###(2) FIND BEST K FOR EACH SIMULATION###
    for k in k_values:
        kmeans = KMeans(n_clusters=k, random_state=sim, n_init=10)
        cluster_labels = kmeans.fit_predict(x_scaled)
        sil_score = silhouette_score(x_scaled, cluster_labels)
        silhouette_scores.append(sil_score)

    #best k for this simulation
    best_k = k_values[np.argmax(silhouette_scores)]
    best_ks.append(best_k)

    print(f'Simulation {sim+1}: Best k = {best_k}')

print('Best k values from all 50 simulations:')
print(best_ks)

Performing 50 Monte Carlo simulations
Simulation 1: Best k = 3
Simulation 2: Best k = 3
Simulation 3: Best k = 3
Simulation 4: Best k = 3
Simulation 5: Best k = 3
Simulation 6: Best k = 3
Simulation 7: Best k = 3
Simulation 8: Best k = 3
Simulation 9: Best k = 4
Simulation 10: Best k = 3
Simulation 11: Best k = 4
Simulation 12: Best k = 3
Simulation 13: Best k = 3
Simulation 14: Best k = 3
Simulation 15: Best k = 3
Simulation 16: Best k = 3
Simulation 17: Best k = 3
Simulation 18: Best k = 3
Simulation 19: Best k = 3
Simulation 20: Best k = 3
Simulation 21: Best k = 4
Simulation 22: Best k = 3
Simulation 23: Best k = 4
Simulation 24: Best k = 3
Simulation 25: Best k = 3
Simulation 26: Best k = 6
Simulation 27: Best k = 3
Simulation 28: Best k = 4
Simulation 29: Best k = 3
Simulation 30: Best k = 3
Simulation 31: Best k = 3
Simulation 32: Best k = 4
Simulation 33: Best k = 3
Simulation 34: Best k = 4
Simulation 35: Best k = 3
Simulation 36: Best k = 4
Simulation 37: Best k = 4
Simulatio

### (b) Determine which family is the majority

In [7]:
#determine majority labels
#extract true labels
family = mfcc_data['Family'].values
genus = mfcc_data['Genus'].values
species = mfcc_data['Species'].values

#store labels form each simulation
pred_family = []
pred_genus = []
pred_species = []

print('Determining majority labels for best k from each simulation')
for sim in range(n_simulations):
    best_k = best_ks[sim]

    #run kmeans with best k
    kmeans = KMeans(n_clusters=best_k, random_state=sim, n_init=10)
    cluster_assignment = kmeans.fit_predict(x_scaled)

    #predictied arrays
    pred_family_label = np.empty(family.shape, dtype=object)
    pred_genus_label = np.empty(genus.shape, dtype=object)
    pred_species_label = np.empty(species.shape, dtype=object)

    #for each cluster, determine majority label
    for cluster_id in range(best_k):
        idx = np.where(cluster_assignment == cluster_id)[0]

        if len(idx) == 0:
            #for empty cluster, assign None
            pred_family_label[idx] = None
            pred_genus_label[idx] = None
            pred_species_label[idx] = None
        else:
            #majority vote for family
            majority_family = pd.Series(family[idx]).mode()[0]
            pred_family_label[idx] = majority_family

            #majority vote for genus
            majority_genus = pd.Series(genus[idx]).mode()[0]
            pred_genus_label[idx] = majority_genus

            #majority vote for species
            majority_species = pd.Series(species[idx]).mode()[0]
            pred_species_label[idx] = majority_species
    
    #results
    pred_family.append(np.array(pred_family_label, dtype=object))
    pred_genus.append(np.array(pred_genus_label, dtype=object))
    pred_species.append(np.array(pred_species_label, dtype=object))

    print(f'Simulation {sim+1}: Assigned majority labels for k = {best_k}')


Determining majority labels for best k from each simulation
Simulation 1: Assigned majority labels for k = 3
Simulation 2: Assigned majority labels for k = 3
Simulation 3: Assigned majority labels for k = 3
Simulation 4: Assigned majority labels for k = 3
Simulation 5: Assigned majority labels for k = 3
Simulation 6: Assigned majority labels for k = 3
Simulation 7: Assigned majority labels for k = 3
Simulation 8: Assigned majority labels for k = 3
Simulation 9: Assigned majority labels for k = 4
Simulation 10: Assigned majority labels for k = 3
Simulation 11: Assigned majority labels for k = 4
Simulation 12: Assigned majority labels for k = 3
Simulation 13: Assigned majority labels for k = 3
Simulation 14: Assigned majority labels for k = 3
Simulation 15: Assigned majority labels for k = 3
Simulation 16: Assigned majority labels for k = 3
Simulation 17: Assigned majority labels for k = 3
Simulation 18: Assigned majority labels for k = 3
Simulation 19: Assigned majority labels for k = 3

### (c) Calculate the average Hamming distance, Hamming score, and Hamming loss

In [None]:
#Now for each cluster you have a majority label triplet (family, genus, species).
#Calculate the average Hamming distance, Hamming score, and Hamming loss
#between the true labels and the labels assigned by clusters.

#store lists of metrics per simulation
hamming_loss_family = []
hamming_score_family =[]
hamming_distance_family = []

hamming_loss_genus =[]
hamming_score_genus =[]
hamming_distance_genus = []

hamming_loss_species = []
hamming_score_species = []
hamming_distance_species = []

print('Calculating Hamming distances')

for sim in range(n_simulations):
    pred_family_ham = pred_family[sim]
    pred_genus_ham = pred_genus[sim]
    pred_species_ham = pred_species[sim]

    #Family: metrics
    fam_loss = hamming_loss(family, pred_family_ham)
    fam_score = 1 - fam_loss
    fam_dist = np.sum(family != pred_family_ham)

    hamming_loss_family.append(fam_loss)
    hamming_score_family.append(fam_score)
    hamming_distance_family.append(fam_dist)

    #Genus: metrics
    gen_loss = hamming_loss(genus, pred_genus_ham)
    gen_score = 1 - gen_loss
    gen_dist = np.sum(genus != pred_genus_ham)

    hamming_loss_genus.append(gen_loss)
    hamming_score_genus.append(gen_score)
    hamming_distance_genus.append(gen_dist)

    #Species: metrics
    spec_loss = hamming_loss(species, pred_species_ham)
    spec_score = 1 - spec_loss
    spec_dist = np.sum(species != pred_species_ham)

    hamming_loss_species.append(spec_loss)
    hamming_score_species.append(spec_score)
    hamming_distance_species.append(spec_dist)

##RESULTS
print('Family:')
print('Hamming Distance: Mean = {:.4f}, Std = {:.4f}'.format(np.mean(hamming_distance_family), np.std(hamming_distance_family)))
print('  Hamming Score:    Mean = {:.4f}, Std = {:.4f}'.format(np.mean(hamming_score_family), np.std(hamming_score_family)))
print('  Hamming Loss:     Mean = {:.4f}, Std = {:.4f}\n'.format(np.mean(hamming_loss_family), np.std(hamming_loss_family)))

print('Genus:')
print('  Hamming Distance: Mean = {:.1f}, Std = {:.1f}'.format(np.mean(hamming_distance_genus), np.std(hamming_distance_genus)))
print('  Hamming Score:    Mean = {:.4f}, Std = {:.4f}'.format(np.mean(hamming_score_genus), np.std(hamming_score_genus)))
print('  Hamming Loss:     Mean = {:.4f}, Std = {:.4f}\n'.format(np.mean(hamming_loss_genus), np.std(hamming_loss_genus)))

print('Species:')
print('  Hamming Distance: Mean = {:.1f}, Std = {:.1f}'.format(np.mean(hamming_distance_species), np.std(hamming_distance_species)))
print('  Hamming Score:    Mean = {:.4f}, Std = {:.4f}'.format(np.mean(hamming_score_species), np.std(hamming_score_species)))
print('  Hamming Loss:     Mean = {:.4f}, Std = {:.4f}\n'.format(np.mean(hamming_loss_species), np.std(hamming_loss_species)))

Calculating Hamming distances
Family:
Hamming Distance: Mean = 1580.0600, Std = 167.4194
  Hamming Score:    Mean = 0.7804, Std = 0.0233
  Hamming Loss:     Mean = 0.2196, Std = 0.0233

Genus:
  Hamming Distance: Mean = 2060.5, Std = 165.5
  Hamming Score:    Mean = 0.7136, Std = 0.0230
  Hamming Loss:     Mean = 0.2864, Std = 0.0230

Species:
  Hamming Distance: Mean = 2152.7, Std = 174.9
  Hamming Score:    Mean = 0.7008, Std = 0.0243
  Hamming Loss:     Mean = 0.2992, Std = 0.0243



REFERENCES
https://mmuratarat.github.io/2020-01-25/multilabel_classification_metrics

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html

https://scikit-learn.org/0.18/modules/generated/sklearn.metrics.hamming_loss.html

https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html

https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html

Copilot: Do I find the best K before running MC simulation? or within the MC simulation? 

https://stackoverflow.com/questions/17412439/how-to-split-data-into-trainset-and-testset-randomly

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html

https://www.geeksforgeeks.org/machine-learning/what-is-silhouette-score/


## 3. ISLR 12.6.2