# Multi-class and Multi-Label Classification



Each instance in the Anuran Calls Data has three labels: Families, Genus, and Species. 

Each of the labels has multiple classes. We wish to solve a multi-class and multi-label problem. One of the most important approaches to multi-label classification is to train a classifier for each label (binary relevance).

We will use SVM and K-Means Clustering techniques to perform the classification.

## Importing necessary packages

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import hamming_loss, accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Package to ignore warnings
import warnings
warnings.filterwarnings('ignore')



## UDFs

In [3]:
def svc_ideal_parameters(traindata_X,traindata_y):

    '''This function helps to obtain the macximum and minimum range parameter values namely c & gamma in the 
    support vector cassifier to achieve an accuracy of atleast 70% 
    Input parameters:
        1. X train dataset
        2. y train dataset
    '''
    
    # Log space for very large and very small parameters
    logspace = np.logspace(-5,6,6)

    # Gaussian kernal inputs with linear variation
    linspace = np.linspace(0.01,2.5,10)

    c_list = []
    gamma_list = []

    for log in logspace:
        for lin in linspace:
            model_svm = SVC(C = log, gamma = lin)
            model_svm.fit(traindata_X,traindata_y)
            y_pred = model_svm.predict(X_train)

            if accuracy_score(y_pred,traindata_y) >= 0.7: # Filtering for parameters resulting in accuracy of atleast 0.7
                c_list.append(log)
                gamma_list.append(lin)
    
    return(round(min(c_list),3),round(max(c_list),3),round(min(gamma_list),3),round(max(gamma_list),3))

# Question 1

## a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29. Choose 70% of the data randomly as the training set.

In [4]:
frogs_data = pd.read_csv('../data/Frogs_MFCCs.csv')

In [5]:
frogs_data.shape

(7195, 26)

In [6]:
frogs_data.tail(2)

Unnamed: 0,MFCCs_ 1,MFCCs_ 2,MFCCs_ 3,MFCCs_ 4,MFCCs_ 5,MFCCs_ 6,MFCCs_ 7,MFCCs_ 8,MFCCs_ 9,MFCCs_10,...,MFCCs_17,MFCCs_18,MFCCs_19,MFCCs_20,MFCCs_21,MFCCs_22,Family,Genus,Species,RecordID
7193,1.0,-0.519497,-0.307553,-0.004922,0.072865,0.377131,0.086866,-0.115799,0.056979,0.089316,...,0.051796,0.069073,0.017963,0.041803,-0.027911,-0.096895,Hylidae,Scinax,ScinaxRuber,60
7194,1.0,-0.508833,-0.324106,0.062068,0.078211,0.397188,0.094596,-0.117672,0.058874,0.07618,...,0.061455,0.072983,-0.00398,0.03156,-0.029355,-0.08791,Hylidae,Scinax,ScinaxRuber,60


In [7]:
# Splitting data into train & test in 70:30 ratio randomly

X_train, X_test, y_train, y_test = train_test_split(frogs_data.drop(['Family','Genus','Species'], axis=1), 
                                                    frogs_data[['Family','Genus','Species']],
                                                    test_size=0.3, 
                                                    random_state = 198)

## b)

### i) Research exact match and hamming score/ loss methods for evaluating multi- label classification and use them in evaluating the classifiers in this problem.

1. Exact Match Ratio: It is a metric to evaluate model performance in multi-class variables, which indicates the percentage of samples which have all of their labels correctly classified. It is a strict metric as it counts only when the model correctly identifies all the labels without any false positives. It can be though of as an extension of the accuracy metric in multi-label classification problems


2. Hamming Loss: It is the ratio of the number of wrong labels to the total number of labels. It can be interpreted as the fraction of labels which are wrongly predicted by the model. It can be thought of as, an extension to false positive ratio and false negative ratio

### ii) Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation. You are welcome to try to solve the problem with both standardized and raw attributes and report the results.

#### Training SVM for each Label: Raw Attributes, one vs rest clasifier

1. Family

In [124]:
# Obtaining minimum and maximum range values for parameters
svc_ideal_parameters(X_train,y_train['Family'])

(0.002, 1000000.0, 0.01, 2.5)

In [126]:
penalty_param = list(np.logspace(-3,6,8))
gaussian_param = list(np.linspace(0.01,2.5,8))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_family = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_family.fit(X_train, y_train['Family'])

y_pred_test_family = svm_clf_family.predict(X_test)
EMR_family_raw = accuracy_score(y_pred_test_family, y_test['Family'])
HL_family_raw = hamming_loss(y_pred_test_family, y_test['Family'])

print('Family: \n')
print(f'Exact Match Ratio: {round(EMR_family_raw,4)}')
print(f'Hamming Loss: {round(HL_family_raw,4)}')

Family: 

Exact Match Ratio: 0.9995
Hamming Loss: 0.0005


2. Genus

In [127]:
# Obtaining minimum and maximum range values for parameters
svc_ideal_parameters(X_train,y_train['Genus'])

(0.002, 1000000.0, 0.01, 2.5)

In [129]:
penalty_param = list(np.logspace(-3,6,8))
gaussian_param = list(np.linspace(0.01,2.5,8))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_genus = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_genus.fit(X_train, y_train['Genus'])

y_pred_test_genus = svm_clf_genus.predict(X_test)
EMR_genus_raw = accuracy_score(y_pred_test_genus, y_test['Genus'])
HL_genus_raw = hamming_loss(y_pred_test_genus, y_test['Genus'])

print('Genus: \n')
print(f'Exact Match Ratio: {round(EMR_genus_raw,4)}')
print(f'Hamming Loss: {round(HL_genus_raw,4)}')

Genus: 

Exact Match Ratio: 0.9995
Hamming Loss: 0.0005


3. Species

In [128]:
# Obtaining minimum and maximum range values for parameters
svc_ideal_parameters(X_train,y_train['Species'])

(0.251, 1000000.0, 0.01, 2.5)

In [306]:
penalty_param = list(np.logspace(-1,6,8))
gaussian_param = list(np.linspace(0.01,2.5,10))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_species = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_species.fit(X_train, y_train['Species'])

y_pred_test_species = svm_clf_species.predict(X_test)
EMR_species_raw = accuracy_score(y_pred_test_species, y_test['Species'])
HL_species_raw = hamming_loss(y_pred_test_species, y_test['Species'])

print('Species: \n')
print(f'Exact Match Ratio: {round(EMR_species_raw,4)}')
print(f'Hamming Loss: {round(HL_species_raw,4)}')

Species: 

Exact Match Ratio: 0.9995
Hamming Loss: 0.005


#### Training SVM for each Label: Standardized Attributes, one vs rest clasifier

Standardizing attributes using sklearn standard scalar 

In [275]:
standarization = StandardScaler()

# Train data
train_std = standarization.fit(X_train)
X_train_trf = train_std.transform(X_train)

# Test data
test_std = standarization.fit(X_test)
X_test_trf = train_std.transform(X_test)


1. Family

In [276]:
penalty_param = list(np.logspace(-2,6,8))
gaussian_param = list(np.linspace(0.01,2.5,10))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_family_trf = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_family_trf.fit(X_train_trf, y_train['Family'])

y_pred_test_family = svm_clf_family_trf.predict(X_test_trf)
EMR_family_std = accuracy_score(y_pred_test_family, y_test['Family'])
HL_family_std = hamming_loss(y_pred_test_family, y_test['Family'])

print('Family: \n')
print(f'Exact Match Ratio: {round(EMR_family_std,4)}')
print(f'Hamming Loss: {round(HL_family_std,4)}')

Family: 

Exact Match Ratio: 0.9907
Hamming Loss: 0.0093


2. Genus

In [307]:
penalty_param = list(np.logspace(-2,6,8))
gaussian_param = list(np.linspace(0.01,2.5,10))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_genus_trf = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_genus_trf.fit(X_train_trf, y_train['Genus'])

y_pred_test_genus = svm_clf_genus_trf.predict(X_test_trf)
EMR_genus_std = accuracy_score(y_pred_test_genus, y_test['Genus'])
HL_genus_std = hamming_loss(y_pred_test_genus, y_test['Genus'])

print('Genus: \n')
print(f'Exact Match Ratio: {round(EMR_genus_std,4)}')
print(f'Hamming Loss: {round(HL_genus_std,4)}')

Genus: 

Exact Match Ratio: 0.9907
Hamming Loss: 0.0093


3. Species

In [308]:
penalty_param = list(np.logspace(-2,6,8))
gaussian_param = list(np.linspace(0.01,2.5,10))

ovr_clf = OneVsRestClassifier(SVC())
svm_clf_species_trf = GridSearchCV(ovr_clf, 
                       param_grid={'estimator__kernel': ['rbf'], 'estimator__C': gaussian_param, 'estimator__gamma': penalty_param}, 
                       cv=10)

svm_clf_species_trf.fit(X_train_trf, y_train['Species'])

y_pred_test_species = svm_clf_species_trf.predict(X_test_trf)
EMR_species_std = accuracy_score(y_pred_test_species, y_test['Species'])
HL_species_std = hamming_loss(y_pred_test_species, y_test['Species'])

print('Species: \n')
print(f'Exact Match Ratio: {round(EMR_species_std,4)}')
print(f'Hamming Loss: {round(HL_species_std,4)}')

Species: 

Exact Match Ratio: 0.9907
Hamming Loss: 0.0093


### iii) L1-penalized SVMs, 10 fold CV

In [293]:
reg_l1_svm = LinearSVC(penalty='l1', loss="squared_hinge", dual=False) #Using default values from sklearn example
reg_l1_svm = GridSearchCV(estimator=reg_l1_svm, param_grid={'C': np.logspace(-2,6,8)} ,cv=10)

1. Family

In [294]:
reg_l1_svm.fit(X_train_trf, y_train['Family'])
y_pred_l1_test_family = reg_l1_svm.predict(X_test_trf)
EMR_family_std = accuracy_score(y_pred_l1_test_family, y_test['Family'])
HL_family_std = hamming_loss(y_pred_l1_test_family, y_test['Family'])

print('Family: \n')
print(f'Exact Match Ratio: {round(EMR_family_std,4)}')
print(f'Hamming Loss: {round(HL_family_std,4)}')

Family: 

Exact Match Ratio: 0.9629
Hamming Loss: 0.0371


2. Genus

In [295]:
reg_l1_svm.fit(X_train_trf, y_train['Genus'])
y_pred_l1_test_genus = reg_l1_svm.predict(X_test_trf)
EMR_genus_std = accuracy_score(y_pred_l1_test_genus, y_test['Genus'])
HL_genus_std = hamming_loss(y_pred_l1_test_genus, y_test['Genus'])

print('Genus: \n')
print(f'Exact Match Ratio: {round(EMR_genus_std,4)}')
print(f'Hamming Loss: {round(HL_genus_std,4)}')

Genus: 

Exact Match Ratio: 0.9796
Hamming Loss: 0.0204


3. Species

In [296]:
reg_l1_svm.fit(X_train_trf, y_train['Species'])
y_pred_l1_test_species = reg_l1_svm.predict(X_test_trf)
EMR_species_std = accuracy_score(y_pred_l1_test_species, y_test['Species'])
HL_species_std = hamming_loss(y_pred_l1_test_species, y_test['Species'])

print('Species: \n')
print(f'Exact Match Ratio: {round(EMR_species_std,4)}')
print(f'Hamming Loss: {round(HL_species_std,4)}')

Species: 

Exact Match Ratio: 0.9792
Hamming Loss: 0.0208


### iv) L1 penalization using SMOTE

In [297]:
# Balancing the dataset using smote
sm = SMOTE()

1. Family

In [298]:
sm_trainX_family, sm_trainY_family = sm.fit_resample(X_train_trf, y_train['Family'])
sm_trainY_family.value_counts()

Dendrobatidae      3082
Bufonidae          3082
Hylidae            3082
Leptodactylidae    3082
Name: Family, dtype: int64

In [299]:
reg_l1_svm.fit(sm_trainX_family, sm_trainY_family)
y_pred_l1_test_family = reg_l1_svm.predict(X_test_trf)
EMR_family_std_sm = accuracy_score(y_pred_l1_test_family, y_test['Family'])
HL_family_std_sm = hamming_loss(y_pred_l1_test_family, y_test['Family'])

print('Family: \n')
print(f'Exact Match Ratio: {round(EMR_family_std_sm,4)}')
print(f'Hamming Loss: {round(HL_family_std_sm,4)}')

Family: 

Exact Match Ratio: 0.9504
Hamming Loss: 0.0496


2. Genus

In [300]:
sm_trainX_genus, sm_trainY_genus = sm.fit_resample(X_train_trf, y_train['Genus'])
sm_trainY_genus.value_counts()

Adenomera        2905
Leptodactylus    2905
Ameerega         2905
Osteocephalus    2905
Dendropsophus    2905
Rhinella         2905
Scinax           2905
Hypsiboas        2905
Name: Genus, dtype: int64

In [301]:
reg_l1_svm.fit(sm_trainX_genus, sm_trainY_genus)
y_pred_l1_test_genus = reg_l1_svm.predict(X_test_trf)
EMR_genus_std_sm = accuracy_score(y_pred_l1_test_genus, y_test['Genus'])
HL_genus_std_sm = hamming_loss(y_pred_l1_test_genus, y_test['Genus'])

print('Genus: \n')
print(f'Exact Match Ratio: {round(EMR_genus_std_sm,4)}')
print(f'Hamming Loss: {round(HL_genus_std_sm,4)}')

Genus: 

Exact Match Ratio: 0.9833
Hamming Loss: 0.0167


3. Species

In [302]:
sm_trainX_species, sm_trainY_species = sm.fit_resample(X_train_trf, y_train['Species'])
sm_trainY_species.value_counts()

Ameeregatrivittata        2423
AdenomeraAndre            2423
HypsiboasCordobae         2423
AdenomeraHylaedactylus    2423
Rhinellagranulosa         2423
ScinaxRuber               2423
HypsiboasCinerascens      2423
LeptodactylusFuscus       2423
OsteocephalusOophagus     2423
HylaMinuta                2423
Name: Species, dtype: int64

In [303]:
reg_l1_svm.fit(sm_trainX_species, sm_trainY_species)
y_pred_l1_test_species = reg_l1_svm.predict(X_test_trf)
EMR_species_std_sm = accuracy_score(y_pred_l1_test_species, y_test['Species'])
HL_species_std_sm = hamming_loss(y_pred_l1_test_species, y_test['Species'])

print('Species: \n')
print(f'Exact Match Ratio: {round(EMR_species_std_sm,4)}')
print(f'Hamming Loss: {round(HL_species_std_sm,4)}')

Species: 

Exact Match Ratio: 0.981
Hamming Loss: 0.019


### Conclusions

The average Hamming loss (across the 3 labels) for each of the following is:

1. SVM with raw attributes: 0.005
2. SVM with standardized attributes: 0.0093
3. L1 penalized SVM: 0.0261
4. L1 penalized SVM using SMOTE to eliminate data imbalance: 0.028

Thus, the best classfier is SVM with raw attributes as it has the least Hamming loss.

### Q1 Sources

1. https://towardsdatascience.com/journey-to-the-center-of-multi-label-classification-384c40229bff
2. https://peltarion.com/knowledge-center/documentation/evaluation-view/classification-loss-metrics/exact-match-ratio
3. https://medium.datadriveninvestor.com/a-survey-of-evaluation-metrics-for-multilabel-classification-bb16e8cd41cd
4. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.hamming_loss.html
5. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
6. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
7. https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html
8. https://scikit-learn.org/stable/auto_examples/svm/plot_svm_kernels.html
9. https://scikit-learn.org/stable/modules/generated/sklearn.multiclass.OneVsRestClassifier.html
10. https://stackoverflow.com/questions/12632992/gridsearch-for-an-estimator-inside-a-onevsrestclassifier
11. https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
12. https://scikit-learn.org/stable/auto_examples/svm/plot_svm_scale_c.html
13. https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

# Question 2

### (a) Use k-means clustering on the whole Anuran Calls (MFCCs) Data Set (do not split the data into train and test, as we are not performing supervised learning in this exercise). Choose k ∈ {1, 2, . . . , 50} automatically based on one of the methods provided in the slides (CH or Gap Statistics or scree plots or Silhouettes) or any other method you know.


In [135]:
# Creating copy of original dataframe
frogs_data_kmeans = frogs_data.copy()

# Filtering out labels and recordID
frogs_data_kmeans.drop(['Family','Genus','Species','RecordID'],axis=1, inplace=True)

In [223]:
# Identifying # clusters & running Monte-Carlo simulations for 50 times

for iter in range(1, 51):
    silhouette_score_list = []
    clusters = []

    for i in range(2,51):
        kmeans_model = KMeans(n_clusters=i)
        kmeans_model.fit(frogs_data_kmeans)
        cluster_labels = kmeans_model.labels_

        # collating silhouette score for each iteration to identify ideal # clusters
        silhouette_score_list.append(silhouette_score(frogs_data_kmeans, cluster_labels))
        clusters.append(i)

    ideal_cluster_cnt = clusters[silhouette_score_list.index(max(silhouette_score_list))]
    print(f'Ideal # clusters in iteration {iter} is: {ideal_cluster_cnt}')

Ideal # clusters in iteration 1 is: 4
Ideal # clusters in iteration 2 is: 4
Ideal # clusters in iteration 3 is: 4
Ideal # clusters in iteration 4 is: 4
Ideal # clusters in iteration 5 is: 4
Ideal # clusters in iteration 6 is: 4
Ideal # clusters in iteration 7 is: 4
Ideal # clusters in iteration 8 is: 4
Ideal # clusters in iteration 9 is: 4
Ideal # clusters in iteration 10 is: 4
Ideal # clusters in iteration 11 is: 4
Ideal # clusters in iteration 12 is: 4
Ideal # clusters in iteration 13 is: 4
Ideal # clusters in iteration 14 is: 4
Ideal # clusters in iteration 15 is: 4
Ideal # clusters in iteration 16 is: 4
Ideal # clusters in iteration 17 is: 4
Ideal # clusters in iteration 18 is: 4
Ideal # clusters in iteration 19 is: 4
Ideal # clusters in iteration 20 is: 4
Ideal # clusters in iteration 21 is: 4
Ideal # clusters in iteration 22 is: 4
Ideal # clusters in iteration 23 is: 4
Ideal # clusters in iteration 24 is: 4
Ideal # clusters in iteration 25 is: 4
Ideal # clusters in iteration 26 i

### (b) In each cluster, determine which family is the majority by reading the true labels. Repeat for genus and species.


In [254]:
# Droping the record column as it is redundant
frogs_data_mc = frogs_data.drop(['RecordID'], axis=1)

In [222]:
print('Iteration # - Cluster ID: \tFamily \t\tGenus \t\tSpecies \n')

for iter in range(1,51):
    # Adding cluster number as a column to the dataframe
    kmeans_model_c4 = KMeans(n_clusters=4)
    kmeans_model_c4.fit(frogs_data_kmeans)
    cluster_labels_c4 = kmeans_model_c4.labels_
    frogs_data_kmeans['cluster'] = cluster_labels_c4
    
    for i in range(0,4):
        label = []

        for j in ['Family','Genus','Species']:    
            index_cluster = frogs_data_kmeans[frogs_data_kmeans['cluster'] == i].index
            temp = pd.DataFrame(frogs_data[j].filter(items = index_cluster, axis=0).value_counts().head(1)).reset_index()
            label.append(temp.iloc[0][0])
            
            
        print(f'Iteration {iter} - Cluster {i}: \t{label[0]} \t{label[1]} \t{label[2]}')
    frogs_data_kmeans.drop(['cluster'], axis=1)


Iteration # - Cluster ID: 	Family 		Genus 		Species 

Iteration 1 - Cluster 0: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 1 - Cluster 1: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 1 - Cluster 2: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 1 - Cluster 3: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 2 - Cluster 0: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 2 - Cluster 1: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 2 - Cluster 2: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 2 - Cluster 3: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 3 - Cluster 0: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 3 - Cluster 1: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 3 - Cluster 2: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 3 - Cluster 3: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 4 - Cluster 0: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 4 - Cluster 1: 	Dendrobatidae 	Amee

Iteration 31 - Cluster 0: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 31 - Cluster 1: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 31 - Cluster 2: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 31 - Cluster 3: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 32 - Cluster 0: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 32 - Cluster 1: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 32 - Cluster 2: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 32 - Cluster 3: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 33 - Cluster 0: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 33 - Cluster 1: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 33 - Cluster 2: 	Dendrobatidae 	Ameerega 	Ameeregatrivittata
Iteration 33 - Cluster 3: 	Hylidae 	Hypsiboas 	HypsiboasCinerascens
Iteration 34 - Cluster 0: 	Leptodactylidae 	Adenomera 	AdenomeraHylaedactylus
Iteration 34 - Cluster 1: 	Hylidae 	Hypsiboas 	HypsiboasCordobae
Iteration 3

### (c) Now for each cluster you have a majority label triplet (family, genus, species). Calculate the average Hamming distance, Hamming score, and Hamming loss between the true labels and the labels assigned by clusters.

In [274]:
frogs_data_kmeans['cluster'] = 0
frogs_data_mc['cluster'] = 0

hamming_loss_list = []
hamming_score_list = []

print('Iteration #: \tHamming Loss \tHamming Score')
for iter in range(1,51):
    
    family_list = []
    genus_list = []
    species_list = []
    cluster_list = []
    
    frogs_data_kmeans.drop(['cluster'], axis=1)
    frogs_data_mc.drop(['cluster'], axis=1)
    
    # Adding cluster number as a column to the dataframe
    kmeans_model_c4 = KMeans(n_clusters=4)
    kmeans_model_c4.fit(frogs_data_kmeans)
    cluster_labels_c4 = kmeans_model_c4.labels_
    frogs_data_kmeans['cluster'] = cluster_labels_c4
    frogs_data_mc['cluster'] = cluster_labels_c4
    
    for i in range(0,4):
        label = []
        cluster_list.append(i)
        
        for j in ['Family','Genus','Species']:    
            index_cluster = frogs_data_kmeans[frogs_data_kmeans['cluster'] == i].index
            temp = pd.DataFrame(frogs_data[j].filter(items = index_cluster, axis=0).value_counts().head(1)).reset_index()
            label.append(temp.iloc[0][0])
            
            if j == 'Family':
                family_list.append(temp.iloc[0][0])
            elif j == 'Genus':
                genus_list.append(temp.iloc[0][0])
            else:
                species_list.append(temp.iloc[0][0])
                
    iter1 = pd.DataFrame()
    iter1['family_pred'] = family_list
    iter1['genus_pred'] = genus_list
    iter1['species_pred'] = species_list
    iter1['cluster_pred'] = cluster_list

    frogs_hamming = frogs_data_mc.merge(iter1, how='inner', left_on='cluster', right_on='cluster_pred')
    
    hamming_loss_f = hamming_loss(frogs_hamming['family_pred'], frogs_hamming['Family'])
    hamming_loss_g = hamming_loss(frogs_hamming['genus_pred'], frogs_hamming['Genus'])
    hamming_loss_s = hamming_loss(frogs_hamming['species_pred'], frogs_hamming['Species'])
    
    hamming_score_f = accuracy_score(frogs_hamming['family_pred'], frogs_hamming['Family'])
    hamming_score_g = accuracy_score(frogs_hamming['genus_pred'], frogs_hamming['Genus'])
    hamming_score_s = accuracy_score(frogs_hamming['species_pred'], frogs_hamming['Species'])
    
    hamming_loss_avg = round((hamming_loss_f + hamming_loss_g + hamming_loss_s)/3,4)
    hamming_score_avg = round((hamming_score_f + hamming_score_g + hamming_score_s)/3,4)

    hamming_loss_list.append(hamming_loss_avg)
    hamming_score_list.append(hamming_score_avg)
    
    
    print(f'Iteration {iter}: \t{hamming_loss_avg} \t\t{hamming_score_avg}')
    

Iteration #: 	Hamming Loss 	Hamming Score
Iteration 1: 	0.2224 		0.7776
Iteration 2: 	0.2224 		0.7776
Iteration 3: 	0.2224 		0.7776
Iteration 4: 	0.2224 		0.7776
Iteration 5: 	0.2224 		0.7776
Iteration 6: 	0.2224 		0.7776
Iteration 7: 	0.2224 		0.7776
Iteration 8: 	0.2224 		0.7776
Iteration 9: 	0.2224 		0.7776
Iteration 10: 	0.2224 		0.7776
Iteration 11: 	0.2224 		0.7776
Iteration 12: 	0.2224 		0.7776
Iteration 13: 	0.2224 		0.7776
Iteration 14: 	0.2224 		0.7776
Iteration 15: 	0.2224 		0.7776
Iteration 16: 	0.2224 		0.7776
Iteration 17: 	0.2224 		0.7776
Iteration 18: 	0.2224 		0.7776
Iteration 19: 	0.2224 		0.7776
Iteration 20: 	0.2224 		0.7776
Iteration 21: 	0.2224 		0.7776
Iteration 22: 	0.2224 		0.7776
Iteration 23: 	0.2224 		0.7776
Iteration 24: 	0.2224 		0.7776
Iteration 25: 	0.2224 		0.7776
Iteration 26: 	0.2224 		0.7776
Iteration 27: 	0.2224 		0.7776
Iteration 28: 	0.2224 		0.7776
Iteration 29: 	0.2224 		0.7776
Iteration 30: 	0.2224 		0.7776
Iteration 31: 	0.2224 		0.7776
Iterat

#### Summary

From the above, we can see that -
1. Mean/Average hamming loss: 0.2224
2. Mean/Average hamming distance: 0.7776
3. Standard deviation of hamming loss: 0
4. Standard deviation of hamming loss: 0

### Sources

1. https://www.analyticsvidhya.com/blog/2021/05/k-mean-getting-the-optimal-number-of-clusters/
2. https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

## ISLR 12.6.2

#### I have attached the pdf with the solution to this question in the data folder as well.

In [8]:
from IPython.display import IFrame
IFrame("../data/HW_7_islr.pdf", width=800, height=400)