## 1. Multi-class and Multi-Label Classification Using Support Vector Machines
##### (a) Download the Anuran Calls (MFCCs) Data Set from: https://archive.ics.uci.edu/ml/datasets/Anuran+Calls+%28MFCCs%29. Choose 70% of the data randomly as the training set.

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.metrics import hamming_loss
from collections import defaultdict
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from collections import defaultdict
import statistics
import random 
#from sklearn.metrics import hamming_loss

## (b) Each instance has three labels: Families, Genus, and Species. Each of the labelshas multiple classes. We wish to solve a multi-class and multi-label problem.One of the most important approaches to multi-class classification is to train a classifier for each label. We first try this approach:

#### i. Research exact match and hamming score/ loss methods for evaluating multilabel
classification and use them in evaluating the classifiers in this problem.


Exact match : indicates the percentage of samples that have all their labels classified correctly, disadvantage of this measure is that multi-class classification problems have a chance of being partially correct, but here we ignore those partially correct matches

Hamming loss : Hamming-Loss is the fraction of labels that are incorrectly predicted

#### ii. Train a SVM for each of the labels, using Gaussian kernels and one versus all classifiers. Determine the weight of the SVM penalty and the width of the Gaussian Kernel using 10 fold cross validation.1 You are welcome to try to solve the problem with both standardized 2 and raw attributes and report the results.

In [4]:
df=pd.read_csv('Frogs_MFCCs.csv')

In [5]:
y=df[['Family','Genus','Species']]

In [6]:
x=df.drop(['Family','Genus','Species','RecordID'],axis=1)
x=preprocessing.scale(x)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.30, random_state=42)

In [7]:
y_pred={}
for i in ['Family','Genus','Species']:
    parameters = [{'kernel': ['rbf'],
               'gamma': [1e-4, 1e-3, 0.01, 0.1, 0.2, 0.5,10],
                'C': [1, 10, 100, 1000]}]
    clf = GridSearchCV(svm.SVC(decision_function_shape='ovr'), parameters,cv=10)
    clf.fit(x_train, y_train[[i]])
    print("Best parameters set found on development set for class {} :".format(i))
    print()
    print(clf.best_params_)

    y_pred[i] = clf.predict(x_test)
    print(classification_report(y_test[[i]].values.ravel(), y_pred[i]))
    print("The hamming distance for class {}".format(hamming_loss(y_test[[i]].values.ravel(), y_pred[i])))

Best parameters set found on development set for class Family :
()
{'kernel': 'rbf', 'C': 10, 'gamma': 0.1}
                 precision    recall  f1-score   support

      Bufonidae       1.00      0.89      0.94        27
  Dendrobatidae       1.00      1.00      1.00       162
        Hylidae       0.98      1.00      0.99       623
Leptodactylidae       1.00      0.99      1.00      1347

      micro avg       0.99      0.99      0.99      2159
      macro avg       0.99      0.97      0.98      2159
   weighted avg       0.99      0.99      0.99      2159

The hamming distance for class 0.00741083835109
Best parameters set found on development set for class Genus :
()
{'kernel': 'rbf', 'C': 10, 'gamma': 0.1}
               precision    recall  f1-score   support

    Adenomera       1.00      1.00      1.00      1251
     Ameerega       1.00      1.00      1.00       162
Dendropsophus       1.00      0.96      0.98        84
    Hypsiboas       0.96      1.00      0.98       468
Le

In [8]:
y_pred=pd.DataFrame(y_pred)
df_match=pd.DataFrame()
for i in ['Family','Genus','Species']:
    df_match[i] = np.where(y_test[i] == y_pred[i],1,0)
df_match['sum'] = df_match[['Family','Genus','Species']].sum(axis=1)
em = np.where(df_match['sum']==3,1.0,0).sum(axis=0)/len(df_match)
print("the exact match is: {}".format(em))

the exact match is: 0.981936081519


#### iii. Repeat 1(b)ii and L1-penalized SVMs. Remember to standardize the attributes

In [9]:
y_pred={}

for i in ['Family','Genus','Species']:
    parameters = [{'C': [1, 10, 100, 1000],'max_iter':[2000]}]
    #clf_sets = [(LinearSVC(penalty='l1', loss='squared_hinge', dual=False,
                      # tol=1e-3),
    clf = GridSearchCV(svm.LinearSVC(multi_class='ovr',penalty='l1',dual=False), parameters,cv=10)
    clf.fit(x_train, y_train[[i]])
    print("Best parameters set found on development set for class {} :".format(i))
    print()
    print(clf.best_params_)

    y_pred[i] = clf.predict(x_test)
    print(classification_report(y_test[[i]].values.ravel(), y_pred[i]))
    print("The hamming distance for class {}".format(hamming_loss(y_test[[i]].values.ravel(), y_pred[i])))


Best parameters set found on development set for class Family :
()
{'C': 1, 'max_iter': 2000}
                 precision    recall  f1-score   support

      Bufonidae       0.00      0.00      0.00        27
  Dendrobatidae       0.86      0.90      0.88       162
        Hylidae       0.89      0.90      0.90       623
Leptodactylidae       0.96      0.96      0.96      1347

      micro avg       0.93      0.93      0.93      2159
      macro avg       0.68      0.69      0.68      2159
   weighted avg       0.92      0.93      0.92      2159

The hamming distance for class 0.0713293191292
Best parameters set found on development set for class Genus :
()
{'C': 10, 'max_iter': 2000}
               precision    recall  f1-score   support

    Adenomera       0.97      0.98      0.97      1251
     Ameerega       0.90      0.93      0.92       162
Dendropsophus       0.87      0.71      0.78        84
    Hypsiboas       0.91      0.97      0.94       468
Leptodactylus       0.98      

In [10]:
y_pred=pd.DataFrame(y_pred)
df_match=pd.DataFrame()
for i in ['Family','Genus','Species']:
    df_match[i] = np.where(y_test[i] == y_pred[i],1,0)
df_match['sum'] = df_match[['Family','Genus','Species']].sum(axis=1)
em = np.where(df_match['sum']==3,1.0,0).sum(axis=0)/len(df_match)
print("the exact match is: {}".format(em))

the exact match is: 0.913385826772


In [11]:
print(y_train['Family'].value_counts())

Leptodactylidae    3073
Hylidae            1542
Dendrobatidae       380
Bufonidae            41
Name: Family, dtype: int64


#### iv. Repeat 1(b)iii by using SMOTE or any other method you know to remedy class imbalance. Report your conclusions about the  classifiers you trained.

In [12]:
from imblearn.over_sampling import SMOTE
y_pred = {}
for i in ['Family','Genus','Species']:
    smote = SMOTE(random_state=0,sampling_strategy='minority')
    s = smote.fit_sample(x_train, y_train[[i]])
    X = pd.DataFrame(s[0])
    Y = pd.DataFrame(s[1])
    parameters = [{'C': [1, 10, 100, 1000],'max_iter':[2000]}]
    #clf_sets = [(LinearSVC(penalty='l1', loss='squared_hinge', dual=False,
                      # tol=1e-3),
    clf = GridSearchCV(svm.LinearSVC(multi_class='ovr',penalty='l1',dual=False), parameters,cv=10)
    clf.fit(X, Y)
    print("Best parameters set found on development set for class {} :".format(i))
    print()
    print(clf.best_params_)
    y_pred[i] = clf.predict(x_test).tolist()
    #y_pred = clf.predict(x_test)
    print(classification_report(y_test[[i]].values.ravel(), y_pred[i]))
    print("The hamming distance for class {}".format(hamming_loss(y_test[[i]].values.ravel(), y_pred[i])))
    


Best parameters set found on development set for class Family :
()
{'C': 10, 'max_iter': 2000}
                 precision    recall  f1-score   support

      Bufonidae       0.33      0.93      0.49        27
  Dendrobatidae       0.85      0.89      0.87       162
        Hylidae       0.93      0.88      0.91       623
Leptodactylidae       0.96      0.94      0.95      1347

      micro avg       0.92      0.92      0.92      2159
      macro avg       0.77      0.91      0.80      2159
   weighted avg       0.94      0.92      0.93      2159

The hamming distance for class 0.0782769800834
Best parameters set found on development set for class Genus :
()
{'C': 10, 'max_iter': 2000}
               precision    recall  f1-score   support

    Adenomera       0.97      0.97      0.97      1251
     Ameerega       0.90      0.91      0.90       162
Dendropsophus       0.87      0.63      0.73        84
    Hypsiboas       0.94      0.96      0.95       468
Leptodactylus       0.98     

In [13]:
y_pred=pd.DataFrame(y_pred)

In [14]:
df_match=pd.DataFrame()
for i in ['Family','Genus','Species']:
    df_match[i] = np.where(y_test[i] == y_pred[i],1,0)
df_match['sum'] = df_match[['Family','Genus','Species']].sum(axis=1)
em =np.where(df_match['sum']==3,1.0,0).sum(axis=0)/len(df_match)
print("the exact match is: {}".format(em))

the exact match is: 0.906438165818


The exact match for the gaussian kernel 0.98 <br/>
The exact match for L1 penalized is 0.91 <br/>
The exact match with smote is 0.90 <br/>

The best classifier here is svm using gaussian kernel as it has the highest exact match.  

## 2. K-Means Clustering on a Multi-Class and Multi-Label Data Set Monte-Carlo Simulation: Perform the following procedures 50 times, and report the average and standard deviation of the 50 Hamming Distances that you calculate.

#### (a) Use k-means clustering on the whole Anuran Calls (MFCCs) Data Set (do not split the data into train and test, as we are not performing supervised learning in this exercise). Choose k automatically based on one of the methods provided in the slides (CH or Gap Statistics or scree plots or Silhouettes) or any other method you know.


In [15]:
df=pd.read_csv('Frogs_MFCCs.csv')
#df=df.drop(['RecordID'],axis=1)
x=df.drop(['Family','Genus','Species','RecordID'],axis=1)
y=df[['Family','Genus','Species']]
x.shape

(7195, 22)

#### (b) In each cluster, determine which family is the majority by reading the true labels.Repeat for genus and species
#### (c) Now for each cluster you have a majority label triplet (family, genus, species). Calculate the average Hamming distance (score) between the true labels and the labels assigned by clusters.

In [16]:
from collections import defaultdict
hamming_avg=[]
for mc in range(0,50):

    silhouette_avg=defaultdict(list)
    
    for i in range(2,16):
        kmeans=KMeans(n_clusters=i,init='k-means++',n_init=5)
        cluster_labels = kmeans.fit_predict(x)
        silhouette_avg[i].append(silhouette_score(x, cluster_labels))
    cluster= max(silhouette_avg, key=silhouette_avg.get)
    print(" The best cluster value is: ",cluster)

    kmeans=KMeans(n_clusters=cluster,init='k-means++',n_init=5)
    cluster_labels = kmeans.fit_predict(x)
    df['cluster_labels']=cluster_labels
    df['Family'].value_counts()
    d=defaultdict(list)
    for clus in range(0,cluster):
        temp = df[df['cluster_labels']==clus]
        d[clus].append(temp['Family'].value_counts().index[0])
        d[clus].append(temp['Genus'].value_counts().index[0])
        d[clus].append(temp['Species'].value_counts().index[0])
    print("the labels with max count in each cluster is :{}".format(d))
    df['family'] = 'none'
    df['genus'] = 'none'
    df['species'] = 'none'   
    
    for clus in range(cluster):
        df['family'] = np.where(df['cluster_labels']==clus,d[clus][0],df['family'])
        df['genus'] = np.where(df['cluster_labels']==clus,d[clus][1],df['genus'])
        df['species'] = np.where(df['cluster_labels']==clus,d[clus][2],df['species'])

    df_pred = df.loc[:,['family','genus','species']].rename(columns={'family':'Family','genus':'Genus','species':'Species'})
    df_true = df.loc[:,['Family','Genus','Species']]


    hamming=[] 
    for i in ['Family', 'Genus', 'Species']:
        hamming.append(1-hamming_loss(df_true[i], df_pred[i]))
    hamming_avg.append(sum(hamming)/3)



        

(' The best cluster value is: ', 4)
the labels with max count in each cluster is :defaultdict(<type 'list'>, {0: ['Leptodactylidae', 'Adenomera', 'AdenomeraHylaedactylus'], 1: ['Leptodactylidae', 'Adenomera', 'AdenomeraAndre'], 2: ['Hylidae', 'Hypsiboas', 'HypsiboasCordobae'], 3: ['Hylidae', 'Hypsiboas', 'HypsiboasCordobae']})
(' The best cluster value is: ', 4)
the labels with max count in each cluster is :defaultdict(<type 'list'>, {0: ['Leptodactylidae', 'Adenomera', 'AdenomeraHylaedactylus'], 1: ['Hylidae', 'Hypsiboas', 'HypsiboasCinerascens'], 2: ['Leptodactylidae', 'Adenomera', 'AdenomeraAndre'], 3: ['Hylidae', 'Hypsiboas', 'HypsiboasCordobae']})
(' The best cluster value is: ', 4)
the labels with max count in each cluster is :defaultdict(<type 'list'>, {0: ['Leptodactylidae', 'Adenomera', 'AdenomeraHylaedactylus'], 1: ['Dendrobatidae', 'Ameerega', 'Ameeregatrivittata'], 2: ['Hylidae', 'Hypsiboas', 'HypsiboasCinerascens'], 3: ['Hylidae', 'Hypsiboas', 'HypsiboasCordobae']})
(' The

In [17]:
hamming_avg

[0.75487607134584211,
 0.76636553161918008,
 0.77757702107945337,
 0.77757702107945337,
 0.77757702107945337,
 0.77808663423673841,
 0.77757702107945337,
 0.77757702107945337,
 0.80078758397034966,
 0.77757702107945337,
 0.75487607134584211,
 0.77757702107945337,
 0.77757702107945337,
 0.77757702107945337,
 0.77808663423673841,
 0.83210562890896467,
 0.75473708593930977,
 0.77822561964327075,
 0.77753069261060925,
 0.76631920315033586,
 0.77757702107945337,
 0.77771600648598571,
 0.76627287468149186,
 0.76627287468149186,
 0.77757702107945337,
 0.77757702107945337,
 0.77771600648598571,
 0.77757702107945337,
 0.77757702107945337,
 0.77753069261060925,
 0.77757702107945337,
 0.77757702107945337,
 0.77757702107945337,
 0.75487607134584211,
 0.77757702107945337,
 0.76594857539958305,
 0.75255964790363683,
 0.75473708593930977,
 0.77757702107945337,
 0.77757702107945337,
 0.77757702107945337,
 0.77757702107945337,
 0.75487607134584211,
 0.77757702107945337,
 0.75487607134584211,
 0.7780866

In [19]:
sd= statistics.stdev(hamming_avg)
print("the standard deviation is :",sd)
aver=statistics.mean(hamming_avg)
print("the average of the hamming_avg",aver)

('the standard deviation is :', 0.01297612494457149)
('the average of the hamming_avg', 0.7739068797776234)
