 # Guided exercice
 This exercice comes from « Practical Data Mining with Python » , Giuseppe Vettigli 
   
 ## Data imports 
The file describes 50 specimen of iris, belonging to 3 different species (Iris setosa, Iris virginica and Iris versicolor). Each specimen is described with 4 caracteristics: 
    - length of Sepal,
    - width of Sepal,
    - length of Petals,
    - width of Petal.
 
 The specie is the 5th data.

 Data is presented in a csv file. To facilitate the use, it will be store in a panda dataFrame

In [None]:
# Import of the needed libraires
#graphical librairies
import matplotlib as mpl
from matplotlib import pyplot
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import figure, subplot, hist, xlim, show, plot
%matplotlib inline

#data librairies 

import pandas as pd
import pylab as pl
import numpy as np

from pandas.plotting import scatter_matrix
from pandas.plotting import boxplot
from pandas.plotting import parallel_coordinates

In [None]:
#data importand creation of panda object
data_panda = pd.read_csv('iris_english.csv')

Check that import went well. 
Display the columns names

In [None]:
print(data_panda.keys())

In [None]:
data_panda.dtypes

We can also display the names of species and their repartition.

In [None]:
data_panda['Species'].value_counts()

In [None]:
data_panda.groupby('Species').describe()

In [None]:
#definition of the colors used for visualization 
colors = np.where(data_panda['Species']=='setosa','r','-')
colors[data_panda['Species']=='versicolor'] = 'g'
colors[data_panda['Species']=='virginica']= 'b'
#print(colors)
color_dict={'setosa':'r','versicolor':'g' ,'virginica':'b'}

## Visualization
Visualization enables to better understand the input data  and to verify the need of pre-treatments.



In [None]:
# to facilitate futur use, we can create a set with the input columns
Input_cols = ['Sepal_lenght', 'Sepal_width', 'Petal_lenght', 'Petal_width']

The choice of input data is very important. 2 things should be considered :
the target value should be excluded from this set.
random data or data unrelated to the target should not be included.



In [None]:
data_panda.corr(numeric_only=True)

In [None]:
# on représente ainsi la longeur des pétales  en fonction de la longueur des sépales 

data_panda.plot(kind="scatter", x='Sepal_lenght', y='Petal_lenght')

plt.title('150 specimens') 
plt.show()


We can use the defined colors to stress out the species.

In [None]:
sns.scatterplot(x='Sepal_lenght', y='Petal_lenght', hue='Species', palette=color_dict, data=data_panda) 

plt.title('150 specimens') 
plt.show()

We can also use histograms and compare the distribution .


In [None]:
scatter_matrix(data_panda, figsize=(10, 10), diagonal='hist', c=colors);

We can also use boxplots.

In [None]:
data_panda.boxplot(by='Species', figsize=(12, 6));

These graphics can give us ideas to ease the classification. for instance we can see that  Iris Setosa have smaller Sepal than Iris Virginica.

# Pre treatment
## Normalization

To garanty that the use of Euclidan distances will not favor one characteristics, we need to work on normalized data.

In [None]:
#normalisation
import copy
Norm=copy.deepcopy(data_panda)
Norm[Input_cols]=(data_panda[Input_cols]-data_panda[Input_cols].min())/(data_panda[Input_cols].max()-data_panda[Input_cols].min())
print(Norm.keys())

In [None]:
sns.scatterplot(x='Sepal_lenght', y='Petal_lenght', hue='Species', palette=color_dict, data=Norm) 
plt.title('150 specimens, normalized') 
plt.show()

In [None]:
Norm.boxplot(by='Species', figsize=(12, 6));

## Encoding
Some methods can only use numerical data. In order to use them we need to transform the target values into integers.

In [None]:
    #transformation des espèces en numéro de classe
Norm.Species=Norm.Species.astype('category')
Norm['Species_encoded'],dict_cat=Norm.Species.factorize()
  # le dictionnaire d encodage est stocké dans le vecteur dict_cat
print(dict_cat)


In [None]:
  #créons le dictionnaire des couleurs associé
color_dict_encoded={}
for i in range (0, 3):
      color_dict_encoded[i]=color_dict[dict_cat[i]]
print(color_dict_encoded)

## Principal component analysis


Principal component analysis project data in a space where variance is maximized. 
This means that if 2 points are different in a n-dimension space they should not overlap in the 2-dimension PCA space.

Lets determine the lost information.

In [None]:
from sklearn.decomposition import PCA
for i in range(1,5):

    pca = PCA(n_components=i)

    pca.fit(Norm[Input_cols])

    print (i, 'components representa data loss of' ,(1-sum(pca.explained_variance_ratio_)) * 100,'%')


We will use the 2-D representation for visualization.
If the algortihmes take too long we should decide which PCA recduction is an acceptable loss.

In [None]:
n_components=2
pca = PCA(n_components)
pca.fit(Norm[Input_cols])
pca_apply = pca.transform(Norm[Input_cols])

We can identify the composition of those 2 components from the 4 initial dimensions.

In [None]:
base=pd.DataFrame(pca.components_,columns=Norm[Input_cols].columns,index = ['PCA0','PCA1'])            
print(base)

In [None]:
pcad_panda=pd.DataFrame(pca_apply, columns=['PCA%i' % i for i in range(n_components)]) #save in a panda object
Norm=pd.concat([Norm, pcad_panda], axis=1)#concatenate in norm_pd
print(Norm.keys())

In [None]:
#visualization
sns.scatterplot(x='PCA0',y='PCA1', hue='Species',palette=color_dict, data=Norm)
pl.xlabel('PCA0')
pl.ylabel('PCA1')
pl.title(' 150 specimens in the new base')    
plt.show()



## Classification

Classification enables to assoign a category to a specimen. 2 steps are required:
- Learning
- Prediction.

Python librairy "sklearn" has numerous classification models. Here we will use Gaussian Naive Bayes to determine the species of the iris.


# Gaussian Naive Bayes 



The data should then be split in 2 groups:
- learning set
- test set

This split can be done manually by an expert (which will be the case for the car example) or can be done randomly wiht the "train_test_split" function.

In this case, the expert should only choose the  % of each group. Here we choose  40% of the specimens for the training set.

In [None]:
from sklearn.model_selection import train_test_split

#Learning population is called train,
#the target value (species) t_train
#test population is called test
#the predicted value t_test

train, test, t_train, t_test = train_test_split(Norm, Norm['Species_encoded'], test_size=0.4, random_state=0)
#print(train)
#print(test)

Lest visualize the repartition between the 2 sets

In [None]:
sns.scatterplot(x='PCA0',y='PCA1', data=train)
sns.scatterplot(x='PCA0',y='PCA1', data=test)

pl.xlabel('PCA0')
pl.ylabel('PCA1')
plt.legend( loc='upper left', labels=['Learning set', 'Test set'])
pl.title('Random repartition of specimens') 

plt.show()
    
    

1st step - learning.

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier_GNB=GaussianNB()
#Learning
classifier_GNB.fit(train[Input_cols],train['Species_encoded']) # train
#Prediction
prediction_train =classifier_GNB.predict(train[Input_cols]) #prediction


In [None]:
#here we can compare the preduction and real specy for the first specimen
print('The first flower is a '+train['Species'][0]+' (encoded as '+ str(train['Species_encoded'][0])+')')
print('The Naive Bayes method predicts it as '+ dict_cat[prediction_train [0]])


We can measure the performance  on all the train set.

In [None]:
print (classifier_GNB.score(train[Input_cols],t_train)) # test


This performance is the number of correct preductions divied by the number of specimens in the test set.
This is the proportion of accurate prediction.
If we are happy with the result we can apply it on the test set to ensure that there are no aver-fitting

In [None]:
prediction_test =classifier_GNB.predict(test[Input_cols]) #prediction
print (classifier_GNB.score(test[Input_cols],t_test)) # test

In [None]:
prediction =classifier_GNB.predict(Norm[Input_cols]) #prediction

An evaluation tool is the confusion matrix. In this matrix columns are predicted classes and rows the real classes.


In [None]:
from sklearn.metrics import confusion_matrix
M_GNB=confusion_matrix(prediction_test,t_test)
print (M_GNB)


We can also represent this info in a graph.

In [None]:
conf_GNB = pd.DataFrame(columns=['real','real_name','predicted','density'])
for i in range (0, 3):
    for j in range (0,3) :
        if M_GNB[i][j]>0 :
            new_row =pd.DataFrame({'real':i, 'real_name':dict_cat[i],'predicted':j, 'density':float(M_GNB[i][j])}, index=[0])
            conf_GNB=pd.concat([conf_GNB,new_row], ignore_index=True)
sns.scatterplot(x='real_name', y='predicted', s=(conf_GNB.density)*60, data=conf_GNB) 
pl.xlabel('Real species')
pl.ylabel('Predicted species')
pl.title('Prediction relevance')
show

In this representation, the errors are the non diagonal elements. Here iris versicolor were labbeled as virginica. 

Other metrics can be used.

In [None]:
from sklearn.metrics import classification_report

print (classification_report(prediction_test,t_test))

- Precision : proportion of the class attributed rightfully
- Recall : proportion of elements of this class wrongfully  attributed
- F1-Score: Harmonic mean of the 2 other indicators
- support : numbre of element of this class used in the test.

To be relevant the evaluation should be done in multiple pairs (learning set/test set). 
We can then use "Cross Validation". It split the initial population several times. The performance of the classfication is the mean of the several evaluations.

In [None]:
from sklearn.model_selection import cross_val_score

# cross validation with 6 iterations 
scores = cross_val_score(classifier_GNB,Norm[Input_cols], Norm['Species_encoded'], cv=6)

print (scores)

The result is a vector with the perf for each iteration.


In [None]:
from numpy import mean

print (mean(scores))


## Use
If we are happy with the classification we can use it to classify the whole set 

In [None]:
prediction_test_GNB =classifier_GNB.predict(Norm[Input_cols]) #prediction
pred_GNB = pd.DataFrame(prediction_test_GNB )
pred_GNB.columns = ['Prediction_GNB']

#we merge this dataframe with df
Norm= pd.concat([Norm,pred_GNB], axis = 1)

We can also use it for new specimen. For instance, what is the species of a flower with  Petal and Sepal normalized lenght  and width of 0.5.?

In [None]:
New_specimen = {'Sepal_lenght': [0.5],
        'Sepal_width': [0.5],
        'Petal_lenght': [0.5],
        'Petal_width': [0.5],
        }
panda_New_specimen = pd.DataFrame(New_specimen)  
D=classifier_GNB.predict(panda_New_specimen)
print('Using Gaussian Naive Bayes, the predicted species of such a flower is '+ dict_cat[D[0]])

# Neuron network
For Neuron network we follow the same method: learn on a train set, apply on a test set and if we are happy with the performance use it for prediction

In [None]:
from sklearn.neural_network import MLPClassifier
classifier_NN = MLPClassifier()
#learn
classifier_NN.fit(train[Input_cols],train['Species_encoded']) # learning classifier.fit(input_dat, target_data)
#use on test
prediction =classifier_NN.predict(test[Input_cols]) #prediction
#evaluate on test
print (classifier_NN.score(test[Input_cols],t_test)) # test

In [None]:
#application on the whole set
prediction_NN =classifier_NN.predict(Norm[Input_cols]) #prediction
pred_NN = pd.DataFrame(prediction_NN )
pred_NN.columns = ['Prediction_NN']

#we merge this dataframe with df
Norm= pd.concat([Norm,pred_NN], axis = 1)

In [None]:
M_NN=confusion_matrix(Norm['Species_encoded'],Norm['Prediction_NN'])
print (M_NN)

In [None]:
confNN = pd.DataFrame(columns=['real','real_species','predicted','predictionNN','density'])
for i in range (0, 3):
    for j in range (0,3) :
        if M_NN[i][j]>0 :
            new_row = pd.DataFrame({'real':i, 'real_species':dict_cat[i],'predicted':j,'predicted_cluster':dict_cat[j], 'density':float(M_NN[i][j])}, index=[0])
            confNN=pd.concat([confNN,new_row], ignore_index = True)             
sns.scatterplot(x='real_species', y='predicted_cluster', s=(confNN.density)*60, data=confNN) 
pl.xlabel('Real species')
pl.ylabel('Cluster k-mean')
pl.title('Alignement between clusters and species')
show

## Clustering

When initial data are not labelled, groups need to be created base on similarity.
This is unsupervised learning.
Here we will use a classical clustering analysis method: k-mean.

In [None]:
print(Norm[Input_cols])

In [None]:
Norm.dtypes

In [None]:
from sklearn import cluster
from sklearn.cluster import KMeans 
from sklearn.metrics import completeness_score, homogeneity_score

In [None]:
Nombre_clusters=3#cluster nombers matching rhe numbers of species
kmeans = KMeans(n_clusters=Nombre_clusters, init='random',n_init='auto') # initialization 
kmeans.fit(Norm[Input_cols]) #K-means training
labels = kmeans.labels_
centroids = kmeans.cluster_centers_
    
    
print('Coordinates of the  3 centroids')
print(centroids)


In [None]:
#actual prediction
y_pred = kmeans.predict(Norm[Input_cols])
#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Prediction_kmean']
 

#we merge this dataframe with df
Norm = pd.concat([Norm,pred], axis = 1)


In [None]:
fig, ax = plt.subplots(figsize=(8,6))
sns.scatterplot(x='PCA0', y='PCA1', hue='Prediction_kmean', data=Norm) 
pl.title('3 Clusters K-Means')
pl.show()

We can compare these clusterswith the reality to obtain the performance.



In [None]:
print (completeness_score(Norm['Species_encoded'],Norm['Prediction_kmean']))

Completeness is near to 1 when all elements of a class belong to the same cluster.

In [None]:
print (homogeneity_score(Norm['Species_encoded'],Norm['Prediction_kmean']))

Homogeneity is near to 1 when all elements of a cluster belong to the same class.

Visualization also enables to confront clustering to reality.

In [None]:
figure = plt.figure(figsize = (10, 10))
plt.tight_layout()

plt.figure(1)
plt.subplot(211)
sns.scatterplot(x='PCA0', y='PCA1', hue='Species_encoded', palette=color_dict_encoded, data=Norm) 
plt.title('Real species')


plt.subplot(212)
sns.scatterplot(x='PCA0', y='PCA1', hue='Prediction_kmean', data=Norm) 
plt.title('3 Clusters K-Means')

In [None]:
M2=confusion_matrix(Norm['Species_encoded'],Norm['Prediction_kmean'])
print (M2)

!!!BEWARE!!!!
Here perfect identification does not mean diagonal matrix. It does not matter if setosa are clustern°1, n° 2 or n°3. 

Let's represent it graphicaly.

In [None]:
dict_cluster={0: 'A', 1: 'B', 2: 'C'}
conf1 = pd.DataFrame(columns=['real','real_species','predicted','predicted_cluster','density'])
for i in range (0, 3):
    for j in range (0,3) :
        if M2[i][j]>0 :
            new_row = pd.DataFrame({'real':i, 'real_species':dict_cat[i],'predicted':j,'predicted_cluster':dict_cluster[j], 'density':float(M2[i][j])}, index=[0])
            conf1=pd.concat([conf1,new_row], ignore_index = True)             
sns.scatterplot(x='real_species', y='predicted_cluster', s=(conf1.density)*60, data=conf1) 
pl.xlabel('Real species')
pl.ylabel('Cluster k-mean')
pl.title('Alignement between clusters and species')
show

In [None]:
print (classification_report(Norm['Species_encoded'],Norm['Prediction_kmean']))

To "match" the clusters to the real labels (here species) we can use the following function 

In [None]:
#This function find the best fit between clusters and labels
from itertools import permutations # import this into script.
#tested with python 3.6
def remap_labels(pred_labels, true_labels):
    """Rename prediction labels (clustered output) to best match true labels."""
   
    pred_labels, true_labels = np.array(pred_labels), np.array(true_labels)
    assert pred_labels.ndim == 1 == true_labels.ndim
    assert len(pred_labels) == len(true_labels)
    cluster_names = np.unique(pred_labels)
    accuracy = 0

    perms = np.array(list(permutations(np.unique(true_labels))))

    remapped_labels = true_labels
    for perm in perms:
        flipped_labels = np.zeros(len(true_labels))
        for label_index, label in enumerate(cluster_names):
            flipped_labels[pred_labels == label] = perm[label_index]

        testAcc = np.sum(flipped_labels == true_labels) / len(true_labels)
        if testAcc > accuracy:
            accuracy = testAcc
            remapped_labels = flipped_labels            
            dict_map= dict(enumerate(perm, 0))
            #print(dict_map)

    return accuracy, remapped_labels,dict_map


In [None]:
acc,y_pred,dict_map_cluster =remap_labels(Norm['Prediction_kmean'],Norm['Species_encoded'])
print(dict_map_cluster)
#We store the K-means results in a dataframe
pred = pd.DataFrame(y_pred)
pred.columns = ['Prediction_kmean_mapped']

#we merge this dataframe with df
Norm= pd.concat([Norm,pred], axis = 1)

This dictionnary gives us the relationship between clusters and species. We can use it to represent the "traditional" confusion matrix.


In [None]:
M_KM=confusion_matrix(Norm['Species_encoded'],Norm['Prediction_kmean_mapped'])
print (M_KM)

In [None]:
conf_KM = pd.DataFrame(columns=['real','real_name','predicted','density'])
for i in range (0, 3):
    for j in range (0,3) :
        if M_KM[i][j]>0 :
            new_row = pd.DataFrame({'real':i, 'real_name':dict_cat[i],'predicted':j, 'density':float(M_KM[i][j])}, index=[0])
            conf_KM=pd.concat([conf_KM,new_row], ignore_index=True)
            
print(conf_KM)
sns.scatterplot(x='real_name', y='predicted', s=(conf_KM.density)*60, data=conf_KM) 
pl.xlabel('Real species')
pl.ylabel('Cluster k-mean')
pl.title('Prediction accuracy')
show

This clustering also enables to predict label of a new specimen.the Cette classification de la population d'apprentissage permet également de prédire l appartenance d 'un nouveau spécimen. 
For instance, what is the cluster of a flower with  Petal and Sepal normalized lenght  and width of 0.5.?

In [None]:
      
D_kmeans=kmeans.predict(panda_New_specimen)
print('Using kmeans, the predicted species of such a flower is '+ dict_cat[dict_map_cluster[D_kmeans[0]]])

# Export of the results

The following lines create a .csv file that encompass the predicted species.

In [None]:
nb_specimen=len(data_panda)

In [None]:
import csv
# writing the csv file
with open('my_prediction.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    # writing the header
    writer.writerow(['Flower', 'Prediction GNB', 'Prediction k-means','Prediction NN'])
    # writing the data
    for i in range(nb_specimen):
        writer.writerow([i, dict_cat[Norm.loc[i,'Prediction_GNB']], dict_cat[int(Norm.loc[i,'Prediction_kmean_mapped'])],dict_cat[int(Norm.loc[i,'Prediction_NN'])]])
