# Laboratoire 2 : Arbre de désision, Bayes naïf et KNN
#### Département du génie logiciel et des technologies de l’information

| Étudiants             |                                                         |
|-----------------------|---------------------------------------------------------|
| Jean-Philippe Decoste |  DECJ19059105                                           |
| Ahmad Al-Taher        |   ALTA22109307                                          |
| Cours                 | GTI770 - Systèmes intelligents et apprentissage machine |
| Session               | Automne 2018                                            |
| Groupe                | 2                                                       |
| Numéro du laboratoire | 02                                                      |
| Professeur            | Hervé Lombaert                                          |
| Chargé de laboratoire | Pierre-Luc Delisle                                      |
| Date                  | 30 oct 2018                                             |

In [1]:
import math

import cv2 as cv
import graphviz
import matplotlib.pyplot as plt
import numpy as np
from scipy.misc import face, imread, imshow
from skimage import img_as_ubyte
from skimage.color import rgb2gray
from skimage.filters import threshold_otsu
from sklearn import preprocessing, tree
from sklearn.datasets import load_iris
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedShuffleSplit, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.naive_bayes import MultinomialNB

In [2]:
class Email:    
    def __init__(self, features):
        self.features = features
        self.answer = int(self.features[len(self.features)-1]) == 1
        del self.features[-1]


In [3]:
class Image:    
    def __init__(self, features):
        self.features = features
        self.answer = self.features[len(self.features)-1] == 1
        del self.features[-1]
        self.id = int(self.features[0]) 
        del self.features[0]

In [46]:
def loadEmailFeatures(emails, csvPath):
    """
    This method is used to al the email features from the data set.
    
    Args:
        csvPath: the dataset file path
    """
    csvFile = open(csvPath, 'r') # option r veut dire read
    #index is used to limit the numbero of data for test purposes
    index = 0
    for line in csvFile:
        features = line.split(",")
        if len(features) > 0:
            emails.append(Email(features))
        
        index +=1
        if index == 100:
            break

    csvFile.close()

def loadImageFeatures(images, csvPath):
    """
    This method is used to al the email features from the data set.
    
    Args:
        csvPath: the dataset file path
    """
    csvFile = open(csvPath, 'r') # option r veut dire read
    #index is used to limit the numbero of data for test purposes
    index = 0
    for line in csvFile:
        features = line.split(",")
        if len(features) > 0:            
            cleanCsvFeatures(features)
            images.append(Image(features))
        index +=1
        if index == 100:
            break

    csvFile.close()

def cleanCsvFeatures(features):
    #clean up the features
    #the csv files how the data as x.xxxxe+XX and we want tem as x.xxx
    for i,val in enumerate(features):
        features[i] = float(val)

        
def knnModel(dataset):
    x=[]
    y=[]
    
    for data in dataset:
        x.append(data.features)
        y.append(data.answer)
    
    x =np.array(x).astype(np.float64)
    y =np.array(y).astype(np.float64)
    
    params = dict(n_neighbors=[3,5,10],weights=['uniform','distance'], algorithm=['auto'])
    grid = GridSearchCV(KNeighborsClassifier(), param_grid=params, cv=5)
    grid.fit(x, y)
    #print raw 
    #print(grid.cv_results_)
    #cv_results_['mean_test_score'] output is [uniform k=3,distance k=3,uniform k=10,distance k=10,uniform k=10,distance k=10]
    for i in range(0, 6):
        print("KNN %s has a score of %0.2f with k=%s " %(grid.cv_results_['params'][i]['weights'],grid.cv_results_['mean_test_score'][i],grid.cv_results_['params'][i]['n_neighbors']))
    #print("KNN 'distance' has a score of %0.2f with k=%s" %(grid.cv_results_['mean_test_score'][1],k) )
    
    print("The best parameters are %s with a score of %0.2f" % (grid.best_params_, grid.best_score_))

def bayes(dataset):
    x=[]
    y=[]
    
    for data in dataset:
        x.append(data.features)
        y.append(data.answer)
    
    x =np.array(x).astype(np.float64)
    y =np.array(y).astype(np.float64)
    
    scaler = MinMaxScaler()
    scaler.fit(x)
    xNormalized = scaler.transform(x) 
    
    grid = GridSearchCV(MultinomialNB(), param_grid={}, cv=5)
    grid.fit(xNormalized, y)
    print("MultinomialNB with normalized values best score is %0.2f" % ( grid.best_score_))
    
    est = preprocessing.KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
    
    est.fit(x)
    xKBinsDiscretizer = est.transform(x)
    
    grid.fit(xKBinsDiscretizer, y)
    
    #print raw 
    #print(grid.cv_results_)
    
    print("MultinomialNB with discretized values best score is %0.2f" % ( grid.best_score_))
    

In [47]:
Emails = []
Images = []

loadEmailFeatures(Emails, r"C:\Users\ThinkPad\Downloads\gti770\data\csv\spam\spam.csv")
loadImageFeatures(Images, r"C:\Users\ThinkPad\Downloads\gti770\data\csv\galaxy\galaxy_feature_vectors.csv")
#knn 
print("Working with the emails")
knnModel(Emails)
bayes(Emails)
print("Working with the images")
knnModel(Images)
bayes(Images)


Working with the emails




KNN uniform has a score of 0.68 with k=3 
KNN distance has a score of 0.69 with k=3 
KNN uniform has a score of 0.64 with k=5 
KNN distance has a score of 0.65 with k=5 
KNN uniform has a score of 0.69 with k=10 
KNN distance has a score of 0.67 with k=10 
The best parameters are {'algorithm': 'auto', 'n_neighbors': 3, 'weights': 'distance'} with a score of 0.69
MultinomialNB with normalized values best score is 0.89
MultinomialNB with discretized values best score is 0.77
Working with the images
KNN uniform has a score of 0.61 with k=3 
KNN distance has a score of 0.59 with k=3 
KNN uniform has a score of 0.62 with k=5 
KNN distance has a score of 0.60 with k=5 
KNN uniform has a score of 0.62 with k=10 
KNN distance has a score of 0.63 with k=10 
The best parameters are {'algorithm': 'auto', 'n_neighbors': 10, 'weights': 'distance'} with a score of 0.63
MultinomialNB with normalized values best score is 0.71
MultinomialNB with discretized values best score is 0.71




## Introduction

In [45]:
*to delete
résultat des algo sans rajouté de nos premitive
KNN uniform has a score of 0.68 with k=3 
KNN distance has a score of 0.69 with k=3 
KNN uniform has a score of 0.64 with k=5 
KNN distance has a score of 0.65 with k=5 
KNN uniform has a score of 0.69 with k=10 
KNN distance has a score of 0.67 with k=10 
The best parameters are {'algorithm': 'auto', 'n_neighbors': 3, 'weights': 'distance'} with a score of 0.69
MultinomialNB with normalized values best score is 0.89
MultinomialNB with discretized values best score is 0.77
Working with the images
KNN uniform has a score of 0.61 with k=3 
KNN distance has a score of 0.59 with k=3 
KNN uniform has a score of 0.62 with k=5 
KNN distance has a score of 0.60 with k=5 
KNN uniform has a score of 0.62 with k=10 
KNN distance has a score of 0.63 with k=10 
The best parameters are {'algorithm': 'auto', 'n_neighbors': 10, 'weights': 'distance'} with a score of 0.63
MultinomialNB with normalized values best score is 0.71
MultinomialNB with discretized values best score is 0.71

SyntaxError: invalid syntax (<ipython-input-45-30f04d5a1ceb>, line 1)

## Question 1
### Méthode de création des ensembles de données

## Question 1
### Détails des ensembles produits

## Question 2
### Approche de validation proposée et justification

## Question 3
### Matrice des expérimentations

## Question 3
### Étude des hyperparamètres et des modèles

## Question 4
### Impact de la taille des ensembles de données sur la performance de classification

## Question 5
### Impact du bruit dans les ensembles de données sur la performance de classification

## Question 6
### Discussion sur la nature des données

## Question 7
### Recommandations

## Question 8
### Améliorations possibles

## Conclusion

## Bibliographie