# Fleurs iris

$\textbf{Description de la base de données :}$

Elle contient la longueur du sépale (Sepal.Length), la largeur du sépale (Sepal.Width), la longueur du pétale (Petal.Length) et la largeur du pétale (Petal.Width) pour trois espèces d’iris : Iris setosa, Iris versicolor et Iris virginica.

R.A. Fisher a utilisé ces données pour construire des combinaisons linéaires des variables permettant de mieux séparer au mieux les différentes espèces d’iris

# Importation des librairies

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
##
from scipy.stats import f_oneway
from scipy import stats
from scipy.stats import skew, kurtosis

%matplotlib notebook

# Importation des données

In [10]:
data_iris=pd.read_csv('iris.csv')
data_iris.head(10)

Unnamed: 0,SepalLength,SepalWidth,PetalLength,PetalWidth,Name
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


In [25]:
data=data_iris.copy()

# Renommer les variables

In [26]:
data=data.rename(columns={'SepalLength':'Longueur_Sepale',
                     'SepalWidth':'Largeur_Sepale',
                     'PetalLength':'Longueur_Petale',
                     'PetalWidth':'Largeur_Petale',
                     'Name':'Nom'})

data.head(10)

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale,Nom
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa
5,5.4,3.9,1.7,0.4,Iris-setosa
6,4.6,3.4,1.4,0.3,Iris-setosa
7,5.0,3.4,1.5,0.2,Iris-setosa
8,4.4,2.9,1.4,0.2,Iris-setosa
9,4.9,3.1,1.5,0.1,Iris-setosa


# Identifier les données manquantes et les traiter

In [28]:
###Données manquantes
data.isna().sum()

Longueur_Sepale    0
Largeur_Sepale     0
Longueur_Petale    0
Largeur_Petale     0
Nom                0
dtype: int64

# Détecter les doublons 

In [29]:
##Nombre de doublons
len(data)-len(data.drop_duplicates())

3

In [30]:
data[data.duplicated()]

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale,Nom
34,4.9,3.1,1.5,0.1,Iris-setosa
37,4.9,3.1,1.5,0.1,Iris-setosa
142,5.8,2.7,5.1,1.9,Iris-virginica


In [31]:
data=data.drop_duplicates(keep='first')
data.shape

(147, 5)

In [32]:
len(data)-len(data.drop_duplicates())

0

# Données aberrantes (outlier)

### Détacher les variables explicatives de la réponse

In [36]:
nvar=data.shape[1]
var=data.columns
var_X=var[:nvar-1,]
var_y=var[nvar-1]

###
print('Les variables explicatives sont : ','\n', var_X)
print('La réponse est : ','\n', var_y)

Les variables explicatives sont :  
 Index(['Longueur_Sepale', 'Largeur_Sepale', 'Longueur_Petale',
       'Largeur_Petale'],
      dtype='object')
La réponse est :  
 Nom


In [38]:
##Les modalités de la variable réponse
print('Les modalités de la variable réponse sont :','\n',y.unique())
print('Le nombre de classe :','\n',len(y.unique()))

NameError: name 'y' is not defined

In [39]:
###Variables explicatives et la réponse
X=data[var_X]
y=data[var_y]
X.head(10)

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
5,5.4,3.9,1.7,0.4
6,4.6,3.4,1.4,0.3
7,5.0,3.4,1.5,0.2
8,4.4,2.9,1.4,0.2
9,4.9,3.1,1.5,0.1


### Boxplots des variables explicatives

In [41]:
plt.figure()
meanprops = {'marker':'o', 'markeredgecolor':'black','markerfacecolor':'firebrick'}
plt.boxplot(X,data=data, labels=var_X,meanprops=meanprops)
plt.show()

<IPython.core.display.Javascript object>

# Les sous-bases de données des types de fleurs

In [75]:
setosa=data[data[var_y]=='Iris-setosa']
virginica=data[data[var_y]=='Iris-virginica']
versicolor=data[data[var_y]=='Iris-versicolor']

In [76]:
###Afficher les tailles des sous-bases
print('le nombre de fleur setosa dans la base de données :', '\n',setosa.shape[0])
print('le nombre de fleur virginica dans la base de données :', '\n',virginica.shape[0])
print('le nombre de fleur versicolor dans la base de données :', '\n',versicolor.shape[0])

le nombre de fleur setosa dans la base de données : 
 48
le nombre de fleur virginica dans la base de données : 
 49
le nombre de fleur versicolor dans la base de données : 
 50


$\textbf{Interprétation :}$ une base de données complètement équilibrées. Une situation très rare

In [10]:
help(sns.pairplot)

Help on function pairplot in module seaborn.axisgrid:

pairplot(data, *, hue=None, hue_order=None, palette=None, vars=None, x_vars=None, y_vars=None, kind='scatter', diag_kind='auto', markers=None, height=2.5, aspect=1, corner=False, dropna=False, plot_kws=None, diag_kws=None, grid_kws=None, size=None)
    Plot pairwise relationships in a dataset.
    
    By default, this function will create a grid of Axes such that each numeric
    variable in ``data`` will by shared across the y-axes across a single row and
    the x-axes across a single column. The diagonal plots are treated
    differently: a univariate distribution plot is drawn to show the marginal
    distribution of the data in each column.
    
    It is also possible to show a subset of variables or plot different
    variables on the rows and columns.
    
    This is a high-level interface for :class:`PairGrid` that is intended to
    make it easy to draw a few common styles. You should use :class:`PairGrid`
    directly 

In [77]:
sns.pairplot(data, hue=var_y,kind='scatter',diag_kind='hist',height=1.8)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x206dd58cf40>

In [78]:
sns.pairplot(data, hue=var_y,kind='kde',diag_kind='kde',height=1.8)

<IPython.core.display.Javascript object>

<seaborn.axisgrid.PairGrid at 0x206df9033d0>

Nuages de points en 3D

In [79]:
fig = plt.figure(figsize=(8,4))
ax = fig.add_subplot(projection='3d')

for spec, clas in data.groupby(var_y):
    ax.scatter(clas[var_X[0]], clas[var_X[2]], clas[var_X[3]], label=spec)
    ax.set_xlabel('Longueur_Sepale')
    ax.set_ylabel('Longueur_Petale')
    ax.set_zlabel('Largeur_Petale')
    ax.legend(loc='upper left')

plt.show()

<IPython.core.display.Javascript object>

# Indépendances des variables des  classes 

### Matrice de corrélation 

In [162]:
###
Mat_corr=data[var_X].corr()

plt.figure()
sns.heatmap(Mat_corr, cmap = 'coolwarm')
plt.show()
Mat_corr

<IPython.core.display.Javascript object>

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale
Longueur_Sepale,1.0,-0.109321,0.871305,0.817058
Largeur_Sepale,-0.109321,1.0,-0.421057,-0.356376
Longueur_Petale,0.871305,-0.421057,1.0,0.961883
Largeur_Petale,0.817058,-0.356376,0.961883,1.0


### Test de Pearson, corrélation de Spearman, tau de Kendall

In [None]:
from scipy.stats import pearsonr
from scipy.stats import kendalltau

In [152]:
help(pearsonr)

Help on function pearsonr in module scipy.stats.stats:

pearsonr(x, y)
    Pearson correlation coefficient and p-value for testing non-correlation.
    
    The Pearson correlation coefficient [1]_ measures the linear relationship
    between two datasets.  The calculation of the p-value relies on the
    assumption that each dataset is normally distributed.  (See Kowalski [3]_
    for a discussion of the effects of non-normality of the input on the
    distribution of the correlation coefficient.)  Like other correlation
    coefficients, this one varies between -1 and +1 with 0 implying no
    correlation. Correlations of -1 or +1 imply an exact linear relationship.
    Positive correlations imply that as x increases, so does y. Negative
    correlations imply that as x increases, y decreases.
    
    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Pearson correlation at least as extreme
    as the one computed from these da

In [156]:
help(stats.spearmanr)

Help on function spearmanr in module scipy.stats.stats:

spearmanr(a, b=None, axis=0, nan_policy='propagate', alternative='two-sided')
    Calculate a Spearman correlation coefficient with associated p-value.
    
    The Spearman rank-order correlation coefficient is a nonparametric measure
    of the monotonicity of the relationship between two datasets. Unlike the
    Pearson correlation, the Spearman correlation does not assume that both
    datasets are normally distributed. Like other correlation coefficients,
    this one varies between -1 and +1 with 0 implying no correlation.
    Correlations of -1 or +1 imply an exact monotonic relationship. Positive
    correlations imply that as x increases, so does y. Negative correlations
    imply that as x increases, y decreases.
    
    The p-value roughly indicates the probability of an uncorrelated system
    producing datasets that have a Spearman correlation at least as extreme
    as the one computed from these datasets. The p-va

In [157]:
help(kendalltau)

Help on function kendalltau in module scipy.stats.stats:

kendalltau(x, y, initial_lexsort=None, nan_policy='propagate', method='auto', variant='b')
    Calculate Kendall's tau, a correlation measure for ordinal data.
    
    Kendall's tau is a measure of the correspondence between two rankings.
    Values close to 1 indicate strong agreement, and values close to -1
    indicate strong disagreement. This implements two variants of Kendall's
    tau: tau-b (the default) and tau-c (also known as Stuart's tau-c). These
    differ only in how they are normalized to lie within the range -1 to 1;
    the hypothesis tests (their p-values) are identical. Kendall's original
    tau-a is not implemented separately because both tau-b and tau-c reduce
    to tau-a in the absence of ties.
    
    Parameters
    ----------
    x, y : array_like
        Arrays of rankings, of the same shape. If arrays are not 1-D, they
        will be flattened to 1-D.
    initial_lexsort : bool, optional
        U

In [154]:
###Longueur de sépale et largeur de sépale
###
Test_corr=pd.DataFrame({
    'Test_Pearson':pearsonr(data["Longueur_Sepale"],data["Largeur_Sepale"]),
    'Test de Spearman' : stats.spearmanr(data["Longueur_Sepale"],data["Largeur_Sepale"]),
    'Test de Kendall' : kendalltau(data["Longueur_Sepale"],data["Largeur_Sepale"])
})

Test_corr.set_index([["correlation","p_value"]])

Unnamed: 0,Test_Pearson,Test de Spearman,Test de Kendall
correlation,-0.109321,-0.155711,-0.070185
p_value,0.187476,0.059661,0.229711


### Analyse graphique

In [163]:
###Représentations graphiques : boxplots
for var in var_X:
    plt.figure(figsize=(6,4))
    meanprops = {'marker':'o', 'markeredgecolor':'black','markerfacecolor':'firebrick'}
    sns.boxplot(x='Nom',y=var,data=data,showmeans=True, meanprops=meanprops,hue='Nom')
    plt.xlabel(var, fontsize=14)
    plt.show()


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

### Analyse descriptive

In [164]:
setosa.describe()

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale
count,48.0,48.0,48.0,48.0
mean,5.010417,3.43125,1.4625,0.25
std,0.359219,0.383243,0.177002,0.105185
min,4.3,2.3,1.0,0.1
25%,4.8,3.2,1.4,0.2
50%,5.0,3.4,1.5,0.2
75%,5.2,3.7,1.6,0.3
max,5.8,4.4,1.9,0.6


In [165]:
versicolor.describe()

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale
count,50.0,50.0,50.0,50.0
mean,5.936,2.77,4.26,1.326
std,0.516171,0.313798,0.469911,0.197753
min,4.9,2.0,3.0,1.0
25%,5.6,2.525,4.0,1.2
50%,5.9,2.8,4.35,1.3
75%,6.3,3.0,4.6,1.5
max,7.0,3.4,5.1,1.8


In [166]:
virginica.describe()

Unnamed: 0,Longueur_Sepale,Largeur_Sepale,Longueur_Petale,Largeur_Petale
count,49.0,49.0,49.0,49.0
mean,6.604082,2.979592,5.561224,2.028571
std,0.632113,0.32338,0.553706,0.276887
min,4.9,2.2,4.5,1.4
25%,6.3,2.8,5.1,1.8
50%,6.5,3.0,5.6,2.0
75%,6.9,3.2,5.9,2.3
max,7.9,3.8,6.9,2.5


### Test de Kruskal-Wallis

In [133]:
help(stats.kruskal)

Help on function kruskal in module scipy.stats.stats:

kruskal(*args, nan_policy='propagate')
    Compute the Kruskal-Wallis H-test for independent samples.
    
    The Kruskal-Wallis H-test tests the null hypothesis that the population
    median of all of the groups are equal.  It is a non-parametric version of
    ANOVA.  The test works on 2 or more independent samples, which may have
    different sizes.  Note that rejecting the null hypothesis does not
    indicate which of the groups differs.  Post hoc comparisons between
    groups are required to determine which groups are different.
    
    Parameters
    ----------
    sample1, sample2, ... : array_like
       Two or more arrays with the sample measurements can be given as
       arguments. Samples must be one-dimensional.
    nan_policy : {'propagate', 'raise', 'omit'}, optional
        Defines how to handle when input contains nan.
        The following options are available (default is 'propagate'):
    
          * 'pro

In [82]:
##Initialisation des variables
stat_kw=[]
p_val=[]

for var in var_X:
    kstat, pv = stats.kruskal(setosa[var], virginica[var],versicolor[var])

    ###
    stat_kw.append(kstat)
    p_val.append(pv)

res_kw=pd.DataFrame({
    'variable' : var_X,
    'statistique' : stat_kw,
    'p_valeur':p_val
})

res_kw

Unnamed: 0,variable,statistique,p_valeur
0,Longueur_Sepale,94.566718,2.918087e-21
1,Largeur_Sepale,61.921377,3.580501e-14
2,Longueur_Petale,127.653832,1.9068760000000002e-28
3,Largeur_Petale,128.37586,1.329034e-28


### Test de Mann Kendall

In [100]:
pip install pymannkendall

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [101]:
import pymannkendall as mk

In [102]:
help(mk.original_test)

Help on function original_test in module pymannkendall.pymannkendall:

original_test(x_old, alpha=0.05)
    This function checks the Mann-Kendall (MK) test (Mann 1945, Kendall 1975, Gilbert 1987).
    Input:
        x: a vector (list, numpy array or pandas series) data
        alpha: significance level (0.05 default)
    Output:
        trend: tells the trend (increasing, decreasing or no trend)
        h: True (if trend is present) or False (if trend is absence)
        p: p-value of the significance test
        z: normalized test statistics
        Tau: Kendall Tau
        s: Mann-Kendal's score
        var_s: Variance S
        slope: Theil-Sen estimator/slope
        intercept: intercept of Kendall-Theil Robust Line
    Examples
    --------
          >>> import numpy as np
      >>> import pymannkendall as mk
      >>> x = np.random.rand(1000)
      >>> trend,h,p,z,tau,s,var_s,slope,intercept = mk.original_test(x,0.05)



Les hypothèses du test de Mann Kendall :

$\textbf{H_0}$ : Absence de tendance dans les données.

$\textbf{H_{alternative}}$ : Présence de tendance dans les données

In [167]:
level_test=0.01
pval=[]
tend=[]
for var in var_X:
    tendance,h,p,z,tau,s,var,slope,intercept = mk.original_test(data[var],level_test)
    pval.append(p)
    tend.append(tendance)
    
tab_mk=pd.DataFrame({
    'variable' : var_X,
    'tendance' : tend,
    'p_valeur' : pval
})

tab_mk

Unnamed: 0,variable,tendance,p_valeur
0,Longueur_Sepale,increasing,0.0
1,Largeur_Sepale,decreasing,2e-06
2,Longueur_Petale,increasing,0.0
3,Largeur_Petale,increasing,0.0


### Test d'ANOVA

$\textbf{Analyse de variance (ANOVA) :}$   permet de tester l’effet de la variable discrète reponse sur les différentes variables explicatives.


$\textbf{ANOVA}$ est utilisée si

. la variable quantitative est normalement distribuée dans chaque groupe

. les variances des groupes sont égales

. les observations sont indépendantes


$\textbf{H_0}$ : Les moyennes des classes sont égales (pas de variation de moyenne dans les $p$ classes)
        $$\mu_1=\mu_2=\cdots=\mu_p$$
        
$\textbf{H_1}$ : Au moins la moyenne d'une classe est différente de ce celle des autres.

Si la p-valeur est inférieure au seuil de significativité alors on rejette l'hypothèse $\textbf{H_0}$.

In [168]:
####
stat_val, pval=f_oneway(setosa[var_X],virginica[var_X],versicolor[var_X])

res_anova=pd.DataFrame({'variable':var_X,
                       'stat':stat_val,
                       'p_value' :pval})
res_anova

Unnamed: 0,variable,stat,p_value
0,Longueur_Sepale,116.672939,7.530912e-31
1,Largeur_Sepale,47.869839,1.150338e-16
2,Longueur_Petale,1132.387047,8.180545999999999e-89
3,Largeur_Petale,915.183585,1.3529730000000002e-82


# Normalité des données

$\textbf{Skewness (Asymétrie)}$ : mésure l'asymétrie des données. Une loi symétrique a un coefficient de skewness nul. Mais un coefficient de skewness nul ne correspond pas nécessairement à une loi symétrique.

$\textbf{Kurtosis (Aplatissement)}$ : mésure l'aplatissement des données. A partir de la définition de Fisher, la différence entre le résultat et 3 donne une valeur de kurtosis nulle pour la loi normale

In [126]:
X=np.random.normal(loc=0,scale=1,size=10000)
print(skew(X),kurtosis(X))

0.020897627848290273 -0.023749633352499355


In [127]:
###Calcul de skewness et kurtosis
sym_setosa=[]
applat_setosa=[]

###
sym_versicolor=[]
applat_versicolor=[]

###
sym_virginica=[]
applat_virginica=[]



###
for var in var_X:
    ###
    sym_setosa.append(skew(setosa[var]))
    applat_setosa.append(kurtosis(setosa[var]))
    ###
    sym_versicolor.append(skew(versicolor[var]))
    applat_versicolor.append(kurtosis(versicolor[var]))
    ###
    sym_virginica.append(skew(virginica[var]))
    applat_virginica.append(kurtosis(virginica[var]))

coef_setosa=pd.DataFrame({
    'Variable': var_X,
    'symétrie': sym_setosa,
    'applatissement' : applat_setosa
})

coef_versicolor=pd.DataFrame({
    'Variable': var_X,
    'symétrie': sym_versicolor,
    'applatissement' : applat_versicolor
})

coef_virginica=pd.DataFrame({
    'Variable': var_X,
    'symétrie': sym_virginica,
    'applatissement' : applat_virginica
})

print('SETOSA','\n',coef_setosa)
print('VERSICOLOR','\n',coef_versicolor)
print('VIRGINICA','\n',coef_virginica)

SETOSA 
           Variable  symétrie  applatissement
0  Longueur_Sepale  0.078589       -0.437388
1   Largeur_Sepale  0.026116        0.725381
2  Longueur_Petale  0.093784        0.677255
3   Largeur_Petale  1.219437        1.384615
VERSICOLOR 
           Variable  symétrie  applatissement
0  Longueur_Sepale  0.102190       -0.598827
1   Largeur_Sepale -0.351867       -0.448272
2  Longueur_Petale -0.588159       -0.074402
3   Largeur_Petale -0.030236       -0.487833
VIRGINICA 
           Variable  symétrie  applatissement
0  Longueur_Sepale  0.082602       -0.018327
1   Largeur_Sepale  0.319770        0.520322
2  Longueur_Petale  0.499664       -0.279413
3   Largeur_Petale -0.151286       -0.683660


### Test de Shapiro

In [128]:
help(stats.shapiro)

Help on function shapiro in module scipy.stats.morestats:

shapiro(x)
    Perform the Shapiro-Wilk test for normality.
    
    The Shapiro-Wilk test tests the null hypothesis that the
    data was drawn from a normal distribution.
    
    Parameters
    ----------
    x : array_like
        Array of sample data.
    
    Returns
    -------
    statistic : float
        The test statistic.
    p-value : float
        The p-value for the hypothesis test.
    
    See Also
    --------
    anderson : The Anderson-Darling test for normality
    kstest : The Kolmogorov-Smirnov test for goodness of fit.
    
    Notes
    -----
    The algorithm used is described in [4]_ but censoring parameters as
    described are not implemented. For N > 5000 the W test statistic is accurate
    but the p-value may not be.
    
    The chance of rejecting the null hypothesis when it is true is close to 5%
    regardless of sample size.
    
    References
    ----------
    .. [1] https://www.itl.nist.

In [130]:
###Test de Shapiro pour les variables explicatives
stat_setosa=[]
pv_setosa=[]

####
stat_versicolor=[]
pv_versicolor=[]

####
stat_virginica=[]
pv_virginica=[]

for var in var_X:
    ST_setosa=stats.shapiro(setosa[var])
    stat_setosa.append(ST_setosa.statistic)
    pv_setosa.append(ST_setosa.pvalue)

shap_setosa=pd.DataFrame({
    'variables': var_X,
    'statistic':stat_setosa,
    'p-value':pv_setosa
})
shap_setosa

Unnamed: 0,variables,statistic,p-value
0,Longueur_Sepale,0.976371,0.437932
1,Largeur_Sepale,0.971323,0.28499
2,Longueur_Petale,0.958571,0.088382
3,Largeur_Petale,0.795648,1e-06


In [131]:
###Test de Shapiro pour les variables explicatives de setosa
stat_seto=[]
pv_seto=[]

for var in var_X:
    N_stat=stats.shapiro(setosa[var])
    stat_seto.append(N_stat.statistic)
    pv_seto.append(ST.pvalue)

setosa_shapiro=pd.DataFrame({
    'variables': var_X,
    'statistic':stat_seto,
    'p-value':pv_seto
})
setosa_shapiro

Unnamed: 0,variables,statistic,p-value
0,Longueur_Sepale,0.976371,2.881323e-08
1,Largeur_Sepale,0.971323,2.881323e-08
2,Longueur_Petale,0.958571,2.881323e-08
3,Largeur_Petale,0.795648,2.881323e-08


In [None]:
help(f_oneway)

# Echantillonnage Train set -test set 

In [10]:
###Divisons la dataset en train et test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(data[var_X],data[var_y], test_size=0.2, random_state=42)
X_train.shape

(120, 4)

# Bernoulli Naîve Bayes

Utilisé pour la détection du spam, la classification de texte et l'analyse des sentiments, utilisés pour déterminer si un certain mot est présent ou non dans un document.

In [17]:
from sklearn.naive_bayes import BernoulliNB

model_Bern = BernoulliNB()
model_Bern.fit(X_train, y_train)
model_Bern.get_params()

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'fit_prior': True}

In [13]:
help(BernoulliNB)

Help on class BernoulliNB in module sklearn.naive_bayes:

class BernoulliNB(_BaseDiscreteNB)
 |  BernoulliNB(*, alpha=1.0, binarize=0.0, fit_prior=True, class_prior=None)
 |  
 |  Naive Bayes classifier for multivariate Bernoulli models.
 |  
 |  Like MultinomialNB, this classifier is suitable for discrete data. The
 |  difference is that while MultinomialNB works with occurrence counts,
 |  BernoulliNB is designed for binary/boolean features.
 |  
 |  Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.
 |  
 |  Parameters
 |  ----------
 |  alpha : float, default=1.0
 |      Additive (Laplace/Lidstone) smoothing parameter
 |      (0 for no smoothing).
 |  
 |  binarize : float or None, default=0.0
 |      Threshold for binarizing (mapping to booleans) of sample features.
 |      If None, input is presumed to already consist of binary vectors.
 |  
 |  fit_prior : bool, default=True
 |      Whether to learn class prior probabilities or not.
 |      If false, a uniform prior wil

In [35]:
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix, ConfusionMatrixDisplay

pred_Bern = model_Bern.predict(X_test)
prec_Bern= accuracy_score(y_test, pred_Bern)

####
print("Accuracy score est :", prec_Bern)
#cr=classification_report(y_test, pred_Bern)
#print(cr)
cm=confusion_matrix(y_test, pred_Bern)
cm_display = ConfusionMatrixDisplay(cm).plot()

Accuracy score est  0.3


<IPython.core.display.Javascript object>

In [32]:
pred_Bern

array(['Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor',
       'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor'],
      dtype='<U15')

# Gaussian Naîve Bayes 

In [36]:
from sklearn.naive_bayes import GaussianNB

model_Gauss = GaussianNB()
model_Gauss.fit(X_train, y_train)
model_Gauss.get_params()

{'priors': None, 'var_smoothing': 1e-09}

In [38]:
pred_Gauss = model_Gauss.predict(X_test)
prec_Gauss = accuracy_score(y_test, pred_Gauss)
print('Accuracy score est :', prec_Gauss)

Accuracy score est : 1.0


In [39]:
CM=confusion_matrix(y_test, pred_Gauss)
CM_display = ConfusionMatrixDisplay(CM).plot()

<IPython.core.display.Javascript object>

# LDA model

In [42]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

help(LinearDiscriminantAnalysis())

Help on LinearDiscriminantAnalysis in module sklearn.discriminant_analysis object:

class LinearDiscriminantAnalysis(sklearn.linear_model._base.LinearClassifierMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator)
 |  LinearDiscriminantAnalysis(solver='svd', shrinkage=None, priors=None, n_components=None, store_covariance=False, tol=0.0001, covariance_estimator=None)
 |  
 |  Linear Discriminant Analysis.
 |  
 |  A classifier with a linear decision boundary, generated by fitting class
 |  conditional densities to the data and using Bayes' rule.
 |  
 |  The model fits a Gaussian density to each class, assuming that all classes
 |  share the same covariance matrix.
 |  
 |  The fitted model can also be used to reduce the dimensionality of the input
 |  by projecting it to the most discriminative directions, using the
 |  `transform` method.
 |  
 |  .. versionadded:: 0.17
 |     *LinearDiscriminantAnalysis*.
 |  
 |  Read more in the :ref:`User Guide <lda_qda>`.
 |  
 |  Pa

In [41]:
model_lda = LinearDiscriminantAnalysis()
model_lda.fit(X_train, y_train)
model_lda.get_params()

{'covariance_estimator': None,
 'n_components': None,
 'priors': None,
 'shrinkage': None,
 'solver': 'svd',
 'store_covariance': False,
 'tol': 0.0001}

In [43]:
pred_lda = model_lda.predict(X_test)
prec_lda= accuracy_score(y_test, pred_lda)
print('Accuracy score est :',prec_lda)

Accuracy score est : 1.0


In [44]:
conf_mat=confusion_matrix(y_test, pred_lda)
conf_display = ConfusionMatrixDisplay(conf_mat).plot()

<IPython.core.display.Javascript object>