### Methode :
- **Support Vector Machine (SVM)** : Un algorithme de classification qui trouve l'hyperplan optimal dans un espace de grande dimension pour séparer les différentes classes. Il peut également être étendu pour gérer des problèmes non linéaires en utilisant des noyaux.

- **Régression logistique** : Un algorithme utilisé pour la classification binaire (et pouvant être étendu à la classification multiclasse) en modélisant la probabilité que chaque classe soit la classe cible à l'aide d'une fonction logistique.
- **Random Forest** : Un algorithme d'ensemble utilisé pour la classification et la régression. Il combine les prédictions de plusieurs arbres de décision pour obtenir une prédiction plus robuste et généralement de meilleure qualité.
- **Réseaux de neurones** :
- **Perceptron ou Multi-perceptron** :
- **Gradient Boosting** : Un autre algorithme d'ensemble qui construit des arbres de décision de manière séquentielle, en corrigeant les erreurs des arbres précédents. Cela conduit à un modèle de prédiction puissant.
- **Naive Bayes** : Un classificateur probabiliste simple basé sur le théorème de Bayes avec une forte indépendance entre les fonctionnalités. Il est souvent utilisé pour la classification de texte et d'autres tâches où l'indépendance des fonctionnalités est une hypothèse raisonnable.





In [17]:
import sklearn
import numpy as np # linear algebra
import pandas as pd 

from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split

In [18]:
data_dir_path = '../data/external'
d_train = pd.read_csv(data_dir_path + "/train.csv")
d_test = pd.read_csv(data_dir_path + '/test.csv')

## Pre-traitement des données
### Separer les classes de features
`y_train` = classes or label
`x_train` = features

In [19]:
d_test.columns

Index(['id', 'margin1', 'margin2', 'margin3', 'margin4', 'margin5', 'margin6',
       'margin7', 'margin8', 'margin9',
       ...
       'texture55', 'texture56', 'texture57', 'texture58', 'texture59',
       'texture60', 'texture61', 'texture62', 'texture63', 'texture64'],
      dtype='object', length=193)

In [20]:
classes = d_train['species'].unique()

In [21]:
classes.shape

(99,)

In [22]:
processed_data = d_train.copy()
# Initialize the encoder
le = LabelEncoder()

# Encode the 'species' column
processed_data['species'] = le.fit_transform(processed_data['species'])

processed_data

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,3,0.007812,0.023438,0.023438,0.003906,0.011719,0.009766,0.027344,0.0,...,0.007812,0.000000,0.002930,0.002930,0.035156,0.000000,0.000000,0.004883,0.000000,0.025391
1,2,49,0.005859,0.000000,0.031250,0.015625,0.025391,0.001953,0.019531,0.0,...,0.000977,0.000000,0.000000,0.000977,0.023438,0.000000,0.000000,0.000977,0.039062,0.022461
2,3,65,0.005859,0.009766,0.019531,0.007812,0.003906,0.005859,0.068359,0.0,...,0.154300,0.000000,0.005859,0.000977,0.007812,0.000000,0.000000,0.000000,0.020508,0.002930
3,5,94,0.000000,0.003906,0.023438,0.005859,0.021484,0.019531,0.023438,0.0,...,0.000000,0.000977,0.000000,0.000000,0.020508,0.000000,0.000000,0.017578,0.000000,0.047852
4,6,84,0.005859,0.003906,0.048828,0.009766,0.013672,0.015625,0.005859,0.0,...,0.096680,0.000000,0.021484,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.031250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
985,1575,40,0.060547,0.119140,0.007812,0.003906,0.000000,0.148440,0.017578,0.0,...,0.242190,0.000000,0.034180,0.000000,0.010742,0.000000,0.000000,0.000000,0.000000,0.018555
986,1578,5,0.001953,0.003906,0.021484,0.107420,0.001953,0.000000,0.000000,0.0,...,0.170900,0.000000,0.018555,0.000000,0.011719,0.000000,0.000000,0.000977,0.000000,0.021484
987,1581,11,0.001953,0.003906,0.000000,0.021484,0.078125,0.003906,0.007812,0.0,...,0.004883,0.000977,0.004883,0.027344,0.016602,0.007812,0.000000,0.027344,0.000000,0.001953
988,1582,78,0.000000,0.000000,0.046875,0.056641,0.009766,0.000000,0.000000,0.0,...,0.083008,0.030273,0.000977,0.002930,0.014648,0.000000,0.041992,0.000000,0.001953,0.002930


### Mise à l'échelle
La mise à l'échelle des données, ou normalisation, est une étape cruciale en prétraitement des données. le but :

1. **Uniformité**: Elle assure que toutes les caractéristiques numériques contribuent également à l'analyse sans être biaisées par leur échelle d'origine.

2. **Meilleure convergence**: Beaucoup d'algorithmes de machine learning, comme les réseaux de neurones et les méthodes de descente de gradient, convergent plus rapidement lorsque les données sont mises à l'échelle.

3. **Amélioration des performances**: Certains algorithmes, en particulier ceux qui utilisent des mesures de distance comme k-means ou k-NN, ont de meilleures performances si toutes les caractéristiques sont sur une échelle comparable.

4. **Stabilité numérique**: La mise à l'échelle peut aussi aider à éviter des problèmes numériques qui peuvent survenir lorsque les caractéristiques ont des ordres de grandeur très différents.

En somme, la mise à l'échelle des données aide à rendre le processus d'apprentissage automatique plus efficace et plus stable.

In [23]:
# Initialize the scaler
scaler = MinMaxScaler()

# Scale numeric columns. Exclude 'id' and 'species' from being scaled
numeric_cols = processed_data.columns.drop(['id', 'species'])
processed_data[numeric_cols] = scaler.fit_transform(processed_data[numeric_cols])
processed_data

Unnamed: 0,id,species,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,1,3,0.088883,0.114287,0.150003,0.022987,0.105264,0.031447,0.297875,0.0,...,0.018181,0.000000,0.016951,0.014635,0.330258,0.000000,0.000000,0.012987,0.000000,0.179315
1,2,49,0.066662,0.000000,0.200000,0.091955,0.228070,0.006289,0.212763,0.0,...,0.002274,0.000000,0.000000,0.004880,0.220178,0.000000,0.000000,0.002599,0.449433,0.158623
2,3,65,0.066662,0.047620,0.124998,0.045975,0.035085,0.018867,0.744676,0.0,...,0.359096,0.000000,0.033896,0.004880,0.073387,0.000000,0.000000,0.000000,0.235957,0.020692
3,5,94,0.000000,0.019046,0.150003,0.034481,0.192976,0.062892,0.255324,0.0,...,0.000000,0.004833,0.000000,0.000000,0.192654,0.000000,0.000000,0.046752,0.000000,0.337938
4,6,84,0.066662,0.019046,0.312499,0.057474,0.122806,0.050314,0.063826,0.0,...,0.224999,0.000000,0.124293,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.220692
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
985,1575,40,0.688887,0.580944,0.049997,0.022987,0.000000,0.477991,0.191488,0.0,...,0.563639,0.000000,0.197744,0.000000,0.100911,0.000000,0.000000,0.000000,0.000000,0.131038
986,1578,5,0.022221,0.019046,0.137498,0.632180,0.017542,0.000000,0.000000,0.0,...,0.397729,0.000000,0.107347,0.000000,0.110089,0.000000,0.000000,0.002599,0.000000,0.151723
987,1581,11,0.022221,0.019046,0.000000,0.126436,0.701743,0.012578,0.085101,0.0,...,0.011364,0.004833,0.028250,0.136583,0.155961,0.013513,0.000000,0.072727,0.000000,0.013792
988,1582,78,0.000000,0.000000,0.300000,0.333339,0.087721,0.000000,0.000000,0.0,...,0.193181,0.149755,0.005652,0.014635,0.137605,0.000000,0.277413,0.000000,0.022470,0.020692


In [24]:
x_train = processed_data.drop(columns=['id', 'species'])
x_train

Unnamed: 0,margin1,margin2,margin3,margin4,margin5,margin6,margin7,margin8,margin9,margin10,...,texture55,texture56,texture57,texture58,texture59,texture60,texture61,texture62,texture63,texture64
0,0.088883,0.114287,0.150003,0.022987,0.105264,0.031447,0.297875,0.0,0.025639,0.340000,...,0.018181,0.000000,0.016951,0.014635,0.330258,0.000000,0.000000,0.012987,0.000000,0.179315
1,0.066662,0.000000,0.200000,0.091955,0.228070,0.006289,0.212763,0.0,0.000000,0.079995,...,0.002274,0.000000,0.000000,0.004880,0.220178,0.000000,0.000000,0.002599,0.449433,0.158623
2,0.066662,0.047620,0.124998,0.045975,0.035085,0.018867,0.744676,0.0,0.000000,0.460002,...,0.359096,0.000000,0.033896,0.004880,0.073387,0.000000,0.000000,0.000000,0.235957,0.020692
3,0.000000,0.019046,0.150003,0.034481,0.192976,0.062892,0.255324,0.0,0.179489,0.179999,...,0.000000,0.004833,0.000000,0.000000,0.192654,0.000000,0.000000,0.046752,0.000000,0.337938
4,0.066662,0.019046,0.312499,0.057474,0.122806,0.050314,0.063826,0.0,0.000000,0.059996,...,0.224999,0.000000,0.124293,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.220692
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
985,0.688887,0.580944,0.049997,0.022987,0.000000,0.477991,0.191488,0.0,0.025639,0.440004,...,0.563639,0.000000,0.197744,0.000000,0.100911,0.000000,0.000000,0.000000,0.000000,0.131038
986,0.022221,0.019046,0.137498,0.632180,0.017542,0.000000,0.000000,0.0,0.384616,0.039998,...,0.397729,0.000000,0.107347,0.000000,0.110089,0.000000,0.000000,0.002599,0.000000,0.151723
987,0.022221,0.019046,0.000000,0.126436,0.701743,0.012578,0.085101,0.0,0.051279,0.000000,...,0.011364,0.004833,0.028250,0.136583,0.155961,0.013513,0.000000,0.072727,0.000000,0.013792
988,0.000000,0.000000,0.300000,0.333339,0.087721,0.000000,0.000000,0.0,0.487174,0.019999,...,0.193181,0.149755,0.005652,0.014635,0.137605,0.000000,0.277413,0.000000,0.022470,0.020692


In [25]:
y_train = processed_data['species']
y_train

0       3
1      49
2      65
3      94
4      84
       ..
985    40
986     5
987    11
988    78
989    50
Name: species, Length: 990, dtype: int32

### Perceptron

In [31]:
from src.models.perceptron_model import PerceptronModel
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Création d'un objet Perceptron
perceptron = PerceptronModel()



# Entraînement du modèle sur les données d'entraînement
perceptron.fit(x_train, y_train)    

# Prédiction sur les données d'entraînement
perceptron_pred_train = perceptron.predict(x_train)


RuntimeError: scikit-learn estimators should always specify their parameters in the signature of their __init__ (no varargs). <class 'src.models.perceptron_model.PerceptronModel'> with constructor (self, *args, **kwargs) doesn't  follow this convention.

In [27]:


# Calcul du taux d'erreur
error_rate = 1 - accuracy_score(y_train, perceptron_pred_train)

# Calcul de la précision
precision = precision_score(y_train, perceptron_pred_train)

# Calcul du rappel
recall = recall_score(y_train, perceptron_pred_train)

# Calcul du score F1
f1 = f1_score(y_train, perceptron_pred_train)

print("Taux d'erreur:", error_rate)
print("Précision:", precision)
print("Rappel:", recall)
print("Score F1:", f1)

NameError: name 'perceptron_pred_train' is not defined

### SVM

In [28]:
from sklearn.svm import SVC

# Création d'un objet SVM avec un kernel gaussien (RBF)
svm_classifier = SVC(kernel='rbf')

# Entraînement du modèle sur les données d'entraînement
svm_classifier.fit(x_train, y_train)

# Prédiction sur les données d'entraînement (facultatif)
svm_pred_train = svm_classifier.predict(x_train)


### Logistic Regression

In [29]:
from sklearn.linear_model import LogisticRegression

# Création d'un objet LogisticRegression
logistic_regression = LogisticRegression()

# Entraînement du modèle sur les données d'entraînement
logistic_regression.fit(x_train, y_train)

# Prédiction sur les données d'entraînement (facultatif)
predictions_train = logistic_regression.predict(x_train)


### Réseaux de neurones

In [30]:
from sklearn.neural_network import MLPClassifier

# Création d'un objet MLPClassifier (réseau de neurones)
mlp_classifier = MLPClassifier(hidden_layer_sizes=(100,), max_iter=500)

# Entraînement du modèle sur les données d'entraînement
mlp_classifier.fit(x_train, y_train)

# Prédiction sur les données d'entraînement (facultatif)
predictions_train = mlp_classifier.predict(x_train)
