# Drills in Classification
Without practice, you cannot claim that you know things and these drills here are there to enable this option for you. Are you ready to classify some very interesting data? 


## Exercise 1
* **Dataset:** `Iris`
* **Model to use:** [`KNN`](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
* **Model evaluation:** try the [classification report](http://www.thedatalife.com/classification-report-python/)

The Iris dataset includes three iris species with 50 samples each as well as some properties about each flower. One flower species is linearly separable from the other two, but the other two are not linearly separable from each other. 

You can load the dataset with `scikit-learn` by using: 

```python
sklearn.datasets.load_iris()
```

Your mission it to apply KNN to this dataset and find the best K.

You will quickly understand that you can't evaluate a complexe classification model just with a percentage of accuracy. 

To understand how accurate your model is and, more importantly, where it is wrong, use scikit learn's [classification report](http://www.thedatalife.com/classification-report-python/).

To use it properly, you will need to understand what the following terms are:
* `Recall`
* `Precision`
* `F1-score`
* `Support`

You can make your own research or [read this article](https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/).

In [6]:
# Import libraries
from sklearn.metrics import classification_report
from sklearn import datasets
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns


In [13]:
# Load you dataset
X, y = datasets.load_iris(return_X_y=True, as_frame=True)


In [22]:
# Explore the dataset to understand it. (use pandas and your data visualation's favorite library)
print(X.head())
print(X.isnull().sum())
print(X.describe())
print(y.describe())
print(X.value_counts())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
0                5.1               3.5                1.4               0.2
1                4.9               3.0                1.4               0.2
2                4.7               3.2                1.3               0.2
3                4.6               3.1                1.5               0.2
4                5.0               3.6                1.4               0.2
sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64
       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000       

In [None]:
# Preprocess the data (deal with NaNs, deal with text features,...)



In [23]:
# Use a KNN model

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

sc = StandardScaler()

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Declare an instance of it
classifier = LogisticRegression(solver='lbfgs')
# Fit the model
classifier.fit(X_train,y_train)

LogisticRegression()

In [26]:
# Evaluate your model

classifier.score(X_test, y_test)
print(classification_report(y_test,classifier.predict(X_test)))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

