# Previsão no dataset Iris com KNN
Neste notebook, utlizamos o modelo de K-Nearest Neighbors para prever a classe de flores no dataset Iris.

## Importação da base de dados

In [44]:
import pandas as pd

In [45]:
import seaborn as sns
iris = sns.load_dataset("iris")

In [46]:
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa


In [47]:
iris.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   150 non-null    float64
 2   petal_length  150 non-null    float64
 3   petal_width   150 non-null    float64
 4   species       150 non-null    object 
dtypes: float64(4), object(1)
memory usage: 6.0+ KB


In [48]:
iris.shape

(150, 5)

## Tratamento de variáveis categoricas

In [50]:
class_mapping = {label: idx for idx, label in enumerate(pd.unique(iris['species']))}
iris['label'] = iris['species'].map(class_mapping)

In [51]:
iris.head(3)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species,label
0,5.1,3.5,1.4,0.2,setosa,0
1,4.9,3.0,1.4,0.2,setosa,0
2,4.7,3.2,1.3,0.2,setosa,0


In [52]:
iris['label'].value_counts()

0    50
1    50
2    50
Name: label, dtype: int64

## Divisão do dataset

In [53]:
from sklearn.model_selection import train_test_split

In [54]:
X = iris.drop(["species", "label"], axis=1)
y = iris["label"]

In [55]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)

## Padronização de variáveis
Escolhemos padronizar as variáveis, pois o KNN é sensitivo a escala.

In [56]:
from sklearn.preprocessing import StandardScaler
stdsc = StandardScaler()
X_train = stdsc.fit_transform(X_train)
X_test = stdsc.transform(X_test)

## Treinamento do modelo

In [57]:
from sklearn.neighbors import KNeighborsClassifier
clf = KNeighborsClassifier(n_neighbors=3)
clf = clf.fit(X_train, y_train)

## Teste do modelo

In [58]:
y_pred = clf.predict(X_test)

## Precisão do modelo

In [64]:
from sklearn.metrics import accuracy_score, confusion_matrix

In [65]:
accuracy_score(y_test, y_pred)

0.94

In [66]:
confusion_matrix(y_test, y_pred)

array([[16,  0,  0],
       [ 0, 17,  0],
       [ 0,  3, 14]])

O modelo teve acurácia de 94%.

## Referências
- [K-Nearest Neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
- Lidando com variáveis categórias: Raschka, S., Liu, Y. & Mirjalili, V. *Machine Learning with PyTorch and Scikit-Learn*. (Packt, 2022).