![logo](https://bigdata.icict.fiocruz.br/sites/bigdata.icict.fiocruz.br/themes/sunrise/images/faixa-topo.png)

## Sklearn KNN e Decision Tree (tarefa: Classificação)  
#### Jefferson Lima (jeffersonlima@icict.fiocruz.br)
18/04/2018 - PPGICS

---

### Haberman's Survival Data Set


In [1]:
import pandas as pd

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

## Importação dos dados

[Haberman's Survival Data Set](https://archive.ics.uci.edu/ml/datasets/Haberman%27s+Survival)
1. Age of patient at time of operation (numerical)
2. Patient's year of operation (year - 1900, numerical)
3. Number of positive axillary nodes detected (numerical)
4. Survival status (class attribute)  
1 = the patient survived 5 years or longer  
2 = the patient died within 5 year

In [5]:
df_survival = pd.read_csv('dados\dataset.data', header=None)
df_survival.columns = ['idade','ano','nodulos','status']

In [6]:
x=df_survival.iloc[:,0:3]
y=df_survival.iloc[:,3]

## Divisão dos dados em treino e teste

In [7]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

---  
## KNN (KNeighborsClassifier)

[Documentação](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

In [8]:
from sklearn.neighbors import KNeighborsClassifier

In [9]:
knn = KNeighborsClassifier(n_neighbors=13)
knn.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=13, p=2,
           weights='uniform')

In [10]:
y_pred = knn.predict(x_test)

### Comparando a previsão com os dados reais (KNN)

In [11]:
accuracy_score(y_test, y_pred)

0.70967741935483875

In [12]:
confusion_matrix(y_test, y_pred)

array([[41,  3],
       [15,  3]])

---  
## Decision Tree

[Documentação](http://scikit-learn.org/stable/modules/tree.html#classification)

In [13]:
from sklearn.tree import  DecisionTreeClassifier

In [14]:
dtc = DecisionTreeClassifier()

In [15]:
dtc.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            presort=False, random_state=None, splitter='best')

In [16]:
y_pred = dtc.predict(x_test)

### Comparando a previsão com os dados reais (DecisionTreeClassifier)

In [17]:
accuracy_score(y_test, y_pred)

0.64516129032258063

In [18]:
confusion_matrix(y_test, y_pred)

array([[35,  9],
       [13,  5]])

---  
## Utilizando o modelo treinado com novos dados

In [19]:
novo_dado = [[70,64,1]]

#### KNN

In [20]:
knn.predict(novo_dado)

array([1], dtype=int64)

#### DecisionTreeClassifier

In [21]:
dtc.predict(novo_dado)

array([2], dtype=int64)

---  
## Discutindo os resultados  
[Kaggle - Machine Learning and Predictions Survival's](https://www.kaggle.com/gilsousa/haberman-s-survival)  
[Kaggle - Explorando os dados](https://www.kaggle.com/gokulkarthik/haberman-s-survival-exploratory-data-analysis)  