# Aula 2

## Sequência Lógica

### Leitura dos Dados com Pandas
### Dar uma explorada nos dados
### Separar Entre features e labels
### Separar entre dados de treino e teste
### Instanciar, Treinar e Prever com os Modelos

***

In [4]:
"""
"""

# importando bibliotecas
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# constantes
PATH_FILE_NAME = 'data/diabetes.csv'
RANDOM_SEED = 42
TEST_SIZE = 0.2

# leitura dos dados
df_diabetes = pd.read_csv(PATH_FILE_NAME)

# masks para filtrar o dataframe
mask_glucose_zero = df_diabetes.loc[:, 'Glucose'] != 0
mask_press_zero = df_diabetes.loc[:, 'BloodPressure'] != 0

# filtrando o dataframe diabetes
df_diabetes_filt = df_diabetes.loc[mask_glucose_zero & mask_press_zero, :]

# separando entre features e labels
X = df_diabetes_filt.drop(columns='Outcome')
y = df_diabetes_filt.loc[:, 'Outcome']

# separando entre dados de treino e teste
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED)

# instanciar o modelo
rand_forest = RandomForestClassifier(random_state=RANDOM_SEED)

# treinar
rand_forest.fit(X_train, y_train)

# previsao
y_pred = rand_forest.predict(X_test)

print(f'Acc Score Dados Teste: {accuracy_score(y_test, y_pred)*100:.2f}%')

Acc Score Dados Teste: 78.77%


## Explicando Linhas 18 até 23 

```
# masks para filtrar o dataframe
mask_glucose_zero = df_diabetes.loc[:, 'Glucose'] != 0
mask_press_zero = df_diabetes.loc[:, 'BloodPressure'] != 0

# filtrando o dataframe diabetes
df_diabetes_filt = df_diabetes.loc[mask_glucose_zero & mask_press_zero, :]
```

Dado o pandas.DataFrame df_diabetes, queremos excluir algumas linhas:
- Linhas onde a coluna Glucose seja igual a zero
- Linhas onde a coluna BloodPressure seja igual a zero

In [7]:
# Este é o DataFrame
df_diabetes

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1


In [8]:
# vamos acessar a coluna Glucose
df_diabetes.loc[:, 'Glucose']

0      148
1       85
2      183
3       89
4      137
      ... 
763    101
764    122
765    121
766    126
767     93
Name: Glucose, Length: 768, dtype: int64

In [10]:
# Aqui estamos vendo quais linhas são diferentes de zero
# Para linhas iguais a zero, ela se torna True
# Para linhas diferentes de zero, ela se torna False
df_diabetes.loc[:, 'Glucose'] != 0

0      True
1      True
2      True
3      True
4      True
       ... 
763    True
764    True
765    True
766    True
767    True
Name: Glucose, Length: 768, dtype: bool

Com o pedaço de código acima, nós temos uma pandas.Series Booleana, ou seja, com valores True e False. Podemos usar essa Series diretamente com o loc[ ] e filtrar o que queremos

In [12]:
# vamos dar um nome para a Series booleana
s_mask_gluc = df_diabetes.loc[:, 'Glucose'] != 0

# queremos as linhas que sejam True na Series Booleana que criamos
df_diabetes.loc[s_mask_gluc, :]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
...,...,...,...,...,...,...,...,...,...
763,10,101,76,48,180,32.9,0.171,63,0
764,2,122,70,27,0,36.8,0.340,27,0
765,5,121,72,23,112,26.2,0.245,30,0
766,1,126,60,0,0,30.1,0.349,47,1
