# Seção 29.04 - Agrupamento - Covid19
Este notebook usa dados disponibilizados pelo Einstein Data4u: [Diagnosis of COVID-19 and its clinical spectrum](https://www.kaggle.com/dataset/e626783d4672f182e7870b1bbe75fae66bdfb232289da0a61f08c2ceb01cab01).

As tasks descritas no link supracitado são em suma:
1. Prever os casos confirmados de Covid-19 (resposta/target/classe: **SARS-Cov-2 exam result**) entre os casos suspeitos; e
2. Prever a admissão na enfermaria geral, unidade semi-intensiva ou unidade de terapia intensiva entre os casos confirmados de Covid-19.

## O que verei nesse notebook:
O aprendizado deste notebook está pautado em 3 pilares/assuntos:
1. CRISP-DM;
2. Modelos de agrupamento (K-means, DB-Scan etc.); e
3. Modelos de classificação (Árvores de decisão, Random forests, Gradient boosting etc.)

## CRISP-DM:

<tr>
    <td>
        <img src="./imagens/crisp.png" alt="CRISP-DM" width="350"/>
        <p style="text-align:center">Figura 01 - Esquema de CRISP-DM.</p>
        <p style="text-align:center">Fonte: 2017; Vasconcellos, P., A. L.; CRISP-DM, SEMMA e KDD: conheça as melhores técnicas para exploração de dados</p>
    </td>
</tr>

Referências:
- 2017; Vasconcellos, P., A. L.; [CRISP-DM, SEMMA e KDD: conheça as melhores técnicas para exploração de dados](https://paulovasconcellos.com.br/crisp-dm-semma-e-kdd-conhe%C3%A7a-as-melhores-t%C3%A9cnicas-para-explora%C3%A7%C3%A3o-de-dados-560d294547d2)
- 2018; Vaz, A. L.; [Gerenciamento de Projetos de Data Science com CRISP](https://medium.com/data-hackers/project-data-science-pds-39d5a78e058a)

Para o caso em questão, temos um problema bem direto e podemos sem cautela pular a etapa de **BUSINESS UNDERSTANDING** e passar para a etpa de **DATA UNDERSTANDING**.

## Data understanding

Pegando a primeira task proposta pelo Einstein Data4U **(1. Prever os casos confirmados de Covid-19 entre os casos suspeitos**. Podemos formular perguntas para nos ajudar na etapa de **DATA UNDERSTANDING** da task 1.
1. Quantos registros temos no dataset?
2. Qual a proporção de confirmados de Covid-19?
3. Quantos atributos temos? Quantos são totalmente nulos? Qual a cardinalidade do resto?
4. É possível identificar agrupamentos?

In [1]:
import numpy as np
import pandas as pd

# Lidar com gráficos.
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly
from bokeh.plotting import figure
from bokeh.io import show, output_notebook

# Lidar com preparação de dados.
from data_prep import data_prep as dp # Eu que fiz esse modulinho ("uuuuuuuuuma bosts!").
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import (StandardScaler
                                   , MinMaxScaler)
from sklearn.model_selection import (train_test_split
                                     , cross_val_score
                                     , StratifiedKFold)

# Lidar com validação de modelos.
from sklearn.metrics import (confusion_matrix
                             , accuracy_score
                             , classification_report)

pd.set_option('display.max_columns', None)  
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', 500)

In [2]:
# # Input data files are available in the "../input/" directory.
# # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# import os
# for dirname, _, filenames in os.walk('/kaggle/input'):
#     for filename in filenames:
#         print(os.path.join(dirname, filename))

# # Any results you write to the current directory are saved as output.

In [3]:
df = pd.read_excel("../bases/dataset_covid_hospital_einstein.xlsx")
df.shape

(5644, 111)

In [4]:
df_proporcao_covid19 = pd.DataFrame(round(100*df["SARS-Cov-2 exam result"].value_counts()/len(df), 2))
df_proporcao_covid19

Unnamed: 0,SARS-Cov-2 exam result
negative,90.11
positive,9.89


In [5]:
dp.breveDescricao(df)

O data set possui: 
- 106 atributos/campos; e 
- 5644 registros.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5644 entries, 0 to 5643
Columns: 106 entries, Patient ID to ctO2 (arterial blood gas analysis)
dtypes: float64(65), int64(4), object(37)
memory usage: 4.6+ MB


# Definindo um baseline
Tendo em vista que a **task 1** é um problema de classificação, vamos pegar um modelo robusto a uma base de dados com falhas (registros nulos nesse caso) e ver como ele se sai antes de qualquer engenharia de atributos (feature engineering).

## Sem nenhuma engenharia de atributos
Como os dados para a árvore de decisão deve ser numéricos, vou proceder da seguinte forma:
- Num primeiro momento, aplicarei a árvore somente com os atributos numéricos;
- Na seção posterior, executarei alguma transformação nos dados categóricos, como One Hot Encoding, juntarei com os dados numéricos e aplicarei a árvore novamente. 

### Aplicação usando somente os atributos numéricos:

In [125]:
df["SARS-Cov-2 exam result - Numerico"] = np.where(df["SARS-Cov-2 exam result"]=="positive", 1, 0)
df_numericos = df.select_dtypes(exclude=["object"])

# Simplesmente preencheremos os nan com zeros, o que faz sentido.
df_numericos.fillna(0, inplace=True)

x = df_numericos.drop("SARS-Cov-2 exam result - Numerico", axis=1)
y = df_numericos["SARS-Cov-2 exam result - Numerico"]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=0)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  downcast=downcast,


In [126]:
df_numericos[df_numericos["SARS-Cov-2 exam result - Numerico"]==1].sample(7, random_state=0).T

Unnamed: 0,3261,5077,5086,5302,5444,2237,1432
Patient age quantile,13.0,5.0,3.0,3.0,3.0,13.0,11.0
"Patient addmited to regular ward (1=yes, 0=no)",0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Patient addmited to semi-intensive unit (1=yes, 0=no)",0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Patient addmited to intensive care unit (1=yes, 0=no)",0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hematocrit,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Hemoglobin,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Platelets,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Mean platelet volume,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Red blood Cells,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Lymphocytes,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [113]:
#verificando se a proporção de classes se manteve após o split:

print("Negativos: {:.2f}".format(100*len(y_test[y_test==0])/len(y_test)))
print("Positivos: {:.2f}".format(100*len(y_test[y_test==1])/len(y_test)))

Negativos: 90.29
Positivos: 9.71


#### Árvore de decisão

In [128]:
from sklearn.tree import (DecisionTreeClassifier
                          , export)

# classificador = DecisionTreeClassifier(criterion='entropy', random_state=0, max_depth=(len(x.columns)**0.5)*2)
classificador = DecisionTreeClassifier(criterion='entropy'
                                       , random_state=0
                                       , max_depth=len(x_train.columns)**0.5)

classificador.fit(x_train, y_train)

y_pred = classificador.predict(x_test)

In [129]:
print("""Métrica 1 (Relatório de classificação): {}""".format(classification_report(y_test
                                                                                    , y_pred
                                                                                    , target_names=["negative", "positive"])))

Métrica 1 (Relatório de classificação):               precision    recall  f1-score   support

    negative       0.91      1.00      0.95      2548
    positive       0.52      0.05      0.09       274

    accuracy                           0.90      2822
   macro avg       0.71      0.52      0.52      2822
weighted avg       0.87      0.90      0.87      2822



In [130]:
feature_importance = pd.DataFrame(data=(classificador.feature_importances_).reshape(1,len(x_train.columns))
                                  , columns=x_train.columns)

feature_importance = feature_importance.T

feature_importance.sort_values(by=0, ascending=False)

Unnamed: 0,0
Patient age quantile,0.318972
"Patient addmited to regular ward (1=yes, 0=no)",0.140942
Leukocytes,0.131215
pO2 (arterial blood gas analysis),0.079572
Red blood Cells,0.064274
Eosinophils,0.054602
Alanine transaminase,0.046001
Platelets,0.042951
Mean platelet volume,0.034985
Urea,0.030874


#### Random forest

In [131]:
from sklearn.ensemble import RandomForestClassifier

classificador = RandomForestClassifier(n_estimators=100
                                       , criterion='entropy'
                                       , random_state=0
                                       , max_depth=len(x_train.columns)**0.5)

classificador.fit(x_train, y_train)

y_pred = classificador.predict(x_test)

In [132]:
print("Métrica 1 (Relatório de classificação):\n{}\n".format(classification_report(y_test
                                                                                   , y_pred
                                                                                   , target_names=["negative", "positive"]
                                                                                  )))


Métrica 1 (Relatório de classificação):
              precision    recall  f1-score   support

    negative       0.90      1.00      0.95      2548
    positive       0.67      0.01      0.03       274

    accuracy                           0.90      2822
   macro avg       0.79      0.51      0.49      2822
weighted avg       0.88      0.90      0.86      2822




In [64]:
feature_importance = pd.DataFrame(data=(classificador.feature_importances_).reshape(1,len(x_train.columns))
                                  , columns=x_train.columns)

feature_importance = feature_importance.T

feature_importance.sort_values(by=0, ascending=False)

Unnamed: 0,0
Patient age quantile,0.167211
Leukocytes,0.081495
Platelets,0.058368
"Patient addmited to regular ward (1=yes, 0=no)",0.055526
Monocytes,0.03427
Eosinophils,0.032862
Mean corpuscular volume (MCV),0.028207
Proteina C reativa mg/dL,0.027973
Lymphocytes,0.025843
Aspartate transaminase,0.022913


#### Análise preliminar
Somente com os atributos numéricos alcançamos míseros 0,01 para o recall da classe positiva. Em outras palavras, apenas 1% dos casos de Covid-19 estão sendo detectados. **Mas temos o mais importante: um baseline**.

**Saída:** Vamos passar os atributos categóricos por tratamentos e por engenharia de atributos (feature engineering) para melhorar nosso modelo.

### Engenharia de atributos sobre as variáveis categóricas

In [145]:
df_categoricas = df.select_dtypes(include=["object"])
df_categoricas["SARS-Cov-2 exam result - Numerico"] = df.loc[:,("SARS-Cov-2 exam result - Numerico")].values
df_categoricas.sample(5, random_state=42).T

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,1694,4434,3297,3980,4165
Patient ID,5541f42a107f084,d312238aa508c24,23b19225d2fdd4c,63aeba2b01cb895,b90320c6c7bbb1e
SARS-Cov-2 exam result,negative,negative,negative,negative,negative
Respiratory Syncytial Virus,,,,,
Influenza A,,,,,
Influenza B,,,,,
Parainfluenza 1,,,,,
CoronavirusNL63,,,,,
Rhinovirus/Enterovirus,,,,,
Coronavirus HKU1,,,,,
Parainfluenza 3,,,,,


In [146]:
dp.breveDescricao(df_categoricas)

O data set possui: 
- 38 atributos/campos; e 
- 5644 registros.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5644 entries, 0 to 5643
Data columns (total 38 columns):
 #   Column                             Non-Null Count  Dtype 
---  ------                             --------------  ----- 
 0   Patient ID                         5644 non-null   object
 1   SARS-Cov-2 exam result             5644 non-null   object
 2   Respiratory Syncytial Virus        1354 non-null   object
 3   Influenza A                        1354 non-null   object
 4   Influenza B                        1354 non-null   object
 5   Parainfluenza 1                    1352 non-null   object
 6   CoronavirusNL63                    1352 non-null   object
 7   Rhinovirus/Enterovirus             1352 non-null   object
 8   Coronavirus HKU1                   1352 non-null   object
 9   Parainfluenza 3                    1352 non-null   object
 10  Chlamydophila pneumoniae           1352 non-null   object
 11  Aden

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(axis=1, how="all", inplace=True)


In [151]:
df_temp = dp.cardinalidade(df_categoricas)
df_temp[df_temp.Cardinalidade<=3]

Unnamed: 0,Atributo,Cardinalidade,Valores
30,SARS-Cov-2 exam result - Numerico,2,"[0, 1]"
28,Urine - Yeasts,2,"[nan, absent]"
27,Urine - Granular cylinders,2,"[nan, absent]"
26,Urine - Hyaline cylinders,2,"[nan, absent]"
16,Parainfluenza 2,2,"[nan, not_detected]"
23,Urine - Urobilinogen,3,"[nan, normal, not_done]"
18,"Influenza A, rapid test",3,"[nan, negative, positive]"
17,"Influenza B, rapid test",3,"[nan, negative, positive]"
14,Bordetella pertussis,3,"[nan, not_detected, detected]"
13,Inf A H1N1 2009,3,"[nan, not_detected, detected]"


#### Excluindo atributos mais "grosseiros":
- Atibuto **Patient ID:** Pois é a identificação do paciente.
- Atibuto **Urine - Nitrite:** Pois só há um valor não nulo, portanto, não deve adicionar
- Atibuto **SARS-Cov-2 exam result:** Pois já fizemos o Label encoding deste atributo.
- Atibuto **Urine - Protein	3:** Pois os três valores não parecem adicionar valor a análise (nan, absent, not_done).
- Atibuto **Urine - Ketone Bodies:** Pois os três valores não parecem adicionar valor a análise (nan, absent, not_done).
- Atibuto **Urine - Bile pigments:** Pois os três valores não parecem adicionar valor a análise (nan, absent, not_done).
- Atibuto **Urine - Esterase:** Pois os três valores não parecem adicionar valor a análise (nan, absent, not_done).

In [149]:
df_categoricas.drop(["Patient ID"
                     , "Urine - Nitrite"
                     , "SARS-Cov-2 exam result"
                     , "Urine - Protein"
                     , "Urine - Ketone Bodies"
                     , "Urine - Bile pigments"
                     , "Urine - Esterase"
                    ], axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


#### Excluindo atributos de baixa cardinalidade com valores nan e absent
À primeira vista, *nan* e *absent* parecem-me não adicionar valor às análises. Removerei os quatro atributos:
- 'Urine - Yeasts',
- 'Urine - Granular cylinders'
- 'Urine - Hyaline cylinders',
- 'Parainfluenza 2'

In [152]:
df_categoricas.drop(['Urine - Yeasts'
                     , 'Urine - Granular cylinders'
                     , 'Urine - Hyaline cylinders'
                     , 'Parainfluenza 2']
                    , axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [153]:
dp.cardinalidade(df_categoricas)

Unnamed: 0,Atributo,Cardinalidade,Valores
26,SARS-Cov-2 exam result - Numerico,2,"[0, 1]"
22,Urine - Urobilinogen,3,"[nan, normal, not_done]"
17,"Influenza A, rapid test",3,"[nan, negative, positive]"
16,"Influenza B, rapid test",3,"[nan, negative, positive]"
15,Metapneumovirus,3,"[nan, not_detected, detected]"
14,Bordetella pertussis,3,"[nan, not_detected, detected]"
12,CoronavirusOC43,3,"[nan, not_detected, detected]"
11,Coronavirus229E,3,"[nan, not_detected, detected]"
10,Parainfluenza 4,3,"[nan, not_detected, detected]"
0,Respiratory Syncytial Virus,3,"[nan, not_detected, detected]"


In [155]:
dp.serieNulos(df_categoricas, 95)

(Urine - Urobilinogen    98.777463
 Urine - Color           98.759745
 Urine - Crystals        98.759745
 Urine - Leukocytes      98.759745
 Urine - Hemoglobin      98.759745
 Urine - pH              98.759745
 Urine - Aspect          98.759745
 dtype: float64,
 '-> 7 atributos/features/campos possuem mais de 95% de valores nulos.')