# **Dados**

Nesta tarefa, nosso objetivo é construir um modelo preditivo capaz de determinar se um paciente infectado com a COVID-19 tem risco de ir para UTI . A tarefa em questão é de classificação, tendo como atributo-alvo ICU. Os dados para esta tarefa são reais e foram coletados no hospital Sírio-Libanês. Para cada paciente presente na base de dados foram extraídas as seguintes características:  Kaggle_Sirio_Libanes_ICU_Prediction.csv

Download Kaggle_Sirio_Libanes_ICU_Prediction.csv 

    RESPIRATORY_RATE_MAX: Frequência Respiratória Máxima que o Paciente atingiu.
    RESPIRATORY_RATE_MEAN: Frequência Respiratória Média do Paciente. 
    AGE_PERCENTIL: Idade percentual 
    OTHER: 
    BLOODPRESSURE_SISTOLIC_MAX: Pressão sistólica máxima
    WINDOW: Janela de admissão do paciente em horas 0-12+ 
    BLOODPRESSURE_DIASTOLIC_MIN: Mínimo da Pressão diastólica
    HTN: Pressão alta
    RESPIRATORY_RATE_MEDIAN:  Frequência Respiratória Mediana do Paciente. 
    BLOODPRESSURE_DIASTOLIC_MEAN: Média da Pressão diastólica
    BLOODPRESSURE_SISTOLIC_MEAN: Pressão sistólica média
    RESPIRATORY_RATE_MIN:  Frequência Respiratória Minima  do Paciente
    BLOODPRESSURE_DIASTOLIC_MAX: Máxima da Pressão diastólica
    GENDER: Gênero 
    BLOODPRESSURE_SISTOLIC_MEDIAN: Pressão sistólica Mediana
    HEART_RATE_DIFF_REL: Relação de diferença da Frequência cardíaca.
    TEMPERATURE_MEAN: Temperatura média
    IMMUNOCOMPROMISED: Se é IMUNOCOMPROMETIDO
    OXYGEN_SATURATION_MAX: Saturação máxima do oxigênio.
    ICU: Se o paciente precisou ou não da  UTI (0 e 1).

Observação: Todos os valores foram normalizados entre -1 e 1.
Descrição da atividade

Para resolver o problema em questão, considere as atividades a seguir:

(25%) Realize uma análise exploratória dos dados, identificando e descrevendo:
- correlações entre as variáveis preditoras;
- valores ausentes.
- Inclua comentários no seu notebook em Markdown para apresentar os resultados observados;

(50%) Divida os dados de treino usando a técnica holdout (80/20) em treino e validação, para  o cálculo das métricas. Baseando-se na análise feita anteriormente, utilizando todos os atributos disponíveis, treine e faça fine-tuning os seguintes modelos conforme uma validação cruzada k-folds:
- Um modelo de Regressão Logística;
- Um modelo de Naive Bayes;
- Um modelo de Árvore de Decisão;
- Um outro modelo à sua escolha, dentre os disponíveis na API do Scikit-Learn;

(25%) Identifique os valores de Precision, Recall, Acurácia e F1-Score no conjunto de validação para os modelos utilizados. Existe diferença entre o desempenho no treino e na validação? Descreva o entendimento e justifique os resultados obtidos.

Como devo entregar

    Link para o Colab compartilhado com os professores, com as questões indexadas no próprio colab para facilitar correção.


# **Imports**

In [1]:
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from numpy import mean, std

# **Load Data**

In [2]:
df = pd.read_csv("Kaggle_Sirio_Libanes_ICU_Prediction.csv")
df.head(5)

Unnamed: 0,RESPIRATORY_RATE_MAX,RESPIRATORY_RATE_MEAN,WINDOW,AGE_PERCENTIL,OTHER,BLOODPRESSURE_SISTOLIC_MAX,BLOODPRESSURE_DIASTOLIC_MIN,HTN,RESPIRATORY_RATE_MEDIAN,BLOODPRESSURE_DIASTOLIC_MEAN,...,RESPIRATORY_RATE_MIN,BLOODPRESSURE_DIASTOLIC_MAX,GENDER,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_DIFF_REL,BLOODPRESSURE_DIASTOLIC_MEAN.1,TEMPERATURE_MEAN,IMMUNOCOMPROMISED,OXYGEN_SATURATION_MAX,ICU
0,-0.575758,-0.525424,6,10,1.0,-0.740541,-0.051546,0.0,-0.517241,-0.259259,...,-0.428571,-0.487179,0,-0.630769,-1.0,-0.259259,0.285714,0.0,0.631579,0
1,0.212121,0.355932,4,80,1.0,-0.32973,0.175258,1.0,0.37931,0.012346,...,0.5,-0.299145,0,-0.046154,-1.0,0.012346,-0.464286,0.0,0.842105,0
2,-0.515152,-0.549005,12,10,1.0,-0.567568,-0.175258,0.0,-0.517241,-0.131508,...,-0.5,-0.247863,0,-0.538462,-0.708718,-0.131508,-0.201863,0.0,0.842105,0
3,-0.454545,-0.389831,4,10,0.0,-0.524324,0.072165,0.0,-0.37931,-0.111111,...,-0.285714,-0.384615,0,-0.323077,-1.0,-0.111111,0.107143,0.0,0.578947,1
4,,,4,70,1.0,-0.610811,-0.010309,0.0,,-0.209877,...,,-0.452991,1,-0.446154,-1.0,-0.209877,0.178571,1.0,0.894737,0


In [3]:
df.columns

Index(['RESPIRATORY_RATE_MAX', 'RESPIRATORY_RATE_MEAN', 'WINDOW',
       'AGE_PERCENTIL', 'OTHER', 'BLOODPRESSURE_SISTOLIC_MAX',
       'BLOODPRESSURE_DIASTOLIC_MIN', 'HTN', 'RESPIRATORY_RATE_MEDIAN',
       'BLOODPRESSURE_DIASTOLIC_MEAN', 'BLOODPRESSURE_SISTOLIC_MEAN',
       'RESPIRATORY_RATE_MIN', 'BLOODPRESSURE_DIASTOLIC_MAX', 'GENDER',
       'BLOODPRESSURE_SISTOLIC_MEDIAN', 'HEART_RATE_DIFF_REL',
       'BLOODPRESSURE_DIASTOLIC_MEAN.1', 'TEMPERATURE_MEAN',
       'IMMUNOCOMPROMISED', 'OXYGEN_SATURATION_MAX', 'ICU'],
      dtype='object')

In [4]:
df.shape

(1732, 21)

# **Removing NaN values**

In [5]:
df.isnull().sum()

RESPIRATORY_RATE_MAX              672
RESPIRATORY_RATE_MEAN             672
WINDOW                              0
AGE_PERCENTIL                       0
OTHER                               4
BLOODPRESSURE_SISTOLIC_MAX        616
BLOODPRESSURE_DIASTOLIC_MIN       616
HTN                                 4
RESPIRATORY_RATE_MEDIAN           672
BLOODPRESSURE_DIASTOLIC_MEAN      616
BLOODPRESSURE_SISTOLIC_MEAN       616
RESPIRATORY_RATE_MIN              672
BLOODPRESSURE_DIASTOLIC_MAX       616
GENDER                              0
BLOODPRESSURE_SISTOLIC_MEDIAN     616
HEART_RATE_DIFF_REL               616
BLOODPRESSURE_DIASTOLIC_MEAN.1    616
TEMPERATURE_MEAN                  622
IMMUNOCOMPROMISED                   4
OXYGEN_SATURATION_MAX             617
ICU                                 0
dtype: int64

In [6]:
df = df.dropna()
df.shape

(1018, 21)

# **Correlation**

In [7]:
df.corr()

Unnamed: 0,RESPIRATORY_RATE_MAX,RESPIRATORY_RATE_MEAN,WINDOW,AGE_PERCENTIL,OTHER,BLOODPRESSURE_SISTOLIC_MAX,BLOODPRESSURE_DIASTOLIC_MIN,HTN,RESPIRATORY_RATE_MEDIAN,BLOODPRESSURE_DIASTOLIC_MEAN,...,RESPIRATORY_RATE_MIN,BLOODPRESSURE_DIASTOLIC_MAX,GENDER,BLOODPRESSURE_SISTOLIC_MEDIAN,HEART_RATE_DIFF_REL,BLOODPRESSURE_DIASTOLIC_MEAN.1,TEMPERATURE_MEAN,IMMUNOCOMPROMISED,OXYGEN_SATURATION_MAX,ICU
RESPIRATORY_RATE_MAX,1.0,0.627593,0.495399,0.193001,0.143118,0.564956,-0.530938,0.150551,0.547697,-0.152842,...,-0.143868,0.434001,-0.041032,0.09108,0.640121,-0.152842,-0.0272,0.065045,0.328371,0.5905
RESPIRATORY_RATE_MEAN,0.627593,1.0,-0.030517,0.187851,-0.029002,0.180171,-0.117295,0.081413,0.977732,-0.102032,...,0.621965,-0.000896,-0.084924,0.157595,0.06449,-0.102032,0.099596,0.028768,-0.106418,0.434151
WINDOW,0.495399,-0.030517,1.0,0.005283,0.245584,0.476621,-0.533841,0.085857,-0.071032,-0.043234,...,-0.564892,0.544294,0.027075,-0.078094,0.77113,-0.043234,-0.173891,0.013863,0.493645,0.20563
AGE_PERCENTIL,0.193001,0.187851,0.005283,1.0,0.112707,0.323188,-0.171422,0.337616,0.176854,-0.181769,...,0.055382,0.002046,0.060242,0.303653,0.057618,-0.181769,-0.100578,0.201481,-0.02021,0.287317
OTHER,0.143118,-0.029002,0.245584,0.112707,1.0,0.156719,-0.187961,0.251233,-0.047944,-0.043475,...,-0.17647,0.155787,0.056695,-0.014125,0.246006,-0.043475,-0.027342,0.185188,0.145904,-0.032395
BLOODPRESSURE_SISTOLIC_MAX,0.564956,0.180171,0.476621,0.323188,0.156719,1.0,-0.246654,0.26291,0.142931,0.234236,...,-0.331135,0.698593,-0.053611,0.668709,0.58683,0.234236,-0.089181,0.053137,0.379904,0.384455
BLOODPRESSURE_DIASTOLIC_MIN,-0.530938,-0.117295,-0.533841,-0.171422,-0.187961,-0.246654,1.0,-0.131907,-0.073732,0.752693,...,0.388867,-0.053319,-0.105177,0.338433,-0.608016,0.752693,0.094495,-0.108978,-0.432992,-0.412389
HTN,0.150551,0.081413,0.085857,0.337616,0.251233,0.26291,-0.131907,1.0,0.066769,-0.039058,...,-0.035815,0.101237,-0.031123,0.190311,0.130766,-0.039058,-0.097012,0.172969,0.115209,0.185915
RESPIRATORY_RATE_MEDIAN,0.547697,0.977732,-0.071032,0.176854,-0.047944,0.142931,-0.073732,0.066769,1.0,-0.089194,...,0.639634,-0.03257,-0.088127,0.158203,0.012608,-0.089194,0.117536,0.022432,-0.130574,0.389363
BLOODPRESSURE_DIASTOLIC_MEAN,-0.152842,-0.102032,-0.043234,-0.181769,-0.043475,0.234236,0.752693,-0.039058,-0.089194,1.0,...,0.020245,0.53805,-0.127759,0.474012,-0.085504,1.0,-0.008409,-0.125234,-0.092622,-0.261935


In [8]:
corr = dict(df.corr())
results = [(k, v['ICU']) for k, v in corr.items()]
results.sort(key=lambda x: x[1], reverse=True)
results[0:3]

[('ICU', 1.0),
 ('RESPIRATORY_RATE_MAX', 0.5905004548441919),
 ('RESPIRATORY_RATE_MEAN', 0.43415135081217776)]

## **Removing NaN values**
- #### **RESPIRATORY_RATE_MAX** - 0.590
- #### **RESPIRATORY_RATE_MEAN** - 0.434
- #### **RESPIRATORY_RATE_MEDIAN** - 0.389

# **Split Train and Test**
- #### **TRAIN** - 0.8
- #### **TEST** - 0.2

In [9]:
X_columns = list(df.columns)
X_columns.remove('ICU')
y_column = ['ICU']
X = df[X_columns]
y = df[y_column]

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# **Training Models**
- #### **Default Configuration**

# **Logistic Regression**

In [None]:
model = LogisticRegression()
kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X_train, y_train, cv = kfold)

In [12]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.839 (0.039)


In [13]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
target_names = ['NO UTI', 'UTI'] 
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      NO UTI       0.80      0.93      0.86       116
         UTI       0.88      0.69      0.78        88

    accuracy                           0.83       204
   macro avg       0.84      0.81      0.82       204
weighted avg       0.84      0.83      0.82       204



  y = column_or_1d(y, warn=True)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


# **Naive Bayes**

In [None]:
model = GaussianNB()
kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X_train, y_train, cv = kfold)

In [15]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.830 (0.027)


In [16]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
target_names = ['NO UTI', 'UTI'] 
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      NO UTI       0.78      0.93      0.85       116
         UTI       0.88      0.65      0.75        88

    accuracy                           0.81       204
   macro avg       0.83      0.79      0.80       204
weighted avg       0.82      0.81      0.80       204



  y = column_or_1d(y, warn=True)


# **Decision Tree**

In [26]:
model = DecisionTreeClassifier()
kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X_train, y_train, cv = kfold)

In [27]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.784 (0.021)


In [28]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
target_names = ['NO UTI', 'UTI'] 
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      NO UTI       0.77      0.87      0.82       116
         UTI       0.79      0.66      0.72        88

    accuracy                           0.78       204
   macro avg       0.78      0.76      0.77       204
weighted avg       0.78      0.78      0.78       204



# **SVM**

In [None]:
model = SVC(kernel='linear')
kfold = KFold(n_splits=5, shuffle=True)
scores = cross_val_score(model, X_train, y_train, cv = kfold)

In [21]:
print('Accuracy: %.3f (%.3f)' % (mean(scores), std(scores)))

Accuracy: 0.838 (0.017)


In [22]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
target_names = ['NO UTI', 'UTI'] 
print(classification_report(y_test, y_pred, target_names=target_names))

              precision    recall  f1-score   support

      NO UTI       0.81      0.93      0.86       116
         UTI       0.89      0.70      0.78        88

    accuracy                           0.83       204
   macro avg       0.85      0.82      0.82       204
weighted avg       0.84      0.83      0.83       204



  y = column_or_1d(y, warn=True)


# **Descrição do entendimento dos resultados obtidos.**
#### Observando os resultados dos modelos, é visível que as métricas do conjunto de treino e do de teste possuem valores bastnate parecidos. Isso deve-se a validação validação cruzada k-folds, em que ocorre o particionamento do conjunto de dados em subconjuntos mutuamente exclusivos, e posteriormente, o uso de alguns destes subconjuntos para a estimação dos parâmetros do modelo (dados de treinamento), sendo os subconjuntos restantes (dados de validação ou de teste) empregados na validação do modelo.