# Extraindo dados da Covid19 via API
### Fonte de Dados: https://brasil.io/dataset/covid19/
### Augusto SPINELLI

### Licença
Os dados dados convertidos estão sob a licença Creative Commons Attribution ShareAlike. Caso utilize os dados, cite a fonte original e quem tratou os dados, como: Fonte: Secretarias de Saúde das Unidades Federativas, dados tratados por Álvaro Justen e colaboradores/Brasil.IO. Caso compartilhe os dados, utilize a mesma licença.

## Importando os dados

In [6]:
import pandas as pd
import os
import numpy as np


In [7]:
#set the path of the processed data
processed_data_path = os.path.join(os.path.pardir,'data','processed')
train_file_path = os.path.join(processed_data_path,'train.csv')
test_file_path = os.path.join(processed_data_path,'test.csv')

In [10]:
#create train and test dataframes using pandas
train_df = pd.read_csv(train_file_path,index_col=[0])
test_df = pd.read_csv(test_file_path,index_col=[0])

In [11]:
train_df.info()
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49450 entries, 0 to 50639
Data columns (total 11 columns):
 #   Column                     Non-Null Count  Dtype
---  ------                     --------------  -----
 0   sobreviveu                 49450 non-null  int64
 1   ap_residencia_estadia_1.0  49450 non-null  int64
 2   ap_residencia_estadia_2.1  49450 non-null  int64
 3   ap_residencia_estadia_2.2  49450 non-null  int64
 4   ap_residencia_estadia_3.1  49450 non-null  int64
 5   ap_residencia_estadia_3.2  49450 non-null  int64
 6   ap_residencia_estadia_3.3  49450 non-null  int64
 7   ap_residencia_estadia_4.0  49450 non-null  int64
 8   ap_residencia_estadia_5.1  49450 non-null  int64
 9   ap_residencia_estadia_5.2  49450 non-null  int64
 10  ap_residencia_estadia_5.3  49450 non-null  int64
dtypes: int64(11)
memory usage: 4.5 MB
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1190 entries, 70 to 50633
Data columns (total 10 columns):
 #   Column                     Non-Null 

## Data Preparation

Vamos agora pegar nossos dados de training e quebrá-lo de forma a ter 80% en uma matrix e 20% em outro.

No final, vamos comparar a taxa de sobrevivência em ambos splits para ver se são similares

In [13]:
X = train_df.loc[:,'ap_residencia_estadia_1.0':].to_numpy().astype('float') #input matrix
y = train_df['sobreviveu'].ravel() #output array

In [14]:
print (X.shape, y.shape)

(49450, 10) (49450,)


In [15]:
#ML train test split 80/20 (80% will be used to train, 20% to test)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

(39560, 10) (39560,)
(9890, 10) (9890,)


In [16]:
#average survival in train and test, without ML 
print('mean survival in train: {0:.3f}'.format(np.mean(y_train)))
print('mean survival in test: {0:.3f}'.format(np.mean(y_test)))

mean survival in train: 0.905
mean survival in test: 0.903


## Modelo Base (Baseline Model)

Este modelo é o nosso modelo base. É o mais burro possível e sempre considera que o paciente sobreviverá

In [17]:
import sklearn
#import function
from sklearn.dummy import DummyClassifier

In [18]:
#create a model
model_dummy = DummyClassifier(strategy='most_frequent',random_state=0)

In [19]:
#train a model
model_dummy.fit(X_train,y_train) #input,output params are necessary for fit training function

DummyClassifier(constant=None, random_state=0, strategy='most_frequent')

In [20]:
#calculate model score (baseline considers most frequent data, a.k.a 1 ==  survived covid)
print('score for baseline model : {0:.2f}'.format(model_dummy.score(X_test,y_test)))

score for baseline model : 0.90


In [21]:
#So, without any ML algorithm, if we simply predict that a pacient with covid survives, we will be right in 90% of the cases
#let's try to improve that with ML! 
#First, lets import some performance metrics
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score

In [22]:
#accuracy score
print('accuracy for baseline model : {0:.2f}'.format(accuracy_score(y_test,model_dummy.predict(X_test))))
#confusion matrix
print('confusion matrix for baseline model : \n {0}'.format(confusion_matrix(y_test,model_dummy.predict(X_test))))
#precision and recall scores
print('precision for baseline model : {0:.2f}'.format(precision_score(y_test,model_dummy.predict(X_test))))
print('recall for baseline model : {0:.2f}'.format(recall_score(y_test,model_dummy.predict(X_test))))

accuracy for baseline model : 0.90
confusion matrix for baseline model : 
 [[   0  959]
 [   0 8931]]
precision for baseline model : 0.90
recall for baseline model : 1.00


## Modelo de regressão lógica

In [23]:
#import logistic regression function from sklearn
from sklearn.linear_model import LogisticRegression

In [24]:
#create our model 
model_lr_1 = LogisticRegression(random_state=0)

In [25]:
#train our model
model_lr_1.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=0, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [27]:
#evaluate model
print('score for logistic regression - version 1 : {0:.2f}'.format(model_lr_1.score(X_test,y_test)))

score for logistic regression - version 1 : 0.90


In [28]:
#performance metrics

#accuracy score
print('accuracy for logistic regression v1 model : {0:.2f}'.format(accuracy_score(y_test,model_lr_1.predict(X_test))))
#confusion matrix
print('confusion matrix for logistic regression v1 model : \n {0}'.format(confusion_matrix(y_test,model_lr_1.predict(X_test))))
#precision 
print('precision for logistic regression v1 model : {0:.2f}'.format(precision_score(y_test,model_lr_1.predict(X_test))))
#recall
print('recall for logistic regression v1 model : {0:.2f}'.format(recall_score(y_test,model_lr_1.predict(X_test))))

accuracy for logistic regression v1 model : 0.90
confusion matrix for logistic regression v1 model : 
 [[   0  959]
 [   0 8931]]
precision for logistic regression v1 model : 0.90
recall for logistic regression v1 model : 1.00


## Otimização de hyperparametros

In [29]:
# base model
model_lr = LogisticRegression(random_state=0)

In [30]:
from sklearn.model_selection import GridSearchCV

In [32]:
parameters = {'C':[1.0,10.0,50.0,100.0,1000.0],'penalty':['l1','l2']}
clf = GridSearchCV(model_lr,param_grid=parameters,cv=3)

In [33]:
clf.fit(X_train,y_train)

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.



GridSearchCV(cv=3, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=0, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1.0, 10.0, 50.0, 100.0, 1000.0],
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

In [34]:
clf.best_params_

{'C': 1.0, 'penalty': 'l2'}

In [35]:
print('best score: {0:.2f}'.format(clf.best_score_))

best score: 0.90


In [36]:
#evaluate model
print('score for logistic regression v2 : {0:.2f}'.format(clf.score(X_test,y_test)))

score for logistic regression v2 : 0.90


## End of Notebook

#### Não estamos melhorando nosso modelo base. Como dispomos de dados pobres nesse exemplo e o modelo base já tem fidelidade de 90% (a maioria dos pacientes de covid sobrevivem), um modelo que usa regressão lógica não é suficiente para prever com uma fidelidade ainda maior (dados os poucos parâmetros disponíveis) se o paciente irá sobreviver ou não

#### Neste caso, voltamos a prancheta. Precisamos ter ainda mais dados disponíveis como sexo, idade, renda, etc (e não somente ap de residencia) para tentar montar um modelo mais fiel