## 1. Inspecting the COVID-19 clinical data

<p>The dataset comes from the daily reports <a href="https://www.gob.mx/salud/documentos/datos-abiertos-152127">General Directorate of Epidemiology</a> (more info <a href="https://datos.gob.mx/busca/dataset/informacion-referente-a-casos-covid-19-en-mexico/resource/e8c7079c-dc2a-4b6e-8035-08042ed37165">here</a>) which is reported by 475 viral respiratory disease monitoring units/hospitals called  USMER (<b>U</b>nidade<b>s</b> <b>M</b>onitoras de <b>E</b>nfermedad <b>R</b>espiratoria <b>V</b>iral) throughout the country in the entire health sector (IMSS, ISSSTE, SEDENA, SEMAR, and more) and that samples 10% of the patients with a viral respiratory diagnosis.</p>

<p>Preliminary data subject to validation by the Ministry of Health through the General Directorate of Epidemiology. The information contained corresponds only to the data obtained from the epidemiological study of a suspected case of viral respiratory disease at the time it is identified in the medical units of the Health Sector.</p>

<p>According to the clinical diagnosis of admission, it is considered as an outpatient or hospitalized patient. The base does not include the evolution during the stay in the medical units, with the exception of updates to your discharge by the hospital epidemiological surveillance units or health jurisdictions in the case of deaths.</p>

In [28]:
# Print out the first 5 lines from the latest pre-processed report file
!head -n 5 ../latest_raw.csv

"id","FECHA_ARCHIVO","ID_REGISTRO","ENTIDAD_UM","ENTIDAD_RES","RESULTADO","DELAY","ENTIDAD_REGISTRO","ENTIDAD","ABR_ENT","FECHA_ACTUALIZACION","ORIGEN","SECTOR","SEXO","ENTIDAD_NAC","MUNICIPIO_RES","TIPO_PACIENTE","FECHA_INGRESO","FECHA_SINTOMAS","FECHA_DEF","INTUBADO","NEUMONIA","EDAD","NACIONALIDAD","EMBARAZO","HABLA_LENGUA_INDIG","DIABETES","EPOC","ASMA","INMUSUPR","HIPERTENSION","OTRA_COM","CARDIOVASCULAR","OBESIDAD","RENAL_CRONICA","TABAQUISMO","OTRO_CASO","MIGRANTE","PAIS_NACIONALIDAD","PAIS_ORIGEN","UCI"
9269,2020-04-12,"00011f",25,25,2,0,25,"Sinaloa","SL","2020-04-19",2,12,2,25,13,1,"2020-03-20","2020-03-12","9999-99-99",97,2,74,1,97,2,1,2,2,2,1,2,2,1,2,2,2,99,"MÃ©xico","97",97
33333,2020-04-12,"00014e",14,14,2,0,14,"Jalisco","JC","2020-04-19",1,4,1,16,98,2,"2020-03-30","2020-03-30","9999-99-99",2,2,71,1,2,2,1,1,2,2,1,2,2,1,2,1,99,99,"MÃ©xico","97",2
35483,2020-04-12,"000153",8,8,1,0,8,"Chihuahua","CH","2020-04-19",1,4,2,8,19,2,"2020-04-02","2020-03-24","9999-99-99",2,1,50,1,97

## 2. Loading the COVID-19 clinical data
<p>We proceed to loading the data into memory.</p>

In [2]:
import pandas as pd

In [3]:
# Read in dataset
df = pd.read_csv('../latest_raw.csv')

In [4]:
# Print out the first rows of our dataset
df.head()

Unnamed: 0,id,FECHA_ARCHIVO,ID_REGISTRO,ENTIDAD_UM,ENTIDAD_RES,RESULTADO,DELAY,ENTIDAD_REGISTRO,ENTIDAD,ABR_ENT,...,OTRA_COM,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,MIGRANTE,PAIS_NACIONALIDAD,PAIS_ORIGEN,UCI
0,9269,2020-04-12,00011f,25,25,2,0,25,Sinaloa,SL,...,2,2,1,2,2,2,99,MÃ©xico,97,97
1,33333,2020-04-12,00014e,14,14,2,0,14,Jalisco,JC,...,2,2,1,2,1,99,99,MÃ©xico,97,2
2,35483,2020-04-12,000153,8,8,1,0,8,Chihuahua,CH,...,2,2,2,2,2,99,99,MÃ©xico,97,2
3,7062,2020-04-12,0001b6,9,15,1,0,9,Ciudad de Mexico,DF,...,2,2,1,2,2,99,99,MÃ©xico,97,97
4,23745,2020-04-12,0001c1,9,9,2,0,9,Ciudad de Mexico,DF,...,2,2,2,2,2,99,99,MÃ©xico,97,97


In [5]:
# Show the shape (number of rows & columns)
df.shape

(188616, 41)

In [6]:
# Show the number of missing (NAN, NaN, na) data for each column
df.isnull().sum()

id                     0
FECHA_ARCHIVO          0
ID_REGISTRO            0
ENTIDAD_UM             0
ENTIDAD_RES            0
RESULTADO              0
DELAY                  0
ENTIDAD_REGISTRO       0
ENTIDAD                0
ABR_ENT                0
FECHA_ACTUALIZACION    0
ORIGEN                 0
SECTOR                 0
SEXO                   0
ENTIDAD_NAC            0
MUNICIPIO_RES          6
TIPO_PACIENTE          0
FECHA_INGRESO          0
FECHA_SINTOMAS         0
FECHA_DEF              0
INTUBADO               0
NEUMONIA               0
EDAD                   0
NACIONALIDAD           0
EMBARAZO               0
HABLA_LENGUA_INDIG     0
DIABETES               0
EPOC                   0
ASMA                   0
INMUSUPR               0
HIPERTENSION           0
OTRA_COM               0
CARDIOVASCULAR         0
OBESIDAD               0
RENAL_CRONICA          0
TABAQUISMO             0
OTRO_CASO              0
MIGRANTE               0
PAIS_NACIONALIDAD      0
PAIS_ORIGEN            0


In [7]:
# Remove null data
df = df[~df.isnull().any(axis=1)]

In [8]:
# Exclude features that aren't a numeric type
df = df[df.columns[~df.columns.isin(
    ['id', 'ID_REGISTRO',
     'FECHA_ARCHIVO', 'FECHA_ACTUALIZACION', 'FECHA_INGRESO','FECHA_SINTOMAS', 'FECHA_DEF',
     'ABR_ENT', 'NACIONALIDAD', 'MIGRANTE', 'PAIS_NACIONALIDAD', 'PAIS_ORIGEN', 'ENTIDAD']
)]]

In [9]:
df.keys()

Index(['ENTIDAD_UM', 'ENTIDAD_RES', 'RESULTADO', 'DELAY', 'ENTIDAD_REGISTRO',
       'ORIGEN', 'SECTOR', 'SEXO', 'ENTIDAD_NAC', 'MUNICIPIO_RES',
       'TIPO_PACIENTE', 'INTUBADO', 'NEUMONIA', 'EDAD', 'EMBARAZO',
       'HABLA_LENGUA_INDIG', 'DIABETES', 'EPOC', 'ASMA', 'INMUSUPR',
       'HIPERTENSION', 'OTRA_COM', 'CARDIOVASCULAR', 'OBESIDAD',
       'RENAL_CRONICA', 'TABAQUISMO', 'OTRO_CASO', 'UCI'],
      dtype='object')

## 3. Inspecting the COVID-19 clinical data
<p>We ideally want every column to have a numeric type. Let's verify this.</p>

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 188610 entries, 0 to 188615
Data columns (total 28 columns):
ENTIDAD_UM            188610 non-null int64
ENTIDAD_RES           188610 non-null int64
RESULTADO             188610 non-null int64
DELAY                 188610 non-null int64
ENTIDAD_REGISTRO      188610 non-null int64
ORIGEN                188610 non-null int64
SECTOR                188610 non-null int64
SEXO                  188610 non-null int64
ENTIDAD_NAC           188610 non-null int64
MUNICIPIO_RES         188610 non-null float64
TIPO_PACIENTE         188610 non-null int64
INTUBADO              188610 non-null int64
NEUMONIA              188610 non-null int64
EDAD                  188610 non-null int64
EMBARAZO              188610 non-null int64
HABLA_LENGUA_INDIG    188610 non-null int64
DIABETES              188610 non-null int64
EPOC                  188610 non-null int64
ASMA                  188610 non-null int64
INMUSUPR              188610 non-null int64
HIPERTE

## 4. Creating target column

We are aiming to predict whether a patient with pending COVID-19 results will get a positive or a negative result:
<ul>
<li>As lab results are processed, this leaves a window when it's uncertain whether a result will return positive or negative (we are assuming that the epidemiology data can be infomative for the prediction).</li>
<li>Also, this could help predict for similar symptoms e.g. from a survey or an app that checks for similar data (ideally, containing most of the parameters that can be assessed without coming into the hospital, like e.g. age of the person.</li>
</ul>
<p>The value of the lab result comes from a real-time PCR, and is stored in <code>RESULTADO</code>, where <code>1=POSITIVE, 2=NEGATIVE, 3=IN PROGRESS</code>. Let's rename this to <code>target</code> so that it's more convenient to work with.</p>

In [11]:
# Rename target column as 'target' for brevity 
df.rename(
    columns={'RESULTADO': 'target'},
    inplace=True
)

In [12]:
# Print out the first 2 rows
df.head(2)

Unnamed: 0,ENTIDAD_UM,ENTIDAD_RES,target,DELAY,ENTIDAD_REGISTRO,ORIGEN,SECTOR,SEXO,ENTIDAD_NAC,MUNICIPIO_RES,...,ASMA,INMUSUPR,HIPERTENSION,OTRA_COM,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,UCI
0,25,25,2,0,25,2,12,2,25,13.0,...,2,2,1,2,2,1,2,2,2,97
1,14,14,2,0,14,1,4,1,16,98.0,...,2,2,1,2,2,1,2,1,99,2


In [13]:
# Remove target variable to move it to the first position of dataframe
col_name = 'target'
first_col = df.pop(col_name)

In [14]:
# Now we can use Pandas insert() function and insert the opped column into first position of the dataframe
# The first argument of insert() function is the location we want to insert, here it is 0
df.insert(0, col_name, first_col)

In [15]:
df

Unnamed: 0,target,ENTIDAD_UM,ENTIDAD_RES,DELAY,ENTIDAD_REGISTRO,ORIGEN,SECTOR,SEXO,ENTIDAD_NAC,MUNICIPIO_RES,...,ASMA,INMUSUPR,HIPERTENSION,OTRA_COM,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,UCI
0,2,25,25,0,25,2,12,2,25,13.0,...,2,2,1,2,2,1,2,2,2,97
1,2,14,14,0,14,1,4,1,16,98.0,...,2,2,1,2,2,1,2,1,99,2
2,1,8,8,0,8,1,4,2,8,19.0,...,2,2,2,2,2,2,2,2,99,2
3,1,9,15,0,9,2,4,1,15,33.0,...,2,2,2,2,2,1,2,2,99,97
4,2,9,9,0,9,1,4,1,99,15.0,...,2,2,2,2,2,2,2,2,99,97
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
188611,1,16,16,0,16,2,4,1,16,53.0,...,2,2,2,2,2,2,2,2,99,97
188612,2,19,19,0,19,1,4,1,24,39.0,...,2,2,1,2,2,2,1,2,99,97
188613,1,8,8,0,8,2,12,1,8,37.0,...,2,2,2,2,2,1,2,2,2,97
188614,2,11,11,0,11,2,4,2,11,37.0,...,2,2,1,2,2,2,2,1,99,97


## 5. Checking target incidence

<p>We want to predict whether or not a person with a certain clinical profile may result COVID-19 positive when tested for a real-time PCR. The model for this is a binary classifier, meaning that there are only 2 possible outcomes:</p>
<ul>
<li><code>0</code> - the lab result is COVID-19 negative</li>
<li><code>1</code> - the lab result is COVID-19 positive</li>
</ul>
<p>This is the convention for a binary classifier, but we know that the actual numeric value that is inputed in the database is factored <code>1=POSITIVE</code>, <code>2=NEGATIVE</code> and <code>3=IN PROGRESS</code>, so let's check this:</p>

In [24]:
print(df['target'].unique())

['0' '1']


Since the data was already pre-processed, we don't have to exclude lab results that are in progress. But if we wasn't the case, we could exclute this observsations like such:

In [25]:
# Nothing should change if we do this
df = df[df.target != 3]

In any case it's good to follow convention for binary classifiers, so let's re-factor the data:

In [26]:
# Refactor result into binary categories (0=negative, 1=positive)
df['target'] = df['target'].astype(str).str.replace('2','0')
df['target'] = df['target'].astype(str).str.replace('1','1')

<p>Target incidence is defined as the number of cases of each individual target value in a dataset. That is, how many 0s in the target column compared to how many 1s? Target incidence gives us an idea of how balanced (or imbalanced) is our dataset.</p>

In [27]:
# Print target incidence proportions, rounding output to 3 decimal places
df.target.value_counts(normalize=True).round(3)

0    0.644
1    0.356
Name: target, dtype: float64

## 6. Splitting the data into train and test datasets

<p>We'll now use <code>train_test_split()</code> method to split the data. Target incidence informed us that in our dataset <code>0</code>s appear 64% of the time. We want to keep the same structure in train and test datasets, i.e., both datasets must have 0 target incidence of 64%. This is very easy to do using the <code>train_test_split()</code> method from the <code>scikit learn</code> library - all we need to do is specify the <code>stratify</code> parameter. In our case, we'll stratify on the <code>target</code> column.</p>

In [20]:
# First let's assign another variable name to the data
clinical = df.copy()

In [21]:
# Import train_test_split method
from sklearn.model_selection import train_test_split

# Split transfusion DataFrame into
# X_train, X_test, y_train and y_test datasets,
# stratifying on the `target` column
X_train,X_test,y_train,y_test = train_test_split(
    clinical.drop(columns='target'),
    clinical.target,
    test_size=0.25,
    random_state=42,
    stratify=clinical['target']
)

# Print out the first 2 rows of X_train
X_train.head(2)

Unnamed: 0,ENTIDAD_UM,ENTIDAD_RES,DELAY,ENTIDAD_REGISTRO,ORIGEN,SECTOR,SEXO,ENTIDAD_NAC,MUNICIPIO_RES,TIPO_PACIENTE,...,ASMA,INMUSUPR,HIPERTENSION,OTRA_COM,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,OTRO_CASO,UCI
167444,12,12,0,12,2,4,1,12,1.0,1,...,2,2,2,2,2,2,2,2,99,97
90061,2,2,0,2,2,99,1,2,1.0,1,...,2,2,2,2,2,2,2,2,99,97


## 7. Selecting model using TPOT

<p><a href="https://github.com/EpistasisLab/tpot">TPOT</a> is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.</p>
<p><img src="https://assets.datacamp.com/production/project_646/img/tpot-ml-pipeline.png" alt="TPOT Machine Learning Pipeline"></p>
<p>TPOT will automatically explore hundreds of possible pipelines to find the best one for our dataset. Note, the outcome of this search will be a <a href="https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html">scikit-learn pipeline</a>, meaning it will include any pre-processing steps as well as the model.</p>
<p>We are using TPOT to help us zero in on one model that we can then explore and optimize further.</p>

In [22]:
#!pip3 install tpot

In [23]:
# Import TPOTClassifier and roc_auc_score
from tpot import TPOTClassifier
from sklearn.metrics import roc_auc_score

# Instantiate TPOTClassifier
tpot = TPOTClassifier(
    generations=5,
    population_size=20,
    verbosity=2,
    scoring='roc_auc',
    random_state=42,
    disable_update_check=True,
    config_dict='TPOT light'
)
tpot.fit(X_train, y_train)

# AUC score for tpot model
tpot_auc_score = roc_auc_score(y_test, tpot.predict_proba(X_test)[:, 1])
print(f'\nAUC score: {tpot_auc_score:.4f}')

# Print best pipeline steps
print('\nBest pipeline steps:', end='\n')
for idx, (name, transform) in enumerate(tpot.fitted_pipeline_.steps, start=1):
    # Print idx and transform
    print(f'{idx}. {transform}')

HBox(children=(FloatProgress(value=0.0, description='Optimization Progress', max=120.0, style=ProgressStyle(de…



TPOT closed during evaluation in one generation.


TPOT closed prematurely. Will use the current best pipeline.

Best pipeline: GaussianNB(input_matrix)

AUC score: 0.6521

Best pipeline steps:
1. GaussianNB()
