In [None]:
import data_utils

dataset = data_utils.get_dataset()
display(dataset.head())
dataset.shape

## Special case: AFEC_DPTO

States names can take similar names for a same state, and given that there are relatively few states, it's possible to manually fix these values to avoid duplicates in classe values.

Some features are writen in different ways, for example, 'ARCHIPIELAGO DE SAN ANDRES, PROVIDENCIA Y SANTA CATALINA', 'SAN ANDRES' and 'SAN ANDRÉS' are the same state. Same for 'BOGOTA D.C' and 'BOGOTA D.C.'

In [None]:
dataset[['AFEC_DPTO']].drop_duplicates().sort_values(by=['AFEC_DPTO'])

In [None]:
dataset = data_utils.clean_afec_dpto(dataset)

dataset[['AFEC_DPTO']].drop_duplicates().sort_values(by=['AFEC_DPTO'])

### RIESGO_VIDA


In [None]:

riesgo_vida = dataset['RIESGO_VIDA'].value_counts()
riesgo_vida.plot(kind='bar', title='Patients with life at risk.');

We remove rows with missing info in our target column

In [None]:
dataset = data_utils.clean_riesgo_vida(dataset)

riesgo_vida = dataset['RIESGO_VIDA'].value_counts()
riesgo_vida.plot(kind='bar', title='Patients with life at risk.');

### CIE_10

In 'Data understanding' notebook we see CIE_10 was way too many missing values. '0' value is the most common value in the column so is not a good candidate for imputing values. But as the column contains descriptions about the patient's illness, we want to keep it as it can provide a signal to predict if the patient's life is at risk.

In [None]:
dataset = data_utils.clean_cie_10(dataset)
dataset.shape

In [None]:
riesgo_vida = dataset['RIESGO_VIDA'].value_counts()
riesgo_vida.plot(kind='bar', title='Patients with life at risk.');

Removing records with CIE_10 = 0 reduces drastically the dataset from 2'375.371 to 281.311 records but it provided a huge improvement in the target's balance.

## Removing fields

Acording to the oficial documentation, fields "IDRANGOEDADES", "ID_MES" and "PQR_GRUPOALERTA" have not statistical use, so they are removed from the dataset.

Feature "PQR_ESTADO" has a significant statistical value that may bias the model. Once a PQRS enters the system, it goes through a series of states before the case is closed. First, Historycally, patients with life at risk can have a tendency to have a certain state or a relationship with and another feature (i.e patient's with life at risk may have most of their states as closed as they may have priority over other cases), so including "PQR_ESTADO" will make the model to make predictions over a feature that will not be statiastic relevant when introducing a new PQRS (When a new PQRS enters the system it will have a default state that is very unlikely to have the final state from the original data set).

### Redundant features
These features represent the same data, so we can keep only the codes and loose the descripion.

* COD_MACROMOT = MACROMOTIVO
* COD_MOTGEN = MOTIVO_GENERAL
* COD_MOTESP = MOTIVO_ESPECIFICO
* ENT_COD_DEPTO = ENT_DPTO
* ENT_COD_MPIO = ENT_MPIO
* PET_COD_DEPTO = PET_DPTO

In [None]:
dataset = data_utils.remove_features(dataset)

display(dataset.head(n = 5))
dataset.shape

### Imputing Values

In [None]:
# Columns with zero values
col_zero_values = set(dataset.columns[dataset.eq('0').mean() > 0])
print(len(col_zero_values))
print(col_zero_values)

In [None]:
dataset = data_utils.impute_values(dataset)
dataset.to_csv("datasets/dataset_clean.csv", index = False)