# Preprocessing data 2 (hospital2.xlsx)

In [18]:
import pandas as pd

Cargamos el conjunto de datos

In [19]:
data2 = pd.read_excel("hospital2.xlsx")

En primer lugar, vamos a observar como recibimos los datos.

In [20]:
data2.head()

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,admission_date,fever_temperature,oxygen_saturation,history_of_fever,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
0,,,,,,NaT,NaT,,,,...,0,0,0,0,0,0,0,0,0,
1,88567155.0,45.0,T.C.,52.0,E=male K=female,2021-03-01 00:00:00,2021-03-01 00:00:00,37.3,-1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
2,36069621.0,181.0,T.C.,47.0,K,2021-03-01 08:38:00,2021-03-01 08:38:00,38.0,95.0,1.0,...,0,0,0,0,0,0,0,0,0,positive
3,57644199.0,36.0,T.C.,36.0,K,2021-03-01 08:39:00,2021-03-01 08:39:00,37.5,88.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
4,81365404.0,32.0,T.C.,30.0,E,2021-03-01 09:25:00,2021-03-01 09:25:00,37.8,87.0,1.0,...,0,0,0,0,0,0,0,0,0,positive


En primer lugar, observamos que la primera fila está compuesta de valores nulos, por tanto, vamos a eliminarla ya que no nos da ningún tipo de información.

In [21]:
data2.drop(index=0, inplace=True)

## ¿Columnas redundantes?

Al igual que con el dataset anterior, vamos a ver si existen columnas redundantes. 

En este caso, observamos que las columnas "date_of_first_symptoms" y "admission_date" parecen tener los mismos valores. Vamos a comprobarlo.

In [22]:
data2.loc[(data2['date_of_first_symptoms'] != data2['admission_date'])]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,admission_date,fever_temperature,oxygen_saturation,history_of_fever,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
12735,,,,,,NaT,NaT,,,,...,0,0,0,0,0,0,0,0,0,positive
12736,,,,,,NaT,NaT,,,,...,0,0,0,0,0,0,0,0,0,positive


Observamos que las únicas columnas que son diferentes son aquellas donde casi todos los valores son nulos. 

Por tanto, al igual que antes, eliminaremos la columna admission_date ya que es una columna redundante, y posteriormente inspeccionaremos estas filas de nulos.

In [23]:
data2.drop(['admission_date'], axis=1, inplace=True)

En cuanto a las filas, ya hemos encontrado 3 filas donde la mayoría de sus valores son nulos. Esta situación es anómala, por lo que vamos a ver si existen más registros así:

In [24]:
data2[data2.isna().sum(axis=1) > 2]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
12735,,,,,,NaT,,,,0.0,...,0,0,0,0,0,0,0,0,0,positive
12736,,,,,,NaT,,,,0.0,...,0,0,0,0,0,0,0,0,0,positive


In [25]:
indices = data2[data2.isna().sum(axis=1) > 2].index.to_list()
data2.drop(indices, inplace=True)

In [None]:
data2['patient_id'] = data2['patient_id'].astype(int)

In [26]:
data2[data2.isna().sum(axis=1) > 2]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result


### ¿Incoherencias en el género?

Observamos que el primer valor de la columna "Gender" tiene la descripción de lo que significan los valores de la columna, en vez de un valor único. Por tanto, vamos a sustituirlo. Para ello, primero veremos si existen otros casos con ese ID para así observar el género que tienen.

In [27]:
data2[data2['patient_id'] == data2['patient_id'].iloc[0]]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
1,88567155.0,45.0,T.C.,52.0,E=male K=female,2021-03-01 00:00:00,37.3,-1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6177,88567155.0,48.0,T.C.,51.0,E,2021-09-19 18:48:00,37.2,95.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12339,88567155.0,51.0,T.C.,53.0,E,2022-03-01 07:57:00,36.7,94.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Como observamos, en el caso de este paciente, se trata de un hombre, por tanto vamos a sustituir dicho valor y veremos si hay mas incoherencias en esta columna.

In [28]:
data2.loc[1, 'sex'] = 'E'

In [29]:
data2['sex'].value_counts()

sex
K    7670
E    5064
Name: count, dtype: int64

In [30]:
data2.groupby('patient_id')['sex'].nunique().value_counts()

sex
1    9423
Name: count, dtype: int64

No existen incoherencias en el género.

### ¿Incoherencias en la Edad?

En este caso, vamos a ver si existen incoherencias en la edad que debamos tratar.

In [16]:
data2.sort_values(by='date_of_first_symptoms').iloc[[0, -1]]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
1,88567155.0,45.0,T.C.,52.0,E,2021-03-01 00:00:00,37.3,-1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12734,55408811.0,182.0,T.C.,41.0,K,2022-03-13 17:23:00,37.4,98.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Podemos observar que la diferencia de fechas de los síntomas nos indica que el paciente habrá podido cumplir, como máximo, dos años. Esta situación ocurriría en pacientes que se hayan hecho 2 PCRs distintas: la primera del 1 al 13 de marzo de 2021, y la segunda desde el 1 al 13 de marzo de 2022, siendo la primera fecha (MM/DD) anterior a la segunda. Veamos si estos casos existen.

In [39]:
patients_first_period = data2[data2['date_of_first_symptoms'] <= '2021-03-13']['patient_id']
patients_second_period = data2[data2['date_of_first_symptoms'] >= '2022-03-01']['patient_id']
list(set(patients_first_period) & set(patients_second_period))

[89537920.0, 22219078.0, 88567155.0, 81365404.0, 96305535.0]

- Paciente 1:

In [45]:
data2[(data2['patient_id'] == 89537920.0) & ((data2['date_of_first_symptoms'] <= '2021-03-13') | (data2['date_of_first_symptoms'] >= '2022-03-01'))]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
61,89537920.0,12.0,T.C.,41.0,K,2021-03-03 15:38:00,36.5,90.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12670,89537920.0,16.0,T.C.,41.0,K,2022-03-11 09:25:00,36.7,97.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,negative


- Paciente 2:

In [46]:
data2[(data2['patient_id'] == 22219078.0) & ((data2['date_of_first_symptoms'] <= '2021-03-13') | (data2['date_of_first_symptoms'] >= '2022-03-01'))]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
205,22219078.0,4.0,T.C.,34.0,K,2021-03-12 08:35:00,38.3,89.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,positive
12500,22219078.0,7.0,T.C.,36.0,K,2022-03-06 09:50:00,36.8,97.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


- Paciente 3:

In [47]:
data2[(data2['patient_id'] == 88567155.0) & ((data2['date_of_first_symptoms'] <= '2021-03-13') | (data2['date_of_first_symptoms'] >= '2022-03-01'))]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
1,88567155.0,45.0,T.C.,52.0,E,2021-03-01 00:00:00,37.3,-1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12339,88567155.0,51.0,T.C.,53.0,E,2022-03-01 07:57:00,36.7,94.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


- Paciente 4:

In [48]:
data2[(data2['patient_id'] == 81365404.0) & ((data2['date_of_first_symptoms'] <= '2021-03-13') | (data2['date_of_first_symptoms'] >= '2022-03-01'))]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
4,81365404.0,32.0,T.C.,30.0,E,2021-03-01 09:25:00,37.8,87.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,positive
12482,81365404.0,38.0,T.C.,30.0,E,2022-03-05 13:13:00,36.7,95.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


- Paciente 5:

In [49]:
data2[(data2['patient_id'] == 96305535.0) & ((data2['date_of_first_symptoms'] <= '2021-03-13') | (data2['date_of_first_symptoms'] >= '2022-03-01'))]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
194,96305535.0,13.0,T.C.,31.0,E,2021-03-11 13:29:00,38.4,95.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12345,96305535.0,16.0,T.C.,31.0,E,2022-03-01 08:58:00,37.2,96.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Analizando los 5 pacientes, solo 3 de ellos cumplen las condiciones que exigíamos: 89537920.0, 88567155.0, 81365404.0. Por tanto, estos 3 pacientes han cumplido, con toda seguridad, un año entre una PCR y otra. Los trataremos por separado.

Al igual que en el dataset anterior, veamos si hay incoherencias de edad y cómo son:

In [50]:
def age_incoherences_2(min_difference):
    age_diffs = data2.groupby('patient_id')['age'].agg(lambda x: x.max() - x.min())
    count = (age_diffs >= min_difference).sum()
    indexes = age_diffs[age_diffs >= min_difference].index
    return count, indexes

In [51]:
print("Diferencias de 1 o más años: ", age_incoherences_2(1)[0])
print("Diferencias de 2 o más años: ", age_incoherences_2(2)[0])
print("Diferencias de 3 o más años: ", age_incoherences_2(3)[0])

Diferencias de 1 o más años:  1642
Diferencias de 2 o más años:  793
Diferencias de 3 o más años:  159


Con esto, podemos verificar que existen aproximadamente cerca de 1000 personas cuya edad ha sido codificada erróneamente. Ante esta situación, se pueden tomar las soluciones propuestas anteriormente. En este caso, al igual que en el anterior vamos a optar por la solución A, que consiste en lo siguiente: 
La solución más básica sería poner la misma edad a todos los personas que repitan PCR. Esta podría ser la más baja, la más alta, la que esté entre las dos (si la diferencia es de 2), o la más repetida. Esta solución evitaría incoherencias de edad.

Para ello, volveremos a hacer uso de la función anterior. Concretamente, elegimos la edad más repetida en los registros de la persona.

Vamos a tomar el ejemplo donde hemos cambiado el sexo para ver si se modifican las edades correctamente.

In [53]:
data2[data2['patient_id'] == data2['patient_id'].iloc[0]]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
1,88567155.0,45.0,T.C.,52.0,E,2021-03-01 00:00:00,37.3,-1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6177,88567155.0,48.0,T.C.,51.0,E,2021-09-19 18:48:00,37.2,95.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12339,88567155.0,51.0,T.C.,53.0,E,2022-03-01 07:57:00,36.7,94.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Como observamos en este caso, la edad está mal codificada ya que en marzo de 2021 el paciente tiene 52 años mientras que 6 meses después tiene 51 años. Por tanto, usaremos la moda para modificar los valores.

In [54]:
def mode_or_min(series):
    return series.mode()[0]
data2['age'] = data2.groupby('patient_id')['age'].transform(mode_or_min)

Comprobamos si se han modificado los valores.

In [55]:
data2[data2['patient_id'] == data2['patient_id'].iloc[0]]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
1,88567155.0,45.0,T.C.,51.0,E,2021-03-01 00:00:00,37.3,-1.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6177,88567155.0,48.0,T.C.,51.0,E,2021-09-19 18:48:00,37.2,95.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
12339,88567155.0,51.0,T.C.,51.0,E,2022-03-01 07:57:00,36.7,94.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Así, podemos observar que los datos se han modificado correctamente.

Para tratar las incoherencias en aquellos 3 casos excepcionales, lo haremos manualmente. Veamos el ejemplo del paciente con ID 89537920.

In [56]:
data2[data2['patient_id'] == 89537920]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
61,89537920.0,12.0,T.C.,41.0,K,2021-03-03 15:38:00,36.5,90.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6753,89537920.0,15.0,T.C.,41.0,K,2021-10-04 08:19:00,38.1,97.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,negative
12670,89537920.0,16.0,T.C.,41.0,K,2022-03-11 09:25:00,36.7,97.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,negative


Ahora mismo existe una incoherencia, pues en su 3ª PCR debería tener un año más que en la primera, como mínimo. Apliquemos esta operación con los 3 casos:

In [63]:
data2.loc[(data2['patient_id'] == 89537920) & (data2['date_of_first_symptoms'] == '2022-03-11 09:25:00'), 'age'] = 42
data2.loc[(data2['patient_id'] == 88567155) & (data2['date_of_first_symptoms'] == '2022-03-01 07:57:00'), 'age'] = 52
data2.loc[(data2['patient_id'] == 81365404) & (data2['date_of_first_symptoms'] == '2022-03-05 13:13:00'), 'age'] = 31

In [65]:
data2[data2['patient_id'] == 89537920]

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
61,89537920.0,12.0,T.C.,41.0,K,2021-03-03 15:38:00,36.5,90.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6753,89537920.0,15.0,T.C.,41.0,K,2021-10-04 08:19:00,38.1,97.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,negative
12670,89537920.0,16.0,T.C.,42.0,K,2022-03-11 09:25:00,36.7,97.0,0.0,1.0,...,0,0,0,0,0,0,0,0,0,negative


Ahora sí, finalmente, no existen incoherencias en la edad.

### ¿Incoherencias en la nacionalidad?

En primer lugar, para comprobar si existen incoherencias en este caso, debemos comprobar si existen registros que tienen el mismo ID pero con diferente nacionalidad. En el caso de nuestros datos, la columna se denomina "country_of_residence".

In [None]:
data2.groupby('patient_id')['country_of_residence'].nunique().value_counts()

country_of_residence
1    9394
2      29
Name: count, dtype: int64

Podemos determinar con certeza que existen 29 IDs que contienen alguna incoherencia en la nacionalidad.

In [None]:
ids_of_nationality_incoherences = data2.groupby('patient_id')['country_of_residence'].nunique().loc[lambda x: x > 1].index

In [None]:
data2.loc[data2['patient_id'].isin(ids_of_nationality_incoherences)].sort_values(by='patient_id')

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
753,1139027,8.0,Azerbaijan,17.0,K,2021-03-29 15:37:00,38.3,79.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6381,1139027,11.0,T.C.,17.0,K,2021-09-24 00:31:00,36.9,98.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
5952,1331997,145.0,Ireland,36.0,K,2021-09-14 08:54:00,36.3,95.0,1.0,0.0,...,0,0,0,0,0,0,0,1,0,positive
12314,1331997,151.0,T.C.,36.0,K,2022-02-28 13:12:00,37.4,96.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
7708,1331997,149.0,T.C.,36.0,K,2021-11-01 09:43:00,39.0,96.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,91145408,8.0,T.C.,21.0,E,2021-03-30 11:21:00,37.9,97.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
10201,99454396,6.0,T.C.,23.0,K,2022-01-12 14:52:00,37.0,97.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,positive
2555,99454396,3.0,Pakistan,23.0,K,2021-05-04 11:59:00,38.7,89.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,positive
12423,99454396,7.0,T.C.,23.0,K,2022-03-03 11:00:00,38.1,97.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Podemos observar que en absolutamente todos se repite que un registro de la persona con nacionalidad no es T.C pero en el resto de registros es T.C que corresponde a la nacionalidad turca. Para hacerlo más sencillo, vamos a asumir que estas personas son turcas y, por tanto, vamos a cambiar los valores donde se indique lo contrario.

Como hemos comentado con el dataset anterior, esta asunción puede ser errónea pero pensamos que la nacionalidad no será un factor muy relevante en la predicción y esta simplificación permitirá evitar posibles incoherencias futuras.

In [None]:
for id in ids_of_nationality_incoherences:
    data2.loc[data2['patient_id'] == id, 'country_of_residence'] = "T.C."

In [None]:
data2.loc[data2['patient_id'].isin(ids_of_nationality_incoherences)].sort_values(by='patient_id')

Unnamed: 0,patient_id,admission_id,country_of_residence,age,sex,date_of_first_symptoms,fever_temperature,oxygen_saturation,history_of_fever,cough,...,chronic_hematologic_disease,AIDS_HIV,diabetes_mellitus_type_1,diabetes_mellitus_type_2,rheumatologic_disorder,dementia,tuberculosis,smoking,other_risks,PCR_result
753,1139027,8.0,T.C.,17.0,K,2021-03-29 15:37:00,38.3,79.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
6381,1139027,11.0,T.C.,17.0,K,2021-09-24 00:31:00,36.9,98.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
5952,1331997,145.0,T.C.,36.0,K,2021-09-14 08:54:00,36.3,95.0,1.0,0.0,...,0,0,0,0,0,0,0,1,0,positive
12314,1331997,151.0,T.C.,36.0,K,2022-02-28 13:12:00,37.4,96.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
7708,1331997,149.0,T.C.,36.0,K,2021-11-01 09:43:00,39.0,96.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,negative
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
803,91145408,8.0,T.C.,21.0,E,2021-03-30 11:21:00,37.9,97.0,1.0,0.0,...,0,0,0,0,0,0,0,0,0,positive
10201,99454396,6.0,T.C.,23.0,K,2022-01-12 14:52:00,37.0,97.0,1.0,1.0,...,0,0,0,0,0,0,0,0,0,positive
2555,99454396,3.0,T.C.,23.0,K,2021-05-04 11:59:00,38.7,89.0,0.0,0.0,...,0,0,0,0,0,0,0,1,0,positive
12423,99454396,7.0,T.C.,23.0,K,2022-03-03 11:00:00,38.1,97.0,0.0,0.0,...,0,0,0,0,0,0,0,0,0,positive


Así, las nacionalidades donde había incoherencias se han convertido en T.C

## Transformación de columnas y tratamiento de nulos

Primero de todo, vamos a definir las columnas que vamos a transformar a valores numéricos. En el caso de este dataset, corresponde con las columnas "country_of_residence", "sex", "PCR_result". 

Y las columnas que tienen nulos son las siguientes:

In [None]:
any_null_cols_2 = data2.columns[data2.isnull().any()]
data2[any_null_cols_2].isnull().sum()

fever_temperature    1219
oxygen_saturation       4
history_of_fever        5
bleeding               36
other_symptoms         36
PCR_result             33
dtype: int64

#### PCR_result

Vamos a empezar por la variable que vamos a predecir, el resultado de la prueba PCR. En este caso, como hemos indicado anteriormente, debemos transformarla a valores numéricos, pero primero debemos ver que hacemos con los datos nulos.

En este caso al tratarse de la variable a predecir, no tendría sentido imputar los datos nulos. Por tanto, procedemos a eliminarlos.

In [None]:
data2.dropna(subset=['PCR_result'], inplace=True)

Ahora sí, transformamos la columna. Usaremos un LabelEncoder(), importado anteriormente.

In [None]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
PCR_result_encoder = LabelEncoder()

In [None]:
data2['PCR_result'] = PCR_result_encoder.fit_transform(data2['PCR_result'])

data2['PCR_result'].value_counts()

PCR_result
1    9776
0    2925
Name: count, dtype: int64