# COVID-19 Mortality Risk Analysis
## Research Questions
1️⃣ Which risk factors are more relevant or have more impact on the death of a person between Jan 2021 to June 2022??

2️⃣ Was COVID-19 the main cause of death during this period compared to other causes?

By: Miguel Trujillo Lopez

## Dataset Scope

This notebook continues the data preparation process after an initial overview.
Due to the size of the original dataset (10M+ records), only variables relevant to:

1. Risk factors associated with mortality
2. Cause of death related to COVID-19

were retained for further analysis.

The dataset covers the period from **January 2021 to June 2022**.



### Importing Libraries

In [1]:
import pandas as pd
import numpy as np

pd.set_option("display.max_columns", None)
pd.set_option("display.width", 180)

### Loading Dataset

In [2]:
DATA_PATH = "COVID19MEXICO2021_final.parquet"

df_raw = pd.read_parquet(DATA_PATH)

print("Shape:", df_raw.shape)
df_raw.head()


Shape: (14167460, 18)


Unnamed: 0,SEXO,TIPO_PACIENTE,FECHA_DEF,INTUBADO,NEUMONIA,EDAD,DIABETES,EPOC,ASMA,INMUSUPR,HIPERTENSION,CARDIOVASCULAR,OBESIDAD,RENAL_CRONICA,TABAQUISMO,CLASIFICACION_FINAL,UCI,DEATH
0,2,1,NaT,97,2,26,2,2,2,2,2,2,2,2,2,7,97,0
1,1,1,NaT,97,99,34,2,2,2,2,2,2,2,2,2,7,97,0
2,2,1,NaT,97,2,41,2,2,2,2,2,2,2,2,1,7,97,0
3,2,1,NaT,97,2,25,2,2,2,2,2,2,2,2,2,7,97,0
4,1,1,NaT,97,2,20,2,2,2,2,2,2,2,2,2,7,97,0


In [3]:
df = df_raw.copy()
del df_raw

### Verifying Columns

In [4]:
expected_cols = [
    "SEXO","TIPO_PACIENTE","FECHA_DEF","INTUBADO","NEUMONIA","EDAD","DIABETES",
    "EPOC","ASMA","INMUSUPR","HIPERTENSION","CARDIOVASCULAR","OBESIDAD",
    "RENAL_CRONICA","TABAQUISMO","CLASIFICACION_FINAL","UCI","DEATH"
]

missing = [c for c in expected_cols if c not in df.columns]
extra = [c for c in df.columns if c not in expected_cols]

print("Missing columns:", missing)
print("Extra columns:", extra)
print("Shape:", df.shape)


Missing columns: []
Extra columns: []
Shape: (14167460, 18)


In [5]:
df.info()
df.isna().mean().sort_values(ascending=False).head(12)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14167460 entries, 0 to 14167459
Data columns (total 18 columns):
 #   Column               Dtype         
---  ------               -----         
 0   SEXO                 Int16         
 1   TIPO_PACIENTE        Int16         
 2   FECHA_DEF            datetime64[ns]
 3   INTUBADO             Int16         
 4   NEUMONIA             Int16         
 5   EDAD                 Int16         
 6   DIABETES             Int16         
 7   EPOC                 Int16         
 8   ASMA                 Int16         
 9   INMUSUPR             Int16         
 10  HIPERTENSION         Int16         
 11  CARDIOVASCULAR       Int16         
 12  OBESIDAD             Int16         
 13  RENAL_CRONICA        Int16         
 14  TABAQUISMO           Int16         
 15  CLASIFICACION_FINAL  Int16         
 16  UCI                  Int16         
 17  DEATH                int8          
dtypes: Int16(16), datetime64[ns](1), int8(1)
memory usage: 770.1

Unnamed: 0,0
FECHA_DEF,0.978244
UCI,8.2e-05
CLASIFICACION_FINAL,6.5e-05
TABAQUISMO,4.5e-05
RENAL_CRONICA,4.2e-05
OBESIDAD,3.9e-05
CARDIOVASCULAR,3.8e-05
EPOC,3.6e-05
ASMA,3.6e-05
HIPERTENSION,3.6e-05


### Null Values

In [6]:
null_counts = df.isna().sum().sort_values(ascending=False)
null_counts

Unnamed: 0,0
FECHA_DEF,13859230
UCI,1168
CLASIFICACION_FINAL,922
TABAQUISMO,638
RENAL_CRONICA,597
OBESIDAD,559
CARDIOVASCULAR,537
EPOC,510
ASMA,508
HIPERTENSION,505


In [7]:
null_percent = (df.isna().mean() * 100).sort_values(ascending=False)
null_percent


Unnamed: 0,0
FECHA_DEF,97.824381
UCI,0.008244
CLASIFICACION_FINAL,0.006508
TABAQUISMO,0.004503
RENAL_CRONICA,0.004214
OBESIDAD,0.003946
CARDIOVASCULAR,0.00379
EPOC,0.0036
ASMA,0.003586
HIPERTENSION,0.003565


In [8]:
df["FECHA_DEF"] = pd.to_datetime(df["FECHA_DEF"], errors="coerce")

if "DEATH" in df.columns:
    death_check = df["FECHA_DEF"].notna().astype(int)
    consistency = (df["DEATH"].astype("Int64") == death_check).mean()
    print("DEATH consistency vs FECHA_DEF:", consistency)
else:
    df["DEATH"] = df["FECHA_DEF"].notna().astype(int)


DEATH consistency vs FECHA_DEF: 1.0


In [9]:
INVALID_VALUES = [97, 98, 99]
cols_to_clean = [c for c in df.columns if c not in ["DEATH", "FECHA_DEF"]]
df[cols_to_clean] = df[cols_to_clean].replace(INVALID_VALUES, np.nan)


### Datatype Optimization

In [10]:
# Binarias as Int8
binary_cols = [
    "INTUBADO","NEUMONIA","DIABETES","EPOC","ASMA","INMUSUPR",
    "HIPERTENSION","CARDIOVASCULAR","OBESIDAD","RENAL_CRONICA",
    "TABAQUISMO","UCI"
]
for col in binary_cols:
    print(col, df[col].dropna().unique()[:10])

for col in binary_cols:
    df[col] = df[col].map({1: 1, 2: 0}).astype("Int8")

# DEATH
df["DEATH"] = df["DEATH"].astype("Int8")

# EDAD as Int16
df["EDAD"] = pd.to_numeric(df["EDAD"], errors="coerce").astype("Int16")


INTUBADO <IntegerArray>
[2, 1, 56, 19, 6, 15, 12, 78, 13, 7]
Length: 10, dtype: Int16
NEUMONIA <IntegerArray>
[2, 1, 24, 12, 6, 18, 15, 9, 33, 41]
Length: 10, dtype: Int16
DIABETES <IntegerArray>
[2, 1, 26, 21, 7, 11, 29, 9, 241, 28]
Length: 10, dtype: Int16
EPOC <IntegerArray>
[2, 1, 21, 55, 46, 7, 12, 4, 29, 11]
Length: 10, dtype: Int16
ASMA <IntegerArray>
[2, 1, 21, 6, 3, 7, 16, 55, 29, 30]
Length: 10, dtype: Int16
INMUSUPR <IntegerArray>
[2, 1, 114, 7, 19, 3, 33, 14, 21, 30]
Length: 10, dtype: Int16
HIPERTENSION <IntegerArray>
[2, 1, 19, 7, 16, 12, 3, 22, 38, 86]
Length: 10, dtype: Int16
CARDIOVASCULAR <IntegerArray>
[2, 1, 42, 9, 3, 12, 89, 108, 29, 7]
Length: 10, dtype: Int16
OBESIDAD <IntegerArray>
[2, 1, 3, 22, -99, 9, 7, 53, 48, 4]
Length: 10, dtype: Int16
RENAL_CRONICA <IntegerArray>
[2, 1, 42, 45, 12, 22, 3, 9, 14, 27]
Length: 10, dtype: Int16
TABAQUISMO <IntegerArray>
[2, 1, 43, 30, 7, 41, 17, 14, 11, 21]
Length: 10, dtype: Int16
UCI <IntegerArray>
[2, 1, 29, 7, 26, 3, 15, 

### Decoding categorical variables

In [11]:
df["SEXO"] = df["SEXO"].map({1: "Female", 2: "Male"}).astype("category")
df["TIPO_PACIENTE"] = df["TIPO_PACIENTE"].map({
    1: "Ambulatory",
    2: "Hospitalized"
}).astype("category")


### Final Order

In [12]:
final_columns = [
    "DEATH","SEXO","EDAD","AGE_GROUP","TIPO_PACIENTE",
    "UCI","INTUBADO","NEUMONIA",
    "DIABETES","HIPERTENSION","OBESIDAD","CARDIOVASCULAR",
    "RENAL_CRONICA","EPOC","ASMA","INMUSUPR","TABAQUISMO",
    "CLASIFICACION_FINAL","FECHA_DEF"
]

df = df[[c for c in final_columns if c in df.columns]]

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14167460 entries, 0 to 14167459
Data columns (total 18 columns):
 #   Column               Dtype         
---  ------               -----         
 0   DEATH                Int8          
 1   SEXO                 category      
 2   EDAD                 Int16         
 3   TIPO_PACIENTE        category      
 4   UCI                  Int8          
 5   INTUBADO             Int8          
 6   NEUMONIA             Int8          
 7   DIABETES             Int8          
 8   HIPERTENSION         Int8          
 9   OBESIDAD             Int8          
 10  CARDIOVASCULAR       Int8          
 11  RENAL_CRONICA        Int8          
 12  EPOC                 Int8          
 13  ASMA                 Int8          
 14  INMUSUPR             Int8          
 15  TABAQUISMO           Int8          
 16  CLASIFICACION_FINAL  Int16         
 17  FECHA_DEF            datetime64[ns]
dtypes: Int16(2), Int8(13), category(2), datetime64[ns](1)
memory

Unnamed: 0,DEATH,SEXO,EDAD,TIPO_PACIENTE,UCI,INTUBADO,NEUMONIA,DIABETES,HIPERTENSION,OBESIDAD,CARDIOVASCULAR,RENAL_CRONICA,EPOC,ASMA,INMUSUPR,TABAQUISMO,CLASIFICACION_FINAL,FECHA_DEF
0,0,Male,26,Ambulatory,,,0.0,0,0,0,0,0,0,0,0,0,7,NaT
1,0,Female,34,Ambulatory,,,,0,0,0,0,0,0,0,0,0,7,NaT
2,0,Male,41,Ambulatory,,,0.0,0,0,0,0,0,0,0,0,1,7,NaT
3,0,Male,25,Ambulatory,,,0.0,0,0,0,0,0,0,0,0,0,7,NaT
4,0,Female,20,Ambulatory,,,0.0,0,0,0,0,0,0,0,0,0,7,NaT


In [14]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [15]:
!cp covid19_2021_clean.parquet "/content/drive/MyDrive/covid19_2021_clean.parquet"


## Summary and Output

In this notebook we:
- Loaded the processed dataset in Parquet format.
- Validated schema consistency and assessed missingness.
- Standardized special codes (97/98/99) as missing values.
- Normalized binary clinical variables (Yes/No) and optimized dtypes.
- Created an analysis-ready dataset for downstream EDA.

**Output:** `covid19_2021_clean.parquet`  
Next notebook: exploratory analysis and descriptive statistics.
