🧬 Project Introduction
This project aims to develop a Dengue Risk Index for municipalities using sociodemographic data and historical dengue case records. The primary goal is to move beyond simple case counts and create a more nuanced predictive model that identifies which municipalities are most vulnerable to dengue outbreaks.

To achieve this, we're using a neural network, a type of machine learning model inspired by the human brain. Neural networks are exceptionally good at detecting complex, non-linear patterns in data. This makes them ideal for a multifaceted problem like disease prediction, where factors like population density, water infrastructure, and housing conditions can interact in subtle ways to influence the spread of the virus.

The final output of this notebook will be a ranked list of municipalities based on their predicted risk, providing a valuable tool for public health officials to prioritize resource allocation and preventative measures. 🦟

🧠 Model Architecture Explained
We've constructed a sequential neural network using TensorFlow and Keras. This type of architecture is essentially a stack of layers where the output of one layer becomes the input for the next. Let's break down its components and why they were chosen.

The Layers: From Raw Data to Prediction
Our model has an input layer, three hidden layers, and one output layer. Think of it as a funnel that processes and refines information at each stage.

Input Layer (Implicit): This is the entry point for our data. It has a neuron for each of our sociodemographic features (e.g., population, number of homes with piped water, etc.).

Hidden Layer 1 (Dense, 128 neurons, ReLU): The first and largest "thinking" layer. It takes all the input features and starts to find broad, high-level patterns.

Hidden Layer 2 (Dense, 64 neurons, ReLU): This layer receives the patterns from the first layer and combines them to learn more complex relationships.

Hidden Layer 3 (Dense, 32 neurons, ReLU): The final hidden layer further refines the patterns, focusing on the most critical signals that are predictive of dengue cases.

Output Layer (Dense, 1 neuron, Linear): This is the final layer. It condenses all the learned information from the previous layers into a single numerical output: the predicted number of dengue cases. We use a linear activation function here because we're predicting a continuous number, not classifying it into a category.

Key Components & Their Purpose
Dense Layers: This means that every neuron in the layer is connected to every neuron in the previous layer. This dense connectivity allows the model to consider all possible feature interactions.

ReLU (Rectified Linear Unit) Activation: This is the "on/off switch" for neurons. It's a simple but powerful function that helps the model learn complex, non-linear relationships. Without it, the network could only learn simple, straight-line patterns.

Dropout Layers (0.3 and 0.2): This is our safeguard against overfitting. During training, dropout randomly "turns off" a fraction of neurons (30% in the first dropout layer, 20% in the second). This forces the network to learn redundant pathways and prevents it from relying too heavily on any single feature or neuron. The result is a more robust and generalizable model that performs better on new, unseen data. It's like studying for an exam with a team where different members are occasionally absent, forcing everyone else to know the material thoroughly.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving dengue_sociograficos.csv to dengue_sociograficos.csv


In [None]:
import pandas as pd

df = pd.read_csv('dengue_sociograficos.csv')
df.head()


Unnamed: 0,Municipio,Área kilómetros cuadrados,Densidad poblacional hab/km2,Población total,Población femenina,Población masculina,Población de 0 a 2 años,Población de 12 años y más,Población de 18 años y más,Población femenina de 18 años y más,...,Viviendas particulares habitadas que disponen de agua entubada y se abastecen del servicio público de agua,Viviendas particulares habitadas que no disponen de agua entubada en el ámbito de la vivienda,Viviendas particulares habitadas que disponen de tinaco,Viviendas particulares habitadas que disponen de cisterna o aljibe,Viviendas particulares habitadas que disponen de letrina (pozo u hoyo),Viviendas particulares habitadas que disponen de drenaje,Viviendas particulares habitadas que no disponen de drenaje,Indicador,Unidad de Medida,casos
0,ACATIC,339.1959,68.323349,23175,11792,11383,1294,17781,15170,7844,...,4956,217,6000,4678,20,6337,64,Casos de Dengue,Casos,0
1,ACATLÁN DE JUÁREZ,160.665691,157.158631,25250,12006,13244,1195,20096,16031,8202,...,6203,71,4854,1339,8,6307,56,Casos de Dengue,Casos,100
2,AHUALULCO DE MERCADO,273.989217,86.24427,23630,11846,11784,1064,18889,16375,8293,...,5951,9,5628,2839,12,6479,29,Casos de Dengue,Casos,5
3,AMACUECA,124.820044,46.010238,5743,2967,2776,297,4461,3852,2030,...,1547,4,1329,212,10,1578,9,Casos de Dengue,Casos,14
4,AMATITÁN,172.573658,95.553402,16490,8306,8184,957,12717,10764,5517,...,3727,43,3748,1808,12,4179,24,Casos de Dengue,Casos,43


In [None]:
import pandas as pd

# Read the CSV file
df = pd.read_csv('dengue_sociograficos.csv')

# Dictionary with Spanish to English column name translations
translation_dict = {
    'Municipio': 'Municipality',
    'Área kilómetros cuadrados': 'Area_square_kilometers',
    'Densidad poblacional hab/km2': 'Population_density_per_km2',
    'Población total': 'Total_population',
    'Población femenina': 'Female_population',
    'Población masculina': 'Male_population',
    'Población de 0 a 2 años': 'Population_0_to_2_years',
    'Población de 12 años y más': 'Population_12_years_and_older',
    'Población de 18 años y más': 'Population_18_years_and_older',
    'Población femenina de 18 años y más': 'Female_population_18_years_and_older',
    'Viviendas particulares habitadas que disponen de agua entubada y se abastecen del servicio público de agua': 'Dwellings_with_piped_water',
    'Viviendas particulares habitadas que no disponen de agua entubada en el ámbito de la vivienda': 'Dwellings_without_piped_water',
    'Viviendas particulares habitadas que disponen de tinaco': 'Dwellings_with_tinaco',
    'Viviendas particulares habitadas que disponen de cisterna o aljibe': 'Dwellings_with_cistern',
    'Viviendas particulares habitadas que disponen de letrina (pozo u hoyo)': 'Dwellings_with_latrine',
    'Viviendas particulares habitadas que disponen de drenaje': 'Dwellings_with_drainage',
    'Viviendas particulares habitadas que no disponen de drenaje': 'Dwellings_without_drainage',
    'Indicador': 'Indicator',
    'Unidad de Medida': 'Unit_of_Measure',
    'casos': 'cases'
}

# Rename the columns
df_en = df.rename(columns=translation_dict)

# Save the translated dataframe to a new CSV file
df_en.to_csv('dengue_sociographics_en.csv', index=False)

# Print the first 5 rows of the translated dataframe
print(df_en.head())

           Municipality  Area_square_kilometers  Population_density_per_km2  \
0                ACATIC              339.195900                   68.323349   
1     ACATLÁN DE JUÁREZ              160.665691                  157.158631   
2  AHUALULCO DE MERCADO              273.989217                   86.244270   
3              AMACUECA              124.820044                   46.010238   
4              AMATITÁN              172.573658                   95.553402   

   Total_population  Female_population  Male_population  \
0             23175              11792            11383   
1             25250              12006            13244   
2             23630              11846            11784   
3              5743               2967             2776   
4             16490               8306             8184   

   Population_0_to_2_years  Population_12_years_and_older  \
0                     1294                          17781   
1                     1195                      

This script begins by loading the dengue_sociographics_en.csv dataset using pandas and preparing it for a neural network. It first isolates all numerical columns, then separates them into two distinct components: the features (X), which consist of all sociodemographic indicators, and the target variable (y), which is the cases column. A critical preprocessing step follows using StandardScaler, which normalizes each feature to have a mean of zero and a standard deviation of one. This standardization is essential for the effective training of neural networks, as it ensures that all input variables are on a comparable scale, preventing features with larger numerical ranges from disproportionately influencing the model's learning process.

The core of the script is the construction and training of a deep neural network using TensorFlow's Keras API. A Sequential model is defined, comprising a stack of layers including three Dense hidden layers with 128, 64, and 32 neurons, respectively, using the relu activation function to learn complex, non-linear patterns. Dropout layers are strategically placed with rates of 0.3 and 0.2; this regularization technique randomly deactivates a fraction of neurons during training to prevent overfitting and improve the model's ability to generalize to new data. The network concludes with a single Dense output layer with a linear activation, appropriate for a regression task that predicts a continuous value (the number of cases). The model is then compiled and trained for 150 epochs, iteratively adjusting its internal parameters to minimize the mean_squared_error between its predictions and the actual dengue case numbers.

After training, the model is used to generate predictions on the scaled input data. These raw predictions are then normalized to a scale of 0 to 1 to create a new, easily interpretable dengue_risk index, which is added as a new column to the original DataFrame. The most advanced part of the script involves using the SHAP (SHapley Additive exPlanations) library to make the model's decisions transparent. An Explainer is created to calculate shap_values, which quantify the contribution of each sociodemographic feature to the final risk prediction for each municipality. This transforms the neural network from a "black box" into an interpretable tool, providing the foundation to create visualizations that reveal the key drivers of dengue risk. Finally, the script displays a preview of the results and saves the complete dataset with the calculated risk index to a new CSV file.

In [None]:
import pandas as pd
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import StandardScaler
import shap
import matplotlib.pyplot as plt

df_en = pd.read_csv("dengue_sociographics_en.csv")

df_numerico = df_en.select_dtypes(include=[np.number])

X = df_numerico.drop(columns=['cases'])
y = df_numerico['cases']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

model = Sequential([
    Dense(128, activation='relu', input_shape=(X_scaled.shape[1],)),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.2),
    Dense(32, activation='relu'),
    Dense(1, activation='linear')
])

model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mae'])

history = model.fit(X_scaled, y, epochs=150, batch_size=16, verbose=1)

predictions = model.predict(X_scaled)

df_en['dengue_risk'] = (predictions - predictions.min()) / (predictions.max() - predictions.min())

explainer = shap.Explainer(model, X_scaled)
shap_values = explainer(X_scaled[:100]) # Usar un subconjunto para agilizar el cálculo


print("\nPrimeras 5 filas con el índice de riesgo de dengue:")
print(df_en[['Municipality', 'cases', 'dengue_risk']].head())

df_en.to_csv('dengue_risk_predictions_en.csv', index=False)
print("\nResultados completos guardados en 'dengue_risk_predictions_en.csv'")

Epoch 1/150


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step - loss: 4713.9277 - mae: 29.7081
Epoch 2/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 4363.9844 - mae: 28.6996 
Epoch 3/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 9ms/step - loss: 5489.0684 - mae: 32.1323
Epoch 4/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 5489.2163 - mae: 30.3231
Epoch 5/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 4515.3613 - mae: 31.5672 
Epoch 6/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 4807.3125 - mae: 33.0229 
Epoch 7/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step - loss: 5474.0225 - mae: 35.5183 
Epoch 8/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step - loss: 3321.2244 - mae: 29.8435 
Epoch 9/150
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/s

PermutationExplainer explainer: 101it [00:14,  1.76it/s]


Primeras 5 filas con el índice de riesgo de dengue:
           Municipality  cases  dengue_risk
0                ACATIC      0     0.038808
1     ACATLÁN DE JUÁREZ    100     0.030174
2  AHUALULCO DE MERCADO      5     0.033180
3              AMACUECA     14     0.039165
4              AMATITÁN     43     0.037896

Resultados completos guardados en 'dengue_risk_predictions_en.csv'





In [None]:
# Model Predictions
y_pred = model.predict(X_scaled).flatten()

# Display a table with actual vs. predicted results
df_results = pd.DataFrame({
    'Municipality': df_en['Municipality'],
    'Observed_Cases': y,
    'Predicted_Cases': y_pred.round(2),
    'Difference': (y - y_pred).round(2)
})

# View the results sorted by the absolute difference (error)
df_results.sort_values(by='Difference', key=abs, ascending=False).head(10)

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 8ms/step 


Unnamed: 0,Municipality,Observed_Cases,Predicted_Cases,Difference
49,JOCOTEPEC,312,104.050003,207.95
1,ACATLÁN DE JUÁREZ,100,14.29,85.71
62,OCOTLÁN,184,110.169998,73.83
14,AUTLÁN DE NAVARRO,6,76.169998,-70.17
43,IXTLAHUACÁN DE LOS MEMBRILLOS,8,72.830002,-64.83
119,ZAPOTITLÁN DE VADILLO,92,32.029999,59.97
81,TALA,10,67.989998,-57.99
118,ZAPOTILTIC,114,56.700001,57.3
110,SAN GABRIEL,86,34.580002,51.42
83,TAMAZULA DE GORDIANO,88,39.849998,48.15


In [None]:
from sklearn.metrics import r2_score

# Get predictions from the (already trained) model
y_pred = model.predict(X_scaled).flatten()

# Calculate R²
r2 = r2_score(y, y_pred)

# Display the result
print(f'Coefficient of Determination (R²): {r2:.4f}')

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step
Coefficient of Determination (R²): 0.8305
