<img src="https://industrial.uniandes.edu.co/sites/default/files/imagenes/uniandeslogo.png" alt="Universidad de los Andes" style="float: right; width: 300px; height: auto;">

# Replication Old Pipeline

Autor: Juan Diego Heredia Niño 

Email: jd.heredian@uniandes.edu.co

Date: Oct 2025

In [1]:
import pandas as pd
import numpy as np
import yaml
from pathlib import Path

In [2]:
# Cargar configuración desde un archivo YAML
with open('paths.yml', 'r') as file:
    paths = yaml.safe_load(file)

raw = Path(paths['data']['raw'])
temp = Path(paths['data']['temp'])
processed = Path(paths['data']['processed'])

---

## Data Processing Pipeline

This section processes multiple data sources related to violence and socioeconomic indicators in Colombian municipalities. The pipeline includes data from:
- Fiscalía (Attorney General's Office)
- Ministerio de Defensa (Ministry of Defense)
- JEP (Special Jurisdiction for Peace)
- DANE (National Statistics Department)
- Illicit Crops Data
- CEDE Panel Data

### 1. Fiscalía Data Processing

Processing crime data from the Attorney General's Office. This includes:
- Filtering specific crime types (Extortion, Homicide, Massacres, Kidnapping, Terrorism)
- Assigning weights to each crime type based on severity
- Aggregating cases by municipality and quarter

In [3]:
# Load Fiscalía data
df_fisc = pd.read_csv(raw / 'old' / 'fiscalía.csv')

# Filter specific crime types
df_fisc = df_fisc.query("delito.isin(['Extorsion','Homicidio','Masacres','Secuestro simple y extorsivo','Terrorismo'])")

# Assign weights to each crime type
df_fisc['pesos'] = np.where(df_fisc['delito'] == 'Extorsion', 0.1031, np.nan)
df_fisc['pesos'] = np.where(df_fisc['delito'] == 'Homicidio', 0.1704, df_fisc['pesos'])
df_fisc['pesos'] = np.where(df_fisc['delito'] == 'Masacres', 0.4484, df_fisc['pesos'])
df_fisc['pesos'] = np.where(df_fisc['delito'] == 'Secuestro simple y extorsivo', 0.1435, df_fisc['pesos'])
df_fisc['pesos'] = np.where(df_fisc['delito'] == 'Terrorismo', 0.1345, df_fisc['pesos'])

# Create quarter variable
df_fisc['trimestre'] = np.where(df_fisc['mes'].isin([4,5,6]), 2, 1)
df_fisc['trimestre'] = np.where(df_fisc['mes'].isin([7,8,9]), 3, df_fisc['trimestre'])
df_fisc['trimestre'] = np.where(df_fisc['mes'].isin([10,11,12]), 4, df_fisc['trimestre'])

# Calculate weighted cases
df_fisc['casos_ponderados'] = df_fisc['pesos'] * df_fisc['casos']

# Aggregate by municipality and quarter (from 2019 onwards)
df_fisc = df_fisc.groupby(['cod_mun','año','trimestre'])[['casos_ponderados']].sum().query("año>=2019").reset_index()

# Save intermediate result
df_fisc.to_parquet(temp / 'old' / 'df_fisc.parquet', index=False)

df_fisc.head()

Unnamed: 0,cod_mun,año,trimestre,casos_ponderados
0,5001.0,2019,1,49.0507
1,5001.0,2019,2,54.3729
2,5001.0,2019,3,41.4287
3,5001.0,2019,4,34.811
4,5001.0,2020,1,30.3222


### 2. Ministerio de Defensa Data Processing

Processing crime data from the Ministry of Defense. This includes five crime categories:
- **Extorsión** (Extortion)
- **Homicidio** (Homicide)
- **Masacres** (Massacres)
- **Secuestro** (Kidnapping)
- **Terrorismo** (Terrorism)

Each category is weighted and aggregated by municipality and quarter.

In [4]:
# EXTORSIÓN
df_md_ext = pd.read_excel(raw / 'old' / 'mindef' / 'EXTORSIÓN.xlsx')
df_md_ext["año"] = df_md_ext['FECHA HECHOS'].dt.year
df_md_ext["trimestre"] = df_md_ext['FECHA HECHOS'].dt.quarter
df_md_ext['pesos'] = 0.1031
df_md_ext['casos_ponderados'] = df_md_ext['pesos'] * df_md_ext['CANTIDAD']
df_md_ext.rename(columns={'COD_MUNI':'cod_mun', 'CANTIDAD':'extorsion'}, inplace=True)
df_md_ext = df_md_ext.groupby(['cod_mun','año','trimestre'])[['casos_ponderados','extorsion']].sum().reset_index()

# HOMICIDIOS
df_md_hom = pd.read_excel(raw / 'old' / 'mindef' / 'HOMICIDIO.xlsx')
df_md_hom["año"] = df_md_hom['FECHA HECHO'].dt.year
df_md_hom["trimestre"] = df_md_hom['FECHA HECHO'].dt.quarter
df_md_hom['pesos'] = 0.1704
df_md_hom['casos_ponderados'] = df_md_hom['pesos'] * df_md_hom['VÍCTIMAS']
df_md_hom.rename(columns={'COD_MUNI':'cod_mun','VÍCTIMAS':'homicidios'}, inplace=True)
df_md_hom = df_md_hom.groupby(['cod_mun','año','trimestre'])[['casos_ponderados','homicidios']].sum().reset_index()

# MASACRES
df_md_mas = pd.read_excel(raw / 'old' / 'mindef' / 'MASACRES.xlsx')
df_md_mas["año"] = df_md_mas['FECHA HECHO'].dt.year
df_md_mas["trimestre"] = df_md_mas['FECHA HECHO'].dt.quarter
df_md_mas['pesos'] = 0.4484
df_md_mas['casos_ponderados'] = df_md_mas['pesos'] * df_md_mas['VICTIMAS']
df_md_mas.rename(columns={'COD_MUNI':'cod_mun','VICTIMAS':'masacres'}, inplace=True)
df_md_mas = df_md_mas.groupby(['cod_mun','año','trimestre'])[['casos_ponderados','masacres']].sum().reset_index()

# SECUESTRO
df_md_sec = pd.read_excel(raw / 'old' / 'mindef' / 'SECUESTRO.xlsx')
df_md_sec["año"] = df_md_sec['FECHA HECHO'].dt.year
df_md_sec["trimestre"] = df_md_sec['FECHA HECHO'].dt.quarter
df_md_sec['pesos'] = 0.1435
df_md_sec['casos_ponderados'] = df_md_sec['pesos'] * df_md_sec['CANTIDAD']
df_md_sec.rename(columns={'COD_MUNI':'cod_mun', 'CANTIDAD':'secuestrados'}, inplace=True)
df_md_sec = df_md_sec.groupby(['cod_mun','año','trimestre'])[['casos_ponderados','secuestrados']].sum().reset_index()

# TERRORISMO
df_md_terr = pd.read_excel(raw / 'old' / 'mindef' / 'TERRORISMO.xlsx')
df_md_terr["año"] = df_md_terr['FECHA HECHO'].dt.year
df_md_terr["trimestre"] = df_md_terr['FECHA HECHO'].dt.quarter
df_md_terr['pesos'] = 0.1345
df_md_terr['casos_ponderados'] = df_md_terr['pesos'] * df_md_terr['CANTIDAD']
df_md_terr.rename(columns={'COD_MUNI':'cod_mun', 'CANTIDAD':'terrorismo'}, inplace=True)
df_md_terr = df_md_terr.groupby(['cod_mun','año','trimestre'])[['casos_ponderados','terrorismo']].sum().reset_index()

# Combine all Ministry of Defense data
df_md = pd.concat([df_md_ext, df_md_hom, df_md_mas, df_md_sec, df_md_terr]).groupby(['cod_mun','año','trimestre'])[['casos_ponderados']].sum().reset_index()

# Save intermediate result
df_md.to_parquet(temp / 'old' / 'df_md.parquet', index=False)

df_md.head()

Unnamed: 0,cod_mun,año,trimestre,casos_ponderados
0,5001,1996,1,7.2503
1,5001,1996,2,5.5506
2,5001,1996,3,4.8331
3,5001,1996,4,3.4747
4,5001,1997,1,1.435


In [5]:
df_1 = (
    pd.concat([
        df_md_terr[['terrorismo']].describe().T,
        df_md_ext[['extorsion']].describe().T,
        df_md_hom[['homicidios']].describe().T,
        df_md_mas[['masacres']].describe().T,
        df_md_sec[['secuestrados']].describe().T
    ])
    [['count', 'mean', 'std', 'min', 'max']]
    .rename(columns={
        'count':'Number of Observations',
        'mean':'Average',
        'std':'Standard Deviation',
        'min':'Minimum',
        'max':'Maximum'})
    .rename(index={
        'terrorismo':'Terrorism',
        'extorsion':'Extortion',
        'homicidios':'Homicides',
        'masacres':'Massacres',
        'secuestrados':'Kidnappings'
    })
    .round(1)
)

In [6]:
# Sort data by municipality and time
df_md_terr = df_md_terr.sort_values(by=['cod_mun', 'año', 'trimestre'])
df_md_ext = df_md_ext.sort_values(by=['cod_mun', 'año', 'trimestre'])
df_md_hom = df_md_hom.sort_values(by=['cod_mun', 'año', 'trimestre'])
df_md_mas = df_md_mas.sort_values(by=['cod_mun', 'año', 'trimestre'])
df_md_sec = df_md_sec.sort_values(by=['cod_mun', 'año', 'trimestre'])

# Create violence indicator lags (1 to 8 quarters)
for lag in range(1, 9):
    df_md_terr[f'terrorismo_{lag}'] = df_md_terr.groupby('cod_mun')['terrorismo'].shift(lag)
    df_md_ext[f'extorsion_{lag}'] = df_md_ext.groupby('cod_mun')['extorsion'].shift(lag)
    df_md_hom[f'homicidios_{lag}'] = df_md_hom.groupby('cod_mun')['homicidios'].shift(lag)
    df_md_mas[f'masacres_{lag}'] = df_md_mas.groupby('cod_mun')['masacres'].shift(lag)
    df_md_sec[f'secuestrados_{lag}'] = df_md_sec.groupby('cod_mun')['secuestrados'].shift(lag)

df_md_terr.head()

Unnamed: 0,cod_mun,año,trimestre,casos_ponderados,terrorismo,terrorismo_1,terrorismo_2,terrorismo_3,terrorismo_4,terrorismo_5,terrorismo_6,terrorismo_7,terrorismo_8
0,5001,2010,1,0.538,4,,,,,,,,
1,5001,2010,2,0.4035,3,4.0,,,,,,,
2,5001,2010,3,0.269,2,3.0,4.0,,,,,,
3,5001,2011,2,0.538,4,2.0,3.0,4.0,,,,,
4,5001,2011,3,0.538,4,4.0,2.0,3.0,4.0,,,,


In [7]:
df_md_terr.dropna(inplace=True)
df_md_ext.dropna(inplace=True)
df_md_hom.dropna(inplace=True)
df_md_mas.dropna(inplace=True)
df_md_sec.dropna(inplace=True)

df_md_terr['atipico 1'] = (df_md_terr['terrorismo'] > (df_md_terr[[f'terrorismo_{lag}' for lag in range(1, 9)]].mean(axis=1) + df_md_terr[[f'terrorismo_{lag}' for lag in range(1, 9)]].std(axis=1))).astype(int)
df_md_terr['atipico 2'] = (df_md_terr['terrorismo'] > (df_md_terr[[f'terrorismo_{lag}' for lag in range(1, 9)]].mean(axis=1) + (2 * df_md_terr[[f'terrorismo_{lag}' for lag in range(1, 9)]].std(axis=1)))).astype(int)
df_md_terr.drop(columns=[f'terrorismo_{lag}' for lag in range(1, 9)], inplace=True)

df_md_ext['atipico 1'] = (df_md_ext['extorsion'] > (df_md_ext[[f'extorsion_{lag}' for lag in range(1, 9)]].mean(axis=1) + df_md_ext[[f'extorsion_{lag}' for lag in range(1, 9)]].std(axis=1))).astype(int)
df_md_ext['atipico 2'] = (df_md_ext['extorsion'] > (df_md_ext[[f'extorsion_{lag}' for lag in range(1, 9)]].mean(axis=1) + (2 * df_md_ext[[f'extorsion_{lag}' for lag in range(1, 9)]].std(axis=1)))).astype(int)
df_md_ext.drop(columns=[f'extorsion_{lag}' for lag in range(1, 9)], inplace=True)

df_md_hom['atipico 1'] = (df_md_hom['homicidios'] > (df_md_hom[[f'homicidios_{lag}' for lag in range(1, 9)]].mean(axis=1) + df_md_hom[[f'homicidios_{lag}' for lag in range(1, 9)]].std(axis=1))).astype(int)
df_md_hom['atipico 2'] = (df_md_hom['homicidios'] > (df_md_hom[[f'homicidios_{lag}' for lag in range(1, 9)]].mean(axis=1) + (2 * df_md_hom[[f'homicidios_{lag}' for lag in range(1, 9)]].std(axis=1)))).astype(int)
df_md_hom.drop(columns=[f'homicidios_{lag}' for lag in range(1, 9)], inplace=True)

df_md_mas['atipico 1'] = (df_md_mas['masacres'] > (df_md_mas[[f'masacres_{lag}' for lag in range(1, 9)]].mean(axis=1) + df_md_mas[[f'masacres_{lag}' for lag in range(1, 9)]].std(axis=1))).astype(int)
df_md_mas['atipico 2'] = (df_md_mas['masacres'] > (df_md_mas[[f'masacres_{lag}' for lag in range(1, 9)]].mean(axis=1) + (2 * df_md_mas[[f'masacres_{lag}' for lag in range(1, 9)]].std(axis=1)))).astype(int)
df_md_mas.drop(columns=[f'masacres_{lag}' for lag in range(1, 9)], inplace=True)

df_md_sec['atipico 1'] = (df_md_sec['secuestrados'] > (df_md_sec[[f'secuestrados_{lag}' for lag in range(1, 9)]].mean(axis=1) + df_md_sec[[f'secuestrados_{lag}' for lag in range(1, 9)]].std(axis=1))).astype(int)
df_md_sec['atipico 2'] = (df_md_sec['secuestrados'] > (df_md_sec[[f'secuestrados_{lag}' for lag in range(1, 9)]].mean(axis=1) + (2 * df_md_sec[[f'secuestrados_{lag}' for lag in range(1, 9)]].std(axis=1)))).astype(int)
df_md_sec.drop(columns=[f'secuestrados_{lag}' for lag in range(1, 9)], inplace=True)



In [8]:
df_2 = pd.concat([
    pd.concat([
        df_md_terr[['atipico 1']].describe().T['mean'].rename(index={'atipico 1':'Terrorism'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (std)'),
        df_md_terr[['atipico 2']].describe().T['mean'].rename(index={'atipico 2':'Terrorism'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (2x std)')
    ], axis=1),

    pd.concat([
        df_md_ext[['atipico 1']].describe().T['mean'].rename(index={'atipico 1':'Extortion'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (std)'),
        df_md_ext[['atipico 2']].describe().T['mean'].rename(index={'atipico 2':'Extortion'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (2x std)')
    ], axis=1),

    pd.concat([
        df_md_hom[['atipico 1']].describe().T['mean'].rename(index={'atipico 1':'Homicides'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (std)'),
        df_md_hom[['atipico 2']].describe().T['mean'].rename(index={'atipico 2':'Homicides'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (2x std)')
    ], axis=1),

    pd.concat([
        df_md_mas[['atipico 1']].describe().T['mean'].rename(index={'atipico 1':'Massacres'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (std)'),
        df_md_mas[['atipico 2']].describe().T['mean'].rename(index={'atipico 2':'Massacres'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (2x std)')
    ], axis=1),

    pd.concat([
        df_md_sec[['atipico 1']].describe().T['mean'].rename(index={'atipico 1':'Kidnappings'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (std)'),
        df_md_sec[['atipico 2']].describe().T['mean'].rename(index={'atipico 2':'Kidnappings'}).to_frame(name='Proportion of Atypical Homicide Municipal Quarters (2x std)')
    ], axis=1)
]).round(4).map(lambda x: x*100)

In [9]:
pd.concat([
    df_1,
    df_2
], axis=1)

Unnamed: 0,Number of Observations,Average,Standard Deviation,Minimum,Maximum,Proportion of Atypical Homicide Municipal Quarters (std),Proportion of Atypical Homicide Municipal Quarters (2x std)
Terrorism,2687.0,2.2,2.1,1.0,29.0,17.4,8.75
Extortion,24890.0,4.7,18.3,1.0,729.0,21.88,11.84
Homicides,47997.0,6.7,24.8,1.0,614.0,17.98,8.53
Massacres,259.0,3.8,1.9,3.0,18.0,0.0,0.0
Kidnappings,10538.0,2.7,4.6,1.0,179.0,14.46,8.05


### 3. JEP Data Processing

Processing data from the Special Jurisdiction for Peace (JEP). This includes:
- IGC_JEP: General Index of Conflict
- IA_JEP: Atypicality Index

Data is aggregated by municipality and quarter.

In [10]:
# Load JEP data
df_jep = pd.read_csv(raw / 'old' / 'jep_iacv.csv').rename(columns={'cod_mnpio':'cod_mun'})

# Create quarter variable
df_jep['trimestre'] = np.where(df_jep['mes'].isin([4,5,6]), 2, 1)
df_jep['trimestre'] = np.where(df_jep['mes'].isin([7,8,9]), 3, df_jep['trimestre'])
df_jep['trimestre'] = np.where(df_jep['mes'].isin([10,11,12]), 4, df_jep['trimestre'])

# Aggregate by municipality and quarter
df_jep = df_jep.groupby(['cod_mun','año','trimestre'])[['igc_jep','ia_jep']].sum().reset_index()

# Save intermediate result
df_jep.to_parquet(temp / 'old' / 'df_jep.parquet', index=False)

df_jep.head()

Unnamed: 0,cod_mun,año,trimestre,igc_jep,ia_jep
0,5001.0,2017,1,0.0,0.0
1,5001.0,2017,2,0.0,0.0
2,5001.0,2017,3,0.0,0.0
3,5001.0,2017,4,0.0,0.0
4,5001.0,2018,1,0.0,0.0


### 4. DANE Data Processing

Processing socioeconomic data from the National Statistics Department (DANE). This creates a panel structure with all municipalities across years and quarters.

In [11]:
# Create temporal structure (years and quarters)
years = list(range(2005, 2025))
quarters = [1, 2, 3, 4]
data = [(year, quarter) for year in years for quarter in quarters]
df_trimestres = pd.DataFrame(data, columns=["año", "trimestre"])

# Load DANE municipal information
df_dane = pd.read_csv(raw / 'old' / 'info_mun_dane.csv')

# Merge to create complete panel structure
df_dane = df_trimestres.merge(df_dane, on='año', how='left')

# Save intermediate result
df_dane.to_parquet(temp / 'old' / 'df_dane.parquet', index=False)

df_dane.head()

Unnamed: 0,año,trimestre,dpto,mun,cod_mun,pob
0,2005,1,Antioquia,Medellín,5001,2046341
1,2005,1,Antioquia,Abejorral,5002,22942
2,2005,1,Antioquia,Abriaquí,5004,2719
3,2005,1,Antioquia,Alejandría,5021,4724
4,2005,1,Antioquia,Amagá,5030,27121


### 5. Illicit Crops Data Processing

Processing data on illicit crops from UNODC:
- **Coca**: Coca cultivation areas
- **Amapola**: Poppy cultivation areas

Data is pivoted from wide to long format and normalized by municipality area.

In [12]:
# Process Coca cultivation data
df_cultivos_coca = pd.read_excel(raw / 'old' / 'RPT_CultivosIlicitos_2025-03-04--170538.xlsx', header=8).dropna(subset=['CODMPIO'])
df_cultivos_coca = df_cultivos_coca.drop(df_cultivos_coca.index[-1])
df_cultivos_coca.columns = df_cultivos_coca.columns.astype(str)
df_cultivos_coca = df_cultivos_coca.melt(id_vars=['CODMPIO'], value_vars=[str(year) for year in range(1999, 2024)], var_name='año', value_name='coca')
df_cultivos_coca.fillna(0, inplace=True)
df_cultivos_coca.rename(columns={'CODMPIO':'cod_mun'}, inplace=True)
df_cultivos_coca['año'] = df_cultivos_coca['año'].astype(int)
df_cultivos_coca['cod_mun'] = df_cultivos_coca['cod_mun'].astype(int)

# Save intermediate result
df_cultivos_coca.to_parquet(temp / 'old' / 'df_cultivos_coca.parquet', index=False)

df_cultivos_coca.head()

  df_cultivos_coca.fillna(0, inplace=True)


Unnamed: 0,cod_mun,año,coca
0,91263,1999,0.0
1,91405,1999,0.0
2,91407,1999,0.0
3,91430,1999,0.0
4,91460,1999,0.0


In [13]:
# Process Poppy cultivation data
df_cultivos_amapola = pd.read_excel(raw / 'old' / 'RPT_CultivosIlicitos_2025-03-04--170553.xlsx', header=8).dropna(subset=['CODMPIO'])
df_cultivos_amapola = df_cultivos_amapola.drop(df_cultivos_amapola.index[-1])
df_cultivos_amapola.columns = df_cultivos_amapola.columns.astype(str)
df_cultivos_amapola = df_cultivos_amapola.melt(id_vars=['CODMPIO'], value_vars=[str(year) for year in range(1999, 2024)], var_name='año', value_name='amapola')
df_cultivos_amapola.rename(columns={'CODMPIO':'cod_mun'}, inplace=True)
df_cultivos_amapola.fillna(0, inplace=True)
df_cultivos_amapola['año'] = df_cultivos_amapola['año'].astype(int)
df_cultivos_amapola['cod_mun'] = df_cultivos_amapola['cod_mun'].astype(int)

# Save intermediate result
df_cultivos_amapola.to_parquet(temp / 'old' / 'df_cultivos_amapola.parquet', index=False)

df_cultivos_amapola.head()

  df_cultivos_amapola.fillna(0, inplace=True)


Unnamed: 0,cod_mun,año,amapola
0,15047,1999,0.0
1,15377,1999,0.0
2,17446,1999,0.0
3,17513,1999,0.0
4,17653,1999,0.0


### 6. CEDE Panel Data Processing

Processing CEDE panel data which contains socioeconomic and institutional variables at the municipal level. This includes:
- Population data
- Geographic characteristics
- Poverty indices (NBI, IPM)
- Education indicators
- Fiscal variables

In [14]:
# Load CEDE panel data and drop redundant columns
df_cede = pd.read_csv(raw / 'old' / 'panel_cede.csv').drop(
    ['pob', 'areaoficialkm2_y', 'indrural_y', 'altura_y', 'discapital_y', 
     'dismdo_y', 'disbogota_y', 'distancia_mercado_y'], 
    axis=1
)

# Save intermediate result
df_cede.to_parquet(temp / 'old' / 'df_cede.parquet', index=False)

df_cede.head()

Unnamed: 0,cod_mun,año,indrural,areaoficialkm2,altura,discapital,dismdo,disbogota,distancia_mercado,y_corr,...,ipm_accagua_p_2005,ipm_accagua_p_2018,ipm_excretas_p_2005,ipm_excretas_p_2018,ipm_pisos_p_2005,ipm_pisos_p_2018,ipm_paredes_p_2005,ipm_paredes_p_2018,ipm_hacinam_p_2005,ipm_hacinam_p_2018
0,5001,2005,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,0.569176,...,2.74429,1.5,3.10372,2.0,0.93007,0.2,1.89035,0.7,11.78,5.4
1,5002,2005,0.690306,46.160966,2275.0,58.200874,58.200874,209.00545,16.083084,0.061483,...,40.85703,32.3,26.16509,23.6,4.73934,1.6,0.75039,0.3,14.33649,5.2
2,5004,2005,0.681041,9.279863,1900.0,63.854633,63.854633,326.9408,49.62903,0.196929,...,44.21416,34.4,35.40587,39.6,5.69948,3.4,0.51813,0.8,15.02591,5.4
3,5021,2005,0.503145,31.284768,1750.0,61.114216,61.114216,241.826,15.529276,0.233209,...,40.57416,26.8,18.56459,3.7,7.08134,2.5,0.47847,0.4,15.311,4.1
4,5030,2005,0.481863,319.070588,1400.0,29.878857,29.878857,247.85628,24.321695,0.117345,...,9.78651,4.6,3.98548,5.6,4.28259,1.4,1.09193,0.3,11.34089,4.8


---

## Merging All Datasets

This section combines all processed datasets into a comprehensive panel database. The merging process follows this order:
1. Start with DANE panel structure (all municipalities × quarters)
2. Merge Ministry of Defense violence data
3. Merge illicit crops data (coca and poppy)
4. Merge CEDE socioeconomic data
5. Finally, merge JEP conflict indices

In [15]:
# Merge all datasets into a comprehensive panel
df = (df_dane
      .merge(df_md, suffixes=['_fisc', '_md'], how='left', on=['año', 'cod_mun', 'trimestre'])
      .merge(df_cultivos_coca, how='left', on=['año', 'cod_mun'])
      .merge(df_cultivos_amapola, how='left', on=['año', 'cod_mun'])
      .merge(df_cede.loc[:, ~df_cede.columns.isin(['indrural', 'areaoficialkm2', 'altura', 
                                                     'discapital', 'dismdo', 'disbogota', 
                                                     'distancia_mercado'])], 
             how='left', on=['año', 'cod_mun'])
      .fillna(0)
      .merge(df_cede[['año', 'cod_mun', 'indrural', 'areaoficialkm2', 'altura', 
                       'discapital', 'dismdo', 'disbogota', 'distancia_mercado']], 
             how='left', on=['año', 'cod_mun'])
      .merge(df_jep, how='left', on=['año', 'cod_mun', 'trimestre'])
)

df.head()

Unnamed: 0,año,trimestre,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,...,ipm_hacinam_p_2018,indrural,areaoficialkm2,altura,discapital,dismdo,disbogota,distancia_mercado,igc_jep,ia_jep
0,2005,1,Antioquia,Medellín,5001,2046341,31.2812,0.0,0.0,0.569176,...,5.4,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,,
1,2005,1,Antioquia,Abejorral,5002,22942,0.6143,0.0,0.0,0.061483,...,5.2,0.690306,46.160966,2275.0,58.200874,58.200874,209.00545,16.083084,,
2,2005,1,Antioquia,Abriaquí,5004,2719,0.287,0.0,0.0,0.196929,...,5.4,0.681041,9.279863,1900.0,63.854633,63.854633,326.9408,49.62903,,
3,2005,1,Antioquia,Alejandría,5021,4724,0.3408,0.0,0.0,0.233209,...,4.1,0.503145,31.284768,1750.0,61.114216,61.114216,241.826,15.529276,,
4,2005,1,Antioquia,Amagá,5030,27121,0.0,0.0,0.0,0.117345,...,4.8,0.481863,319.070588,1400.0,29.878857,29.878857,247.85628,24.321695,,


### Data Quality Check

Checking for missing values across time periods to understand data completeness.

In [16]:
# Check missing values by year and quarter
missing_by_period = pd.concat(
    [df[['año', 'trimestre']],
     df.loc[:, ~df.columns.isin(['año', 'trimestre'])].isnull()],
    axis=1
).groupby(['año', 'trimestre']).sum()

missing_by_period

Unnamed: 0_level_0,Unnamed: 1_level_0,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,y_corr_tribut_IyC,DF_ing_func,...,ipm_hacinam_p_2018,indrural,areaoficialkm2,altura,discapital,dismdo,disbogota,distancia_mercado,igc_jep,ia_jep
año,trimestre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2005,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1122,1122
2005,2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1122,1122
2005,3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1122,1122
2005,4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1122,1122
2006,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,1122,1122
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2023,4,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,251,251
2024,1,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,251,251
2024,2,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,251,251
2024,3,0,0,0,0,0,0,0,0,0,0,...,0,1,0,1,1,1,1,1,251,251


---

## Data Cleaning and Imputation

This section handles missing values and prepares variables for analysis:
1. Filter data to years before 2023
2. Impute JEP indices with zeros for years >= 2017
3. Forward-fill geographic variables (time-invariant)
4. Normalize variables by population or area
5. Scale violence indicators per 100,000 inhabitants

In [17]:
# Filter data to years before 2023
df.query("año < 2023", inplace=True)

# Impute JEP indices with zeros for years >= 2017 (when JEP was established)
df.loc[df['año'] >= 2017, 'igc_jep'] = df.loc[df['año'] >= 2017, 'igc_jep'].fillna(0)
df.loc[df['año'] >= 2017, 'ia_jep'] = df.loc[df['año'] >= 2017, 'ia_jep'].fillna(0)

# Sort by municipality and time
df = df.sort_values(by=['cod_mun', 'año', 'trimestre'])

# Forward-fill time-invariant geographic variables
df['indrural'] = df.groupby('cod_mun')['indrural'].ffill()
df['altura'] = df.groupby('cod_mun')['altura'].ffill()
df['discapital'] = df.groupby('cod_mun')['discapital'].ffill()
df['dismdo'] = df.groupby('cod_mun')['dismdo'].ffill()
df['disbogota'] = df.groupby('cod_mun')['disbogota'].ffill()
df['distancia_mercado'] = df.groupby('cod_mun')['distancia_mercado'].ffill()

# Normalize variables by population
df[['casos_ponderados', 'y_corr', 'y_corr_tribut_IyC', 'DF_ing_func', 
    'DF_deuda', 'DF_desemp_fisc', 'docen_total', 'alumn_total']] = (
    df[['casos_ponderados', 'y_corr', 'y_corr_tribut_IyC', 'DF_ing_func', 
        'DF_deuda', 'DF_desemp_fisc', 'docen_total', 'alumn_total']].div(df['pob'], axis=0)
)

# Normalize crops by area
df[['coca', 'amapola']] = df[['coca', 'amapola']].div(df['areaoficialkm2'], axis=0)

# Scale violence indicator to per 100,000 inhabitants
df[['casos_ponderados']] = df[['casos_ponderados']] * 100000

df.head()

Unnamed: 0,año,trimestre,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,...,ipm_hacinam_p_2018,indrural,areaoficialkm2,altura,discapital,dismdo,disbogota,distancia_mercado,igc_jep,ia_jep
0,2005,1,Antioquia,Medellín,5001,2046341,1.528641,0.0,0.0,2.781432e-07,...,5.4,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,,
1122,2005,2,Antioquia,Medellín,5001,2046341,1.497097,0.0,0.0,2.781432e-07,...,5.4,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,,
2244,2005,3,Antioquia,Medellín,5001,2046341,1.912555,0.0,0.0,2.781432e-07,...,5.4,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,,
3366,2005,4,Antioquia,Medellín,5001,2046341,1.61454,0.0,0.0,2.781432e-07,...,5.4,0.017527,5287.702842,1475.0,0.0,0.0,264.33902,0.0,,
4488,2006,1,Antioquia,Medellín,5001,2074195,1.522817,0.0,0.0,3.001994e-07,...,5.4,0.016824,5359.677003,1475.0,0.0,0.0,264.33902,0.0,,


In [18]:
# Check missing values percentage after imputation
df.isnull().mean()

año                      0.000000
trimestre                0.000000
dpto                     0.000000
mun                      0.000000
cod_mun                  0.000000
pob                      0.000000
casos_ponderados         0.000569
coca                     0.000545
amapola                  0.000545
y_corr                   0.000693
y_corr_tribut_IyC        0.000693
DF_ing_func              0.000693
DF_deuda                 0.000693
DF_desemp_fisc           0.000693
s11_total                0.000000
docen_total              0.000594
alumn_total              0.000594
nbi_2005                 0.000000
nbi_2018                 0.000000
IPM_2005                 0.000000
IPM_2018                 0.000000
ipm_ledu_p_2005          0.000000
ipm_ledu_p_2018          0.000000
ipm_analf_p_2005         0.000000
ipm_analf_p_2018         0.000000
ipm_asisescu_p_2005      0.000000
ipm_asisescu_p_2018      0.000000
ipm_rezagoescu_p_2005    0.000000
ipm_rezagoescu_p_2018    0.000000
ipm_serv_pinf_

---

## Creating Lag Variables

Creating lagged variables for violence indicators and socioeconomic factors:
- **Violence lags (1-4 quarters)**: iacv_1 to iacv_4, igc_1 to igc_4, ia_1 to ia_4
- **Socioeconomic lags (1 quarter)**: All control variables are lagged to avoid endogeneity

This ensures that predictors precede the outcome in time.

In [19]:
# Sort data by municipality and time
df = df.sort_values(by=['cod_mun', 'año', 'trimestre'])

# Create violence indicator lags (1 to 4 quarters)
for lag in range(1, 5):
    df[f'iacv_{lag}'] = df.groupby('cod_mun')['casos_ponderados'].shift(lag)
    df[f'igc_{lag}'] = df.groupby('cod_mun')['igc_jep'].shift(lag)
    df[f'ia_{lag}'] = df.groupby('cod_mun')['ia_jep'].shift(lag)

# Define socioeconomic variables to lag
variables_a_rezagar = [
    'coca', 'amapola', 'y_corr', 'y_corr_tribut_IyC', 'DF_ing_func', 'DF_deuda', 
    'DF_desemp_fisc', 's11_total', 'docen_total', 'alumn_total', 'nbi_2005', 
    'nbi_2018', 'IPM_2005', 'IPM_2018', 'ipm_ledu_p_2005', 'ipm_ledu_p_2018', 
    'ipm_analf_p_2005', 'ipm_analf_p_2018', 'ipm_asisescu_p_2005', 
    'ipm_asisescu_p_2018', 'ipm_rezagoescu_p_2005', 'ipm_rezagoescu_p_2018', 
    'ipm_serv_pinf_p_2005', 'ipm_serv_pinf_p_2018', 'ipm_ti_p_2005', 
    'ipm_ti_p_2018', 'ipm_templeof_p_2005', 'ipm_templeof_p_2018', 
    'ipm_assalud_p_2005', 'ipm_assalud_p_2018', 'ipm_accsalud_p_2005', 
    'ipm_accsalud_p_2018', 'ipm_accagua_p_2005', 'ipm_accagua_p_2018', 
    'ipm_excretas_p_2005', 'ipm_excretas_p_2018', 'ipm_pisos_p_2005', 
    'ipm_pisos_p_2018', 'ipm_paredes_p_2005', 'ipm_paredes_p_2018', 
    'ipm_hacinam_p_2005', 'ipm_hacinam_p_2018', 'indrural', 'areaoficialkm2', 
    'altura', 'discapital', 'dismdo', 'disbogota', 'distancia_mercado'
]

# Lag all socioeconomic variables by 1 quarter
df[variables_a_rezagar] = df.groupby('cod_mun')[variables_a_rezagar].shift(1)

# Drop rows with missing lag values (first 4 quarters per municipality)
df = df.dropna(subset=['iacv_4']).reset_index(drop=True)

print(f"Data shape after creating lags: {df.shape}")
df.head()

Data shape after creating lags: (76254, 70)


Unnamed: 0,año,trimestre,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,...,ia_1,iacv_2,igc_2,ia_2,iacv_3,igc_3,ia_3,iacv_4,igc_4,ia_4
0,2006,1,Antioquia,Medellín,5001,2074195,1.522817,0.0,0.0,2.781432e-07,...,,1.912555,,,1.497097,,,1.528641,,
1,2006,2,Antioquia,Medellín,5001,2074195,1.449087,0.0,0.0,3.001994e-07,...,,1.61454,,,1.912555,,,1.497097,,
2,2006,3,Antioquia,Medellín,5001,2074195,1.578376,0.0,0.0,3.001994e-07,...,,1.522817,,,1.61454,,,1.912555,,
3,2006,4,Antioquia,Medellín,5001,2074195,1.446725,0.0,0.0,3.001994e-07,...,,1.449087,,,1.522817,,,1.61454,,
4,2007,1,Antioquia,Medellín,5001,2101771,1.299523,0.0,0.0,3.001994e-07,...,,1.578376,,,1.449087,,,1.522817,,


### Creating Atypical Violence Indicator

Creating the target variable for atypical violence detection:
- **Threshold**: Mean + 1 Standard Deviation of past 4 quarters
- **Atypical case**: When current violence exceeds the threshold (binary indicator)

In [20]:
# Calculate threshold: mean + 1 std of past 4 quarters
df['umbral'] = df[['iacv_1', 'iacv_2', 'iacv_3', 'iacv_4']].mean(axis=1) + df[['iacv_1', 'iacv_2', 'iacv_3', 'iacv_4']].std(axis=1)

# Create binary indicator for atypical violence
df['caso_atipico'] = np.where(df['casos_ponderados'] >= df['umbral'], 1, 0)

# Display summary statistics
print(f"Proportion of atypical cases: {df['caso_atipico'].mean():.2%}")
df[['casos_ponderados', 'umbral', 'caso_atipico']].head(10)

Proportion of atypical cases: 36.48%


Unnamed: 0,casos_ponderados,umbral,caso_atipico
0,1.522817,1.82772,0
1,1.449087,1.827405,0
2,1.578376,1.828207,0
3,1.446725,1.613278,0
4,1.299523,1.562738,0
5,1.302711,1.557391,0
6,1.397441,1.54022,0
7,1.570694,1.434292,1
8,1.693561,1.51972,1
9,1.620068,1.665828,0


### Final Data Quality Check

Verifying data completeness after all transformations.

In [21]:
# Check missing values by year and quarter after all processing
final_missing = pd.concat(
    [df[['año', 'trimestre']],
     df.loc[:, ~df.columns.isin(['año', 'trimestre'])].isnull()],
    axis=1
).groupby(['año', 'trimestre']).sum()

final_missing

Unnamed: 0_level_0,Unnamed: 1_level_0,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,y_corr_tribut_IyC,DF_ing_func,...,igc_2,ia_2,iacv_3,igc_3,ia_3,iacv_4,igc_4,ia_4,umbral,caso_atipico
año,trimestre,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
2006,1,0,0,0,0,1,2,2,2,2,2,...,1120,1120,2,1120,1120,0,1120,1120,2,0
2006,2,0,0,0,0,0,0,0,0,0,0,...,1118,1118,0,1118,1118,0,1118,1118,0,0
2006,3,0,0,0,0,0,0,0,0,0,0,...,1118,1118,0,1118,1118,0,1118,1118,0,0
2006,4,0,0,0,0,0,0,0,0,0,0,...,1118,1118,0,1118,1118,0,1118,1118,0,0
2007,1,0,0,0,0,0,1,1,1,1,1,...,1119,1119,0,1119,1119,0,1119,1119,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021,4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2022,3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [22]:
# Save raw processed data (with all identifiers)
df_raw = df.copy()
df_raw.to_parquet(temp / 'old' / 'preliminary' / 'df_raw.parquet', index=False)

print(f"Raw data saved: {df_raw.shape}")
df_raw.head()

Raw data saved: (76254, 72)


Unnamed: 0,año,trimestre,dpto,mun,cod_mun,pob,casos_ponderados,coca,amapola,y_corr,...,igc_2,ia_2,iacv_3,igc_3,ia_3,iacv_4,igc_4,ia_4,umbral,caso_atipico
0,2006,1,Antioquia,Medellín,5001,2074195,1.522817,0.0,0.0,2.781432e-07,...,,,1.497097,,,1.528641,,,1.82772,0
1,2006,2,Antioquia,Medellín,5001,2074195,1.449087,0.0,0.0,3.001994e-07,...,,,1.912555,,,1.497097,,,1.827405,0
2,2006,3,Antioquia,Medellín,5001,2074195,1.578376,0.0,0.0,3.001994e-07,...,,,1.61454,,,1.912555,,,1.828207,0
3,2006,4,Antioquia,Medellín,5001,2074195,1.446725,0.0,0.0,3.001994e-07,...,,,1.522817,,,1.61454,,,1.613278,0
4,2007,1,Antioquia,Medellín,5001,2101771,1.299523,0.0,0.0,3.001994e-07,...,,,1.449087,,,1.522817,,,1.562738,0


---

## Preparing Modeling Datasets

Creating two versions of the data for modeling:
1. **Without JEP indices**: For broader temporal coverage
2. **With JEP indices**: For more recent period with conflict metrics

Both datasets drop identifiers and keep only modeling variables.

In [23]:
# Create dataset WITHOUT JEP indices
df_sin_jep = df.drop(
    ['dpto', 'mun', 'cod_mun', 'casos_ponderados', 'umbral'] + 
    ['ia_jep', 'igc_jep', 'igc_1', 'ia_1', 'igc_2', 'ia_2', 'igc_3', 'ia_3', 'igc_4', 'ia_4'], 
    axis=1
).dropna()

# Remove infinite values
df_sin_jep = df_sin_jep[np.isfinite(df_sin_jep).all(axis=1)].reset_index(drop=True)

print(f"Dataset without JEP shape: {df_sin_jep.shape}")
print(f"Missing values: {df_sin_jep.isnull().sum().sum()}")
df_sin_jep.head()

Dataset without JEP shape: (76184, 57)
Missing values: 0


Unnamed: 0,año,trimestre,pob,coca,amapola,y_corr,y_corr_tribut_IyC,DF_ing_func,DF_deuda,DF_desemp_fisc,...,altura,discapital,dismdo,disbogota,distancia_mercado,iacv_1,iacv_2,iacv_3,iacv_4,caso_atipico
0,2006,1,2074195,0.0,0.0,2.781432e-07,4.504404e-08,2.1e-05,6e-06,3.7e-05,...,1475.0,0.0,0.0,264.33902,0.0,1.61454,1.912555,1.497097,1.528641,0
1,2006,2,2074195,0.0,0.0,3.001994e-07,6.333018e-08,1.9e-05,5e-06,3.7e-05,...,1475.0,0.0,0.0,264.33902,0.0,1.522817,1.61454,1.912555,1.497097,0
2,2006,3,2074195,0.0,0.0,3.001994e-07,6.333018e-08,1.9e-05,5e-06,3.7e-05,...,1475.0,0.0,0.0,264.33902,0.0,1.449087,1.522817,1.61454,1.912555,0
3,2006,4,2074195,0.0,0.0,3.001994e-07,6.333018e-08,1.9e-05,5e-06,3.7e-05,...,1475.0,0.0,0.0,264.33902,0.0,1.578376,1.449087,1.522817,1.61454,0
4,2007,1,2101771,0.0,0.0,3.001994e-07,6.333018e-08,1.9e-05,5e-06,3.7e-05,...,1475.0,0.0,0.0,264.33902,0.0,1.446725,1.578376,1.449087,1.522817,0


In [24]:
# Create dataset WITH JEP indices
df_con_jep = df.drop(
    ['dpto', 'mun', 'cod_mun', 'casos_ponderados', 'umbral'], 
    axis=1
).dropna()

print(f"Dataset with JEP shape: {df_con_jep.shape}")
print(f"Missing values: {df_con_jep.isnull().sum().sum()}")
df_con_jep.head()

Dataset with JEP shape: (22420, 67)
Missing values: 0


Unnamed: 0,año,trimestre,pob,coca,amapola,y_corr,y_corr_tribut_IyC,DF_ing_func,DF_deuda,DF_desemp_fisc,...,iacv_2,igc_2,ia_2,iacv_3,igc_3,ia_3,iacv_4,igc_4,ia_4,caso_atipico
48,2018,1,2427129,0.0,0.0,3.635967e-07,1.141004e-07,1.5e-05,3e-06,3.5e-05,...,1.775957,0.0,0.0,1.565938,0.0,0.0,1.345076,0.0,0.0,0
49,2018,2,2427129,0.0,0.0,3.693195e-07,1.135234e-07,1.5e-05,3e-06,3.4e-05,...,1.562294,0.0,0.0,1.775957,0.0,0.0,1.565938,0.0,0.0,1
50,2018,3,2427129,0.0,0.0,3.693195e-07,1.135234e-07,1.5e-05,3e-06,3.4e-05,...,1.486163,0.0,0.0,1.562294,0.0,0.0,1.775957,0.0,0.0,0
51,2018,4,2427129,0.0,0.0,3.693195e-07,1.135234e-07,1.5e-05,3e-06,3.4e-05,...,2.022991,0.0,0.0,1.486163,0.0,0.0,1.562294,0.0,0.0,0
52,2019,1,2483545,0.0,0.0,3.693195e-07,1.135234e-07,1.5e-05,3e-06,3.4e-05,...,1.829149,0.0,0.0,2.022991,0.0,0.0,1.486163,0.0,0.0,0


---

## Saving Final Datasets

Saving processed datasets to the preliminary folder for further analysis and modeling.

In [25]:
# Save dataset without JEP indices
df_sin_jep.to_parquet(temp / 'old' / 'preliminary' / 'db_no_jep.parquet', index=False)
print(f"✓ Saved: db_no_jep.parquet ({df_sin_jep.shape})")

# Save dataset with JEP indices
df_con_jep.to_parquet(temp / 'old' / 'preliminary' / 'db_con_jep.parquet', index=False)
print(f"✓ Saved: db_con_jep.parquet ({df_con_jep.shape})")

print("\n" + "="*50)
print("DATA PROCESSING PIPELINE COMPLETED SUCCESSFULLY")
print("="*50)

✓ Saved: db_no_jep.parquet ((76184, 57))
✓ Saved: db_con_jep.parquet ((22420, 67))

DATA PROCESSING PIPELINE COMPLETED SUCCESSFULLY


### Summary of Saved Files

**Intermediate files** (saved in `data/temp/old/`):
- `df_fisc.parquet` - Fiscalía processed data
- `df_md.parquet` - Ministry of Defense aggregated data
- `df_jep.parquet` - JEP indices
- `df_dane.parquet` - DANE panel structure
- `df_cultivos_coca.parquet` - Coca cultivation data
- `df_cultivos_amapola.parquet` - Poppy cultivation data
- `df_cede.parquet` - CEDE panel data

**Final datasets** (saved in `data/temp/old/preliminary/`):
- `df_raw.parquet` - Complete processed data with all identifiers
- `db_no_jep.parquet` - Modeling dataset without JEP indices
- `db_con_jep.parquet` - Modeling dataset with JEP indices