<img src="https://industrial.uniandes.edu.co/sites/default/files/imagenes/uniandeslogo.png" alt="Universidad de los Andes" style="float: right; width: 300px; height: auto;">

# Cleaning Ministerio de Defensa Nacional de Colombia data — Extended crimes

Autor: Juan Diego Heredia Niño 

Email: jd.heredian@uniandes.edu.co

Date: Feb 2026

In [1]:
# Import necessary libraries
import pandas as pd  # For data manipulation and analysis
import yaml  # To read YAML configuration files
from pathlib import Path  # For cross-platform file path handling

In [2]:
# Load directory paths from configuration file
with open('paths.yml', 'r') as file:
    paths = yaml.safe_load(file)  # Read and parse YAML file

# Create Path objects for each directory
raw = Path(paths['data']['raw'])  # Directory with raw data
temp = Path(paths['data']['temp'])  # Directory with temporary processed data
processed = Path(paths['data']['processed'])  # Directory with final processed data

# Ensure output directory exists
(temp / 'mindef' / 'other').mkdir(parents=True, exist_ok=True)

## Standard Cleaning Process

This notebook processes 12 additional crime types from the Colombian Ministry of Defense (`data/raw/mindef/other/`). Crime codes continue from where `min_defensa.ipynb` left off (top5 = codes 1–5).

| # | File | crime_code | Output |
|---|------|-----------|--------|
| 1 | ASPERSION.xlsx | 6 | aspersion.parquet |
| 2 | CAPTURAS POR MINERÍA ILEGAL.xlsx | 7 | illegal_mining_arrests.parquet |
| 3 | DELITOS CONTRA EL MEDIO AMBIENTE.xlsx | 8 | environmental_crimes.parquet |
| 4 | DESTRUCCIÓN INFRAESTRUCTURAS PARA LA PRODUCCIÓN DE DROGAS ILÍCITAS.xlsx | 9 | drug_infrastructure_destruction.parquet |
| 5 | ERRADICACIÓN.xlsx | 10 | eradication.parquet |
| 6 | HOJA DE COCA.xlsx | 11 | coca_leaf.parquet |
| 7 | HURTO A COMERCIO.xlsx | 12 | commerce_theft.parquet |
| 8 | HURTO PERSONAS.xlsx | 13 | person_theft.parquet |
| 9 | INCAUTACIÓN DE COCAINA.xlsx | 14 | cocaine_seizure.parquet |
| 10 | MINAS INTERVENIDAS.xlsx | 15 | intervened_mines.parquet |
| 11 | TRATA DE PERSONAS Y TRÁFICO DE MIGRANTES.xlsx | 16 | human_trafficking.parquet |
| 12 | VOLADURA DE OLEODUCTOS.xlsx | 17 | pipeline_bombing.parquet |

### Notes on column differences vs top5 files:
- Date column is `FECHA HECHO` (with space) in most files and `FECHA_HECHO` (with underscore) in others — handled per file.
- Quantity column is `CANTIDAD` in all files except CAPTURAS POR MINERÍA ILEGAL which uses `CAPTURAS`.
- Some files include extra columns (`UNIDADES DE MEDIDA`, `DESCRIPCION CONDUCTA`, `TIPO CULTIVO`, `ZONA`) that are dropped during aggregation.

## crime_code = 6 | Aspersion

In [3]:
# ASPERSION
# Read Excel file with aerial spraying data
df_aspersion = pd.read_excel(raw / 'mindef' / 'other' / 'ASPERSION.xlsx')

# Rename date column to match standard name before processing
df_aspersion.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_aspersion['FECHA_HECHO'] = pd.to_datetime(df_aspersion['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_aspersion['COD_MUNI'] = df_aspersion['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing quantity (hectares sprayed)
df_aspersion = df_aspersion.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_aspersion.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (6 = Aspersion) — stored as zero-padded string for consistency with top5
df_aspersion['crime_code'] = '06'

# Save result in Parquet format
df_aspersion.to_parquet(temp / 'mindef' / 'other' / 'aspersion.parquet', index=False)

In [4]:
# Sanity checks for aspersion
print("ASPERSION - Data Quality Checks:")
print(f"  Total records: {len(df_aspersion):,}")
print(f"  Date range: {df_aspersion['date'].min()} to {df_aspersion['date'].max()}")
print(f"  Unique municipalities: {df_aspersion['mun_code'].nunique()}")
print(f"  Total qty: {df_aspersion['qty'].sum():,.2f}")
print(f"  Null values: {df_aspersion.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_aspersion['qty'] < 0).sum()}")
print(f"  Duplicates: {df_aspersion.duplicated(subset=['date', 'mun_code']).sum()}")
print()

ASPERSION - Data Quality Checks:
  Total records: 3,584
  Date range: 2003-01-01 to 2015-09-01
  Unique municipalities: 278
  Total qty: 1,424,493.83
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 7 | Illegal Mining Arrests

In [5]:
# CAPTURAS POR MINERÍA ILEGAL
# Read Excel file with illegal mining arrests data
df_mining_arrests = pd.read_excel(raw / 'mindef' / 'other' / 'CAPTURAS POR MINERÍA ILEGAL.xlsx')

# Rename columns: date uses space variant; quantity column is CAPTURAS (not CANTIDAD)
df_mining_arrests.rename(columns={'FECHA HECHO': 'FECHA_HECHO', 'CAPTURAS': 'CANTIDAD'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_mining_arrests['FECHA_HECHO'] = pd.to_datetime(df_mining_arrests['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_mining_arrests['COD_MUNI'] = df_mining_arrests['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of arrests
df_mining_arrests = df_mining_arrests.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_mining_arrests.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (7 = Illegal Mining Arrests) — stored as zero-padded string for consistency with top5
df_mining_arrests['crime_code'] = '07'

# Save result in Parquet format
df_mining_arrests.to_parquet(temp / 'mindef' / 'other' / 'illegal_mining_arrests.parquet', index=False)

In [6]:
# Sanity checks for illegal mining arrests
print("ILLEGAL MINING ARRESTS - Data Quality Checks:")
print(f"  Total records: {len(df_mining_arrests):,}")
print(f"  Date range: {df_mining_arrests['date'].min()} to {df_mining_arrests['date'].max()}")
print(f"  Unique municipalities: {df_mining_arrests['mun_code'].nunique()}")
print(f"  Total arrests: {df_mining_arrests['qty'].sum():,}")
print(f"  Null values: {df_mining_arrests.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_mining_arrests['qty'] < 0).sum()}")
print(f"  Duplicates: {df_mining_arrests.duplicated(subset=['date', 'mun_code']).sum()}")
print()

ILLEGAL MINING ARRESTS - Data Quality Checks:
  Total records: 4,912
  Date range: 2010-08-01 to 2025-09-01
  Unique municipalities: 716
  Total arrests: 25,715
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 8 | Environmental Crimes

In [7]:
# DELITOS CONTRA EL MEDIO AMBIENTE
# Read Excel file with environmental crimes data
# Note: has extra columns DESCRIPCION CONDUCTA and ZONA — dropped during aggregation
df_environmental = pd.read_excel(raw / 'mindef' / 'other' / 'DELITOS CONTRA EL MEDIO AMBIENTE.xlsx')

# Convert date to monthly period (first day of month) then to date format
# Note: this file already uses FECHA_HECHO (with underscore)
df_environmental['FECHA_HECHO'] = pd.to_datetime(df_environmental['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_environmental['COD_MUNI'] = df_environmental['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events (DESCRIPCION CONDUCTA and ZONA are dropped)
df_environmental = df_environmental.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_environmental.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (8 = Environmental Crimes) — stored as zero-padded string for consistency with top5
df_environmental['crime_code'] = '08'

# Save result in Parquet format
df_environmental.to_parquet(temp / 'mindef' / 'other' / 'environmental_crimes.parquet', index=False)

In [8]:
# Sanity checks for environmental crimes
print("ENVIRONMENTAL CRIMES - Data Quality Checks:")
print(f"  Total records: {len(df_environmental):,}")
print(f"  Date range: {df_environmental['date'].min()} to {df_environmental['date'].max()}")
print(f"  Unique municipalities: {df_environmental['mun_code'].nunique()}")
print(f"  Total events: {df_environmental['qty'].sum():,}")
print(f"  Null values: {df_environmental.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_environmental['qty'] < 0).sum()}")
print(f"  Duplicates: {df_environmental.duplicated(subset=['date', 'mun_code']).sum()}")
print()

ENVIRONMENTAL CRIMES - Data Quality Checks:
  Total records: 37,986
  Date range: 2003-01-01 to 2025-09-01
  Unique municipalities: 1092
  Total events: 83,846
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 9 | Drug Infrastructure Destruction

In [9]:
# DESTRUCCIÓN INFRAESTRUCTURAS PARA LA PRODUCCIÓN DE DROGAS ILÍCITAS
# Read Excel file with drug infrastructure destruction data
df_drug_infra = pd.read_excel(raw / 'mindef' / 'other' / 'DESTRUCCIÓN INFRAESTRUCTURAS PARA LA PRODUCCIÓN DE DROGAS ILÍCITAS.xlsx')

# Rename date column to match standard name before processing
df_drug_infra.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_drug_infra['FECHA_HECHO'] = pd.to_datetime(df_drug_infra['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_drug_infra['COD_MUNI'] = df_drug_infra['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events
df_drug_infra = df_drug_infra.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_drug_infra.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (9 = Drug Infrastructure Destruction) — stored as zero-padded string for consistency with top5
df_drug_infra['crime_code'] = '09'

# Save result in Parquet format
df_drug_infra.to_parquet(temp / 'mindef' / 'other' / 'drug_infrastructure_destruction.parquet', index=False)

In [10]:
# Sanity checks for drug infrastructure destruction
print("DRUG INFRASTRUCTURE DESTRUCTION - Data Quality Checks:")
print(f"  Total records: {len(df_drug_infra):,}")
print(f"  Date range: {df_drug_infra['date'].min()} to {df_drug_infra['date'].max()}")
print(f"  Unique municipalities: {df_drug_infra['mun_code'].nunique()}")
print(f"  Total events: {df_drug_infra['qty'].sum():,}")
print(f"  Null values: {df_drug_infra.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_drug_infra['qty'] < 0).sum()}")
print(f"  Duplicates: {df_drug_infra.duplicated(subset=['date', 'mun_code']).sum()}")
print()

DRUG INFRASTRUCTURE DESTRUCTION - Data Quality Checks:
  Total records: 12,653
  Date range: 2010-01-01 to 2025-09-01
  Unique municipalities: 515
  Total events: 66,244
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 10 | Eradication

In [11]:
# ERRADICACIÓN
# Read Excel file with crop eradication data
# Note: has extra columns TIPO CULTIVO and UNIDAD DE MEDIDA — dropped during aggregation
df_eradication = pd.read_excel(raw / 'mindef' / 'other' / 'ERRADICACIÓN.xlsx')

# Rename date column to match standard name before processing
df_eradication.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_eradication['FECHA_HECHO'] = pd.to_datetime(df_eradication['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_eradication['COD_MUNI'] = df_eradication['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing quantity (TIPO CULTIVO and UNIDAD DE MEDIDA are dropped)
df_eradication = df_eradication.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_eradication.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (10 = Eradication) — stored as zero-padded string for consistency with top5
df_eradication['crime_code'] = '10'

# Save result in Parquet format
df_eradication.to_parquet(temp / 'mindef' / 'other' / 'eradication.parquet', index=False)

In [12]:
# Sanity checks for eradication
print("ERADICATION - Data Quality Checks:")
print(f"  Total records: {len(df_eradication):,}")
print(f"  Date range: {df_eradication['date'].min()} to {df_eradication['date'].max()}")
print(f"  Unique municipalities: {df_eradication['mun_code'].nunique()}")
print(f"  Total qty: {df_eradication['qty'].sum():,.2f}")
print(f"  Null values: {df_eradication.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_eradication['qty'] < 0).sum()}")
print(f"  Duplicates: {df_eradication.duplicated(subset=['date', 'mun_code']).sum()}")
print()

ERADICATION - Data Quality Checks:
  Total records: 15,613
  Date range: 2007-01-01 to 2025-09-01
  Unique municipalities: 521
  Total qty: 949,514.40
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 11 | Coca Leaf

In [13]:
# HOJA DE COCA
# Read Excel file with coca leaf seizure data
df_coca = pd.read_excel(raw / 'mindef' / 'other' / 'HOJA DE COCA.xlsx')

# Rename date column to match standard name before processing
df_coca.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_coca['FECHA_HECHO'] = pd.to_datetime(df_coca['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_coca['COD_MUNI'] = df_coca['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing quantity
df_coca = df_coca.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_coca.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (11 = Coca Leaf) — stored as zero-padded string for consistency with top5
df_coca['crime_code'] = '11'

# Save result in Parquet format
df_coca.to_parquet(temp / 'mindef' / 'other' / 'coca_leaf.parquet', index=False)

In [14]:
# Sanity checks for coca leaf
print("COCA LEAF - Data Quality Checks:")
print(f"  Total records: {len(df_coca):,}")
print(f"  Date range: {df_coca['date'].min()} to {df_coca['date'].max()}")
print(f"  Unique municipalities: {df_coca['mun_code'].nunique()}")
print(f"  Total qty: {df_coca['qty'].sum():,.2f}")
print(f"  Null values: {df_coca.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_coca['qty'] < 0).sum()}")
print(f"  Duplicates: {df_coca.duplicated(subset=['date', 'mun_code']).sum()}")
print()

COCA LEAF - Data Quality Checks:
  Total records: 7,351
  Date range: 2010-01-01 to 2025-09-01
  Unique municipalities: 328
  Total qty: 11,256,755.25
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 12 | Commerce Theft

In [15]:
# HURTO A COMERCIO
# Read Excel file with commerce theft data
# Note: this file already uses FECHA_HECHO (with underscore)
df_commerce_theft = pd.read_excel(raw / 'mindef' / 'other' / 'HURTO A COMERCIO.xlsx')

# Convert date to monthly period (first day of month) then to date format
df_commerce_theft['FECHA_HECHO'] = pd.to_datetime(df_commerce_theft['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_commerce_theft['COD_MUNI'] = df_commerce_theft['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events
df_commerce_theft = df_commerce_theft.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_commerce_theft.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (12 = Commerce Theft) — stored as zero-padded string for consistency with top5
df_commerce_theft['crime_code'] = '12'

# Save result in Parquet format
df_commerce_theft.to_parquet(temp / 'mindef' / 'other' / 'commerce_theft.parquet', index=False)

In [16]:
# Sanity checks for commerce theft
print("COMMERCE THEFT - Data Quality Checks:")
print(f"  Total records: {len(df_commerce_theft):,}")
print(f"  Date range: {df_commerce_theft['date'].min()} to {df_commerce_theft['date'].max()}")
print(f"  Unique municipalities: {df_commerce_theft['mun_code'].nunique()}")
print(f"  Total events: {df_commerce_theft['qty'].sum():,}")
print(f"  Null values: {df_commerce_theft.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_commerce_theft['qty'] < 0).sum()}")
print(f"  Duplicates: {df_commerce_theft.duplicated(subset=['date', 'mun_code']).sum()}")
print()

COMMERCE THEFT - Data Quality Checks:
  Total records: 76,722
  Date range: 2003-01-01 to 2025-09-01
  Unique municipalities: 1109
  Total events: 658,563
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 13 | Person Theft

In [17]:
# HURTO PERSONAS
# Read Excel file with personal theft data
# Note: this file already uses FECHA_HECHO (with underscore)
df_person_theft = pd.read_excel(raw / 'mindef' / 'other' / 'HURTO PERSONAS.xlsx')

# Convert date to monthly period (first day of month) then to date format
df_person_theft['FECHA_HECHO'] = pd.to_datetime(df_person_theft['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_person_theft['COD_MUNI'] = df_person_theft['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events
df_person_theft = df_person_theft.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_person_theft.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (13 = Person Theft) — stored as zero-padded string for consistency with top5
df_person_theft['crime_code'] = '13'

# Save result in Parquet format
df_person_theft.to_parquet(temp / 'mindef' / 'other' / 'person_theft.parquet', index=False)

In [18]:
# Sanity checks for person theft
print("PERSON THEFT - Data Quality Checks:")
print(f"  Total records: {len(df_person_theft):,}")
print(f"  Date range: {df_person_theft['date'].min()} to {df_person_theft['date'].max()}")
print(f"  Unique municipalities: {df_person_theft['mun_code'].nunique()}")
print(f"  Total events: {df_person_theft['qty'].sum():,}")
print(f"  Null values: {df_person_theft.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_person_theft['qty'] < 0).sum()}")
print(f"  Duplicates: {df_person_theft.duplicated(subset=['date', 'mun_code']).sum()}")
print()

PERSON THEFT - Data Quality Checks:
  Total records: 108,522
  Date range: 2003-01-01 to 2025-09-01
  Unique municipalities: 1113
  Total events: 3,524,822
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 14 | Cocaine Seizure

In [19]:
# INCAUTACIÓN DE COCAINA
# Read Excel file with cocaine seizure data
df_cocaine = pd.read_excel(raw / 'mindef' / 'other' / 'INCAUTACIÓN DE COCAINA.xlsx')

# Rename date column to match standard name before processing
df_cocaine.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_cocaine['FECHA_HECHO'] = pd.to_datetime(df_cocaine['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_cocaine['COD_MUNI'] = df_cocaine['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing quantity (kg seized)
df_cocaine = df_cocaine.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_cocaine.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (14 = Cocaine Seizure) — stored as zero-padded string for consistency with top5
df_cocaine['crime_code'] = '14'

# Save result in Parquet format
df_cocaine.to_parquet(temp / 'mindef' / 'other' / 'cocaine_seizure.parquet', index=False)

In [20]:
# Sanity checks for cocaine seizure
print("COCAINE SEIZURE - Data Quality Checks:")
print(f"  Total records: {len(df_cocaine):,}")
print(f"  Date range: {df_cocaine['date'].min()} to {df_cocaine['date'].max()}")
print(f"  Unique municipalities: {df_cocaine['mun_code'].nunique()}")
print(f"  Total qty: {df_cocaine['qty'].sum():,.2f}")
print(f"  Null values: {df_cocaine.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_cocaine['qty'] < 0).sum()}")
print(f"  Duplicates: {df_cocaine.duplicated(subset=['date', 'mun_code']).sum()}")
print()

COCAINE SEIZURE - Data Quality Checks:
  Total records: 47,295
  Date range: 2010-01-01 to 2025-09-01
  Unique municipalities: 1057
  Total qty: 6,944,458.82
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 15 | Intervened Mines

In [21]:
# MINAS INTERVENIDAS
# Read Excel file with intervened illegal mines data
df_mines = pd.read_excel(raw / 'mindef' / 'other' / 'MINAS INTERVENIDAS.xlsx')

# Rename date column to match standard name before processing
df_mines.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_mines['FECHA_HECHO'] = pd.to_datetime(df_mines['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_mines['COD_MUNI'] = df_mines['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of mines intervened
df_mines = df_mines.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_mines.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (15 = Intervened Mines) — stored as zero-padded string for consistency with top5
df_mines['crime_code'] = '15'

# Save result in Parquet format
df_mines.to_parquet(temp / 'mindef' / 'other' / 'intervened_mines.parquet', index=False)

In [22]:
# Sanity checks for intervened mines
print("INTERVENED MINES - Data Quality Checks:")
print(f"  Total records: {len(df_mines):,}")
print(f"  Date range: {df_mines['date'].min()} to {df_mines['date'].max()}")
print(f"  Unique municipalities: {df_mines['mun_code'].nunique()}")
print(f"  Total mines: {df_mines['qty'].sum():,}")
print(f"  Null values: {df_mines.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_mines['qty'] < 0).sum()}")
print(f"  Duplicates: {df_mines.duplicated(subset=['date', 'mun_code']).sum()}")
print()

INTERVENED MINES - Data Quality Checks:
  Total records: 7,838
  Date range: 2010-01-01 to 2025-09-01
  Unique municipalities: 723
  Total mines: 41,957
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 16 | Human Trafficking

In [23]:
# TRATA DE PERSONAS Y TRÁFICO DE MIGRANTES
# Read Excel file with human trafficking and migrant smuggling data
# Note: has extra column DESCRIPCION CONDUCTA — dropped during aggregation
# Note: this file already uses FECHA_HECHO (with underscore)
df_trafficking = pd.read_excel(raw / 'mindef' / 'other' / 'TRATA DE PERSONAS Y TRÁFICO DE MIGRANTES.xlsx')

# Convert date to monthly period (first day of month) then to date format
df_trafficking['FECHA_HECHO'] = pd.to_datetime(df_trafficking['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_trafficking['COD_MUNI'] = df_trafficking['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events (DESCRIPCION CONDUCTA is dropped)
df_trafficking = df_trafficking.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_trafficking.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (16 = Human Trafficking) — stored as zero-padded string for consistency with top5
df_trafficking['crime_code'] = '16'

# Save result in Parquet format
df_trafficking.to_parquet(temp / 'mindef' / 'other' / 'human_trafficking.parquet', index=False)

In [24]:
# Sanity checks for human trafficking
print("HUMAN TRAFFICKING - Data Quality Checks:")
print(f"  Total records: {len(df_trafficking):,}")
print(f"  Date range: {df_trafficking['date'].min()} to {df_trafficking['date'].max()}")
print(f"  Unique municipalities: {df_trafficking['mun_code'].nunique()}")
print(f"  Total events: {df_trafficking['qty'].sum():,}")
print(f"  Null values: {df_trafficking.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_trafficking['qty'] < 0).sum()}")
print(f"  Duplicates: {df_trafficking.duplicated(subset=['date', 'mun_code']).sum()}")
print()

HUMAN TRAFFICKING - Data Quality Checks:
  Total records: 1,749
  Date range: 2003-12-01 to 2025-09-01
  Unique municipalities: 295
  Total events: 5,852
  Null values: 0
  Negative quantities: 0
  Duplicates: 0



## crime_code = 17 | Pipeline Bombing

In [25]:
# VOLADURA DE OLEODUCTOS
# Read Excel file with pipeline bombing data
df_pipeline = pd.read_excel(raw / 'mindef' / 'other' / 'VOLADURA DE OLEODUCTOS.xlsx')

# Rename date column to match standard name before processing
df_pipeline.rename(columns={'FECHA HECHO': 'FECHA_HECHO'}, inplace=True)

# Convert date to monthly period (first day of month) then to date format
df_pipeline['FECHA_HECHO'] = pd.to_datetime(df_pipeline['FECHA_HECHO']).dt.to_period('M').dt.to_timestamp().dt.date

# Standardize municipality code: convert to 5-digit string (zero-padded)
df_pipeline['COD_MUNI'] = df_pipeline['COD_MUNI'].astype(str).str.zfill(5)

# Group by date and municipality, summing number of events
df_pipeline = df_pipeline.groupby(['FECHA_HECHO', 'COD_MUNI'])[['CANTIDAD']].sum().reset_index()

# Rename columns to English for standardization
df_pipeline.rename(columns={'FECHA_HECHO': 'date', 'COD_MUNI': 'mun_code', 'CANTIDAD': 'qty'}, inplace=True)

# Assign crime code (17 = Pipeline Bombing) — stored as zero-padded string for consistency with top5
df_pipeline['crime_code'] = '17'

# Save result in Parquet format
df_pipeline.to_parquet(temp / 'mindef' / 'other' / 'pipeline_bombing.parquet', index=False)

In [26]:
# Sanity checks for pipeline bombing
print("PIPELINE BOMBING - Data Quality Checks:")
print(f"  Total records: {len(df_pipeline):,}")
print(f"  Date range: {df_pipeline['date'].min()} to {df_pipeline['date'].max()}")
print(f"  Unique municipalities: {df_pipeline['mun_code'].nunique()}")
print(f"  Total events: {df_pipeline['qty'].sum():,}")
print(f"  Null values: {df_pipeline.isnull().sum().sum()}")
print(f"  Negative quantities: {(df_pipeline['qty'] < 0).sum()}")
print(f"  Duplicates: {df_pipeline.duplicated(subset=['date', 'mun_code']).sum()}")
print()

PIPELINE BOMBING - Data Quality Checks:
  Total records: 757
  Date range: 2007-01-01 to 2025-08-01
  Unique municipalities: 59
  Total events: 1,358
  Null values: 0
  Negative quantities: 0
  Duplicates: 0

