## Merge datasets related to crime, subway and camera location in Mexico City


Since more datasets have been found, then they should be merged in a single dataframe after some cleaning and transformation steps. Since there is information available about the different subway lines in Mexico City, the location of the subway stations, and the location of security cameras according to the city program *Mi calle*, the intention of this notebook is to join that information with the reported crimes using the neighborhood and/or district information.

First, we import the required libraries and get the reported crimes dataset 

In [129]:
import pandas as pd
import zipfile
import numpy as np
import matplotlib.pyplot as plt

#Path
#Insert your local repo path to file 
repo_path = "INSERT YOUR LOCAL SDC-SECURITY REPO PATH HERE"
repo_path = "D:\Archivos\Social Data Challenge\sdc-security"


In [130]:
crime_data_path = "\datasets\da_victimas_completa_marzo_2023.zip"

zf = zipfile.ZipFile(repo_path+crime_data_path) 
crimes_raw = pd.read_csv(zf.open('da_victimas_completa_marzo_2023.csv'))
crimes_raw.head()


Unnamed: 0,idCarpeta,Año_inicio,Mes_inicio,FechaInicio,Delito,Categoria,Sexo,Edad,TipoPersona,CalidadJuridica,...,Mes_hecho,FechaHecho,HoraHecho,HoraInicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,latitud,longitud
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


## Transforming the crime dataset

In [131]:
#Change column names to lowercase
crimes_raw.rename(columns = lambda x : x.lower() , inplace = True)
crimes_raw.head()

Unnamed: 0,idcarpeta,año_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,latitud,longitud
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


### Handling null values

There are a lot of columns with null values. Since crimes dataset already has more than 1M records, erasing the records that contain null values could help to have a lighter, more accurate dataset

In [132]:
null_counts = crimes_raw.isnull().sum()
print(null_counts)

idcarpeta                     0
año_inicio                    0
mes_inicio                    0
fechainicio                   0
delito                        0
categoria                     0
sexo                     190025
edad                     366188
tipopersona                6645
calidadjuridica               1
competencia                   0
año_hecho                   377
mes_hecho                   377
fechahecho                  377
horahecho                   368
horainicio                    1
alcaldia_hechos               0
municipio_hechos        1028246
colonia_datos             73721
fgj_colonia_registro      50410
latitud                   50202
longitud                  50204
dtype: int64


Almost every value in column *municipio_hechos* is null, so this column will be dropped. After that, every row containing null values will be deleted. 


In [133]:
del crimes_raw['municipio_hechos']

In [134]:
crimes = crimes_raw.dropna().copy()

print('Original crime dataset shape is: {}'.format(crimes_raw.shape))
print('The shape of the new crime dataset without null values is: {}'.format(crimes.shape))



Original crime dataset shape is: (1038430, 21)
The shape of the new crime dataset without null values is: (627976, 21)


A new dataset with more than 600k rows and no null values is generated. Then, some minor changes in the column names will be made.

In [135]:
crimes.columns = crimes.columns.str.replace('ñ', 'ni')


Rounding numeric values

In [136]:
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)
crimes["edad"]       = crimes["edad"].round().astype(int)
crimes["anio_hecho"] = crimes["anio_hecho"].round().astype(int)
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)


Converting month names to numeric values

In [137]:
month_name_to_number = {
    'enero': 1,
    'febrero': 2,
    'marzo': 3,
    'abril': 4,
    'mayo': 5,
    'junio': 6,
    'julio': 7,
    'agosto': 8,
    'septiembre': 9,
    'octubre': 10,
    'noviembre': 11,
    'diciembre': 12
}

crimes["mes_inicio"] = crimes["mes_inicio"].str.lower().map(month_name_to_number) 
crimes["mes_hecho"] = crimes["mes_hecho"].str.lower().map(month_name_to_number) 

crimes.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,latitud,longitud
0,8324429,2019,1,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62,FISICA,OFENDIDO,...,2018,8,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431,2019,1,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435,2019,1,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35,FISICA,VICTIMA Y DENUNCIANTE,...,2019,1,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
5,8324442,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,42,FISICA,OFENDIDO,...,2018,10,2018-10-12,18:00:00,12:38:00,COYOACAN,LOS REYES (PBLO),PUEBLO DE LOS REYES,19.33537,-99.16016


In [138]:
crimes[crimes["colonia_datos"] != crimes["fgj_colonia_registro"]].shape

(486658, 21)

The upper line demonstrates that there are a lot of records where *colonia_datos* and *fgj_colonia_registro* do not match with each other. The reason of difference between these columns should be searched.

## Getting the new datasets

In [139]:
!pip install xlrd



In [140]:
# Insert the path of the dataset in your local machine
metro_stations_path = "\datasets\metro\metro_cdmx_estaciones.xls"
cams_path = "\datasets\mi-calle_camaras\programa-mi-calle-shapes.zip"

stations_raw = pd.read_excel(repo_path+metro_stations_path)

zf = zipfile.ZipFile(repo_path+cams_path) 
cams_raw = pd.read_csv(zf.open('programa-mi-calle-shapes.csv'))


In [141]:
stations_raw.head()

Unnamed: 0,FID,geometry,SISTEMA,NOMBRE,LINEA,EST,CVE_EST,CVE_EOD17,TIPO,ALCALDIAS,A_O
0,cdmx_estaciones_metro.1,POINT (-99.07473572701159 19.416334085596525),STC Metro,Pantitlán,1,1,STC0101,5014,Terminal / Transbordo,Venustiano Carranza,1984
1,cdmx_estaciones_metro.2,POINT (-99.0822888117768 19.41192045593743),STC Metro,Zaragoza,1,2,STC0102,5020,Intermedia,Venustiano Carranza,1969
2,cdmx_estaciones_metro.3,POINT (-99.09021039171121 19.416478456907367),STC Metro,Gomez Farías,1,3,STC0103,5007,Intermedia,Venustiano Carranza,1969
3,cdmx_estaciones_metro.4,POINT (-99.09625873241016 19.419941964850207),STC Metro,Boulevard Puerto Aéreo,1,4,STC0104,5003,Intermedia,Venustiano Carranza,1969
4,cdmx_estaciones_metro.5,POINT (-99.10277436397425 19.42335533557525),STC Metro,Balbuena,1,5,STC0105,5001,Intermedia,Venustiano Carranza,1969


In [142]:
cams_raw.head()

Unnamed: 0,id,alcaldia,colonia,barrioproy,crucesproy,senderopro,unidadproy,totalproye,barrioisnt,crucesisnt,senderoisn,unidadisnt,totalinsta,avance,prioritari,geo_shape,geo_point_2d
0,0,IZTACALCO,DE SANTA CRUZ,2,0,2,0,4,2,0,2,0,4,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...","19.3882552948,-99.1202643652"
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,0,0,1,1,0,0,0,1,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...","19.3887000042,-99.1020342766"
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,0,0,2,2,0,0,0,2,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...","19.3884203799,-99.1065451021"
3,3,IZTACALCO,SANTIAGO SUR,3,1,3,1,8,3,1,3,1,8,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...","19.3887099506,-99.1252431061"
4,4,IZTACALCO,TLAZINTLA,2,1,0,0,3,2,1,0,0,3,100,Si,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...","19.3964289682,-99.1118026548"


## Transforming the new datasets
### Metro stations dataset

In [143]:
stations_raw.rename(columns = lambda x : x.lower() , inplace = True)
stations_raw.rename(columns = {"a_o":"year", "fid":"id"} , inplace = True)

del stations_raw['sistema']


There is a column called *geometry* that contains, in string format, longitude and latitude. The numeric values from this string will be extracted and stored in two new columns

In [144]:

# Regular expression pattern to extract numeric values
pattern = r"\((-?\d+\.\d+) (-?\d+\.\d+)\)"

import re

def extract_coordinates(point_str):
    matches = re.findall(pattern, point_str)
    if matches:
        return pd.Series(matches[0], index=['station_longitude', 'station_latitude'])
    return pd.Series([None, None], index=['station_longitude', 'station_latitude'])

stations_raw[['station_longitude', 'station_latitude']] = stations_raw['geometry'].apply(extract_coordinates)



The columns *geometry* and *cve_est* will be dropped because they are redundant . *id* column will be converted to a numeric value

In [145]:
del stations_raw['geometry']
del stations_raw['cve_est']

The categorical values in column id will be replaced by a numeric value by erasing everything but the number inside the column

In [146]:
stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')
stations_raw['id'] = pd.to_numeric(stations_raw['id'])

  stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')


In [147]:
stations_raw.head()

Unnamed: 0,id,nombre,linea,est,cve_eod17,tipo,alcaldias,year,station_longitude,station_latitude
0,1,Pantitlán,1,1,5014,Terminal / Transbordo,Venustiano Carranza,1984,-99.0747357270116,19.416334085596525
1,2,Zaragoza,1,2,5020,Intermedia,Venustiano Carranza,1969,-99.0822888117768,19.41192045593743
2,3,Gomez Farías,1,3,5007,Intermedia,Venustiano Carranza,1969,-99.0902103917112,19.416478456907367
3,4,Boulevard Puerto Aéreo,1,4,5003,Intermedia,Venustiano Carranza,1969,-99.09625873241016,19.419941964850207
4,5,Balbuena,1,5,5001,Intermedia,Venustiano Carranza,1969,-99.10277436397423,19.42335533557525


### Cams dataset


In [148]:
cams_raw.head()

Unnamed: 0,id,alcaldia,colonia,barrioproy,crucesproy,senderopro,unidadproy,totalproye,barrioisnt,crucesisnt,senderoisn,unidadisnt,totalinsta,avance,prioritari,geo_shape,geo_point_2d
0,0,IZTACALCO,DE SANTA CRUZ,2,0,2,0,4,2,0,2,0,4,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...","19.3882552948,-99.1202643652"
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,0,0,1,1,0,0,0,1,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...","19.3887000042,-99.1020342766"
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,0,0,2,2,0,0,0,2,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...","19.3884203799,-99.1065451021"
3,3,IZTACALCO,SANTIAGO SUR,3,1,3,1,8,3,1,3,1,8,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...","19.3887099506,-99.1252431061"
4,4,IZTACALCO,TLAZINTLA,2,1,0,0,3,2,1,0,0,3,100,Si,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...","19.3964289682,-99.1118026548"


The cams dataset contains a column with the centroid of the *colonia* or neighborhood. The numeric values will be extracted and saved in two new columns. Also, the values from the categorical variable *prioritari* will be replaced by 1 or 0. 

In [149]:
cams_raw[['centroid_lat', 'centroid_long']] = cams_raw['geo_point_2d'].str.split(',', expand=True)
cams_raw['centroid_lat'] = pd.to_numeric(cams_raw['centroid_lat'])
cams_raw['centroid_long'] = pd.to_numeric(cams_raw['centroid_long'])
del cams_raw['geo_point_2d']

cams_raw['prioritari'] = cams_raw['prioritari'].replace({'Si': 1, 'No': 0})
cams_raw.head()



Unnamed: 0,id,alcaldia,colonia,barrioproy,crucesproy,senderopro,unidadproy,totalproye,barrioisnt,crucesisnt,senderoisn,unidadisnt,totalinsta,avance,prioritari,geo_shape,centroid_lat,centroid_long
0,0,IZTACALCO,DE SANTA CRUZ,2,0,2,0,4,2,0,2,0,4,100,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...",19.388255,-99.120264
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,0,0,1,1,0,0,0,1,100,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...",19.3887,-99.102034
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,0,0,2,2,0,0,0,2,100,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...",19.38842,-99.106545
3,3,IZTACALCO,SANTIAGO SUR,3,1,3,1,8,3,1,3,1,8,100,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...",19.38871,-99.125243
4,4,IZTACALCO,TLAZINTLA,2,1,0,0,3,2,1,0,0,3,100,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...",19.396429,-99.111803


In [150]:
null_counts = cams_raw.isnull().sum()
print(null_counts)

id                0
alcaldia          0
colonia           0
barrioproy        0
crucesproy        0
senderopro        0
unidadproy        0
totalproye        0
barrioisnt        0
crucesisnt        0
senderoisn        0
unidadisnt        0
totalinsta        0
avance            0
prioritari        0
geo_shape        17
centroid_lat     17
centroid_long    17
dtype: int64


There are only 17 null values in the neighborhood location data. Those rows will be kept for now. However, the number of features for this dataset will be reduced, so that it simply keeps the total number of cameras by neighborhood and the neighborhood location info. 

In [151]:
cams = cams_raw.loc[:,['id','alcaldia','colonia','totalinsta','prioritari','geo_shape','centroid_lat','centroid_long']]
cams.head()

Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,centroid_lat,centroid_long
0,0,IZTACALCO,DE SANTA CRUZ,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...",19.388255,-99.120264
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...",19.3887,-99.102034
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...",19.38842,-99.106545
3,3,IZTACALCO,SANTIAGO SUR,8,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...",19.38871,-99.125243
4,4,IZTACALCO,TLAZINTLA,3,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...",19.396429,-99.111803
