## Merge datasets related to crime, subway and camera location in Mexico City


Since more datasets have been found, then they should be merged in a single dataframe after some cleaning and transformation steps. Since there is information available about the different subway lines in Mexico City, the location of the subway stations, and the location of security cameras according to the city program *Mi calle*, the intention of this notebook is to join that information with the reported crimes using the neighborhood and/or district information.

First, we import the required libraries and get the reported crimes dataset 

In [74]:
import pandas as pd
import zipfile
import numpy as np
import matplotlib.pyplot as plt

#Path
#Insert your local repo path to file 
repo_path = "INSERT YOUR LOCAL SDC-SECURITY REPO PATH HERE"
repo_path = "D:\Archivos\Social Data Challenge\sdc-security"


In [75]:
crime_data_path = "\datasets\da_victimas_completa_marzo_2023.zip"

zf = zipfile.ZipFile(repo_path+crime_data_path) 
crimes_raw = pd.read_csv(zf.open('da_victimas_completa_marzo_2023.csv'))
crimes_raw.head()


Unnamed: 0,idCarpeta,Año_inicio,Mes_inicio,FechaInicio,Delito,Categoria,Sexo,Edad,TipoPersona,CalidadJuridica,...,Mes_hecho,FechaHecho,HoraHecho,HoraInicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,latitud,longitud
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


## Transforming the crime dataset

In [76]:
#Change column names
crimes_raw.rename(columns = lambda x : x.lower() , inplace = True)
crimes_raw.columns = crimes_raw.columns.str.replace('ñ', 'ni')
crimes_raw.rename(columns = {"latitud":"crimen_lat", "longitud":"crimen_lon"} , inplace = True)

crimes_raw.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


### Handling null values

There are a lot of columns with null values. Since crimes dataset already has more than 1M records, erasing the records that contain null values could help to have a lighter, more accurate dataset

In [77]:
null_counts = crimes_raw.isnull().sum()
print(null_counts)

idcarpeta                     0
anio_inicio                   0
mes_inicio                    0
fechainicio                   0
delito                        0
categoria                     0
sexo                     190025
edad                     366188
tipopersona                6645
calidadjuridica               1
competencia                   0
anio_hecho                  377
mes_hecho                   377
fechahecho                  377
horahecho                   368
horainicio                    1
alcaldia_hechos               0
municipio_hechos        1028246
colonia_datos             73721
fgj_colonia_registro      50410
crimen_lat                50202
crimen_lon                50204
dtype: int64


Almost every value in column *municipio_hechos* is null, so this column will be dropped. After that, every row containing null values will be deleted. An exception is done for column *edad*, so that most of the rows are kept.


In [78]:
del crimes_raw['municipio_hechos']

In [80]:
columns_to_dropna = crimes_raw.columns.drop('edad')
#crimes = crimes_raw.dropna().copy()
crimes = crimes_raw.dropna(subset = columns_to_dropna).copy()

print('Original crime dataset shape is: {}'.format(crimes_raw.shape))
print('The shape of the new crime dataset without null values is: {}'.format(crimes.shape))

crimes.head()


Original crime dataset shape is: (1038430, 21)
The shape of the new crime dataset without null values is: (784813, 21)


Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,2018.0,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,2018.0,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,2018.0,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,2019.0,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,2019.0,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


The next step is to round the numeric values to avoid unnecessary decimals.

In [81]:
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)
crimes["anio_hecho"] = crimes["anio_hecho"].round().astype(int)
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)

average              = crimes['edad'].mean()
crimes["edad"]       = crimes["edad"].fillna(average)
crimes["edad"]       = crimes["edad"].round().astype(int)


Converting month names to numeric values

In [82]:
month_name_to_number = {
    'enero': 1,
    'febrero': 2,
    'marzo': 3,
    'abril': 4,
    'mayo': 5,
    'junio': 6,
    'julio': 7,
    'agosto': 8,
    'septiembre': 9,
    'octubre': 10,
    'noviembre': 11,
    'diciembre': 12
}

crimes["mes_inicio"] = crimes["mes_inicio"].str.lower().map(month_name_to_number) 
crimes["mes_hecho"] = crimes["mes_hecho"].str.lower().map(month_name_to_number) 

crimes.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429,2019,1,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62,FISICA,OFENDIDO,...,2018,8,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431,2019,1,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435,2019,1,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35,FISICA,VICTIMA Y DENUNCIANTE,...,2019,1,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438,2019,1,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,39,FISICA,VICTIMA,...,2019,1,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


In [83]:
crimes[crimes["colonia_datos"] != crimes["fgj_colonia_registro"]].shape

(610309, 21)

The upper line demonstrates that there are a lot of records where *colonia_datos* and *fgj_colonia_registro* do not match with each other. The reason of difference between these columns should be searched.

In [84]:
crimes.reset_index(drop=True, inplace =True)
crimes.to_csv(repo_path+'/datasets/crimes.csv', index=False)

## Getting the new datasets

In [86]:
!pip install xlrd



In [87]:
# Insert the path of the dataset in your local machine
metro_stations_path = "\datasets\metro\metro_cdmx_estaciones.xls"
cams_path = "\datasets\mi-calle_camaras\programa-mi-calle-shapes.zip"

stations_raw = pd.read_excel(repo_path+metro_stations_path)

zf = zipfile.ZipFile(repo_path+cams_path) 
cams_raw = pd.read_csv(zf.open('programa-mi-calle-shapes.csv'))


## Transforming the new datasets
### Metro stations dataset

The *Sistema* column has the same value in all rows, so it will be deleted. Also, some column names are going to be changed:

In [88]:
stations_raw.head()

Unnamed: 0,FID,geometry,SISTEMA,NOMBRE,LINEA,EST,CVE_EST,CVE_EOD17,TIPO,ALCALDIAS,A_O
0,cdmx_estaciones_metro.1,POINT (-99.07473572701159 19.416334085596525),STC Metro,Pantitlán,1,1,STC0101,5014,Terminal / Transbordo,Venustiano Carranza,1984
1,cdmx_estaciones_metro.2,POINT (-99.0822888117768 19.41192045593743),STC Metro,Zaragoza,1,2,STC0102,5020,Intermedia,Venustiano Carranza,1969
2,cdmx_estaciones_metro.3,POINT (-99.09021039171121 19.416478456907367),STC Metro,Gomez Farías,1,3,STC0103,5007,Intermedia,Venustiano Carranza,1969
3,cdmx_estaciones_metro.4,POINT (-99.09625873241016 19.419941964850207),STC Metro,Boulevard Puerto Aéreo,1,4,STC0104,5003,Intermedia,Venustiano Carranza,1969
4,cdmx_estaciones_metro.5,POINT (-99.10277436397425 19.42335533557525),STC Metro,Balbuena,1,5,STC0105,5001,Intermedia,Venustiano Carranza,1969


In [89]:
stations_raw.rename(columns = lambda x : x.lower() , inplace = True)
stations_raw.rename(columns = {"a_o":"year", "fid":"id"} , inplace = True)

del stations_raw['sistema']


There is a column called *geometry* that contains, in string format, longitude and latitude. The numeric values from this string will be extracted and stored in two new columns

In [90]:

# Regular expression pattern to extract numeric values
pattern = r"\((-?\d+\.\d+) (-?\d+\.\d+)\)"

import re

def extract_coordinates(point_str):
    matches = re.findall(pattern, point_str)
    if matches:
        return pd.Series(matches[0], index=['station_lon', 'station_lat'])
    return pd.Series([None, None], index=['station_lon', 'station_lat'])

stations_raw[['station_lon', 'station_lat']] = stations_raw['geometry'].apply(extract_coordinates)



The columns *geometry* and *cve_est* will be dropped because they are redundant . *id* column will be converted to a numeric value

In [91]:
del stations_raw['geometry']
del stations_raw['cve_est']

The categorical values in column id will be replaced by a numeric value by erasing everything but the number inside the column

In [92]:
stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')
stations_raw['id'] = pd.to_numeric(stations_raw['id'])

  stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')


In [93]:
null_counts = stations_raw.isnull().sum()
print(null_counts)

id             0
nombre         0
linea          0
est            0
cve_eod17      0
tipo           0
alcaldias      0
year           0
station_lon    0
station_lat    0
dtype: int64


No null values are present on this dataset, so there won't be any deleted rows.

In [94]:
stations = stations_raw.loc[:,['id','nombre','linea','est','cve_eod17','tipo','alcaldias','year','station_lat','station_lon']].copy()
stations.head()


Unnamed: 0,id,nombre,linea,est,cve_eod17,tipo,alcaldias,year,station_lat,station_lon
0,1,Pantitlán,1,1,5014,Terminal / Transbordo,Venustiano Carranza,1984,19.416334085596525,-99.0747357270116
1,2,Zaragoza,1,2,5020,Intermedia,Venustiano Carranza,1969,19.41192045593743,-99.0822888117768
2,3,Gomez Farías,1,3,5007,Intermedia,Venustiano Carranza,1969,19.416478456907367,-99.0902103917112
3,4,Boulevard Puerto Aéreo,1,4,5003,Intermedia,Venustiano Carranza,1969,19.419941964850207,-99.09625873241016
4,5,Balbuena,1,5,5001,Intermedia,Venustiano Carranza,1969,19.42335533557525,-99.10277436397423


### Cams dataset


In [95]:
cams_raw.head()

Unnamed: 0,id,alcaldia,colonia,barrioproy,crucesproy,senderopro,unidadproy,totalproye,barrioisnt,crucesisnt,senderoisn,unidadisnt,totalinsta,avance,prioritari,geo_shape,geo_point_2d
0,0,IZTACALCO,DE SANTA CRUZ,2,0,2,0,4,2,0,2,0,4,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...","19.3882552948,-99.1202643652"
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,0,0,1,1,0,0,0,1,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...","19.3887000042,-99.1020342766"
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,0,0,2,2,0,0,0,2,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...","19.3884203799,-99.1065451021"
3,3,IZTACALCO,SANTIAGO SUR,3,1,3,1,8,3,1,3,1,8,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...","19.3887099506,-99.1252431061"
4,4,IZTACALCO,TLAZINTLA,2,1,0,0,3,2,1,0,0,3,100,Si,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...","19.3964289682,-99.1118026548"


The cams dataset contains a column with the centroid of the *colonia* or neighborhood. The numeric values will be extracted and saved in two new columns. Also, the values from the categorical variable *prioritari* will be replaced by 1 or 0. 

In [96]:
cams_raw[['colonia_lat', 'colonia_lon']] = cams_raw['geo_point_2d'].str.split(',', expand=True)
cams_raw['colonia_lat'] = pd.to_numeric(cams_raw['colonia_lat'])
cams_raw['colonia_lon'] = pd.to_numeric(cams_raw['colonia_lon'])

del cams_raw['geo_point_2d']

cams_raw['prioritari'] = cams_raw['prioritari'].replace({'Si': 1, 'No': 0})



In [97]:
null_counts = cams_raw.isnull().sum()
print(null_counts)

id              0
alcaldia        0
colonia         0
barrioproy      0
crucesproy      0
senderopro      0
unidadproy      0
totalproye      0
barrioisnt      0
crucesisnt      0
senderoisn      0
unidadisnt      0
totalinsta      0
avance          0
prioritari      0
geo_shape      17
colonia_lat    17
colonia_lon    17
dtype: int64


There are only 17 null values in the neighborhood location data. Those rows will be kept for now. However, the number of features for this dataset will be reduced, so that it simply keeps the total number of cameras by neighborhood and the neighborhood location info. 

In [98]:
cams = cams_raw.loc[:,['id','alcaldia','colonia','totalinsta','prioritari','geo_shape','colonia_lat','colonia_lon']]
cams.head()

Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,colonia_lat,colonia_lon
0,0,IZTACALCO,DE SANTA CRUZ,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...",19.388255,-99.120264
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...",19.3887,-99.102034
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...",19.38842,-99.106545
3,3,IZTACALCO,SANTIAGO SUR,8,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...",19.38871,-99.125243
4,4,IZTACALCO,TLAZINTLA,3,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...",19.396429,-99.111803


## Merging stage
The **DataFrame.merge** will be used to merge dataframes **crime** and **cams**. *crimes.fgj_colonia_registro* can be compared with the *cams.colonia* column, and the same can be done with *crimes.alcaldia_hechos* and *cams.alcaldia*. Some columns will be dropped or their order will be changed.


In [99]:
merged_df = crimes.merge(cams, left_on=['alcaldia_hechos','fgj_colonia_registro'], right_on=['alcaldia','colonia'])

keep_this_columns = ['idcarpeta', 'delito', 'categoria', 'alcaldia', 'colonia', 'sexo', 'edad', 'tipopersona', 'calidadjuridica',
                     'anio_inicio', 'mes_inicio', 'fechainicio','horainicio', 'competencia',
                    'anio_hecho', 'mes_hecho', 'fechahecho', 'horahecho',
                    'crimen_lat', 'crimen_lon','colonia_lat','colonia_lon','totalinsta','prioritari','geo_shape']
consolidated = merged_df [keep_this_columns]
consolidated.head()

Unnamed: 0,idcarpeta,delito,categoria,alcaldia,colonia,sexo,edad,tipopersona,calidadjuridica,anio_inicio,...,mes_hecho,fechahecho,horahecho,crimen_lat,crimen_lon,colonia_lat,colonia_lon,totalinsta,prioritari,geo_shape
0,8324429,FRAUDE,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Masculino,62,FISICA,OFENDIDO,2019,...,8,2018-08-29,12:00:00,19.36125,-99.18314,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."
1,8325972,ROBO A NEGOCIO CON VIOLENCIA,ROBO A NEGOCIO CON VIOLENCIA,ALVARO OBREGON,GUADALUPE INN,Masculino,26,FISICA,VICTIMA Y DENUNCIANTE,2019,...,1,2019-01-05,18:00:00,19.36157,-99.18556,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."
2,8330740,ROBO DE ACCESORIOS DE AUTO,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Masculino,26,FISICA,VICTIMA Y DENUNCIANTE,2019,...,1,2018-01-12,14:30:00,19.35574,-99.18916,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."
3,8448568,ROBO DE VEHICULO DE SERVICIO PARTICULAR CON VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,ALVARO OBREGON,GUADALUPE INN,Masculino,62,FISICA,VICTIMA Y DENUNCIANTE,2019,...,6,2019-06-10,18:10:00,19.35431,-99.19001,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."
4,8449807,USURPACIÓN DE IDENTIDAD,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Femenino,31,FISICA,VICTIMA Y DENUNCIANTE,2019,...,11,2018-11-09,12:00:00,19.36179,-99.18621,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."


## Relation with *stations* dataframe

Something  more complex should be done in order to relate the new consolidated dataframe with the *stations* dataset. The way to do this is to calculate the distance between the crime locations and all the metro stations in order to find which is the nearest one. The **harvesine_distance** function will be applied and the new columns *nearest_distance*, *nearest_location* and *nearest_station* will be populated. 


In [156]:
import math

def haversine_distance(lat1, lon1, lat2, lon2):
    # Earth's radius in kilometers
    earth_radius = 6371.0

    # Convert latitude and longitude from degrees to radians
    lat1_rad = math.radians(lat1)
    lon1_rad = math.radians(lon1)
    lat2_rad = math.radians(lat2)
    lon2_rad = math.radians(lon2)

    # Calculate differences in latitude and longitude
    d_lat = lat2_rad - lat1_rad
    d_lon = lon2_rad - lon1_rad

    # Haversine formula
    a = math.sin(d_lat / 2) ** 2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(d_lon / 2) ** 2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    # Calculate the distance
    distance = earth_radius * c
    return distance


In [167]:
def find_nearest_location(ref_lat, ref_lon, locations, stationsDf):
    nearest_distance = float('inf')
    nearest_location = None
    
   
    #Loop to compare crime location with each of the 195 metro station locations
    for lat, lon, station_name in locations:
        lat = float(lat)
        lon = float(lon)
        distance = haversine_distance(ref_lat, ref_lon, lat, lon)
        if distance < nearest_distance:
            nearest_distance = distance
            nearest_location = (lat, lon)
            nearest_station = station_name
    
    return nearest_distance, nearest_location, nearest_station


In [158]:
consolidated['nearest_distance'], consolidated['nearest_location'], consolidated['nearest_station'] = zip(*consolidated.apply(
    lambda row: find_nearest_location( ref_lat   = row['crimen_lat'], 
                                       ref_lon   = row['crimen_lon'], 
                                       locations = stations[['station_lat', 'station_lon', 'nombre']].values,
                                       stationsDf = stations
                                     ),
                axis = 1
    )
  )

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  consolidated['nearest_distance'], consolidated['nearest_location'], consolidated['nearest_station'] = zip(*consolidated.apply(


In [164]:
consolidated.head()




Unnamed: 0,idcarpeta,delito,categoria,alcaldia,colonia,sexo,edad,tipopersona,calidadjuridica,anio_inicio,...,crimen_lat,crimen_lon,colonia_lat,colonia_lon,totalinsta,prioritari,geo_shape,nearest_station,nearest_distance,nearest_location
0,8324429,FRAUDE,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Masculino,62,FISICA,OFENDIDO,2019,...,19.36125,-99.18314,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826...",Barranca del Muerto,0.643025,"(19.361501563286584, -99.18926370689458)"
1,8325972,ROBO A NEGOCIO CON VIOLENCIA,ROBO A NEGOCIO CON VIOLENCIA,ALVARO OBREGON,GUADALUPE INN,Masculino,26,FISICA,VICTIMA Y DENUNCIANTE,2019,...,19.36157,-99.18556,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826...",Barranca del Muerto,0.388617,"(19.361501563286584, -99.18926370689458)"
2,8330740,ROBO DE ACCESORIOS DE AUTO,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Masculino,26,FISICA,VICTIMA Y DENUNCIANTE,2019,...,19.35574,-99.18916,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826...",Barranca del Muerto,0.640749,"(19.361501563286584, -99.18926370689458)"
3,8448568,ROBO DE VEHICULO DE SERVICIO PARTICULAR CON VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,ALVARO OBREGON,GUADALUPE INN,Masculino,62,FISICA,VICTIMA Y DENUNCIANTE,2019,...,19.35431,-99.19001,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826...",Barranca del Muerto,0.803489,"(19.361501563286584, -99.18926370689458)"
4,8449807,USURPACIÓN DE IDENTIDAD,DELITO DE BAJO IMPACTO,ALVARO OBREGON,GUADALUPE INN,Femenino,31,FISICA,VICTIMA Y DENUNCIANTE,2019,...,19.36179,-99.18621,19.357275,-99.187114,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826...",Barranca del Muerto,0.321955,"(19.361501563286584, -99.18926370689458)"


In [163]:
print('Min. nearest distance is ',consolidated.nearest_distance.min())
print('Max. nearest distance is ',consolidated.nearest_distance.max())
consolidated.nearest_station.unique()


Min. nearest distance is  0.0003785548623949956
Max. nearest distance is  16.953825939600286


array(['Barranca del Muerto', 'Miguel Ángel de Quevedo', 'Norte 45',
       'Popotla', 'Copilco', 'Constitución de 1917', 'UAM-I',
       'San Pedro de Los Pinos', 'Observatorio', 'San Antonio',
       'Politécnico', 'La Raza', 'Ermita', 'Portales', 'Santa Marta',
       'Acatitla', 'Pantitlán', 'Lomas Estrella', 'Normal',
       'Niños Héroes/Poder Judicial CDMX', 'Balderas', 'Salto del Agua',
       'Lázaro Cárdenas', 'Obrera', 'Doctores', 'Cuauhtémoc',
       'Hospital General', 'Centro Médico', 'Lagunilla',
       'Garibaldi/Lagunilla', 'Tepito', 'Valle Gómez', 'Canal del Norte',
       'Insurgentes', 'Chilpancingo', 'Sevilla', 'Chapultepec', 'Mixcoac',
       'Cuitláhuac', 'Colegio Militar', 'Tacuba', 'Insurgentes Sur',
       'Etiopía/Plaza de la Transparencia', 'Hospital 20 de Noviembre',
       'Zapata', 'División del Norte', 'Eugenia', 'Coyoacán',
       'Zapotitlán', 'Tlaltenco', 'Calle 11', 'La Villa/Basílica',
       'Talismán', 'Bondojito', 'Hidalgo', 'Guerrero', 'Tlatelol

In [166]:
consolidated.shape

(450977, 28)

This is the consolidated dataset that can be used as base for future calculations. It could be xported as a csv at this stage. However, more data related to camera locations related to program **C5** was found, so this info can be merged as well