
## Merge datasets related to crime, subway and camera location in Mexico City


Since more datasets have been found, then they should be merged in a single dataframe after some cleaning and transformation steps. Since there is information available about the different subway lines in Mexico City, the location of the subway stations, and the location of security cameras according to the city program *Mi calle*, the intention of this notebook is to join that information with the reported crimes using the neighborhood and/or district information.

First, we import the required libraries and get the reported crimes dataset 

In [6]:
import pandas as pd
import zipfile
import numpy as np
import matplotlib.pyplot as plt

#Path
#Insert your local repo path to file 
repo_path = "INSERT YOUR LOCAL SDC-SECURITY REPO PATH HERE"
repo_path ='D:\Archivos\Social Data Challenge\sdc-security'

In [9]:
crime_data_path = "\datasets\da_victimas_completa_marzo_2023.zip"

zf = zipfile.ZipFile(repo_path+crime_data_path) 
crimes_raw = pd.read_csv(zf.open('da_victimas_completa_marzo_2023.csv'))
print(crimes_raw.shape)
crimes_raw.head()


(1038430, 22)


Unnamed: 0,idCarpeta,Año_inicio,Mes_inicio,FechaInicio,Delito,Categoria,Sexo,Edad,TipoPersona,CalidadJuridica,...,Mes_hecho,FechaHecho,HoraHecho,HoraInicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,latitud,longitud
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


## Transforming the crime dataset

In [10]:
#Change column names
crimes_raw.rename(columns = lambda x : x.lower() , inplace = True)
crimes_raw.columns = crimes_raw.columns.str.replace('ñ', 'ni')
crimes_raw.rename(columns = {"latitud":"crimen_lat", "longitud":"crimen_lon"} , inplace = True)

crimes_raw.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,municipio_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


### Handling null values

There are a lot of columns with null values. Since crimes dataset already has more than 1M records, erasing the records that contain null values could help to have a lighter, more accurate dataset

In [11]:
null_counts = crimes_raw.isnull().sum()
print(null_counts)

idcarpeta                     0
anio_inicio                   0
mes_inicio                    0
fechainicio                   0
delito                        0
categoria                     0
sexo                     190025
edad                     366188
tipopersona                6645
calidadjuridica               1
competencia                   0
anio_hecho                  377
mes_hecho                   377
fechahecho                  377
horahecho                   368
horainicio                    1
alcaldia_hechos               0
municipio_hechos        1028246
colonia_datos             73721
fgj_colonia_registro      50410
crimen_lat                50202
crimen_lon                50204
dtype: int64


Almost every value in column *municipio_hechos* is null, so this column will be dropped. After that, every row containing null values will be deleted. An exception is done for column *edad*, so that most of the rows are kept.


In [12]:
del crimes_raw['municipio_hechos']

In [13]:
columns_to_dropna = crimes_raw.columns.drop('edad')
crimes = crimes_raw.dropna(subset = columns_to_dropna).copy()

print('Original crime dataset shape is: {}'.format(crimes_raw.shape))
print('The shape of the new crime dataset without null values is: {}'.format(crimes.shape))

crimes.head()


Original crime dataset shape is: (1038430, 21)
The shape of the new crime dataset without null values is: (784813, 21)


Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429.0,2019,Enero,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62.0,FISICA,OFENDIDO,...,2018.0,Agosto,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430.0,2019,Enero,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38.0,FISICA,VICTIMA Y DENUNCIANTE,...,2018.0,Diciembre,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431.0,2019,Enero,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42.0,FISICA,VICTIMA Y DENUNCIANTE,...,2018.0,Diciembre,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435.0,2019,Enero,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35.0,FISICA,VICTIMA Y DENUNCIANTE,...,2019.0,Enero,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438.0,2019,Enero,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,,FISICA,VICTIMA,...,2019.0,Enero,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


The next step is to round the numeric values to avoid unnecessary decimals.

In [14]:
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)
crimes["anio_hecho"] = crimes["anio_hecho"].round().astype(int)
crimes["idcarpeta"]  = crimes["idcarpeta"].round().astype(int)

average              = crimes['edad'].mean()
crimes["edad"]       = crimes["edad"].fillna(average)
crimes["edad"]       = crimes["edad"].round().astype(int)


Converting month names to numeric values

In [15]:
month_name_to_number = {
    'enero': 1,
    'febrero': 2,
    'marzo': 3,
    'abril': 4,
    'mayo': 5,
    'junio': 6,
    'julio': 7,
    'agosto': 8,
    'septiembre': 9,
    'octubre': 10,
    'noviembre': 11,
    'diciembre': 12
}

crimes["mes_inicio"] = crimes["mes_inicio"].str.lower().map(month_name_to_number) 
crimes["mes_hecho"] = crimes["mes_hecho"].str.lower().map(month_name_to_number) 

crimes.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429,2019,1,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62,FISICA,OFENDIDO,...,2018,8,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431,2019,1,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435,2019,1,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35,FISICA,VICTIMA Y DENUNCIANTE,...,2019,1,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438,2019,1,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,39,FISICA,VICTIMA,...,2019,1,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


In [16]:
crimes[crimes["colonia_datos"] != crimes["fgj_colonia_registro"]].shape

(610309, 21)

There are a lot of records where *colonia_datos* and *fgj_colonia_registro* do not match with each other. The reason is that *colonia* names have variations between data sources. Column *fgj_colonia_registro* contains more generic names, so this will be used. The Column *colonia_datos* might help with the homologation process. If it doesnt, it will be just deleted.

In [17]:
crimes.reset_index(drop=True, inplace =True)
crimes.to_csv(repo_path+'/datasets/crimes.csv', index=False)

## Getting the new datasets

In [18]:
!pip install xlrd



In [19]:
# Insert the path of the dataset in your local machine
metro_stations_path = "\datasets\metro\metro_cdmx_estaciones.xls"
stations_raw = pd.read_excel(repo_path+metro_stations_path)



## Transforming the new datasets
### Metro stations dataset

The *Sistema* column has the same value in all rows, so it will be deleted. Also, some column names are going to be changed:

In [20]:
stations_raw.head()

Unnamed: 0,FID,geometry,SISTEMA,NOMBRE,LINEA,EST,CVE_EST,CVE_EOD17,TIPO,ALCALDIAS,A_O
0,cdmx_estaciones_metro.1,POINT (-99.07473572701159 19.416334085596525),STC Metro,Pantitlán,1,1,STC0101,5014,Terminal / Transbordo,Venustiano Carranza,1984
1,cdmx_estaciones_metro.2,POINT (-99.0822888117768 19.41192045593743),STC Metro,Zaragoza,1,2,STC0102,5020,Intermedia,Venustiano Carranza,1969
2,cdmx_estaciones_metro.3,POINT (-99.09021039171121 19.416478456907367),STC Metro,Gomez Farías,1,3,STC0103,5007,Intermedia,Venustiano Carranza,1969
3,cdmx_estaciones_metro.4,POINT (-99.09625873241016 19.419941964850207),STC Metro,Boulevard Puerto Aéreo,1,4,STC0104,5003,Intermedia,Venustiano Carranza,1969
4,cdmx_estaciones_metro.5,POINT (-99.10277436397425 19.42335533557525),STC Metro,Balbuena,1,5,STC0105,5001,Intermedia,Venustiano Carranza,1969


In [21]:
stations_raw.rename(columns = lambda x : x.lower() , inplace = True)
stations_raw.rename(columns = {"a_o":"year", "fid":"id"} , inplace = True)

del stations_raw['sistema']


There is a column called *geometry* that contains, in string format, longitude and latitude. The numeric values from this string will be extracted and stored in two new columns

In [22]:

# Regular expression pattern to extract numeric values
pattern = r"\((-?\d+\.\d+) (-?\d+\.\d+)\)"

import re

def extract_coordinates(point_str):
    matches = re.findall(pattern, point_str)
    if matches:
        return pd.Series(matches[0], index=['station_lon', 'station_lat'])
    return pd.Series([None, None], index=['station_lon', 'station_lat'])

stations_raw[['station_lon', 'station_lat']] = stations_raw['geometry'].apply(extract_coordinates)



The columns *geometry* and *cve_est* will be dropped because they are redundant . *id* column will be converted to a numeric value

In [23]:
del stations_raw['geometry']
del stations_raw['cve_est']

The categorical values in column id will be replaced by a numeric value by erasing everything but the number inside the column

In [24]:
stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')
stations_raw['id'] = pd.to_numeric(stations_raw['id'])

  stations_raw['id'] = stations_raw.id.str.replace('cdmx_estaciones_metro.', '')


In [25]:
null_counts = stations_raw.isnull().sum()
print(null_counts)

id             0
nombre         0
linea          0
est            0
cve_eod17      0
tipo           0
alcaldias      0
year           0
station_lon    0
station_lat    0
dtype: int64


No null values are present on this dataset, so there won't be any deleted rows.

In [26]:
stations = stations_raw.loc[:,['id','nombre','linea','est','cve_eod17','tipo','alcaldias','year','station_lat','station_lon']].copy()
stations.head()


Unnamed: 0,id,nombre,linea,est,cve_eod17,tipo,alcaldias,year,station_lat,station_lon
0,1,Pantitlán,1,1,5014,Terminal / Transbordo,Venustiano Carranza,1984,19.416334085596525,-99.0747357270116
1,2,Zaragoza,1,2,5020,Intermedia,Venustiano Carranza,1969,19.41192045593743,-99.0822888117768
2,3,Gomez Farías,1,3,5007,Intermedia,Venustiano Carranza,1969,19.416478456907367,-99.0902103917112
3,4,Boulevard Puerto Aéreo,1,4,5003,Intermedia,Venustiano Carranza,1969,19.419941964850207,-99.09625873241016
4,5,Balbuena,1,5,5001,Intermedia,Venustiano Carranza,1969,19.42335533557525,-99.10277436397423


### Cams dataset


In [27]:
cams_path = "\datasets\mi-calle_camaras\programa-mi-calle-shapes.zip"
zf = zipfile.ZipFile(repo_path+cams_path) 
cams_raw = pd.read_csv(zf.open('programa-mi-calle-shapes.csv'))
print(cams_raw.shape)
cams_raw.head()

(1761, 17)


Unnamed: 0,id,alcaldia,colonia,barrioproy,crucesproy,senderopro,unidadproy,totalproye,barrioisnt,crucesisnt,senderoisn,unidadisnt,totalinsta,avance,prioritari,geo_shape,geo_point_2d
0,0,IZTACALCO,DE SANTA CRUZ,2,0,2,0,4,2,0,2,0,4,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...","19.3882552948,-99.1202643652"
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,0,0,1,1,0,0,0,1,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...","19.3887000042,-99.1020342766"
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,0,0,2,2,0,0,0,2,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...","19.3884203799,-99.1065451021"
3,3,IZTACALCO,SANTIAGO SUR,3,1,3,1,8,3,1,3,1,8,100,No,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...","19.3887099506,-99.1252431061"
4,4,IZTACALCO,TLAZINTLA,2,1,0,0,3,2,1,0,0,3,100,Si,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...","19.3964289682,-99.1118026548"


The cams dataset contains a column with the centroid of the *colonia* or neighborhood. The numeric values will be extracted and saved in two new columns. Also, the values from the categorical variable *prioritari* will be replaced by 1 or 0. 

In [28]:
cams_raw.isnull().sum()

id               0
alcaldia         0
colonia          0
barrioproy       0
crucesproy       0
senderopro       0
unidadproy       0
totalproye       0
barrioisnt       0
crucesisnt       0
senderoisn       0
unidadisnt       0
totalinsta       0
avance           0
prioritari       0
geo_shape       17
geo_point_2d    17
dtype: int64

There are only 17 null values in the neighborhood location data. But those rows will be deleted.

In [29]:
cams_raw = cams_raw.dropna()

cams_raw[['colonia_lat', 'colonia_lon']] = cams_raw['geo_point_2d'].str.split(',', expand=True)
cams_raw['colonia_lat'] = pd.to_numeric(cams_raw['colonia_lat'])
cams_raw['colonia_lon'] = pd.to_numeric(cams_raw['colonia_lon'])

del cams_raw['geo_point_2d']

cams_raw['prioritari'] = cams_raw['prioritari'].replace({'Si': 1, 'No': 0})



In [30]:
cams = cams_raw.loc[:,['id','alcaldia','colonia','totalinsta','prioritari','geo_shape','colonia_lat','colonia_lon']].copy()
cams.head()

Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,colonia_lat,colonia_lon
0,0,IZTACALCO,DE SANTA CRUZ,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1183...",19.388255,-99.120264
1,1,IZTACALCO,EL MOSCO CHINAMPA,1,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1006...",19.3887,-99.102034
2,2,IZTACALCO,LOS PICOS DE IZTACALCO II A,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1050...",19.38842,-99.106545
3,3,IZTACALCO,SANTIAGO SUR,8,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1218...",19.38871,-99.125243
4,4,IZTACALCO,TLAZINTLA,3,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1100...",19.396429,-99.111803


### Checking for duplicates
It turns out that there are data duplicates on the datasets. That has to do with errors during collection or updates that were done by adding a new record without erasing the previous one. In the case of the cams dataset, there are inconsistencies on the number of installed cams for the same *colonia* in the same district or *alcaldia*. Cleaning is required on this stage before future merges.

In [31]:
duplicates = cams[['colonia', 'alcaldia']].duplicated(keep=False)
duplicated_rows=cams[duplicates]
duplicated_rows.sort_values(by=['alcaldia','colonia']).head(15)

Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,colonia_lat,colonia_lon
148,148,ALVARO OBREGON,BELEN DE LAS FLORES,0,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2153...",19.394297,-99.217076
210,210,ALVARO OBREGON,BELEN DE LAS FLORES,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2320...",19.355859,-99.234221
154,154,ALVARO OBREGON,LOMAS DE BECERRA,1,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1991...",19.384495,-99.200984
1309,1309,ALVARO OBREGON,LOMAS DE BECERRA,7,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2129...",19.384752,-99.217245
203,203,ALVARO OBREGON,LOMAS DE TARANGO,0,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2534...",19.34767,-99.254523
697,697,ALVARO OBREGON,LOMAS DE TARANGO,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2083...",19.364082,-99.214819
132,132,ALVARO OBREGON,LOS ALPES,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2065...",19.356777,-99.209819
1286,1286,ALVARO OBREGON,LOS ALPES,9,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1989...",19.360429,-99.194006
178,178,ALVARO OBREGON,SANTA FE,0,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2290...",19.384954,-99.229816
727,727,ALVARO OBREGON,SANTA FE,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2318...",19.382868,-99.239754


Since there is uncertainty about which record to keep (for instance, look at colonia Santa Fe example on the dataframe above), the strategy will be keeping the last option: the assumption is that it will contain the most accurate information.

In [32]:
print('Shape of cams dataset before dropping duplicates: ', cams.shape)

cams = cams.drop_duplicates(['colonia', 'alcaldia'], keep ='last')

print('Shape of cams dataset after dropping duplicates: ', cams.shape)

#Verifying that the duplicates are gone
duplicates = cams[['colonia', 'alcaldia']].duplicated(keep=False)
duplicated_rows=cams[duplicates]
duplicated_rows.sort_values(by=['alcaldia','colonia']).head(15)

Shape of cams dataset before dropping duplicates:  (1744, 8)
Shape of cams dataset after dropping duplicates:  (1720, 8)


Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,colonia_lat,colonia_lon


### Checking crimes dataset for cleaning


The following lines suggests that there might be a lot of reasons why there are duplicates on the *crimes.idcarpeta* column. For example, the same *idcarpeta* crime can have multiple victims. And those victims can have the same age and gender. For 

In [39]:
duplicates = crimes['idcarpeta'].duplicated(keep=False)
duplicated_rows=crimes[duplicates]
print("Number of duplicated idcarpetas: {}".format(duplicated_rows.shape[0]))
duplicated_rows[['idcarpeta','delito','categoria','sexo','edad','tipopersona','fgj_colonia_registro','alcaldia_hechos']].sort_values(by=['idcarpeta']).head(15)

Number of duplicated idcarpetas: 55679


Unnamed: 0,idcarpeta,delito,categoria,sexo,edad,tipopersona,fgj_colonia_registro,alcaldia_hechos
95152,8322427,ROBO A NEGOCIO CON VIOLENCIA,ROBO A NEGOCIO CON VIOLENCIA,Femenino,43,FISICA,EL RETOÑO,IZTAPALAPA
95153,8322427,ROBO A NEGOCIO CON VIOLENCIA,ROBO A NEGOCIO CON VIOLENCIA,Masculino,55,FISICA,EL RETOÑO,IZTAPALAPA
49556,8322439,ROBO A TRANSEUNTE EN VIA PUBLICA CON VIOLENCIA,ROBO A TRANSEUNTE EN VÍA PÚBLICA CON Y SIN VIO...,Femenino,45,FISICA,EL ERMITAÑO,LA MAGDALENA CONTRERAS
106332,8322439,ROBO A TRANSEUNTE EN VIA PUBLICA CON VIOLENCIA,ROBO A TRANSEUNTE EN VÍA PÚBLICA CON Y SIN VIO...,Masculino,15,FISICA,EL ERMITAÑO,LA MAGDALENA CONTRERAS
49578,8322533,VIOLENCIA FAMILIAR,DELITO DE BAJO IMPACTO,Masculino,1,FISICA,SAN MARCOS NORTE,XOCHIMILCO
95184,8322533,VIOLENCIA FAMILIAR,DELITO DE BAJO IMPACTO,Femenino,4,FISICA,SAN MARCOS NORTE,XOCHIMILCO
106354,8322550,HOMICIDIO POR ARMA DE FUEGO,HOMICIDIO DOLOSO,Masculino,26,FISICA,MOSCO CHINAMPA,IZTACALCO
95192,8322550,HOMICIDIO POR ARMA DE FUEGO,HOMICIDIO DOLOSO,Masculino,30,FISICA,MOSCO CHINAMPA,IZTACALCO
95198,8322572,DAÑO EN PROPIEDAD AJENA CULPOSA POR TRÁNSITO V...,DELITO DE BAJO IMPACTO,Femenino,39,FISICA,LAS AGUILAS 3ER PARQUE,ALVARO OBREGON
106364,8322572,DAÑO EN PROPIEDAD AJENA CULPOSA POR TRÁNSITO V...,DELITO DE BAJO IMPACTO,Femenino,39,FISICA,LAS AGUILAS 3ER PARQUE,ALVARO OBREGON


Checking duplicates in more columns: 

In [51]:
#Checking row duplicates in every single columns
duplicates = crimes[crimes.columns].duplicated(keep=False)
#duplicates = crimes[crimes.columns].duplicated(keep='last')
#duplicated_rows=crimes[duplicates].copy() 
duplicated_rows=crimes[duplicates]
print(duplicated_rows[['idcarpeta','delito','categoria','sexo','edad','tipopersona','crimen_lat','crimen_lon']].sort_values(by=['idcarpeta']).shape)
duplicated_rows["duplicates"]=duplicates
duplicated_rows[['idcarpeta','delito','categoria','sexo','edad','tipopersona','crimen_lat','crimen_lon','duplicates']].head(30)#.sort_values(by=['idcarpeta']).head(30)
crimes.head(100)

(8686, 8)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  duplicated_rows["duplicates"]=duplicates


Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429,2019,1,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62,FISICA,OFENDIDO,...,2018,8,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431,2019,1,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435,2019,1,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35,FISICA,VICTIMA Y DENUNCIANTE,...,2019,1,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438,2019,1,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,39,FISICA,VICTIMA,...,2019,1,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.35480,-99.06324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,8324792,2019,1,2019-01-04,ROBO DE VEHICULO DE PEDALES,DELITO DE BAJO IMPACTO,Masculino,39,MORAL,OFENDIDO,...,2019,1,2019-01-04,17:00:00,18:19:00,AZCAPOTZALCO,EL ROSARIO C (U HAB),EL ROSARIO,19.50326,-99.20375
96,8324805,2019,1,2019-01-04,ROBO A TRANSEUNTE EN VIA PUBLICA SIN VIOLENCIA,ROBO A TRANSEUNTE EN VÍA PÚBLICA CON Y SIN VIO...,Femenino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-27,21:30:00,18:34:00,CUAUHTEMOC,CENTRO VI,CENTRO,19.42623,-99.13303
97,8324808,2019,1,2019-01-04,ROBO A TRANSEUNTE DE CELULAR CON VIOLENCIA,DELITO DE BAJO IMPACTO,Masculino,34,FISICA,VICTIMA Y DENUNCIANTE,...,2018,10,2018-10-19,23:00:00,18:41:00,TLALPAN,VILLA LAZARO CARDENAS,VILLA LÁZARO CÁRDENAS,19.29550,-99.14017
98,8324813,2019,1,2019-01-04,ROBO DE OBJETOS,DELITO DE BAJO IMPACTO,Masculino,52,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-24,12:00:00,18:45:00,LA MAGDALENA CONTRERAS,HUAYATLA,HUAYATLA,19.30641,-99.26473


Since there columns on this dataset do not have unique value constraints, the next lines will only erase the duplicates who match on every single column: a double entry of the same record is more possible when we look at the indexes and realize that a lot of those duplicate contain non consecutive indexes. Those should be dropped, and the ones with consecutive indexes, kept, just to allow the chances of having more than one victim of the same crime with the same age and gender.

In [43]:
#To Do next
# Check documentation about duplicated() and drop_duplicated, to understand what kind of outputs they have when using the keep=False, first or last
#https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.duplicated.html
# We want to use that to create a boolean column that indicates that the row is part of a duplicate set
# We want another column to indicater if in the dataset there are consecutive repetitions of value idcarpeta
# That way we can use the boolean value to prevent dropping those rows, and only drop the duplicates that do not happen consecutively

# Function from Chatgpt

#Two lazy alternatives:
#Do not drop the duplicates
# apply DataFrame.drop_duplicates(keep=False)
columns_of_interest=['idcarpeta','delito','categoria','sexo','edad','tipopersona','crimen_lat','crimen_lon','duplicates']
duplicated_rows[columns_of_interest].head(30)#.sort_values(by=['idcarpeta']).head(30)

# def check_consecutive_same_value(df, col1, new_col_name):
#     df[new_col_name] = (df[col1] == df[col1].shift(1)) | (df[col1] == df[col1].shift(-1))
#     return df

Unnamed: 0,idcarpeta,delito,categoria,sexo,edad,tipopersona,crimen_lat,crimen_lon,duplicates
7518,8454529,HOMICIDIO POR ARMA DE FUEGO,HOMICIDIO DOLOSO,Masculino,39,FISICA,19.28214,-99.2223,True
8897,8437644,HOMICIDIO POR ARMA DE FUEGO,HOMICIDIO DOLOSO,Masculino,39,FISICA,19.3495,-98.99664,True
8898,8437644,HOMICIDIO POR ARMA DE FUEGO,HOMICIDIO DOLOSO,Masculino,39,FISICA,19.3495,-98.99664,True
31312,8461163,DAÑO EN PROPIEDAD AJENA INTENCIONAL A NEGOCIO,DELITO DE BAJO IMPACTO,Masculino,39,FISICA,19.51668,-99.13889,True
32099,8509270,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,39,FISICA,19.34883,-99.21793,True
41709,8400645,LESIONES CULPOSAS POR TRANSITO VEHICULAR EN CO...,DELITO DE BAJO IMPACTO,Femenino,39,FISICA,19.3458,-99.17532,True
45472,8424563,ROBO A CASA HABITACION CON VIOLENCIA,ROBO A CASA HABITACIÓN CON VIOLENCIA,Femenino,39,FISICA,19.2652,-99.16426,True
50197,8323331,ABUSO DE AUTORIDAD Y USO ILEGAL DE LA FUERZA P...,DELITO DE BAJO IMPACTO,Masculino,39,FISICA,19.43079,-99.13232,True
53020,8464053,VIOLENCIA FAMILIAR,DELITO DE BAJO IMPACTO,Femenino,39,FISICA,19.38037,-99.0317,True
77258,8573201,AMENAZAS,DELITO DE BAJO IMPACTO,Femenino,39,FISICA,19.29913,-99.01814,True


In [48]:
crimes.head()

Unnamed: 0,idcarpeta,anio_inicio,mes_inicio,fechainicio,delito,categoria,sexo,edad,tipopersona,calidadjuridica,...,anio_hecho,mes_hecho,fechahecho,horahecho,horainicio,alcaldia_hechos,colonia_datos,fgj_colonia_registro,crimen_lat,crimen_lon
0,8324429,2019,1,2019-01-04,FRAUDE,DELITO DE BAJO IMPACTO,Masculino,62,FISICA,OFENDIDO,...,2018,8,2018-08-29,12:00:00,12:19:00,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,19.36125,-99.18314
1,8324430,2019,1,2019-01-04,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,Femenino,38,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-15,15:00:00,12:20:00,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,19.47181,-99.16458
2,8324431,2019,1,2019-01-04,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,Masculino,42,FISICA,VICTIMA Y DENUNCIANTE,...,2018,12,2018-12-22,15:30:00,12:23:00,COYOACAN,COPILCO EL BAJO,COPILCO UNIVERSIDAD ISSSTE,19.33797,-99.18611
3,8324435,2019,1,2019-01-04,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,35,FISICA,VICTIMA Y DENUNCIANTE,...,2019,1,2019-01-04,06:00:00,12:27:00,IZTACALCO,PANTITLAN V,AGRÍCOLA PANTITLAN,19.40327,-99.05983
4,8324438,2019,1,2019-01-04,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,Masculino,39,FISICA,VICTIMA,...,2019,1,2019-01-03,20:00:00,12:35:00,IZTAPALAPA,LAS AMERICAS (U HAB),PROGRESISTA,19.3548,-99.06324


In [127]:
#consecutive_duplicates = duplicates & (crimes.index - df.index.shift(1) == 1)
#uplicated_rows.head(10)
#onsecutive_duplicates = (duplicated_rows.index - duplicated_rows.index.shift(1) == 1)
#duplicated_rows.index.shift(1)
for i in range(1, len(duplicated_rows)):
    if duplicates.iloc[i] and duplicates.iloc[i - 1]:
        consecutive_duplicates_index.append(df.index[i])
        consecutive_duplicates_index.append(df.index[i - 1])


## Merging stage
The **DataFrame.merge** will be used to merge dataframes **crime** and **cams**. *crimes.fgj_colonia_registro* can be compared with the *cams.colonia* column, and the same can be done with *crimes.alcaldia_hechos* and *cams.alcaldia*. Some columns will be dropped or their order will be changed.


In [41]:
print('Crimes dataset shape is ', crimes.shape)
print('Cams dataset shape is ', cams.shape)

print(crimes.isnull().sum())
print(cams.isnull().sum())


Crimes dataset shape is  (784813, 21)
Cams dataset shape is  (1744, 8)
idcarpeta               0
anio_inicio             0
mes_inicio              0
fechainicio             0
delito                  0
categoria               0
sexo                    0
edad                    0
tipopersona             0
calidadjuridica         0
competencia             0
anio_hecho              0
mes_hecho               0
fechahecho              0
horahecho               0
horainicio              0
alcaldia_hechos         0
colonia_datos           0
fgj_colonia_registro    0
crimen_lat              0
crimen_lon              0
dtype: int64
id             0
alcaldia       0
colonia        0
totalinsta     0
prioritari     0
geo_shape      0
colonia_lat    0
colonia_lon    0
dtype: int64


In [52]:

# print("=============CRIMES=============")
# print(crimes.shape)
# print(crimes.isnull().sum())

# print("=============CAMS=============")
# print(cams.shape)
# print(cams.isnull().sum())

merged_df =  crimes.merge(cams, how= 'left', left_on=['fgj_colonia_registro','alcaldia_hechos'], right_on=['colonia','alcaldia'])
print("=============MERGED_DF=============")
print(merged_df.shape)
print(merged_df.isnull().sum())


keep_this_columns = ['idcarpeta', 'delito', 'categoria', 'alcaldia_hechos', 'alcaldia', 'fgj_colonia_registro','colonia', 'sexo', 'edad', 'tipopersona', 'calidadjuridica',
                     'anio_inicio', 'mes_inicio', 'fechainicio','horainicio', 'competencia',
                    'anio_hecho', 'mes_hecho', 'fechahecho', 'horahecho',
                    'crimen_lat', 'crimen_lon','colonia_lat','colonia_lon','totalinsta','prioritari','geo_shape']

consolidated = merged_df[keep_this_columns].copy()

print("=============CONSOLIDATED=============")
print(consolidated.shape)
print(consolidated.isnull().sum())

consolidated.head()

(791185, 29)
idcarpeta                    0
anio_inicio                  0
mes_inicio                   0
fechainicio                  0
delito                       0
categoria                    0
sexo                         0
edad                         0
tipopersona                  0
calidadjuridica              0
competencia                  0
anio_hecho                   0
mes_hecho                    0
fechahecho                   0
horahecho                    0
horainicio                   0
alcaldia_hechos              0
colonia_datos                0
fgj_colonia_registro         0
crimen_lat                   0
crimen_lon                   0
id                      349149
alcaldia                349149
colonia                 349149
totalinsta              349149
prioritari              349149
geo_shape               349149
colonia_lat             349149
colonia_lon             349149
dtype: int64
(791185, 27)
idcarpeta                    0
delito                       0


Unnamed: 0,idcarpeta,delito,categoria,alcaldia_hechos,alcaldia,fgj_colonia_registro,colonia,sexo,edad,tipopersona,...,mes_hecho,fechahecho,horahecho,crimen_lat,crimen_lon,colonia_lat,colonia_lon,totalinsta,prioritari,geo_shape
0,8324429,FRAUDE,DELITO DE BAJO IMPACTO,ALVARO OBREGON,ALVARO OBREGON,GUADALUPE INN,GUADALUPE INN,Masculino,62,FISICA,...,8,2018-08-29,12:00:00,19.36125,-99.18314,19.357275,-99.187114,9.0,0.0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1826..."
1,8324430,"PRODUCCIÓN, IMPRESIÓN, ENAJENACIÓN, DISTRIBUCI...",DELITO DE BAJO IMPACTO,AZCAPOTZALCO,AZCAPOTZALCO,VICTORIA DE LAS DEMOCRACIAS,VICTORIA DE LAS DEMOCRACIAS,Femenino,38,FISICA,...,12,2018-12-15,15:00:00,19.47181,-99.16458,19.468811,-99.164122,5.0,0.0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1612..."
2,8324431,ROBO A TRANSEUNTE SALIENDO DEL BANCO CON VIOLE...,ROBO A CUENTAHABIENTE SALIENDO DEL CAJERO CON ...,COYOACAN,COYOACAN,COPILCO UNIVERSIDAD ISSSTE,COPILCO UNIVERSIDAD ISSSTE,Masculino,42,FISICA,...,12,2018-12-22,15:30:00,19.33797,-99.18611,19.337162,-99.182207,0.0,0.0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1808..."
3,8324435,ROBO DE VEHICULO DE SERVICIO PARTICULAR SIN VI...,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,IZTACALCO,,AGRÍCOLA PANTITLAN,,Masculino,35,FISICA,...,1,2019-01-04,06:00:00,19.40327,-99.05983,,,,,
4,8324438,ROBO DE MOTOCICLETA SIN VIOLENCIA,ROBO DE VEHÍCULO CON Y SIN VIOLENCIA,IZTAPALAPA,IZTAPALAPA,PROGRESISTA,PROGRESISTA,Masculino,39,FISICA,...,1,2019-01-03,20:00:00,19.3548,-99.06324,19.356632,-99.066046,9.0,1.0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.0631..."


In [64]:
#consolidated.to_csv('consolidated.csv')
#crimes.tail()
duplicates = consolidated['idcarpeta'].duplicated()
duplicated_rows=consolidated[duplicates]

duplicates = crimes['idcarpeta'].duplicated()
duplicated_rows=crimes[duplicates]

duplicates = cams[['colonia', 'alcaldia']].duplicated()
duplicated_rows=cams[duplicates]



duplicated_rows

Unnamed: 0,id,alcaldia,colonia,totalinsta,prioritari,geo_shape,colonia_lat,colonia_lon
110,110,CUAJIMALPA DE MORELOS,CRUZ BLANCA,3,0,"{""type"": ""MultiPolygon"", ""coordinates"": [[[[-9...",19.327217,-99.30821
210,210,ALVARO OBREGON,BELEN DE LAS FLORES,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2320...",19.355859,-99.234221
697,697,ALVARO OBREGON,LOMAS DE TARANGO,2,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2083...",19.364082,-99.214819
727,727,ALVARO OBREGON,SANTA FE,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2318...",19.382868,-99.239754
728,728,ALVARO OBREGON,SANTA FE,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2449...",19.379403,-99.246961
735,735,ALVARO OBREGON,SANTA FE,0,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.2246...",19.38598,-99.226336
845,845,COYOACAN,COPILCO UNIVERSIDAD,1,0,"{""type"": ""Polygon"", ""coordinates"": [[[-99.1741...",19.335466,-99.17944
1032,1032,TLAHUAC,SAN AGUSTIN,4,0,"{""type"": ""Polygon"", ""coordinates"": [[[-98.9956...",19.243428,-98.999587
1035,1035,TLAHUAC,SAN MIGUEL,6,0,"{""type"": ""Polygon"", ""coordinates"": [[[-98.9635...",19.222575,-98.962576
1105,1105,XOCHIMILCO,SANTA CECILIA TEPETLAPA,29,1,"{""type"": ""Polygon"", ""coordinates"": [[[-99.0829...",19.219048,-99.09523


When performing this first merge, we are losing almost 350k rows. The colonia/alcaldia names on both datasets might be different but refer to the same thing; hence, an homologation process should be done in order to keep the rows in the merged dataframe. Some cool algorithm, like **fuzzywuzzy**, might help to avoid the long, brute force approach.  

In [43]:
print(np.sort(crimes.alcaldia_hechos.unique()), '\n')
print(np.sort(cams.alcaldia.unique()))

['ALVARO OBREGON' 'AZCAPOTZALCO' 'BENITO JUAREZ' 'COYOACAN'
 'CUAJIMALPA DE MORELOS' 'CUAUHTEMOC' 'GUSTAVO A. MADERO' 'IZTACALCO'
 'IZTAPALAPA' 'LA MAGDALENA CONTRERAS' 'MIGUEL HIDALGO' 'MILPA ALTA'
 'TLAHUAC' 'TLALPAN' 'VENUSTIANO CARRANZA' 'XOCHIMILCO'] 

['ALVARO OBREGON' 'AZCAPOTZALCO' 'BENITO JUAREZ' 'COYOACAN'
 'CUAJIMALPA DE MORELOS' 'CUAUHTEMOC' 'GUSTAVO A. MADERO' 'IZTACALCO'
 'IZTAPALAPA' 'LA MAGDALENA CONTRERAS' 'MIGUEL HIDALGO' 'MILPA ALTA'
 'TLAHUAC' 'TLALPAN' 'VENUSTIANO CARRANZA' 'XOCHIMILCO']


The *alcaldia* values in both original dataframes are the same, so the homologation is needed on the *colonia* columns. Since there are more rows in crime dataset, the *colonia* values in the cams dataset will be the ones being adjusted. Performing a right join will help identify which colonia names have to be changed 

In [45]:
merged_df2 = crimes.merge(cams, how= 'outer', left_on=['alcaldia_hechos','fgj_colonia_registro'], right_on=['alcaldia','colonia'])

keep_this_columns = ['idcarpeta', 'delito', 'categoria', 'alcaldia_hechos', 'alcaldia', 'fgj_colonia_registro','colonia', 'sexo', 'edad', 'tipopersona', 'calidadjuridica',
                     'anio_inicio', 'mes_inicio', 'fechainicio','horainicio', 'competencia',
                    'anio_hecho', 'mes_hecho', 'fechahecho', 'horahecho',
                    'crimen_lat', 'crimen_lon','colonia_lat','colonia_lon','totalinsta','prioritari','geo_shape']

consolidated2= merged_df2 [keep_this_columns].copy()
print(consolidated2.shape)

consolidated2_nulls= consolidated2[consolidated2["fgj_colonia_registro"].isnull()].copy()
print(consolidated2_nulls.shape)

#.head()


(792083, 27)
(898, 27)


A list of the problematic *colonia* names from the cams dataset will helpt to replace those values by a substitute name related to the crimes dataset.


In [None]:
!pip install fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz, process

def get_best_substitute(substitute_this, potential_substitutes):
    # Get the best match and its score using fuzzy matching
    best_match, score = process.extractOne(substitute_this, potential_substitutes)
    if score >= 90:  # You can adjust the threshold as needed
        return best_match
    else:
        return None

1. Create a dictionary 'substitute_dict' that contains the recommended substitions.
2. Create a dictionary 'crimes_colonia_to_alcaldia_dict'that contains 'colonia' names from crimes dataframe as keys, and 'alcaldia' names from same dataset as values
3. Create a function that gets 'colonia' to be substituted, finds the substitution on the first dict and uses it to get the matching 'alcadia' from the second dict
4. Return substituted value from first dict if alcadia names from crimes and cams dataset match

In [None]:
missing_colonias = []
missing_colonias =list(consolidated2_nulls.colonia.unique())
substitute_list = list(crimes.fgj_colonia_registro.unique())

missing_colonias.sort()
substitute_list.sort()

# Create a dictionary with target values as keys and their best substitutes as values
substitute_dict = {value: get_best_substitute(value, substitute_list) for value in missing_colonias}

#import json
#pretty_dict = json.dumps(substitute_dict, indent=4)
#print(pretty_dict)
#print(substitute_dict.get("ACUEDUCTO DE GUADALUPE MODULAR"))

In [None]:
# Create a key-value pair dictionary based on 'City' and 'Country' columns
crimes_colonia_to_alcaldia_dict = dict(zip(crimes['fgj_colonia_registro'], crimes['alcaldia_hechos']))


In [None]:
def homologation(row, dict_subs, dict_compare, sub_col, ref_col, type):
    to_be_substituted =row[sub_col]
    reference_val = row[ref_col]
    #Get the colonia substitution name
    recommended_sub = dict_subs.get(to_be_substituted)
    #obtain alcaldia name related to that substitution name
    validate_sub = dict_compare.get(recommended_sub)
    if reference_val == validate_sub:
        if type=='colonia':
            return recommended_sub 
        else:
            return validate_sub
    else:
        return None
    
    

In [None]:
# Create a new column 'colonia2' in the cams DataFrame
cams['colonia2'] = None
cams['relatedCol']=None


# Iterate over each row in the DataFrame and apply the homologation function
for index, row in cams.iterrows():
    cams.at[index, 'colonia2']   = homologation(row, substitute_dict, crimes_colonia_to_alcaldia_dict, 'colonia', 'alcaldia','colonia')
    cams.at[index, 'relatedCol'] = homologation(row, substitute_dict, crimes_colonia_to_alcaldia_dict, 'colonia', 'alcaldia','other')
   

In [None]:

#axis=1
cams['colonia2'] = cams.apply(homologation, \
                              dict_subs = substitute_dict, \
                              dict_compare = crimes_colonia_to_alcaldia_dict, \
                              sub_col= "colonia", \
                              ref_col= 'alcaldia'\
                             )


In [None]:
print(cams.shape)
cams.isnull().sum()


In [None]:
cams[(cams["relatedCol"] !=cams["alcaldia"]) & ~(cams["relatedCol"].isnull()) ].head(20)
#cams[~cams["relatedCol"].isnull()].head(20)

In [None]:
cams.head()

In [None]:
data_dict={'cams_colonia': list(substitute_dict.keys()), 'substitute_colonia': list(substitute_dict.values()) }
#k= list(substitute_dict.keys())
#print(k)
homologation = pd.DataFrame(data_dict)
#homologation.head()


To ensure that the *colonia* substitute name is correct, a comparison with the *alcaldias* values will be made before replacing the values. A lookup dataframe from *crimes* dataset will be created in order to pair up *colonia* with *alcaldia*

In [None]:
crime_locations= crimes.groupby(['fgj_colonia_registro','alcaldia_hechos'],as_index=False).agg(count=('idcarpeta','count')).copy()
crime_locations.head()


In [None]:
homologation["alcaldia_cams"] = cams.loc[ (cams["colonia"] == homologation["cams_colonia"]), "alcaldia"]
homologation["alcaldia_crime"] = crime_locations.loc[crime_locations["fgj_colonia_registro"] == homologation["substitute_colonia"], \
                                                     "alcaldia_hechos"]
homologation["matches"] = (homologation["alcaldia_crime"] == homologation["alcaldia_cams"])

homologation.head()
#df.loc[df['Age'] > 25, 'Gender'] = 'M'



In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Function to find the best fuzzy match from 'c5_agg' for each value in 'consolidated'
def find_fuzzy_match(value, ref_df, ref_col):
    match, score, *_ = process.extractOne(value, ref_df[ref_col])
    if score >= 80:  # You can adjust the threshold as needed
        return match
    else:
        return None

In [None]:
#df.loc[df['Age'] > 25, 'Gender'] = 'M'

pretty_dict[0]

In [None]:
for old_colonia in pretty_dict.keys():
    alcaldia_old = cams[cams["colonia"] == alcaldia_old]
    if old_colonia

In [None]:
colonias_crimes =crimes.fgj_colonia_registro.unique()

In [None]:
print(consolidated.shape)
consolidated.isnull().sum()


In [None]:
merged_df2 = crimes.merge(cams, how= 'right', left_on=['fgj_colonia_registro','alcaldia_hechos'], right_on=['colonia','alcaldia'])
#merged_df2 = crimes.merge(cams, how= 'left', left_on=['alcaldia_hechos','colonia_datos'], right_on=['alcaldia','colonia'])


keep_this_columns = ['idcarpeta', 'delito', 'categoria', 'alcaldia_hechos', 'alcaldia', 'fgj_colonia_registro','colonia', 'sexo', 'edad', 'tipopersona', 'calidadjuridica',
                     'anio_inicio', 'mes_inicio', 'fechainicio','horainicio', 'competencia',
                    'anio_hecho', 'mes_hecho', 'fechahecho', 'horahecho',
                    'crimen_lat', 'crimen_lon','colonia_lat','colonia_lon','totalinsta','prioritari','geo_shape']


consolidated2 = merged_df2 [keep_this_columns].copy()
print(consolidated2.shape)
print(consolidated2.isnull().sum())
consolidated2.head()


In [None]:
!pip install fuzzywuzzy

In [None]:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

# Function to find the best fuzzy match from 'c5_agg' for each value in 'consolidated'
def find_fuzzy_match(value, ref_df, ref_col):
    match, score, *_ = process.extractOne(value, ref_df[ref_col])
    if score >= 80:  # You can adjust the threshold as needed
        return match
    else:
        return None

In [None]:
missing_colonias= consolidated[ consolidated["colonia"].isnull()]

missing_colonias


In [None]:
l= list(missing_colonias)

In [None]:
test= c5_agg[c5_agg["colonia"].str.contains('COPILCO',na=False, case=False)]
test

For this example, the substring *COPILCO* was used to search for similar colonia names. *COPILCO UNIVERSIDAD I.S.S.S.T.E.* on *c5_agg* is not an equal string to *COPILCO UNIVERSIDAD ISSSTE* from *consolidated2*, but in reality the values reference the same colonia. Since there are almost 300 values here, there should be a way to automate the process of finding the matching pairs of each colonia. A first approach is using an external library , such as **fuzzywuzzy**

In [None]:
print(consolidated2[consolidated2["c5_cam_col"].isnull()].shape)
consolidated2[consolidated2["c5_cam_col"].isnull()].head()

An aggregated dataset from *consolidated* is created here so that the matching colonia name is searched among the unique values. 

In [None]:
#c5_agg['possible_substitute'] = c5_agg['colonia'].apply(find_fuzzy_match, ref_df=consolidated, ref_col='colonia')

consolidated_agg = consolidated.groupby('colonia').agg(count=('colonia','count'), alcaldia=('alcaldia', 'first'))
consolidated_agg.reset_index(inplace=True)

consolidated_agg.head()

#print(find_fuzzy_match('COPILCO UNIVERSIDAD I.S.S.S.T.E.', consolidated, 'colonia'))

In [None]:

# Function to find the best fuzzy match from 'consolidated' for each value in 'c5_agg'
#c5_agg['possible_substitute'] = c5_agg['colonia'].apply(find_fuzzy_match, ref_df=consolidated, ref_col='colonia')

# Find possible substitutes from df2 for each value in df1's 'col1'
c5_agg['possible_substitute'] = c5_agg['colonia'].map(lambda x: find_fuzzy_match(x, consolidated_agg, 'colonia'))



## Relation with *stations* dataframe

Something  more complex should be done in order to relate the new consolidated dataframe with the *stations* dataset. The way to do this is to calculate the distance between the crime locations and all the metro stations in order to find which is the nearest one. The **harvesine_distance** function will be applied and the new columns *nearest_distance*, *nearest_location* and *nearest_station* will be populated. 


In [None]:
import math

def haversine_distance(lat1, lon1, lat2, lon2):
    # Earth's radius in kilometers
    earth_radius = 6371.0

    # Convert latitude and longitude from degrees to radians
    lat1_rad = math.radians(lat1)
    lon1_rad = math.radians(lon1)
    lat2_rad = math.radians(lat2)
    lon2_rad = math.radians(lon2)

    # Calculate differences in latitude and longitude
    d_lat = lat2_rad - lat1_rad
    d_lon = lon2_rad - lon1_rad

    # Haversine formula
    a = math.sin(d_lat / 2) ** 2 + math.cos(lat1_rad) * math.cos(lat2_rad) * math.sin(d_lon / 2) ** 2
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1 - a))

    # Calculate the distance
    distance = earth_radius * c
    return distance


In [None]:
def find_nearest_location(ref_lat, ref_lon, locations, stationsDf):
    nearest_distance = float('inf')
    nearest_location = None
    
   
    #Loop to compare crime location with each of the 195 metro station locations
    for lat, lon, station_name in locations:
        lat = float(lat)
        lon = float(lon)
        distance = haversine_distance(ref_lat, ref_lon, lat, lon)
        if distance < nearest_distance:
            nearest_distance = distance
            nearest_location = (lat, lon)
            nearest_station  = station_name
    
    return nearest_distance, nearest_location, nearest_station


In [None]:
consolidated['nearest_distance'], consolidated['nearest_location'], consolidated['nearest_station'] = zip(*consolidated.apply(
    lambda row: find_nearest_location( ref_lat   = row['crimen_lat'], 
                                       ref_lon   = row['crimen_lon'], 
                                       locations = stations[['station_lat', 'station_lon', 'nombre']].values,
                                       stationsDf = stations
                                     ),
                axis = 1
    )
  )

In [None]:
consolidated.head()




In [None]:
print('Min. nearest distance is ',consolidated.nearest_distance.min())
print('Max. nearest distance is ',consolidated.nearest_distance.max())
consolidated.nearest_station.unique()


In [None]:
consolidated.shape

This is the consolidated dataset that can be used as base for future calculations. It will be exported as a csv at this stage. However, more data related to camera locations related to program **C5** was found, so this info can be merged as well so that another dataset is used


In [None]:
consolidated.to_csv('../datasets/consolidated.csv.zip', compression = 'zip', index=False)

## C5 cameras
Mexico City's security organization **C5** installed Wifi Modules on their public city security cameras. The data related to those cameras and their locations is public and can be used on this project

In [None]:
consolidated_path = "\datasets\consolidated.zip"

zf = zipfile.ZipFile(repo_path+consolidated_path) 
consolidated = pd.read_csv(zf.open('consolidated'))
consolidated.head()


In [None]:
consolidated.isnull().sum()

In [None]:
c5_path = "\datasets\c5_cams.zip"

zf = zipfile.ZipFile(repo_path+c5_path) 
c5_raw = pd.read_csv(zf.open('c5_cams.csv'), encoding = "latin-1")

c5_raw.head()

In [None]:
c5_raw.rename(columns = lambda x : x.lower() , inplace = True)
c5_raw.rename(columns = {'alcaldía':'alcaldia'}, inplace = True)
del c5_raw["programa"]
del c5_raw["puntos_de_acceso"]


In [None]:
c5_raw.head()

The net step is to somehow relate this information to the consolidated dataset. Some ideas are:

- Counting the number of cameras installed on each neighborhood (colonia) and/or alcaldia/delegacion.  Also, counting the number of cameras within a certain radius.
- Calculating which is the crime's nearest camera by using the coordinates of the crime and the camera and calculating nearest distance. The calculation would be done only for cameras in the same 'alcaldia' or neighborhood to reduce processing time
- A dictionary containing which alcaldias are near from each other, so that the above step also could consider the nearby alcaldias.


In [None]:
consolidated.shape


In [None]:
c5_agg = c5_raw.groupby('colonia').agg(count=('id','count'), alcaldia=('alcaldia', 'first'))
c5_agg.reset_index(inplace=True)
c5_agg.rename(columns = {'count':'c5_cam_col'}, inplace = True)
c5_agg.head()


In [None]:
consolidated2 = pd.merge(consolidated, c5_agg[['colonia','c5_cam_col']], on='colonia', how='left')
#consolidated2["c5_cam_col"]= consolidated2["c5_cam_col"].round().astype(int)
consolidated2.head()

In [None]:
#consolidated2["c5_cam_col"]= consolidated2["c5_cam_col"].round().astype(int)
consolidated2.isnull().sum()


There are a lot of null values in the c5_cam_col. The cause probably is some difference between the colonia names in *c5_agg* and *consolidated*. A homologation process could fix this. The colonia names from *c5_agg* which didn't find a match in *consolidated* dataframe must be identified:

In [None]:
consolidated2[consolidated2["c5_cam_col"].isnull()].head()

In [None]:
missing_colonias= consolidated2.loc[ consolidated2["c5_cam_col"].isnull(), 'colonia'].unique()

In [None]:
l= list(missing_colonias)

In [None]:
test= c5_agg[c5_agg["colonia"].str.contains('COPILCO',na=False, case=False)]
test

For this example, the substring *COPILCO* was used to search for similar colonia names. *COPILCO UNIVERSIDAD I.S.S.S.T.E.* on *c5_agg* is not an equal string to *COPILCO UNIVERSIDAD ISSSTE* from *consolidated2*, but in reality the values reference the same colonia. Since there are almost 300 values here, there should be a way to automate the process of finding the matching pairs of each colonia. A first approach is using an external library , such as **fuzzywuzzy**

In [None]:
print(consolidated2[consolidated2["c5_cam_col"].isnull()].shape)
consolidated2[consolidated2["c5_cam_col"].isnull()].head()

An aggregated dataset from *consolidated* is created here so that the matching colonia name is searched among the unique values. 

In [None]:
#c5_agg['possible_substitute'] = c5_agg['colonia'].apply(find_fuzzy_match, ref_df=consolidated, ref_col='colonia')

consolidated_agg = consolidated.groupby('colonia').agg(count=('colonia','count'), alcaldia=('alcaldia', 'first'))
consolidated_agg.reset_index(inplace=True)

consolidated_agg.head()

#print(find_fuzzy_match('COPILCO UNIVERSIDAD I.S.S.S.T.E.', consolidated, 'colonia'))

In [None]:

# Function to find the best fuzzy match from 'consolidated' for each value in 'c5_agg'
#c5_agg['possible_substitute'] = c5_agg['colonia'].apply(find_fuzzy_match, ref_df=consolidated, ref_col='colonia')

# Find possible substitutes from df2 for each value in df1's 'col1'
c5_agg['possible_substitute'] = c5_agg['colonia'].map(lambda x: find_fuzzy_match(x, consolidated_agg, 'colonia'))



In [None]:
homologated = c5_agg[c5_agg["colonia"]!= c5_agg["possible_substitute"]]

condition = consolidated_agg["colonia"] == homologated["possible_substitute"]
homologated["alcaldia_substitute"] = consolidated_agg[condition]["alcaldia"]
homologated.head()

In [None]:
homologated.sort_values(by="colonia", inplace = True)
homologated

In [None]:
pd.set_option('display.max_columns', None)
homologated = homologated.loc[:, ['colonia', 'possible_substitute','alcaldia']]
homologated.to_csv('homologated.csv')

In [None]:
c5_agg[c5_agg['colonia'].str.contains("AGRICOLA")]

In [None]:
#print(l)
consolidated_agg[consolidated_agg['colonia'].str.contains("AGRÍCOLA")]

In [None]:
c5_agg[]