# Clean NUSE
Our first source of information is a database of the citizens' crime reports to the security and emergency call center NUSE (by the Spanish acronym of "Número Único de Seguridad y Emergencias") in 2018 in Bogotá, Colombia. 

The objective of this script is to do two things:
1. As NUSE is the data collected from the reports, in many cases, there is more than one report per crime. Therefore, we unified all the reports that meet the following three conditions:

    (1) A crime $i$ occurred at a maximum distance of 500 meters from another crime $j$.

    (2) The crime $i$ is separated for no more than 8 hours from crime $j$. 

    (3) The National Police Department classified crimes $i$ and $j$ with the same typology.
    
The output of this process produce the `unique_nuse` dataframe as an intermediate output that will be used to accomplish the second objective of this script. 

2. As our research is only interested in violent crimes, we filtered the `unique_nuse` data frame to only hold violent crimes and produce `violent_nuse` as the main output of this script. This data frame will be used in the next script `2_append_crimes.ipynb` in which we join `violent_nuse` with siedco, the official dataset of the crime of the Colombian National Police. 

## 0. Import packages and data

In [7]:
# Beginning of code

# Load packages
import pandas as pd
import numpy as np
from scipy.spatial import distance
from tqdm import tqdm

In [8]:
# Import nuse as df
nuse = pd.read_csv('../../Data/nuse_2018_raw.csv', sep = ";", low_memory = False)

In [9]:
# Fix data types of the columns related with dates
nuse.FECHA_ORIG = pd.to_datetime(nuse.FECHA_ORIG, format = "%d/%m/%Y %H:%M:%S")
nuse.FECHA = pd.to_datetime(nuse.FECHA, format = "%d/%m/%Y")

## 1. Aggregation of reports to build a unique crime dataset

In [10]:
# Create an hour variable (HORA2) that also reflects the minutes
nuse["MINUTOS"] = nuse.FECHA_ORIG[0:10].apply(lambda x: int(x.strftime("%M")))
nuse["MINUTOS"] = nuse["MINUTOS"].fillna(0)
nuse["HORA2"] = nuse["HORA"] + nuse["MINUTOS"]/60

In [None]:
# An empty data frame to store the results
unique_nuse = pd.DataFrame()
# We iterate by dates
grilla_fechas = np.sort(nuse.FECHA.unique())

for i in tqdm(grilla_fechas):
    # For each date, we identify all the crimes reported
    i = pd.to_datetime(i)
    filtro = nuse.FECHA == i
    df_small = nuse.loc[filtro,].reset_index(drop = True)

    # Then, we divide the reports in groups according the typology of the crime
    tipos = df_small["TIPO_DETALLE"].unique()
    for t in tipos:
        filtro = df_small["TIPO_DETALLE"] == t
        df_small2 = df_small.loc[filtro,].reset_index(drop = True)

        # Now we are going to calculate the events that have happened in a distance less than 500 meters
        coords = df_small2[["LATITUD", "LONGITUD"]].values
        # Firstly, calculate an euclidean distance between coords 
        eu_d = distance.cdist(coords, coords, 'euclidean')
        # Transform the results, that are in grades, to meters.
        # In the equator, one grade is equivalent to 111,319 meters
        dist_m = eu_d * 111319
        # Make the filter
        cercanos1 = dist_m < 500

        # Lastly, create a filter to identify if the distance between events is less than 8 hours
        horas = df_small2["HORA2"].values
        dist_h = np.abs(np.subtract.outer(horas, horas))
        cercanos2 = dist_h < 8

        # We define that we have the same crime if both the proximity condition of 500 meters and the 
        # proximity of 8 hours are met.
        cercanos = cercanos1 & cercanos2

        # Since the matrix is symmetric we only keep the upper triangular
        cercanos = np.triu(cercanos)

        # Now we are going to do the magic of only keeping the unique crimes. As our matrix is upper triangular, 
        # if we add by columns, those that are equal to 1 means that they are the only crimes
        indices_guardar = np.where(np.sum(cercanos, axis = 0) == 1)[0]
        
        # Store the results
        unique_nuse = pd.concat([unique_nuse, df_small2.loc[indices_guardar,]]).reset_index(drop = True)

In [None]:
unique_nuse.to_csv('../../Data/unique_nuse.csv', sep = ";", index = False)

## 2. Filter to only have violent crimes

In [15]:
violent_crimes_typology = ["901", "903", "905", "910", "911", "912", "929", "934"]
violent_nuse = unique_nuse.loc[unique_nuse.TIPO_UNICO.isin(violent_crimes_typology),:].reset_index(drop = True)

In [16]:
violent_nuse.to_csv('../../Data/violent_nuse.csv', sep = ";", index = False)

In [None]:
# End of the code