# Crime clustering 


## Data

* **ENTITY_CODE** : ID Number.
* **STATE** : State's name.
* **ID** : Official name's abbreviation.
* **HOMICIDES*** : The act of one human killing another.
* **CAR_THEFT*** : Total or partial theft of vehicle.
* **EXTORTION*** : Intimidation to perform an act to the detriment of your patrimony.
* **STREET_TRANSPORT_THEFT*** : Robbery/Theft or assault on the street or public transportation.
* **HOME_THEFT*** : Home theft.
* **FRAUD*** : Delivery of money for a product or service that was not received as agreed.
* **POPULATION** : Total number of inhabitants in the entity$^{4}$.
* **URBAN_PP** : Percentage of urban population$^{4}$.

_* Crime prevalence rate by state per hundred thousand inhabitants_
https://github.com/isaacarroyov/crime_analysis_mx2017

## Exploratory Data Analysis

### Import relevant libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

In [None]:
#import the csv as a Data Frame
df = pd.read_csv('crimes_mx.csv', encoding='ISO-8859-1')
df.head(10)

In [None]:
#number of columns and rows 
df.shape

In [None]:
#name of the columns
df.columns

A statistical summary of the data is shown below.

In [None]:
#relevant stats ordered in descending order (by the mean)
df.describe().transpose().iloc[1:].sort_values(by = 'mean', ascending = False)

### Data distribution

As part of the analysis, the distribution of the variables will be shown through histograms.

In [None]:
df.iloc[:,3:].hist( figsize=(18,15) )
plt.show()

### Bar charts

An interesting analysis is to show the different distributions of variables by state

In [None]:
variables = df.columns.values[3:]

plt.figure( figsize=(20,20) )


for i in range(len(variables)):

    col_name=variables[i]

    plt.subplot( 4, 2, i+1 )
    plt.bar( df['ID'], df[col_name])
    
    #add title and labels on the axes
    plt.title( col_name, size = 15 )
    plt.xlabel( 'State [ ID ]', size = 10 )
    plt.xticks( rotation = 90 )
    plt.tick_params( labelsize = 15 )
    plt.subplots_adjust( bottom= -0.05)

plt.show()

In [None]:

#titles 
titles = ['Crimes by State. \nType: Homicides',
           'Crimes by State. \nType: Car theft', 'Crimes by State. \nType: Extortion',
           'Crimes by State. \nType: Theft/Assault on the street or public transportation', 
           'Crimes by State. \nType: Home theft', 'Crimes by State. \nType: Fraud',
          ]


#dropped 'POBLACION' and 'PP_URBANA' in order to have the crime variables
variables = df.columns.values[3:]

#create a figure
plt.figure( figsize=(20,25) )


for i in range(len(variables)):

    col_name=variables[i]
    df_i = df.sort_values( by = col_name, ascending = False )

    #create a subplot
    plt.subplot( 4, 2, i+1 )

    #make the bar chart
    plt.bar( df_i['ID'], df_i[col_name])
    
    #add title and labels on the axes
    plt.title( col_name, size = 20 )
    plt.xlabel( 'State [ ID ]', size = 15 )
    plt.xticks( rotation = 90 )
    plt.tick_params( labelsize = 10)

#adjust subplots
plt.subplots_adjust(bottom=-0.05)
plt.show()

### Correlation of the variables

For this analysis only the most important variables will be taken, these are:

1. **CAR_THEFT**
2. **STREET_TRANSPORT_THEFT**
3. **EXTORTION**
4. **HOMICIDES**

The correlation between these four variables will be shown below for a better perspective of the problem:

In [None]:
plt.figure( figsize=(7,5))

sns.heatmap( df[variables].corr().round(3), annot = True  )
plt.xticks( rotation = 90 )
plt.yticks( rotation = 0 )

plt.show()

## Standardization

In [None]:
#Here we standardize a dataset along any axis. Center to the mean and component wise scale to unit variance.
from sklearn import preprocessing
df_standardized = preprocessing.scale( df[variables] )
df_standardized = pd.DataFrame( df_standardized )

## Distorsion

In [None]:
from sklearn.cluster import KMeans 

inertia = []

max_k=20

for i in range(1, 20):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 11)
    kmeans.fit(df_standardized)
    inertia.append(kmeans.inertia_)


In [None]:
plt.figure( figsize=(10,6))

plt.plot(range(1, max_k), inertia,   marker = '+')
plt.xlabel('Number of clusters')
plt.ylabel('inertia')
plt.tick_params( labelsize = 10 )

plt.show()

###  Silhouette 

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score

In [None]:
silhouette_score(df_standardized,  kmeans.labels_)

In [None]:
sil=[]
for i in range(3,max_k):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 11)
    kmeans.fit(df_standardized)
    sil_score=silhouette_score(df_standardized,  kmeans.labels_)
    sil.append(sil_score) 
    

In [None]:
plt.figure( figsize=(10,6))

plt.plot(range(3, max_k), sil,   marker = '+')
plt.xlabel('Number of clusters')
plt.ylabel('silhouette')
plt.tick_params( labelsize = 10 )

plt.show()

## Clustering  

In [None]:
#Use of n_clusters = 5
kmeans = KMeans( n_clusters=5, init='k-means++', random_state=11 )

#train and prediction on our normalized data
predicted_y = kmeans.fit_predict( df_standardized )

#numbers of cluster goes from 0 to 4, we're adding +1 to the array
predicted_y = predicted_y + 1 

predicted_y

In [None]:
df['CLUSTER'] = predicted_y
df[ ['CLUSTER', 'STATE'] ]

## Analysis of the output

### Distribution of the variables

The following graph shows the correlation of each of the selected variables and each of the _clusters_ in colour.

In [None]:
df.head()

In [None]:
colours_cluster = ['#004777', '#A30000', '#FF7700','#F564A9', '#00AFB5']

plt.subplots(2, 4,figsize=(15,3)  )
for i, col in enumerate(df.columns[3:-1]):
    plt.subplot(2,4, i+1)
    for j in range(1,6):
        sns.kdeplot(df.loc[df['CLUSTER'] == j, col], shade=True, label=j, color=colours_cluster[j-1])
    plt.title(col)
    if i == 3:
        plt.legend(loc='upper right')
    else:
        plt.legend().remove()
        
plt.subplots_adjust(bottom=-1)


- CLUSTER 1: rural - safe
- CLUSTER 2: semi-rural - extorsion 
- CLUSTER 3: big urban  - fraud, street theft, homicides
- CLUSTER 4: urban - home/car theft
- CLUSTER 5: urban - safe (homicide big variance)

### Geographical position

In [None]:
df_dict = df.set_index( 'ID' )['CLUSTER']

states_geo = 'states_mx.json'

#use folium to create map
import folium
map_mex = folium.Map( location = [24,-102], zoom_start = 4.5 )

#colour a state according to its cluster
def my_color_function(feature):
    if df_dict[feature['id']] == 1:
        return colours_cluster[0]
    elif df_dict[feature['id']] == 2:
        return colours_cluster[1]
    elif df_dict[feature['id']] == 3:
        return colours_cluster[2]
    elif df_dict[feature['id']] == 4:
        return colours_cluster[3]
    elif df_dict[feature['id']] == 5:
        return colours_cluster[4]

In [None]:
for i in range(1):
    folium.GeoJson(
        states_geo,
        style_function=lambda feature: {
            'fillColor': my_color_function(feature),
            'color' : 'black',
            'fill_opacity' : .5,
            'weight' : 0.5,
            }
        ).add_to(map_mex)

map_mex