<a href="https://colab.research.google.com/github/lsteffenel/RT0902-IntroML/blob/main/15-Chicago_crime_data_on_spark.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chicago crime dataset analysis
---

Ce notebook permet d'appliquer un peu de vos connaissances à la découverte d'un vrai dataset.

Vous allez effectuer :
 * Lecture, transformation et requêtage avec Apache Spark
 * Visualisation avec des bibliothèques Python (Matplotlib et Seaborn).
 * Parfois, transformer les données en Pandas ou Numpy pour une meilleure visualisation.


---

## Quelques Import



Standard python data analysis imports

In [None]:
## standard imports
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Spark imports

In [None]:
import os
memory = '8g'
pyspark_submit_args = ' --driver-memory ' + memory + ' pyspark-shell'
os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

In [None]:
## spark imports
from pyspark.sql import Row, SparkSession
import pyspark.sql.functions as pyf

spark = SparkSession.builder.master("local[1]").appName("RT0902").getOrCreate()

Jupyter visualization options

In [None]:
%matplotlib inline

#Not too sure the following 2 work. This is a TODO
sns.set_color_codes("pastel")
plt.rcParams["figure.figsize"] = [20, 8]

---
## Dataset
On va d'abord nettoyer le dataset et son entête, car ça aide l'accès aux colonnes (pas d'espaces, par exemple).

Les données originales viennent de Kaggle (https://www.kaggle.com/djonafegnem/chicago-crime-data-analysis)

In [None]:
content_cols = '''
ID - Unique identifier for the record.
Case Number - The Chicago Police Department RD Number (Records Division Number), which is unique to the incident.
Date - Date when the incident occurred. this is sometimes a best estimate.
Block - The partially redacted address where the incident occurred, placing it on the same block as the actual address.
IUCR - The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description. See the list of IUCR codes at https://data.cityofchicago.org/d/c7ck-438e.
Primary Type - The primary description of the IUCR code.
Description - The secondary description of the IUCR code, a subcategory of the primary description.
Location Description - Description of the location where the incident occurred.
Arrest - Indicates whether an arrest was made.
Domestic - Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
Beat - Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts. See the beats at https://data.cityofchicago.org/d/aerh-rz74.
District - Indicates the police district where the incident occurred. See the districts at https://data.cityofchicago.org/d/fthy-xz3r.
Ward - The ward (City Council district) where the incident occurred. See the wards at https://data.cityofchicago.org/d/sp34-6z76.
Community Area - Indicates the community area where the incident occurred. Chicago has 77 community areas. See the community areas at https://data.cityofchicago.org/d/cauq-8yn6.
FBI Code - Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS). See the Chicago Police Department listing of these classifications at http://gis.chicagopolice.org/clearmap_crime_sums/crime_types.html.
X Coordinate - The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
Y Coordinate - The y coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
Year - Year the incident occurred.
Updated On - Date and time the record was last updated.
Latitude - The latitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Longitude - The longitude of the location where the incident occurred. This location is shifted from the actual location for partial redaction but falls on the same block.
Location - The location where the incident occurred in a format that allows for creation of maps and other geographic operations on this data portal. This location is shifted from the actual location for partial redaction but falls on the same block.'''

In [None]:
def entry_dic(line):
    """
    Convert a header - description line into a single dictionary that holds the original header as 'title', a corresponding field name as 'header', and a description.
    """
    pair = line.split(' - ')
    return {'title': pair[0], 'description': pair[1], 'header': pair[0].lower().replace(' ', '_')}

Créer une liste de titres à partir de la fonction précédente

In [None]:
header_dics = list(map(entry_dic, list(filter(lambda l: l != '', content_cols.split('\n')))))

In [None]:
header_dics[:2]

### Données

Les données seront téléchargées et stockées dans `./data/`. Ce sont des fichiers .CSV.


In [None]:
!mkdir data

In [None]:
!gsutil -m cp -r gs://angelo_crime_data/*.csv ./data

In [None]:
!ls -lh data/

total 1.9G
-rw-r--r-- 1 root root 454M Mar  8 17:02 Chicago_Crimes_2001_to_2004.csv
-rw-r--r-- 1 root root 450M Mar  8 17:02 Chicago_Crimes_2005_to_2007.csv
-rw-r--r-- 1 root root 647M Mar  8 17:02 Chicago_Crimes_2008_to_2011.csv
-rw-r--r-- 1 root root 350M Mar  8 17:02 Chicago_Crimes_2012_to_2017.csv


---
## Lecture des données

Avec l'opération `csv read` de spark, nous allons lire et parser les fichiers. Le résultat sera un seul DataFrame :

In [None]:
df = spark.read.csv('./data/*.csv', inferSchema=True, header=True)

Note : ce qui prend vraiment le temps est la découverte du schéma : on n'a pas tellement de lignes, après tout.

In [None]:
# Ce DataFrame sera mis en cache (gardé en mémoire) car il sera utilisé plusieurs fois.
df = df.cache()

In [None]:
# Affichage du schéma (structure) du dataframe
df.printSchema()

---
**Nous allons renommer les colonnes pour permettre l'utilisation de la notation "point" (ex : df.colonne)**

In [None]:
for h in header_dics:
    df = df.withColumnRenamed(h['title'], h['header'])

Certaines lignes de n'ont aucune valeur déclarée à la colonne `location_description` . C'est le moment de les supprimer.

Ici, vous avez l'example en utilisant l'appel `rdd.filter` call. Votre première *mission* est de faire ma même chose mais en utilisant la fonction **`Dataset.filter`**.

In [None]:
#df = df.rdd.filter(lambda rec: rec.arrest.find('Location Description') < 0).toDF().cache()

In [None]:
# à vous de jouer
df = df.filter(df['location_description'] != '')

Un petit aperçu du début du dataframe :

In [None]:
df.show(n=3, truncate=False)

---
## Comprendre les données

In [None]:
# crime types
crime_type_groups = df.groupBy('primary_type').count()

In [None]:
crime_type_counts = crime_type_groups.orderBy('count', ascending=False)

Jusqu'à ici ça a été rapide : Spark fait une exécution *lazy*, i.e., il n'a fait qu'enregistrer les *transformations* à applier. Il attendra pour lancer l'exécution uniquement lorsqu'une *action* est demandée (par exemple, afficher le résultat).

Dans la ligne suivante on demande le nombre total de lignes, mais en fait il va appliquer les modifications, faire le filtrage, etc. Ça prendra un certain temps, au moins 7 minutes.

In [None]:
print(df.count())

Et si on affiche les colonnes (nos futures `features`) ?

In [None]:
df.columns

In [None]:
#Encore une fois, quel est le schéma (cette fois-ci, avec les noms "harmonisés") ?
df.printSchema()

### Types de Crime

La commande suivante affiche les 20 types de crime les plus fréquents :

In [None]:
crime_type_counts.show(truncate=False)

On peut faire un affichage plus propre (et d'autres opérations) en transformant ce dataframe en Pandas :

``
crime_type_counts.toPandas()
``

In [None]:
counts_pddf = crime_type_counts.toPandas()

In [None]:
counts_pddf.head(10)

In [None]:
plt.rcParams["figure.figsize"] = [20, 8]

sns.set(style="whitegrid")
sns.set_color_codes("pastel")

#sns.despine(left=True, bottom=True)
type_graph = sns.barplot(x='count', y='primary_type', data=counts_pddf)
type_graph.set(ylabel="Primary Type", xlabel="Crimes Record Count")

### Période couverte

In [None]:
import datetime
from pyspark.sql.functions import *

In [None]:
df.select(min('date').alias('first_record_date'), max('date').alias('latest_record_date')).show(truncate=False)

Ici, on voir que les fichiers couvrent la période entre le **2001-01-01** et le **2016-12-31**

---
Nous allons convertir les dates au format timestamp. En effet, le schéma montrait que le champ `date` était de type `string`, ce qui n'est pas très utile.

Nous allons changer ce format afin de le lire sous la forme '02/23/2006 09:06:22 PM' , c'est à dire **`'MM/dd/yyyy hh:mm:ss a'`** (format américain).

On va aussi rajouter une colonne `month` qui n'a pas les champs pour l'heure.

In [None]:
df = df.withColumn('date_time', to_timestamp('date', 'MM/dd/yyyy hh:mm:ss a'))\
       .withColumn('month', trunc('date_time', 'YYYY')) #adding a month column to be able to view stats on a monthly basis

In [None]:
df.select(['date','date_time', 'month'])\
  .show(n=2, truncate=False)

### Combien d'arrestations ?

In [None]:
# On peut utiliser la colonne month pour affiche les quantités d'arrestations au fil des années, groupées par mois :
type_arrest_date = df.groupBy(['arrest', 'month'])\
                     .count()\
                     .orderBy(['month', 'count'], ascending=[True, False])
print()
type_arrest_date.show(3, truncate=False)

In [None]:
# Une petite note sur le fonctionnement du type datetime
import datetime

In [None]:
datetime.datetime.now()
datetime.datetime.strftime(datetime.datetime.now(), '%H')

In [None]:
# Ici on crée un dataframe Pandas à partir des informations ci-dessous :
type_arrest_pddf = pd.DataFrame(type_arrest_date.rdd.map(lambda l: l.asDict()).collect())

Encore une fois, nous allons convertir les types de jour/temps pour Pandas

*Certaines choses ne sont pas nécessaires, mais bon...*

In [None]:
type_arrest_pddf['yearpd'] = type_arrest_pddf['month'].apply(lambda dt: datetime.datetime.strftime(pd.Timestamp(dt), '%Y'))

In [None]:
type_arrest_pddf['arrest'] = type_arrest_pddf['arrest'].apply(lambda l: l=='True')
type_arrest_pddf.head(5)

### Comment le nombre d'emprisonnements a évolué au fil des 16 ans de collecte de données ?

In [None]:
# Data for plotting
t = type_arrest_pddf['count'] - 20 # np.arange(0.0, 2.0, 0.01)
s = type_arrest_pddf['month']

arrested = type_arrest_pddf[type_arrest_pddf['arrest'] == True]
not_arrested = type_arrest_pddf[type_arrest_pddf['arrest'] == False]

# Note that using plt.subplots below is equivalent to using
# fig = plt.figure() and then ax = fig.add_subplot(111)
fig, ax = plt.subplots()
ax.plot(arrested['month'], arrested['count'], label='Arrested')
ax.plot(not_arrested['month'], not_arrested['count'], label='Not Arrested')

ax.set(xlabel='Year - 2001-2017', ylabel='Total records',
       title='Year-on-year crime records')
ax.grid(b=True, which='both', axis='y')
ax.legend()

Mis à part la fin de 2016, la "distance" relative entre arrestations et non-arrestations est plus ou moins constante.

### À quel moment de la journée les criminels sont plus actifs ?

Ici c'est à vous de refaire le même type d'opération. Je vais juste vous montrer comment créer une colonne avec les heures.

In [None]:
# Extract the "hour" field from the date into a separate column called "hour"
df_hour = df.withColumn('hour', hour(df['date_time']))

In [None]:
# Derive a data frame with crime counts per hour of the day:
hourly_count = df_hour.groupBy(['primary_type', 'hour']).count().cache()
hourly_total_count = hourly_count.groupBy('hour').sum('count')

In [None]:
hourly_count_pddf = pd.DataFrame(hourly_total_count.select(hourly_total_count['hour'], hourly_total_count['sum(count)'].alias('count'))\
                                .rdd.map(lambda l: l.asDict())\
                                 .collect())

In [None]:
hourly_count_pddf = hourly_count_pddf.sort_values(by='hour')

In [None]:
fig, ax = plt.subplots()
ax.plot(hourly_count_pddf['hour'], hourly_count_pddf['count'], label='Hourly Count')

ax.set(xlabel='Hour of Day', ylabel='Total records',
       title='Overall hourly crime numbers')
ax.grid(b=True, which='both', axis='y')
ax.legend()

Il semble que c'est plus agité entre 18h et 22h... avec un pic à minuit et autre à midi.



### Dans que type d'endroit les crimes sont commis ?

In [None]:
# Combien de types d'endroit sont recensés
df.select('location_description').distinct().count()

Quels sont les top 10 endroits ?

In [None]:
df.groupBy(['location_description']).count().orderBy('count', ascending=False).show(10)

### Crimes "domestiques" :

In [None]:
domestic_hour = pd.DataFrame(df_hour.groupBy(['domestic', 'hour']).count().orderBy('hour').rdd.map(lambda row: row.asDict()).collect())

In [None]:
dom = domestic_hour[domestic_hour['domestic'] == 'True']['count']
non_dom = domestic_hour[domestic_hour['domestic'] == 'False']['count']

either_dom = domestic_hour.groupby(by=['hour']).sum()['count']

dom_keys = domestic_hour[domestic_hour['domestic'] == 'False']['hour']

#### Comment ce type de crime (domestic) se compare aux autres types de crimes ?

In [None]:
figure, axes = plt.subplots()

axes.plot(dom_keys, either_dom, label='Total hourly count')
axes.plot(dom_keys, dom, label='Domestic crime count')
axes.plot(dom_keys, non_dom, label='Non-Domestic hourly count')

axes.legend()
#axes.grid(which='b', b=True)

### Une analyse par rapport au temps

Les données de type heure/date permettent d'obtenir plus d'information sur les types de crime et d'émettre des hypothèses sur leurs sursauts. Par ontre, d'autres facteurs externes comme le changement de garde ou les nouvelles politiques de sécurité peuvent avoir un impact non décrit ici.

Néanmoins, si on a une idée de quand et où les crimes sont les plus fréquents, on peut s'aventurer à faire quelques prévisions...

On va rajouter quelques champs utiles :

 * l'heure du jour (déjà présent dans le champ 'hour')
 * le jour de la semaine
 * le mois de l'année
 * le "numéro du jour" dans une séquence 1, 2...(on commence à compter à partir du 2001-01-01).

In [None]:
df_dates = df_hour.withColumn('week_day', dayofweek(df_hour['date_time']))\
                 .withColumn('year_month', month(df_hour['date_time']))\
                 .withColumn('month_day', dayofmonth(df_hour['date_time']))\
                 .withColumn('date_number', datediff(df['date_time'], to_date(lit('2001-01-01'), format='yyyy-MM-dd')))\
                 .cache()

In [None]:
df_dates.select(['date', 'month', 'hour', 'week_day', 'year', 'year_month', 'month_day', 'date_number']).show(20, truncate=False)

## Les crimes par rapport au jour de la semaine


In [None]:
week_day_crime_counts = df_dates.groupBy('week_day').count()

In [None]:
week_day_crime_counts_pddf = pd.DataFrame(week_day_crime_counts.orderBy('week_day').rdd.map(lambda e: e.asDict()).collect())

In [None]:
sns.barplot(data=week_day_crime_counts_pddf, x='week_day', y='count')

On voit très peu de variance... D'un autre côté, les criminels restent "méchants" tous les jours.

## Mois de l'année



In [None]:
year_month_crime_counts = df_dates.groupBy('year_month').count()

In [None]:
year_month_crime_counts_pddf = pd.DataFrame(year_month_crime_counts.orderBy('year_month').rdd.map(lambda e: e.asDict()).collect())

In [None]:
year_month_crime_counts_pddf

Il semble que la période Mai-Août est la plus active pour les criminels. Des idées sur la cause ?


In [None]:
sns.barplot(data=year_month_crime_counts_pddf, y='count', x='year_month')

## Jour du mois

In [None]:
month_day_crime_counts = df_dates.groupBy('month_day').count()

In [None]:
month_day_crime_counts_pddf = pd.DataFrame(month_day_crime_counts.orderBy('month_day').rdd.map(lambda e: e.asDict()).collect())

#### Les 10 pire jours du mois

In [None]:
month_day_crime_counts_pddf.sort_values(by='count', ascending=False).head(10)

In [None]:
month_day_crime_counts_pddf = month_day_crime_counts_pddf.sort_values(by='month_day', ascending=True)

In [None]:
fg, ax = plt.subplots()

ax.plot(month_day_crime_counts_pddf['month_day'], month_day_crime_counts_pddf['count'], label='Crimes over the month')

ax.grid(b=True, which='both')
ax.legend()

Que se passe-t-il le 31 de chaque mois ? Est-ce vraiment une baisse ou juste un problème statistique ?

### Et les quartiers ?

Chicago compte 77 quartiers (community areas). Comment le crime est distribuée entre eux ?

In [None]:
df_dates_community_areas = df_dates.na.drop(subset=['community_area']).groupBy('community_area').count()

Quelles sont les 10 quartiers avec plus de crime ?

In [None]:
df_dates_community_areas.orderBy('count', ascending=False).show(10)

In [None]:
## Et quel est le type de crime le plus commun par quartier ?
top_crime_types = df_dates.select('primary_type').groupBy('primary_type').count().rdd.map(lambda row: row.asDict()).takeOrdered(10, key=lambda l: 1/l['count'])
top_busy_areas =  df_dates_community_areas.rdd.map(lambda row: row.asDict()).takeOrdered(10, key=lambda l: 1/l['count'])

In [None]:
top_crime_types_lst = [dc['primary_type'] for dc in top_crime_types]
top_busy_areas_lst = [dc['community_area'] for dc in top_busy_areas]

In [None]:
top_crime_types_lst

In [None]:
top_busy_areas_lst

Afficher les 10 crimes les plus courants, dans les 10 quartiers les plus chauds...

In [None]:
q1 = "instr('" + ' '.join(top_busy_areas_lst) + "', community_area) > 0"
q2 = "instr('" + ' '.join(top_crime_types_lst) + "', primary_type) > 0"
print(q1)

In [None]:
## Construct a data frame filtered on these top community areas and top crime types:
df_dates_tops = df_dates.filter(q1).filter(q2)

In [None]:
df_dates_tops.count()

In [None]:
tops_of_tops = df_dates_tops.groupBy(['primary_type', 'community_area']).count().orderBy(['primary_type', 'count', 'community_area'], ascending=[True, False, True]).cache()

In [None]:
tops_of_tops.show(20)

#### Les noms des quartiers

Pour l'instant on avait juste les numéros, on peut croiser cela avec le nom des quartiers.

La source se trouve dans le journal Chicago Tribune  http://www.chicagotribune.com/chi-community-areas-htmlstory.html


In [None]:
area_names = """
01	Rogers Park
40	Washington Park
02	West Ridge
41	Hyde Park
03	Uptown
42	Woodlawn
04	Lincoln Square
43	South Shore
05	North Center
44	Chatham
06	Lakeview
45	Avalon Park
07	Lincoln Park
46	South Chicago
08	Near North Side
47	Burnside
09	Edison Park
48	Calumet Heights
10	Norwood Park
49	Roseland
11	Jefferson Park
50	Pullman
12	Forest Glen
51	South Deering
13	North Park
52	East Side
14	Albany Park
53	West Pullman
15	Portage Park
54	Riverdale
16	Irving Park
55	Hegewisch
17	Dunning
56	Garfield Ridge
18	Montclare
57	Archer Heights
19	Belmont Cragin
58	Brighton Park
20	Hermosa
59	McKinley Park
21	Avondale
60	Bridgeport
22	Logan Square
61	New City
23	Humboldt Park
62	West Elsdon
24	West Town
63	Gage Park
25	Austin
64	Clearing
26	West Garfield Park
65	West Lawn
27	East Garfield Park
66	Chicago Lawn
28	Near West Side
67	West Englewood
29	North Lawndale
68	Englewood
30	South Lawndale
69	Greater Grand Crossing
31	Lower West Side
70	Ashburn
32	Loop
71	Auburn Gresham
33	Near South Side
72	Beverly
34	Armour Square
73	Washington Heights
35	Douglas
74	Mount Greenwood
36	Oakland
75	Morgan Park
37	Fuller Park
76	O'Hare
38	Grand Boulevard
77	Edgewater
39	Kenwood
"""

In [None]:
code_pairs = [[float(p[0]), p[1]] for p in [pair.strip().split('\t') for pair in area_names.strip().split('\n')]]

In [None]:
code_pairs[:5]

#### Une vue de la criminalité par quartier

In [None]:
community_area_counts = pd.DataFrame(df_dates_community_areas.rdd.map(lambda row: row.asDict()).collect())

In [None]:
# Create a dictionary of area code to names
area_name_dic = {float(k[0]):k[1] for k in code_pairs}

In [None]:
community_area_counts['community_area_name'] = community_area_counts['community_area'].apply(lambda area: area_name_dic.get(float(area),  'unknown_%s'%area))

In [None]:
community_area_counts = community_area_counts.sort_values(by='count')
community_area_counts.head(5)

In [None]:
plt.rcParams["figure.figsize"] = [32, 32]

sns.set(style="whitegrid")
sns.set_color_codes("pastel")

sns.despine(left=True, bottom=True)
area_chart = sns.barplot(x='count', y='community_area_name', data=community_area_counts)
area_chart.set(ylabel="Community Area Name", xlabel="Overall Crimes Record Count")

**Que se passe-t-il dans le quartier Austin ?**

In [None]:
code_pairs_df = spark.createDataFrame(code_pairs, ['community_area', 'area_name'])

In [None]:
named_tops_of_tops = code_pairs_df.join(tops_of_tops, on='community_area', how='right')

In [None]:
named_tops_of_tops.show(10)

In [None]:
tops_of_tops_dff = pd.DataFrame(named_tops_of_tops.rdd.map(lambda l: l.asDict()).collect() )

In [None]:
plt.rcParams["figure.figsize"] = [64, 16]
sns.barplot(data=tops_of_tops_dff, x='area_name', y='count', hue='primary_type', palette='pastel')

---


# Pouvons-nous prédire le crime le plus probable à un moment/endroit ?

Afin de le faire, on va nettoyer un peu plus le dataset

### retirer ces variables :

 * 'id' - Random information that isn't a predictor of crime type
 * 'case_number' - Random information that isn't a predictor of crime type
 * 'date' - Removed because it's been re-featurized in other features generated above
 * 'block' - Excluded as this may simply mean noise
 * 'iucr' - Excluded as correlated with crime type. No point.
 * 'x_coordinate' - Not included
 * 'y_coordinate' - Not included
 * 'year' - Not included (already otherwise featurized)
 * 'updated_on' - not included
 * 'latitude' - not included
 * 'longitude' - not included
 * 'location' - not included
 * 'date_time' - Taken into account in other time-related features
 * 'description' - Excluded. I want to see this as associated with the response (primary type)


### Garder ces variables :

 * 'location_description'
 * 'arrest'
 * 'domestic'
 * 'beat'
 * 'district'
 * 'ward'
 * 'community_area'
 * 'fbi_code'
 * 'hour'
 * 'week_day'
 * 'year_month'
 * 'month_day'
 * 'date_number'

In [None]:
selected_features = [
 'location_description',
 'arrest',
 'domestic',
 'beat',
 'district',
 'ward',
 'community_area',
 'fbi_code',
 'hour',
 'week_day',
 'year_month',
 'month_day',
 'date_number']

In [None]:
#Let's see the schema of these selected features:
features_df = df_dates.select(selected_features)
features_df.printSchema()

In [None]:
feature_level_count_dic = []

for feature in selected_features:
    print('Analysing %s' % feature)
    levels_list_df = features_df.select(feature).distinct()
    feature_level_count_dic.append({'feature': feature, 'level_count': levels_list_df.count()})


In [None]:
pd.DataFrame(feature_level_count_dic).sort_values(by='level_count', ascending=False)

### Preparer le modèle

In [None]:
from pyspark.ml.feature import StringIndexer, VectorAssembler
df_dates_features = df_dates.na.drop(subset=selected_features)



Utilisons le string indexer de Spark pour les variables séléctionnées

In [None]:
for feature in feature_level_count_dic:
    indexer = StringIndexer(inputCol=feature['feature'], outputCol='%s_indexed' % feature['feature'])
    print('Fitting feature "%s"' % feature['feature'])
    model = indexer.fit(df_dates_features)
    print('Transforming "%s"' % feature['feature'])
    df_dates_features = model.transform(df_dates_features)

Même chose avec les labels

In [None]:
## String-index the response variable:
response_indexer = StringIndexer(inputCol='primary_type', outputCol='primary_type_indexed')
response_model = response_indexer.fit(df_dates_features)
df_dates_features = response_model.transform(df_dates_features)

In [None]:
#What does it look like now...
df_dates_features.show(1)


Maintenant, on va vectoriser les éléments pour les avoir dans une colonne `features`

In [None]:
indexed_features = ['%s_indexed' % fc['feature'] for fc in feature_level_count_dic]
indexed_features

In [None]:
assembler = VectorAssembler(inputCols=indexed_features, outputCol='features')
vectorized_df_dates = assembler.transform(df_dates_features)

In [None]:
vectorized_df_dates.select('features').take(1)

### Et entraîner le modèle.

Utiliser une répartition **60%**/**40%** entre les données train et test.

Pour commencer, utilisons une régression logistique.

In [None]:
train, test = vectorized_df_dates.randomSplit([0.6, 0.4])

In [None]:
from pyspark.ml.classification import LogisticRegression

In [None]:
logisticRegression = LogisticRegression(labelCol='primary_type_indexed', featuresCol='features', maxIter=10, family='multinomial')

In [None]:
fittedModel = logisticRegression.fit(train)

## Quelle est la performance du modèle ?

In [None]:
fittedModel.summary.accuracy

In [None]:
model_summary = fittedModel.summary

In [None]:
fittedModel.coefficientMatrix

#### Pourquoi nous avons une forme 34x13 ?

Comme on faut une regressions logistique multinomial logistic, elle est entraînée sur **chaque classe** des labels. Ainsi, une probabilité pour chaque classe est identifiée.

In [None]:
print(fittedModel.coefficientMatrix)

In [None]:
print('Coefficient matrix:\nRow count = %s\nCol count = %s' % (fittedModel.coefficientMatrix.numRows, fittedModel.coefficientMatrix.numCols))

In [None]:
print('Model:\nNum Classes = %s\nNum Features = %s' % (fittedModel.numClasses, fittedModel.numFeatures))

In [None]:
print('Training "primary_type" factor level count = %s' % train.select('primary_type_indexed').distinct().count())

In [None]:
vectorized_df_dates.select('features').show(2, truncate=False)

In [None]:
fittedModel.numClasses

In [None]:
fittedModel.numFeatures

In [None]:
train.select('primary_type_indexed').distinct().count()

In [None]:
df_dates.select('primary_type').distinct().count()

In [None]:
fittedModel.interceptVector.values.size

In [None]:
print(model_summary.objectiveHistory)
print()
print('Objective history size ', len(model_summary.objectiveHistory))

In [None]:
sns.barplot(y=model_summary.objectiveHistory, x=list(range(len(model_summary.objectiveHistory))))

In [None]:
label_stats = {float(i):{'index': float(i)} for i in range(34)}
print(label_stats)

In [None]:
print("False positive rate by label:")
for i, rate in enumerate(model_summary.falsePositiveRateByLabel):
    label_stats[i]['false_positive_rate'] = rate

for i, rate in enumerate(model_summary.truePositiveRateByLabel):
    label_stats[i]['true_positive_rate'] = rate

for i, rate in enumerate(model_summary.precisionByLabel):
    label_stats[i]['precision_rate'] = rate

for i, rate in enumerate(model_summary.recallByLabel):
    label_stats[i]['recall_rate'] = rate

for i, rate in enumerate(model_summary.fMeasureByLabel()):
    label_stats[i]['f_measure'] = rate

In [None]:
train_rdd = train.select(['primary_type', 'primary_type_indexed']).distinct().orderBy('primary_type_indexed').rdd.map(lambda l: l.asDict()).collect()


In [None]:
for l in train_rdd:
    print(l)
    label_stats[l['primary_type_indexed']]['primary_type'] = l['primary_type']

In [None]:
rates_pddf = pd.DataFrame(list(label_stats.values()))

In [None]:
rates_pddf = rates_pddf.sort_values(by='precision_rate', ascending=False)

#### Est-ce que ça semble un bon modèle pour prédire les crimes ?

In [None]:
rates_pddf

## À vous:

 * Exécuter le modèle sur l'ensemble de test