<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

## Chicago Car Accident Data Analysis
In this notebook, we analyze the data using a Python environment.<br/>
We also use Pixiedust as the engine over Mapbox to display maps in the later part of the analysis.

In an additional section, we see how we could use additional data to add the city name to each record.

## Additional Information
The chicago accident information includes three files: Crashes, people, and vehicles.

In this notebook, we explore the crashes through a file called `ChicagoTrafficCrashes20180917.csv`

### 017-Spark Data Exploration 
Execute the next cell if you want to see the `Byte Size Data Science` youtube channel video

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/xSDP6u_Xqhc?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)


## Read the crash data
In this section, we read the data as a Spark DataFrame

In [None]:
# PixieDust is an open source library that was contributed by IBM
!pip install --user --upgrade pixiedust

In [None]:
import pixiedust

In [None]:
from pyspark.sql import SparkSession
import urllib.request
import zipfile

spark = SparkSession.builder.getOrCreate()

In [None]:
url = 'https://github.com/jacquesroy/byte-size-data-science/raw/master/data/ChicagoTrafficCrashes20180917.csv.zip'
# get the filename from the url: "ChicagoTrafficCrashes20180917.csv"
zipfilename = url.rsplit('/', 1)[-1]
filename = zipfilename.rsplit('.', 1)[0]
print ("zipfilename: " + zipfilename)
print("filename: " +filename)

In [None]:
urllib.request.urlretrieve(url, zipfilename)
compressed_file = zipfile.ZipFile(zipfilename)
csv_file = compressed_file.extract(filename)

In [None]:
collisions_df = spark.read\
  .format('org.apache.spark.sql.execution.datasources.csv.CSVFileFormat')\
  .option('header', 'true')\
  .option('inferSchema', 'true')\
  .load(filename)

collisions_df.createOrReplaceTempView("collisions")
spark.sql("""
    select rd_no,crash_date,DATE_POLICE_NOTIFIED, LATITUDE,LONGITUDE
    from   collisions
    limit 5
    """).take(5)

## Basic Statistics

In [None]:
# Print the number of records and display the DataFrame schema
print("Records: {}".format(collisions_df.count()))
collisions_df.printSchema()

In [None]:
# Convert the two datetime columns to the proper type
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import to_date
from pyspark.sql.functions import column

collisions_df = collisions_df.withColumn("CRASH_TS", to_timestamp("CRASH_DATE", "MM/dd/yyyy hh:mm:ss aa")).\
                              withColumn("CRASH_DATE", column('CRASH_TS').cast('date')).\
                              withColumn("DATE_POLICE_NOTIFIED", to_timestamp("DATE_POLICE_NOTIFIED", "MM/dd/yyyy hh:mm:ss aa"))
collisions_df.createOrReplaceTempView("collisions")

### Get statistics for each column

In [None]:
stats_df = collisions_df.summary()

In [None]:
# Use multiple cells to show the results in a readable manner
stats_df.select(['summary','RD_NO','CRASH_DATE_EST_I','POSTED_SPEED_LIMIT','TRAFFIC_CONTROL_DEVICE','DEVICE_CONDITION']).show()

From the previous output, we see that `CRASH_DATE_EST_I` is mostly null (92.5% of the time)<br/>
We see that the `POSTED_SPEED_LIMIT maximum` is 99 so that raises some questions.<br/>
The last two columns are likely from a short list of possibilities.

In [None]:
stats_df.select(['summary','WEATHER_CONDITION','LIGHTING_CONDITION','FIRST_CRASH_TYPE','TRAFFICWAY_TYPE','LANE_CNT','ALIGNMENT']).show()

Here we have five string columns that are list of possibilities.<br/>
Since we had two other columns earlier, let's find out how many possibilities in each.

In [None]:
spark.sql("""
  select count(distinct TRAFFIC_CONTROL_DEVICE) TRAFFIC_CONTROL_DEVICE, count(distinct DEVICE_CONDITION) DEVICE_CONDITION,
         count(distinct WEATHER_CONDITION) WEATHER_CONDITION,
         count(distinct LIGHTING_CONDITION) LIGHTING_CONDITION, count(distinct FIRST_CRASH_TYPE) FIRST_CRASH_TYPE,
         count(distinct TRAFFICWAY_TYPE) TRAFFICWAY_TYPE, count(distinct ALIGNMENT) ALIGNMENT
  from collisions
""").show()

### Continuing with column statistics

In [None]:
stats_df.select(['summary','ROADWAY_SURFACE_COND','ROAD_DEFECT','REPORT_TYPE','CRASH_TYPE','INTERSECTION_RELATED_I','NOT_RIGHT_OF_WAY_I']).show()

We just saw a bunch of string columns.<br/>
Note that `INTERSECTION_RELATED_I` and `NOT_RIGHT_OF_WAY_I` are often null (80.2% and 95.6% respectively)<br/>
Let's look at the other columns to see how many distinct values there are.

In [None]:
spark.sql("""
  select count(distinct ROADWAY_SURFACE_COND) ROADWAY_SURFACE_COND, count(distinct ROAD_DEFECT) ROAD_DEFECT,
         count(distinct REPORT_TYPE) REPORT_TYPE, count(distinct CRASH_TYPE) CRASH_TYPE
  from collisions
""").show()

In [None]:
stats_df.select(['summary','HIT_AND_RUN_I','DAMAGE','PRIM_CONTRIBUTORY_CAUSE','SEC_CONTRIBUTORY_CAUSE','STREET_DIRECTION']).show()

`HIT_AND_RUN_I` is often null (72.6%).<br/>
We can look at the choices in the other columns

In [None]:
spark.sql("""
  select count(distinct STREET_NAME) STREET_NAME, count(distinct DAMAGE) DAMAGE, count(distinct PRIM_CONTRIBUTORY_CAUSE) PRIM_CONTRIBUTORY_CAUSE,
         count(distinct SEC_CONTRIBUTORY_CAUSE) SEC_CONTRIBUTORY_CAUSE, count(distinct STREET_DIRECTION) STREET_DIRECTION
  from collisions
""").show()

In [None]:
stats_df.select(['summary','BEAT_OF_OCCURRENCE','PHOTOS_TAKEN_I','STATEMENTS_TAKEN_I','DOORING_I','WORK_ZONE_I','WORK_ZONE_TYPE']).show()

`BEAT_OF_OCCURRENCE` could be useful in terms of resource deployment.<br/>
The other columns are null most of the time.

In [None]:
stats_df.select(['summary','WORKERS_PRESENT_I','NUM_UNITS','MOST_SEVERE_INJURY','INJURIES_TOTAL','INJURIES_FATAL','INJURIES_INCAPACITATING']).show()

`WORKERS_PRESENT_I` is mostly null.<br/>
The other columns appear "normal"

In [None]:
stats_df.select(['summary','INJURIES_NON_INCAPACITATING','INJURIES_REPORTED_NOT_EVIDENT','INJURIES_NO_INDICATION','INJURIES_UNKNOWN']).show()

In [None]:
stats_df.select(['summary','CRASH_HOUR','CRASH_DAY_OF_WEEK','CRASH_MONTH','LATITUDE','LONGITUDE']).show()

## Exploring further
We saw earlier that the minumum `CRASH DATE` was 2014-01-21 and the minimum `DATE_POLICE_NOTIFIED` was 2015-07-25<br/>
That indicates that there probably are some errors in the data.

The `POSTED_SPEED_LIMIT` value has a maximum of 99. This is suspicious.

There are more...

In [None]:
# DATE_POLICE_NOTIFIED should always be greater or equal to CRASH_DATE
data_df = spark.sql("""
  select RD_NO, CRASH_DATE, DATE_POLICE_NOTIFIED, datediff(to_date(DATE_POLICE_NOTIFIED), CRASH_DATE) DIFF_DAYS
  from collisions
""")
data_df.summary().show()

In [None]:
spark.sql("""
  select datediff(to_date(DATE_POLICE_NOTIFIED), CRASH_DATE) DIFF_DAYS, count(*) cnt
  from collisions
  group by DIFF_DAYS
  order by DIFF_DAYS desc
  limit 20
""").show()

In [None]:
print(data_df.filter('DIFF_DAYS > 0').count())

In [None]:
data_df.filter('DIFF_DAYS > 30').summary().show()

In [None]:
print(data_df.filter('DIFF_DAYS > 30').count())

### POSTED_SPEED_LIMIT

In [None]:
spark.sql("""
  select POSTED_SPEED_LIMIT, count(POSTED_SPEED_LIMIT) TOTAL
  from collisions
  group by POSTED_SPEED_LIMIT
  order by POSTED_SPEED_LIMIT Desc
""").show(40)

## Look at the distinct values
How relevant are the distinct values?

In [None]:
spark.sql("""
  select TRAFFIC_CONTROL_DEVICE, count(TRAFFIC_CONTROL_DEVICE) TOTAL
  from collisions
  group by TRAFFIC_CONTROL_DEVICE
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select DEVICE_CONDITION, count(DEVICE_CONDITION) TOTAL
  from collisions
  group by DEVICE_CONDITION
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select WEATHER_CONDITION, count(WEATHER_CONDITION) TOTAL
  from collisions
  group by WEATHER_CONDITION
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select LIGHTING_CONDITION, count(LIGHTING_CONDITION) TOTAL
  from collisions
  group by LIGHTING_CONDITION
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select FIRST_CRASH_TYPE, count(FIRST_CRASH_TYPE) TOTAL
  from collisions
  group by FIRST_CRASH_TYPE
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select TRAFFICWAY_TYPE, count(TRAFFICWAY_TYPE) TOTAL
  from collisions
  group by TRAFFICWAY_TYPE
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select ALIGNMENT, count(ALIGNMENT) TOTAL
  from collisions
  group by ALIGNMENT
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select ROADWAY_SURFACE_COND, count(ROADWAY_SURFACE_COND) TOTAL
  from collisions
  group by ROADWAY_SURFACE_COND
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select ROAD_DEFECT, count(ROAD_DEFECT) TOTAL
  from collisions
  group by ROAD_DEFECT
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select REPORT_TYPE, count(REPORT_TYPE) TOTAL
  from collisions
  group by REPORT_TYPE
  order by TOTAL desc
""").show()

In [None]:
spark.sql("""
  select CRASH_TYPE, count(CRASH_TYPE) TOTAL
  from collisions
  group by CRASH_TYPE
  order by TOTAL desc
""").show()

## Count accidents, accidents with injuries, accidents with casualties

In [None]:
spark.sql("""
  select count(*) all_accidents from collisions
  where longitude is not null
  and latitude is not null
""").show()

spark.sql("""
  select count(*) accidents_with_injuries from collisions
  where longitude is not null
  and latitude is not null
  and INJURIES_TOTAL > 0
""").show()

spark.sql("""
  select count(*) accidents_with_fatalities from collisions
  where longitude is not null
  and latitude is not null
  and INJURIES_FATAL > 0
""").show()

## Extract a subset of columns

In [None]:
from pyspark.sql.functions import to_timestamp
from pyspark.sql.functions import to_date
from pyspark.sql.functions import column
# Select the columns to use
# RD_NO,CRASH_DATE,POSTED_SPEED_LIMIT,TRAFFIC_CONTROL_DEVICE,DEVICE_CONDITION,
#         WEATHER_CONDITION,LIGHTING_CONDITION,FIRST_CRASH_TYPE,TRAFFICWAY_TYPE,LANE_CNT,
#         ALIGNMENT,ROADWAY_SURFACE_COND,ROAD_DEFECT,REPORT_TYPE,CRASH_TYPE,
#         INTERSECTION_RELATED_I,HIT_AND_RUN_I,DAMAGE,DATE_POLICE_NOTIFIED,
#         PRIM_CONTRIBUTORY_CAUSE,SEC_CONTRIBUTORY_CAUSE,STREET_NO,STREET_DIRECTION,
#         STREET_NAME,BEAT_OF_OCCURRENCE,NUM_UNITS,MOST_SEVERE_INJURY,INJURIES_TOTAL,
#         INJURIES_FATAL,INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,
#         INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,
#         CRASH_HOUR,CRASH_DAY_OF_WEEK,CRASH_MONTH,LATITUDE,LONGITUDE
#
# Additional columns (derived from CRASH_DATE)
# CRASH_TS (original CRASH_DATE as a timestamp)
# CRASH DATE (derived from CRASH_TS as a date)
#
collisions2_df = spark.sql("""
  select RD_NO,CRASH_TS,CRASH_DATE, INJURIES_TOTAL, INJURIES_FATAL,
         INJURIES_INCAPACITATING,INJURIES_NON_INCAPACITATING,
         INJURIES_REPORTED_NOT_EVIDENT,INJURIES_NO_INDICATION,INJURIES_UNKNOWN,
         CRASH_HOUR,CRASH_DAY_OF_WEEK,dayofmonth(CRASH_DATE) CRASH_DAY, 
         CRASH_MONTH,year(CRASH_DATE) CRASH_YEAR, LATITUDE,LONGITUDE
  from collisions
""")
collisions2_df.createOrReplaceTempView("collisions2")
collisions2_df.show(5)

## Visualization

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
# seaborn also builds on matplotlib and adds graphical features and new plot types
# adjust settings
# The inline statement insures that the plot will show in the cell output. Look at the documentation for more information
%matplotlib inline
sns.set_style("white")
plt.rcParams['figure.figsize'] = (15, 15)

### Grouping accidents
First by street for 3 categories:
<ul><li>All accidents</li>
<li>Accidents with injuries</li>
<li>Accidents with fatalities</li>
</ul>

In [None]:
# Plot the top 15 streets by accident count
plt.figure(figsize=(8,5))
streets = collisions_df.groupBy('STREET_NAME').count().sort('count',ascending=False).limit(15).toPandas() 
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='count', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of Collisions by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
streets = collisions_df.filter(collisions_df.INJURIES_FATAL == 0).filter(collisions_df.INJURIES_TOTAL > 0).groupBy('STREET_NAME').count().sort('count',ascending=False).limit(15).toPandas() # .iloc[1:,:]
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='count', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of injuries accidents by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
streets = collisions_df.filter(collisions_df.INJURIES_FATAL > 0).\
                        groupBy('STREET_NAME').count().sort('count',ascending=False).limit(15).toPandas()
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='count', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of Fatal accidents by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

### Accidents by month, day of the week
See example code at: `http://benalexkeen.com/bar-charts-in-matplotlib/`

In [None]:
plt.figure(figsize=(8,5))
byMonth = collisions_df.groupBy('CRASH_MONTH').count().sort('CRASH_MONTH',ascending=False).toPandas()
colors = ['g','0.75','y','k','b','r']
byMonth.sort_values(by='CRASH_MONTH', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('month')
plt.title('Total Number of Collisions by month', size=15)
plt.yticks(range(0,12),byMonth['CRASH_MONTH'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
byDay = collisions_df.groupBy('CRASH_DAY_OF_WEEK').count().sort('CRASH_DAY_OF_WEEK',ascending=False).toPandas()
colors = ['g','0.75','y','k','b','r']
byDay.sort_values(by='CRASH_DAY_OF_WEEK', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('month')
plt.title('Total Number of Collisions by Day of the week', size=15)
plt.yticks(range(0,7),byDay['CRASH_DAY_OF_WEEK'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
byHour = collisions_df.groupBy('CRASH_HOUR').count().sort('CRASH_HOUR',ascending=False).toPandas()
colors = ['g','0.75','y','k','b','r']
byHour.sort_values(by='CRASH_HOUR', ascending=False)['count'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('month')
plt.title('Total Number of Collisions by hour of the day', size=15)
plt.yticks(range(0,24),byHour['CRASH_HOUR'])
plt.tight_layout()
plt.show()

## Prep Data for plotting

In [None]:
collisions_pd = collisions_df[collisions_df['LATITUDE'] != 0][['LATITUDE', 'LONGITUDE', 'CRASH_DATE',
                                                               'INJURIES_TOTAL', 'INJURIES_FATAL', 'CRASH_HOUR','CRASH_DAY_OF_WEEK',
                                                               'CRASH_MONTH']].toPandas()

collisions_pd.columns = ['Latitude', 'Longitude', 'Date', 'Persons Injured', 'Persons Killed',
                         'Crash hour', 'Crash day of week', 'Crash month']

collisions_pd['Latitude'] = collisions_pd['Latitude'].astype(float)
collisions_pd['Longitude'] = collisions_pd['Longitude'].astype(float)
collisions_pd['Persons Killed'] = collisions_pd['Persons Killed'].astype(float)
collisions_pd['Persons Injured'] = collisions_pd['Persons Injured'].astype(float)



#divide dataset into accident categories: fatal, non-fatal but with injuries, none of the above
killed_pd = collisions_pd[collisions_pd['Persons Killed']>0]
injured_pd = collisions_pd[np.logical_and(collisions_pd['Persons Injured']>0, collisions_pd['Persons Killed']==0)]
nothing_pd = collisions_pd[np.logical_and(collisions_pd['Persons Killed']==0, collisions_pd['Persons Injured']==0)]

## Plot the accidents using longitude/latitude
This is not a map but a graphical representation of the accidents related to longitude and latitude.

We got the limits for longitude and latitude earlier and plug them into the xlim/ylim values.

In [None]:
#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(collisions_pd.Longitude, collisions_pd.Latitude, alpha=0.05, s=4, color='darkseagreen')

#adjust more settings
plt.title('Motor Vehicle Collisions in Chicago', size=25)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)

plt.show()

## Enhance the scatter plot to identify the accidents severity
We draw from Pandas DataFrames we created earlier to plot the severity in different color

In [None]:
#adjust settings
plt.figure(figsize=(15,10))

#create scatterplots
plt.scatter(nothing_pd.Longitude, nothing_pd.Latitude, alpha=0.04, s=1, color='blue')
plt.scatter(injured_pd.Longitude, injured_pd.Latitude, alpha=0.1, s=1, color='yellow')
plt.scatter(killed_pd.Longitude, killed_pd.Latitude, color='red', s=5)

#create legend
blue_patch = mpatches.Patch( label='car body damage', alpha=0.2, color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury', alpha=0.5)
red_patch = mpatches.Patch(color='red', label='lethal accidents')
plt.legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Severity of Motor Vehicle Collisions in Chicago', size=20)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.savefig('anothertry.png')

plt.show()

## Determine the streets with the most collisions
Find the top ten streets in New York where the most vehicle collisions occurred. Display the results in a bar graph and as a scatter plot

In [None]:
from pyspark.sql import functions as F

# Note the Spark DataFrame SQL-like methods available: groupBy, agg, sort (order by), limit
# The result is converted to a Pandas DataFrame
plottingdf = collisions_df.groupBy("STREET_NAME").agg(F.count("STREET_NAME").alias("count(STREET_NAME)")).\
sort(F.desc('count(STREET_NAME)')).limit(10).toPandas()

plottingdf[['count(STREET_NAME)']].plot(kind='barh', figsize=(11,7), legend=False)
plt.title('Top 10 Streets with the most accidents', size=20)
plt.xlabel('Count')
plt.yticks(range(10), plottingdf['STREET_NAME'])
plt.gca().invert_yaxis()
plt.show()

## Visualize the 10 streets with the most collisions

In [None]:
data1 = collisions_df[['STREET_NAME', 'LATITUDE', 'LONGITUDE']].toPandas()
street_names = collisions_df.groupBy("STREET_NAME").agg(F.count("STREET_NAME").
                                                        alias("count(STREET_NAME)")).\
                sort(F.desc('count(STREET_NAME)')).limit(10).select('STREET_NAME').\
                rdd.map(lambda r : r.STREET_NAME).collect()

collisions1 = data1[data1['STREET_NAME']==street_names[0]]
collisions2 = data1[data1['STREET_NAME']==street_names[1]]
collisions3 = data1[data1['STREET_NAME']==street_names[2]]
collisions4 = data1[data1['STREET_NAME']==street_names[3]]
collisions5 = data1[data1['STREET_NAME']==street_names[4]]
collisions6 = data1[data1['STREET_NAME']==street_names[5]]
collisions7 = data1[data1['STREET_NAME']==street_names[6]]
collisions8 = data1[data1['STREET_NAME']==street_names[7]]
collisions9 = data1[data1['STREET_NAME']==street_names[8]]
collisions10 = data1[data1['STREET_NAME']==street_names[9]]

#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(data1.LONGITUDE, data1.LATITUDE, s=1, color='darkseagreen')
plt.scatter(collisions1.LONGITUDE, collisions1.LATITUDE, s=2, color='red')
plt.scatter(collisions2.LONGITUDE, collisions2.LATITUDE, color='blue', s=2)
plt.scatter(collisions3.LONGITUDE, collisions3.LATITUDE, s=2, color='magenta')
plt.scatter(collisions4.LONGITUDE, collisions4.LATITUDE, color='orange', s=2)
plt.scatter(collisions5.LONGITUDE, collisions5.LATITUDE, s=2, color='yellow')
plt.scatter(collisions6.LONGITUDE, collisions6.LATITUDE, color='purple', s=2)
plt.scatter(collisions7.LONGITUDE, collisions7.LATITUDE, s=2, color='black')
plt.scatter(collisions8.LONGITUDE, collisions8.LATITUDE, color='chartreuse', s=2)
plt.scatter(collisions9.LONGITUDE, collisions9.LATITUDE, s=2, color='brown')
plt.scatter(collisions10.LONGITUDE, collisions10.LATITUDE, color='darkgreen', s=2)


#create legend
a_patch = mpatches.Patch(color='red', label=street_names[0])
b_patch = mpatches.Patch(color='blue', label=street_names[1])
c_patch = mpatches.Patch(color='magenta', label=street_names[2])
d_patch = mpatches.Patch(color='orange', label=street_names[3])
e_patch = mpatches.Patch(color='yellow', label=street_names[4])
f_patch = mpatches.Patch(color='purple', label=street_names[5])
g_patch = mpatches.Patch(color='black', label=street_names[6])
h_patch = mpatches.Patch(color='chartreuse', label=street_names[7])
i_patch = mpatches.Patch(color='brown', label=street_names[8])
j_patch = mpatches.Patch(color='darkgreen', label=street_names[9])

plt.legend([a_patch, b_patch, c_patch, d_patch, e_patch, f_patch, g_patch, h_patch, i_patch, j_patch],
            (street_names[0],street_names[1],street_names[2],street_names[3],street_names[4],
             street_names[5],street_names[6],street_names[7],street_names[8],street_names[9]),
           loc='upper left', prop={'size':12})

#adjust more settings
plt.title('Vehicle Collisions in Chicago', size=25)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.show()

## Using K-Means to find hot spots
We are using K-means to find the center of groupings of accidents.

The process is as follows:

We extract the longitude and latitude of all accidents
We create a model (for, arbitrarily, 10 clusters)
We extract the centers and convert them to a Panda DataFrame
We display the result on a map using pixiedust

In [None]:
from pyspark.ml.clustering import KMeans
from pyspark.ml.linalg import Vectors
from pyspark.sql import Row

import pandas as pd
import numpy as np
import seaborn as sns

In [None]:
# Column features must be of type org.apache.spark.ml.linalg.Vector.
data1 = spark.createDataFrame(
    spark.sql("""
          select LONGITUDE, LATITUDE from collisions
          where LATITUDE is not null
          and longitude is not null
    """).rdd.map(lambda r : Row(Vectors.dense([r.LONGITUDE, r.LATITUDE]))), ["features"] )

kmeans = KMeans(k=10, seed=123)
model = kmeans.fit(data1)
centers = model.clusterCenters()

In [None]:
# Convert the NumPy array into a Panda DataFrame
long=[]
lat=[]
for center in centers :
    long.append(center[0])
    lat.append(center[1])

summary = model.summary
data2 = pd.DataFrame(data={'LONGITUDE': long, 'LATITUDE': lat, "COUNT": summary.clusterSizes })

In [None]:
display(data2)

## K-Means for accidents with injuries

In [None]:
data1 = spark.createDataFrame(
    spark.sql("""
          select LONGITUDE, LATITUDE from collisions
          where LATITUDE is not null
          and longitude is not null
          and INJURIES_TOTAL > 0
    """).rdd.map(lambda r : Row(Vectors.dense([r.LONGITUDE, r.LATITUDE]))), ["features"] )

kmeans = KMeans(k=100, seed=123)
model = kmeans.fit(data1)
centers = model.clusterCenters()

In [None]:
# Convert the NumPy array into a Panda DataFrame
long=[]
lat=[]
for center in centers :
    long.append(center[0])
    lat.append(center[1])

summary = model.summary
data2 = pd.DataFrame(data={'LONGITUDE': long, 'LATITUDE': lat, "COUNT": summary.clusterSizes })
display(data2)

## K-Means for accidents with fatalities
There are only 180 accidents so we'll use a "k" of 10

In [None]:
data1 = spark.createDataFrame(
    spark.sql("""
          select LONGITUDE, LATITUDE from collisions
          where LATITUDE is not null
          and longitude is not null
          and INJURIES_FATAL > 0
    """).rdd.map(lambda r : Row(Vectors.dense([r.LONGITUDE, r.LATITUDE]))), ["features"] )

kmeans = KMeans(k=10, seed=123)
model = kmeans.fit(data1)
centers = model.clusterCenters()

In [None]:
# Convert the NumPy array into a Panda DataFrame
long=[]
lat=[]
for center in centers :
    long.append(center[0])
    lat.append(center[1])

summary = model.summary
data2 = pd.DataFrame(data={'LONGITUDE': long, 'LATITUDE': lat, "COUNT": summary.clusterSizes })
display(data2)

In [None]:
# see %lsmagic for all the commands available
%rm $filename

In [None]:
%rm $zipfilename