<IMG SRC="https://github.com/jacquesroy/byte-size-data-science/raw/master/images/Banner.png" ALT="BSDS Banner" WIDTH=1195 HEIGHT=200>

## Chicago Car Accident Data Analysis
In this notebook, we analyze the data using a Python environment.
We also use Pixiedust as the engine over Mapbox to display maps in the later part of the analysis.

In an additional section, we see how we could use additional data to add the city name to each record.

## Additional Information
The chicago accident information includes three files: Crashes, people, and vehicles.

In this notebook, we explore the crashes through a file called ChicagoTrafficCrashes20180917.csv

## Data Location
The data was loaded in a Db2 Warehouse in the cloud. It is possible for you to use a free Db2 database lite to work with this data. 

### 032-JDBC Data Exploration
Execute the next cell if you want to see the `Byte Size Data Science` youtube channel video

In [None]:
from IPython.display import IFrame

IFrame(src="https://www.youtube.com/embed/qw4FtewQFZE?rel=0&amp;controls=0&amp;showinfo=0", width=560, height=315)


In [None]:
# PixieDust is an open source library that was contributed by IBM
!pip install --user --upgrade pixiedust

In [None]:
import pixiedust

In [None]:

# @hidden_cell
# The following code contains the credentials for a connection in your Project.
# You might want to remove those credentials before you share your notebook.
credentials_1 = {
    'username': 'psj85494',
    'password': """xw234wkr8xvclf^b""",
    'sg_service_url': 'https://sgmanager.ng.bluemix.net',
    'database': 'BLUDB',
    'host': 'dashdb-txn-sbox-yp-dal09-04.services.dal.bluemix.net',
    'port': '50000',
    'url': 'https://undefined'
}


In [None]:
import pandas as pd
import ibm_db
import ibm_db_dbi

dsn = (
    "DRIVER={{IBM DB2 ODBC DRIVER}};"
    "DATABASE={0};"
    "HOSTNAME={1};"
    "PORT={2};"
    "PROTOCOL=TCPIP;"
    "UID={3};"
    "PWD={4};").format(credentials_1['database'], credentials_1['host'],
                       credentials_1['port'], credentials_1['username'], 
                       credentials_1['password'])

conn = ibm_db.connect(dsn, "", "")
pconn = ibm_db_dbi.Connection(conn)

## Basic Statistics
- Total number of records in the table
- Number of non-null values for each column
- Min/Max for selected columms
- Number of distinct values in some columns

In [None]:
# Number of records in the table
sql = """
  select count(*) cnt 
  from traffic_crashes_crashes;
"""

data_pd = pd.read_sql(sql, pconn)
data_pd.head(5)

In [None]:
# Number of non-null values in each columns
sql = """
  select count(RD_NO) RD_NO, count(CRASH_DATE_EST_I) CRASH_DATE_EST_I,
         count(CRASH_DATE) CRASH_DATE, count(POSTED_SPEED_LIMIT) POSTED_SPEED_LIMIT,
         count(TRAFFIC_CONTROL_DEVICE) TRAFFIC_CONTROL_DEVICE,
         count(DEVICE_CONDITION) DEVICE_CONDITION, count(WEATHER_CONDITION) WEATHER_CONDITION,
         count(LIGHTING_CONDITION) LIGHTING_CONDITION, count(FIRST_CRASH_TYPE) FIRST_CRASH_TYPE,
         count(TRAFFICWAY_TYPE) TRAFFICWAY_TYPE, count(LANE_CNT) LANE_CNT,
         count(ALIGNMENT) ALIGNMENT, count(ROADWAY_SURFACE_COND) ROADWAY_SURFACE_COND,
         count(ROAD_DEFECT) ROAD_DEFECT, count(REPORT_TYPE) REPORT_TYPE,
         count(CRASH_TYPE) CRASH_TYPE, count(INTERSECTION_RELATED_I) INTERSECTION_RELATED_I,
         count(NOT_RIGHT_OF_WAY_I) NOT_RIGHT_OF_WAY_I, count(HIT_AND_RUN_I) HIT_AND_RUN_I,
         count(DAMAGE) DAMAGE, count(DATE_POLICE_NOTIFIED) DATE_POLICE_NOTIFIED,
         count(PRIM_CONTRIBUTORY_CAUSE) PRIM_CONTRIBUTORY_CAUSE, 
         count(SEC_CONTRIBUTORY_CAUSE) SEC_CONTRIBUTORY_CAUSE,
         count(STREET_NO) STREET_NO, count(STREET_DIRECTION) STREET_DIRECTION,
         count(STREET_NAME) STREET_NAME, count(BEAT_OF_OCCURRENCE) BEAT_OF_OCCURRENCE,
         count(PHOTOS_TAKEN_I) PHOTOS_TAKEN_I, count(STATEMENTS_TAKEN_I) STATEMENTS_TAKEN_I,
         count(DOORING_I) DOORING_I, count(WORK_ZONE_I) WORK_ZONE_I,
         count(WORK_ZONE_TYPE) WORK_ZONE_TYPE, count(WORKERS_PRESENT_I) WORKERS_PRESENT_I,
         count(NUM_UNITS) NUM_UNITS, count(MOST_SEVERE_INJURY) MOST_SEVERE_INJURY,
         count(INJURIES_TOTAL) INJURIES_TOTAL, count(INJURIES_FATAL) INJURIES_FATAL,
         count(INJURIES_INCAPACITATING) INJURIES_INCAPACITATING,
         count(INJURIES_NON_INCAPACITATING) INJURIES_NON_INCAPACITATING,
         count(INJURIES_REPORTED_NOT_EVIDENT) INJURIES_REPORTED_NOT_EVIDENT,
         count(INJURIES_NO_INDICATION) INJURIES_NO_INDICATION,
         count(INJURIES_UNKNOWN) INJURIES_UNKNOWN, count(CRASH_HOUR) CRASH_HOUR,
         count(CRASH_DAY_OF_WEEK) CRASH_DAY_OF_WEEK, count(CRASH_MONTH) CRASH_MONTH,
         count(LATITUDE) LATITUDE, count(LONGITUDE) LONGITUDE,
         count(LOCATION_str) LOCATION_str
  from   traffic_crashes_crashes
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.iloc[:,0:8].head()

In [None]:
data_pd.iloc[:,8:16].head()

In [None]:
data_pd.iloc[:,16:24].head()

In [None]:
data_pd.iloc[:,24:32].head()

In [None]:
data_pd.iloc[:,32:40].head()

In [None]:
data_pd.iloc[:,40:48].head()

In [None]:
# Look at minimum and maximum values of numerical columns
sql = """
  select min(CRASH_DATE) min_CRASH_DATE, max(CRASH_DATE) max_CRASH_DATE,
         min(DATE_POLICE_NOTIFIED) min_DATE_POLICE_NOTIFIED, max(DATE_POLICE_NOTIFIED) max_DATE_POLICE_NOTIFIED,
         min(POSTED_SPEED_LIMIT) min_POSTED_SPEED_LIMIT, max(POSTED_SPEED_LIMIT) max_POSTED_SPEED_LIMIT,
         min(LANE_CNT) min_LANE_CNT, max(LANE_CNT) max_LANE_CNT,
         min(INJURIES_TOTAL) min_INJURIES_TOTAL, max(INJURIES_TOTAL) max_INJURIES_TOTAL,
         min(INJURIES_FATAL) min_INJURIES_FATAL, max(INJURIES_FATAL) max_INJURIES_FATAL,
         min(INJURIES_INCAPACITATING) min_INJURIES_INCAPACITATING, max(INJURIES_INCAPACITATING) max_INJURIES_INCAPACITATING,
         min(INJURIES_NON_INCAPACITATING) min_INJURIES_NON_INCAPACITATING, 
         max(INJURIES_NON_INCAPACITATING) max_INJURIES_NON_INCAPACITATING,
         min(INJURIES_REPORTED_NOT_EVIDENT) min_INJURIES_REPORTED_NOT_EVIDENT, 
         max(INJURIES_REPORTED_NOT_EVIDENT) max_INJURIES_REPORTED_NOT_EVIDENT,
         min(INJURIES_NO_INDICATION) min_INJURIES_NO_INDICATION, max(INJURIES_NO_INDICATION) max_INJURIES_NO_INDICATION,
         min(INJURIES_UNKNOWN) min_INJURIES_UNKNOWN, max(INJURIES_UNKNOWN) max_INJURIES_UNKNOWN
  from   traffic_crashes_crashes
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.iloc[:,0:6].head()


In [None]:
data_pd.iloc[:,6:12].head()

In [None]:
data_pd.iloc[:,12:16].head()

In [None]:
data_pd.iloc[:,16:20].head()

In [None]:
data_pd.iloc[:,20:23].head()

In [None]:
sql = """
  select count(distinct POSTED_SPEED_LIMIT) POSTED_SPEED_LIMIT, count(distinct TRAFFIC_CONTROL_DEVICE) TRAFFIC_CONTROL_DEVICE,
         count(distinct DEVICE_CONDITION) DEVICE_CONDITION,
         count(distinct WEATHER_CONDITION) WEATHER_CONDITION,
         count(distinct LIGHTING_CONDITION) LIGHTING_CONDITION, count(distinct FIRST_CRASH_TYPE) FIRST_CRASH_TYPE,
         count(distinct TRAFFICWAY_TYPE) TRAFFICWAY_TYPE, count(distinct ALIGNMENT) ALIGNMENT,
         count(distinct ROADWAY_SURFACE_COND) ROADWAY_SURFACE_COND, count(distinct ROAD_DEFECT) ROAD_DEFECT,
         count(distinct REPORT_TYPE) REPORT_TYPE,
         count(distinct CRASH_TYPE) CRASH_TYPE, count(distinct DAMAGE) DAMAGE,
         count(distinct PRIM_CONTRIBUTORY_CAUSE) PRIM_CONTRIBUTORY_CAUSE, 
         count(distinct SEC_CONTRIBUTORY_CAUSE) SEC_CONTRIBUTORY_CAUSE
  from   traffic_crashes_crashes
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.iloc[:,0:6].head(5)

In [None]:
data_pd.iloc[:,6:13].head(5)

In [None]:
data_pd.iloc[:,13:15].head(5)

## Exploring further
We saw earlier that the minumum CRASH DATE was 2014-01-21 and the minimum DATE_POLICE_NOTIFIED was 2015-07-25
That indicates that there probably are some errors in the data.

The POSTED_SPEED_LIMIT value has a maximum of 99. This is suspicious.

There are more...

In [None]:
# DATE_POLICE_NOTIFIED should always be greater or equal to CRASH_DATE
sql = """
  select RD_NO, CRASH_DATE, DATE_POLICE_NOTIFIED, (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE)) DIFF_DAYS
  from   traffic_crashes_crashes
  where (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE)) != 0
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE)) DIFF_DAYS, count(*) cnt
  from   traffic_crashes_crashes
  group by (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE))
  order by (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE)) desc
  limit 20
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select count(*) cnt
  from   traffic_crashes_crashes
  where (DAYS(DATE_POLICE_NOTIFIED) - Days(CRASH_DATE)) > 0 
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(5)

## POSTED_SPEED_LIMIT

In [None]:
sql = """
  select POSTED_SPEED_LIMIT, count(POSTED_SPEED_LIMIT) TOTAL
  from   traffic_crashes_crashes
  group by POSTED_SPEED_LIMIT
  order by POSTED_SPEED_LIMIT Desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(40)

## Look at the distinct values
How relevant are the distinct values?

In [None]:
sql = """
  select  TRAFFIC_CONTROL_DEVICE, count(TRAFFIC_CONTROL_DEVICE) TOTAL
  from    traffic_crashes_crashes
  group by TRAFFIC_CONTROL_DEVICE
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select   DEVICE_CONDITION, count(DEVICE_CONDITION) TOTAL
  from     traffic_crashes_crashes
  group by DEVICE_CONDITION
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select WEATHER_CONDITION, count(WEATHER_CONDITION) TOTAL
  from   traffic_crashes_crashes
  group by WEATHER_CONDITION
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select LIGHTING_CONDITION, count(LIGHTING_CONDITION) TOTAL
  from   traffic_crashes_crashes
  group by LIGHTING_CONDITION
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select FIRST_CRASH_TYPE, count(FIRST_CRASH_TYPE) TOTAL
  from   traffic_crashes_crashes
  group by FIRST_CRASH_TYPE
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select TRAFFICWAY_TYPE, count(TRAFFICWAY_TYPE) TOTAL
  from   traffic_crashes_crashes
  group by TRAFFICWAY_TYPE
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select ALIGNMENT, count(ALIGNMENT) TOTAL
  from   traffic_crashes_crashes
  group by ALIGNMENT
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select ROADWAY_SURFACE_COND, count(ROADWAY_SURFACE_COND) TOTAL
  from   traffic_crashes_crashes
  group by ROADWAY_SURFACE_COND
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select ROAD_DEFECT, count(ROAD_DEFECT) TOTAL
  from   traffic_crashes_crashes
  group by ROAD_DEFECT
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select REPORT_TYPE, count(REPORT_TYPE) TOTAL
  from   traffic_crashes_crashes
  group by REPORT_TYPE
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

In [None]:
sql = """
  select CRASH_TYPE, count(CRASH_TYPE) TOTAL
  from   traffic_crashes_crashes
  group by CRASH_TYPE
  order by TOTAL desc
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)

## Count accidents, accidents with injuries, accidents with casualties

In [None]:
sql = """
  select count(*) all_accidents, 
         sum(INJURIES_TOTAL > 0) accidents_with_injuries, 
         sum(INJURIES_FATAL > 0) accidents_with_fatalities
  from   traffic_crashes_crashes
  where longitude is not null
  and latitude is not null
"""
data_pd = pd.read_sql(sql, pconn)
data_pd.head(20)


## Visualization

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt
# matplotlib.patches lets us create colored patches, which we can use for legends in plots
import matplotlib.patches as mpatches
# seaborn also builds on matplotlib and adds graphical features and new plot types
# adjust settings
# The inline statement insures that the plot will show in the cell output. Look at the documentation for more information
%matplotlib inline
sns.set_style("white")
plt.rcParams['figure.figsize'] = (15, 15)

### Grouping accidents
First by street for 3 categories:
- All accidents
- Accidents with injuries
- Accidents with fatalities

In [None]:
# Plot the top 15 streets by accident count
plt.figure(figsize=(8,5))
sql = """
  select STREET_NAME, count(*) CNT
  from   traffic_crashes_crashes
  group by STREET_NAME
  order by CNT desc
  limit 15
"""
streets = pd.read_sql(sql, pconn)
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='CNT', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of Collisions by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sql = """
  select STREET_NAME, count(*) CNT
  from   traffic_crashes_crashes
  where INJURIES_TOTAL > 0
  group by STREET_NAME
  order by CNT desc
  limit 15
"""
streets = pd.read_sql(sql, pconn)
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='CNT', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of injuries accidents by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()


In [None]:
plt.figure(figsize=(8,5))
sql = """
  select STREET_NAME, count(*) CNT
  from   traffic_crashes_crashes
  where INJURIES_FATAL > 0
  group by STREET_NAME
  order by CNT desc
  limit 15
"""
streets = pd.read_sql(sql, pconn)
colors = ['g','0.75','y','k','b','r']
streets.sort_values(by='CNT', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('Street')
plt.title('Total Number of Fatal accidents by Street', size=15)
plt.yticks(range(0,15),streets['STREET_NAME'])
plt.tight_layout()
plt.show()

### Accidents by month, day of the week
See example code at: http://benalexkeen.com/bar-charts-in-matplotlib/

In [None]:
plt.figure(figsize=(8,5))
sql = """
  select CRASH_MONTH, count(*) CNT
  from   traffic_crashes_crashes
  group by CRASH_MONTH
  order by CRASH_MONTH desc
"""
byMonth = pd.read_sql(sql, pconn)
colors = ['g','0.75','y','k','b','r']
byMonth.sort_values(by='CRASH_MONTH', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('month')
plt.title('Total Number of Collisions by month', size=15)
plt.yticks(range(0,12),byMonth['CRASH_MONTH'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))
sql = """
  select CRASH_DAY_OF_WEEK, count(*) CNT
  from   traffic_crashes_crashes
  group by CRASH_DAY_OF_WEEK
  order by CRASH_DAY_OF_WEEK desc
"""
byDay = pd.read_sql(sql, pconn)


# byDay = collisions_df.groupBy('CRASH_DAY_OF_WEEK').count().sort('CRASH_DAY_OF_WEEK',ascending=False).toPandas()
colors = ['g','0.75','y','k','b','r']
byDay.sort_values(by='CRASH_DAY_OF_WEEK', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('day')
plt.title('Total Number of Collisions by Day of the week', size=15)
plt.yticks(range(0,7),byDay['CRASH_DAY_OF_WEEK'])
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(8,5))

sql = """
  select CRASH_HOUR, count(*) CNT
  from   traffic_crashes_crashes
  group by CRASH_HOUR
  order by CRASH_HOUR desc
"""
byHour = pd.read_sql(sql, pconn)


# byHour = collisions_df.groupBy('CRASH_HOUR').count().sort('CRASH_HOUR',ascending=False).toPandas()
colors = ['g','0.75','y','k','b','r']
byHour.sort_values(by='CRASH_HOUR', ascending=False)['CNT'].plot.barh(color=colors)
plt.xlabel('Collisions')
plt.ylabel('hour')
plt.title('Total Number of Collisions by hour of the day', size=15)
plt.yticks(range(0,24),byHour['CRASH_HOUR'])
plt.tight_layout()
plt.show()

## Prep Data for plotting

In [None]:
sql = """
  select longitude, latitude
  from   traffic_crashes_crashes
  where latitude != 0
"""
collisions_pd = pd.read_sql(sql, pconn)

sql = """
  select longitude, latitude
  from   traffic_crashes_crashes
  where latitude != 0
  and INJURIES_FATAL > 0
"""
killed_pd = pd.read_sql(sql, pconn)

sql = """
  select longitude, latitude
  from   traffic_crashes_crashes
  where latitude != 0
  and INJURIES_TOTAL > 0
  and INJURIES_FATAL = 0
"""
injured_pd = pd.read_sql(sql, pconn)

sql = """
  select longitude, latitude
  from   traffic_crashes_crashes
  where latitude != 0
  and INJURIES_TOTAL = 0
  and INJURIES_FATAL = 0
"""
nothing_pd = pd.read_sql(sql, pconn)


## Plot the accidents using longitude/latitude
This is not a map but a graphical representation of the accidents related to longitude and latitude.

We got the limits for longitude and latitude earlier and plug them into the xlim/ylim values.

In [None]:
#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(collisions_pd.LONGITUDE, collisions_pd.LATITUDE, alpha=0.05, s=4, color='darkseagreen')

#adjust more settings
plt.title('Motor Vehicle Collisions in Chicago', size=25)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)

plt.show()

## Enhance the scatter plot to identify the accidents severity
We draw from Pandas DataFrames we created earlier to plot the severity in different color

In [None]:
#adjust settings
plt.figure(figsize=(15,10))

#create scatterplots
plt.scatter(nothing_pd.LONGITUDE, nothing_pd.LATITUDE, alpha=0.04, s=1, color='blue')
plt.scatter(injured_pd.LONGITUDE, injured_pd.LATITUDE, alpha=0.1, s=1, color='yellow')
plt.scatter(killed_pd.LONGITUDE, killed_pd.LATITUDE, color='red', s=5)

#create legend
blue_patch = mpatches.Patch( label='car body damage', alpha=0.2, color='blue')
yellow_patch = mpatches.Patch(color='yellow', label='personal injury', alpha=0.5)
red_patch = mpatches.Patch(color='red', label='lethal accidents')
plt.legend([blue_patch, yellow_patch, red_patch],('car body damage', 'personal injury', 'fatal accidents'), 
           loc='upper left', prop={'size':20})

#adjust more settings
plt.title('Severity of Motor Vehicle Collisions in Chicago', size=20)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.savefig('anothertry.png')

plt.show()

## Visualize the 10 streets with the most collisions

In [None]:
sql = """
  select STREET_NAME, longitude, latitude
  from   traffic_crashes_crashes
  where  latitude != 0    
"""
data1 = pd.read_sql(sql, pconn)

sql = """
  select STREET_NAME, count(*)
  from   traffic_crashes_crashes
  where  latitude != 0
  group by STREET_NAME
  order by count(*) desc
  limit 10
"""
street_names = pd.read_sql(sql, pconn)['STREET_NAME']


collisions1 = data1[data1['STREET_NAME']==street_names[0]]
collisions2 = data1[data1['STREET_NAME']==street_names[1]]
collisions3 = data1[data1['STREET_NAME']==street_names[2]]
collisions4 = data1[data1['STREET_NAME']==street_names[3]]
collisions5 = data1[data1['STREET_NAME']==street_names[4]]
collisions6 = data1[data1['STREET_NAME']==street_names[5]]
collisions7 = data1[data1['STREET_NAME']==street_names[6]]
collisions8 = data1[data1['STREET_NAME']==street_names[7]]
collisions9 = data1[data1['STREET_NAME']==street_names[8]]
collisions10 = data1[data1['STREET_NAME']==street_names[9]]

#create scatterplots
plt.figure(figsize=(15,10))
plt.scatter(data1.LONGITUDE, data1.LATITUDE, s=1, color='darkseagreen')
plt.scatter(collisions1.LONGITUDE, collisions1.LATITUDE, s=2, color='red')
plt.scatter(collisions2.LONGITUDE, collisions2.LATITUDE, color='blue', s=2)
plt.scatter(collisions3.LONGITUDE, collisions3.LATITUDE, s=2, color='magenta')
plt.scatter(collisions4.LONGITUDE, collisions4.LATITUDE, color='orange', s=2)
plt.scatter(collisions5.LONGITUDE, collisions5.LATITUDE, s=2, color='yellow')
plt.scatter(collisions6.LONGITUDE, collisions6.LATITUDE, color='purple', s=2)
plt.scatter(collisions7.LONGITUDE, collisions7.LATITUDE, s=2, color='black')
plt.scatter(collisions8.LONGITUDE, collisions8.LATITUDE, color='chartreuse', s=2)
plt.scatter(collisions9.LONGITUDE, collisions9.LATITUDE, s=2, color='brown')
plt.scatter(collisions10.LONGITUDE, collisions10.LATITUDE, color='darkgreen', s=2)


#create legend
a_patch = mpatches.Patch(color='red', label=street_names[0])
b_patch = mpatches.Patch(color='blue', label=street_names[1])
c_patch = mpatches.Patch(color='magenta', label=street_names[2])
d_patch = mpatches.Patch(color='orange', label=street_names[3])
e_patch = mpatches.Patch(color='yellow', label=street_names[4])
f_patch = mpatches.Patch(color='purple', label=street_names[5])
g_patch = mpatches.Patch(color='black', label=street_names[6])
h_patch = mpatches.Patch(color='chartreuse', label=street_names[7])
i_patch = mpatches.Patch(color='brown', label=street_names[8])
j_patch = mpatches.Patch(color='darkgreen', label=street_names[9])

plt.legend([a_patch, b_patch, c_patch, d_patch, e_patch, f_patch, g_patch, h_patch, i_patch, j_patch],
            (street_names[0],street_names[1],street_names[2],street_names[3],street_names[4],
             street_names[5],street_names[6],street_names[7],street_names[8],street_names[9]),
           loc='upper left', prop={'size':12})

#adjust more settings
plt.title('Vehicle Collisions in Chicago', size=25)
plt.xlim((-87.92,-87.52))
plt.ylim((41.64,42.03))
plt.xlabel('Longitude',size=20)
plt.ylabel('Latitude',size=20)
plt.show()

## Using K-Means to find hot spots
We are using K-means to find the center of groupings of accidents.

The process is as follows:

We extract the longitude and latitude of all accidents
We create a model (for, arbitrarily, 10 clusters)
We extract the centers and convert them to a Panda DataFrame
We display the result on a map using pixiedust

In [None]:
# Create dataframes for all accidents, accidents with injuries and accidents with fatalities
sql = """
  select INJURIES_TOTAL, INJURIES_FATAL, longitude, latitude
  from   traffic_crashes_crashes
  where  latitude is not null 
"""
data_pd = pd.read_sql(sql, pconn)

data_injuries_pd = data_pd[data_pd['INJURIES_TOTAL'] > 0]
data_fatal_pd = data_pd[data_pd['INJURIES_FATAL'] > 0]

In [None]:
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.cluster import KMeans
import sklearn.metrics as sm

%matplotlib inline

In [None]:
# K Means Cluster
k=10
model = KMeans(n_clusters=k)
kmeans = model.fit(data_pd[['LONGITUDE','LATITUDE']])
vals=[0] * k
for i in kmeans.labels_ :
    vals[i] = vals[i] + 1

In [None]:
# Create a Panda dataframe for display
d = {'longitude': kmeans.cluster_centers_[:,0], 'latitude': kmeans.cluster_centers_[:,1], 'total' : vals}
k_pd = pd.DataFrame(data=d)

In [None]:
display(k_pd)

## K-Means for accidents with fatalities

In [None]:
# K Means Cluster
k=10
model = KMeans(n_clusters=k)
kmeans = model.fit(data_fatal_pd[['LONGITUDE','LATITUDE']])
vals=[0] * k
for i in kmeans.labels_ :
    vals[i] = vals[i] + 1

In [None]:
# Create a Panda dataframe for display
d2 = {'longitude': kmeans.cluster_centers_[:,0], 'latitude': kmeans.cluster_centers_[:,1], 'total' : vals}
k2_pd = pd.DataFrame(data=d2)

In [None]:
display(k2_pd)