# Crime in Toronto's Neighbourhoods


This is an ongoing project that uses data from Toronto's [Open Data](http://www1.toronto.ca/wps/portal/contentonly?vgnextoid=9e56e03bb8d1e310VgnVCM10000071d60f89RCRD), which consists of over 200 datasets provided by the City of Toronto. These datasets are organized into 15 different categories.

Toronto is considered to be a safe city in comparison to other big cities. In an article in the Economist (2015), Toronto was ranked as the safest major city in North America and the eighth safest major city in the world, as cited in [Wikipedia](https://en.wikipedia.org/wiki/Crime_in_Toronto).

Despite being a relatively safer city, Toronto has its fair share of crime. The city consists of 140 officially recognized neighbourhoods along with several other unofficial, smaller neighbourhoods. As is the case with any big city, some neighbourhoods are considered to be less safe than others. Several reasons are attributed to higher crime – lower income, lower literacy and access to education, unemployment leading to illegal drug activity etc. An analysis of crime and neighbourhood data within Toronto will provide us with a good understanding of how many of these assumptions are true and to what degree. It might additionally reveal hidden patterns, trends or relationships between some independent variables and our dependent variable (e.g. major crimes) that would not be obvious. 

For this project, I will focus only on crime in Toronto’s 140 official neighbourhoods, and explore the following topics in no particular order. 

1) I will provide a summarized visualization of all the major crimes in Toronto. I might have to do some type of normalization (e.g. divide the number of crimes in a neighbourhood by the population of that neighbourhood).

2) I will compare 3-5 most crime prone neighbourhoods against 3-5 least crime prone neighbourhoods.
 
3) What is the difference in these neighbourhoods as regards median household income and education? Which is the most prominent age group of people? Does this in any way affect crime?

4) My dataset provides me with data only for two years – 2008 and 2011. I plan to compare crime data for both years. Has anything changed from 2008 to 2011?

5) What is the neighbourhood with the most change?

6) What could be the reasons for this change? Can the data give us an answer?

7) Finally, I will try to come up with something predictive (using machine learning). This would be more speculative as the data does not afford any type of validation, given that I have only two years of data. But this would still be helpful as a predictive tool in the current climate of ongoing gentrification projects in Toronto since the last 4-5 years, coinciding with a significant increase in the construction of large-to-midsize condominium buildings in various neighbourhoods, undoubtedly changing the social demographics of the city.

#### Import all the libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#### Seaborn settings

In [2]:
# seaborn setting
sns.set_palette("deep", desat=.6)
sns.set_context(rc={"figure.figsize": (8,4)})

#### Read the data files as pandas dataframes

In [3]:
c = pd.read_csv('TorontoCrime2011.csv')
d = pd.read_csv('TorontoDemographics2011.csv')

# Also read in the 2008 data files
c2 = pd.read_csv('TorontoCrime2008.csv')
d2 = pd.read_csv('TorontoDemographics2008.csv')

#### Let's take a quick look at the top few rows of the 2011 and 2008 Toronto Crime dataframes.

In [4]:
#2011 dataframe
c.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Arsons,Assaults,Break & Enters,Drug Arrests,Fire Medical Calls,Fire Vehicle Incidents,Fires & Fire Alarms,Hazardous Incidents,Murders,Robberies,Sexual Assaults,Thefts,Total Major Crime Incidents,Vehicle Thefts
0,West Humber-Clairville,1,4,390,175,62,1321,502,705,210,0,82,68,54,1119,288
1,Mount Olive-Silverstone-Jamestown,2,3,316,61,90,1016,59,361,176,1,78,75,7,690,62
2,Thistletown-Beaumond Heights,3,0,85,36,16,323,48,90,34,0,17,24,2,192,12
3,Rexdale-Kipling,4,0,59,32,15,305,34,94,55,1,16,20,3,164,18
4,Elms-Old Rexdale,5,1,77,25,14,321,71,107,43,0,23,5,19,185,22


In [5]:
#2008 dataframe
c2.head()
#c2.shape()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Ambulance Calls,Ambulance Referrals,Arsons,Assaults,Break & Enters,Drug Arrests,Fire Vehicle Incidents,Firearms Incidents,Fires & Fire Alarms,Hazardous Incidents,Murders,Robberies,Sexual Assaults,TCHC Safety Incidents,Thefts,Vehicle Thefts
0,West Humber-Clairville,1,3613,10,4,272,193,88,674,3,135,269,1,85,21,62,42,341
1,Mount Olive-Silverstone-Jamestown,2,2229,5,0,269,88,145,52,11,70,163,0,80,23,426,6,113
2,Thistletown-Beaumond Heights,3,793,5,0,66,30,27,20,0,26,48,0,23,5,89,2,16
3,Rexdale-Kipling,4,664,5,0,49,28,17,14,0,26,64,0,16,6,48,0,25
4,Elms-Old Rexdale,5,836,3,2,49,22,8,45,1,23,52,0,26,3,138,2,17


We notice a few differences between the crime datasets for 2008 and 2011, where 2008 does not have a column for total major crimes. Now, let's take a look at the top few rows of the 2011 and 2008 Toronto Demographics dataframes.

In [6]:
# 2011 dataframe
d.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Total Area,Total Population,Pop - Males,Pop - Females,Pop 0 - 4 years,Pop 5 - 9 years,Pop 10 - 14 years,Pop 15 -19 years,...,Language - Chinese,Language - Italian,Language - Korean,Language - Persian (Farsi),Language - Portuguese,Language - Russian,Language - Spanish,Language - Tagalog,Language - Tamil,Language - Urdu
0,West Humber-Clairville,1,30.09,34100,17095,17000,1865,1950,2155,2550,...,475,925,95,160,205,15,1100,850,715,715
1,Mount Olive-Silverstone-Jamestown,2,4.6,32790,16015,16765,2575,2535,2555,2620,...,275,750,60,350,115,50,820,345,1420,1075
2,Thistletown-Beaumond Heights,3,3.4,10140,4920,5225,575,580,670,675,...,95,705,35,115,105,15,570,130,120,300
3,Rexdale-Kipling,4,2.5,10485,5035,5455,495,520,570,665,...,95,475,30,95,145,30,700,180,70,215
4,Elms-Old Rexdale,5,2.9,9550,4615,4935,670,720,720,725,...,90,510,55,285,80,30,670,195,60,140


In [7]:
# 2008 dataframe
# We notice that the 2008 dataset has more columns
d2.head()

Unnamed: 0,Neighbourhood,Neighbourhood Id,Total Area,Total Population,Pop - Males,Pop - Females,Pop 0 - 4 years,Pop 5 - 9 years,Pop 10 - 14 years,Pop 15 -19 years,...,Home Repairs Needed,Tenant Average Rent,Low Income Families,Low Income Singles,Low Income Children,Family Income Category,Average Family Income,Household Income Category,Pre-Tax Household Income,After-Tax Household Income
0,West Humber-Clairville,1,30.09,32265,16295,15960,2005,2135,2325,2180,...,365,850,7720,725,643,7720,67240,8960,63415,63977
1,Mount Olive-Silverstone-Jamestown,2,4.6,32130,15900,16230,2680,2680,2685,2285,...,980,875,7715,1177,1206,7720,52745,9265,48145,49601
2,Thistletown-Beaumond Heights,3,3.4,9925,4900,5035,615,625,645,630,...,185,875,2520,305,161,2520,71300,3150,55030,54910
3,Rexdale-Kipling,4,2.5,10725,5205,5525,580,645,665,640,...,300,835,2780,653,135,2775,65215,3880,52430,53779
4,Elms-Old Rexdale,5,2.9,9440,4615,4820,725,700,745,655,...,320,895,2560,255,328,2555,56515,3130,53780,55054


**We can do a quick check to see if there are any missing values in the data**

In [8]:
# Drop any missing values, and check to see if any rows have been dropped
if c.shape == c.dropna().shape:
    print "2011 Crime data has no missing values. Dataframe dimensions are" + str(c.shape)
if d.shape == d.dropna().shape:
    print "2011 Demographics data has no missing values. Dataframe dimensions are" + str(d.shape)
if c2.shape == c2.dropna().shape:
    print "2008 Crime data has no missing values. Dataframe dimensions are" + str(c2.shape)
if d2.shape == d2.dropna().shape:
    print "2008 Demographics data has no missing values. Dataframe dimensions are" + str(d2.shape)

2011 Crime data has no missing values. Dataframe dimensions are(140, 16)
2011 Demographics data has no missing values. Dataframe dimensions are(140, 39)
2008 Crime data has no missing values. Dataframe dimensions are(140, 18)
2008 Demographics data has no missing values. Dataframe dimensions are(140, 85)


**The rows for all these dataframes are listed by neighbourhood. The 140 rows indicate all official neighbourhoods of Toronto. We noticed a difference between the crime datasets in 2011 and 2008, which is reflected in the dimensions and in the column names. We can take a quick look at all the columns to get an idea.**

In [9]:
c.columns #2011

Index([u'Neighbourhood', u'Neighbourhood Id', u'Arsons', u'Assaults',
       u'Break & Enters', u'Drug Arrests', u'Fire Medical Calls',
       u'Fire Vehicle Incidents', u'Fires & Fire Alarms',
       u'Hazardous Incidents', u'Murders', u'Robberies', u'Sexual Assaults',
       u'Thefts', u'Total Major Crime Incidents', u'Vehicle Thefts'],
      dtype='object')

In [10]:
c2.columns #2008

Index([u'Neighbourhood', u'Neighbourhood Id', u'Ambulance Calls',
       u'Ambulance Referrals', u'Arsons', u'Assaults', u'Break & Enters',
       u'Drug Arrests', u'Fire Vehicle Incidents', u'Firearms Incidents',
       u'Fires & Fire Alarms', u'Hazardous Incidents', u'Murders',
       u'Robberies', u'Sexual Assaults', u'TCHC Safety Incidents', u'Thefts',
       u'Vehicle Thefts'],
      dtype='object')

**To make data access and manipulation easier, we can change the column names to shorter, manageable ones.**

In [11]:
# Rename columns of 2011 crime dataframe
c.rename(columns={'Neighbourhood':'N','Neighbourhood Id':'NId','Arsons':'Ars','Assaults':'Ass',
       'Break & Enters':'BE','Drug Arrests':'DA','Fire Medical Calls':'FMC',
       'Fire Vehicle Incidents':'FVI','Fires & Fire Alarms':'FFA',
       'Hazardous Incidents':'HI','Murders':'M','Robberies':'R','Sexual Assaults':'SA',
       'Thefts':'T','Total Major Crime Incidents':'TMCI','Vehicle Thefts':'VT'}, inplace=True)
c.tail()

Unnamed: 0,N,NId,Ars,Ass,BE,DA,FMC,FVI,FFA,HI,M,R,SA,T,TMCI,VT
135,West Hill,136,3,387,102,87,1145,78,338,142,1,71,52,3,749,46
136,Woburn,137,2,412,128,77,1469,219,504,223,3,107,29,7,808,45
137,Eglinton East,138,0,239,88,48,720,76,223,111,1,66,17,10,492,23
138,Scarborough Village,139,1,226,93,31,652,35,180,94,1,62,31,3,474,27
139,Guildwood,140,0,44,32,9,284,24,48,48,0,14,7,2,113,5


In [12]:
# Rename columns of 2008 crime dataframe. Column names comparable to 2011 will have suffix "2"
c2.rename(columns={'Neighbourhood':'N','Neighbourhood Id':'NId','Ambulance Calls':'AC2',
       'Ambulance Referrals':'AR2','Arsons':'Ars2','Assaults':'Ass2','Break & Enters':'BE2',
       'Drug Arrests':'DA2','Fire Vehicle Incidents':'FVI2','Firearms Incidents':'FI2',
       'Fires & Fire Alarms':'FFA2','Hazardous Incidents':'HI2','Murders':'M2',
       'Robberies':'R2','Sexual Assaults':'SA2','TCHC Safety Incidents':'TCHCSI2','Thefts':'T2',
       'Vehicle Thefts':'VT2'},inplace=True)
c2.tail()

Unnamed: 0,N,NId,AC2,AR2,Ars2,Ass2,BE2,DA2,FVI2,FI2,FFA2,HI2,M2,R2,SA2,TCHCSI2,T2,VT2
135,West Hill,136,2323,10,3,357,90,179,74,7,111,181,0,59,28,721,9,70
136,Woburn,137,3607,29,2,325,129,72,254,9,118,196,1,78,17,373,15,152
137,Eglinton East,138,1500,8,1,171,83,105,92,5,67,104,2,45,10,285,3,92
138,Scarborough Village,139,1364,10,1,170,52,74,37,7,62,95,3,35,8,276,3,57
139,Guildwood,140,688,0,0,50,30,20,17,0,33,51,0,12,2,0,0,12


**Our focus for this data story will mainly be on major crime incidents.** 

**Eight categories of crimes fall under major crime - Assaults, Break & Enters, Drug Arrests, Murders, Robberies, Sexual Assaults, Thefts, and Vehicle Thefts.**

**We notice that the 2008 crime dataframe does not have a column for total major crime incidents. So, let's add one.**

In [13]:
# We can add a column called "Total Major Crime Incidents" in the 2008 dataframe to facilitate comparisons
c2['TMCI2'] = c2['Ass2'] + c2['BE2'] + c2['DA2'] + c2['M2'] + c2['R2'] + c2['SA2'] + c2['T2'] + c2['VT2'] 
c2.tail()

Unnamed: 0,N,NId,AC2,AR2,Ars2,Ass2,BE2,DA2,FVI2,FI2,FFA2,HI2,M2,R2,SA2,TCHCSI2,T2,VT2,TMCI2
135,West Hill,136,2323,10,3,357,90,179,74,7,111,181,0,59,28,721,9,70,792
136,Woburn,137,3607,29,2,325,129,72,254,9,118,196,1,78,17,373,15,152,789
137,Eglinton East,138,1500,8,1,171,83,105,92,5,67,104,2,45,10,285,3,92,511
138,Scarborough Village,139,1364,10,1,170,52,74,37,7,62,95,3,35,8,276,3,57,402
139,Guildwood,140,688,0,0,50,30,20,17,0,33,51,0,12,2,0,0,12,126


**It might be the case that a certain neighbourhood has more major crimes just because of more people living in it. This would be a confounding variable. So, to compare crime data across each neighbourhood, and across two years (2008 and 2011 in our case), it makes sense to normalize the data by dividing each neighbourhood with its population. Let's use the demographic data for population.** 

In [14]:
d.columns #2011

Index([u'Neighbourhood', u'Neighbourhood Id', u'Total Area',
       u'Total Population', u'Pop - Males', u'Pop - Females',
       u'Pop 0 - 4 years', u'Pop 5 - 9 years', u'Pop 10 - 14 years',
       u'Pop 15 -19 years', u'Pop 20 - 24 years', u'Pop  25 - 29 years',
       u'Pop 30 - 34 years', u'Pop 35 - 39 years', u'Pop 40 - 44 years',
       u'Pop 45 - 49 years', u'Pop 50 - 54 years', u'Pop 55 - 59 years',
       u'Pop 60 - 64 years', u'Pop 65 - 69 years', u'Pop 70 - 74 years',
       u'Pop 75 - 79 years', u'Pop 80 - 84 years', u'Pop 85 years and over',
       u'Seniors 55 and over', u'Seniors 65 and over', u'Child 0-14',
       u'Youth 15-24', u'Home Language Category', u'   Language - Chinese',
       u'   Language - Italian', u'   Language - Korean',
       u'   Language - Persian (Farsi)', u'   Language - Portuguese',
       u'   Language - Russian', u'   Language - Spanish',
       u'   Language - Tagalog', u'   Language - Tamil',
       u'   Language - Urdu'],
      dtype='objec

In [15]:
d2.columns #2008

Index([u'Neighbourhood', u'Neighbourhood Id', u'Total Area',
       u'Total Population', u'Pop - Males', u'Pop - Females',
       u'Pop 0 - 4 years', u'Pop 5 - 9 years', u'Pop 10 - 14 years',
       u'Pop 15 -19 years', u'Pop 20 - 24 years', u'Pop  25 - 29 years',
       u'Pop 30 - 34 years', u'Pop 35 - 39 years', u'Pop 40 - 44 years',
       u'Pop 45 - 49 years', u'Pop 50 - 54 years', u'Pop 55 - 59 years',
       u'Pop 60 - 64 years', u'Pop 65 - 69 years', u'Pop 70 - 74 years',
       u'Pop 75 - 79 years', u'Pop 80 - 84 years', u'Pop 85 years and over',
       u'Pop 6-12 years', u'Visible Minority Category', u'   Chinese',
       u'   South Asian', u'   Black', u'   Filipino', u'   Latin American',
       u'   Southeast Asian', u'   Arab', u'   West Asian', u'   Korean',
       u'   Japanese', u'   Other Visible Minority',
       u'   Multiple Visible Minority', u'   Not a Visible Minority',
       u'Aboriginal', u'Home Language Category', u'   Language - Chinese',
       u'   Languag

**We only need the total population for now, from each neighbourhood, to normalize the crime data.**

In [16]:
d.rename(columns={'Neighbourhood':'N','Neighbourhood Id':'NId','Total Population':'TPop'},inplace=True)
d2.rename(columns={'Neighbourhood':'N','Neighbourhood Id':'NId','Total Population':'TPop2'},inplace=True)

**Now we can compute the number of crimes in each neighbourhood per 1000 people for 2011 and 2008**

In [17]:
c_norm = c
c2_norm = c2
# Divide all the crime data columns by the total neighbourhood population and multiply by 1000
c_norm.iloc[:,2:] = c.iloc[:,2:].div(d['TPop'], axis=0) * 1000
c2_norm.iloc[:,2:] = c2.iloc[:,2:].div(d2['TPop2'], axis=0) * 1000

In [18]:
c_norm.head() #2011 crime data

Unnamed: 0,N,NId,Ars,Ass,BE,DA,FMC,FVI,FFA,HI,M,R,SA,T,TMCI,VT
0,West Humber-Clairville,1,0.117302,11.43695,5.131965,1.818182,38.739003,14.721408,20.674487,6.158358,0.0,2.404692,1.994135,1.583578,32.815249,8.445748
1,Mount Olive-Silverstone-Jamestown,2,0.091491,9.637084,1.860323,2.744739,30.985056,1.799329,11.009454,5.36749,0.030497,2.378774,2.287283,0.21348,21.043001,1.89082
2,Thistletown-Beaumond Heights,3,0.0,8.382643,3.550296,1.577909,31.854043,4.733728,8.87574,3.353057,0.0,1.676529,2.366864,0.197239,18.934911,1.183432
3,Rexdale-Kipling,4,0.0,5.627086,3.051979,1.430615,29.089175,3.242728,8.965188,5.245589,0.095374,1.52599,1.907487,0.286123,15.641392,1.716738
4,Elms-Old Rexdale,5,0.104712,8.062827,2.617801,1.465969,33.612565,7.434555,11.204188,4.502618,0.0,2.408377,0.52356,1.989529,19.371728,2.303665


In [20]:
c2_norm.head() #2008 crime data

Unnamed: 0,N,NId,AC2,AR2,Ars2,Ass2,BE2,DA2,FVI2,FI2,FFA2,HI2,M2,R2,SA2,TCHCSI2,T2,VT2,TMCI2
0,West Humber-Clairville,1,111.978925,0.309933,0.123973,8.430188,5.981714,2.727414,20.889509,0.09298,4.1841,8.337208,0.030993,2.634434,0.65086,1.921587,1.30172,10.568728,32.32605
1,Mount Olive-Silverstone-Jamestown,2,69.374416,0.155618,0.0,8.372238,2.738873,4.512916,1.618425,0.342359,2.178649,5.07314,0.0,2.489885,0.715842,13.258637,0.186741,3.516962,22.533458
2,Thistletown-Beaumond Heights,3,79.899244,0.503778,0.0,6.649874,3.02267,2.720403,2.015113,0.0,2.619647,4.836272,0.0,2.31738,0.503778,8.967254,0.201511,1.612091,17.027708
3,Rexdale-Kipling,4,61.911422,0.4662,0.0,4.568765,2.610723,1.585082,1.305361,0.0,2.424242,5.967366,0.0,1.491841,0.559441,4.475524,0.0,2.331002,13.146853
4,Elms-Old Rexdale,5,88.559322,0.317797,0.211864,5.190678,2.330508,0.847458,4.766949,0.105932,2.436441,5.508475,0.0,2.754237,0.317797,14.618644,0.211864,1.800847,13.45339


In [22]:
# Get some overall statistics for 2011
c_norm.describe()

Unnamed: 0,NId,Ars,Ass,BE,DA,FMC,FVI,FFA,HI,M,R,SA,T,TMCI,VT
count,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0
mean,70.5,0.060806,8.057472,4.113241,2.026643,34.380419,4.670153,12.96622,6.033728,0.019289,1.859598,0.958315,0.346375,18.899926,1.518993
std,40.5586,0.070187,4.65908,1.548878,2.146247,14.937431,3.003211,8.2713,2.104366,0.037732,1.11383,0.619382,0.298949,8.655682,1.194685
min,1.0,0.0,1.73913,1.30039,0.0,13.194722,0.386349,3.798481,1.583211,0.0,0.2574,0.089286,0.0,7.092723,0.2746
25%,35.75,0.0,4.419568,2.947731,0.828803,25.980638,2.565404,8.416467,4.673796,0.0,1.069989,0.516434,0.155055,12.407377,0.871931
50%,70.5,0.058042,7.205346,3.851444,1.497455,31.418898,3.931834,10.76026,5.882035,0.0,1.559647,0.825615,0.289292,17.225228,1.310282
75%,105.25,0.094234,10.542619,5.14527,2.438344,38.636659,6.12418,13.926176,7.285228,0.030836,2.382644,1.297216,0.436619,22.990496,1.753712
max,140.0,0.367872,28.637891,9.445762,18.516248,111.833231,21.36703,62.186612,13.182097,0.252525,6.409925,2.923812,1.989529,56.468424,9.329248


In [23]:
# Get some overall statistics for 2008
c2_norm.describe()

Unnamed: 0,NId,AC2,AR2,Ars2,Ass2,BE2,DA2,FVI2,FI2,FFA2,HI2,M2,R2,SA2,TCHCSI2,T2,VT2,TMCI2
count,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0,140.0
mean,70.5,87.741,0.404364,0.088052,6.885194,3.740126,3.780567,4.875626,0.12854,3.569013,6.902121,0.025703,1.622161,0.568509,6.521509,0.358737,2.44225,19.423247
std,40.5586,44.222251,0.268568,0.107554,4.344047,2.111306,4.430225,3.782824,0.197849,1.41854,2.672502,0.050091,1.137876,0.404646,9.531093,0.349347,1.662016,11.742062
min,1.0,35.134588,0.0,0.0,1.291513,1.023172,0.0,0.496894,0.0,0.566687,1.700061,0.0,0.200736,0.0,0.0,0.0,0.404776,5.716435
25%,35.75,63.815593,0.232116,0.0,3.770993,2.341697,1.134179,2.490915,0.0,2.665424,5.10219,0.0,0.886543,0.305505,0.0,0.160374,1.313811,11.882192
50%,70.5,78.912274,0.352334,0.068456,6.085921,3.177226,2.443936,3.608613,0.062624,3.357485,6.601917,0.0,1.352844,0.487331,3.851822,0.26275,2.019474,16.327531
75%,105.25,98.091611,0.533653,0.117326,8.923467,4.593931,4.701562,6.046208,0.177953,4.338422,8.113737,0.041946,2.083005,0.716594,9.022628,0.436383,2.954621,23.779808
max,140.0,372.062663,2.111801,0.581395,28.524804,16.765453,32.105943,24.87106,1.095618,9.464752,18.472585,0.306279,6.788512,2.370872,61.531054,2.415144,10.568728,72.157623


**The mean crime across all neighbourhoods is more or less the same for both 2008 and 2011**

In [26]:
c_norm['TMCI'].max()

56.468424279583076

In [35]:
# Sort all the major crimes by mean for 2011
c_norm.loc[:,['Ass','BE','DA','M','R','SA','T','VT']].mean().sort_values(ascending=False)

Ass    8.057472
BE     4.113241
DA     2.026643
R      1.859598
VT     1.518993
SA     0.958315
T      0.346375
M      0.019289
dtype: float64

In [36]:
# Sort all the major crimes by mean for 2008
c2_norm.loc[:,['Ass2','BE2','DA2','M2','R2','SA2','T2','VT2']].mean().sort_values(ascending=False)

Ass2    6.885194
DA2     3.780567
BE2     3.740126
VT2     2.442250
R2      1.622161
SA2     0.568509
T2      0.358737
M2      0.025703
dtype: float64

**When we compare the major crimes for 2011 and 2008, we notice that Assaults, Drug Arrests, and Break & Enters are the major crime category contributors for both these years. Murders and thefts are the lowest two major crime categories, justifying why Toronto is a safe city.**

In [52]:
# Top 5 neighbourhoods by major crime in 2011
t5TMCI_2011 = c_norm.sort_values(['TMCI']).tail(5)
t5TMCI_2011[['N', 'TMCI']] # Most Total Major Crime Incidents
#b5TMCI_2011 = c.sort_values(['Total Major Crime Incidents']).head(5)

Unnamed: 0,N,TMCI
30,Yorkdale-Glen Park,37.52128
78,University,37.997433
77,Kensington-Chinatown,46.972973
75,Bay Street Corridor,52.571724
72,Moss Park,56.468424


In [47]:
c2_norm.sort_values('TMCI2',ascending=False).head()

Unnamed: 0,N,NId,AC2,AR2,Ars2,Ass2,BE2,DA2,FVI2,FI2,FFA2,HI2,M2,R2,SA2,TCHCSI2,T2,VT2,TMCI2
72,Moss Park,73,256.653747,0.452196,0.581395,20.542636,8.397933,32.105943,7.299742,0.129199,7.687339,12.790698,0.0,6.330749,2.260982,39.793282,0.516796,2.002584,72.157623
78,University,79,211.007621,1.016088,0.169348,16.088061,16.765453,26.587638,10.330229,0.0,8.128704,11.685013,0.0,5.08044,2.370872,10.668925,0.508044,3.725656,71.126164
75,Bay Street Corridor,76,372.062663,0.587467,0.0,28.524804,9.921671,10.313316,15.535248,0.261097,9.464752,18.472585,0.0,6.788512,2.219321,4.438642,2.415144,2.872063,63.05483
65,Danforth,66,181.46789,0.733945,0.550459,21.834862,7.889908,13.577982,11.192661,0.183486,8.256881,16.697248,0.0,6.055046,0.917431,19.816514,0.183486,3.119266,53.577982
77,Kensington-Chinatown,78,172.322996,0.234055,0.175541,15.798713,9.069631,18.490345,6.729081,0.351083,6.319485,11.293154,0.058514,4.095963,1.462844,9.128145,1.521358,1.989468,52.486834


We can compare the top three major crime incidents and the total major crime incidents visually using a box plot.

In [None]:
c_box.head()

In [None]:
%matplotlib inline
sns.set_style("whitegrid")
plt.figure(figsize=(10,5)) # This allows us to set the width and height of the box plot
plt.ylim(0,800) # This allows us to limit the range of the y-axis. This is helpful because we have a lot of outliers.
sns.boxplot(data=c_box.iloc[:,[0,1,2,7]])

From the box plot, we see that "Assaults" vary the most between neighbourhoods, in comparison to "Break & Enters" and "Drug Arrests". Has anything changed from 2008? Let's take a look at the 2008 data.

In [None]:
# Selecting just the major crime incidents in 2008 as a dataframe. Find the mean
c2_box = c2.iloc[:,[5,6,7,12,13,14,16,17,18]]
c2_box.mean().sort_values()

The pattern is more or less the same, however, the number of drug arrests have increased.

In [None]:
c2_box.head()

In [None]:
%matplotlib inline
sns.set_style("whitegrid")
plt.figure(figsize=(10,6)) # This allows us to set the width and height of the box plot
plt.ylim(0,800) # This allows us to limit the range of the y-axis. This is helpful because we have a lot of outliers.
sns.boxplot(data=c2_box.iloc[:,[0,1,2,8]])

We notice an increase in the range and variance of drug arrests. But we do not see any major changes in the order of mean frequency of the types of crimes when comparing 2008 and 2011.

We can also plot these comparisons between types of crimes and years, as violin plots. One advantage of the violin plot is that it shows a kernel density plot instead of a box. So, we get probability density without the need for a histogram (killing two birds with one stone!). Plots for 2011 and 2008 are provided below.

In [None]:
%matplotlib inline
sns.set_style("whitegrid")
plt.figure(figsize=(10,6)) # This allows us to set the width and height of the box plot
plt.ylim(0,800) # This allows us to limit the range of the y-axis. This is helpful because we have a lot of outliers.
sns.violinplot(data=c_box.iloc[:,[0,1,2,7]])

In [None]:
%matplotlib inline
sns.set_style("whitegrid")
plt.figure(figsize=(10,6)) # This allows us to set the width and height of the box plot
plt.ylim(0,800) # This allows us to limit the range of the y-axis. This is helpful because we have a lot of outliers.
sns.violinplot(data=c2_box.iloc[:,[0,1,2,8]])

Perhaps a better way to compare crimes between both years is to normalize crimes in each neighbourhood based on the population of that neighbourhood.

In [None]:
c_box_norm = c_box.iloc[:,0:].div(d['Total Population'], axis=0)
c2_box_norm = c2_box.iloc[:,0:].div(d2['Total Population'], axis=0)

In [None]:
c_box_norm.head()

In [None]:
c2_box_norm.head()

Now, we can compare the means of all major crimes again.

In [None]:
%matplotlib inline

In [None]:
#plt.scatter(c.Robberies, c['Sexual Assaults'])

In [None]:
#c['Robberies']

In [None]:
# t5 is the Top 5, b5 is for Bottom 5; Rob = Robberies; SA = Sexual Assaults; 
# VT = Vehicle Thefts; TMCI = Total Major Crime Incidents
#t5Rob = c.sort_values(['Robberies']).tail(5)
#b5Rob = c.sort_values(['Robberies']).head(5)
#t5SA = c.sort_values(['Sexual Assaults']).tail(5)
#b5SA = c.sort_values(['Sexual Assaults']).head(5)
#t5VT = c.sort_values(['Vehicle Thefts']).tail(5)
#b5VT = c.sort_values(['Vehicle Thefts']).head(5)
t5TMCI_2011 = c.sort_values(['Total Major Crime Incidents']).tail(5)
b5TMCI_2011 = c.sort_values(['Total Major Crime Incidents']).head(5)

In [None]:
t5TMCI_2011[['Neighbourhood', 'Total Major Crime Incidents']] # Most Total Major Crime Incidents

In [None]:
c_norm = c.iloc[:,2:16].div(d['Total Population'], axis=0)
c_norm['Neighbourhood'] = c.Neighbourhood
c_norm['NID'] = c['Neighbourhood Id']

In [None]:
t5TMCI_2011 = c_norm.sort_values(['Total Major Crime Incidents']).tail(5)
t5TMCI_2011[['Neighbourhood', 'Total Major Crime Incidents']] # Most Total Major Crime Incidents