# Analyzing Montgomery County crimes to find useful patterns

Montgomery is a county in the Maryland US located on the east coast [Montgomery County, Maryland](https://en.wikipedia.org/wiki/Montgomery_County,_Maryland). 

Improvement of various aspects of social life entitles a proactive and reactive analysis. 
With this in mind I will be looking to to find patterns by performing time, location and crime classification analysis. For this project I will heavily rely on graph visualization as a picture worths a thousand words.

Here are some conclusion highlights:
1. Time analysis:
    1. Most of the crimes are committed Tuesday
    2. On 24 hour basis most of the crimes are committed between 7 a.m - 11 p.m
    3. October has the highest crime count
2. Classification analysis:
    1. Violent/Non-Violent crimes rates are pretty even 42.8%/57.2%
3. Location analysis:
    1. Cities with highest crime counts are : Silver Spring, Rockville, Gaithersburg
    2. Most of the crimes happen in the street, residence or parking lot
    3. Silver Spring Police District has the highest crime rates

Dataset for this project: 
[here](https://data.montgomerycountymd.gov/Public-Safety/Crime/icn6-v9z3)

In [1]:
import pandas as pd
import numpy as np

crimes = pd.read_csv("MontgomeryCountyCrime2013.csv")
crimes.head()

Unnamed: 0,Incident ID,CR Number,Dispatch Date / Time,Class,Class Description,Police District Name,Block Address,City,State,Zip Code,...,Sector,Beat,PRA,Start Date / Time,End Date / Time,Latitude,Longitude,Police District Number,Location,Address Number
0,200939101,13047006,10/02/2013 07:52:41 PM,511,BURG FORCE-RES/NIGHT,OTHER,25700 MT RADNOR DR,DAMASCUS,MD,20872.0,...,,,,10/02/2013 07:52:00 PM,,,,OTHER,,25700.0
1,200952042,13062965,12/31/2013 09:46:58 PM,1834,CDS-POSS MARIJUANA/HASHISH,GERMANTOWN,GUNNERS BRANCH RD,GERMANTOWN,MD,20874.0,...,M,5M1,470.0,12/31/2013 09:46:00 PM,,,,5D,,
2,200926636,13031483,07/06/2013 09:06:24 AM,1412,VANDALISM-MOTOR VEHICLE,MONTGOMERY VILLAGE,OLDE TOWNE AVE,GAITHERSBURG,MD,20877.0,...,P,6P3,431.0,07/06/2013 09:06:00 AM,,,,6D,,
3,200929538,13035288,07/28/2013 09:13:15 PM,2752,FUGITIVE FROM JUSTICE(OUT OF STATE),BETHESDA,BEACH DR,CHEVY CHASE,MD,20815.0,...,D,2D1,11.0,07/28/2013 09:13:00 PM,,,,2D,,
4,200930689,13036876,08/06/2013 05:16:17 PM,2812,DRIVING UNDER THE INFLUENCE,BETHESDA,BEACH DR,SILVER SPRING,MD,20815.0,...,D,2D3,178.0,08/06/2013 05:16:00 PM,,,,2D,,


### Exploring the data

Each row in the dataset represents a crime being commited. Data contains location information, crime classification as well as various timestamps.
In examining the dataset our goals would be to:
1. Find the columns that have meaningfull information, have minimum missing values, and also hold granular data.
2. We also need to perform data cleaning/manipulation

In [2]:
crimes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23369 entries, 0 to 23368
Data columns (total 22 columns):
Incident ID               23369 non-null int64
CR Number                 23369 non-null int64
Dispatch Date / Time      23369 non-null object
Class                     23369 non-null int64
Class Description         23369 non-null object
Police District Name      23369 non-null object
Block Address             23369 non-null object
City                      23369 non-null object
State                     23369 non-null object
Zip Code                  23339 non-null float64
Agency                    23369 non-null object
Place                     23369 non-null object
Sector                    23323 non-null object
Beat                      23361 non-null object
PRA                       23363 non-null float64
Start Date / Time         23369 non-null object
End Date / Time           13191 non-null object
Latitude                  23208 non-null float64
Longitude                 2

We need to convert `Start Date / Time`; `End Date / Time`; `Dispatch Date/ Time` to datetime format.

### Examine the number of (rows,columns) from the dataframe

In [3]:
crimes.shape

(23369, 22)

### Examining missing values. 

In [4]:
crimes.isnull().sum()

Incident ID                   0
CR Number                     0
Dispatch Date / Time          0
Class                         0
Class Description             0
Police District Name          0
Block Address                 0
City                          0
State                         0
Zip Code                     30
Agency                        0
Place                         0
Sector                       46
Beat                          8
PRA                           6
Start Date / Time             0
End Date / Time           10178
Latitude                    161
Longitude                   161
Police District Number        0
Location                    161
Address Number              132
dtype: int64

1. Main takeaway here is that `End Date / Time` has a high number of missing value, for this reason it can't be used for our analysis 
2. A deeper look into the collumns that have missing values is needed, also we have to determine which columns will provide usefull insight

In [5]:
columns_to_keep=['Zip Code','Sector','Beat','PRA','Latitude','Longitude','Location','Address Number']

for col in columns_to_keep:
    item_null=crimes[col].notnull()
    print(col+"\n",crimes[col][item_null==True].head(),"\n")

Zip Code
 0    20872.0
1    20874.0
2    20877.0
3    20815.0
4    20815.0
Name: Zip Code, dtype: float64 

Sector
 1    M
2    P
3    D
4    D
5    P
Name: Sector, dtype: object 

Beat
 1    5M1
2    6P3
3    2D1
4    2D3
5    6P1
Name: Beat, dtype: object 

PRA
 1    470.0
2    431.0
3     11.0
4    178.0
5    444.0
Name: PRA, dtype: float64 

Latitude
 10    39.105561
13    39.064334
14    39.067335
15    39.017814
16    39.178862
Name: Latitude, dtype: float64 

Longitude
 10   -77.144617
13   -76.968985
14   -77.124027
15   -77.047689
16   -77.267406
Name: Longitude, dtype: float64 

Location
 10    (39.105560882140779, -77.144617133574968)
13     (39.064334220776551, -76.96898520383327)
14    (39.067334736049553, -77.124027420153752)
15     (39.017814078946948, -77.04768926351224)
16    (39.178862442227761, -77.267405973712243)
Name: Location, dtype: object 

Address Number
 0     25700.0
10      600.0
11     9200.0
13     2100.0
14     2200.0
Name: Address Number, dtype: float64

### Columns analysis:

1. `Zip Code`; `Sector`; `Beat`; `Address Number` can't be used for our analysis as these details are meaningless for a casual reader.
2. Columns like `Dispatch Date / Time`; `Class Description`; `City`; `Start Date / Time`; `Police District Number` will be useful
3. `Latitude` and `Longitude` is not an obvious choice, at least from a comprehensibility standpoint but will be useful for map visualization later on.
4. Also the number of nan values for `End Date / Time` column is quite high so for this reason I will choose `Dispatch Date / Time` instead
5. I will exclude missing latitude, longitude values from our dataset

In [6]:
#exclude lat&lon missing values
lat_null=crimes['Latitude'].notnull()
lon_null=crimes['Latitude'].notnull()
crimes=crimes[(lat_null==True) & (lon_null==True)]

### Time analysis

The aim here is to spot time related patterns in crimes. Time analysis is structured around these questions:

1. What day of the week are the most crimes committed on? (i.e Monday, Tuesday, etc)
2. During what time of day are the most crimes committed?
3. During what month are the most crimes committed?

First step would be to convert the `Dispatch Date / Time` column from an object type to a datetime type.<br>
Dispatch Date/Time--The actual date and time a Officer was dispatched

In [7]:
# import datetime modules
import datetime as dt
 
# convert 'Dispatch Date / Time' to datetime format
crimes['Dispatch Date / Time']=pd.to_datetime(crimes['Dispatch Date / Time'])

# get crime counts/weekday;hour;month
crimes['Dispatch Day of the Week']=crimes['Dispatch Date / Time'].dt.weekday_name
crimes['Dispatch Hour']=crimes['Dispatch Date / Time'].dt.hour
crimes['Dispatch Month']=crimes['Dispatch Date / Time'].dt.month

In [8]:
# import plotting modules

from plotly.offline import download_plotlyjs, init_notebook_mode, iplot
from plotly.graph_objs import *
init_notebook_mode()
import cufflinks as cf

# day of the week crime counts
dow=crimes['Dispatch Day of the Week'].value_counts().copy()

# Converting to a series to a dataframe
# It will be easier for plottling
pd.DataFrame({'Dispatch Day of the Week':dow.index,'Counts':dow}).reset_index(drop=True)

# plotting
dow.iplot(theme='pearl', filename='crime_distrib_per_day', title='Crime distribution per day of the week', xTitle='Day', yTitle='Count')

'''
world_readable=False invalid plotly option
'''
dow

Tuesday      3805
Monday       3712
Wednesday    3591
Friday       3572
Thursday     3382
Saturday     2780
Sunday       2366
Name: Dispatch Day of the Week, dtype: int64

### Day of the week analysis:

1. Generally it seems that less crimes are committed towards the weekend and highest counts are around the beginning of the week
2. Crime peak is Tuesday and lowest crimes are on Sunday.

In [9]:
# hour crime counts
tod=crimes['Dispatch Hour'].value_counts()

# Converting to a series to dataframe
pd.DataFrame({'Dispatch Hour':tod.index,'Counts':tod}).reset_index(drop=True)

# Pandas dataframe needs to be sorted by index for our time analysys otherwise the graph will be scrambled.
# This is because df will be sorted by values
tod.sort_index().iplot(theme='pearl', filename='crime_distrib_per_hour', title='Crime distribution per hour',
         xTitle='Hour', yTitle='Count', world_readable=False)

tod

7     1275
9     1218
16    1209
15    1176
8     1170
14    1141
13    1130
18    1114
10    1112
17    1111
11    1102
6     1074
12    1061
20    1057
23    1022
19    1019
22    1010
21    1000
0      883
1      839
2      675
3      366
4      226
5      218
Name: Dispatch Hour, dtype: int64

### Crime analysis by hour:
1. Highest value is a 7 am and lowest is at 5 am.
2. Most of the crimes are committed between 7 am and 11 pm

In [10]:
# crime distribution/month
tom=crimes['Dispatch Month'].value_counts()

# convert series to df
pd.DataFrame({'Dispatch Month':tom.index,'Counts':tom}).reset_index(drop=True)

# sort index & plot
tom.sort_index().iplot(theme='pearl', filename='crime_distrib_per_month', title='Crime distribution per month',
         xTitle='Month', yTitle='Count', world_readable=False)

tom

10    4045
8     3977
11    3913
9     3898
12    3874
7     3501
Name: Dispatch Month, dtype: int64

###  Month crime analysis:

1. Data is not complete, we only have the statistics from July till the end of the year.
2. July month has the least crimes committed followed by a substantial increase in crimes starting from August.
3. I can attribute this to vacation time. Most probably by the end of August most people return from vacation. Peak is in October

### Dispatch time interval analysis

Dispatch time could be a strong indicator as to how does police prioritize crimes. 
1. To do this we need to see the general time difference between `Dispatch Date / Time` and  `Start Date / Time`
2. This will be useful to determine and categorize the dispatch time intervals
3. Also it would be interesting to see if there is any difference in dispatch time based on the type of incidents

In [11]:
# convert to datetime
crimes['Start Date / Time']=pd.to_datetime(crimes['Start Date / Time'])

# get difference between 'Dispatch Date / Time' 'Start Date / Time'
crimes['Date diff']=crimes['Dispatch Date / Time'] -crimes['Start Date / Time']
pd.DataFrame(crimes.groupby(['Start Date / Time','Dispatch Date / Time'])['Date diff'].value_counts().head())

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Date diff
Start Date / Time,Dispatch Date / Time,Date diff,Unnamed: 3_level_1
1974-10-30 00:00:00,2013-11-07 11:41:23,14253 days 11:41:23,1
1977-02-11 00:00:00,2013-08-14 17:05:22,13333 days 17:05:22,1
1980-08-24 00:00:00,2013-12-23 10:28:40,12174 days 10:28:40,1
1985-01-01 06:00:00,2013-09-25 20:27:18,10494 days 14:27:18,1
1993-08-27 00:00:00,2013-11-18 17:45:43,7388 days 17:45:43,1


1. As we can see start time column is not accurate because a lot of dates that have the start time before 2013:
2. The data sample we are analyzing  provides a summary of incidents from 07/2013 till 12/2013
3. So we will have to ignore those rows from out analysis

In [12]:
# Exclude records where start time was before 2013/07/01
crimes_slice=crimes[(crimes['Start Date / Time'] >= dt.date(2013,7,1))].copy()

# group Start/Dispatch Time on minutes intervals [<1,(1,5),(5,10),(10,30),(30,60),>60]
crimes_slice.loc[crimes_slice['Date diff']<dt.timedelta(minutes=1),'Less then 1']='True'
crimes_slice.loc[((crimes_slice['Date diff']>dt.timedelta(minutes=1))& (crimes_slice['Date diff']<dt.timedelta(minutes=5))),'Between 1 and 5']='True'
crimes_slice.loc[((crimes_slice['Date diff']>dt.timedelta(minutes=5)) & (crimes_slice['Date diff']<dt.timedelta(minutes=10))),'Between 5 and 10']='True'
crimes_slice.loc[((crimes_slice['Date diff']>dt.timedelta(minutes=10)) & (crimes_slice['Date diff']<dt.timedelta(minutes=30))),'Between 10 and 30']='True'
crimes_slice.loc[((crimes_slice['Date diff']>dt.timedelta(minutes=30)) & (crimes_slice['Date diff']<dt.timedelta(minutes=60))),'Between 30 and 60']='True'
crimes_slice.loc[crimes_slice['Date diff']>dt.timedelta(minutes=60),'Above 60']='True'



Comparing Series of datetimes with 'datetime.date'.  Currently, the
'datetime.date' is coerced to a datetime. In the future pandas will
not coerce, and a TypeError will be raised. To retain the current
behavior, convert the 'datetime.date' to a datetime with
'pd.Timestamp'.



In [13]:
# Slicing the df on the time interval collumns & get the counts
td=crimes_slice.loc[:,['Less then 1',
                    'Between 1 and 5',
                    'Between 5 and 10',
                    'Between 10 and 30',
                    'Between 30 and 60',
                    'Above 60']].apply(pd.Series.value_counts)


#.iloc[0]  select that row positionally using iloc, which gives you a Series with the columns as the new index and values, 
# then sorting by values
td.iloc[0].sort_values(ascending=False).iplot(kind='bar', title='Crime distribution Start / Dispatch Time',
         xTitle='Time Delta/minutes', yTitle='Count',color='rgba(55, 128, 191, 1.0)',filename='cufflinks/bar-chart-row')

td.iloc[0].sort_values(ascending=False)

Less then 1          10828
Above 60              9185
Between 10 and 30      977
Between 30 and 60      500
Between 5 and 10       491
Between 1 and 5        384
Name: True, dtype: int64

### Dispatch time interval breakdown:

1. The majority of the dispatch time happened quite fast(less then a minute). However this could be misleading as there isn't a clear description of what this column represent. I would tend to believe that `Dispatch Date / Time` represents the time when an officer was sent to a crime location not when it arrived at the crime scene.
2. Next would be the response time above 60 minutes.
3. Also it's important to remember that this is only a slice of the dataset as some start time might have not been accurate.
4. Researching  the internet I saw various sources and it seems that response time varies between 6 and 15 minutes for critical issues. However I could not find an universal SLA for dispatching police officers.
5. Also another point worth taking into consideration that not all incidents require the same attention and some of the could be resolved over the phone.
6. The same analysis would be interesting when diving further into our analysis and classifying the crimes in violent/non violent ones.

### Dispatch Time Delta analysis

1. The bar chart above is not to revealing in term of quantifying the timedeltas
2. If we want to do that we will need to use a pie chart. For simplicity I will have to convert the series to a dataframe format for a pie chart with cufflinks. See below:
<br>
https://plot.ly/pandas/pie-charts/
<br>
![pie format](pie_format.png "Title")


In [14]:
# convert series to dataframe
td_pie=td.iloc[0].to_frame()

# reset index, if column is in index it can't be used to plot  
td_pie=td_pie.reset_index()

# rename columns, this step is not necessary
td_pie.columns=['Time Deltas','Values']

# plot
td_pie.iplot(kind='pie',labels='Time Deltas',values='Values',pull=.1,hole=.2,title='Dispatch Time interval breakdown',
                   textposition='outside',textinfo='value+percent')

### Dispatch time intervals percentage breakdown:
1. dispatch time < 1 min             48.4%
2. dispatch time > 1 min & < 5 min   1.72%
3. dispatch time > 5 min & < 10 min  2.2%
4. dispatch time > 10 min & < 30 min 4.37%
5. dispatch time > 30 min & < 60 min 2.24%
6. dispatch time > 60 min            41%

### Analyzing crime locations:

1. Will use the following criteria for choosing columns:
    1. Granularity: Small areas shouldn't be used, because only a few crimes were committed inside them, which makes it hard to analyze and compare
    2. Comprehensibility: Need to analyze data that is compelling for the casual reader
    3. Missing values: If a column has a lot of missing values, that means that the conclusions you draw are less valid, because you don't know if the missing data is systematic


2. Columns used:
    1. Police District Number | Major Police Boundary corresponding to Police District Names i.e (Rockville,Weaton etc.)
    2. City | City 

### Crime distribution by City

In [15]:
# crime counts per City, transform to df; sort

crimes_city=crimes['City'].value_counts().to_frame().sort_values(by='City')

# graph layout
layout=dict(autosize= True,
            title = 'Crime counts per City / top 15',
            xaxis=dict(domain=[0.08, 1]),     
            )

# plot
crimes_city.tail(15).iplot(kind='barh',barmode='normal', bargap=.8, filename='cufflinbarh',layout=layout)
crimes_city.sort_values(by='City',ascending=False).head(10)

Unnamed: 0,City
SILVER SPRING,8587
ROCKVILLE,3424
GAITHERSBURG,3372
GERMANTOWN,2143
BETHESDA,1733
MONTGOMERY VILLAGE,680
POTOMAC,526
CHEVY CHASE,493
OLNEY,379
KENSINGTON,362


### Crime distribution by location

In [16]:
# get crime counts/place; transform to df
crimes_place=crimes['Place'].value_counts().to_frame().sort_values(by='Place')

# graph layout
layout=dict(
            title = 'Crime counts per Place / top 15',
            xaxis=dict(domain=[0.2, 1]),
            yaxis=dict(domain=[0.1, 0.66])    
)

# plot
crimes_place.tail(15).iplot(kind='barh',barmode='normal', bargap=.4, filename='cufflinbarh',layout=layout)
crimes_place.sort_values(by='Place',ascending=False).head(10)

Unnamed: 0,Place
Residence - Single Family,2724
Street - In vehicle,2501
Residence - Apartment/Condo,1845
Street - Residential,1664
Other/Unknown,1401
Parking Lot - Residential,1065
Parking Lot - Commercial,1018
Residence -Townhouse/Duplex,963
Residence - Driveway,810
Street - Commercial,715


### Analize crimes based on Police District Location

Although Police District does not mean much for the casual reader it could provide useful insight into our data analysis as the area are quite big an would provide great granularity.<br>
Let's visualize Police District Location:<br>
![Montgomery County map](Countywidemap.jpg)

PD summary:<br>
    1D | Rockville<br>
    2D | Bethesda<br>
    3D | Silver Spring<br>
    4D | Wheaton<br>
    5D | Germantown<br>
    6D | Gaitersburg

In [17]:
# plot crime distribution/ PD
crimes['Police District Number'].value_counts().iplot(kind='bar',title='Crime distribution  / Police District Number',
        xTitle='Police District Number', yTitle='Count',color='rgba(55, 128, 191, 1.0)', filename='crime_distribution_per_district')

### Crime by PD:

Silver Spring has the highest crime rates, and the lowest crime rates are in Germantown. Because of the lack of data we will exclude the last district i.e TPPD from our analysis<br>
This is the order of the crimes counts / per district from the highest to lowest:<br>
    1. Silver Spring; Wheaton; Gaithersburg; Rockville; Bethesda; Germantown
But this would not be an accurate analysis unless we take into account the census data for these districts.<br>
For this I will use the census data for Montgomery county Police Department. See below link at the end there is census data per each district.
https://www.wau.edu/wp-content/uploads/2012/09/MCPCrimeReport2014.compressed.pdf

In [18]:
# Constructing a pandas dataframe with Police District Number & Population
census = pd.DataFrame([{'Police District Number':'1D','Population':'149118'},
                       {'Police District Number':'2D','Population':'182883'},
                       {'Police District Number':'3D','Population':'152991'},
                       {'Police District Number':'4D','Population':'208263'},
                       {'Police District Number':'5D','Population':'131391'},
                       {'Police District Number':'6D','Population':'147486'}]
                      )                     

# crime counts / Police District Number
cpd=crimes['Police District Number'].value_counts()

# Change the series to a df object with PD as index
cpd=pd.DataFrame({'Police District Number':cpd.index,'Counts':cpd}).reset_index(drop=True)

# Merge the 2 dfs
result=pd.merge(census, cpd, on='Police District Number')

# Compute the crime counts pe 100k,
result['Counts/100k']=100000*result['Counts']/pd.to_numeric(result['Population'], errors='coerce')

# Set Police district number as index, 
# this is needed to have on the X axis the Police District Number, then get the value_count()
result.set_index('Police District Number',inplace=True)
result['Counts/100k'].sort_values(ascending=False).iplot(kind='bar',title='Crime distribution per 100K',xTitle='Police District Number',yTitle='Counts/100K')

### Crime by PD/100k:

Silver Spring is still the district with the highest crime rates, followed by Gaithersburg and at the end Bethesda.<br>
This is the order from highest to the lowest:<br>
    1. Silver Spring; Gaithersburg; Rockville; Wheaton; Germantown; Bethesda

## Crime distribution/PD vizualization pie

In [19]:
#Getting the crime percentage per district
result['Percentage']=result['Counts']*100/pd.to_numeric(result['Population'], errors='coerce')

#Need to reset the index otherwise I can't plot a pandas df and use one column if that column is in the index
result.reset_index().iplot(kind='pie',labels='Police District Number',values='Percentage',title="Crime percentage distribution / District")
result

Unnamed: 0_level_0,Population,Counts,Counts/100k,Percentage
Police District Number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1D,149118,3459,2319.63948,2.319639
2D,182883,3370,1842.70818,1.842708
3D,152991,5502,3596.289978,3.59629
4D,208263,4364,2095.427416,2.095427
5D,131391,2727,2075.484622,2.075485
6D,147486,3763,2551.42861,2.551429


### Crime percentage breakdown/PD:
* 3D - Silver Spring 24.8%
* 6D - Gaitersburg 17.6%
* 1D - Rockville 16%
* 4D - Wheaton 14.5%
* 5D - Germantown 14.3%
* 2D - Bethesda 12.7%

### Crime analysis by City & PD

In [20]:
# group crimes based on Police District Number / City; get the value_counts
pd_city_series=crimes.loc[:,['Police District Number','City']].groupby('Police District Number')['City'].apply(lambda s: s.value_counts())

# get the names of cities that appear on different PDs
# convert series to frame; reset index; rename the columns
pd_city=pd_city_series.to_frame()
pd_city.reset_index(inplace=True)
pd_city.columns=['Police District Number','City','Counts']
pd_city_series

Police District Number                    
1D                      ROCKVILLE             2375
                        GAITHERSBURG           398
                        POTOMAC                365
                        DERWOOD                154
                        POOLESVILLE            104
                        GERMANTOWN              37
                        DICKERSON               19
                        BOYDS                    4
                        BEALLSVILLE              2
                        SILVER SPRING            1
2D                      BETHESDA              1733
                        ROCKVILLE              531
                        CHEVY CHASE            492
                        KENSINGTON             274
                        POTOMAC                161
                        SILVER SPRING          157
                        CABIN JOHN              18
                        GLEN ECHO                4
3D                      SILVER SPRING  

One thing that stick out is that some cities appear on several districts. This could be due to crime locations being close to multiple PDs.<br>
Let's do a map visualization

In [21]:
# get duplicated cities 
pd_city_dup=pd_city[pd_city.duplicated('City')==True]['City'].unique()
pd_city_dup

array(['ROCKVILLE', 'POTOMAC', 'SILVER SPRING', 'CHEVY CHASE',
       'KENSINGTON', 'SPENCERVILLE', 'GAITHERSBURG', 'GERMANTOWN',
       'BOYDS', 'DICKERSON', 'DERWOOD', 'BROOKEVILLE', 'POOLESVILLE',
       'MONTGOMERY VILLAGE', 'OLNEY', 'TAKOMA PARK'], dtype=object)

## City map visualization

In [22]:
import folium

map_1 = folium.Map(location=[39.154743, -77.240515],
                   zoom_start=9.7,
                   tiles='Stamen Terrain')
folium.Marker([39.0838889, -77.1530556], popup='ROCKVILLE').add_to(map_1)
folium.Marker([39.0180556, -77.2088889], popup='POTOMAC').add_to(map_1)
folium.Marker([38.9905556, -77.0263889], popup='SILVER SPRING').add_to(map_1)
folium.Marker([38.9712215, -77.0763667], popup='CHEVY CHASE').add_to(map_1)
folium.Marker([39.0256651, -77.0763669], popup='KENSINGTON').add_to(map_1)
folium.Marker([39.1142747, -76.9783097], popup='SPENCERVILLE').add_to(map_1)
folium.Marker([39.1433333, -77.2016667], popup='GAITHERSBURG').add_to(map_1)
folium.Marker([39.1730556, -77.2719444], popup='GERMANTOWN').add_to(map_1)
folium.Marker([39.1837171, -77.3127623], popup='BOYDS').add_to(map_1)
folium.Marker([39.11733,  -77.1610916], popup='DERWOOD').add_to(map_1)
folium.Marker([39.1806623, -77.0591452], popup='BROOKEVILLE').add_to(map_1)
folium.Marker([39.0180556, -77.2088889], popup='POOLESVILLE').add_to(map_1)
folium.Marker([39.1766667, -77.1955556], popup='MONTGOMERY VILLAGE').add_to(map_1)
folium.Marker([39.1530556, -77.0672222], popup='OLNEY').add_to(map_1)
folium.Marker([38.9777778, -77.0077778], popup='TAKOMA PARK').add_to(map_1)

map_1

Because it's not clear why a city name appears on multiple PDs, I will focus the analysis on other areas

### Analyze crime by location:
1. determine what are the most and least common crime locations(i.e residence,street) for Montgomery county and for each PD. 
2. get 10 most common/least common locations

In [23]:
# import graph modules
import plotly.graph_objs as go
from plotly import tools

# Will get the total number of crimes committed
total=crimes['Place'].count()

# Get the value_counts() for all the places
place=crimes['Place'].value_counts()

# Convert the series to a df
place=pd.DataFrame({'Place':place.index,'Counts':place}).reset_index(drop=True)

# Get crime percentege on a given location.
place['Percent']=place['Counts']*100/total

# 10 most/least common
trace1=go.Bar(x=place['Place'].head(10),y=place['Percent'].head(10),name='most common')
trace2=go.Bar(x=place['Place'].tail(10),y=place['Percent'].tail(10),name='least common')

fig = tools.make_subplots(rows=1, cols=2, subplot_titles=('Montgomery 10 most common crime locations', 
                                                          'Montgomery 10 least common crime locations'))

fig.append_trace(trace1,1,1)
fig.append_trace(trace2,1,2)

fig['layout'].update(height=500,width=1000,autosize=True,margin=go.Margin(b=135),title='Montgomery crime percentage by location')
iplot(fig, filename='stacked-subplots-shared-xaxes')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]




plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### 10 most common locations:
1. Residence (Single Family; Apartment; Townhouse/Duplex; Driveway)
2. Street(in Vehicle; Residential; Commercial)
3. Parking Lot (Commercial; Residential)

### 10 least common locations:
1. Jail
2. Retail(Dry Cleaner; Video Store)
3. Liquor Store
4. Lake 
5. Pawn Shop
6. Residence(Mobile Home)
6. Nursery
7. Parking Lot(Park & Ride)

###  10 most/least common crime locations on PDs

In [24]:
# First I will create a df where I will group the df by 'Police District Number' chain it to 'Place' column and get
# the value_counts(), and create a new column called Counts
crimes_district_location=pd.DataFrame({'Counts' : crimes.groupby(['Police District Number'])['Place'].value_counts()}).reset_index()

def find_pcent(row):
    # Get the total crimes per district
    p_d_val_counts = crimes['Police District Number'].value_counts()
    
    # get a list of unique police district numbers and iterate over it
    for p_d in crimes['Police District Number'].unique():
        # for each Police District Number get the percentage
        if row['Police District Number'] == p_d:
            return (row['Counts'] / p_d_val_counts[p_d]) * 100

# Use an apply function to compute the percentage for each PD / location        
crimes_district_location['%'] = crimes_district_location.apply(find_pcent, axis = 1)
crimes_district_location.head()

Unnamed: 0,Police District Number,Place,Counts,%
0,1D,Residence - Single Family,509,14.715236
1,1D,Other/Unknown,303,8.759757
2,1D,Street - In vehicle,287,8.297196
3,1D,Street - Residential,201,5.810928
4,1D,Residence - Driveway,183,5.290546


### Plotting functions

1. to make the code modular I decided to implement 3 plotting functions
2. These are the benefits:
    1. reduce the overall code
    2. put emphasis on the plot not the plotting code
    3. focus more on the actual data behind the plot
 
 
3. plot_bar_staked:

    1. Official doc: https://plot.ly/python/bar-charts/
    2. plotting will be done based on a data list
    3. data is comprised of traces
    4. a trace is made out of:
        1. labels (x on the axis) 
        2. values (y on the axis)
        3. name of the particular trace
    5. particular to this code we will plot data based on top/least values
    6. so that means we will have separate data and traces for each case

In [25]:
def plot_bar_staked(df,part):
    data_h=[]
    data_t=[]
    names=['PD_1D','PD_2D','PD_3D','PD_4D','PD_5D','PD_6D']
    for pd,n in zip(['1D','2D','3D','4D','5D','6D'],names):
        data_h+=['trace'+str(pd)]
        data_t+=['trace'+str(pd)]
        labels_h=crimes_district_location[crimes_district_location['Police District Number']==pd]['Place'].head()
        labels_t=crimes_district_location[crimes_district_location['Police District Number']==pd]['Place'].tail()
        values_h=crimes_district_location[crimes_district_location['Police District Number']==pd]['%'].head()
        values_t=crimes_district_location[crimes_district_location['Police District Number']==pd]['%'].tail()
        data_h[len(data_h)-1]=go.Bar(x=labels_h,y=values_h,name=n)
        data_t[len(data_t)-1]=go.Bar(x=labels_t,y=values_t,name=n)
    
    if part=='head':
        return data_h
    elif part=='tail':
        return data_t

In [26]:
# plotting most common locations
# 1) need to pass to the plotting function data for the analysis in this case 'crimes_district_location;
# 2) need to specify if are are looking for top/least common locations in this case top most i.e part='head'

df=crimes_district_location
part='head'
data=plot_bar_staked(df,part)

# create figure layout
layout = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='stack',
    autosize=True,
    margin=go.Margin(b=160),
    title='5 most common crime locations accross all Police Districts'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='angled-text-bar')


plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Most common crime locations:
1. Residence - Single Family
2. Street - In vehicle
3. Residence - Apartment/Condo
4. Street - Residential
5. Other
6. Parking Lot
7. Residence - Townhouse/Duplex
8. Residence - Driveway

In [27]:
# create plot trace for each district; get 5 least common locations
df=crimes_district_location
part='tail'
data=plot_bar_staked(df,part)

#create fig layout
layout = go.Layout(
    xaxis=dict(tickangle=-45),
    barmode='stack',
    autosize=True,
    margin=go.Margin(b=140),
    title='5 least common crime locations accross all Police Districts'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig, filename='angled-text-bar')


plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Least common crime locations:
* Liquer Store
* Theater
* Pawn Shop
* Retail (Jewelery;Video Store; Dry Cleaner;Hardware)
* Nursery
* Parking Lot (Church;Park & Ride;Rec Center; Metro)
* Residence (Mobile Home; Carpot) 
* Loundromat
* Check Cashing

### Analize violent/non-violent crimes:
According to the UCR(Uniform Crime Reporting) definition violent crimes are the following:<br>
    "The descending order of UCR violent crimes are murder and nonnegligent manslaughter, forcible rape, robbery, and 
aggravated assault, followed by the property crimes of burglary, larceny-theft, and motor vehicle theft. Although arson is also a property crime, the Hierarchy Rule does not apply to the offense of arson. In cases in which an arson occurs in conjunction with another violent or property crime, both crimes are reported, the arson and the additional crime.".

The goal is to:
    1. get a percentage breakdown between violent and non-violent crimes
    2. percentage breakdown for violent subcategories
    3. analyze dispatch time for above mentioned cases
    
Columns used:<br>
    Class | Four digit code identifying the crime type of the incident<br>
    Class Description | Common name description of the incident class type<br> 

Official reference:
https://ucr.fbi.gov/crime-in-the-u.s/2011/crime-in-the-u.s.-2011/violent-crime/violent-crime
![alt text](Violent_crime_UCR.png "Title")

### Violent crime class analysis

In [28]:
# Set pandas to display absolutly all the rows. Usually pandas prints just the beginning and the last part of the df
import pandas as pd
pd.set_option('display.max_rows', None)
pd.options.display.float_format = '{:20,.2f}'.format

# Slice the df on'Class' and 'Class Description'
violent_crimes=crimes.loc[:,['Class','Class Description']]

# get unique class values & sort ascending
print(violent_crimes.sort_values(by='Class').drop_duplicates())

# Reset pandas display max rows to default value
pd.reset_option('display.max_rows')

       Class                                Class Description
17839    111                                 HOMICIDE-FIREARM
12606    115                                   HOMICIDE-OTHER
4379     211                                       RAPE-FORCE
5508     212                           RAPE - ATTEMPT - FORCE
16701    311                             ROB FIREARM - STREET
12576    312                         ROB FIREARM - COMMERCIAL
6594     313                        ROB FIREARM - GAS/SVC STA
19925    314                        ROB FIREARM - CONV. STORE
16228    315                        ROB FIREARM - RESIDENTIAL
14217    316              ROB FIREARM - FINANCIAL INSTITUTION
4358     317                              ROB FIREARM - OTHER
1593     318                              ROB FIREARM CARJACK
13358    321                      ROB KNIFE/CUT INST - STREET
11292    322                        ROB KNIFE/CUT INST - COMM
15268    324                      ROB KNIFE/CUT - CONV. STORE
9334    

1. Crimes identified by Class between 111 and 933 could be classified as violent.
2. Crimes from 911-933 fall into arson category. Arson as a single crime would not classify as violent but I tend to believe that here there is a combination of crimes which could be classified as violent. 
3. For the time being I will classify crimes between 111-933 as violent 

### Setup violent crime classification

In [29]:
crimes.loc[((crimes['Class']>=111) &(crimes['Class']<=933)),'Violent Crimes']='Violent'
crimes.loc[(crimes['Class']>933),'Violent Crimes']='Non-Violent'

#print first 5 rows
crimes.loc[:,['Class','Class Description','Violent Crimes']].head()

Unnamed: 0,Class,Class Description,Violent Crimes
10,2938,POL INFORMATION,Non-Violent
13,821,SIMPLE ASSAULT - CITIZEN,Violent
14,634,LARCENY FROM AUTO UNDER $50,Violent
15,2623,SUICIDE-ATTEMPT-POISON/OVERDOSE,Non-Violent
16,1712,SEX OFFENSE- INDECENT EXPOSURE,Non-Violent


### Violent/Non-Violent crime distribution

In [30]:
# Getting violent/non violent crime counts
violent_count=crimes['Violent Crimes'].value_counts()

# used the same process as for the time analysis pie chart
violent_count=violent_count.to_frame()
violent_count.reset_index(inplace=True)
violent_count.rename(columns={'index':'labels','Violent Crimes':'values'}, inplace=True)

# plot
violent_count.iplot(kind='pie', labels='labels',values='values',title='Violent/Non-Violent crime distribution in Montgomery county')
violent_count

Unnamed: 0,labels,values
0,Non-Violent,13279
1,Violent,9929


### Violent/Non-Violent breakdown:
* 42.8% violent crimes; 57.2% non-violent
* This is somewhat a little bit of a surprise as violent/non-violent values are quite close. I would have expected  much lower values for non-violent crimes. 

### Pie Plotting function:
* pie_plot:
    * Ref: https://plot.ly/python/pie-charts/
    * this type of plot needs a figure and a layout
    * both figure and layout are dictionaries
    * figure is comprised of:
        * domains grid for a particular axis x,y values are between [0,1]
        * labels list
        * values list
    * particular for this code I used zip to iterate over the items mentioned above at the same time & append the items to data
    
    
* make_annotations
    * annotations are part of the layout dictionary
    * annotations are as well dictionaries
    * I decided I didn't want an entire function for the layout but will need one for annotations
    * make_annotations requites:
        * size - font size fo the text
        * text - text associated with the pie chart
        * x - positioning of the pie charts on x axis
        * y - positioning on y axis

In [31]:
def pie_plot(domains,labels,values):
    data=[]
    original = [domains,labels,values]
    zipped=tuple([list(tup) for tup in zip(*original)])
    for i in range(0,len(zipped)):
        data.append({
            'labels': zipped[i][1],
            'values': zipped[i][2],
            'type': 'pie',
            'name': 'Starry Night',
            'domain': zipped[i][0],
            'hoverinfo':'label+percent+name',
            'hole': .4,
            'pull': .2
            })
    return data

def make_annotations(size,text,x,y):
    if len(text)==len(x)==len(y):
        data=[]
        size=[size]*len(x)
        for i,j,a,b in zip(size,text,x,y):
            data.append({
                "font":{"size":i},
                "showarrow": False,
                "text":j,
                "x":a,
                "y":b
            })
        return data
    else:
        print('Length mismatch')

### Violent/Non-Violent crime distribution per PD

In [32]:
# Slice the dataframe based on Police District Number & Violent Crimes
vc_pd=crimes.loc[:,['Police District Number','Violent Crimes']]

labels=[]
values=[]

# loopt through each PD
for pd in ['1D','2D','3D','4D','5D','6D']:
    
    # get the violent/non-violent crime counts for each PD
    item=vc_pd[vc_pd['Police District Number']==pd]['Violent Crimes'].value_counts()
    
    # create 2 list from the above step; labels & values
    l=item.index.tolist()
    v=item.values.tolist()
    labels.append(l)
    values.append(v)

In [33]:
domains=[({'x': [0, .20],'y': [0, .49]}),
         ({'x': [.35, .55],'y': [0, .49]}),
         ({'x': [.70, .90],'y': [0, .49]}),
         ({'x': [0, .20],'y': [.50, .93]}),
         ({'x': [.35, .55],'y': [.50, .93]}),
         ({'x': [.70, .90],'y': [.50, .93]})]
          
data=[]
data=pie_plot(domains,labels,values)

size=20
text=['PD_1D','PD_2D','PD_3D','PD_4D','PD_5D','PD_6D']
x=[0.05,0.45,0.85,0.05,0.45,0.85]
y=[1.03,1.03,1.03,0.48,0.48,0.48]
anno=make_annotations(size,text,x,y)

layout = dict(height = 750,
              width = 1000,
              autosize = False,
              title = 'Violent/Non-Violent crime distribution per Police District',
              annotations= anno
              )  

fig = dict(data=data, layout=layout)
iplot(fig, filename='pie-subplots')

### PD violent/non-violent breakdown:
1D | Rockville:
    1. violent 40.8%
    2. non-violent 59.2%
2D | Bethesda:
    1. violent 37.8%
    2. non-violent 61.3%
3D | Silver Spring:
    1. violent 41.6 %
    2. non-violent 58.4%
4D | Wheaton:
    1. violent 39.4%
    2. non-violent 60.6%
5D | Germantown:
    1. violent 50.7%
    2. non-violent 49.3 %
6D | Gaitersburg:
    1. violent 44.5%
    2. non-violent 55.5%
 
1. Violent crimes range from 38.7-44.5%
2. Non-violent crimes range from 55.4-61.3%
3. The only surprise is for PD_5D violent crimes outnumber the non-violent crimes;(violent/non-violent) distribution is 50.7%/49.3%
4. highest violent percentage PD_5D 50.7%
5. lowest violent percentage PD_2D 38.7%
6. highet non-violent percentage PD_2D 61.3%
7. lowest non-violent percentage PD_5D 49.3%

### Analyze violent/non-violent crime distribution per PD & Place

1. Get top 5 places where violent/non-violent crimes are committed
2. For each PD create a slice of the dataframe where based on PD number and Crime categorization
3. sort values in descending order

In [34]:
import pandas as pd

# Create a slice of the df based on 'Police District Number','Place','Violent Crimes'
crimes_pd_place=crimes.loc[:,['Police District Number','Place','Violent Crimes']]

# get violent/non-violent crime counts per PD & Place
# 1) Create a new df by grouping the slice based on 'Police District Number','Place'
# 2) Chain the df to 'Violent Crimes',get value_counts(),reset the index

pd_place_vc=pd.DataFrame({'Counts' : crimes_pd_place.groupby(['Police District Number','Place'])['Violent Crimes'].value_counts()}).reset_index()
labels_v=[]
values_v=[]

labels_nv=[]
values_nv=[]

# loop throught the PDs
for pd in ['1D','2D','3D','4D','5D','6D']:
    # get the violent/non-violent crime counts for each PD
    item_v=pd_place_vc[(pd_place_vc['Police District Number']==pd)& (pd_place_vc['Violent Crimes']=='Violent')].sort_values(by='Counts',ascending=False).head()
    item_nv=pd_place_vc[(pd_place_vc['Police District Number']==pd)& (pd_place_vc['Violent Crimes']=='Non-Violent')].sort_values(by='Counts',ascending=False).head()
    
    # create labels,values lists for each case i.e violent/non-violent
    labels_v.append(item_v['Place'])
    values_v.append(item_v['Counts'])
    
    labels_nv.append(item_nv['Place'])
    values_nv.append(item_nv['Counts'])
    

### Plotting  most common places where violent crimes are commited /PD

### Grupped bar subplots plotting functions:
1. plot_subplots:

    1. Ref:  https://plot.ly/python/subplots/
    2. this require a figure & layout
    3. the figure is comprised of individual traces and (row; column) specification
    4. each trace is made out a x<-->label and y<-->values
    5. particular to this function I will use zip to iterate over labels,value,names at the same time & construct the traces
    6. make the figure layout
    7. append each trace to the figure
    
    
2. make_titles:
    1. this will create a list with titles
    2. titles are made out a base string + another string typically this would be the PDs

In [35]:
def plot_subplots(labels,values,row,col,names,titles):
    if len(labels)==len(values)==len(names)==len(titles):
        
        original = [labels,values,names]
        zipped=tuple([list(tup) for tup in zip(*original)])
        
        trace=[]
        for i in range(0,len(labels)):
            trace+=['trace'+str(i+1)]
            trace[i]=go.Bar(x=zipped[i][0],y=zipped[i][1],name=zipped[i][2])
        
        fig = tools.make_subplots(
            rows=row, cols=col, subplot_titles=(titles)
            )
        
        layout_row=[]
        layout_col=[]
        
        for i in range(1,row+1):
            for j in (1,col):
                layout_row.append(j)
                layout_col.append(i)
        
        for i,j,k in zip(trace,layout_col,layout_row):
            fig.append_trace(i,j,k)
        
        return fig
    else:
        return print('Length mismatch')
        exit()

        
def make_titles(names,base_name):
    titles=[]
    for n in names:
        titles+=[str(base_name)+' '+str(n)]
    return titles

### Plot crime locations distribution/PD

In [36]:
# setup labels & values
labels=labels_v
values=values_v

# strings used to construct titles
names=['PD_1D','PD_2D','PD_3D','PD_4D','PD_5D','PD_6D']
base_name='Top 5 violent crime locations'

#construct titles list
titles=make_titles(names,base_name)

row=3;
col=2

#create figure layout
fig=plot_subplots(labels,values,row,col,names,titles)

#plot
fig['layout'].update(height=1200,width=1000, margin=go.Margin(b=155), title='Top 5 Violent crimes locations/PD')
iplot(fig, filename='stacked-subplots-shared-xaxes')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]




plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Top 5 violent crime locations:
1. The actual places where these crimes are committed are more specific i.e Residence Single Family but I will not focus on that as I fell it will not bring an substantial added value to the analysis
2. The order of this locations differ from each PD, however I would not focus on that either
3. Most common place where crimes are committed are Residence;Street;Parking Lot;Department/Retail store

In [37]:
#The same process as above will follow

labels=labels_nv
values=values_nv

base_name='Top 5 non-violent crime locations'
titles=make_titles(names,base_name)

row=3;
col=2

fig=plot_subplots(labels,values,row,col,names,titles)

fig['layout'].update(height=1200,width=1000, margin=go.Margin(b=130), title='Top 5 Non-Violent crimes locations/PD')
iplot(fig, filename='stacked-subplots-shared-xaxes')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]




plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Top 5 non-violent crime locations:
1. Most common non-violent crime locations are Residence;Street;Parking Lot;Department/Retail Store
2. These are the same locations as for violent crimes

### Analize violent crimes by main categories

1. The goal here is to know what is the percentage of a particular violent crime sub category.
2. Classification structure:
    1. create a function that is applied against the 'Class' columns i.e(Homicide,Rape...etc)
    2. depending on the values the function will return the overall category
    3. return the results in a new column

In [38]:
def class2nd(x):
    if ((x>100) & (x<=199)):
        #print(x)
        return 'HOMICIDE'
    elif ((x>200) & (x<=299)):
        return 'RAPE'
    elif ((x>300) & (x<=399)):
        return 'ROBBERY'
    elif ((x>400)& (x<=499)):
        return 'AGG ASSAULT'
    elif ((x>500)&(x<=599)):
        return 'BURGLARY'
    elif ((x>600)&(x<=699)):
        return 'LARCENY'
    elif ((x>700)&(x<=799)):
        return 'AUTO THEFT'
    elif ((x>800)&(x<=815)):
        return 'ASSAULT & BATTERY'
    elif ((x>815)&(x<=825)):
        return 'ASSAULT'
    elif ((x>900)&(x<999)):
        return 'ARSON'
    
crimes['Class Main Cathegory']=crimes['Class'].apply(class2nd)

### Get violent crimes count by main class category

In [39]:
import pandas as pd

# 1) create a new df end excluce non-violent crimes
# 2) group by PD and get the counts for each violent crime main cathegory 

violent_main=pd.DataFrame({'Counts': crimes[crimes['Class Main Cathegory'].notnull()==True].groupby(['Police District Number'])['Class Main Cathegory'].value_counts()}).reset_index()

#print preview
violent_main

Unnamed: 0,Police District Number,Class Main Cathegory,Counts
0,1D,LARCENY,845
1,1D,BURGLARY,220
2,1D,ASSAULT,94
3,1D,ASSAULT & BATTERY,87
4,1D,AUTO THEFT,56
5,1D,ROBBERY,40
6,1D,AGG ASSAULT,12
7,1D,RAPE,5
8,1D,ARSON,3
9,2D,LARCENY,1307


In [40]:
labels_v=[]
values_v=[]

# loop through each PD
for pd in ['1D','2D','3D','4D','5D','6D']:
    # for each PD get the main cathegory & counts
    l=violent_main[violent_main['Police District Number']==pd]['Class Main Cathegory']
    v=violent_main[violent_main['Police District Number']==pd]['Counts']
    
    # create labels & values list
    labels_v.append(l)
    values_v.append(v)    

###  Plot main violent crimes categories distribution on PDs

In [41]:
labels=labels_v
values=values_v

base_name='Violent crime distribution'
titles=make_titles(names,base_name)

row=3
col=2
# plot
fig=plot_subplots(labels,values,row,col,names,titles)

fig['layout'].update(height=1200,width=1000, title='Violent Crimes/PD by main category')
iplot(fig, filename='stacked-subplots-shared-xaxes')

This is the format of your plot grid:
[ (1,1) x1,y1 ]  [ (1,2) x2,y2 ]
[ (2,1) x3,y3 ]  [ (2,2) x4,y4 ]
[ (3,1) x5,y5 ]  [ (3,2) x6,y6 ]



### Main violent crimes categories distribution on PDs - Pie Chart

In [42]:
domains=[({'x':[0, 0.40], 'y':[0.60,1.0]}), # for the cell (1,1)
         ({'x':[0.60, 1], 'y':[0.60,1]}),#cell (1,2)
         ({'x':[0, 0.40], 'y':[0.30, 0.66]}), #cell (2,1)
         ({'x':[0.60, 1], 'y':[0.30, 0.66]}),#cell (2,2)
         ({'x':[0, 0.40], 'y':[0,0.30]}),#cell (3,1)
         ({'x': [0.60, 1], 'y':[0,0.30]})]#cell (3,1)

data=pie_plot(domains,labels,values)

size=24
x=[0.14,0.86,0.14,0.86,0.14,0.86]
y=[0.82,0.82,0.48,0.48,0.13,0.13]

anno=make_annotations(size,text,x,y)
layout = dict(height = 1200,
              width = 1000,
              autosize = False,
              title = 'Violent Crimes/PD by main category',
              annotations=anno
             )

fig = dict(data=data, layout=layout)
iplot(fig, filename='pie-subplots')

### Violent crimes sub categories breakdown:
For the most part, the order of the violent crimes in terms of percentage from high to low is the following( did not focus on the exact order): 
    1. Larceny
    2. Burglary
    3. ASSAULT & BATTERY
    4. ASSAULT
    5. AUTO THEFT
    6. ROBBERY
    7. AGGRAVATED ASSAULT
    8. RAPE; ARSON; HOMICIDE
1. Larceny:
    1. ranges from 59,4%-76,4%
    2. highest percent PD_2D
    3. lowest percent PD_4D
2. Burglary:
    1. ranges from 11%-16,2%
    2. highest percent PD_1D
    3. lowest percent PD_2D
3. ASSAULT & BATTERY:
    1. ranges from 5.2%-11,4%
    2. highest percent PD_4D
    3. lowest percent PD_2D
4. ASSAULT:
    1. ranges from 3.22%-7,1%
    2. highest percent PD_5D
    3. lowest percent PD_2D
5. AUTO THEFT:
    1. ranges from 1.87%-5,27%
    2. highest percent PD_3D
    3. lowest percent PD_2D
6. ROBBERY:
    1. ranges from 1.75%-5,35%
    2. highest percent PD_3D
    3. lowest percent PD_2D
7. AGGRAVATE ASSAULT:
    1. ranges from 0.292%-2,12%
    2. highest percent PD_3D
    3. lowest percent PD_2D

### Time analysis based on violent/non-violent classification

Will attempt to see if there is any difference in dispatch time based on violent/non-violent crime classification.<br>
Will also break down the dispatch time based on violent crime sub categories.<br>
The assumption is that the more violent the crime is the faster is the dispatch time.

In [43]:
import pandas as pd

#create a new df by filtering out the dates before 01.07.2013 & looking at 'Violent Crimes' and 'Date diff' columns
new_td=crimes[crimes['Start Date / Time']>=dt.datetime(2013,7,1)].loc[:,['Violent Crimes','Date diff']].copy()

# function defining dispatch time classification
def time_class(x):
    # assesing time intervals
    if x < pd.Timedelta('1 minute'):
        return 'Less then 1 minute'
    elif ((x>=(pd.Timedelta('1 minute'))) & (x < pd.Timedelta('5 minute'))):
        return 'Between 1 and 5'
    elif ((x>=(pd.Timedelta('5 minute'))) & (x < pd.Timedelta('10 minute'))):
        return 'Between 5 and 10'
    elif ((x>=(pd.Timedelta('10 minute'))) & (x < pd.Timedelta('30 minute'))):
        return 'Between 10 and 30'
    elif ((x>=(pd.Timedelta('30 minute'))) & (x < pd.Timedelta('60 minute'))):
        return 'Between 30 and 60'
    elif x>pd.Timedelta('60 minute'):
        return 'Above 60'

def time_analysis_v_nv(df):
    labels=[]
    values=[]
    # loop through violent/non-violent classification
    for classif in ['Violent','Non-Violent']: 
        
        #create a new df & apply time classification funtcion then get the counts
        td=new_td[new_td['Violent Crimes']==classif].copy()
        td['Time Deltas']=td['Date diff'].apply(time_class)
        td=td['Time Deltas'].value_counts()
        
        # convert the series to two label & values lists
        l=td.index.tolist()
        v=td.values.tolist()
        
        labels.append(l)
        values.append(v)
    
    return (labels,values)

### Plotting dispatch time interval based on violent/non-violent classification

In [44]:
# get data list for plotting a pie chart
domains=[({"x": [0, .40]}),({"x": [.52, .92]})]
[labels,values]=time_analysis_v_nv(df=new_td)
data=pie_plot(domains,labels,values)

# creating the layout
layout = dict(autosize = True,
              margin=go.Margin(b=50),
              title="Dispatch Time based on violent/non-violent crime classification",
              annotations= [{"font": {"size": 15},"showarrow": False,"text": "Violent Crimes Time Deltas","x": 0.07, "y": .97},
                            {"font": {"size": 15},"showarrow": False,"text": "Non-Violent Crimes Time Deltas","x": 0.90, "y": .97}]
              )
# plot
figxx = dict(data=data, layout=layout)
iplot(figxx, filename='pie-subplots')


plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Violent/Non-Violent crime dispatch time breakdown:

* less then 1 minute ---> 82.7% decrease from non-violent to violent
* between 1 minute and 5 minutes --->27.55 % decrease from non-violent to violent
* between 5 minutes and 10 minutes --->41.39% increase from non-violent to violent
* between 10 minutes and 30 minutes --->97.08% increase from non-violent to violent
* between 30 minutes and 60 minutes --->76.78% increase from non-violent to violent
* above 60 minutes --->74.27% increase from non-violent to violent

### Time delta analysis based on violent class

In [45]:
# create a data frame for violent crimes
td_v=crimes[((crimes['Violent Crimes']=='Violent')& (crimes['Start Date / Time']>=dt.datetime(2013,7,1)))].loc[:,["Date diff",'Class Main Cathegory']]

data_l=[]
data_v=[]
item_d=[]

#loop through violent crimes categories 
for crime in td_v['Class Main Cathegory'].unique():
    #construct 2 lists for plotting labels & values
    data_l+=[str(crime)+'_l']
    data_v+=[str(crime)+'_v']
    
    # for each violent crime type get dispatch time deltas
    item=td_v[td_v['Class Main Cathegory']==crime]['Date diff']
    item=item.apply(time_class).value_counts()
    
    # append lables & values to their corresponding lists
    data_l[len(data_l)-1]=item.index
    data_v[len(data_v)-1]=item.values

In [46]:
domains=[({'x':[0, 0.40], 'y':[0.85, 1]}),    # for the cell (1,1)
         ({'x':[0.60, 1], 'y':[0.85, 1]}),    # cell (1,2)
         ({'x':[0, 0.40], 'y':[0.65, 0.80]}), # cell (2,1)
         ({'x':[0.60, 1], 'y':[0.65, 0.80]}), # cell (2,2)
         ({'x':[0, 0.40], 'y':[0.45, 0.60]}), # cell (3,1)
         ({'x':[0.60, 1], 'y':[0.45, 0.60]}), # cell (3,2)
         ({'x':[0, 0.40], 'y':[0.25,  0.40]}),# cell (4,1)
         ({'x':[0.60 ,1], 'y':[0.25,  0.40]}),# cell (4,1)
         ({'x':[0, 0.40], 'y':[0.05,  0.20]}),# cell (5,1)
         ({'x':[0.5, 1], 'y':[0.05,  0.20]})] # cell (5,1)

labels=data_l
values=data_v

data=pie_plot(domains,labels,values)
text=td_v['Class Main Cathegory'].unique()
text[1]='ASSLT & BTR'
text[4]='A THEFT'
text[5]='AGG ASSLT'
size=15
x=[0.15,0.875,0.14,0.865,0.15,0.875,0.15,0.84,0.16,0.81]
y=[0.93,0.93,0.73,0.73,0.52,0.52,0.32,0.32,0.12,0.12]

anno=make_annotations(size,text,x,y)
layout = dict(height = 1600,
              width = 1000,
              autosize = True,
              margin=go.Margin(b=20),
              title="Dispatch Time based on violent crime classification",
              annotations=anno
             )
figxx = dict(data=data, layout=layout)
iplot(figxx, filename='pie-subplots')


plotly.graph_objs.Margin is deprecated.
Please replace it with one of the following more specific types
  - plotly.graph_objs.layout.Margin




### Violent crimes dispatch time breakdown:
1. LARCENY:
    1. biggest dispatch time category: above 60 minutes 63%
    2. smallest dispatch time category: between 1 and 5 minutes 0.762%
    
    
2. ASSAULT & BATTERY:
    1. biggest dispatch time category: less then 1 minute 61%
    2. smallest dispatch time category: between 1 and 5 minutes 3.09%


3. BURGLARY:
    1. biggest dispatch time category: above 60 minutes 65.6%
    2. smallest dispatch time category: between 30 and 60 minutes 1.1%


4. ROBBERY:
    1. biggest dispatch time category: less then 1 minute 43.8%
    2. smallest dispatch time category: between 30 and 60 minutes 4.61%


5. AUTO THEFT:
    1. biggest dispatch time category: above 60 minutes 71%
    2. smallest dispatch time category: between 5 and 10 minutes 0.763%


6. AGGRAVATE ASSAULT:
    1. biggest dispatch time category: less then 1 minute 64.1%
    2. smallest dispatch time category: between 1 and 5 minutes 3.59%


7. ASSAULT:
    1. biggest dispatch time category: less then 1 minute 62.3%
    2. smallest dispatch time category: between 1 and 5 minutes 2.16%


8. RAPE:
    1. biggest dispatch time category: above 60 minutes 65.7%
    2. smallest dispatch time category: between 30 and 60 minutes 2.86%


9. ARSON:
    1. biggest dispatch time category: less then 1 minutes 67.9%
    2. smallest dispatch time category: between 30 and 60 minutes 3.57%


10. HOMICIDE:
    1. biggest dispatch time category: less then 1 minutes 75.%
    2. smallest dispatch time category: between 1 and 5 minutes 25%


11. DISPAtCH TIME DELTAS CATEGORY BREAKDOWN:
    1. less than 1 minutes:
        1. highest % HOMICIDE 75%
        2. lowest  % AUTO THEFT 22.1%   
    2. between 1 minute and 5 minutes:
        1. highest % HOMICIDE 25%
        2. lowest  % LARCENY 0.762%
    3. between 5 minutes and 10 minutes:
        1. highest % ROBBERY 11.8%
        2. lowest  % AUTO THEFT 0.763
    4. between 10 minutes and 30 minutes:
        1. highest % ROBBERY 17%
        2. lowest  % AUTO THEFT 2.54%
    5. between 30 minutes and 60 minutes:
        1. highest % ROBBERY 4.61%
        2. lowest  % BURGLARY 1.1%
    6. above 60 minutes:
        1. highest % AUTO THEFT 71%
        2. lowest  % ASSAULT & BATTERY 13.3%

### Map visualization
Having a map representation of the crime distribution can make the data more compelling to a casual reader.<br>
With this in mind i used folium to plot crime on Montgomery district.<br>
Columns used:
    1. Latitude
    2. Longitude 

### folium plotting function

In [47]:
'''
changed geo_path=district_geo/geo_data
'''

def folium_plot(data_out,data):
    
    # setup county boundries
    map = folium.Map(location=[39.154743, -77.240515], zoom_start=10.5)
    
    folium.Choropleth(geo_data = district_geo,  
                  data_out = data_out,        # json file with PD boundries
                  data=data,                  # df used for plotting
                  columns = ['DIST', 'Number'],
                  key_on = 'feature.properties.DIST',  # bind the json and data file on district
                  fill_color='YlGn', fill_opacity=0.8, line_opacity=0.2,
                  legend_name='Crime Vizualization').add_to(map)
    return map

In [48]:
import folium

# Assigning the geojson file to a variable
district_geo = r'montgomery_county_pd.geojson'

# Preparing the data for plotting; this also means reordering the index to match the geojson file
crimedata2 = pd.DataFrame(crimes['Police District Number'].value_counts().astype(float))[:6]
crimedata2.index = [3,4,6,1,2,5]

# Cleaned data to json
crimedata2.to_json('crimeagg_new.json')

# Reset index; rename columns
crimedata2 = crimedata2.reset_index()

crimedata2.columns = ['DIST', 'Number']
# Initiate folium map and then plot it

map=folium_plot(data_out='crimeagg_new.json',data=crimedata2)
map

1. This map representation is a little bit misleading in the sense that it does not take into account the population percentage.
2. With this in mind let's do the same thing but try to get the percentage by/100K

In [49]:
import folium
district_geo = r'montgomery_county_pd.geojson'

crimedata=crimes['Police District Number'].value_counts()
crimedata=crimedata.iloc[:6]

crimedata=crimedata.to_frame().reset_index()
crimedata.columns=['Police District Number','Number']
crimedata.index=[3,4,6,1,2,5]

#get the crime percentage per 100K
census.index=[1,2,3,4,5,6]
crimedata['Number']=crimedata['Number']*100/census['Population'].astype(float)

json_conv=pd.DataFrame({'Police District Number':crimedata['Number']},index=[3,4,6,1,2])
json_conv.to_json('crimeaggcc.json')


crimedata=crimedata.loc[:,['Police District Number','Number']]
crimedata.columns=['DIST','Number']
crimedata=crimedata.reset_index(drop=True)
crimedata['DIST']=crimedata['DIST'].str.strip('D').astype(float)

map=folium_plot(data_out='crimeaggcc.json',data=crimedata)
map

### Conclusions

Analysis conducted on this dataset has 3 directions: time analysis, location analysis and crime classification analysis
###### Time analysis:
1. Week:
    2. most of the crimes are committed Tuesday and least crimes are Sunday
	3. there is a general decreasing trends towards the end of the weekly<br><br>
2. Hour:
    1. most of the crimes are committed between 7am -11pm and least crimes between 5am-7am<br><br>
3. Month:
    1. highest crimes rates are in October
    2. lowest crimes rate in July
	3. amendment dataset is not complete as it contains timep stamps from July onward<br><br>
4. Dispatch Time conclusion:
	1. highest dispatch time interval is less then 1 min 48.4% followed by a dispatch time above 60 min 41%
	2. smallest dispatch category is between 1 min and  5 min   1.72%<br><br>
5. Dispatch time deltas based on violent crime subcategories:
    1. less than 1 minutes:
        * highest % HOMICIDE 75%
        * lowest % AUTO THEFT 22.1%<br><br>
    2. between 1 minute and 5 minutes:
        * highest % HOMICIDE 25%
        * lowest % LARCENY 0.762%<br><br>
    3. between 5 minutes and 10 minutes:
        * highest % ROBBERY 11.8%
        * lowest % AUTO THEFT 0.763<br><br>
    4. between 10 minutes and 30 minutes:
        * highest % ROBBERY 17%
        * lowest % AUTO THEFT 2.54%<br><br>
    5. between 30 minutes and 60 minutes:
        * highest % ROBBERY 4.61%
        * lowest % BURGLARY 1.1%<br><br>
    6. above 60 minutes:
        * highest % AUTO THEFT 71%
        * lowest % ASSAULT & BATTERY 13.3%
        
###### Location analysis:
Locations can vary from a size stand point i.e Police District; City; Location(Street;Residence)<br>
1.Police District:<br>
    1. 3D | Silver Spring 24.8%
    2. 6D | Gaithersburg 17.6%
    3. 1D | Rockville 16%
    4. 4D | Wheaton 14.5%
    5. 5D | Germantown 14.3%
    6. 2D | Bethesda 12.7%
     
2. Locations:
	1. most common:
        * Residence; Street ;Parking Lot
    2. least common:
		* Jail; Retail; Liquor Store; Lake; Pawn Shop; Residence(Mobile Home); Nursery; Parking Lot(Park & Ride)

##### Crime classification analysis:
1. Montgomery county:
    1. overall violent and non-violent crimes have close values (42.8% violent crimes; 57.2% non-violent) which comes as a surprise
	2. violent crimes range from 38.7-44.5%
    3. non-violent crimes range from 55.4-61.3%
         
         
2. Police Districts crime percentage:	
    1. highest violent percentage 5D - Germantown 50.7%
	2. lowest violent percentage 2D - Bethesda 38.7%
	3. highest non-violent percentage 2D - Bethesda 61.3%
	4. lowest non-violent percentage 5D - Germantown 49.3%


3. Violent Crimes subcategory analysis:
    1. Larceny ranges from 59,4%-76,4%
    2. Burglary ranges from 11%-16,2%
	3. ASSAULT & BATTERY ranges from 5.2%-11,4%
	4. ASSAULT ranges from 3.22%-7,1%
    5. AUTO THEFT ranges from 1.87%-5,27%
	6. ROBBERY ranges from 1.75%-5,35%
	7. AGGRAVATE ASSAULT ranges from 0.292%-2,12%


4. Violent crime classification dispatch time breakdown:     
    1. less then 1 minute ---> 82.7% decrease from non-violent to violent
	2. between 1 minute and 5 minutes --->27.55 % decrease from non-violent to violent
	3. between 5 minutes and 10 minutes --->41.39% increase from non-violent to violent
    4. between 10 minutes and 30 minutes --->97.08% increase from non-violent to violent
    5. between 30 minutes and 60 minutes --->76.78% increase from non-violent to violent
	6. above 60 minutes --->74.27% increase from non-violent to violent