### Capstone 2

[Dataset](https://www.phoenixopendata.com/dataset/crime-data) downloaded on 9/25/2019 from the City of Phoenix Open Data Portal.  Includes crime data from 2015 to the present.

1. do certain types of crimes occur more frequently at specific 
 - times of the day, or 
 - months of the year
 - places in the city
1. if policing efforts were adjusted based on these peak times, how would it impact the frequency of these crimes?
1. 

#### General Structure
1. Analysis that highlights your experimental hypothesis.
1. A rollout plan showing how you would implement and rollout the experiment
1. An evaluation plan showing what constitutes success in this experiment

In [61]:
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
import datetime
%matplotlib inline 
plt.style.use('bmh')
# pd.options.display.max_rows = 999
pd.options.display.float_format = '{:.0f}'.format

In [27]:
# read in csv files

df = pd.read_csv('crime-data_crime-data_crimestat.csv', low_memory=False)

In [63]:
df.head()

Unnamed: 0,inc_number,occurred_on,occurred_to,ucr_crime_category,100_block_addr,zip,premise_type
0,201500002101405,2015-11-01,2015-11-01 05:00:00,MOTOR VEHICLE THEFT,102XX W MEDLOCK AVE,85307,SINGLE FAMILY HOUSE
1,201500002102668,2015-11-01,2015-11-01 11:50:00,MOTOR VEHICLE THEFT,69XX W WOOD ST,85043,SINGLE FAMILY HOUSE
2,201600000052855,2015-11-01,2016-01-09 00:00:00,MOTOR VEHICLE THEFT,N 43RD AVE & W CACTUS RD,85029,SINGLE FAMILY HOUSE
3,201500002168686,2015-11-01,2015-11-11 09:30:00,LARCENY-THEFT,14XX E HIGHLAND AVE,85014,PARKING LOT
4,201700001722914,2015-11-01,NaT,LARCENY-THEFT,279XX N 23RD LN,85085,SINGLE FAMILY HOUSE


In [30]:
# convert column labels to lower case and remove spaces from column names
df.columns = df.columns.str.replace(' ', '_')
df.columns = map(str.lower, df.columns)

In [62]:
df.head()

Unnamed: 0,inc_number,occurred_on,occurred_to,ucr_crime_category,100_block_addr,zip,premise_type
0,201500002101405,2015-11-01,2015-11-01 05:00:00,MOTOR VEHICLE THEFT,102XX W MEDLOCK AVE,85307,SINGLE FAMILY HOUSE
1,201500002102668,2015-11-01,2015-11-01 11:50:00,MOTOR VEHICLE THEFT,69XX W WOOD ST,85043,SINGLE FAMILY HOUSE
2,201600000052855,2015-11-01,2016-01-09 00:00:00,MOTOR VEHICLE THEFT,N 43RD AVE & W CACTUS RD,85029,SINGLE FAMILY HOUSE
3,201500002168686,2015-11-01,2015-11-11 09:30:00,LARCENY-THEFT,14XX E HIGHLAND AVE,85014,PARKING LOT
4,201700001722914,2015-11-01,NaT,LARCENY-THEFT,279XX N 23RD LN,85085,SINGLE FAMILY HOUSE


In [97]:
df.dtypes

inc_number                    object
occurred_on           datetime64[ns]
occurred_to           datetime64[ns]
ucr_crime_category            object
100_block_addr                object
zip                          float64
premise_type                  object
dtype: object

In [33]:
# change date and time columns to datetime format
df.occurred_on = pd.to_datetime(df.occurred_on)
df.occurred_to = pd.to_datetime(df.occurred_to)

In [172]:
# add datetime features
df['occ_on_month'] = df['occurred_on'].dt.month
df['occ_on_year'] = df['occurred_on'].dt.year



In [173]:
df.shape

(253000, 10)

In [37]:
# how many different types of crime are there?
df.ucr_crime_category.value_counts()

LARCENY-THEFT                            128579
BURGLARY                                  40684
MOTOR VEHICLE THEFT                       26801
DRUG OFFENSE                              21111
AGGRAVATED ASSAULT                        19280
ROBBERY                                   10744
RAPE                                       3722
ARSON                                      1568
MURDER AND NON-NEGLIGENT MANSLAUGHTER       511
Name: ucr_crime_category, dtype: int64

### Motor Vehicle Theft

In [89]:
# zips with the most thefts
df.groupby(df.ucr_crime_category).get_group('MOTOR VEHICLE THEFT').loc[:,'zip'].value_counts().head()

85009    1714
85041    1543
85033    1388
85043    1300
85035    1289
Name: zip, dtype: int64

### Mean Times For Different Crimes

In [153]:
# create column for hour of day
df['occ_on_hr'] = df.occurred_on.dt.hour

In [160]:
# this doesn't work
# df.groupby(df.ucr_crime_category)['occurred_on'].dt.hour.mean()

pd.options.display.float_format = '{:.2f}'.format
df.groupby(df.ucr_crime_category)['occ_on_hr'].mean()

ucr_crime_category
AGGRAVATED ASSAULT                      12.95
ARSON                                   10.55
BURGLARY                                11.44
DRUG OFFENSE                            12.89
LARCENY-THEFT                           13.15
MOTOR VEHICLE THEFT                     13.06
MURDER AND NON-NEGLIGENT MANSLAUGHTER   11.75
RAPE                                    10.10
ROBBERY                                 12.90
Name: occ_on_hr, dtype: float64

crimes with the longest gaps

In [167]:
# difference in time between when crime began and ended
# why are there so many negative numbers?

(df.occurred_on.dt.hour - df.occurred_to.dt.hour).value_counts().head()

0.00     68872
-1.00    17501
-2.00     7756
-3.00     5402
-4.00     4581
dtype: int64

In [51]:
df.zip.value_counts().head()

85015.0    13074
85008.0    11826
85009.0    11117
85041.0    10870
85051.0    10794
Name: zip, dtype: int64

In [64]:
# which street locations have the most crimes
df['100_block_addr'].value_counts().head()

18XX W BELL RD            1619
61XX N 35TH AVE           1333
57XX N 19TH AVE           1225
16XX W BETHANY HOME RD    1056
37XX E THOMAS RD          1044
Name: 100_block_addr, dtype: int64

In [169]:
# where do most crimes take place at?
df.premise_type.value_counts().head(10)

SINGLE FAMILY HOUSE                    44658
APARTMENT                              28929
STREET / ROADWAY / ALLEY / SIDEWALK    25542
PARKING LOT                            25361
RETAIL BUSINESS                        19515
VEHICLE                                13107
CONVENIENCE MARKET / STORE              9192
DEPARTMENT / DISCOUNT STORE             8872
DRIVEWAY                                8833
GROCERY / SUPER MARKET                  6917
Name: premise_type, dtype: int64