# Atlanta Police Department
![APD Logo](https://atlantapd.galls.com/photos/partners/atlantapd/logo.jpg)


The Atlanta Police Department provides Part 1 crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads

A recent copy of the data file is stored in the cluster. <span style="color: red; font-weight: bold;">Please, do not copy this data file into your home directory!</span>

# Introduction


- This notebooks leads into an exploration of public crime data provided by the Atlanta Police Department.
- The original data set and supplemental information can be found at http://www.atlantapd.org/i-want-to/crime-data-downloads
- **The data set is available on ARC, please, don't download into your home directory on ARC!**

In [1]:
import numpy as np
import pandas as pd 
%matplotlib inline
import matplotlib.pyplot as plt

In [2]:
# load data set
df = pd.read_csv('/Users/mxz0434/Downloads/COBRA-YTD-multiyear.csv.gz')
print("Shape of table: ", df.shape)

Shape of table:  (285733, 23)


  interactivity=interactivity, compiler=compiler, result=result)


## Review: Creating a data key

Let's look at the structure of this table. We're actually creating some text output that can be used to create a data dictionary.

In [None]:
dataDict = pd.DataFrame({'DataType': df.dtypes.values, 'Description': '', }, index=df.columns.values)

We need to enter the descriptions for each entry in our dictionary manually...

In [None]:
dataDict.loc['MI_PRINX'].Description = '' # type: int64
dataDict.loc['offense_id'].Description = 'Unique ID in the format YYDDDNNNN with the year YY, the day of the year DDD and a counter NNNN' # type: int64
dataDict.loc['rpt_date'].Description = 'Date the crime was reported' # type: object
dataDict.loc['occur_date'].Description = 'Estimated date when the crime occured' # type: object
dataDict.loc['occur_time'].Description = 'Estimated time when the crime occured' # type: object
dataDict.loc['poss_date'].Description = '' # type: object
dataDict.loc['poss_time'].Description = '' # type: object
dataDict.loc['beat'].Description = '' # type: int64
dataDict.loc['apt_office_prefix'].Description = '' # type: object
dataDict.loc['apt_office_num'].Description = '' # type: object
dataDict.loc['location'].Description = '' # type: object
dataDict.loc['MinOfucr'].Description = '' # type: int64
dataDict.loc['MinOfibr_code'].Description = '' # type: object
dataDict.loc['dispo_code'].Description = '' # type: object
dataDict.loc['MaxOfnum_victims'].Description = '' # type: float64
dataDict.loc['Shift'].Description = 'Zones have 8 or 10 hour shifts' # type: object
dataDict.loc['Avg Day'].Description = '' # type: object
dataDict.loc['loc_type'].Description = '' # type: float64
dataDict.loc['UC2 Literal'].Description = '' # type: object
dataDict.loc['neighborhood'].Description = '' # type: object
dataDict.loc['npu'].Description = '' # type: object
dataDict.loc['x'].Description = '' # type: float64
dataDict.loc['y'].Description = '' # type: float64
dataDict.to_csv("COBRA_Data_Dictionary.csv")

In [None]:
dataDict

## Convert Time Columns

Please refer to the following resources for working with time series data in pandas:
- https://pandas.pydata.org/pandas-docs/stable/timeseries.html
- https://pandas.pydata.org/pandas-docs/stable/api.html#id10

In [3]:
# function currying
def fixdatetime(fld):
    def _fix(s):
        date_col = '%s_date' % fld # "rpt_date"
        time_col = '%s_time' % fld # "rpt_time"
        if time_col in s.index:
            return str(s[date_col])+' '+str(s[time_col])
        else:
            return str(s[date_col])+' 00:00:00'
    return _fix

In [4]:
for col in ['rpt', 'occur', 'poss']:
    datser = df.apply(fixdatetime(col), axis=1)
    df['%s_dt'%col] = pd.to_datetime(datser, format="%m/%d/%Y %H:%M:%S", errors='coerce')

In [5]:
df[["MI_PRINX", "offense_id", "beat", "UC2 Literal", "neighborhood", "rpt_dt", "occur_dt", "poss_dt"]].head()

Unnamed: 0,MI_PRINX,offense_id,beat,UC2 Literal,neighborhood,rpt_dt,occur_dt,poss_dt
0,1160569,90360664,305.0,LARCENY-NON VEHICLE,South Atlanta,2009-02-05,2009-02-03 13:50:00,2009-02-03 15:00:00
1,1160570,90370891,502.0,LARCENY-FROM VEHICLE,Ansley Park,2009-02-06,2009-02-06 08:50:00,2009-02-06 10:45:00
2,1160572,91681984,604.0,LARCENY-NON VEHICLE,Sweet Auburn,2009-06-17,2009-06-17 14:00:00,2009-06-17 15:00:00
3,1160575,82040835,104.0,BURGLARY-RESIDENCE,Mozley Park,2009-02-27,2008-07-21 18:00:00,2008-07-21 18:00:00
4,1160576,82922120,210.0,AUTO THEFT,Lenox,2009-01-14,2008-10-19 18:30:00,2008-10-19 19:45:00


What's the date range of the data?

In [None]:
print df.occur_dt.min(), '---', df.occur_dt.max()

Number of crimes reported each year:

In [None]:
# resample is like "groupby" for time
df.resample('A-DEC', closed='right', on='occur_dt').offense_id.count()
# df['Year'] = df.occur_dt.map(lambda d: d.year)
# df2 = df[(df.Year>=2010) & (df.Year<=2017)]
# df2.shape, df.shape

Looks like most of the data is actually from 2009-2017! Let's throw the rest away...

In [None]:
df = df[df.occur_dt>='01/01/2009']

# Crime Over Time

Has the number of crimes in Atlanta changed over time?

Are some areas more affected by crime than others?

Do different types of crime correlate with each other?

## Number of Crimes Over Time, with Pivot Tables

In [None]:
df[["occur_dt", "UC2 Literal", "offense_id"]].head()

In [None]:
# Pivoting the table:
# index = nolumn that the new table will be indexed by
# columns = column whose unique values will form the new column names
# values = values used to fill the table (default = all columns other than those given in index and columns)
df_ct = df.pivot_table(index="occur_dt", columns="UC2 Literal", values="offense_id")

In [None]:
df_ct.head()

This gives us a timeline for different types of crime reported in Atlanta.

By itself, this can be useful, but we are more interested in aggregate statistics. Let's get the number of crimes by month...

In [None]:
df_ct = df_ct.resample("1M", closed="right").count()

In [None]:
df_ct.head()

Average number of crimes per month, for each year:

In [None]:
ax = df_ct.plot.box(figsize=(13,4), rot=45)
plt.ylabel("Total Reported Crimes by Month")

Explanation of boxplot:

From the matplotlib documentation (http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.boxplot):

The box extends from the **lower to upper quartile values** of the data, with a line at the **median**. The whiskers extend from the box to show the range of the data. Flier points are those past the end of the whiskers.

**Whiskers:** IQR is the interquartile range (Q3-Q1). The upper whisker will extend to last datum less than Q3 + whis*IQR (where the default value for whis is 1.5). Similarly, the lower whisker will extend to the first datum greater than Q1 - whis*IQR. Beyond the whiskers, data are considered outliers and are plotted as individual points. 

In [None]:
## In-class exercise:
# Make a boxplot of the number of reported crimes, aggregating by week. 

In [None]:
df_wk = df.pivot_table(index="occur_dt", columns="UC2 Literal", values="offense_id")
df_wk = df_wk.resample("W-SUN", closed='right').count()
df_wk.plot.box(figsize=(13,4), rot=45)
plt.ylabel("Total Reported Crimes by Week")

More on pandas datetime objects:

http://pandas-docs.github.io/pandas-docs-travis/timeseries.html#dateoffset-objects

http://pandas-docs.github.io/pandas-docs-travis/timeseries.html#anchored-offsets

### Crimes over time

Let's take a look at a time series plot of the number of crimes over time...

In [None]:
ax = df_ct.plot(figsize=(10,5), style='-o')
ax.get_legend().set_bbox_to_anchor((1, 1))
plt.ylabel("Total Reported Crimes by Month")
ax.vlines(pd.date_range("12/31/2009", "12/31/2017", freq="A-JAN"), 0, 900)

Can you pick out the seasonal variation in number of crimes per year?

Suppose we are not interested in seasonal trends, but want to see if the number of reported crimes is changing year over year. We could simply add the number of crimes together to get number of crimes reported each year.

In [None]:
ann_cr = df_ct.resample("A-DEC", closed="right").sum()

In [None]:
ax = ann_cr[ann_cr.index<"01/01/2017"].plot(figsize=(10,5), style='-o')
ax.get_legend().set_bbox_to_anchor((1, 1))

## Correlation In Number of Crimes Over Time

You can use the "corr" method in Pandas to find the correlation between columns of a dataframe. 

In [None]:
crime_corr = df_ct.corr()

In [None]:
crime_corr

Visualizing the correlation...

In [None]:
plt.matshow(crime_corr);
plt.yticks(range(len(crime_corr.columns)), crime_corr.columns);
plt.xticks(range(len(crime_corr.columns)), crime_corr.columns, rotation=90);
plt.colorbar();

# Crimes By Place

## Beats and Zones
The City of Atlanta is divided into 6 zones, each with 12 to 14 beats. 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Atlanta_Police_Zones_-_Feb_2013.jpg/588px-Atlanta_Police_Zones_-_Feb_2013.jpg)

In [6]:
df['Zone'] = df['beat']//100
df['Year'] = df.occur_dt.apply(lambda x: x.year)

In [7]:
df_cp = df.pivot_table(index="Zone", 
                       columns="UC2 Literal", 
                       values="offense_id", 
                       aggfunc=lambda x: np.count_nonzero(~np.isnan(x)))

In [None]:
df_cp

In [8]:
df_cp = df.pivot_table(index=["Year","Zone"], 
                       columns="UC2 Literal", 
                       values="offense_id", 
                       aggfunc=lambda x: np.count_nonzero(~np.isnan(x)))

In [9]:
df_cp

Unnamed: 0_level_0,UC2 Literal,AGG ASSAULT,AUTO THEFT,BURGLARY-NONRES,BURGLARY-RESIDENCE,HOMICIDE,LARCENY-FROM VEHICLE,LARCENY-NON VEHICLE,RAPE,ROBBERY-COMMERCIAL,ROBBERY-PEDESTRIAN,ROBBERY-RESIDENCE
Year,Zone,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1916.0,2.0,,1.0,1.0,,,,1.0,,,,
1916.0,4.0,,,,1.0,,,1.0,,,,
1916.0,5.0,,1.0,1.0,,,1.0,,,,,
1916.0,6.0,,1.0,,2.0,,,,,,,
1920.0,2.0,,,,,,1.0,,,,,
1920.0,4.0,,,,1.0,,,,,,,
1970.0,6.0,,,,,,,1.0,,,,
1973.0,3.0,,,,1.0,,,,,,,
1976.0,6.0,,,,,,1.0,,,,,
1979.0,3.0,,1.0,,,,,,,,,


In [22]:
# [x[1] for x in df_cp.index.values]
list(zip(*df_cp.index.values))

[(1916.0,
  1916.0,
  1916.0,
  1916.0,
  1920.0,
  1920.0,
  1970.0,
  1973.0,
  1976.0,
  1979.0,
  1980.0,
  1991.0,
  1991.0,
  1993.0,
  1994.0,
  1996.0,
  1996.0,
  1998.0,
  2000.0,
  2000.0,
  2000.0,
  2001.0,
  2001.0,
  2001.0,
  2002.0,
  2002.0,
  2002.0,
  2003.0,
  2003.0,
  2004.0,
  2004.0,
  2004.0,
  2005.0,
  2005.0,
  2005.0,
  2006.0,
  2006.0,
  2006.0,
  2007.0,
  2007.0,
  2007.0,
  2007.0,
  2007.0,
  2007.0,
  2008.0,
  2008.0,
  2008.0,
  2008.0,
  2008.0,
  2008.0,
  2008.0,
  2009.0,
  2009.0,
  2009.0,
  2009.0,
  2009.0,
  2009.0,
  2009.0,
  2009.0,
  2010.0,
  2010.0,
  2010.0,
  2010.0,
  2010.0,
  2010.0,
  2010.0,
  2010.0,
  2011.0,
  2011.0,
  2011.0,
  2011.0,
  2011.0,
  2011.0,
  2011.0,
  2012.0,
  2012.0,
  2012.0,
  2012.0,
  2012.0,
  2012.0,
  2012.0,
  2013.0,
  2013.0,
  2013.0,
  2013.0,
  2013.0,
  2013.0,
  2013.0,
  2013.0,
  2014.0,
  2014.0,
  2014.0,
  2014.0,
  2014.0,
  2014.0,
  2014.0,
  2015.0,
  2015.0,
  2015.0,
  2015.0,


In [23]:
df_cp = df_cp[np.logical_and([x >= 1 for x in list(zip(*df_cp.index.values))[1]], 
                             [x <= 6 for x in list(zip(*df_cp.index.values))[1]])].fillna(0)
df_cp.head(20)

Unnamed: 0_level_0,UC2 Literal,AGG ASSAULT,AUTO THEFT,BURGLARY-NONRES,BURGLARY-RESIDENCE,HOMICIDE,LARCENY-FROM VEHICLE,LARCENY-NON VEHICLE,RAPE,ROBBERY-COMMERCIAL,ROBBERY-PEDESTRIAN,ROBBERY-RESIDENCE
Year,Zone,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
1916.0,2.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1916.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1916.0,5.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1916.0,6.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1920.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1920.0,4.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1970.0,6.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1973.0,3.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1976.0,6.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1979.0,3.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
# A MUCH PRETTIER way to do the same thing:
df_cp = df_cp.loc[(slice(None), slice(1,6)),:].fillna(0)
df_cp.head(20)

In [None]:
## slicing on a multi-index
#  get data for 2009-2010, for zones 1-3 
df_cp.loc[(slice(2009,2010), slice(1,5,2)),:]

In [None]:
## In-class exercise:
# Show all robbery data for 2011, 2013, and 2015, for zones 4-6 

In [None]:
df_cp.loc[(slice(2011,2015,2), slice(4,6)), "ROBBERY-COMMERCIAL":"ROBBERY-RESIDENCE"]

In [None]:
df_cp.filter(like='ROBBERY').loc[(slice(2011,2015,2), slice(4,6)), :]

In [None]:
## In-class exercise:
# Count the total number of crimes in each zone
df_cp.groupby(level=1).sum()

In [None]:
help(df_cp.plot)

In [None]:
## In-class exercise:
# Plot the number of pedestrian robberies in each zone in 2016
df_cp.loc[(slice(2016,2016), slice(None)), "ROBBERY-PEDESTRIAN"].plot.bar()
plt.xticks(range(6), range(1,7), rotation=0);
plt.xlabel("Zone");
plt.ylabel("Ped. Robberies in 2016");

In [None]:
## In-class exercise:
# What is the average annual number of crimes in each zone (for each type of crime)?
# Hint: use "groupby" with a "level" argument.

In [None]:
df_cp.groupby(level=1).mean()

In [None]:
##### Shapefile stuff ########

In [None]:
import sys
try:
    from osgeo import ogr, osr, gdal
except:
    sys.exit('ERROR: cannot find GDAL/OGR modules')