# Atlanta Police Department
![APD Logo](https://atlantapd.galls.com/photos/partners/atlantapd/logo.jpg)


The Atlanta Police Department provides Part 1 crime data at http://www.atlantapd.org/i-want-to/crime-data-downloads

A recent copy of the data file is stored in the cluster. <span style="color: red; font-weight: bold;">Please, do not copy this data file into your home directory!</span>

# Introduction


- This notebooks leads into an exploration of public crime data provided by the Atlanta Police Department.
- The original data set and supplemental information can be found at http://www.atlantapd.org/i-want-to/crime-data-downloads
- **The data set is available on ARC, please, don't download into your home directory on ARC!**

In [2]:
%reload_ext autoreload
%autoreload 2

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [5]:
# load data set
# df = pd.read_csv('/home/data/APD/COBRA-YTD2017.csv.gz')
df = pd.read_csv('/home/data/APD/COBRA-YTD-multiyear.csv.gz', low_memory=False)
print(f"Number of records: {df.shape[0]:,}   Number of columns: {df.shape[1]}")

Number of records: 285,733   Number of columns: 23


Let's look at the structure of this table. We're actually creating some text output that can be used to create a data dictionary.

In [6]:
dataDict = pd.DataFrame({'DataType': df.dtypes.values, 'Description': '', }, index=df.columns.values)

We need to enter the descriptions for each entry in our dictionary manually. However, why not just create a the Python code automatically...

Run the code below only if you haven't edited the `datadict.py` file in a different way, since it will overwrite what you have so far. (That's why the code is commented-out.)

In [7]:
dataDict

Unnamed: 0,DataType,Description
MI_PRINX,int64,
offense_id,int64,
rpt_date,object,
occur_date,object,
occur_time,object,
poss_date,object,
poss_time,object,
beat,float64,
apt_office_prefix,object,
apt_office_num,object,


In [None]:
# with open("datadict.py", "w") as io:
#     for i in dataDict.index:
#         io.write("dataDict.loc['%s'].Description = '' # type: %s\n" % (i, str(dataDict.loc[i].DataType)))

In [None]:
# %load datadict.py
dataDict.loc['MI_PRINX'].Description = '' # type: int64
dataDict.loc['offense_id'].Description = 'Unique ID in the format YYDDDNNNN with the year YY, the day of the year DDD and a counter NNNN' # type: int64
dataDict.loc['rpt_date'].Description = 'Date the crime was reported' # type: object
dataDict.loc['occur_date'].Description = 'Estimated date when the crime occured' # type: object
dataDict.loc['occur_time'].Description = 'Estimated time when the crime occured' # type: object
dataDict.loc['poss_date'].Description = '' # type: object
dataDict.loc['poss_time'].Description = '' # type: object
dataDict.loc['beat'].Description = '' # type: int64
dataDict.loc['apt_office_prefix'].Description = '' # type: object
dataDict.loc['apt_office_num'].Description = '' # type: object
dataDict.loc['location'].Description = '' # type: object
dataDict.loc['MinOfucr'].Description = '' # type: int64
dataDict.loc['MinOfibr_code'].Description = '' # type: object
dataDict.loc['dispo_code'].Description = '' # type: object
dataDict.loc['MaxOfnum_victims'].Description = '' # type: float64
dataDict.loc['Shift'].Description = 'Zones have 8 or 10 hour shifts' # type: object
dataDict.loc['Avg Day'].Description = '' # type: object
dataDict.loc['loc_type'].Description = '' # type: float64
dataDict.loc['UC2 Literal'].Description = '' # type: object
dataDict.loc['neighborhood'].Description = '' # type: object
dataDict.loc['npu'].Description = '' # type: object
dataDict.loc['x'].Description = '' # type: float64
dataDict.loc['y'].Description = '' # type: float64
dataDict.to_csv("COBRA_Data_Dictionary.csv")

## Fixing Data Types

In [None]:
print df.groupby("Shift").count().index

## Date and Time
- Working with dates can be tricky. Often dates and times are coded as strings and need to be converted to a date and time data format.
- Python provides a module `datetime` to deal with converting parsing and formatting dates and times. See https://docs.python.org/2/library/datetime.html
- The `pandas` package provides functionality to convert text fields into date/time fields...given the values adhere to a given format. See http://pandas.pydata.org/pandas-docs/version/0.20/generated/pandas.to_datetime.html

### Create a proper text field
In order to use the text to date/time converter our text columns need to have the appropriate format.

In [4]:
# function currying
def fixdatetime(fld):
    def _fix(s):
        date_col = '%s_date' % fld
        time_col = '%s_time' % fld
        if time_col in s.index:
            return str(s[date_col])+' '+str(s[time_col])
        else:
            return str(s[date_col])+' 00:00:00'
    return _fix

df.apply(fixdatetime('rpt'), axis=1)[:10]

0    09/06/2017 00:00:00
1    09/06/2017 00:00:00
2    09/06/2017 00:00:00
3    09/06/2017 00:00:00
4    09/06/2017 00:00:00
5    09/06/2017 00:00:00
6    09/06/2017 00:00:00
7    09/06/2017 00:00:00
8    09/06/2017 00:00:00
9    09/06/2017 00:00:00
dtype: object

### Convert Columns

In [5]:
for col in ['rpt', 'occur', 'poss']:
    datser = df.apply(fixdatetime(col), axis=1)
    df['%s_dt'%col] = pd.to_datetime(datser, format="%m/%d/%Y %H:%M:%S", errors='coerce')

In [6]:
df.head()

Unnamed: 0,MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,...,Avg Day,loc_type,UC2 Literal,neighborhood,npu,x,y,rpt_dt,occur_dt,poss_dt
0,7693210,172490115,09/06/2017,09/06/2017,00:30:00,09/06/2017,00:35:00,607,,,...,Wed,,AGG ASSAULT,Custer/McDonough/Guice,W,-84.3585,33.70839,2017-09-06,2017-09-06 00:30:00,2017-09-06 00:35:00
1,7693211,172490265,09/06/2017,09/05/2017,11:15:00,09/06/2017,02:30:00,512,,,...,Tue,99.0,LARCENY-FROM VEHICLE,Downtown,M,-84.39736,33.74958,2017-09-06,2017-09-05 11:15:00,2017-09-06 02:30:00
2,7693212,172490322,09/06/2017,09/06/2017,03:15:00,09/06/2017,03:45:00,501,,,...,Wed,18.0,LARCENY-FROM VEHICLE,Atlantic Station,E,-84.39776,33.79072,2017-09-06,2017-09-06 03:15:00,2017-09-06 03:45:00
3,7693213,172490390,09/06/2017,09/05/2017,17:45:00,09/06/2017,04:57:00,207,,,...,Tue,18.0,LARCENY-FROM VEHICLE,Brookwood Hills,E,-84.39361,33.80774,2017-09-06,2017-09-05 17:45:00,2017-09-06 04:57:00
4,7693214,172490401,09/06/2017,09/05/2017,17:00:00,09/06/2017,05:00:00,203,,,...,Tue,18.0,LARCENY-FROM VEHICLE,Hills Park,D,-84.43337,33.79848,2017-09-06,2017-09-05 17:00:00,2017-09-06 05:00:00


## Beats and Zones
The City of Atlanta is divided into 6 zones. Each with 12 to 14 beats. 

![](https://upload.wikimedia.org/wikipedia/commons/thumb/c/cb/Atlanta_Police_Zones_-_Feb_2013.jpg/588px-Atlanta_Police_Zones_-_Feb_2013.jpg)

Let's create a separate column for the zones:

In [41]:
df['Zone'] = df['beat']//100

In [None]:
df['UC2 Literal'].unique()

In [42]:
df[df['UC2 Literal']=='LARCENY-FROM VEHICLE']

Unnamed: 0,MI_PRINX,offense_id,rpt_date,occur_date,occur_time,poss_date,poss_time,beat,apt_office_prefix,apt_office_num,...,UC2 Literal,neighborhood,npu,x,y,rpt_dt,occur_dt,poss_dt,Year,Zone
1,7693211,172490265,09/06/2017,09/05/2017,11:15:00,09/06/2017,02:30:00,512,,,...,LARCENY-FROM VEHICLE,Downtown,M,-84.39736,33.74958,2017-09-06,2017-09-05 11:15:00,2017-09-06 02:30:00,2017.0,5
2,7693212,172490322,09/06/2017,09/06/2017,03:15:00,09/06/2017,03:45:00,501,,,...,LARCENY-FROM VEHICLE,Atlantic Station,E,-84.39776,33.79072,2017-09-06,2017-09-06 03:15:00,2017-09-06 03:45:00,2017.0,5
3,7693213,172490390,09/06/2017,09/05/2017,17:45:00,09/06/2017,04:57:00,207,,,...,LARCENY-FROM VEHICLE,Brookwood Hills,E,-84.39361,33.80774,2017-09-06,2017-09-05 17:45:00,2017-09-06 04:57:00,2017.0,2
4,7693214,172490401,09/06/2017,09/05/2017,17:00:00,09/06/2017,05:00:00,203,,,...,LARCENY-FROM VEHICLE,Hills Park,D,-84.43337,33.79848,2017-09-06,2017-09-05 17:00:00,2017-09-06 05:00:00,2017.0,2
11,7693221,172490557,09/06/2017,09/05/2017,21:00:00,09/06/2017,07:20:00,207,,323,...,LARCENY-FROM VEHICLE,Loring Heights,E,-84.40671,33.79742,2017-09-06,2017-09-05 21:00:00,2017-09-06 07:20:00,2017.0,2
12,7693222,172490741,09/06/2017,09/06/2017,07:30:00,09/06/2017,08:20:00,506,,,...,LARCENY-FROM VEHICLE,Midtown,E,-84.37102,33.77686,2017-09-06,2017-09-06 07:30:00,2017-09-06 08:20:00,2017.0,5
13,7693223,172490781,09/06/2017,09/05/2017,22:00:00,09/06/2017,05:00:00,212,,,...,LARCENY-FROM VEHICLE,Piedmont Heights,F,-84.37157,33.80697,2017-09-06,2017-09-05 22:00:00,2017-09-06 05:00:00,2017.0,2
17,7693227,172490914,09/06/2017,09/05/2017,19:30:00,09/06/2017,10:10:00,408,,,...,LARCENY-FROM VEHICLE,Venetian Hills,S,-84.44116,33.72205,2017-09-06,2017-09-05 19:30:00,2017-09-06 10:10:00,2017.0,4
20,7693230,172491015,09/06/2017,09/06/2017,07:00:00,09/06/2017,10:45:00,208,,,...,LARCENY-FROM VEHICLE,Peachtree Park,B,-84.36815,33.84642,2017-09-06,2017-09-06 07:00:00,2017-09-06 10:45:00,2017.0,2
21,7693231,172491018,09/06/2017,09/06/2017,10:30:00,09/06/2017,11:00:00,413,,,...,LARCENY-FROM VEHICLE,Ben Hill Pines,P,-84.50755,33.68212,2017-09-06,2017-09-06 10:30:00,2017-09-06 11:00:00,2017.0,4


In [None]:
df.occur_dt.map(lambda d: d.year).unique()

In [43]:
df['Year'] = df.occur_dt.map(lambda d: d.year)
df2 = df[(df.Year>=2010) & (df.Year<=2017)]
df2.shape, df.shape

((17418, 28), (17425, 28))

# Descriptive Statistics
https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

# Time Series
- https://pandas.pydata.org/pandas-docs/stable/timeseries.html
- https://pandas.pydata.org/pandas-docs/stable/api.html#id10

In [44]:
df_LarcenyFromVehicle = df2[(df2['UC2 Literal']=='LARCENY-FROM VEHICLE')&(df2.Year==2017)].copy()
agr_LarcenyFromVehicle = df_LarcenyFromVehicle.set_index('occur_dt').resample('W').offense_id.count()
agr_LarcenyFromVehicle

occur_dt
2017-01-01     21
2017-01-08    157
2017-01-15    154
2017-01-22    178
2017-01-29    182
2017-02-05    161
2017-02-12    137
2017-02-19    190
2017-02-26    135
2017-03-05    142
2017-03-12    112
2017-03-19    118
2017-03-26    117
2017-04-02    124
2017-04-09    165
2017-04-16    191
2017-04-23    181
2017-04-30    186
2017-05-07    198
2017-05-14    180
2017-05-21    191
2017-05-28    190
2017-06-04    171
2017-06-11    196
2017-06-18    180
2017-06-25    194
2017-07-02    183
2017-07-09    188
2017-07-16    211
2017-07-23    196
2017-07-30    188
2017-08-06    167
2017-08-13    161
2017-08-20    226
2017-08-27    201
2017-09-03    187
2017-09-10     61
Freq: W-SUN, Name: offense_id, dtype: int64

In [30]:
df_LarcenyFromVehicle["Hour"] = df_LarcenyFromVehicle.occur_dt.map(lambda d: d.hour)

In [33]:
df_LarcenyFromVehicle.groupby("Hour").offense_id.count()

Hour
0     290
1     220
2     121
3      94
4      55
5      57
6      85
7     113
8     210
9     198
10    195
11    258
12    317
13    278
14    254
15    310
16    288
17    339
18    431
19    483
20    441
21    363
22    403
23    317
Name: offense_id, dtype: int64

In [53]:
hourly = df_LarcenyFromVehicle.resample('H', on='occur_dt').offense_id.count()

In [71]:
hourly.reset_index().occur_dt.map(lambda d: d.week)

0       52
1       52
2       52
3       52
4       52
5       52
6       52
7       52
8       52
9       52
10      52
11      52
12      52
13      52
14      52
15      52
16      52
17      52
18      52
19      52
20      52
21      52
22      52
23      52
24       1
25       1
26       1
27       1
28       1
29       1
        ..
5942    36
5943    36
5944    36
5945    36
5946    36
5947    36
5948    36
5949    36
5950    36
5951    36
5952    36
5953    36
5954    36
5955    36
5956    36
5957    36
5958    36
5959    36
5960    36
5961    36
5962    36
5963    36
5964    36
5965    36
5966    36
5967    36
5968    36
5969    36
5970    36
5971    36
Name: occur_dt, dtype: int64

In [64]:
df3 = pd.DataFrame({"N": hourly})
##df3['Day'] = df3.reset_index().occur_dt ##.map(lambda d: d.day)
df3

Unnamed: 0_level_0,N
occur_dt,Unnamed: 1_level_1
2017-01-01 00:00:00,2
2017-01-01 01:00:00,1
2017-01-01 02:00:00,0
2017-01-01 03:00:00,0
2017-01-01 04:00:00,0
2017-01-01 05:00:00,0
2017-01-01 06:00:00,0
2017-01-01 07:00:00,0
2017-01-01 08:00:00,0
2017-01-01 09:00:00,0


In [73]:
ls

COBRA_Data_Dictionary.csv  CrimeData_orig.ipynb  datadict.py  [0m[01;34mHW06[0m/  README.md


# Plotting
The Pandas package provides a number of plotting features. Let's try them out.
- https://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.boxplot.html


In [None]:
fig = plt.figure(figsize=(10,6)) # 10inx10in
#plt.plot(resdf['BURGLARY-RESIDENCE'].index, resdf['BURGLARY-RESIDENCE'])
plt.scatter(resdf['BURGLARY-RESIDENCE'].index, resdf['BURGLARY-RESIDENCE'], marker='x')
plt.scatter(resdf['BURGLARY-NONRES'].index, resdf['BURGLARY-NONRES'], marker='o')

plt.ylim(0, 500)
plt.title('BURGLARY-RESIDENCE')
plt.xticks(range(13), ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
fig.savefig('BurglaryResidence_over_month.svg')
x = 1

In [None]:
def getTheMonth(x):
    return x.month

df['occur_month'] = df['occur_ts'].map(getTheMonth)

In [None]:
resdf = df.groupby(['UC2 Literal', 'occur_month']).offense_id.count()
fig = plt.figure(figsize=(10,6))
plt.scatter(resdf['BURGLARY-RESIDENCE'].index, resdf['BURGLARY-RESIDENCE'], marker='x')
plt.ylim(0, 500)
plt.title('BURGLARY-RESIDENCE')
plt.xticks(range(13), ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.savefig('quiz3-burglary-residence.png')


# Seasonal Model

In [None]:
fig = plt.figure(figsize=(40,30))
crime_types = crime_year.index.levels[0]
years = crime_year.index.levels[1]
for c in range(len(crime_types)):
    y_max = max(crime_year.loc[crime_types[c]])
    
    plt.subplot(4,3,c+1)
    plt.hlines(crime_year.loc[crime_types[c]].iloc[-1]*100/y_max, years[0], years[-1], linestyles="dashed", color="r")
    plt.bar(crime_year.loc[crime_types[c]].index, crime_year.loc[crime_types[c]]*100/y_max, label=crime_types[c], alpha=0.5)
    ##plt.legend()
    plt.ylim(0, 100)
    plt.xticks(years+0.4, [str(int(y)) for y in years], rotation=0, fontsize=24)
    plt.yticks([0,20,40,60,80,100], ['0%','20%','40%','60%','80%','100%'], fontsize=24)
    plt.title(crime_types[c], fontsize=30)
    None

In [None]:
c = 3 ## 'BURGLARY-RESIDENCE'
resburglaries = crime_year_month.loc[crime_types[c]]
fig = plt.figure(figsize=(20,10))
for y in years:
    plt.plot(resburglaries.loc[y].index, resburglaries.loc[y], label=("%4.0f"%y))
plt.legend()
plt.title("Seasonal Trends - %s"%crime_types[c], fontsize=20)
plt.xticks(range(13), ['', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.xlim(0,13)
None

In [None]:
c = 3 ## 'BURGLARY-RESIDENCE'
fig = plt.figure(figsize=(20,10))
for y in years:
    avg = resburglaries.loc[y].mean()
    std = resburglaries.loc[y].std()
    ##plt.hlines(avg, 1, 13, linestyle='dashed')
    plt.plot(resburglaries.loc[y].index, (resburglaries.loc[y]-avg)/std, label=("%4.0f"%y))
plt.legend()
plt.title("Seasonal Trends - %s (normalized)"%crime_types[c], fontsize=20)
plt.xticks(list(range(1,13)), ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
plt.xlim(0,13)
plt.ylabel("Standard deviations $\sigma_y$")
None