# Clean San Francisco Crime data 
- Jim Haskin

- GA-Data Science
- Dec 2015

- 2/17/2016

## Method
- I have collected the incident reports of the San Franciso Police Department from the SF OpenData website. https://data.sfgov.org/data?category=Public%20Safety. I have the records from January, 2003 until the beginning of 2016.
- I cleaned and reformated the fields.
- I summerized the report to generate a daily report of the number of incidents and another factor I am calling Crime Level. Each incident is given a score based on how violent it is. Murders and assaults are rated high. Traffic violations and non-criminal incidents are rated low. These scores are summed and then normalized to a scale of 0 - 10.
- Also added other crime measuring variables that can be used to narrow the reseach to different subsets of crime.
- 'gun_level' is an example. These are explained below.

# Sections


- [Data Source](#Data-Source)
- [Clean Data](#Clean-Data)
- [New Features for Raw Data](#New-Features-for-Raw-data)
- [Corrrections for Raw data](#Corrections-for-Raw-data)
- [Create Crime Measuring Variables](#Create-Crime-Measuring-Variables)
- [Consolidate into daily records](#Consolidate-into-daily-records)
- [Corrections for Daily records](#Corrections-for-Daily-records)
- [New Features for Daily records](#New-Features-for-Daily-records)
- [Normalize Levels](#Normalize-Levels)
- [Write final data to file](#Write-final-data-to-file)
- [qq](#qq)

# Working Notes
## features to add or create
- month




## Data Source
[[back to top](#Sections)]

- Data downloaded from SF Open Data site. File includes incidents from 1/1/2003 until the present 
- SFPD_Incidents_-_from_1_January_2003.csv
- https://data.sfgov.org/data?category=Public%20Safety


FieldName|Type|Description                             
---------------|------------|---------------------
IncidntNum|string|Police assigned number
Category|string|General Crime category
Descript|string|Secondary category/details
DayOfWeek|string|Day of week event occured
Date|string|Date in format : 01/18/2016
Time|string|Time in format : 23:52
PdDistrict|string|Police District that event occured in
Resolution|int|How case was resolved
Address|string|Address of event
X|float|Longitude 
Y|float|Latitude
Location|string|Latitude,Longitude in character pair
PdId|int|Police Department ID number


In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
%matplotlib inline

In [2]:
! head -2 SFPD_Incidents_-_from_1_January_2003.csv


IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
160051264,WARRANTS,WARRANT ARREST,Monday,01/18/2016,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568445228,37.7887594214703,"(37.7887594214703, -122.408568445228)",16005126463010


In [3]:
! tail -2 SFPD_Incidents_-_from_1_January_2003.csv



031353484,OTHER OFFENSES,OBSCENE PHONE CALLS(S),Wednesday,01/01/2003,00:01,TARAVAL,NONE,1500 Block of 41ST AV,-122.5003001196,37.7578465298467,"(37.7578465298467, -122.5003001196)",3135348419050
030320997,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Wednesday,01/01/2003,00:01,SOUTHERN,NONE,0 Block of LAFAYETTE ST,-122.416608653757,37.7725681063387,"(37.7725681063387, -122.416608653757)",3032099764070


### Read in Crime data

In [4]:
sf_data = pd.read_csv('SFPD_Incidents_-_from_1_January_2003.csv', index_col=0)    # has header, commas, index

## Clean Data
[[back to top](#Sections)]

### Convert to lower case
- Feature names
- Feature values that I'm working with

In [5]:
sf_data.columns = sf_data.columns.str.lower()
sf_data['category'] = sf_data['category'].str.lower()
sf_data['descript'] = sf_data['descript'].str.lower()
sf_data['dayofweek'] = sf_data['dayofweek'].str.lower()
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
160051264,warrants,warrant arrest,monday,01/18/2016,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010
160051242,robbery,"robbery, bodily force",monday,01/18/2016,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074


### Investigate data

In [6]:
sf_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1866570 entries, 160051264 to 30320997
Data columns (total 12 columns):
category      object
descript      object
dayofweek     object
date          object
time          object
pddistrict    object
resolution    object
address       object
x             float64
y             float64
location      object
pdid          int64
dtypes: float64(2), int64(1), object(9)
memory usage: 185.1+ MB


### Observations
- 1,866,570 records 
- date and time in string format
- other fields look appropriate

### Convert date to datetime

In [7]:
sf_data['date'] = pd.to_datetime(sf_data['date'])

## New Features for Raw data
[[back to top](#Sections)]

### Add the hour as numeric

In [8]:
sf_data['hour'] = sf_data['time'].str[0:2].astype(int)

### Add month, day and year features

In [9]:
#tdf['Date'].dtype
sf_data['month'] = sf_data['date'].dt.month
sf_data['day'] = sf_data['date'].dt.day
sf_data['year'] = sf_data['date'].dt.year
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid,hour,month,day,year
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
160051264,warrants,warrant arrest,monday,2016-01-18,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010,23,1,18,2016
160051242,robbery,"robbery, bodily force",monday,2016-01-18,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074,23,1,18,2016


### Create shift feature
- For more detailed analysis or workforce planning, add feature that records the shift that event occured.
- 3rd shift - Midnight to 7:59am
- 1st shift - 8:00am - 3:59pm
- 2nd shift - 4:00pm - 11:59pm

NOTE: The below discussion of the approximate times in the incident reporting may make this unreliable

In [10]:
def calc_shift(hour):
    shift = hour//8
    if shift == 0:
        shift = 3
    return 'shift_' + str(shift)
        

In [11]:
sf_data['shift'] = sf_data['hour'].apply(calc_shift)
# or leave shift as hour//8. so that it sorts into time order, but label shift0 as third shift
#sf_data['shift'] = sf_data['hour'].apply(lambda x : x//8)

In [12]:
sf_data[sf_data['shift']=='shift_2'].tail(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid,hour,month,day,year,shift
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
30005882,vehicle theft,"vehicle, recovered, auto",wednesday,2003-01-01,16:00,TARAVAL,NONE,1700 Block of 48TH AV,-122.507539,37.753788,"(37.7537879722664, -122.507539443431)",3000588207041,16,1,1,2003,shift_2
30005882,vehicle theft,stolen automobile,wednesday,2003-01-01,16:00,TARAVAL,NONE,1700 Block of 48TH AV,-122.507539,37.753788,"(37.7537879722664, -122.507539443431)",3000588207021,16,1,1,2003,shift_2


In [13]:
copy = sf_data.copy()

In [14]:
sf_data['year'].describe()

count    1866570.000000
mean        2009.062880
std            3.821668
min         2003.000000
25%         2006.000000
50%         2009.000000
75%         2012.000000
max         2016.000000
Name: year, dtype: float64

In [15]:
sf_data['year'].value_counts()

2015    153879
2013    152812
2014    150161
2003    149176
2004    148148
2005    142186
2008    141311
2012    140858
2009    139861
2006    137853
2007    137639
2010    133525
2011    132699
2016      6462
Name: year, dtype: int64

## Corrections for Raw data
[[back to top](#Sections)]

After analysing the data in the 5_analysis notebook several anomalys appeared. They caused me to remove some data as explained below.

### Investigate Day 1 anomaly
- While investigating the data in Part 2, I found a large spike in the number of incidents on the first day of the month.
- Investigate and reslove

In [16]:
sf_data['day'].value_counts().head()

1     72766
15    63578
20    62161
16    61410
17    61359
Name: day, dtype: int64

In [17]:
sf_data[sf_data['day']==1]['time'].value_counts().head(6)

00:01    6934
12:00    4053
08:00    1977
09:00    1837
18:00    1281
17:00    1247
Name: time, dtype: int64

In [18]:
sf_data['time'].value_counts().head(7)

12:00    48318
00:01    47451
18:00    40872
17:00    35815
19:00    34903
20:00    34794
22:00    33674
Name: time, dtype: int64

In [19]:
sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')].count()

category      6934
descript      6934
dayofweek     6934
date          6934
time          6934
pddistrict    6934
resolution    6934
address       6934
x             6934
y             6934
location      6934
pdid          6934
hour          6934
month         6934
day           6934
year          6934
shift         6934
dtype: int64

### observations
- 6934 records recorded on the the first of the month at 00:01.
- This is ~10% of the records for the 1st and 0.37% of total records.
- My hypotosis is that some incidents were reported later and when the exact date and time were not know, they were recorded this way.
- Since my model is based on the day of the incident, these records are not reliable. 
- If I remove them the counts are more in line with the other days of the month.
- Also appears that many of the times in the records may be approximate. (Many at 12:00). This will not effect the daily totals, but could be a problem if I try to divide the day into shifts.

### results
- Remove the records in question

In [20]:
sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')].count()

category      6934
descript      6934
dayofweek     6934
date          6934
time          6934
pddistrict    6934
resolution    6934
address       6934
x             6934
y             6934
location      6934
pdid          6934
hour          6934
month         6934
day           6934
year          6934
shift         6934
dtype: int64

In [21]:
sf_data = sf_data.drop(sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')].index)
sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')].count()

category      0
descript      0
dayofweek     0
date          0
time          0
pddistrict    0
resolution    0
address       0
x             0
y             0
location      0
pdid          0
hour          0
month         0
day           0
year          0
shift         0
dtype: int64

## Create Crime Measuring Variables
[[back to top](#Sections)]

Instead of only calculating the number of incidents per day I have created several other measurements that could be used for analysis and modeling. These can be used to help with workforce planning. These are very subjective and could use domain knowledge to help tune these measurements
- Create crime level feature that weights the incident by the severity/violence of the crime
- Create a weather crime level. I select categories that I feel may be weather related.
- Create a Violent crime feature based on a list of violent/emotional words.
- Create a gun related feature


### Investigate the categories and their sub descriptions

In [22]:
sf_data['category'].value_counts()

larceny/theft                  379182
other offenses                 264937
non-criminal                   197349
assault                        162750
vehicle theft                  113192
drug/narcotic                  110768
vandalism                       95006
warrants                        88832
burglary                        77862
suspicious occ                  66303
missing person                  54751
robbery                         48277
fraud                           35053
secondary codes                 21238
forgery/counterfeiting          21126
weapon laws                     18307
trespass                        15557
prostitution                    15475
stolen property                  9944
sex offenses, forcible           9119
drunkenness                      8961
disorderly conduct               8883
recovered vehicle                6343
driving under the influence      4905
kidnapping                       4795
runaway                          3971
liquor laws 

In [23]:
sf_data[sf_data['category']=='larceny/theft']['descript'].value_counts()

grand theft from locked auto                               132399
petty theft from locked auto                                42554
petty theft of property                                     35174
grand theft of property                                     23709
petty theft from a building                                 21477
petty theft shoplifting                                     20397
grand theft from a building                                 19830
grand theft from person                                     14933
grand theft pickpocket                                      11840
grand theft from unlocked auto                              10127
petty theft with prior                                       8196
petty theft from unlocked auto                               5572
grand theft bicycle                                          5342
attempted theft from locked vehicle                          4692
grand theft shoplifting                                      4563
petty thef

In [24]:
sf_data[sf_data['category']=='robbery']['descript'].value_counts()

robbery on the street, strongarm                     13925
robbery, bodily force                                 9543
robbery on the street with a gun                      4308
attempted robbery on the street with bodily force     2359
robbery, armed with a gun                             2333
attempted robbery with bodily force                   1626
robbery on the street with a knife                    1543
robbery on the street with a dangerous weapon         1382
robbery, armed with a knife                           1126
robbery of a commercial establishment, strongarm       874
robbery, armed with a dangerous weapon                 852
robbery of a chain store with bodily force             719
robbery of a commercial establishment with a gun       693
attempted robbery on the street with a gun             684
carjacking with a gun                                  500
robbery of a residence with bodily force               485
attempted robbery on the street w/deadly weapon        3

In [25]:
sf_data[sf_data['category']=='assault']['descript'].value_counts()

battery                                                        57689
threats against life                                           30545
inflict injury on cohabitee                                    15128
aggravated assault with a deadly weapon                        13729
aggravated assault with bodily force                           10492
battery, former spouse or dating relationship                   6069
aggravated assault with a knife                                 5196
child abuse (physical)                                          2824
battery of a police officer                                     2818
aggravated assault with a gun                                   2208
threatening phone call(s)                                       1806
stalking                                                        1744
battery with serious injuries                                   1726
elder adult or dependent abuse (not embezzlement or theft)      1372
assault                           

In [26]:
sf_data[sf_data['category']=='assault']['descript'].value_counts()

battery                                                        57689
threats against life                                           30545
inflict injury on cohabitee                                    15128
aggravated assault with a deadly weapon                        13729
aggravated assault with bodily force                           10492
battery, former spouse or dating relationship                   6069
aggravated assault with a knife                                 5196
child abuse (physical)                                          2824
battery of a police officer                                     2818
aggravated assault with a gun                                   2208
threatening phone call(s)                                       1806
stalking                                                        1744
battery with serious injuries                                   1726
elder adult or dependent abuse (not embezzlement or theft)      1372
assault                           

In [27]:
sf_data[sf_data['category']=='assault']['descript'].value_counts()

battery                                                        57689
threats against life                                           30545
inflict injury on cohabitee                                    15128
aggravated assault with a deadly weapon                        13729
aggravated assault with bodily force                           10492
battery, former spouse or dating relationship                   6069
aggravated assault with a knife                                 5196
child abuse (physical)                                          2824
battery of a police officer                                     2818
aggravated assault with a gun                                   2208
threatening phone call(s)                                       1806
stalking                                                        1744
battery with serious injuries                                   1726
elder adult or dependent abuse (not embezzlement or theft)      1372
assault                           

In [28]:
sf_data[sf_data['category']=='disorderly conduct']['descript'].value_counts()

committing public nuisance                              2729
disturbing the peace                                    1974
maintaining a public nuisance after notification        1629
disturbing the peace, fighting                          1025
maintaining a public nuisance                            819
disturbing the peace, commotion                          358
disturbing the peace, swearing                           339
disturbing religious meetings                              9
disturbance of non-religious, non-political assembly       1
Name: descript, dtype: int64

In [29]:
sf_data[sf_data['category']=='drunkenness']['descript'].value_counts()

under influence of alcohol in a public place    8961
Name: descript, dtype: int64

In [30]:
sf_data[sf_data['category']=='sex offenses, forcible']['descript'].value_counts()

sexual battery                                       3139
forcible rape, bodily force                          1476
child abuse sexual                                    827
assault to rape with bodily force                     617
oral copulation                                       333
annoy or molest children                              290
attempted rape, bodily force                          287
child abuse, pornography                              277
oral copulation, unlawful (adult victim)              263
sodomy (adult victim)                                 245
penetration, forced, with object                      222
sodomy                                                199
sexual assault, aggravated, of child                  163
engaging in lewd act                                  131
sexual assault, administering drug to commit          129
child abuse, exploitation                             128
forcible rape, armed with a sharp instrument          120
forcible rape,

In [31]:
sf_data[sf_data['category']=='prostitution']['descript'].value_counts()

solicits for act of prostitution                   6807
solicits to visit house of prostitution            5089
loitering for purpose of prostitution              2448
engaging in lewd conduct - prostitution related     301
human trafficking                                   239
pimping                                             191
pandering                                           141
indecent exposure - prostitution related             94
solicits lewd act                                    82
inmate/keeper of house of prostitution               64
placing wife in house of prostitution                12
disorderly house, keeping                             3
procurement, pimping, & pandering                     2
purchase female for the purpose of prostitution       2
Name: descript, dtype: int64

In [32]:
sf_data[sf_data['category']=='drug/narcotic']['descript'].value_counts()

possession of narcotics paraphernalia                20638
possession of base/rock cocaine                      14122
possession of marijuana                              11289
sale of base/rock cocaine                             8813
possession of meth-amphetamine                        7627
possession of base/rock cocaine for sale              7390
possession of marijuana for sales                     5664
possession of controlled substance                    4272
possession of heroin                                  4132
possession of cocaine                                 3005
sale of marijuana                                     2900
possession of meth-amphetamine for sale               2387
possession of controlled substance for sale           2207
possession of heroin for sales                        1823
sale of controlled substance                          1568
possession of cocaine for sales                       1312
sale of heroin                                        12

In [33]:
sf_data[sf_data['category']=='non-criminal']['descript'].value_counts()

lost property                                         66498
aided case, mental disturbed                          46230
found property                                        26441
aided case                                            11603
death report, cause unknown                            9085
case closure                                           5039
stay away or court order, non-dv related               3422
aided case, dog bite                                   2910
civil sidewalks, citation                              2667
property for identification                            2552
aided case, injured person                             2220
death report, natural causes                           2048
courtesy report                                        1933
aided case -property for destruction                   1846
fire report                                            1690
located property                                       1570
tarasoff report                         

In [34]:
sf_data[sf_data['category']=='weapon laws']['descript'].value_counts()

poss of loaded firearm                                     4109
carrying a concealed weapon                                2017
exhibiting deadly weapon in a threating manner             2015
poss of firearm by convicted felon/addict/alien            1538
poss of prohibited weapon                                  1398
discharge firearm at an inhabited dwelling                 1179
possession of air gun                                       867
loitering while carrying concealed weapon                   735
discharge firearm within city limits                        630
poss of deadly weapon with intent to assault                530
firearm, loaded, in vehicle, possession or use              502
carrying of concealed weapon by convicted felon             345
ammunition, poss. by prohibited person                      325
weapon, possess or bring other on school grounds            230
switchblade knife, possession                               190
firearm, armed while possessing controll

In [35]:
sf_data[sf_data['category']=='secondary codes']['descript'].value_counts()

domestic violence                         15631
juvenile involved                          1885
gang activity                              1641
prejudice-based incident                   1397
atm related crime                           585
battery by juvenile suspect                  53
weapons possession by juvenile suspect       26
assault by juvenile suspect                  18
shooting by juvenile suspect                  2
Name: descript, dtype: int64

In [36]:
sf_data[sf_data['category']=='family offenses']['descript'].value_counts()

desertion of child                                        266
children, abandonment & neglect of (general)              230
minor without proper parental care                        214
abandonment of child                                      200
failure to provide for child                               82
immoral acts or drunk in presence of child                 38
concealment/removal of child without consent               31
failure to provide for parents                              3
harassing child or ward because of person's employment      1
Name: descript, dtype: int64

### Assign a Crime level and if Weather related to each catagory
- First number is the crime level in the range 1-4 : Higher number is more violent
- Second number (0/1) indicates if I believe the crime would be effected by the weather

In [37]:
levels = {'larceny/theft' : [2,0],                
          'other offenses' : [1,0],                 
          'non-criminal' : [1,0],
          'assault' :  [4,1],                        
          'vehicle theft' : [2,0],                 
          'drug/narcotic' : [2,0],                 
          'vandalism' : [2,1],                       
          'warrants' : [1,0],                        
          'burglary' : [2,0],                        
          'suspicious occ' : [2,0],                 
          'missing person' : [1,0],                 
          'robbery' : [2,1],                         
          'fraud' : [2,0],                          
          'forgery/counterfeiting' : [2,0],         
          'secondary codes' :  [4,1],              
          'weapon laws' :  [3,1],                    
          'trespass' :  [2,0],                       
          'prostitution' :  [2,0],                  
          'stolen property' :  [2,0],                 
          'sex offenses, forcible' : [4,1],          
          'drunkenness' :  [1,1],                     
          'disorderly conduct'  : [1,1],              
          'recovered vehicle' :  [1,0],              
          'driving under the influence' :  [1,0],      
          'kidnapping' :  [3,0],                      
          'runaway' :  [1,0],                          
          'liquor laws' : [1,0],                     
          'arson' : [3,1],                           
          'embezzlement' : [1,0],                    
          'loitering' : [1,0],                      
          'suicide' :  [1,1],                         
          'family offenses' : [3,1],                  
          'bad checks' : [1,0],                 
          'bribery' : [1,0],                          
          'extortion' : [2,0],                        
          'sex offenses, non forcible' : [2,1],       
          'gambling' : [1,0],                          
          'pornography/obscene mat' : [2,0],          
          'trea' : [1,0]}

### Map these levels to the incidents

In [38]:
sf_data['crime_level'] = sf_data['category'].map(lambda x : levels[x][0])
sf_data['weather_crime'] = sf_data['category'].map(lambda x : levels[x][1])



### Tag incidents using the following list of words
- Words were selected from reviewing the incident descriptions looking for words that are more violent or could be triggered by emotions

In [39]:
v_words = ['assault', 'battery', 'drunk', 'abuse', 'forced', 'rape', 'shootin',
           'violence', 'harassing', 'threat', 'threating', 'threats', 'resist', 'resisting',
           'destruction', 'weapons', 'gun', 'knife', 'armed', 'deadly', 'drunkenness',
           'bomb', 'bombing', 'influence', 'looting', 'disorderly', 'force', 'forcible',
           'fighting', 'injurues', 'nusance', 'homicide', 'alcohol', 'rape', 'mayhem',
           'abuse', 'cruelty', 'lewd', 'molest', 'distubing']
       

In [40]:
sf_data['v_word'] = sf_data['descript'].apply(lambda x: any(word in x for word in v_words) )
sf_data['v_word'].value_counts()



False    1573321
True      286313
Name: v_word, dtype: int64

### Tag incidents where a Gun was used
- Just wanted to know
- May be useful for other analysis

In [41]:
sf_data['gun'] = sf_data['descript'].apply(lambda x: x.find('gun') != -1 )
sf_data['gun'].value_counts()


False    1843633
True       16001
Name: gun, dtype: int64

In [42]:
sf_data.head()

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,...,pdid,hour,month,day,year,shift,crime_level,weather_crime,v_word,gun
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
160051264,warrants,warrant arrest,monday,2016-01-18,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,...,16005126463010,23,1,18,2016,shift_2,1,0,False,False
160051242,robbery,"robbery, bodily force",monday,2016-01-18,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,...,16005124203074,23,1,18,2016,shift_2,2,1,True,False
160051305,disorderly conduct,committing public nuisance,monday,2016-01-18,23:30,MISSION,"ARREST, BOOKED",500 Block of SHOTWELL ST,-122.415922,37.759612,...,16005130519010,23,1,18,2016,shift_2,1,1,False,False
160051258,robbery,attempted robbery on the street with a gun,monday,2016-01-18,23:30,BAYVIEW,NONE,BANCROFT AV / KEITH ST,-122.392791,37.725605,...,16005125803411,23,1,18,2016,shift_2,2,1,True,True
160051258,assault,aggravated assault with a gun,monday,2016-01-18,23:30,BAYVIEW,NONE,BANCROFT AV / KEITH ST,-122.392791,37.725605,...,16005125804011,23,1,18,2016,shift_2,4,1,True,True


## Consolidate into daily records
[[back to top](#Sections)]

Group the incidents by day and count the Number of incidents and the sum of the crime_level

### Group by the day
- Aggregate by the sum and count for my measuring variables
- To group by day and shift, uncomment code marked #SHIFT

In [43]:
day_group = sf_data.groupby(['date'])[['crime_level','weather_crime', 'v_word', 'gun']].agg(['sum', 'count'])


#SHIFT day_group = sf_data.groupby(['date','shift'])[['crime_level']].agg(['sum', 'count'])
day_group.head()

Unnamed: 0_level_0,crime_level,crime_level,weather_crime,weather_crime,v_word,v_word,gun,gun
Unnamed: 0_level_1,sum,count,sum,count,sum,count,sum,count
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2003-01-01,1066,541,143,541,131,541,5,541
2003-01-02,750,411,72,411,52,411,1,411
2003-01-03,799,440,84,440,60,440,0,440
2003-01-04,674,347,65,347,58,347,3,347
2003-01-05,755,377,102,377,71,377,2,377


In [44]:
# unstack to bring shift from rows to columns
#SHIFT day_group = day_group.unstack(level=-1)
#SHIFT day_group.head()

In [45]:
day_group.columns.values

array([('crime_level', 'sum'), ('crime_level', 'count'),
       ('weather_crime', 'sum'), ('weather_crime', 'count'),
       ('v_word', 'sum'), ('v_word', 'count'), ('gun', 'sum'),
       ('gun', 'count')], dtype=object)

In [46]:
day_group.columns = ['_'.join(col).strip() for col in day_group.columns.values]
# drop the weaher_crime_count - it just counts all records
day_group.drop(['weather_crime_count', 'v_word_count', 'gun_count'], axis=1, inplace=True)
day_group.head(2)


Unnamed: 0_level_0,crime_level_sum,crime_level_count,weather_crime_sum,v_word_sum,gun_sum
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2003-01-01,1066,541,143,131,5
2003-01-02,750,411,72,52,1


In [47]:
# Sum up the 3 shift info into day totals
#SHIFT day_group['crime_level_sum_day'] = day_group['crime_level_sum_shift_1'] + 
#                                   day_group['crime_level_sum_shift_2'] + 
#                                   day_group['crime_level_sum_shift_3']
#day_group['crime_level_count_day'] = day_group['crime_level_count_shift_1'] + 
#                                     day_group['crime_level_count_shift_2'] + 
#                                     day_group['crime_level_count_shift_3']        
#day_group.head(2)

### Add in the other fields that are not crime rate
Features that are needed for further analysis
- day, month, year and dayofweek

In [48]:
day_group_static = sf_data.groupby(['date'])[['dayofweek','day', 'month', 'year']].min()
day_group_static.head()

Unnamed: 0_level_0,dayofweek,day,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003-01-01,wednesday,1,1,2003
2003-01-02,thursday,2,1,2003
2003-01-03,friday,3,1,2003
2003-01-04,saturday,4,1,2003
2003-01-05,sunday,5,1,2003


## New Features for Daily records
[[back to top](#Sections)]

After analysis and modeling there were some features that I need that have to be created in this step of data preparation.

### Add in the 'end_of_week' feature
- Reduce the 7 features for the day of the week to just one that represents the end of the week, Fri, Sat, Sun. when the crime level jumps up

In [61]:
def eow(s):
    if s =='friday' or s == 'saturday' or s == 'sunday':
        return True
    else:
        return False

In [62]:
day_group_static.loc[:,'end_of_week'] = day_group_static['dayofweek'].map(eow)



### merge crimelevel df with other fields
- day_group
- day_group_static

In [63]:
data = pd.concat([day_group, day_group_static], axis=1, join_axes=[day_group.index])
data.head()

Unnamed: 0_level_0,crime_level_sum,crime_level_count,weather_crime_sum,v_word_sum,gun_sum,dayofweek,day,month,year,end_of_week
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,1066,541,143,131,5,wednesday,1,1,2003,False
2003-01-02,750,411,72,52,1,thursday,2,1,2003,False
2003-01-03,799,440,84,60,0,friday,3,1,2003,True
2003-01-04,674,347,65,58,3,saturday,4,1,2003,True
2003-01-05,755,377,102,71,2,sunday,5,1,2003,True


## Corrections for Daily records
[[back to top](#Sections)]

After looking at the data here and in the 5_analysis notebook several anomalys appeared. They caused me to remove some data as explained below.

### Observations

In [64]:
data.describe()

Unnamed: 0,crime_level_sum,crime_level_count,weather_crime_sum,v_word_sum,gun_sum,day,month,year,end_of_week
count,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765
mean,730.839454,390.269465,79.38489,60.086674,3.358027,15.706611,6.502413,2009.025813,0.428751
std,88.840986,47.094029,14.074783,11.273555,2.150567,8.798708,3.45947,3.759785,0.49495
min,2.0,2.0,0.0,0.0,0.0,1.0,1.0,2003.0,False
25%,673.0,360.0,70.0,52.0,2.0,8.0,4.0,2006.0,0
50%,729.0,390.0,79.0,59.0,3.0,16.0,7.0,2009.0,0
75%,785.0,420.0,88.0,67.0,5.0,23.0,10.0,2012.0,1
max,1196.0,593.0,158.0,131.0,17.0,31.0,12.0,2016.0,True


#### Outliers
#### Minimum crime level count is 2. 
- That seems unreasonable that there would only be 2 incidents on a day.

In [65]:
data.sort_values('crime_level_sum', ascending=True).head(30)

Unnamed: 0_level_0,crime_level_sum,crime_level_count,weather_crime_sum,v_word_sum,gun_sum,dayofweek,day,month,year,end_of_week
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2007-12-16,2,2,0,0,0,sunday,16,12,2007,True
2008-08-01,8,6,0,0,0,friday,1,8,2008,True
2013-12-24,308,173,18,18,0,tuesday,24,12,2013,False
2013-12-25,314,152,39,29,2,wednesday,25,12,2013,False
2011-12-25,327,168,47,28,1,sunday,25,12,2011,True
2010-12-25,355,175,46,32,0,saturday,25,12,2010,True
2013-12-23,379,187,46,47,1,monday,23,12,2013,False
2007-12-25,395,212,52,39,4,tuesday,25,12,2007,False
2011-11-24,397,222,45,39,2,thursday,24,11,2011,False
2008-12-25,403,199,60,42,4,thursday,25,12,2008,False


#### Remove the two days that appear to be missing data
- there were 2 days that only had 2 and 8 incidents. 
- remove them since there must be missing data from those 2 days.

In [66]:
data = data[data['crime_level_count'] > 10]
data.shape

(4763, 10)

#### Several really low scores 
- After looking at data saw that almost all the low scores where for December 25.
- Reasonable to assume that officers know that Christmas is a special day
- drop data with crime level less than 250, so Christmas does not skew results

In [67]:
data = data[data['crime_level_count'] > 250]
data.shape

(4743, 10)

#### Several really high scores 
- After looking at data saw that there are just a few days that spiked very high.
- Although I beleive that reporting to be true, these few days are probably not related to weather.
- drop data with crime level above 550, to remove these anomalys


In [68]:
data = data[data['crime_level_count'] < 550]
data.shape

(4740, 10)

## Normalize Levels
[[back to top](#Sections)]

### Normalize crime level and weather crime level to a scale of 0 to 10
- This would be easier for explaination. Knowing a day is a crime level 7 vs a normal 5 is easiler to understand than today is a 567 vs a 387

In [69]:
high_crime = data['crime_level_sum'].max()
low_crime = data['crime_level_sum'].min()
data['crime_level'] = (data['crime_level_sum'] - low_crime) * 10 / (high_crime - low_crime)
high_w_crime = data['weather_crime_sum'].max()
low_w_crime = data['weather_crime_sum'].min()
data['weather_crime_level'] = (data['weather_crime_sum'] - low_w_crime) * 10 / (high_w_crime - low_w_crime)
data.head()

Unnamed: 0_level_0,crime_level_sum,crime_level_count,weather_crime_sum,v_word_sum,gun_sum,dayofweek,day,month,year,end_of_week,crime_level,weather_crime_level
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2003-01-01,1066,541,143,131,5,wednesday,1,1,2003,False,10.0,8.717949
2003-01-02,750,411,72,52,1,thursday,2,1,2003,False,4.903226,2.649573
2003-01-03,799,440,84,60,0,friday,3,1,2003,True,5.693548,3.675214
2003-01-04,674,347,65,58,3,saturday,4,1,2003,True,3.677419,2.051282
2003-01-05,755,377,102,71,2,sunday,5,1,2003,True,4.983871,5.213675


## Write final data to file
[[back to top](#Sections)]

In [70]:
data.to_csv('sf_crime_clean.csv')

In [71]:
data[data['day']==1].sum()

crime_level_sum                                                   122882
crime_level_count                                                  65251
weather_crime_sum                                                  14016
v_word_sum                                                         10748
gun_sum                                                              563
dayofweek              wednesdaysaturdaysaturdaytuesdaythursdaysunday...
day                                                                  155
month                                                                996
year                                                              311399
end_of_week                                                           65
crime_level                                                      866.968
weather_crime_level                                              654.786
dtype: object

In [72]:
data[data['day']==15].mean()

crime_level_sum         758.292994
crime_level_count       404.955414
weather_crime_sum        80.987261
v_word_sum               61.707006
gun_sum                   3.280255
day                      15.000000
month                     6.464968
year                   2009.044586
end_of_week               0.426752
crime_level               5.036984
weather_crime_level       3.417715
dtype: float64