# Clean San Francisco Crime data 
- Jim Haskin

- GA-Data Science
- Dec 2015

- 2/17/2016

## Method
- I have collected the incident reports of the San Franciso Police Department from the SF OpenData website. https://data.sfgov.org/data?category=Public%20Safety. I have the records from January, 2003 until the beginning of 2016.
- I cleaned and reformated the fields.
- I summerized the report to generate a daily report of the number of incidents and another factor I am calling Crime Level. Each incident is given a score based on how violent it is. Murders and assaults are rated high. Traffic violations and non-criminal incidents are rated low. These scores are summed and then normalized to a scale of 0 - 10.
- Also added other crime measuring variables that can be used to narrow the reseach to different subsets of crime.
- 'gun_level', 'COP_count', 'violent_count' are examples. These are explained below.

# Sections


- [Data Source](#Data-Source)
- [Clean Data](#Clean-Data)
- [New Features for Raw Data](#New-Features-for-Raw-data)
- [Corrections for Raw data](#Corrections-for-Raw-data)
- [Create Crime Measuring Variables](#Create-Crime-Measuring-Variables)
- [Consolidate into daily records](#Consolidate-into-daily-records)
- [Normalize Levels](#Normalize-Levels)
- [Write final data to file](#Write-final-data-to-file)
- [qq](#qq)

## Data Source
[[back to top](#Sections)]

- Data downloaded from SF Open Data site. File includes incidents from 1/1/2003 until the present 
- SFPD_Incidents_-_from_1_January_2003.csv
- https://data.sfgov.org/data?category=Public%20Safety


FieldName|Type|Description                             
---------------|------------|---------------------
IncidntNum|string|Police assigned number
Category|string|General Crime category
Descript|string|Secondary category/details
DayOfWeek|string|Day of week event occured
Date|string|Date in format : 01/18/2016
Time|string|Time in format : 23:52
PdDistrict|string|Police District that event occured in
Resolution|int|How case was resolved
Address|string|Address of event
X|float|Longitude 
Y|float|Latitude
Location|string|Latitude,Longitude in character pair
PdId|int|Police Department ID number


In [1]:
import pandas as pd
import numpy as np
import seaborn as sb
%matplotlib inline

In [2]:
! head -2 SFPD_Incidents_-_from_1_January_2003.csv


IncidntNum,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
160051264,WARRANTS,WARRANT ARREST,Monday,01/18/2016,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568445228,37.7887594214703,"(37.7887594214703, -122.408568445228)",16005126463010


In [3]:
! tail -2 SFPD_Incidents_-_from_1_January_2003.csv



031353484,OTHER OFFENSES,OBSCENE PHONE CALLS(S),Wednesday,01/01/2003,00:01,TARAVAL,NONE,1500 Block of 41ST AV,-122.5003001196,37.7578465298467,"(37.7578465298467, -122.5003001196)",3135348419050
030320997,SUSPICIOUS OCC,SUSPICIOUS OCCURRENCE,Wednesday,01/01/2003,00:01,SOUTHERN,NONE,0 Block of LAFAYETTE ST,-122.416608653757,37.7725681063387,"(37.7725681063387, -122.416608653757)",3032099764070


### Read in Crime data

In [4]:
sf_data = pd.read_csv('SFPD_Incidents_-_from_1_January_2003.csv', index_col=0)    # has header, commas, index
sf_data.head(2)

Unnamed: 0_level_0,Category,Descript,DayOfWeek,Date,Time,PdDistrict,Resolution,Address,X,Y,Location,PdId
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
160051264,WARRANTS,WARRANT ARREST,Monday,01/18/2016,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010
160051242,ROBBERY,"ROBBERY, BODILY FORCE",Monday,01/18/2016,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074


## Clean Data
[[back to top](#Sections)]

### Convert to lower case
- Feature names
- Feature values that I'm working with

In [5]:
sf_data.columns = sf_data.columns.str.lower()
sf_data['category'] = sf_data['category'].str.lower()
sf_data['descript'] = sf_data['descript'].str.lower()
sf_data['dayofweek'] = sf_data['dayofweek'].str.lower()
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
160051264,warrants,warrant arrest,monday,01/18/2016,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010
160051242,robbery,"robbery, bodily force",monday,01/18/2016,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074


### Investigate data

In [6]:
sf_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1866570 entries, 160051264 to 30320997
Data columns (total 12 columns):
category      object
descript      object
dayofweek     object
date          object
time          object
pddistrict    object
resolution    object
address       object
x             float64
y             float64
location      object
pdid          int64
dtypes: float64(2), int64(1), object(9)
memory usage: 185.1+ MB


### Observations
- 1,866,570 records 
- date and time in string format
- other fields look appropriate

### Convert date to Pandas datetime format

In [7]:
sf_data['date'] = pd.to_datetime(sf_data['date'])

## New Features for Raw data
[[back to top](#Sections)]

There is information in some of these features that I would like to use in future analysis. But I need to put them into a usable form
- hour - What time did the incident occure. Can be used if you want to split data into time frames.
- month, day, year - Seperate out from data to be used to look for trends.
- shift - feature derived from time that splits the day into 3 working shifts.

### Add the hour as numeric

In [8]:
sf_data['hour'] = sf_data['time'].str[0:2].astype(int)

### Add month, day and year features

In [9]:
#tdf['Date'].dtype
sf_data['month'] = sf_data['date'].dt.month
sf_data['day'] = sf_data['date'].dt.day
sf_data['year'] = sf_data['date'].dt.year
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid,hour,month,day,year
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
160051264,warrants,warrant arrest,monday,2016-01-18,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010,23,1,18,2016
160051242,robbery,"robbery, bodily force",monday,2016-01-18,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074,23,1,18,2016


### Create shift feature
- For more detailed analysis or workforce planning, add feature that records the shift that event occured.
- 3rd shift - Midnight to 7:59am
- 1st shift - 8:00am - 3:59pm
- 2nd shift - 4:00pm - 11:59pm

NOTE: The below discussion of the approximate times in the incident reporting may make this unreliable

In [10]:
def calc_shift(hour):
    shift = hour//8
    if shift == 0:
        shift = 3
    return 'shift_' + str(shift)
        

In [11]:
sf_data['shift'] = sf_data['hour'].apply(calc_shift)
# or leave shift as hour//8. so that it sorts into time order, but label shift0 as third shift
#sf_data['shift'] = sf_data['hour'].apply(lambda x : x//8)

In [12]:
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,location,pdid,hour,month,day,year,shift
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
160051264,warrants,warrant arrest,monday,2016-01-18,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,"(37.7887594214703, -122.408568445228)",16005126463010,23,1,18,2016,shift_2
160051242,robbery,"robbery, bodily force",monday,2016-01-18,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,"(37.78710945429, -122.40642786236)",16005124203074,23,1,18,2016,shift_2


In [13]:
copy = sf_data.copy()

## Corrections for Raw data
[[back to top](#Sections)]

After analysing the data in the 5_analysis notebook several anomalys appeared. They caused me to remove some data as explained below.

### Investigate Day 1 anomaly
- While investigating the data I found a large spike in the number of incidents on the first day of the month.
- Investigate and reslove

In [14]:
sf_data['day'].value_counts().head()

1     72766
15    63578
20    62161
16    61410
17    61359
Name: day, dtype: int64

In [15]:
sf_data[sf_data['day']==15]['time'].value_counts().head()

12:00    2171
00:01    1863
18:00    1399
20:00    1199
15:00    1187
Name: time, dtype: int64

In [16]:
sf_data['time'].value_counts().head()

12:00    48318
00:01    47451
18:00    40872
17:00    35815
19:00    34903
Name: time, dtype: int64

In [17]:
sf_data[sf_data['time']=='00:01']['day'].value_counts().head()

1     6934
15    1863
20    1523
10    1510
13    1411
Name: day, dtype: int64

In [18]:
sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')]['time'].count()

6934

### observations
- 6934 records recorded on the the first of the month at 00:01.
- This is ~10% of the records for the 1st and 0.37% of total records.
- My hypotosis is that some incidents were reported later and when the exact date and time were not know, they were recorded this way.
- Since my model is based on the day of the incident, these records are not reliable. 
- If I remove them the counts are more in line with the other days of the month.
- Also appears that many of the times in the records may be approximate. (Many at 12:00). This will not effect the daily totals, but could be a problem if I try to divide the day into shifts.

### observations 2
- there are a total of 47,451 records from all days that have a time of 00:01 ~ 2.5% of data
- These records may be inaccurate also. Since I have 1.8M records it is better to get rid of questionable data

### results
- Remove the records that have a time of 00:01

In [19]:
# Remove day 1 time 00:01
#sf_data = sf_data.drop(sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')].index)
#sf_data[(sf_data['day']==1) & (sf_data['time']=='00:01')]['time'].count()

In [20]:
# Remove all time 00:01 records
sf_data = sf_data.drop(sf_data[(sf_data['time']=='00:01')].index)
sf_data[(sf_data['time']=='00:01')]['time'].count()

0

## Create Crime Measuring Variables
[[back to top](#Sections)]

Instead of only calculating the number of incidents per day I have created several other measurements that could be used for analysis and modeling. These can be used to help with workforce planning. These are very subjective and could use domain knowledge to help tune these measurements
- Create crime level feature that weights the incident by the severity/violence of the crime
- Create a weather crime level. I select categories that I feel may be weather related.
- Create a Crime Of Passion feature based on a list of violent/emotional words.
- Create a violent crime feature for the assault, rape and domestic violence categories
- Create a gun related feature


### Investigate the categories and their sub descriptions

In [66]:
sf_data['category'].value_counts()

larceny/theft                  370337
other offenses                 260051
non-criminal                   193041
assault                        160146
vehicle theft                  110911
drug/narcotic                  110273
vandalism                       92436
warrants                        88371
burglary                        76509
suspicious occ                  64123
missing person                  53887
robbery                         47879
fraud                           31146
secondary codes                 20767
forgery/counterfeiting          18487
weapon laws                     18159
trespass                        15445
prostitution                    15347
stolen property                  9815
drunkenness                      8894
disorderly conduct               8799
sex offenses, forcible           8471
recovered vehicle                6318
driving under the influence      4863
kidnapping                       4678
runaway                          3921
liquor laws 

In [67]:
sf_data[sf_data['category']=='larceny/theft']['descript'].value_counts()

grand theft from locked auto                               128810
petty theft from locked auto                                41730
petty theft of property                                     34211
grand theft of property                                     22969
petty theft from a building                                 20915
petty theft shoplifting                                     20351
grand theft from a building                                 19204
grand theft from person                                     14679
grand theft pickpocket                                      11681
grand theft from unlocked auto                               9864
petty theft with prior                                       8184
petty theft from unlocked auto                               5481
grand theft bicycle                                          5278
attempted theft from locked vehicle                          4598
grand theft shoplifting                                      4551
petty thef

In [68]:
sf_data[sf_data['category']=='other offenses']['descript'].value_counts()

drivers license, suspended or revoked                                        56650
traffic violation                                                            33942
resisting arrest                                                             18441
miscellaneous investigation                                                  17161
probation violation                                                          16239
lost/stolen license plate                                                    13427
violation of restraining order                                               11822
traffic violation arrest                                                     11403
parole violation                                                             10141
conspiracy                                                                    6426
obscene phone calls(s)                                                        4797
violation of municipal code                                                   4760
poss

In [69]:
sf_data[sf_data['category']=='non-criminal']['descript'].value_counts()

lost property                                         64152
aided case, mental disturbed                          45951
found property                                        25989
aided case                                            11505
death report, cause unknown                            8769
case closure                                           4735
stay away or court order, non-dv related               3395
aided case, dog bite                                   2900
civil sidewalks, citation                              2665
property for identification                            2508
aided case, injured person                             2196
death report, natural causes                           1997
aided case -property for destruction                   1840
courtesy report                                        1761
fire report                                            1677
located property                                       1536
tarasoff report                         

In [70]:
sf_data[sf_data['category']=='assault']['descript'].value_counts()

battery                                                        57142
threats against life                                           29896
inflict injury on cohabitee                                    14842
aggravated assault with a deadly weapon                        13584
aggravated assault with bodily force                           10382
battery, former spouse or dating relationship                   5990
aggravated assault with a knife                                 5134
battery of a police officer                                     2802
child abuse (physical)                                          2652
aggravated assault with a gun                                   2127
threatening phone call(s)                                       1729
battery with serious injuries                                   1697
stalking                                                        1619
elder adult or dependent abuse (not embezzlement or theft)      1286
assault                           

In [71]:
sf_data[sf_data['category']=='drug/narcotic']['descript'].value_counts()

possession of narcotics paraphernalia                20567
possession of base/rock cocaine                      14089
possession of marijuana                              11220
sale of base/rock cocaine                             8774
possession of meth-amphetamine                        7578
possession of base/rock cocaine for sale              7358
possession of marijuana for sales                     5643
possession of controlled substance                    4261
possession of heroin                                  4113
possession of cocaine                                 2986
sale of marijuana                                     2889
possession of meth-amphetamine for sale               2376
possession of controlled substance for sale           2201
possession of heroin for sales                        1813
sale of controlled substance                          1564
possession of cocaine for sales                       1305
sale of heroin                                        12

In [72]:
sf_data[sf_data['category']=='vandalism']['descript'].value_counts()

malicious mischief, vandalism of vehicles                   37002
malicious mischief, vandalism                               33041
malicious mischief, breaking windows                        10272
malicious mischief, graffiti                                 7485
malicious mischief                                            942
malicious mischief, tire slashing                             664
malicious mischief, street cars/buses                         587
vandalism or graffiti tools, possession                       548
malicious mischief, breaking windows with bb gun              524
malicious mischief, adult suspect                             517
malicious mischief, juvenile suspect                          160
graffiti on government vehicles or public transportation      124
vandalism or graffiti on or within 100 ft of highway          116
malicious mischief, fictitious phone calls                    111
malicious mischief, building under construction                85
damage to 

In [73]:
sf_data[sf_data['category']=='robbery']['descript'].value_counts()

robbery on the street, strongarm                     13804
robbery, bodily force                                 9453
robbery on the street with a gun                      4260
attempted robbery on the street with bodily force     2345
robbery, armed with a gun                             2318
attempted robbery with bodily force                   1611
robbery on the street with a knife                    1529
robbery on the street with a dangerous weapon         1364
robbery, armed with a knife                           1118
robbery of a commercial establishment, strongarm       872
robbery, armed with a dangerous weapon                 849
robbery of a chain store with bodily force             717
robbery of a commercial establishment with a gun       693
attempted robbery on the street with a gun             678
carjacking with a gun                                  497
robbery of a residence with bodily force               478
attempted robbery on the street w/deadly weapon        3

In [74]:
sf_data[sf_data['category']=='secondary codes']['descript'].value_counts()

domestic violence                         15262
juvenile involved                          1864
gang activity                              1624
prejudice-based incident                   1362
atm related crime                           559
battery by juvenile suspect                  52
weapons possession by juvenile suspect       25
assault by juvenile suspect                  17
shooting by juvenile suspect                  2
Name: descript, dtype: int64

In [75]:
sf_data[sf_data['category']=='weapon laws']['descript'].value_counts()

poss of loaded firearm                                     4075
carrying a concealed weapon                                2006
exhibiting deadly weapon in a threating manner             2002
poss of firearm by convicted felon/addict/alien            1524
poss of prohibited weapon                                  1387
discharge firearm at an inhabited dwelling                 1157
possession of air gun                                       866
loitering while carrying concealed weapon                   733
discharge firearm within city limits                        623
poss of deadly weapon with intent to assault                528
firearm, loaded, in vehicle, possession or use              498
carrying of concealed weapon by convicted felon             344
ammunition, poss. by prohibited person                      323
weapon, possess or bring other on school grounds            230
switchblade knife, possession                               188
firearm, armed while possessing controll

In [76]:
sf_data[sf_data['category']=='prostitution']['descript'].value_counts()

solicits for act of prostitution                   6769
solicits to visit house of prostitution            5062
loitering for purpose of prostitution              2419
engaging in lewd conduct - prostitution related     298
human trafficking                                   229
pimping                                             181
pandering                                           135
indecent exposure - prostitution related             89
solicits lewd act                                    82
inmate/keeper of house of prostitution               64
placing wife in house of prostitution                12
disorderly house, keeping                             3
procurement, pimping, & pandering                     2
purchase female for the purpose of prostitution       2
Name: descript, dtype: int64

In [77]:
sf_data[sf_data['category']=='sex offenses, forcible']['descript'].value_counts()

sexual battery                                       3025
forcible rape, bodily force                          1366
child abuse sexual                                    623
assault to rape with bodily force                     582
oral copulation                                       307
attempted rape, bodily force                          276
annoy or molest children                              269
child abuse, pornography                              261
oral copulation, unlawful (adult victim)              243
sodomy (adult victim)                                 230
penetration, forced, with object                      210
sodomy                                                187
sexual assault, aggravated, of child                  141
engaging in lewd act                                  131
sexual assault, administering drug to commit          124
child abuse, exploitation                             119
forcible rape, armed with a sharp instrument          115
forcible rape,

In [78]:
sf_data[sf_data['category']=='drunkenness']['descript'].value_counts()

under influence of alcohol in a public place    8894
Name: descript, dtype: int64

In [79]:
sf_data[sf_data['category']=='disorderly conduct']['descript'].value_counts()

committing public nuisance                              2710
disturbing the peace                                    1935
maintaining a public nuisance after notification        1623
disturbing the peace, fighting                          1019
maintaining a public nuisance                            816
disturbing the peace, commotion                          355
disturbing the peace, swearing                           331
disturbing religious meetings                              9
disturbance of non-religious, non-political assembly       1
Name: descript, dtype: int64

In [80]:
sf_data[sf_data['category']=='family offenses']['descript'].value_counts()

desertion of child                                        252
children, abandonment & neglect of (general)              222
minor without proper parental care                        212
abandonment of child                                      197
failure to provide for child                               81
immoral acts or drunk in presence of child                 37
concealment/removal of child without consent               27
failure to provide for parents                              3
harassing child or ward because of person's employment      1
Name: descript, dtype: int64

### Assign a Crime level and if Weather related to each catagory
- First number is the crime level in the range 1-4 : Higher number is more violent
- Second number (0/1) indicates if I believe the crime would be effected by the weather

In [81]:
levels = {'larceny/theft' : [2,0],                
          'other offenses' : [1,0],                 
          'non-criminal' : [1,0],
          'assault' :  [4,1],                        
          'vehicle theft' : [2,0],                 
          'drug/narcotic' : [2,0],                 
          'vandalism' : [2,1],                       
          'warrants' : [1,0],                        
          'burglary' : [2,0],                        
          'suspicious occ' : [2,0],                 
          'missing person' : [1,0],                 
          'robbery' : [2,1],                         
          'fraud' : [2,0],                          
          'forgery/counterfeiting' : [2,0],         
          'secondary codes' :  [4,1],              
          'weapon laws' :  [3,1],                    
          'trespass' :  [2,0],                       
          'prostitution' :  [2,0],                  
          'stolen property' :  [2,0],                 
          'sex offenses, forcible' : [4,1],          
          'drunkenness' :  [2,1],                     
          'disorderly conduct'  : [3,1],              
          'recovered vehicle' :  [1,0],              
          'driving under the influence' :  [1,0],      
          'kidnapping' :  [3,0],                      
          'runaway' :  [1,0],                          
          'liquor laws' : [1,0],                     
          'arson' : [3,1],                           
          'embezzlement' : [1,0],                    
          'loitering' : [1,0],                      
          'suicide' :  [1,1],                         
          'family offenses' : [3,1],                  
          'bad checks' : [1,0],                 
          'bribery' : [1,0],                          
          'extortion' : [2,0],                        
          'sex offenses, non forcible' : [2,1],       
          'gambling' : [1,0],                          
          'pornography/obscene mat' : [2,0],          
          'trea' : [1,0]}

### Create new crime variables 'crime_level' and 'weather_crime' from the above mapping

In [82]:
sf_data['crime_level'] = sf_data['category'].map(lambda x : levels[x][0])
sf_data['weather_crime'] = sf_data['category'].map(lambda x : levels[x][1])



### Create new crime variable 'cop' (Crime Of Passion) based on incidents containing the following list of words
- Words were selected from reviewing the incident descriptions looking for words that are more violent or could be triggered by emotions

In [83]:
cop_words = ['assault', 'battery', 'drunk', 'abuse', 'forced', 'rape', 'shooting',
           'violence', 'harassing', 'threat', 'threatening', 'threats', 'resist', 'resisting',
           'destruction', 'weapons', 'gun', 'knife', 'armed', 'deadly', 'drunkenness',
           'bomb', 'bombing', 'influence', 'looting', 'disorderly', 'force', 'forcible',
           'fighting', 'injuries', 'nuisance', 'homicide', 'alcohol', 'rape', 'mayhem',
           'abuse', 'cruelty', 'lewd', 'molest', 'disturbing']
       

In [84]:
sf_data['COP'] = sf_data['descript'].apply(lambda x: any(word in x for word in cop_words) )
sf_data['COP'].value_counts()



False    1529569
True      289538
Name: COP, dtype: int64

### Create new crime variable 'violant' for incidents in the categories assault, rape, domestic violence

In [85]:
violent_cats = ['assault', 'sex offenses, forcible', 'secondary codes']
sf_data['violent'] = sf_data['category'].apply(lambda x: x in violent_cats)
sf_data['violent'].value_counts()

False    1629723
True      189384
Name: violent, dtype: int64

### Create new crime variable southern_assaults
Just the assault incidents from the Southern Police District

In [86]:
sf_data[sf_data['category']== 'assault'].groupby(['pddistrict'])[['pdid']].count().sort_values('pdid', ascending=False)

Unnamed: 0_level_0,pdid
pddistrict,Unnamed: 1_level_1
SOUTHERN,25263
MISSION,23184
BAYVIEW,20324
INGLESIDE,18189
NORTHERN,17303
TENDERLOIN,16135
CENTRAL,14636
TARAVAL,11268
PARK,7247
RICHMOND,6597


In [62]:
#sf_data['southern_assaults'] = sf_data[['category','pddistrict']].apply(lambda x, y: (x=='assault') & (y=='SOUTHERN'), axis=1)

#df[['one','two']].apply(sum, axis=1)

#sf_data['southern_assaults'] = np.any((sf_data['category']=='assault') , (sf_data['pddistrict']=='SOUTHERN'))
#sf_data['southern_assaults'].value_counts()

### Create new crime variable 'gun' for incidents where a Gun was used
- Just wanted to know
- May be useful for other analysis

In [88]:
sf_data['gun'] = sf_data['descript'].apply(lambda x: x.find('gun') != -1 )
sf_data['gun'].value_counts()


False    1803322
True       15785
Name: gun, dtype: int64

In [89]:
sf_data.head(2)

Unnamed: 0_level_0,category,descript,dayofweek,date,time,pddistrict,resolution,address,x,y,...,hour,month,day,year,shift,gun,crime_level,weather_crime,COP,violent
IncidntNum,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
160051264,warrants,warrant arrest,monday,2016-01-18,23:52,CENTRAL,"ARREST, BOOKED",400 Block of POWELL ST,-122.408568,37.788759,...,23,1,18,2016,shift_2,False,1,0,False,False
160051242,robbery,"robbery, bodily force",monday,2016-01-18,23:40,TENDERLOIN,NONE,100 Block of STOCKTON ST,-122.406428,37.787109,...,23,1,18,2016,shift_2,False,2,1,True,False


## Consolidate into daily records
[[back to top](#Sections)]

Group the incidents by day and count the Number of incidents and the sum of the crime_level

### Group by the day
- Aggregate by the sum and count for my measuring variables
- To group by day and shift, uncomment code marked #SHIFT

In [90]:
day_group = sf_data.groupby(['date'])[['crime_level','weather_crime', 'violent', 'COP', 'gun']].agg(['sum', 'count'])


#SHIFT day_group = sf_data.groupby(['date','shift'])[['crime_level']].agg(['sum', 'count'])
day_group.head()

Unnamed: 0_level_0,crime_level,crime_level,weather_crime,weather_crime,violent,violent,COP,COP,gun,gun
Unnamed: 0_level_1,sum,count,sum,count,sum,count,sum,count,sum,count
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2
2003-01-01,1078,541,143,541,91,541,134,541,5,541
2003-01-02,731,399,72,399,39,399,53,399,1,399
2003-01-03,802,435,84,435,42,435,65,435,0,435
2003-01-04,678,347,65,347,44,347,59,347,3,347
2003-01-05,749,371,101,371,49,371,72,371,2,371


In [91]:
# unstack to bring shift from rows to columns
#SHIFT day_group = day_group.unstack(level=-1)
#SHIFT day_group.head()

In [92]:
day_group.columns.values

array([('crime_level', 'sum'), ('crime_level', 'count'),
       ('weather_crime', 'sum'), ('weather_crime', 'count'),
       ('violent', 'sum'), ('violent', 'count'), ('COP', 'sum'),
       ('COP', 'count'), ('gun', 'sum'), ('gun', 'count')], dtype=object)

In [93]:
# Reduce column levels
day_group.columns = ['_'.join(col).strip() for col in day_group.columns.values]
day_group.head(2)

Unnamed: 0_level_0,crime_level_sum,crime_level_count,weather_crime_sum,weather_crime_count,violent_sum,violent_count,COP_sum,COP_count,gun_sum,gun_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,1078,541,143,541,91,541,134,541,5,541
2003-01-02,731,399,72,399,39,399,53,399,1,399


In [94]:
# drop the weather_crime_count and other count fields - they are just the count of all records
day_group.drop(['weather_crime_count', 'COP_count', 'gun_count', 'violent_count'], axis=1, inplace=True)

In [95]:
# Rename columns to avoid confusion
cols = ['crime_level_sum', 'crime_count', 'weather_crime_count', 'violent_count', 'COP_count', 'gun_crime_count']
day_group.columns = cols
day_group.head(2)

Unnamed: 0_level_0,crime_level_sum,crime_count,weather_crime_count,violent_count,COP_count,gun_crime_count
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2003-01-01,1078,541,143,91,134,5
2003-01-02,731,399,72,39,53,1


In [96]:
# Sum up the 3 shift info into day totals
#SHIFT day_group['crime_level_sum_day'] = day_group['crime_level_sum_shift_1'] + 
#                                   day_group['crime_level_sum_shift_2'] + 
#                                   day_group['crime_level_sum_shift_3']
#day_group['crime_level_count_day'] = day_group['crime_level_count_shift_1'] + 
#                                     day_group['crime_level_count_shift_2'] + 
#                                     day_group['crime_level_count_shift_3']        
#day_group.head(2)

## Add in the other fields that are not crime rate
Features that are needed for further analysis
- day, month, year and dayofweek

In [97]:
day_group_static = sf_data.groupby(['date'])[['dayofweek','day', 'month', 'year']].min()
day_group_static.head()

Unnamed: 0_level_0,dayofweek,day,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2003-01-01,wednesday,1,1,2003
2003-01-02,thursday,2,1,2003
2003-01-03,friday,3,1,2003
2003-01-04,saturday,4,1,2003
2003-01-05,sunday,5,1,2003


### Merge the two data Frames 

In [98]:
data = pd.concat([day_group, day_group_static], axis=1, join_axes=[day_group.index])
data.head()

Unnamed: 0_level_0,crime_level_sum,crime_count,weather_crime_count,violent_count,COP_count,gun_crime_count,dayofweek,day,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2003-01-01,1078,541,143,91,134,5,wednesday,1,1,2003
2003-01-02,731,399,72,39,53,1,thursday,2,1,2003
2003-01-03,802,435,84,42,65,0,friday,3,1,2003
2003-01-04,678,347,65,44,59,3,saturday,4,1,2003
2003-01-05,749,371,101,49,72,2,sunday,5,1,2003


### Review Data

In [99]:
data.describe()

Unnamed: 0,crime_level_sum,crime_count,weather_crime_count,violent_count,COP_count,gun_crime_count,day,month,year
count,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0,4765.0
mean,720.142497,381.764323,77.894648,39.744806,60.763484,3.312697,15.706611,6.502413,2009.025813
std,88.978788,46.919233,13.966653,9.29096,11.326118,2.129817,8.798708,3.45947,3.759785
min,2.0,2.0,0.0,0.0,0.0,0.0,1.0,1.0,2003.0
25%,662.0,351.0,68.0,34.0,53.0,2.0,8.0,4.0,2006.0
50%,718.0,381.0,77.0,39.0,60.0,3.0,16.0,7.0,2009.0
75%,775.0,412.0,86.0,45.0,68.0,5.0,23.0,10.0,2012.0
max,1182.0,579.0,158.0,91.0,134.0,16.0,31.0,12.0,2016.0


#### minimum of 2 crimes looks bad

In [100]:
data.sort_values('crime_count', ascending=True).head(10)

Unnamed: 0_level_0,crime_level_sum,crime_count,weather_crime_count,violent_count,COP_count,gun_crime_count,dayofweek,day,month,year
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2007-12-16,2,2,0,0,0,0,sunday,16,12,2007
2008-08-01,8,6,0,0,0,0,friday,1,8,2008
2013-12-25,306,148,39,20,27,2,wednesday,25,12,2013
2011-12-25,310,159,43,21,25,1,sunday,25,12,2011
2013-12-24,292,162,18,6,18,0,tuesday,24,12,2013
2010-12-25,346,171,46,32,32,0,saturday,25,12,2010
2013-12-23,378,184,46,25,47,1,monday,23,12,2013
2008-12-25,396,193,58,31,43,4,thursday,25,12,2008
2007-12-25,394,209,50,26,41,4,tuesday,25,12,2007
2011-11-24,379,212,44,22,39,2,thursday,24,11,2011


#### Remove the two days that appear to be missing data
- there were 2 days that only had 2 and 8 incidents. 
- remove them since there must be missing data from those 2 days.

In [101]:
data = data[data['crime_count'] > 100]
data.shape

(4763, 10)

In [102]:
data.describe()

Unnamed: 0,crime_level_sum,crime_count,weather_crime_count,violent_count,COP_count,gun_crime_count,day,month,year
count,4763.0,4763.0,4763.0,4763.0,4763.0,4763.0,4763.0,4763.0,4763.0
mean,720.442788,381.922948,77.927357,39.761495,60.788999,3.314088,15.709637,6.500945,2009.026454
std,87.781889,46.285817,13.878038,9.257131,11.259817,2.129182,8.797973,3.459211,3.76043
min,292.0,148.0,18.0,6.0,18.0,0.0,1.0,1.0,2003.0
25%,662.0,351.5,68.0,34.0,53.0,2.0,8.0,3.5,2006.0
50%,718.0,381.0,77.0,39.0,60.0,3.0,16.0,7.0,2009.0
75%,775.5,412.0,86.0,45.0,68.0,5.0,23.0,10.0,2012.0
max,1182.0,579.0,158.0,91.0,134.0,16.0,31.0,12.0,2016.0


## Write final data to file
[[back to top](#Sections)]

In [103]:
data.to_csv('sf_crime_clean.csv')