# Initial Data Exploration 

* This notebook starts to explore the initial data files uploaded to the data folder. I explore preliminary trends in the Philly crime dataset, which outlines all the reports made by the Philadelphia Police Department in a given year. I recode the variables to make it more understandable from a data analysis standpoint (which will be consistent in the later notebooks!). Finally, I examine trends such as most frequent crimes, and what time of day they tend to occur to draw patterns for future analyses. 

----
#### HISTORY
* 10/19/20 - set up notebook

In [1]:
import pandas as pd

In [2]:
phillycrime_df = pd.read_csv('../data/philly_crime.csv')

#### Subtitles

* To organize your notebook you could add in intermittent subtitles and markdown cells to describe the different steps of your analysis 

In [3]:
phillycrime_df.shape

(110271, 15)

* Number of rows: 110271
* Number of columns: 15

In [4]:
phillycrime_df.columns

Index(['objectid', 'dc_dist', 'psa', 'dispatch_date_time', 'dispatch_date',
       'dispatch_time', 'hour_', 'dc_key', 'location_block', 'ucr_general',
       'text_general_code', 'point_x', 'point_y', 'lat', 'lng'],
      dtype='object')

In [5]:
phillycrime_df.head()

Unnamed: 0,objectid,dc_dist,psa,dispatch_date_time,dispatch_date,dispatch_time,hour_,dc_key,location_block,ucr_general,text_general_code,point_x,point_y,lat,lng
0,129,9,2,2020-03-25 18:32:00,2020-03-25,18:32:00,18,202009012094,1400 BLOCK SPRING GARDEN ST,600,Theft from Vehicle,-75.161446,39.962334,39.962334,-75.161446
1,41,77,A,2020-03-08 19:08:00,2020-03-08,19:08:00,19,202077001196,0 BLOCK PIA WAY,600,Thefts,-75.230706,39.883881,39.883881,-75.230706
2,42,77,A,2020-03-15 18:41:00,2020-03-15,18:41:00,18,202077001312,0 BLOCK PIA WAY,600,Thefts,-75.230706,39.883881,39.883881,-75.230706
3,43,77,A,2020-03-18 08:17:00,2020-03-18,08:17:00,8,202077001343,0 BLOCK PIA WAY,600,Thefts,-75.230706,39.883881,39.883881,-75.230706
4,44,77,A,2020-04-03 15:32:33,2020-04-03,15:32:33,15,202077001460,0 BLOCK PIA WAY,600,Thefts,-75.230706,39.883881,39.883881,-75.230706


In [6]:
phillycrime_df.tail()

Unnamed: 0,objectid,dc_dist,psa,dispatch_date_time,dispatch_date,dispatch_time,hour_,dc_key,location_block,ucr_general,text_general_code,point_x,point_y,lat,lng
110266,5253186,16,2,2020-09-23 11:13:39,2020-09-23,11:13:39,11,202016036076,800 BLOCK N 46TH ST,800,Other Assaults,-75.213683,39.968523,39.968523,-75.213683
110267,3821930,19,1,2020-07-11 00:33:16,2020-07-11,00:33:16,0,202019044845,7200 BLOCK HAVERFORD AV,600,Theft from Vehicle,-75.261351,39.974957,39.974957,-75.261351
110268,4372141,17,3,2020-08-06 13:08:00,2020-08-06,13:08:00,13,202017022625,1900 BLOCK WATKINS ST,1400,Vandalism/Criminal Mischief,-75.177072,39.929892,39.929892,-75.177072
110269,4092171,18,1,2020-07-22 14:47:06,2020-07-22,14:47:06,14,202018052559,5800 BLOCK NORFOLK ST,800,Other Assaults,-75.240107,39.949705,39.949705,-75.240107
110270,3400619,9,2,2020-06-15 10:41:09,2020-06-15,10:41:09,10,202009020906,200 BLOCK N BROAD ST,600,Theft from Vehicle,-75.162803,39.956657,39.956657,-75.162803


### Observations

* each row is a reported crime incident
* columns:
    * `objectid` - unique identification number for each case / crime (QUANTITATIVE)
    * `dc_dist` - distance from the Philadelphia Police? Need to confirm! (QUANTITATIVE)
    * `dispatch_date_time` - date and time in which the crime was reported (QUANTITATIVE)
    * `dispatch_date` - date in which the crime was reported (QUANTITATIVE)
    * `dispatch_time` - time in which the crime was reported (24 hour cycle) (QUANTITATIVE)
    * `hour_` - rounded hour for crime occurrence (note that hours round "down") (QUANTITATIVE)
    * `dc_key` - unique identification code (not sure for what though) (QUANTITATIVE)
    * `location_block` - general location area within Philadelphia region that the crime occurred at (QUALITATIVE)
    * `ucr_general` - broad UCR code, standard for crime reporting from FBI (QUANTITATIVE)
    * `text_general_code` - type of crime in coded format, standard for crime reporting from FBI (QUALITATIVE)
    * `point_x` - longitude coordinate (QUANTITATIVE)
    * `point_y` - latitude coordinate (QUANTITATIVE)
    * `lat` - equivalent of point_y column (QUANTITATIVE)
    * `lng` - equivalent of point_x column (QUANTITATIVE)

In [7]:
cols_to_use = ['objectid', 'dc_dist', 'dispatch_date_time', 'dispatch_date', 'dispatch_time', 'hour_', 'dc_key', 'location_block', 'ucr_general', 'text_general_code', 'point_x', 'point_y', 'lat','lng']
phillycrime_df2 = phillycrime_df[cols_to_use]
phillycrime_df2 = phillycrime_df2.rename(columns={
    'objectid':'id',
    'dc_dist':'distance',
    'dispatch_date_time':'datetime',
    'dispatch_date':'date',
    'dispatch_time':'time',
    'hour_':'hour',
    'dc_key':'key',
    'location_block':'location',
    'ucr_general':'ucr',
    'text_general_code':'crimetype',
    'point_x':'x',
    'point_y':'y',
    'lat':'latitutde',
    'lng':'longitude'})
print(phillycrime_df2)

             id  distance             datetime        date      time  hour  \
0           129         9  2020-03-25 18:32:00  2020-03-25  18:32:00    18   
1            41        77  2020-03-08 19:08:00  2020-03-08  19:08:00    19   
2            42        77  2020-03-15 18:41:00  2020-03-15  18:41:00    18   
3            43        77  2020-03-18 08:17:00  2020-03-18  08:17:00     8   
4            44        77  2020-04-03 15:32:33  2020-04-03  15:32:33    15   
...         ...       ...                  ...         ...       ...   ...   
110266  5253186        16  2020-09-23 11:13:39  2020-09-23  11:13:39    11   
110267  3821930        19  2020-07-11 00:33:16  2020-07-11  00:33:16     0   
110268  4372141        17  2020-08-06 13:08:00  2020-08-06  13:08:00    13   
110269  4092171        18  2020-07-22 14:47:06  2020-07-22  14:47:06    14   
110270  3400619         9  2020-06-15 10:41:09  2020-06-15  10:41:09    10   

                 key                     location   ucr  \
0   

In [8]:
phillycrime_df2['crimetype'].unique()

array(['Theft from Vehicle', 'Thefts', 'Robbery No Firearm',
       'Burglary Non-Residential', 'Robbery Firearm',
       'Aggravated Assault No Firearm', 'Aggravated Assault Firearm',
       'Rape', 'Vandalism/Criminal Mischief', 'All Other Offenses',
       'Other Assaults', 'Burglary Residential',
       'DRIVING UNDER THE INFLUENCE', 'Narcotic / Drug Law Violations',
       'Fraud', 'Arson', 'Prostitution and Commercialized Vice',
       'Disorderly Conduct', 'Public Drunkenness',
       'Other Sex Offenses (Not Commercialized)', 'Weapon Violations',
       'Offenses Against Family and Children', 'Embezzlement',
       'Forgery and Counterfeiting', 'Liquor Law Violations',
       'Receiving Stolen Property', 'Gambling Violations',
       'Vagrancy/Loitering', 'Homicide - Criminal',
       'Homicide - Criminal ', 'Recovered Stolen Motor Vehicle',
       'Motor Vehicle Theft', 'Homicide - Justifiable ',
       'Homicide - Gross Negligence'], dtype=object)

In [9]:
phillycrime_df2['crimetype'].value_counts()

Other Assaults                             17698
Thefts                                     16440
All Other Offenses                         13806
Vandalism/Criminal Mischief                11557
Theft from Vehicle                         10118
Fraud                                       6928
Narcotic / Drug Law Violations              5524
Aggravated Assault No Firearm               4245
Motor Vehicle Theft                         3737
Aggravated Assault Firearm                  2929
Burglary Residential                        2788
Weapon Violations                           2602
Burglary Non-Residential                    2336
Robbery No Firearm                          2216
DRIVING UNDER THE INFLUENCE                 1359
Robbery Firearm                             1281
Recovered Stolen Motor Vehicle              1150
Disorderly Conduct                           679
Rape                                         622
Other Sex Offenses (Not Commercialized)      554
Arson               

In [10]:
phillycrime_df2.groupby('crimetype')['hour'].mean()

crimetype
Aggravated Assault Firearm                 13.357460
Aggravated Assault No Firearm              13.128857
All Other Offenses                         13.163842
Arson                                      10.268235
Burglary Non-Residential                   10.619435
Burglary Residential                       12.866930
DRIVING UNDER THE INFLUENCE                12.415011
Disorderly Conduct                         14.600884
Embezzlement                               13.372222
Forgery and Counterfeiting                 15.053191
Fraud                                      13.966946
Gambling Violations                        14.400000
Homicide - Criminal                        14.241611
Homicide - Criminal                        13.622222
Homicide - Gross Negligence                 0.000000
Homicide - Justifiable                     10.500000
Liquor Law Violations                      13.597015
Motor Vehicle Theft                        13.026224
Narcotic / Drug Law Violations      

* Missing data for:
    * Homicide - Gross Negligence
* High (crime that occurs latest in the day):
    * Narcotic / Drug Law Violations (15.457639)
* Low (crime that occurs earliest in the day): 
    * Arson (10.268235)
* Notes
    * would be helpful to know what "all other offenses" entailed. 

In [11]:
drug_filter = phillycrime_df2['crimetype']=='Narcotic / Drug Law Violations'
drug_df = phillycrime_df2[drug_filter]
drug_df.shape

(5524, 14)

* Distribution of drug (Narcotic / Drug Law Violations) incidents by Philadelphia Police.

In [12]:
drug_df.groupby('distance').size()

distance
1       53
2      248
3       52
5        2
6      125
7        9
8       92
9      216
12     464
14     284
15     210
16     225
17     132
18     308
19     217
22     290
24    1223
25     521
26      72
35     518
39     247
77      16
dtype: int64

* Time of day when drug-related crime incidents occur: 

In [13]:
drug_df.groupby('hour').size().plot(kind='bar', figsize=(12,4))

<AxesSubplot:xlabel='hour'>

### Observations (Part 2)

* Narcotic / drug law violations frequently occur at night (does NOT mean they are the most frequent crime); with peaks at the 19 hour mark (7:00 PM). 
* Arson is most likely to be committed earlier in the day in 2020 compared to other crimes covered in this data set.
* Personally, I think drug abuse is a huge issue in Philly (I'm actually conducting a research study right now on tobacco!), so I'd like to dive a bit deeper and compare these rates across the span of the last 10 years. 
* Many crimes occur past the 10 hour mark (less frequent in the morning). 
