# Final Project Exploratory Data Analysis

#### Import pandas, matplotlib

In [11]:
import pandas as pd
import matplotlib.pyplot as plt

In [12]:
plt.rcParams['figure.figsize'] = (9, 8)
plt.rcParams['font.size'] = 12

In [13]:
ls data

NYPD_Complaint_Data_2018.csv


Original source file here https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i  I dropped several columns and only filtered down to records from 2018, in the interest of conserving CSV filesize.

#### Read in CSV file

In [14]:
nypd = pd.read_csv('data/NYPD_Complaint_Data_2018.csv')

Reference data dictionary [here](https://git.generalassemb.ly/gracepaet/nypd-complaints)

*Change this to a relative reference to a repo file*

Display all columns

In [9]:
pd.set_option('display.max_columns', None)

In [19]:
nypd.shape

(469212, 15)

In [15]:
nypd.head()

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,OFNS_DESC,LAW_CAT_CD,BORO_NM,PREM_TYP_DESC,JURIS_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX,CMPLNT_FR_DT_YEAR
0,774621657,2018-10-08,12:30:00,ROBBERY,FELONY,BRONX,BUS STOP,N.Y. POLICE DEPT,UNKNOWN,BLACK,M,<18,BLACK,M,2018
1,232548146,2018-08-24,11:00:00,PETIT LARCENY,MISDEMEANOR,BROOKLYN,RESIDENCE-HOUSE,N.Y. POLICE DEPT,,,,45-64,WHITE,F,2018
2,452701517,2018-03-30,16:55:00,DANGEROUS WEAPONS,FELONY,QUEENS,STREET,N.Y. POLICE DEPT,,,,UNKNOWN,UNKNOWN,E,2018
3,620357753,2018-10-02,16:00:00,PETIT LARCENY,MISDEMEANOR,QUEENS,STREET,N.Y. POLICE DEPT,,,,65+,BLACK,M,2018
4,110535568,2018-08-14,14:20:00,MURDER & NON-NEGL. MANSLAUGHTER,FELONY,,,N.Y. POLICE DEPT,25-44,BLACK,M,45-64,BLACK,M,2018


In [20]:
nypd.columns

Index(['CMPLNT_NUM', 'CMPLNT_FR_DT', 'CMPLNT_FR_TM', 'OFNS_DESC', 'LAW_CAT_CD',
       'BORO_NM', 'PREM_TYP_DESC', 'JURIS_DESC', 'SUSP_AGE_GROUP', 'SUSP_RACE',
       'SUSP_SEX', 'VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX',
       'CMPLNT_FR_DT_YEAR'],
      dtype='object')

In [21]:
nypd.dtypes

CMPLNT_NUM            int64
CMPLNT_FR_DT         object
CMPLNT_FR_TM         object
OFNS_DESC            object
LAW_CAT_CD           object
BORO_NM              object
PREM_TYP_DESC        object
JURIS_DESC           object
SUSP_AGE_GROUP       object
SUSP_RACE            object
SUSP_SEX             object
VIC_AGE_GROUP        object
VIC_RACE             object
VIC_SEX              object
CMPLNT_FR_DT_YEAR     int64
dtype: object

#### Confirm all records are from 2018

In [18]:
nypd.CMPLNT_FR_DT_YEAR.value_counts(dropna=False)

2018    469212
Name: CMPLNT_FR_DT_YEAR, dtype: int64

#### Drop CMPLNT_FR_DT_YEAR from dataframe

In [22]:
nypd.drop('CMPLNT_FR_DT_YEAR', inplace=True, axis=1)

## Exploring the dataframe

In [26]:
nypd.columns

Index(['CMPLNT_NUM', 'CMPLNT_FR_DT', 'CMPLNT_FR_TM', 'OFNS_DESC', 'LAW_CAT_CD',
       'BORO_NM', 'PREM_TYP_DESC', 'JURIS_DESC', 'SUSP_AGE_GROUP', 'SUSP_RACE',
       'SUSP_SEX', 'VIC_AGE_GROUP', 'VIC_RACE', 'VIC_SEX'],
      dtype='object')

In [27]:
nypd.dtypes

CMPLNT_NUM         int64
CMPLNT_FR_DT      object
CMPLNT_FR_TM      object
OFNS_DESC         object
LAW_CAT_CD        object
BORO_NM           object
PREM_TYP_DESC     object
JURIS_DESC        object
SUSP_AGE_GROUP    object
SUSP_RACE         object
SUSP_SEX          object
VIC_AGE_GROUP     object
VIC_RACE          object
VIC_SEX           object
dtype: object

In [28]:
nypd.describe

<bound method NDFrame.describe of         CMPLNT_NUM CMPLNT_FR_DT CMPLNT_FR_TM                        OFNS_DESC  \
0        774621657   2018-10-08     12:30:00                          ROBBERY   
1        232548146   2018-08-24     11:00:00                    PETIT LARCENY   
2        452701517   2018-03-30     16:55:00                DANGEROUS WEAPONS   
3        620357753   2018-10-02     16:00:00                    PETIT LARCENY   
4        110535568   2018-08-14     14:20:00  MURDER & NON-NEGL. MANSLAUGHTER   
...            ...          ...          ...                              ...   
469207   690975879   2018-12-31     08:00:00     ASSAULT 3 & RELATED OFFENSES   
469208   779067255   2018-11-01     12:00:00                    GRAND LARCENY   
469209   541626662   2018-12-30     19:30:00                      THEFT-FRAUD   
469210   417694811   2018-12-31     17:45:00                    HARRASSMENT 2   
469211   379455106   2018-12-20     18:00:00                           FRAU

In [35]:
nypd.OFNS_DESC.value_counts(dropna=False)

PETIT LARCENY                     87715
HARRASSMENT 2                     70943
ASSAULT 3 & RELATED OFFENSES      53425
CRIMINAL MISCHIEF & RELATED OF    47916
GRAND LARCENY                     44815
                                  ...  
NEW YORK CITY HEALTH CODE             3
UNLAWFUL POSS. WEAP. ON SCHOOL        2
DISRUPTION OF A RELIGIOUS SERV        1
ABORTION                              1
HOMICIDE-NEGLIGENT-VEHICLE            1
Name: OFNS_DESC, Length: 61, dtype: int64

In [36]:
nypd.LAW_CAT_CD.value_counts(dropna=False)

MISDEMEANOR    254109
FELONY         143434
VIOLATION       71669
Name: LAW_CAT_CD, dtype: int64

In [37]:
nypd.JURIS_DESC.value_counts(dropna=False)

N.Y. POLICE DEPT                415823
N.Y. HOUSING POLICE              33634
N.Y. TRANSIT POLICE              12449
OTHER                             3149
PORT AUTHORITY                    2018
DEPT OF CORRECTIONS                960
TRI-BORO BRDG TUNNL                275
NYC PARKS                          267
HEALTH & HOSP CORP                 240
N.Y. STATE POLICE                  163
METRO NORTH                         80
N.Y. STATE PARKS                    41
NEW YORK CITY SHERIFF OFFICE        33
LONG ISLAND RAILRD                  30
STATN IS RAPID TRANS                17
U.S. PARK POLICE                    14
AMTRACK                             10
NYS DEPT TAX AND FINANCE             8
CONRAIL                              1
Name: JURIS_DESC, dtype: int64

In [38]:
nypd.BORO_NM.value_counts(dropna=False)

BROOKLYN         138329
MANHATTAN        115955
BRONX            101802
QUEENS            91686
STATEN ISLAND     21124
NaN                 316
Name: BORO_NM, dtype: int64

In [40]:
nypd.PREM_TYP_DESC.value_counts(dropna=False)

STREET                        128337
RESIDENCE - APT. HOUSE        105873
RESIDENCE-HOUSE                44415
RESIDENCE - PUBLIC HOUSING     33652
CHAIN STORE                    15218
                               ...  
PHOTO/COPY                        44
CEMETERY                          41
LOAN COMPANY                      21
TRAMWAY                           10
HOMELESS SHELTER                   8
Name: PREM_TYP_DESC, Length: 74, dtype: int64

In [42]:
nypd.SUSP_SEX.value_counts(dropna=False)

M      212879
NaN    120521
U       69647
F       66165
Name: SUSP_SEX, dtype: int64

In [43]:
nypd.VIC_SEX.value_counts(dropna=False)

F    186532
M    160990
D     66191
E     55498
U         1
Name: VIC_SEX, dtype: int64

In [44]:
nypd.SUSP_AGE_GROUP.value_counts(dropna=False)

UNKNOWN    127696
NaN        120521
25-44      117822
45-64       44618
18-24       42088
<18         12472
65+          3962
2018           10
1018            3
928             2
-941            1
938             1
-939            1
948             1
1967            1
-974            1
954             1
920             1
952             1
-63             1
-2              1
-978            1
-80             1
922             1
1012            1
1017            1
924             1
955             1
Name: SUSP_AGE_GROUP, dtype: int64

In [45]:
nypd.VIC_AGE_GROUP.value_counts(dropna=False)

25-44      162582
UNKNOWN    133431
45-64       87810
18-24       45261
65+         20354
<18         19736
-1              2
-956            2
-974            2
965             1
970             1
-61             1
-958            1
-4              1
-5              1
-3              1
-948            1
-940            1
936             1
953             1
-952            1
-970            1
951             1
929             1
954             1
-67             1
-2              1
922             1
-76             1
957             1
-968            1
-972            1
1017            1
-43             1
-962            1
-51             1
-55             1
-966            1
-59             1
948             1
-955            1
Name: VIC_AGE_GROUP, dtype: int64

#### Checking for null values

In [30]:
print(nypd.shape)
nypd.isnull().sum()

(469212, 14)


CMPLNT_NUM             0
CMPLNT_FR_DT           0
CMPLNT_FR_TM           0
OFNS_DESC             14
LAW_CAT_CD             0
BORO_NM              316
PREM_TYP_DESC       2007
JURIS_DESC             0
SUSP_AGE_GROUP    120521
SUSP_RACE         120521
SUSP_SEX          120521
VIC_AGE_GROUP          0
VIC_RACE               0
VIC_SEX                0
dtype: int64

Will need to handle nulls in these fiels
* `OFNS_DESC`
* `BORO_NM`
* `PREM_TYP_DESC`
* `SUSP_AGE_GROUP`
* `SUSP_RACE`
* `SUSP_SEX`

25% of `SUSP_AGE_GROUP`, `SUSP_RACE`, and `SUSP_SEX` contain nulls!

#### Quick look at the dataframe to see what other cleaning I may have to do

In [31]:
nypd.head(5)

Unnamed: 0,CMPLNT_NUM,CMPLNT_FR_DT,CMPLNT_FR_TM,OFNS_DESC,LAW_CAT_CD,BORO_NM,PREM_TYP_DESC,JURIS_DESC,SUSP_AGE_GROUP,SUSP_RACE,SUSP_SEX,VIC_AGE_GROUP,VIC_RACE,VIC_SEX
0,774621657,2018-10-08,12:30:00,ROBBERY,FELONY,BRONX,BUS STOP,N.Y. POLICE DEPT,UNKNOWN,BLACK,M,<18,BLACK,M
1,232548146,2018-08-24,11:00:00,PETIT LARCENY,MISDEMEANOR,BROOKLYN,RESIDENCE-HOUSE,N.Y. POLICE DEPT,,,,45-64,WHITE,F
2,452701517,2018-03-30,16:55:00,DANGEROUS WEAPONS,FELONY,QUEENS,STREET,N.Y. POLICE DEPT,,,,UNKNOWN,UNKNOWN,E
3,620357753,2018-10-02,16:00:00,PETIT LARCENY,MISDEMEANOR,QUEENS,STREET,N.Y. POLICE DEPT,,,,65+,BLACK,M
4,110535568,2018-08-14,14:20:00,MURDER & NON-NEGL. MANSLAUGHTER,FELONY,,,N.Y. POLICE DEPT,25-44,BLACK,M,45-64,BLACK,M


In [32]:
nypd.dtypes

CMPLNT_NUM         int64
CMPLNT_FR_DT      object
CMPLNT_FR_TM      object
OFNS_DESC         object
LAW_CAT_CD        object
BORO_NM           object
PREM_TYP_DESC     object
JURIS_DESC        object
SUSP_AGE_GROUP    object
SUSP_RACE         object
SUSP_SEX          object
VIC_AGE_GROUP     object
VIC_RACE          object
VIC_SEX           object
dtype: object

## Cleaning and Data Prep needed

* Set new index (TBD)
* Handle nulls in these fields
 * `OFNS_DESC`
 * `BORO_NM`
 * `PREM_TYPE_DESC`
 * `SUSP_AGE_GROUP`
 * `SUSP_RACE`
 * `SUSP_SEX`
* Convert these fields to Timestamp
 * `CMPLNT_FR_DT`
 * `CMPLNT_FR_TM`



Optional
* Combine `CMPLNT_FR_DT` and `CMPLNT_FR_TM` into one field
* Create month, week, day, day of week columns for `CMPLNT_FR_DT`
* Create hour column for `CMPLNT_FR_TM`


## Framing the problem

* Start *framing* the problem -- This is Step 1 of the Data Science Workflow

 * Make a list of **what you know**
 * Make a list of **what you don't know**
 * Make a list of possible problem statements
   * Try to come up with *5-10* of them if possible
   * You will decide on which ones to explore later, but for now brainstorm as much as you can.
   

#### What I know
* All complaints are from 2018
* Complaints are timestamped and we have the level of detail down to month, day, year, hour, minute
* `OFNS_DESC` is an important field that tells us what each complaint was
* Each offense occurs within a borough of NYC, defined as `BORO_NM`
* Complaints are categorized under `LAW_CAT_CD` as either a felony, misdemeanor, or violation
* Different jurisdictions are responsible for handling each complaint. The police department is not responsible for all complaints
* Victim and suspect demo (age, sex, race) may be available for each offense



#### What I don't know
* The most popular or highest volume of offenses overall in 2018
* Seasonal trends, if any, in offenses
* Other time-based trends in offenses, down to month, day of the week, and hour of the day
* If some boroughs in New York City have more complaints than other boroughs
* If certain types of offenses are more prevalent in some boroughs versus others
* Which offenses are deemed a felony vs. misdemeanor vs. violation
* What complaints is each jurisdiction primarily responsible for handling
* If there’s a relationship between victim demo and suspect demo


#### Possible problem statements
1. More crimes occur during the warmer months of the year
- Criminal activity is more common during times of the day when there are more people around or active, i.e. commute times, meal times. Criminal activity is also more likely to occur when suspects believe they won’t be seen or caught, i.e. late night
- Racial bias may exist, with victims having a tendency to file complaints against specific races
- Criminal activity in NYC aligns with known correlations of [criminal behavior](https://en.wikipedia.org/wiki/Statistical_correlations_of_criminal_behaviour#Gender_and_biology) **
- Each borough of NYC has roughly the same volume of complaints as other boroughs, when comparing per capita

<font size=1.5>** Crime occurs most frequently during the second and third decades of life. Males commit more crime overall and more violent crime than females. They commit more property crime except shoplifting, which is about equally distributed between the genders.</font>
