# Initial EDA

In this notebook, I explore the data for the first time using the 2019 dataset to get a feel for the columns and layout of the data.  

My aim after this exploration is to have a good grasp on what the dataset contains, how things will need to be changed/formatted/engineered in order to create my FSM. 

### Imports

In [4]:
import pandas as pd

parent_dir = '../../'

### Import Data

In [5]:
prr19 = pd.read_csv(parent_dir + 'data/prr_2019.csv')
prr19.head()

Unnamed: 0,OBJECTID,ZIP,FILENUM,UOFNum,OCCURRED_D,OCCURRED_T,CURRENT_BA,OffSex,OffRace,HIRE_DT,...,Council District,RA,BEAT,SECTOR,DIVISION,X,Y,GeoLocation,Council Districts--Test,Dallas City Limits GIS Layer
0,2817,75253.0,UF2019-1702,"62295, 63542",12/01/2019,10:34 PM,11285,Male,White,03/08/2017,...,D8,6062.0,357.0,350.0,SOUTHEAST,2557123.437,6944231.397,POINT (-96.586265 32.702825),8.0,3.0
1,2234,75208.0,UF2019-1344,61093,10/06/2019,12:50 AM,11208,Male,White,08/24/2016,...,D1,4160.0,444.0,440.0,SOUTHWEST,2474936.793,6952151.398,POINT (-96.853036 32.729136),1.0,3.0
2,2755,75231.0,UF2019-1665,62820,12/31/2019,11:37 PM,9415,Male,White,04/02/2008,...,D9,6034.0,247.0,240.0,NORTHEAST,2508349.267,7001784.466,POINT (-96.741661 32.863941),13.0,3.0
3,2110,75228.0,UF2019-1314,60990,09/30/2019,6:20 PM,9884,Male,Hispanic,06/10/2009,...,D9,1132.0,228.0,220.0,NORTHEAST,2536678.324,6999039.025,POINT (-96.649175 32.855492),13.0,3.0
4,1663,75051.0,UF2019-1030,"59592, 59600",08/04/2019,12:10 AM,10480,Male,Hispanic,09/26/2012,...,,,,,,2433285.622,6953645.72,POINT (-96.98722 32.734935),,


In [6]:
prr19.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2944 entries, 0 to 2943
Data columns (total 41 columns):
OBJECTID                        2944 non-null int64
ZIP                             2943 non-null float64
FILENUM                         2944 non-null object
UOFNum                          2944 non-null object
OCCURRED_D                      2944 non-null object
OCCURRED_T                      2931 non-null object
CURRENT_BA                      2944 non-null int64
OffSex                          2944 non-null object
OffRace                         2944 non-null object
HIRE_DT                         2944 non-null object
OFF_INJURE                      2944 non-null bool
OffCondTyp                      2944 non-null object
OFF_HOSPIT                      2944 non-null bool
SERVICE_TY                      2937 non-null object
ForceType                       2944 non-null object
UOF_REASON                      2937 non-null object
Cycles_Num                      174 non-null objec

## Inital comments

There does not appear to be any documentation for this dataset and so we might need to make some inferences about certain columns and dig deeper into what they mean. For now, I'll detail my current understanding of the column descriptions:

### Column Descriptions

- objectid:  this appears to be the incident ID and should be unique (check this).  This is probably the identifier of the dataset
- zip:  zip code of the incident - currently a float, should probably change to integer
- filenum:  the file number of the incident that is recorded in official police records - check this and make sure it aligns with police records (should be able to look file numbers up...)
- uofnum:  UOF should stand for 'Use of Force'.  The use of force number probably refers to an id number relating to the type of force used.  Should check that uofnum and forcetype match up. 
- occurred_d:  Date the incident occured.  Change to datetime object
- occured_t:  time the incident occured.  Change to datetime object
- current_ba:  Not sure what this column means - will need to look this up.  What does ba mean?  It looks like some sort of identification number - I wonder if it's a number that identifies the police officer?
- offsex:  Sex of the officer 
- offrace:  Race of the officer
- hire_dt:  Date the officer was hired.  Change to datetime object
- off_injure:  Whether the officer was injured during the encounter.
- offcondtype:  What type of condition the officer was in after the incident and/or what injury/ies they had
- off_hospit:  I think this is whether the officer had to go to hospital or not - check this
- service_ty:  I'm not sure what this is exactly but it looks like it is what type of service the officer performed during the incident or what it was labelled as when the officer was called to the scene.  e.g. were they performing an arrest, were they off duty and witnessed a crime, were they attending a service call etc
- forcetype:  This details what type of force was used on the citizen
- street_n:  The street number of the incident
- street:  The street of the incident
- street_g:  This appears to be the direction of the street address (E, S etc)
- street_t:  This is the street type (rd, st, blvd etc)
- address:  The full address (as specified by above fields)
- citnum:  This appears to be some sort of identification number for the citizen.  I'm not clear on this though and I should verify this
- citrace:  Race of the citizen
- citsex:  Sex of the citzen
- cit_injure:  Whether the citizen was injured during the encounter or not
- citcondtyp:  The condition of the citizen after the incident
- cit_arrest:  Whether the citizen was arrested or not (it should be determined whether they were arrested because of the inital incident or arrested because of the resistance (is the dataset on resistance to arrest?  What other types of resistance are there?)
- cit_infl_a:  Was the citizen under the influence of anything during the incident.  'Mentally unstable' is a value in this column so are they counting mental instability as being 'under the influence'?
- citcharget:  Details of what the citizen was being arrest for
- council district:  The ID of the coucil district of the incident
- ra:  Not sure what RA is - look this up
- beat:  The police beat in which the incident occured?  Or is this the police beat from which the officers are from?
- sector:  The police sector?  Or the county section?  Check this
- division:  I think this relates to county info?  But could be more police sector info?  Check this - is it even needed?
- x:  Coordinates of the incident?  Not sure...
- y:  coordinates of the incident?  Not sure...
- geolocation:  Location of the incident
- council districts--test:  Not sure
- dallas city limis gis layer:  Not sure

### Other questions about the data:
- Is the dataset on resistance to arrest?  What other types of resistance are there?
- In the cit_infl_a column, 'Mentally unstable' is a value in this column so are they counting mental instability as being 'under the influence'?