# Exploration & Wrangling Data of the Crimes in Chicago Dataset 
* The data is extracted from: https://www.kaggle.com/currie32/crimes-in-chicago/data

###### Outline of Entire Process: 
* Import 4 Datasets
* Analyze Missing Data
* Combine Four Datasets into One
* Conversion of Time Data
* Exploring Null Values in Year 2001 'Ward'  
* Exploration of Unique Labels: 
    * (1) The Unique Labels for each Column 
    * (2) Printing Out the Number of Unique Labels Per Column 
* Dropping Columns 
* Export DfTotal 
* Unnest Location Data: Readding Latitude and Longitude 
* Export and Import the New Cleaned Dataset With Latitude and Longitude Information
* Using the District Data to fill in the Missing Longitude and Latitude Data 
    * Part I: Filling in NaN Data of Longitude and Latitude Observations that have District info in Observation 
    * Part II: Filling in Districts with Missing Values, but nonempty longitude/latitude data
* Filling in Missing Location Description Values
* Export New Dataset with No Missing Values 

###### Outline of Data Wrangling Process: 
*  Combine Four Datasets into One: Chicago_Crimes_2001_to_2004.csv, Chicago_Crimes_2005_to_2007.csv, Chicago_Crimes_2008_to_2011.csv, Chicago_Crimes_2012_to_2017.csv
* Conversion of Time Data 
* Dropping Columns 
* Unnest Location Data: Readding Latitude and Longitude 
* Using the District Data to fill in the Missing Longitude and Latitude Data
* Filling in Missing Location Description Values
    * Part I: Filling in NaN Data of Longitude and Latitude Observations that have District info in Observation 
    * Part II: Filling in Districts with Missing Values, but nonempty longitude/latitude data

##### Outline of Data Exploration Process: 
* Analyze Missing Data 
* Exploring Null Values in Year 2001 'Ward'  
* Exploration of Unique Labels: 
    * (1) The Unique Labels for each Column 
    * (2) Printing Out the Number of Unique Labels Per Column


## Importing Basic Packages 
These packages help us rearrange our data into a dataframe format easily 

In [1]:
import numpy as np  
import pandas as pd 
import math

## Import 4 Datasets 

In [2]:
df1 = pd.read_csv("../../crimesInChicagoData/Chicago_Crimes_2001_to_2004.csv", error_bad_lines = False)

b'Skipping line 1513591: expected 23 fields, saw 24\n'
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
df2 = pd.read_csv("../../crimesInChicagoData/Chicago_Crimes_2005_to_2007.csv", error_bad_lines = False)

b'Skipping line 533719: expected 23 fields, saw 24\n'


In [4]:
df3 = pd.read_csv("../../crimesInChicagoData/Chicago_Crimes_2008_to_2011.csv", error_bad_lines = False)

b'Skipping line 1149094: expected 23 fields, saw 41\n'


In [9]:
df4 = pd.read_csv("../../crimesInChicagoData/Chicago_Crimes_2012_to_2017.csv", error_bad_lines = False)

## Analyze  Missing Data 
Seeing how much data is missing per column of each of the 4 dataframes to get an understanding of how much missing data we have.

In [10]:
'''Takes in a dataframe and returns a dataframe with percentages of missing data of each column
'''
def missingDataSummary(df):
    dfInitial = pd.DataFrame(index = [df.name], columns =df.columns)
    for column in df.columns:
        #print("Column name: " + column )
        numNas = df[column].isnull().sum()
        numObservations = len(df[column])
        missingDataPercentage = numNas/numObservations *100
        dfInitial[column][0] = missingDataPercentage
        #print(str(missingDataPercentage) + "% missing") 
    return dfInitial 

In [11]:
#set the names of the dataframes for convenience 
df1.name = 'df1'
df2.name = 'df2'
df3.name = 'df3'
df4.name = 'df4'

In [12]:
#retrieve info on missing data for each data frame 
df1_missingSummary = missingDataSummary(df1)
df2_missingSummary  = missingDataSummary(df2)
df3_missingSummary = missingDataSummary(df3)
df4_missingSummary = missingDataSummary(df4)

#concatenate missingData summary info into one
df_missingSummaries = pd.concat([df1_missingSummary, df2_missingSummary, df3_missingSummary, df4_missingSummary], axis = 0)

#display all the missing data summary
pd.set_option('display.max_columns', None)
df_missingSummaries

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
df1,0,0,0.0,0,0,0,0,0,0.000831811,0,0,0,0.000103976,36.3986,36.4046,0,1.59557,1.59557,0,0,1.59557,1.59562,1.59562
df2,0,0,0.0,0,0,0,0,0,0.00133523,0,0,0,0.000267045,0.000801135,0.0186397,0,0.488906,0.488906,0,0,0.488906,0.488906,0.488906
df3,0,0,0.000223155,0,0,0,0,0,0.010823,0,0,0,0.00308698,0.00234313,0.0541152,0,1.06538,1.06538,0,0,1.06538,1.06538,1.06538
df4,0,0,6.86477e-05,0,0,0,0,0,0.113818,0,0,0,6.86477e-05,0.000961067,0.00274591,0,2.54566,2.54566,0,0,2.54566,2.54566,2.54566


## Combine Four Datasets into One
Concatenate 4 data sets to make one total data set form years range 2001-2017

In [13]:
dfTotal = pd.concat([df1, df2, df3, df4], axis = 0)

Checking if the concatenation indeed happened:

In [14]:
print("Are the number of columns maintained?")
print(len(df1.columns) == len(df2.columns) ==  len(df3.columns) == len(df4.columns) == len(dfTotal.columns))

print("Are the number of observations of all data sets equal to the dfTotal")
print(len(df1['ID']) + len(df2['ID']) + len(df3['ID']) + len(df4['ID']) == len(dfTotal['ID']))

dfTotal.columns

Are the number of columns maintained?
True
Are the number of observations of all data sets equal to the dfTotal
True


Index(['Unnamed: 0', 'ID', 'Case Number', 'Date', 'Block', 'IUCR',
       'Primary Type', 'Description', 'Location Description', 'Arrest',
       'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code',
       'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude',
       'Longitude', 'Location'],
      dtype='object')

In [15]:
dfTotal.head()

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,879,4786321,HM399414,01/01/2004 12:01:00 AM,082XX S COLES AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,424,4.0,7.0,46.0,6,,,2004.0,08/17/2015 03:03:40 PM,,,
1,2544,4676906,HM278933,03/01/2003 12:00:00 AM,004XX W 42ND PL,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,935,9.0,11.0,61.0,26,1173974.0,1876760.0,2003.0,04/15/2016 08:55:02 AM,41.8172,-87.637328,"(41.817229156, -87.637328162)"
2,2919,4789749,HM402220,06/20/2004 11:00:00 AM,025XX N KIMBALL AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,1413,14.0,35.0,22.0,20,,,2004.0,08/17/2015 03:03:40 PM,,,
3,2927,4789765,HM402058,12/30/2004 08:00:00 PM,045XX W MONTANA ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,2521,25.0,31.0,20.0,6,,,2004.0,08/17/2015 03:03:40 PM,,,
4,3302,4677901,HM275615,05/01/2003 01:00:00 AM,111XX S NORMAL AVE,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,2233,22.0,34.0,49.0,6,1174948.0,1831050.0,2003.0,04/15/2016 08:55:02 AM,41.6918,-87.635116,"(41.691784636, -87.635115968)"


## Conversion of Time Data 
* We converted the 'Date' data type to a Time Series for easy sorting in pandas df. This sorting mechanism allows us to explore the data easily. 
* We also converted the format of the time. We got rid of AM & PM designations and made the hours from a 12-hour clock to a 24-hour clock. This was in part so that we could look at the data by hours more easily. 

In [16]:
df1.sort_values(by='Year')

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1602847,3629582,1423259,G139165,03/10/2001 11:30:00 PM,035XX S FEDERAL ST,1340,CRIMINAL DAMAGE,TO STATE SUP PROP,CHA PARKING LOT/GROUNDS,True,False,211,2.0,,,14,1176246.0,18 08:55:02 AM,41.789832,-87.672973835,"(41.789832136, -87.672973835)",,
386628,3926953,1797222,G622842,10/16/2001 04:00:00 PM,042XX N WINCHESTER AV,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE-GARAGE,False,False,1922,19.0,,,14,1162614.0,1.92829e+06,2001.000000,08/17/2015 03:03:40 PM,41.9589,-87.677555,"(41.958876653, -87.677555184)"
379441,3919766,1788231,G603480,10/08/2001 09:15:00 AM,023XX W FLOURNOY ST,2027,NARCOTICS,POSS: CRACK,PARK PROPERTY,True,False,1224,12.0,,,18,1160929.0,1.89704e+06,2001.000000,08/17/2015 03:03:40 PM,41.8732,-87.684619,"(41.873154129, -87.684618984)"
379440,3919765,1788230,G588035,10/01/2001 11:45:00 AM,035XX S FEDERAL ST,2095,NARCOTICS,ATTEMPT POSSESSION NARCOTICS,CHA PARKING LOT/GROUNDS,True,False,211,2.0,,,18,1176254.0,1.88129e+06,2001.000000,08/17/2015 03:03:40 PM,41.8296,-87.628828,"(41.829611625, -87.628828268)"
379439,3919764,1788228,G604692,10/08/2001 07:00:10 PM,003XX W LOCUST ST,2024,NARCOTICS,POSS: HEROIN(WHITE),STREET,True,False,1823,18.0,,,18,1173468.0,1.90657e+06,2001.000000,08/17/2015 03:03:40 PM,41.899,-87.638299,"(41.899038499, -87.638299131)"
379438,3919763,1788227,G608421,10/10/2001 12:44:27 PM,023XX W 111 PL,1130,DECEPTIVE PRACTICE,FRAUD OR CONFIDENCE GAME,RESIDENCE,False,False,2212,22.0,,,11,1162729.0,1.83064e+06,2001.000000,08/17/2015 03:03:40 PM,41.6909,-87.679863,"(41.690919865, -87.679862857)"
379437,3919762,1788225,G607431,10/09/2001 10:20:28 PM,008XX N ST LOUIS AV,0460,BATTERY,SIMPLE,SIDEWALK,False,False,1121,11.0,,,08B,1152942.0,1.90515e+06,2001.000000,08/17/2015 03:03:40 PM,41.8956,-87.713728,"(41.895576192, -87.713728213)"
379436,3919761,1788224,G588034,10/01/2001 11:43:00 AM,035XX S FEDERAL ST,2095,NARCOTICS,ATTEMPT POSSESSION NARCOTICS,CHA PARKING LOT/GROUNDS,True,False,211,2.0,,,18,1176254.0,1.88129e+06,2001.000000,08/17/2015 03:03:40 PM,41.8296,-87.628828,"(41.829611625, -87.628828268)"
379435,3919760,1788222,G608283,10/10/2001 11:41:25 AM,013XX W 111 ST,5002,OTHER OFFENSE,OTHER VEHICLE OFFENSE,STREET,True,False,2234,22.0,,,26,1169568.0,1.83112e+06,2001.000000,08/17/2015 03:03:40 PM,41.6921,-87.654811,"(41.692097453, -87.654810782)"
379434,3919759,1788221,G588513,10/01/2001 02:10:00 PM,046XX S STATE ST,2027,NARCOTICS,POSS: CRACK,VEHICLE NON-COMMERCIAL,True,False,221,2.0,,,18,1177006.0,1.8745e+06,2001.000000,08/17/2015 03:03:40 PM,41.811,-87.626274,"(41.810959587, -87.62627425)"


In [17]:
def changeToDateTime(df, columnName, timeFormat): 
    df[columnName] = pd.to_datetime(df[columnName], format = timeFormat)

In [18]:
dateFormat = '%m/%d/%Y %I:%M:%S %p' 
yearFormat = '%Y.0'

In [19]:
changeToDateTime(df1, 'Date', dateFormat)

In [62]:
#changeToDateTime(df1, 'Year', yearFormat)

In [92]:
#df2003 = df1.loc[df1['Date'].dt.year == 2003] #allows easy slicing of data by time

## Exploring Null Values in Year 2001 'Ward' 
* From the "Analyze Missing Data" section, we found out that df1 had a lot of missing data for the 'Ward' column. We decided to explore it and understand if it was a phenomenon that occurred only for a particular year. 
* We found out that 2001 accounted for563443/7000132 or ~80% of df1's missing 'Ward' data
* We hypothesize that this may have been the case because 'Ward' data may have been put in only after the mid 2001s (we do not know if this true) 

This is the observations in df1 that has all the missing 'Ward' info:

In [20]:
df1WardNull = df1.loc[df1['Ward'].isnull()] 

There are this many missing 'Ward' data in df1's observations:

In [133]:
len(df1WardNull)

700132

These are how many 'Ward' observations are missing in the year 2001: 

In [23]:
ward2001Nan = df1.loc[(df1['Ward'].isnull()) &( df1['Date'].dt.year == 2001)]

len(ward2001Nan)

563443

So 2001 accounts for this percentage of df1's missing observations in 'Ward': 

In [2]:
str(563443/700132*100) + "%" 

'80.47668154005245%'

In [25]:
df2001 = len(df1.loc[df1['Date'].dt.year == 2001])

## Exploration of Unique Labels
* We wanted to see what labels were used for each category/column to better understand the data we were working with 

In [29]:
'''Takes in a dataframe and returns a dataframe with the corresponding
unique labels made in each column
'''
def uniqueLabels(df):
    dfInitial = pd.DataFrame(index = [0], columns =df.columns)
    for column in df.columns:
        uniqueCounts = df[column].unique()
        numObservations = len(df[column])
        dfInitial[column][0] = uniqueCounts
    return dfInitial 

In [30]:
dfTotal.name = 'dfTotal'

In [31]:
dftotal_unique = uniqueLabels(dfTotal)

### (1) The Unique Labels for each Column
This demonstrates and summarizes what kind of labels were used per column: 

In [32]:
dftotal_unique

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,"[879, 2544, 2919, 2927, 3302, 3633, 3756, 4502...","[4786321, 4676906, 4789749, 4789765, 4677901, ...","[HM399414, HM278933, HM402220, HM402058, HM275...","[01/01/2004 12:01:00 AM, 03/01/2003 12:00:00 A...","[082XX S COLES AVE, 004XX W 42ND PL, 025XX N K...","[0840, 2825, 1752, 0841, 0266, 5007, 0890, 175...","[THEFT, OTHER OFFENSE, OFFENSE INVOLVING CHILD...","[FINANCIAL ID THEFT: OVER $300, HARASSMENT BY ...","[RESIDENCE, OTHER, APARTMENT, RESIDENCE PORCH/...","[False, True]","[False, True]","[424, 935, 1413, 2521, 2233, 1011, 531, 2222, ...","[4.0, 9.0, 14.0, 25.0, 22.0, 10.0, 5.0, 18.0, ...","[7.0, 11.0, 35.0, 31.0, 34.0, 24.0, 9.0, 21.0,...","[46.0, 61.0, 22.0, 20.0, 49.0, 29.0, 50.0, 73....","[06, 26, 20, 02, 07, 17, 11, 10, 08B, 05, 15, ...","[nan, 1173974.0, 1174948.0, 1182247.0, 1169911...","[nan, 1876757.0, 1831051.0, 1829375.0, 1844832...","[2004.0, 2003.0, 2001.0, 2002.0, 41.789832136,...","[08/17/2015 03:03:40 PM, 04/15/2016 08:55:02 A...","[nan, 41.817229155999996, 41.691784636, 41.687...","[nan, -87.637328162, -87.635115968, -87.608445...","[nan, (41.817229156, -87.637328162), (41.69178..."


In [42]:
df_missingSummaries

Unnamed: 0.1,Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
df1,0,0,0.0,0,0,0,0,0,0.000831811,0,0,0,0.000103976,36.3986,36.4046,0,1.59557,1.59557,0,0,1.59557,1.59562,1.59562
df2,0,0,0.0,0,0,0,0,0,0.00133523,0,0,0,0.000267045,0.000801135,0.0186397,0,0.488906,0.488906,0,0,0.488906,0.488906,0.488906
df3,0,0,0.000223155,0,0,0,0,0,0.010823,0,0,0,0.00308698,0.00234313,0.0541152,0,1.06538,1.06538,0,0,1.06538,1.06538,1.06538
df4,0,0,6.86477e-05,0,0,0,0,0,0.113818,0,0,0,6.86477e-05,0.000961067,0.00274591,0,2.54566,2.54566,0,0,2.54566,2.54566,2.54566


### (2) Printing Out the Number of Unique Labels Per Column

In [61]:
def printUniqueSummaryValues (df): 
    for column in df.columns: 
        print(column + " has these many unique values: ")
        print(len(df[column][0]))

In [62]:
printUniqueSummaryValues(dftotal_unique)

Unnamed: 0 has these many unique values: 
6170812
ID has these many unique values: 
6170812
Case Number has these many unique values: 
6170473
Date has these many unique values: 
2451622
Block has these many unique values: 
58776
IUCR has these many unique values: 
398
Primary Type has these many unique values: 
35
Description has these many unique values: 
376
Location Description has these many unique values: 
173
Arrest has these many unique values: 
2
Domestic has these many unique values: 
2
Beat has these many unique values: 
304
District has these many unique values: 
27
Ward has these many unique values: 
51
Community Area has these many unique values: 
79
FBI Code has these many unique values: 
26
X Coordinate has these many unique values: 
78276
Y Coordinate has these many unique values: 
152136
Year has these many unique values: 
18
Updated On has these many unique values: 
1310
Latitude has these many unique values: 
864639
Longitude has these many unique values: 
838433
Lo

## Dropping Columns 
From our exploration of labels and Ward info, we were able to understand some of the properties of the categories and realized that some of the columns were too repetitive, too detailed, or unnecessary for answering our question. In addition, we also had problems working and creating a model because our dataset was big and decided to shrink its size. Thus, we decided to drop the following information: 

### Dropping Columns: Part I 

* Unnamed because it was an extraneous column 
* Ward because the dataset had a lot of missing data for df1 and we already had other categories indicating district or location. Hence, we deemed the column repetitive and unnecessary
* Longitude & Latitude because Location includes both coordinates' information  
* Updated On because we are not interested in this data. It does not help answer our question
* X Coordinate & Y Coordinate, because we will retain location information anyway with the 'Location' column and it was unclear as to which scale these x and y coordinates were based off  
* Case Number in this case because we already have an ID that identifies each observation uniquely  


In [45]:
dfTotal.columns

Index(['Unnamed: 0', 'ID', 'Case Number', 'Date', 'Block', 'IUCR',
       'Primary Type', 'Description', 'Location Description', 'Arrest',
       'Domestic', 'Beat', 'District', 'Ward', 'Community Area', 'FBI Code',
       'X Coordinate', 'Y Coordinate', 'Year', 'Updated On', 'Latitude',
       'Longitude', 'Location'],
      dtype='object')

In [46]:
delInitialCols = ['Unnamed: 0', 'Case Number', 'Longitude', 'Latitude', 'Ward', 'X Coordinate', 'Y Coordinate', 'Updated On']

In [47]:
dfTotal = dfTotal.drop(delInitialCols, axis = 1) #actual deletion 

In [48]:
dfTotal.head()

Unnamed: 0,ID,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Community Area,FBI Code,Year,Location
0,4786321,01/01/2004 12:01:00 AM,082XX S COLES AVE,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,424,4.0,46.0,6,2004.0,
1,4676906,03/01/2003 12:00:00 AM,004XX W 42ND PL,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,935,9.0,61.0,26,2003.0,"(41.817229156, -87.637328162)"
2,4789749,06/20/2004 11:00:00 AM,025XX N KIMBALL AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,1413,14.0,22.0,20,2004.0,
3,4789765,12/30/2004 08:00:00 PM,045XX W MONTANA ST,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,2521,25.0,20.0,6,2004.0,
4,4677901,05/01/2003 01:00:00 AM,111XX S NORMAL AVE,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,2233,22.0,49.0,6,2003.0,"(41.691784636, -87.635115968)"


### Dropping Columns: Part II 
We delete:
* Community Area because a groupings by region, Beat and District, already exist 
* Block because it contains addresses which are too fine grain details that we do not need 

In [53]:
delInitialCols2 = ['Community Area', 'Block']

In [54]:
dfTotal = dfTotal.drop(delInitialCols2, axis = 1) #actual deletion 

In [55]:
dfTotal.head()

Unnamed: 0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,FBI Code,Year,Location
0,4786321,01/01/2004 12:01:00 AM,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,424,4.0,6,2004.0,
1,4676906,03/01/2003 12:00:00 AM,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,935,9.0,26,2003.0,"(41.817229156, -87.637328162)"
2,4789749,06/20/2004 11:00:00 AM,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,1413,14.0,20,2004.0,
3,4789765,12/30/2004 08:00:00 PM,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,2521,25.0,6,2004.0,
4,4677901,05/01/2003 01:00:00 AM,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,2233,22.0,6,2003.0,"(41.691784636, -87.635115968)"


In [63]:
def printUniqueValues (df): 
    for column in df.columns: 
        print(column + " has these many unique values: ")
        print(len(df[column].unique()))

In [64]:
printUniqueValues(dfTotal)

ID has these many unique values: 
6170812
Date has these many unique values: 
2451622
IUCR has these many unique values: 
398
Primary Type has these many unique values: 
35
Description has these many unique values: 
376
Location Description has these many unique values: 
173
Arrest has these many unique values: 
2
Domestic has these many unique values: 
2
Beat has these many unique values: 
304
District has these many unique values: 
27
FBI Code has these many unique values: 
26
Year has these many unique values: 
18
Location has these many unique values: 
840086


### Dropping Columns: Part III 
* We decided to drop Beat.

While Beat has 0 missing data values, it has 304 different categories while District has 27 and has missing categories. We believed groupings of districts did not need to be so fine grain, particularly when we already have information with Location. 

In [65]:
dfTotal = dfTotal.drop(['Beat'], axis = 1) #actual deletion 

In [67]:
len(dfTotal)

7941282

In [70]:
dfTotal.head()

Unnamed: 0,index,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Location
0,0,4786321,01/01/2004 12:01:00 AM,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,6,2004.0,
1,1,4676906,03/01/2003 12:00:00 AM,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,"(41.817229156, -87.637328162)"
2,2,4789749,06/20/2004 11:00:00 AM,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,
3,3,4789765,12/30/2004 08:00:00 PM,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,6,2004.0,
4,4,4677901,05/01/2003 01:00:00 AM,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,6,2003.0,"(41.691784636, -87.635115968)"


### Export dfTotal 
We export dfTotal to a csv exporting to a csv, so that we could get rid of the local data and reduce the dataset size on our local computer

In [69]:
dfTotal = dfTotal.reset_index() #to realign the indices from the concatenation of four dataframesfrom before

In [291]:
newTotalDf = pd.read_csv("../../crimesInChicagoData/dfTotal.csv")

In [292]:
newTotalDf = newTotalDf.drop(['Unnamed: 0'], axis = 1)

We needed to reconvert 'Date' to a timeseries format for sorting. None of the data was in fact changed. 

In [293]:
changeToDateTime(newTotalDf, 'Date', dateFormat)

In [294]:
newTotalDf.head()

Unnamed: 0,index,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Location
0,0,4786321,2004-01-01 00:01:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,6,2004.0,
1,1,4676906,2003-03-01 00:00:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,"(41.817229156, -87.637328162)"
2,2,4789749,2004-06-20 11:00:00,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,
3,3,4789765,2004-12-30 20:00:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,6,2004.0,
4,4,4677901,2003-05-01 01:00:00,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,6,2003.0,"(41.691784636, -87.635115968)"


## Unnest Location Data: Readding Latitude and Longitude 
We decided to include latitude and longitude coordinates. While this is repetitive, we had lost the information on our local computer and the datset took to long to import we just readded the Latitude and Longitude columns using Location data. 

In [300]:
locationDf = newTotalDf['Location']
latitudeList = []
longitudeList = []
for location in locationDf:
    if(type(location) != str): #if it is nan 
        latitudeList.append(location)
        longitudeList.append(location)
    elif(type(location) == str): 
        coordinate = location.replace("(", "")
        coordinate = coordinate.replace(")", "")
        splitCoords = coordinate.split(",")
        latCoord = float(splitCoords[0])
        longCoord = float(splitCoords[1])
        latitudeList.append(latCoord)
        longitudeList.append(longCoord)
        
    else: 
        print(location)

In [302]:
latitudeDf = pd.DataFrame({'Latitude':latitudeList})
latitudeDf.head(10)

Unnamed: 0,Latitude
0,
1,41.817229
2,
3,
4,41.691785
5,
6,41.68702
7,41.729712
8,
9,41.869772


In [303]:
longitudeDf = pd.DataFrame({'Longitude':longitudeList})
longitudeDf.head(10)

Unnamed: 0,Longitude
0,
1,-87.637328
2,
3,
4,-87.635116
5,
6,-87.608445
7,-87.653159
8,
9,-87.70818


In [304]:
locationsDf = latitudeDf.join(longitudeDf)

In [305]:
locationsDf.head(20)

Unnamed: 0,Latitude,Longitude
0,,
1,41.817229,-87.637328
2,,
3,,
4,41.691785,-87.635116
5,,
6,41.68702,-87.608445
7,41.729712,-87.653159
8,,
9,41.869772,-87.70818


In [306]:
print(len(latitudeDf))
print(len(longitudeDf))
print(len(newTotalDf))
print(len(locationsDf))

7941281
7941281
7941281
7941281


In [307]:
newTotalDf =newTotalDf.join(locationsDf)

In [282]:
newTotalDf= newTotalDf.drop(['index', 'Location'],axis =1)

In [326]:
newTotalDf.head(10)

Unnamed: 0,level_0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
0,0,4786321,2004-01-01 00:01:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,6,2004.0,,
1,1,4676906,2003-03-01 00:00:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,41.817229,-87.637328
2,2,4789749,2004-06-20 11:00:00,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,,
3,3,4789765,2004-12-30 20:00:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,6,2004.0,,
4,4,4677901,2003-05-01 01:00:00,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,6,2003.0,41.691785,-87.635116
5,5,4838048,2004-08-01 00:01:00,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,APARTMENT,False,False,10.0,6,2004.0,,
6,6,4791194,2001-01-01 11:00:00,266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,True,True,5.0,2,2001.0,41.68702,-87.608445
7,7,4679521,2003-03-15 00:00:00,5007,OTHER OFFENSE,OTHER WEAPONS VIOLATION,RESIDENCE PORCH/HALLWAY,False,False,22.0,26,2003.0,41.729712,-87.653159
8,8,4792195,2004-09-16 10:00:00,890,THEFT,FROM BUILDING,RESIDENCE,False,False,18.0,6,2004.0,,
9,9,4680124,2003-01-01 00:00:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,11.0,6,2003.0,41.869772,-87.70818


### Export and Import the New Cleaned Dataset With Latitude and Longitude Information
This is just to prevent crashes of the notebook because of a memory overflow. 

In [327]:
newTotalDf.to_csv("../../crimesInChicagoData/newTotalDf.csv")

In [3]:
dataset = pd.read_csv("../../crimesInChicagoData/newTotalDf.csv")

In [None]:
dataset = dataset.drop(['level_0'], axis =1)

In [6]:
dataset.head()

Unnamed: 0.1,Unnamed: 0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
0,0,4786321,2004-01-01 00:01:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,6,2004.0,,
1,1,4676906,2003-03-01 00:00:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,41.817229,-87.637328
2,2,4789749,2004-06-20 11:00:00,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,,
3,3,4789765,2004-12-30 20:00:00,840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,6,2004.0,,
4,4,4677901,2003-05-01 01:00:00,841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,6,2003.0,41.691785,-87.635116


## Using the District Data to fill in the Missing Longitude and Latitude Data
While this process may be flawed because it could make certain regions seem to appear more frequently, we are only missing less than 5% of the data. So we believe the following process and the method we use to fill in the missing values will not be significant. We could have dropped those observations, but we still wanted to retain other data corresponding to the observations with missing longitude and latitude data. 

### Part I: Filling in NaN Data of Longitude and Latitude Observations that have District info in Observation
* District corresponds to a specific region.
* In this section we subdivide the data by district then find the "average" location for each district 
* That average will be used to fill in the location missing value 


In [11]:
uniqueDistricts = list(dataset['District'].unique())

uniqueDistricts.remove(uniqueDistricts[24]) #removed the nan 

In [12]:
uniqueDistricts

[4.0,
 9.0,
 14.0,
 25.0,
 22.0,
 10.0,
 5.0,
 18.0,
 11.0,
 20.0,
 8.0,
 7.0,
 1.0,
 16.0,
 15.0,
 3.0,
 6.0,
 2.0,
 19.0,
 12.0,
 24.0,
 17.0,
 31.0,
 21.0,
 23.0,
 13.0]

In [13]:
def subsetLists(df, subsetCategories, columnName): 
    uniqueSubsetsList = []
    for category in subsetCategories: 
        dfSubset = df.loc[df[columnName] == category] 
        uniqueSubsetsList.append(dfSubset)
    return uniqueSubsetsList

In [14]:
uniqueDistrictsList = subsetLists(dataset, uniqueDistricts, 'District')

In [28]:
for i in range(len(uniqueDistrictsList)):
    print(str(uniqueDistrictsList[i]['District'].unique()) + " district is consisted of the follwing # of districts: ")
    print(len(uniqueDistrictsList[i]))

[ 4.] district is consisted of the follwing # of districts: 
453894
[ 9.] district is consisted of the follwing # of districts: 
397942
[ 14.] district is consisted of the follwing # of districts: 
314642
[ 25.] district is consisted of the follwing # of districts: 
463938
[ 22.] district is consisted of the follwing # of districts: 
263055
[ 10.] district is consisted of the follwing # of districts: 
335491
[ 5.] district is consisted of the follwing # of districts: 
354567
[ 18.] district is consisted of the follwing # of districts: 
337197
[ 11.] district is consisted of the follwing # of districts: 
498775
[ 20.] district is consisted of the follwing # of districts: 
137310
[ 8.] district is consisted of the follwing # of districts: 
550011
[ 7.] district is consisted of the follwing # of districts: 
476524
[ 1.] district is consisted of the follwing # of districts: 
291073
[ 16.] district is consisted of the follwing # of districts: 
263442
[ 15.] district is consisted of the foll

In [42]:
def fillMissingWithAverage(columnName, df):
    totalSum = df[columnName].sum()
    totalCount = df[columnName].describe()['count']
    average = totalSum/totalCount 
    df[columnName].fillna(average, inplace = True)

In [43]:
for dataframe in uniqueDistrictsList:
    fillMissingWithAverage('Latitude', dataframe)
    fillMissingWithAverage('Longitude', dataframe)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._update_inplace(new_data)


### Part II: Filling in Districts with Missing Values, but nonempty longitude/latitude data

In [90]:
avgCoordsList = []
for dataFrame in  uniqueDistrictsList:
    coords = []
    lat = dataFrame['Latitude'].sum()/dataFrame['Latitude'].describe()['count']
    long = dataFrame['Longitude'].sum()/dataFrame['Longitude'].describe()['count']
    coords.append(lat)
    coords.append(long)
    coords.append(str(dataFrame['District'].unique()[0]))
    avgCoordsList.append(coords)
    print(coords)

[41.734106465357918, -87.563621465897356, '4.0']
[41.814834324982158, -87.665279593948156, '9.0']
[41.915619987842391, -87.694019153700154, '14.0']
[41.919054267813586, -87.752178204964579, '25.0']
[41.708615973820699, -87.658508927398941, '22.0']
[41.853446584131227, -87.712625046552503, '10.0']
[41.687693669783748, -87.62293524998104, '5.0']
[41.902862496469353, -87.636098052120047, '18.0']
[41.882593721342076, -87.71915927439089, '11.0']
[41.978504465170332, -87.671868012214816, '20.0']
[41.77828299100242, -87.715424268281978, '8.0']
[41.775835445561057, -87.653798495789147, '7.0']
[41.871889066502099, -87.62891553227513, '1.0']
[41.964588567605986, -87.796818568950187, '16.0']
[41.886361521048535, -87.758102961052728, '15.0']
[41.770891499332414, -87.597182482763628, '3.0']
[41.745504921274303, -87.63270767909394, '6.0']
[41.810943162372638, -87.613066009139018, '2.0']
[41.947739249024593, -87.660401833179506, '19.0']
[41.880340460211137, -87.672350340257438, '12.0']
[42.0056650132

In [76]:
nullDistricts = dataset.loc[dataset['District'].isnull() ]

In [81]:
nullDistricts['Latitude'].isnull().sum()
nullDistricts['Longitude'].isnull().sum()

0

In [82]:
nullDistricts

Unnamed: 0.1,Unnamed: 0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
1802296,1802296,3575885,2004-06-26 10:00:00,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,RESTAURANT,False,False,,11,2004.0,41.892164,-87.607702
1817065,1817065,3596991,2004-10-14 15:41:00,0610,BURGLARY,FORCIBLE ENTRY,CONSTRUCTION SITE,False,False,,05,2004.0,41.886323,-87.610023
1969393,1969393,4740376,2006-05-10 19:40:00,0460,BATTERY,SIMPLE,OTHER,False,False,,08B,2006.0,41.884107,-87.610757
2503109,2503109,4740376,2006-05-10 19:40:00,0460,BATTERY,SIMPLE,OTHER,False,False,,08B,2006.0,41.884107,-87.610757
3242555,3242555,6405961,2007-11-16 00:01:00,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,RESIDENCE,False,True,,11,2007.0,41.709756,-87.651424
3242582,3242582,6420740,2007-01-07 05:00:00,1140,DECEPTIVE PRACTICE,EMBEZZLEMENT,OTHER,False,False,,12,2007.0,41.838968,-87.665779
3636356,3636356,4437079,2005-11-10 10:10:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,GOVERNMENT BUILDING/PROPERTY,False,True,,26,2005.0,41.892240,-87.603173
4031834,4031834,6376239,2008-07-18 20:00:00,1310,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,,14,2008.0,41.899573,-87.723769
4058377,4058377,6420058,2008-07-13 18:08:55,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,,18,2008.0,41.699909,-87.620564
4077370,4077370,6451103,2008-08-23 21:23:00,1563,SEX OFFENSE,CRIMINAL SEXUAL ABUSE,SIDEWALK,False,False,,17,2008.0,41.944632,-87.659105


##### Reverse Engineering :
* Using the average latitude, longitude data to predict the nullDistricts Designation
* Using the euclidean distance formula: sqrt((x2-x1)^2+ (y2-y1)^2) 

In [86]:
def euclideanDistance(x1, x2, y1, y2):
    return math.sqrt((x2 -x1)**2 + (y2-y1)**2) 

In [117]:
for observation in nullDistricts.index:
    origLat = nullDistricts['Latitude'][observation]
    origLong = nullDistricts['Longitude'][observation]
    distanceList = []
    for coord in avgCoordsList: 
        avgLat = coord[0]
        avgLong = coord[1]
        distanceList.append(euclideanDistance(origLat, avgLat, origLong, avgLong))
    
    shortestDistance = min(distanceList)
    shortestDistanceIndex = distanceList.index(shortestDistance)
    district = float(avgCoordsList[shortestDistanceIndex][2])
    
    nullDistricts['District'][observation] = district
#     print("this is the distance list: ")
#     print(distanceList)
#     print("\n")
#     print("this is the shortest distance:" )
#     print(shortestDistance)
#     print("\n")
#     print("this is the index: ")
#     print(shortestDistanceIndex)
#     print("\n")  
#     print("this is the district:")
#     print(district)
#     print("******************************\n")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


In [119]:
nullDistricts['District'].isnull().sum() #check if districts have been successfully filled

0

In [124]:
allDistricts = uniqueDistrictsList 

allDistricts.append(nullDistricts)

Reseting 'dataset' to the new datset with non missing Longitude, Latitude & District Values

In [131]:
dataset = pd.concat(allDistricts, axis = 0)

In [135]:
dataset.sort_index()

Unnamed: 0.1,Unnamed: 0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
0,0,4786321,2004-01-01 00:01:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,06,2004.0,41.734106,-87.563621
1,1,4676906,2003-03-01 00:00:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,41.817229,-87.637328
2,2,4789749,2004-06-20 11:00:00,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,41.915620,-87.694019
3,3,4789765,2004-12-30 20:00:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,06,2004.0,41.919054,-87.752178
4,4,4677901,2003-05-01 01:00:00,0841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,06,2003.0,41.691785,-87.635116
5,5,4838048,2004-08-01 00:01:00,0841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,APARTMENT,False,False,10.0,06,2004.0,41.853447,-87.712625
6,6,4791194,2001-01-01 11:00:00,0266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,True,True,5.0,02,2001.0,41.687020,-87.608445
7,7,4679521,2003-03-15 00:00:00,5007,OTHER OFFENSE,OTHER WEAPONS VIOLATION,RESIDENCE PORCH/HALLWAY,False,False,22.0,26,2003.0,41.729712,-87.653159
8,8,4792195,2004-09-16 10:00:00,0890,THEFT,FROM BUILDING,RESIDENCE,False,False,18.0,06,2004.0,41.902862,-87.636098
9,9,4680124,2003-01-01 00:00:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,11.0,06,2003.0,41.869772,-87.708180


In [136]:
dataset.describe()

Unnamed: 0.1,Unnamed: 0,ID,District,Year,Latitude,Longitude
count,7941281.0,7941281.0,7941281.0,7941281.0,7941281.0,7941281.0
mean,3970640.0,5926071.0,11.31216,2007.672,41.8415,-87.67201
std,2292451.0,2568290.0,6.94453,4.064012,0.0913501,0.06319202
min,0.0,634.0,1.0,2001.0,36.61945,-91.68657
25%,1985320.0,3853210.0,6.0,2005.0,41.76862,-87.71409
50%,3970640.0,6165079.0,10.0,2008.0,41.85387,-87.66638
75%,5955960.0,7716590.0,17.0,2010.0,41.90697,-87.62858
max,7941280.0,10827880.0,31.0,2017.0,42.02291,-87.52453


In [139]:
dataset.isnull().sum()

Unnamed: 0                 0
ID                         0
Date                       0
IUCR                       0
Primary Type               0
Description                0
Location Description    1990
Arrest                     0
Domestic                   0
District                   0
FBI Code                   0
Year                       0
Latitude                   0
Longitude                  0
dtype: int64

## Filling in Missing Location Description Values 
* The summary tells us that there are 1990 observations of missing location values 
* We fill these in with 'Unknown' because this will indicate to us that we do not have information but not delete those observations and the corresponding data associated to the observations. 

In [140]:
locDescriptionDf = dataset.loc[dataset['Location Description'].isnull()] 

In [142]:
locDescriptionDf

Unnamed: 0.1,Unnamed: 0,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
518707,518707,1959008,2002-01-26 16:25:00,2820,OTHER OFFENSE,TELEPHONE THREAT,,False,False,4.0,26,2002.0,41.736054,-87.583335
770238,770238,2279950,2002-08-07 04:00:00,0261,CRIM SEXUAL ASSAULT,AGGRAVATED: HANDGUN,,True,False,4.0,02,2002.0,41.744217,-87.578598
1923399,1923399,10645212,2004-08-17 10:05:00,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,4.0,11,2004.0,41.734106,-87.563621
3797266,3797266,10530814,2011-12-01 11:00:00,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,4.0,11,2011.0,41.734106,-87.563621
3809308,3809308,8281441,2011-09-25 08:50:00,0890,THEFT,FROM BUILDING,,False,False,4.0,06,2011.0,41.734106,-87.563621
3810892,3810892,9862543,2011-11-18 18:00:00,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,4.0,11,2011.0,41.734106,-87.563621
4956803,4956803,8281441,2011-09-25 08:50:00,0890,THEFT,FROM BUILDING,,False,False,4.0,06,2011.0,41.734106,-87.563621
4958387,4958387,9862543,2011-11-18 18:00:00,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,4.0,11,2011.0,41.734106,-87.563621
6254420,6254420,8076790,2011-05-19 21:00:00,0820,THEFT,$500 AND UNDER,,False,False,4.0,06,2011.0,41.701996,-87.530450
6483015,6483015,10428831,2011-04-01 21:00:00,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,4.0,11,2011.0,41.734106,-87.563621


There are no unique identifiers that could help us what the data in 'Location Description' should be

In [148]:
dataset['Location Description'].unique()

array(['RESIDENCE', 'BANK', 'APARTMENT', 'STREET',
       'CHURCH/SYNAGOGUE/PLACE OF WORSHIP', 'COMMERCIAL / BUSINESS OFFICE',
       'SMALL RETAIL STORE', 'HOSPITAL BUILDING/GROUNDS', 'OTHER',
       'CURRENCY EXCHANGE', 'RESIDENTIAL YARD (FRONT/BACK)', 'RESTAURANT',
       'PORCH', 'VESTIBULE', 'YARD', 'AUTO', 'VACANT LOT', 'PARKING LOT',
       'PARKING LOT/GARAGE(NON.RESID.)', 'GROCERY FOOD STORE',
       'DEPARTMENT STORE', 'TAVERN/LIQUOR STORE',
       'SCHOOL, PUBLIC, BUILDING', 'DRUG STORE', 'RESIDENCE-GARAGE',
       'VEHICLE NON-COMMERCIAL', 'CTA BUS',
       'POLICE FACILITY/VEH PARKING LOT', 'CHA PARKING LOT/GROUNDS',
       'CHA APARTMENT', 'SCHOOL, PUBLIC, GROUNDS',
       'GOVERNMENT BUILDING/PROPERTY', 'SIDEWALK', 'VACANT LOT/LAND',
       'CONSTRUCTION SITE', 'FACTORY/MANUFACTURING BUILDING',
       'BOAT/WATERCRAFT', 'LAKEFRONT/WATERFRONT/RIVERBANK', 'ALLEY',
       'BASEMENT', 'GANGWAY', 'BOWLING ALLEY',
       'OTHER COMMERCIAL TRANSPORTATION', 'RESIDENCE PORCH/HALL

There is no identifier for 'unknown' locations so we will create one with UNKNOWN and fill missing values with it

In [149]:
dataset['Location Description'].fillna('UNKNOWN', inplace = True)

In [150]:
dataset.isnull().sum()

Unnamed: 0              0
ID                      0
Date                    0
IUCR                    0
Primary Type            0
Description             0
Location Description    0
Arrest                  0
Domestic                0
District                0
FBI Code                0
Year                    0
Latitude                0
Longitude               0
dtype: int64

## Export New Dataset with No Missing Values 

In [160]:
dataset = dataset.sort_index()

In [162]:
dataset.to_csv("../../crimesInChicagoData/dataset.csv")

In [163]:
cleanData = pd.read_csv("../../crimesInChicagoData/dataset.csv")

In [164]:
cleanData

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,ID,Date,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,District,FBI Code,Year,Latitude,Longitude
0,0,0,4786321,2004-01-01 00:01:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,4.0,06,2004.0,41.734106,-87.563621
1,1,1,4676906,2003-03-01 00:00:00,2825,OTHER OFFENSE,HARASSMENT BY TELEPHONE,RESIDENCE,False,True,9.0,26,2003.0,41.817229,-87.637328
2,2,2,4789749,2004-06-20 11:00:00,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,14.0,20,2004.0,41.915620,-87.694019
3,3,3,4789765,2004-12-30 20:00:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,OTHER,False,False,25.0,06,2004.0,41.919054,-87.752178
4,4,4,4677901,2003-05-01 01:00:00,0841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,RESIDENCE,False,False,22.0,06,2003.0,41.691785,-87.635116
5,5,5,4838048,2004-08-01 00:01:00,0841,THEFT,FINANCIAL ID THEFT:$300 &UNDER,APARTMENT,False,False,10.0,06,2004.0,41.853447,-87.712625
6,6,6,4791194,2001-01-01 11:00:00,0266,CRIM SEXUAL ASSAULT,PREDATORY,RESIDENCE,True,True,5.0,02,2001.0,41.687020,-87.608445
7,7,7,4679521,2003-03-15 00:00:00,5007,OTHER OFFENSE,OTHER WEAPONS VIOLATION,RESIDENCE PORCH/HALLWAY,False,False,22.0,26,2003.0,41.729712,-87.653159
8,8,8,4792195,2004-09-16 10:00:00,0890,THEFT,FROM BUILDING,RESIDENCE,False,False,18.0,06,2004.0,41.902862,-87.636098
9,9,9,4680124,2003-01-01 00:00:00,0840,THEFT,FINANCIAL ID THEFT: OVER $300,RESIDENCE,False,False,11.0,06,2003.0,41.869772,-87.708180
