# Data Cleaning

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pylab as plt
%matplotlib inline

### Dataset Description
There are 22 total attributes in the original dataset. Note that each record is of a single crime irregardless of when it was committed so multiple rows can refer to the same incident.    

* **ID** unique integer identifier for the crime record.
* **Case Number** The Chicago Police Department RD Number (Records Division Number), which is unique to each incident.
* **Date** Date when incident occurred
* **Block** The partially redacted address where the incident occurred, placing it on the same block as the actual address.
* **IUCR** The Illinois Unifrom Crime Reporting code. This is directly linked to the Primary Type and Description.
* **Primary Type** The primary description of the IUCR code.
* **Description** The secondary description of the IUCR code, a subcategory of the primary description.
* **Location Description** Description of the location where the incident occurred.
* **Arrest** Indicates whether an arrest was made.
* **Domestic** Indicates whether the incident was domestic-related as defined by the Illinois Domestic Violence Act.
* **Beat** Indicates the beat where the incident occurred. A beat is the smallest police geographic area – each beat has a dedicated police beat car. Three to five beats make up a police sector, and three sectors make up a police district. The Chicago Police Department has 22 police districts.
* **District** Indicates the police district where the incident occurred.
* **Ward** The ward (City Council district) where the incident occurred.
* **Community Area** Indicates the community area where the incident occurred. Chicago has 77 community areas.
* **FBI Code** Indicates the crime classification as outlined in the FBI's National Incident-Based Reporting System (NIBRS)
* **X Coordinate** The x coordinate of the location where the incident occurred in State Plane Illinois East NAD 1983 projection. This location is shifted from the actual location for partial redaction but falls on the same block.
* **Y Coordinate** The y coordinate of the location where the incident occurred. Partially redacted
* **Year** Year when incident occurred
* **Updated On** Date and time the record was last updated.
* **Latitude** The latitude of the location where the incident occurred. Partially redacted
* **Longitude** The longitude of the location where the incident occurred. Partially redacted
* **Location**  The combination of latitude and longitude

*Descriptions of attributes taken from 
[data.cityofchicago.org](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present-Dashboard/5cd6-ry5g)*

### Parsing the CSV
**Datetimes:** The *Date* and *Updated On* attributes both follow the same U.S. "month/day/year hour:minute:second period" date and time format. These columns will be converted into datetime objects during intial parsing.

**Redundancy:** The *IUCR, Year, District, and Location* attributes all repeat information available in the other columns and therefore are ignored during intial parsing.

**Locations:** In order to protect the individuals involved the location data of incidents are partial redacted and are only accurate to the block, therefore the  *Block, X Coordinate, Y Coordinate* columns are ignored during intial parsing. *Latitude and Longitude* are chosen over *Block* because the numerical values are easier to parse than arbitrary address names.

In [2]:
dateparse = lambda x: pd.datetime.strptime(x, "%m/%d/%Y %H:%M:%S %p")

types = {"ID":"uint64", "Case Number":"object", "Primary Type":"category","Description":"category",
         "Location Description":"category", "Arrest":"bool", "Domestic":"bool", "Beat":"category", "Ward":"category",
         "Community Area":"category", "FBI Code":"category", "Latitude":"float", "Longitude":"float"}

columns = list(types.keys())+["Date", "Updated On"]

dfCrime = pd.read_csv(
    "Samples/RandomCrime.csv",
    parse_dates=["Date", "Updated On"],
    date_parser=dateparse,
    usecols = columns,
    dtype = types
)

### Null Values

### Type Checking