<div style="float:left; font-weight:bold">Peter Nagy - 100672801</div>
<div style="float:right">CSCI 2000U - Final Project</div>

# Chicago crimes 2001-2018
Crimes that occurred in the City of Chicago from 2001, up until mid Nov. 2018

## Description
> This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 up until approximately half of November 2018.
> 
> Source: [Chicago Data Portal](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)

## Analysis

### Imports

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as mp
from IPython.display import Markdown
%matplotlib inline

### Importing data with specific columns of interest

Using "read_csv" method from the pandas library and using parameter "usecols" to truncate the data to only the 10 important columns

In [2]:
data = pd.read_csv("city_of_chicago_crimes_2001_to_present.csv", usecols=["Primary Type", "Description"])

### Data preview

Visualizing the data's 5 first columns using "head" method from the data

In [3]:
data.head()

Unnamed: 0,Primary Type,Description
0,BATTERY,AGGRAVATED: HANDGUN
1,OTHER OFFENSE,PAROLE VIOLATION
2,BATTERY,DOMESTIC BATTERY SIMPLE
3,BATTERY,SIMPLE
4,ROBBERY,ARMED: HANDGUN


### Data volume
#### Printing number of rows and columns of dataset

Using "shape" method from the data to view the dataset size (rows, columns)

In [4]:
data.shape

(6747040, 2)

The dataset contains **6,747,040 rows** and **10 columns**

### Data quality

The data source presents multiple crimes that occured in Chicago between 2001 and 2018.
The crimes are described and located by the block, by a general location (street, apartment, sidewalk, ...),
by the Chicago district number and by the Chicago community area number.
Every crime is labeled by a case number, which makes it easy to look up for further information.
Finally, every crime mentions if the suspect was arrested and if it was a domestic crime.

It's relevant to data science, since we, as data scientists, can calculate and come up with a lot of
conclusions concerning crime in Chicago or crime in general as well. For example, which areas in
Chicago are the safest and which ones are the riskiest, what location (street, sidewalk, ...) should 
people avoid to remain safe and many more.

### Experiment
First let's insert the data into a dataframe in order to process it easier

Using "DataFrame" method from the pandas library to transform the data to a dataframe

In [5]:
dataF = pd.DataFrame(data)

In [51]:
a=dataF.groupby(["Primary Type"])["Primary Type", "Description"]

In [73]:
b=dataF.loc[dataF["Primary Type"] == "SEX OFFENSE"]

In [74]:
b.groupby(["Description"])["Description"].count()

Description
ADULTRY                              6
AGG CRIMINAL SEXUAL ABUSE         5544
ATT AGG CRIM SEXUAL ABUSE            6
ATT AGG CRIMINAL SEXUAL ABUSE      155
ATT CRIM SEXUAL ABUSE              851
BIGAMY                              28
CRIMINAL SEXUAL ABUSE             9051
CRIMINAL TRANSMISSION OF HIV       124
FORNICATION                         13
INDECENT SOLICITATION/ADULT         74
INDECENT SOLICITATION/CHILD        630
MARRYING A BIGAMIST                  2
OTHER                              604
PUBLIC INDECENCY                  7269
SEX RELATION IN FAMILY              47
SEXUAL EXPLOITATION OF A CHILD     681
Name: Description, dtype: int64

In [46]:
"OFFENSE INVOLVING CHILDREN"

'OFFENSE INVOLVING CHILDREN'

In [50]:
dataF.groupby(["Primary Type"])["Primary Type", "Description"].count()

Unnamed: 0_level_0,Primary Type,Description
Primary Type,Unnamed: 1_level_1,Unnamed: 2_level_1
ARSON,11152,11152
ASSAULT,418474,418474
BATTERY,1232273,1232273
BURGLARY,387990,387990
CONCEALED CARRY LICENSE VIOLATION,284,284
CRIM SEXUAL ASSAULT,27081,27081
CRIMINAL DAMAGE,771497,771497
CRIMINAL TRESPASS,193385,193385
DECEPTIVE PRACTICE,262370,262370
DOMESTIC VIOLENCE,1,1


#### Community area crime analysis

Using "groupby" method from the dataframe will group all of the same community areas, then taking the column "Community Area" and counting how many elements are in each group using "count" method

In [6]:
comAreaCount = dataF.groupby(["Community Area"])["Community Area"].count()

KeyError: 'Community Area'

Now let's calculate the community area which has the most crime in it

Using "idxmax" method from the community area counts returns the index of the maximum, in other words the most common community area number from our dataset

In [None]:
comAreaCount.idxmax()

The community area **#25** refers to _**Austin, Illinois**_

And the number of crimes committed there in the past 18 years

Using "max" method from the community area counts returns the number of crime committed in the most community area

In [None]:
comAreaCount.max()

Now let's plot this community area crime rate

Using "plot" and "bar" method from the community area counts returns a data visualization of the community areas and their crime rate. Also using "figsize" argument and setting the length to 20 and the height to 5 to enlarge our plot in order to properly display all the community areas.

In [None]:
comAreaCount.plot.bar(figsize=(20,5));

This graph tells us that there is quite a large difference between the worst community area and the rest.

Which is a hint that community area #25 is an extreme outlier.

Let's see what the data looks if sorted

Using "sort_values" method before plotting them, sorts the community area numbers

In [None]:
comAreaCount.sort_values().plot.bar(figsize=(20,5));

##### Conclusion
Now we can observe that there is a lot of good community areas, where crime is less present, but there is only one community area where crime rate is that atrocious and that's **Austin, Illinois**.

#### Crime location description analysis

Using "groupby" method again, this time with the "Location Description" will group the same crime locations

In [None]:
crimLocCount = dataF.groupby(["Location Description"])["Location Description"].count()

The most common crime location

In [None]:
crimLocCount.idxmax()

With a total amount of crime there

In [None]:
crimLocCount.max()

Plotting the top 10 crime locations

After sorting the count of crime location values with "sort_values" method, we plot the 10 last crime locations (10 highest elements) since the rest of the values are close to negligible compared to the top 10

In [None]:
crimLocCount.sort_values()[-10:].plot.bar();

Let's also calculate how much does the street, the residence, the apartment and the sidewalk crimes represent out of all crimes.

In [None]:
crimLocCount.sort_values()[-4:].sum() / crimLocCount.sum()

##### Conclusion
We conclude that **the street, the residence, the apartment and the sidewalk** is where most of the crimes happen, precisely **63.4%** out of all the crimes are committed at these 4 locations.

#### Crime types

Using "groupby" method again, this time with the "Primary Type" will group the same crimes types

In [None]:
crimTypeCount = dataF.groupby(["Primary Type"])["Primary Type"].count()

The most common crime type

In [None]:
crimTypeCount.idxmax()

With a total amount of crime commited by the type

In [None]:
crimTypeCount.max()

Plotting the crime types

In [None]:
crimTypeCount.sort_values().plot.bar(figsize=(15,5));

Calculating the proportion of the top 4 crime types over all crimes

In [None]:
crimTypeCount.sort_values()[-4:].sum() / crimTypeCount.sum()

##### Conclusion
Theft, battery, criminal damage and narcotics are the most common crime types, precisely they form 61.3% of all the crime types

Now let's see what is the most common crime committed in each of the top 10 locations

#### Crime location vs Crime commited

Using "groupby" method again, this time with the "Location Description" and the "Primary Type" will group the same locations and crime types together and we also arrange the dataset by "Location Description" followed by their count

In [None]:
crimLocTypeCount = dataF.groupby(["Location Description", "Primary Type"])["Location Description"].count()

First we create a new DataFrame and name it "crimLocType" composed of 2 columns ("Crime Location" and "Crime Type").
Then we set the index to be "Crime Location".
For every location from the top 10 locations, get the highest crime type commited in that location and append it to the "crimLocType" dataset.
Finally output the dataset created.

In [None]:
crimLocType = pd.DataFrame(columns=["Crime Location", "Crime Type"])
crimLocType = crimLocType.set_index(["Crime Location"])
for i in crimLocCount.sort_values()[-10:].index:
    crimLocType.loc[i] = [crimLocTypeCount[i].idxmax()]
    
crimLocType

And let's see what is top 10 most common combination of location and crime type

In [None]:
crimLocTypeCount.sort_values()[-10:].plot.bar();

##### Conclusion
The highest risk locations involve **theft, battery, burglary and narcotics**

#### Arrest analysis

Let's analyze now the arrests proportions and then mix it with the crime type to find the correlation

In [None]:
crimArrestCount = dataF.groupby(["Arrest"])["Arrest"].count()

In [None]:
crimArrestCount

In [None]:
crimArrestCount[True]/crimArrestCount.sum()

##### Conclusion
The overall proportion of successful arrests is about **27.7%**, which is really low

#### Arrest vs Crime type analysis

Let's calculate the crime types with the maximum proportion of successful arrests and the crime types with the minimum proportion of successful arrests

First let's create the combined dataframe

In [None]:
crimArrestTypeCount = dataF.groupby(["Primary Type", "Arrest"])["Arrest"].count()

Next, let's create a dataframe named "crimArrestType" with the proportion of successful arrests of each crime type

In [None]:
crimArrestType = pd.DataFrame(columns=["Crime type", "Crime Proportion of successful arrests"])
crimArrestType = crimArrestType.set_index(["Crime type"])
for i in crimTypeCount.index:
    crimArrestType.loc[i] = [crimArrestTypeCount[i][True]/crimArrestTypeCount[i].sum()]

Now let's visualize the dataset by sorting and plotting it

In [None]:
crimArrestType.sort_values(by="Crime Proportion of successful arrests").plot.bar(figsize=(15,5));

Finally, let's get the maximum proportion of successful arrests crime types. In order to do that, since there is a couple, let's get the ones that have a proportion over 99%.

In [None]:
maxCrimTypes = crimArrestType["Crime Proportion of successful arrests"] > 0.99

The following list contains the ones that have a proportion over 99%

In [None]:
crimArrestType[maxCrimTypes]

And now let's find the minimum proportion of successful arrests crime types. In other words, those with a proportion below 1%.

In [None]:
minCrimTypes = crimArrestType["Crime Proportion of successful arrests"] < 0.1

The following list contains the ones that have a proportion below 1%

In [None]:
crimArrestType[minCrimTypes]

##### Conclusion
Domestic violence, gambling, liquor law violation, narcotics, prostitution and public indecency are 6 crimes that are almost always caught with an arrest rate of over 99%.

Burglary, criminal damage, motor vehicle theft, non-criminal crimes and robbery are 5 crimes that are almost never caught with an arrest rate of under 1%.