# Chicago Crime Analysis and Pipeline with PySpark

#### Goals


Some goals for this project:
- Do EDA on *Chicago Crime* from [Kaggle](https://www.kaggle.com/)
- Create a Machine Learning Pipeline

All this will be done using **[PySpark](https://spark.apache.org/docs/latest/api/python/)**

<hr>

#### Import Libraries

In [23]:
try:
    from pyspark.sql import SparkSession
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql import DataFrame
    import pyspark.sql.functions as F
    import pandas as pd
    import numpy as np
    import glob
    from functools import reduce
    
    print('[SUCCESS]')
except ImportError as ie:
    raise ImportError(f'[Error importing]: {ie}')

[SUCCESS]


**INITIALIZE SESSION**

In [24]:
sc = SparkContext('local')
spark = SparkSession(sc)

<hr>

#### Read in our DATA

In [25]:
PATH = 'DATA' # FOLDER CONTAINING FILES
csv_files = glob.glob(PATH + '/*.csv') # GET ALL CSV FILES


# CREATE FUNCTION TO READ IN THE DATA AND MERGE FILES
def merge_csv(files):
    df = spark.read.options(header = True).csv(files)
    
    return df

df = merge_csv(csv_files)

In [26]:
df.show(1)

+---+-----------+--------+--------------------+-----------------+------------+-----------+--------------------+---------+--------+-----+--------+----+--------------+--------+------------+------------+----+----------+--------------------+---------+--------+
| ID|Case Number|    Date|               Block|             IUCR|Primary Type|Description|Location Description|   Arrest|Domestic| Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|Year|Updated On|            Latitude|Longitude|Location|
+---+-----------+--------+--------------------+-----------------+------------+-----------+--------------------+---------+--------+-----+--------+----+--------------+--------+------------+------------+----+----------+--------------------+---------+--------+
|879|    4786321|HM399414|01/01/2004 12:01:...|082XX S COLES AVE|        0840|      THEFT|FINANCIAL ID THEF...|RESIDENCE|   False|False|     424| 4.0|           7.0|    46.0|          06|        null|null|      2004|08/17/2015 03

<hr>

#### Look at our Schema

Looking at schema allows us to see how our column types are set up.

In [27]:
df.printSchema()

root
 |-- ID: string (nullable = true)
 |-- Case Number: string (nullable = true)
 |-- Date: string (nullable = true)
 |-- Block: string (nullable = true)
 |-- IUCR: string (nullable = true)
 |-- Primary Type: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Location Description: string (nullable = true)
 |-- Arrest: string (nullable = true)
 |-- Domestic: string (nullable = true)
 |-- Beat: string (nullable = true)
 |-- District: string (nullable = true)
 |-- Ward: string (nullable = true)
 |-- Community Area: string (nullable = true)
 |-- FBI Code: string (nullable = true)
 |-- X Coordinate: string (nullable = true)
 |-- Y Coordinate: string (nullable = true)
 |-- Year: string (nullable = true)
 |-- Updated On: string (nullable = true)
 |-- Latitude: string (nullable = true)
 |-- Longitude: string (nullable = true)
 |-- Location: string (nullable = true)



Because our schema has the appropriate data types for each column, we don't have to create a custom schema. 

#### Exploratory Data Analysis

Goals:
- Do some basic data descriptions
    - NA values
    - Dimensions of the data



In [31]:
# FIND NULL VALUES IN DATAFRAME
def null_values(df):
    return df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

df_null = null_values(df)

In [32]:
df_null.show()

+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+------+----------+--------+---------+--------+
| ID|Case Number|Date|Block|IUCR|Primary Type|Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|  Year|Updated On|Latitude|Longitude|Location|
+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+------+----------+--------+---------+--------+
|  0|          1|   7|    0|   0|           0|          0|                   0|  1990|       0|   0|       0|  91|        700224|  702091|           0|      105573|105573|         0|       0|   105573|  105574|
+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+-----

In [37]:
def GET_DIMENSIONS(df):
    return (df.count(), len(df.columns))


# -----
print(f'Number of rows: {GET_DIMENSIONS(df)[0]} \nNumber of columns: {GET_DIMENSIONS(df)[1]}')

Number of rows: 7941286 
Number of columns: 22


In [22]:
sc.stop()

```python
+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+------+----------+--------+---------+--------+
| ID|Case Number|Date|Block|IUCR|Primary Type|Description|Location Description|Arrest|Domestic|Beat|District|Ward|Community Area|FBI Code|X Coordinate|Y Coordinate|  Year|Updated On|Latitude|Longitude|Location|
+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+------+----------+--------+---------+--------+
|  0|          1|   7|    0|   0|           0|          0|                   0|  1990|       0|   0|       0|  91|        700224|  702091|           0|      105573|105573|         0|       0|   105573|  105574|
+---+-----------+----+-----+----+------------+-----------+--------------------+------+--------+----+--------+----+--------------+--------+------------+------------+------+----------+--------+---------+--------+
```