# Chicago Crime Analysis with PySpark

#### Goals


Some goals for this project:
- Do some simple EDA on *Chicago Crime* from [Kaggle](https://www.kaggle.com/)

All this will be done using **[PySpark](https://spark.apache.org/docs/latest/api/python/)**

<hr>

#### Import Libraries

In [1]:
try:
    from pyspark.sql import SparkSession
    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    from pyspark.sql import DataFrame
    import pyspark.sql.types as tp
    import pyspark.sql.functions as F
    import pandas as pd
    import numpy as np
    import glob
    from functools import reduce
    from urllib.request import urlopen
    
    print('[SUCCESS]')
except ImportError as ie:
    raise ImportError(f'[Error importing]: {ie}')

[SUCCESS]


**INITIALIZE SESSION**

In [2]:
sc = SparkContext('local')
spark = SparkSession(sc)

<hr>

#### Read in our DATA

In [3]:
'''
URL THAT WE ARE USING TO READ IN OUR JSON DATA
'''
URL = 'https://data.cityofchicago.org/resource/x2n5-8w5q.json'

In [4]:
def read_json_api(URL):
    json_data = urlopen(URL).read().decode('utf-8')
    
    rdd = spark.sparkContext.parallelize([json_data])
    
    df = spark.read.json(rdd)
    
    
    return df

In [5]:
df = read_json_api(URL)
df.show(5)

+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----+---------------------+--------------------+----------------------+------+----+--------------------+--------+--------------------+--------+------+------------+--------------------+-------------+----+------------+------------+
|:@computed_region_43wa_7qmu|:@computed_region_6mkv_f3dw|:@computed_region_awaf_s7ux|:@computed_region_bdys_3d7i|:@computed_region_rpca_8um6|:@computed_region_vrxf_vc4k|_iucr|_location_description|_primary_decsription|_secondary_description|arrest|beat|               block|   case_|  date_of_occurrence|domestic|fbi_cd|    latitude|            location|    longitude|ward|x_coordinate|y_coordinate|
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+-----+---------------------+---

<hr>

#### Look at our Schema

Looking at schema allows us to see how our column types are set up.

In [9]:
df.printSchema()

root
 |-- :@computed_region_43wa_7qmu: string (nullable = true)
 |-- :@computed_region_6mkv_f3dw: string (nullable = true)
 |-- :@computed_region_awaf_s7ux: string (nullable = true)
 |-- :@computed_region_bdys_3d7i: string (nullable = true)
 |-- :@computed_region_rpca_8um6: string (nullable = true)
 |-- :@computed_region_vrxf_vc4k: string (nullable = true)
 |-- _iucr: string (nullable = true)
 |-- _location_description: string (nullable = true)
 |-- _primary_decsription: string (nullable = true)
 |-- _secondary_description: string (nullable = true)
 |-- arrest: string (nullable = true)
 |-- beat: string (nullable = true)
 |-- block: string (nullable = true)
 |-- case_: string (nullable = true)
 |-- date_of_occurrence: string (nullable = true)
 |-- domestic: string (nullable = true)
 |-- fbi_cd: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- human_address: string (nullable = true)
 |    |-- latitude: string (nullable = 

As we, our schema isn't in the best condition to use for EDA. 

We want certain attributes to the appropriate data types. 
- This would involve changing the schema


However, for now we will stick with the way it is and drop columns we don't need.

In [10]:
columns_to_be_dropped = (':@computed_region_43wa_7qmu', ':@computed_region_6mkv_f3dw', ':@computed_region_awaf_s7ux', ':@computed_region_bdys_3d7i', ':@computed_region_rpca_8um6', ':@computed_region_vrxf_vc4k')

In [11]:
def drop_cols(df, cols):
    df = df.drop(*cols)
    
    return df

In [12]:
df = drop_cols(df, columns_to_be_dropped)
df.printSchema()

root
 |-- _iucr: string (nullable = true)
 |-- _location_description: string (nullable = true)
 |-- _primary_decsription: string (nullable = true)
 |-- _secondary_description: string (nullable = true)
 |-- arrest: string (nullable = true)
 |-- beat: string (nullable = true)
 |-- block: string (nullable = true)
 |-- case_: string (nullable = true)
 |-- date_of_occurrence: string (nullable = true)
 |-- domestic: string (nullable = true)
 |-- fbi_cd: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- location: struct (nullable = true)
 |    |-- human_address: string (nullable = true)
 |    |-- latitude: string (nullable = true)
 |    |-- longitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- ward: string (nullable = true)
 |-- x_coordinate: string (nullable = true)
 |-- y_coordinate: string (nullable = true)



In [14]:
df.columns

['_iucr',
 '_location_description',
 '_primary_decsription',
 '_secondary_description',
 'arrest',
 'beat',
 'block',
 'case_',
 'date_of_occurrence',
 'domestic',
 'fbi_cd',
 'latitude',
 'location',
 'longitude',
 'ward',
 'x_coordinate',
 'y_coordinate']

Now that we have the columns we want to work with, here is a small description of what each column is according to [chicago.gov](https://www.chicago.gov/city/en/dataset/crime.html)

#### Exploratory Data Analysis

Goals:
- Dataset overview
    - NA values
    - Dimensions of the data
- Description on columns (Important ones chosen by me)
    - Description
    - Arrest
    - Year
    - 


In [6]:
'''
FUNCTION TO FIND NULL/NA VALUES IN DATAFRAME
'''
def null_values(df):
    return df.agg(*[F.count(F.when(F.isnull(c), c)).alias(c) for c in df.columns])

df_null = null_values(df)

In [None]:
'''
FUNCTION TO GET DIMENSIONS OF DATAFRAME
'''
def GET_DIMENSIONS(df):
    return (df.count(), len(df.columns))

# -----
print(f'Number of rows: {GET_DIMENSIONS(df)[0]} \nNumber of columns: {GET_DIMENSIONS(df)[1]}')

In [39]:
sc.stop()