# Chicago Crimes Analysis (2014 - 2016)

Chicago Crimes analysis is going to be performed as follows:

1. PySpark **environment setup**
2. Data source and **Spark data abstraction** (DataFrame) **set up**
3. Data set **metadata analysis**:
  1. Display **schema and size** of the DataFrame
  2. Enrichment of datasets: Ancillary Datasets 
  3. Get one or multiple **random samples** 
  4. Table Joins
  5. Identify **data entities**, **metrics** and **dimensions**
  6. **Columns/fields categorization**
4. Columns groups **basic profiling** to better understand our data set:
  1. **Timing related** columns basic profiling
  2. **Location related** columns basic profiling
  3. **Crime related** columns basic profiling
  4. **Socio-economic related** columns basic profiling
5. **Business Solutions to business questions** 
  1. **Time of day:** Violent and Non-violent crimes
  2. **Areas of Chicago:** Violent and Non-violent crimes
  3. **Domestic Violence Cases:** Conviction Rates
  4. **Seasonality of robberies:**
  5. **Profile of Domestic Violence affected community**



## 1. PySpark environment setup

In [1]:
import findspark
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

## 2. Data Source and Spark Data Abstraction (DataFrame) Setup

In [2]:
crimesDF = spark.read \
                 .format("csv") \
                 .option("mode","FAILFAST") \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .load("../Assignment1/*.csv")

## 3. DataSet Metadata Analysis
### A. Display Schema & Size of the DataFrame

In [3]:
from IPython.display import display, Markdown

crimesDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % crimesDF.count()))

root
 |-- _c0: integer (nullable = true)
 |-- id: integer (nullable = true)
 |-- case_number: string (nullable = true)
 |-- date: timestamp (nullable = true)
 |-- year: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- day: integer (nullable = true)
 |-- dow: integer (nullable = true)
 |-- hour: integer (nullable = true)
 |-- block: string (nullable = true)
 |-- iucr: string (nullable = true)
 |-- primary_type: string (nullable = true)
 |-- description: string (nullable = true)
 |-- location_description: string (nullable = true)
 |-- arrest: boolean (nullable = true)
 |-- domestic: boolean (nullable = true)
 |-- beat: integer (nullable = true)
 |-- district: integer (nullable = true)
 |-- ward: string (nullable = true)
 |-- community_area: string (nullable = true)
 |-- fbi_code: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)



This DataFrame has **754541 rows**.

### B. Enrichment of data: Ancillary Datasets
#### IUCR code
The crimes may be based upon preliminary information supplied to the Police Department by the reporting parties that have not been verified. As a result, the preliminary crime classifications may later change based upon an additional investigation and leading to more data entry errors. These codes are also used to aggregate types of cases for statistical purposes. In Illinois, the Illinois State Police establish IUCR codes, but the agencies can add codes to suit their individual need [1].
- The CPD currently uses more than 350 IUCR codes to classify criminal offenses, divided into “Index” and “Non-Index” offenses.
- Index offenses are the offenses that are collected nation-wide by the Federal Bureaus of Investigation’s Uniform Crime Reports program to document crime trends over time (data released semi-annually), and include murder, criminal sexual assault, robbery, aggravated assault & battery, burglary, theft, motor vehicle theft, and arson [1].
- Non-index offenses are all other types of criminal incidents, including vandalism, weapons violations, public peace violations, etc.

To provide the index code, this dataset will be loaded to enrich the final crimes table to gain further insights.

In [4]:
iucrDF = spark.read \
                 .format("csv") \
                 .option("mode","FAILFAST") \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .load("../AncillaryDatasets/*.csv")

In [5]:
iucrDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % iucrDF.count()))

root
 |-- iucr: string (nullable = true)
 |-- index_code: string (nullable = true)



This DataFrame has **401 rows**.

In [6]:
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit

#### Police District Names
In the crimes dataframe, *districts* variables indicates the police district number where the incident occurred. Although the police district number is, incorporating the police district name will provide further context necessary for analysis. There are 22 police districts, each district has police name and were obtained from __[City of Chicago Data Portal](https://www.chicago.gov/city/en/depts/cpd/dataset/police_stations.html)__.

In [7]:
districtNameDf = spark.read \
                 .format("csv") \
                 .option("mode","FAILFAST") \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .load("../PoliceStations/*.csv")

In [8]:
districtNameDf.printSchema()
display(Markdown("This DataFrame has **%d rows**." % districtNameDf.count()))

root
 |-- district: string (nullable = true)
 |-- districtName: string (nullable = true)
 |-- _c2: string (nullable = true)



This DataFrame has **23 rows**.

The *_c2* id variable is removed as it is not pertinent for the analysis.

In [9]:
districtNameDf.drop('_c2').show(1)
districtNameDf.select(col("district")).distinct().show()

+--------+------------+
|district|districtName|
+--------+------------+
|       1|     Central|
+--------+------------+
only showing top 1 row

+------------+
|    district|
+------------+
|           7|
|          15|
|          11|
|           3|
|           8|
|          22|
|          16|
|           5|
|          18|
|          17|
|           6|
|          19|
|          25|
|Headquarters|
|          24|
|           9|
|           1|
|          20|
|          10|
|           4|
+------------+
only showing top 20 rows



#### Community Area: Names
In the crimes dataset, *community_area* indicates the community area where the incident occurred. Chicago has 77 community areas and the names of each community area which were obtained from __[Census Data by Community Area](http://www.actforchildren.org/wp-content/uploads/2018/01/Census-Data-by-Chicago-Community-Area-2017.pdf)__ will be incorporated to dataset to provide further context necessary for analysis. Moreover, the locational regions (*Northside*, *South side* etc.) are constituted of these community areas. These will also be enriched in the final dataframe.

In [11]:
from pyspark.sql import Row
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType, StringType

comms = [ Row(1, "Rogers Park","Far North side"),Row(2, "West Ridge","Far North side"),Row(3, "Uptown","Far North side"),Row(4, "Lincoln Square","Far North side"),
               Row(5, "North Center","North side"),Row(6, "Lake View","North side"),Row(7, "Lincoln Park","North side"),Row(8, "Near North Side","Central"),
               Row(9, "Edison Park","Far North side"),Row(10, "Norwood Park","Far North side"),Row(11, "Jefferson Park","Far North side"),Row(12, "Forest Glen","Far North side"),
               Row(13, "North Park","Far North side"),Row(14, "Albany Park","Far North side"),Row(15, "Portage Park","Northwest side"),Row(16, "Irving Park","Northwest side"),
               Row(17, "Dunning","Northwest side"),Row(18, "Montclare","Northwest side"),Row(19, "Belmont Cragin","Northwest side"),Row(20, "Hermosa","Northwest side"),
               Row(21, "Avondale","North side"),Row(22, "Logan Square","North side"),Row(23, "Humboldt Park", "West side"),Row(24, "West Town","West side"),
               Row(25, "Austin","West side"),Row(26, "West Garfield Park","West side"),Row(27, "East Garfield Park","West side"),Row(28, "Near West Side","West side"),
               Row(29, "North Lawndale","West side"),Row(30, "South Lawndale","West side"),Row(31, "Lower West Side","West side"),Row(32, "Loop","Central"),
               Row(33, "Near South Side","Central"),Row(34, "Armour Square","South side"),Row(35, "Douglas","South side"),Row(36, "Oakland","South side"),
               Row(37, "Fuller Park","South side"),Row(38, "Grand Boulevard","South side"),Row(39, "Kenwood","South side"),Row(40, "Washington Park","South side"),
               Row(41, "Hyde Park","South side"),Row(42, "Woodlawn","South side"),Row(43, "South Shore","South side"),Row(44, "Chatham","Far Southeast side"),
               Row(45, "Avalon Park","Far Southeast side"),Row(46, "South Chicago","Far Southeast side"),Row(47, "Burnside","Far Southeast side"),Row(48, "Calumet Heights","Far Southeast side"),
               Row(49, "Roseland","Far Southeast side"),Row(50, "Pullman","Far Southeast side"),Row(51, "South Deering","Far Southeast side"),Row(52, "East Side","Far Southeast side"),
               Row(53, "West Pullman","Far Southeast side"),Row(54, "Riverdale","Far Southeast side"),Row(55, "Hegewisch","Far Southeast side"),Row(56, "Garfield Ridge","Southwest side"),
               Row(57, "Archer Heights","Southwest side"),Row(58, "Brighton Park","Southwest side"),Row(59, "McKinley Park","Southwest side"),Row(60, "Bridgeport","South side"),
               Row(61, "New City","Southwest side"),Row(62, "West Elsdon","Southwest side"),Row(63, "Gage Park","Southwest side"),Row(64, "Clearing","Southwest side"),
               Row(65, "West Lawn","Southwest side"),Row(66, "Chicago Lawn","Southwest side"),Row(67, "West Englewood","Southwest side"),Row(68, "Englewood","Southwest side"),
               Row(69, "Greater Grand Crossing","South side"),Row(70, "Ashburn","Far Southwest side"),Row(71, "Auburn Gresham","Far Southwest side"),Row(72, "Beverly","Far Southwest side"),
               Row(73, "Washington Heights","Far Southwest side"),Row(74, "Mount Greenwood","Far Southwest side"),Row(75, "Morgan Park","Far Southwest side"),Row(76, "O'Hare","Far North side"),
               Row(77, "Edgewater","Far North side")]
commsSchema = StructType([ StructField("community_area", IntegerType(), False),
                                StructField("communityAreaName", StringType(), True),
                          StructField("communityAreaSide", StringType(), True)])
commsDf = spark.createDataFrame(comms, commsSchema)
commsDf.show(truncate=False)

+--------------+-----------------+-----------------+
|community_area|communityAreaName|communityAreaSide|
+--------------+-----------------+-----------------+
|1             |Rogers Park      |Far North side   |
|2             |West Ridge       |Far North side   |
|3             |Uptown           |Far North side   |
|4             |Lincoln Square   |Far North side   |
|5             |North Center     |North side       |
|6             |Lake View        |North side       |
|7             |Lincoln Park     |North side       |
|8             |Near North Side  |Central          |
|9             |Edison Park      |Far North side   |
|10            |Norwood Park     |Far North side   |
|11            |Jefferson Park   |Far North side   |
|12            |Forest Glen      |Far North side   |
|13            |North Park       |Far North side   |
|14            |Albany Park      |Far North side   |
|15            |Portage Park     |Northwest side   |
|16            |Irving Park      |Northwest si

### Community Area Statistics
To gain better profiles of the Chicago constituents who may be affected by the crime, general socio-economic factors including **Poverty Level**, **unemployment**, **no High School Diploma** and **per Capita Income** will be considered for the period between 2014 - 2016. Information was obtained from __[Public health Data by Community Area](https://data.cityofchicago.org/Health-Human-Services/Public-Health-Statistics-Selected-public-health-in/iqnk-2tcu/data)__ portal. Note it is the average over the three-year period.

In [12]:
genDF = spark.read \
                 .format("csv") \
                 .option("mode","FAILFAST") \
                 .option("inferSchema", "true") \
                 .option("header", "true") \
                 .load("../GenStats/*.csv")

In [13]:
genDF.printSchema()
display(Markdown("This DataFrame has **%d rows**." % genDF.count()))

root
 |-- community_area: integer (nullable = true)
 |-- belowPovertyLevel: double (nullable = true)
 |-- crowdedHousing: double (nullable = true)
 |-- noHighSchoolDiploma: double (nullable = true)
 |-- perCapitaIncome: integer (nullable = true)
 |-- unemployment: double (nullable = true)



This DataFrame has **77 rows**.

## C. Get one or multiple **random samples**

In [14]:
crimesDF.cache() # optimization to make the processing faster
crimesDF.sample(False, 0.1).take(2)

[Row(_c0=3, id=9446758, case_number='HX100030', date=datetime.datetime(2014, 1, 1, 0, 30), year=2014, month=1, day=1, dow=4, hour=0, block='052XX W RACE AVE', iucr='1310', primary_type='CRIMINAL DAMAGE', description='TO PROPERTY', location_description='APARTMENT', arrest=False, domestic=False, beat=1523, district=15, ward='28', community_area='25', fbi_code='14', latitude='41.890046233', longitude='-87.756333158'),
 Row(_c0=9, id=9446770, case_number='HX100039', date=datetime.datetime(2014, 1, 1, 0, 30), year=2014, month=1, day=1, dow=4, hour=0, block='002XX S LOCKWOOD AVE', iucr='1811', primary_type='NARCOTICS', description='POSS: CANNABIS 30GMS OR LESS', location_description='STREET', arrest=True, domestic=False, beat=1522, district=15, ward='29', community_area='25', fbi_code='18', latitude='41.87786538', longitude='-87.757599414')]

In [15]:
crimesDF.select(col("primary_type")).distinct().show(33,truncate=False)

+---------------------------------+
|primary_type                     |
+---------------------------------+
|OFFENSE INVOLVING CHILDREN       |
|STALKING                         |
|PUBLIC PEACE VIOLATION           |
|OBSCENITY                        |
|NON-CRIMINAL (SUBJECT SPECIFIED) |
|ARSON                            |
|GAMBLING                         |
|CRIMINAL TRESPASS                |
|ASSAULT                          |
|NON - CRIMINAL                   |
|LIQUOR LAW VIOLATION             |
|MOTOR VEHICLE THEFT              |
|THEFT                            |
|BATTERY                          |
|ROBBERY                          |
|HOMICIDE                         |
|PUBLIC INDECENCY                 |
|CRIM SEXUAL ASSAULT              |
|HUMAN TRAFFICKING                |
|INTIMIDATION                     |
|PROSTITUTION                     |
|DECEPTIVE PRACTICE               |
|CONCEALED CARRY LICENSE VIOLATION|
|SEX OFFENSE                      |
|CRIMINAL DAMAGE            

It can be seen that there are multiple entries of the **NON-CRIMINAL** *(e.g "NON - CRIMINAL")* primary type of crime. Care will be taken to deal with these issues for further analysis

## D. Table Joins
Ancillary data is available to enrich the original dataframe with information which may improve the understanding of the generated results. 
### Chicago Police Dept
Below we are enriching the police district number with the police district names: i.e. the chicago police department's full name with an inner join.

In [15]:
combinedDistrictDf = crimesDF\
                        .join(districtNameDf,"district",'inner')

In [16]:
# To ensure that the join occurred
combinedDistrictDf 

DataFrame[district: int, _c0: int, id: int, case_number: string, date: timestamp, year: int, month: int, day: int, dow: int, hour: int, block: string, iucr: string, primary_type: string, description: string, location_description: string, arrest: boolean, domestic: boolean, beat: int, ward: string, community_area: string, fbi_code: string, latitude: string, longitude: string, districtName: string, _c2: string]

### Community Area Names
With an inner join, we enrich the dataframe with community area name and commmunity side.

In [17]:
combinedCommAreaDf = combinedDistrictDf\
                        .join(commsDf,"community_area",'inner')

In [18]:
# To ensure that the join occurred
combinedCommAreaDf

DataFrame[community_area: string, district: int, _c0: int, id: int, case_number: string, date: timestamp, year: int, month: int, day: int, dow: int, hour: int, block: string, iucr: string, primary_type: string, description: string, location_description: string, arrest: boolean, domestic: boolean, beat: int, ward: string, fbi_code: string, latitude: string, longitude: string, districtName: string, _c2: string, communityAreaName: string, communityAreaSide: string]

### IUCR Index Code
The "Index-Code" which is either violent: **"I"** or **"N"** is enriched using an inner join.

In [19]:
combinedICRDf = combinedCommAreaDf\
                   .join(iucrDF,"iucr","inner")

In [20]:
# To ensure that the join occurred
combinedICRDf

DataFrame[iucr: string, community_area: string, district: int, _c0: int, id: int, case_number: string, date: timestamp, year: int, month: int, day: int, dow: int, hour: int, block: string, primary_type: string, description: string, location_description: string, arrest: boolean, domestic: boolean, beat: int, ward: string, fbi_code: string, latitude: string, longitude: string, districtName: string, _c2: string, communityAreaName: string, communityAreaSide: string, index_code: string]

### Socio Economic Factors
Below we are enriching the final dataframe with the socio-economic factors including **% below poverty level**, **% with no High-School Diploma**, **per Capita Income** and **unemployment rate**. This was done using an inner join. 
Note: this information is an average for the years 2014 - 2016 and split by community area. 

In [21]:
crimeDF = combinedICRDf\
             .join(genDF,"community_area",'inner')

In [22]:
#To ensure that the join occurred
crimeDF

DataFrame[community_area: string, iucr: string, district: int, _c0: int, id: int, case_number: string, date: timestamp, year: int, month: int, day: int, dow: int, hour: int, block: string, primary_type: string, description: string, location_description: string, arrest: boolean, domestic: boolean, beat: int, ward: string, fbi_code: string, latitude: string, longitude: string, districtName: string, _c2: string, communityAreaName: string, communityAreaSide: string, index_code: string, belowPovertyLevel: double, crowdedHousing: double, noHighSchoolDiploma: double, perCapitaIncome: int, unemployment: double]

In [25]:
# to ensure the table is fully enriched
crimeDF.show(2)

+--------------+----+--------+----+-------+-----------+-------------------+----+-----+---+---+----+-----------------+-----------------+-----------+--------------------+------+--------+----+----+--------+------------+-------------+------------+----+-----------------+-----------------+----------+-----------------+--------------+-------------------+---------------+------------+
|community_area|iucr|district| _c0|     id|case_number|               date|year|month|day|dow|hour|            block|     primary_type|description|location_description|arrest|domestic|beat|ward|fbi_code|    latitude|    longitude|districtName| _c2|communityAreaName|communityAreaSide|index_code|belowPovertyLevel|crowdedHousing|noHighSchoolDiploma|perCapitaIncome|unemployment|
+--------------+----+--------+----+-------+-----------+-------------------+----+-----+---+---+----+-----------------+-----------------+-----------+--------------------+------+--------+----+----+--------+------------+-------------+------------+-

### Conclusion: Joins

The final dataframe is **crimeDF** and has been enriched with the Community Area Names, Side of Chicago,  Police Department Names, the IUCR Index Codes and the several Socio Economic Factors defined for Chicago. 

### D. Data Entities, Metrics and Dimensions

We've identified the following elements:

* **Entities:** Case Number, date, Crime Type (dimension), Location (dimension)
* **Metrics:** % Below Poverty level, % Crowded Housing, % no High school Diploma, Income per Capita & % Unemployment rate
* **Dimensions:** Arrest, Domestic, Police department, Ward, CommmunityArea,

### E. Column Categorization

The following is the column categorization:

* **Timing related columns:** *Year*, *Month*, *DayofMonth*, *DayOfWeek*, *HourOfDay*
* **Location related columns:** *Block*, *Location Description*, *Beat*, *District*,*Police District Name*,*Ward*, *Community Area*,*Community Area Name*, *Community Area Side*, *Latitude*, *Longitude*
* **Crime related columns:** *ID*, *Case Number*, *Primary Type*, *Description*, *Arrest*, *Domestic*, *IUCR*, *Index Code*, *FBI Code*
* **Socio-economic factors columns:** *Below Poverty Level*, *Crowded Housing*, *No High-school Diploma*, *per Capita Income*, *Unemployment*

## 4. Columns groups basic profiling to better understand our data set
### A. Basic Profiling: Timing related columns

In [23]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit


print ("Summary of columns Date, Year, Month, DayofMonth, DayOfWeek and HourOfDay:")
crimeDF.select("date","year","month","day","dow","hour").summary().show()

print("Checking for nulls on columns Date, Year, Month, DayofMonth, DayOfWeek and HourOfDay:")
crimeDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["date","year","month","day","dow","hour"]]).show()

print("Checking amount of distinct values in columns Year, Month, DayofMonth and DayOfWeek:")
crimeDF.select([countDistinct(c).alias(c) for c in ["date","year","month","day","dow","hour"]]).show()

print ("Most and least frequent occurrences for Year, Month, DayofMonth, DayOfWeek and HourOfDay columns:")
YearOccurencesDF = crimeDF.groupBy("year").agg(count(lit(1)).alias("Total"))
MonthOccurencesDF = crimeDF.groupBy("month").agg(count(lit(1)).alias("Total"))
dayofMonthOccurrencesDF = crimeDF.groupBy("day").agg(count(lit(1)).alias("Total"))
dayOfWeekDF = crimeDF.groupBy("dow").agg(count(lit(1)).alias("Total"))
hourOfDayDF = crimeDF.groupBy("hour").agg(count(lit(1)).alias("Total"))

leastFreqYear          = YearOccurencesDF.orderBy(col("Total").asc()).first()
mostFreqYear           = YearOccurencesDF.orderBy(col("Total").desc()).first()
leastFreqMonth         = MonthOccurencesDF.orderBy(col("Total").asc()).first()
mostFreqMonth          = MonthOccurencesDF.orderBy(col("Total").desc()).first()
leastFreqDayOfMonth    = dayofMonthOccurrencesDF.orderBy(col("Total").asc()).first()
mostFreqDayOfMonth     = dayofMonthOccurrencesDF.orderBy(col("Total").desc()).first()
leastFreqDayOfWeek     = dayOfWeekDF.orderBy(col("Total").asc()).first()
mostFreqDayOfWeek      = dayOfWeekDF.orderBy(col("Total").desc()).first()
leastFreqHourOfDay     = hourOfDayDF.orderBy(col("Total").asc()).first()
mostFreqHourOfDay      = hourOfDayDF.orderBy(col("Total").desc()).first()
display(Markdown("""
| %s | %s | %s | %s | %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s | %s | %s | %s | %s |
""" % ("leastFreqYear","mostFreqYear","leastFreqMonth","mostFreqMonth","leastFreqDayOfMonth", "mostFreqDayOfMonth", "leastFreqDayOfWeek", "mostFreqDayOfWeek", "leastFreqHourOfDay","mostFreqHourOfDay", \
       "%d (%d occurrences)" % (leastFreqYear["year"], leastFreqMonth["Total"]), \
       "%d (%d occurrences)" % (mostFreqYear["year"], mostFreqMonth["Total"]), \
       "%d (%d occurrences)" % (leastFreqMonth["month"], leastFreqMonth["Total"]), \
       "%d (%d occurrences)" % (mostFreqMonth["month"], mostFreqMonth["Total"]), \
       "%d (%d occurrences)" % (leastFreqDayOfMonth["day"], leastFreqDayOfMonth["Total"]), \
       "%d (%d occurrences)" % (mostFreqDayOfMonth["day"], mostFreqDayOfMonth["Total"]), \
       "%d (%d occurrences)" % (leastFreqDayOfWeek["dow"], leastFreqDayOfWeek["Total"]), \
       "%d (%d occurrences)" % (mostFreqDayOfWeek["dow"], mostFreqDayOfWeek["Total"]), \
       "%d (%d occurrences)" % (leastFreqHourOfDay["hour"], leastFreqHourOfDay["Total"]), \
       "%d (%d occurrences)" % (mostFreqHourOfDay["hour"], mostFreqHourOfDay["Total"]))))


Summary of columns Date, Year, Month, DayofMonth, DayOfWeek and HourOfDay:
+-------+------------------+------------------+------------------+------------------+------------------+
|summary|              year|             month|               day|               dow|              hour|
+-------+------------------+------------------+------------------+------------------+------------------+
|  count|            312537|            312537|            312537|            312537|            312537|
|   mean|2014.9068270316795|6.2287185197272645|15.498644960436684| 4.052540979148069|13.392619113896915|
| stddev| 0.797917054303514| 3.229429701861881| 8.826555021840456|1.9903781403914629| 6.622624872723397|
|    min|              2014|                 1|                 1|                 1|                 0|
|    25%|              2014|                 4|                 8|                 2|                 9|
|    50%|              2015|                 6|                15|                 4|


| leastFreqYear | mostFreqYear | leastFreqMonth | mostFreqMonth | leastFreqDayOfMonth | mostFreqDayOfMonth | leastFreqDayOfWeek | mostFreqDayOfWeek | leastFreqHourOfDay | mostFreqHourOfDay |
|----|----|----|----|----|----|----|----|----|----|
| 2016 (16924 occurrences) | 2014 (30112 occurrences) | 12 (16924 occurrences) | 7 (30112 occurrences) | 31 (5642 occurrences) | 1 (13184 occurrences) | 1 (42110 occurrences) | 6 (47642 occurrences) | 5 (3613 occurrences) | 19 (20027 occurrences) |


### B. Basic profiling: Location related columns

In [27]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first

print ("Summary of columns Block, LocDescription, Beat, District, Police Department, Ward, Community Area, Area Name and Side of Chicago:")
crimeDF.select("block", "location_description", "beat", "district","districtName","ward", "community_area","communityAreaName", "communityAreaSide").summary().show()

print("Checking for nulls on columns Block, LocationDescription, Beat, District, Police Department, Ward, CommunityArea Area Name and Side of Chicago:")
crimeDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["block", "location_description", "beat", "district","districtName","ward", "community_area","communityAreaName","communityAreaSide"]]).show()


print("Checking amount of distinct values in columns Block, LocationDescription, Beat, District, Police Department, Ward, Community Area, Area Name and Side of Chicago:")
crimeDF.select([countDistinct(c).alias(c) for c in ["block", "location_description", "beat", "district", "ward", "community_area","communityAreaName","communityAreaSide"]]).show()

print ("Most and least frequent occurrences for Block, LocationDescription, Beat, District, Police Department, Ward, CommunityArea, Area Name and Side of Chicago:")
blockDF         = crimeDF.groupBy("block").agg(count(lit(1)).alias("Total"))
locDescDF       = crimeDF.groupBy("location_description").agg(count(lit(1)).alias("Total"))
beatDF          = crimeDF.groupBy("beat").agg(count(lit(1)).alias("Total"))
districtDF      = crimeDF.groupBy("district").agg(count(lit(1)).alias("Total"))
wardDF          = crimeDF.groupBy("ward").agg(count(lit(1)).alias("Total"))
communityAreaDF = crimeDF.groupBy("community_area").agg(count(lit(1)).alias("Total"))

deptDF          = crimeDF.groupBy("districtName").agg(count(lit(1)).alias("Total"))
commAreaNameDF  = crimeDF.groupBy("communityAreaName").agg(count(lit(1)).alias("Total"))
commAreaSideDF  = crimeDF.groupBy("communityAreaSide").agg(count(lit(1)).alias("Total"))


leastFreqBlock          = blockDF.orderBy(col("Total").asc()).first()
mostFreqBlock           = blockDF.orderBy(col("Total").desc()).first()
leastFreqLocationDesc   = locDescDF.orderBy(col("Total").asc()).first()
mostFreqLocationDesc    = locDescDF.orderBy(col("Total").desc()).first()
leastFreqBeat           = beatDF.orderBy(col("Total").asc()).first()
mostFreqBeat            = beatDF.orderBy(col("Total").desc()).first()

leastFreqDistrict       = districtDF.orderBy(col("Total").asc()).first()
mostFreqDistrict        = districtDF.orderBy(col("Total").desc()).first()
leastFreqWard           = wardDF.orderBy(col("Total").asc()).first()
mostFreqWard            = wardDF.orderBy(col("Total").desc()).first()
leastFreqCommunityArea  = communityAreaDF.orderBy(col("Total").asc()).first()
mostFreqCommunityArea   = communityAreaDF.orderBy(col("Total").desc()).first()

leastFreqDeptName       = deptDF.orderBy(col("Total").asc()).first()
mostFreqDeptName        = deptDF.orderBy(col("Total").desc()).first()
leastFreqCommName       = commAreaNameDF.orderBy(col("Total").asc()).first()
mostFreqCommName        = commAreaNameDF.orderBy(col("Total").desc()).first()
leastFreqSide           = commAreaSideDF.orderBy(col("Total").asc()).first()
mostFreqSide            = commAreaSideDF.orderBy(col("Total").desc()).first()

display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqBlock", "mostFreqBlock", "leastFreqLocationDesc", "mostFreqLocationDesc", \
       "%s (%d occurrences)" % (leastFreqBlock["block"], leastFreqBlock["Total"]), \
       "%s (%d occurrences)" % (mostFreqBlock["block"], mostFreqBlock["Total"]), \
       "%s (%d occurrences)" % (leastFreqLocationDesc["location_description"], leastFreqLocationDesc["Total"]), \
       "%s (%d occurrences)" % (mostFreqLocationDesc["location_description"], mostFreqLocationDesc["Total"]))))
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqBeat", "mostFreqBeat", "leastFreqDistrict", "mostFreqDistrict", \
       "%d (%d occurrences)" % (leastFreqBeat["beat"], leastFreqBeat["Total"]), \
       "%d (%d occurrences)" % (mostFreqBeat["beat"], mostFreqBeat["Total"]), \
       "%d (%d occurrences)" % (leastFreqDistrict["district"], leastFreqDistrict["Total"]), \
       "%d (%d occurrences)" % (mostFreqDistrict["district"], mostFreqDistrict["Total"]))))
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqWard", "mostFreqWard", "leastFreqCommunityArea", "mostFreqCommunityArea", \
       "%s (%d occurrences)" % (leastFreqWard["ward"], leastFreqWard["Total"]), \
       "%s (%d occurrences)" % (mostFreqWard["ward"], mostFreqWard["Total"]), \
       "%s (%d occurrences)" % (leastFreqCommunityArea["community_area"], leastFreqCommunityArea["Total"]), \
       "%s (%d occurrences)" % (mostFreqCommunityArea["community_area"], mostFreqCommunityArea["Total"]))))
display(Markdown("""
| %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s |
""" % ("leastFreqDeptName", "mostFreqDeptName ", "leastFreqCommName", "mostFreqCommName","leastFreqSide","mostFreqSide", \
       "%s (%d occurrences)" % (leastFreqDeptName["districtName"], leastFreqDeptName["Total"]), \
       "%s (%d occurrences)" % (mostFreqDeptName ["districtName"], mostFreqDeptName["Total"]), \
       "%s (%d occurrences)" % (leastFreqCommName["communityAreaName"], leastFreqCommName["Total"]), \
       "%s (%d occurrences)" % (mostFreqCommName["communityAreaName"], mostFreqCommName["Total"]), \
       "%s (%d occurrences)" % (leastFreqSide["communityAreaSide"], leastFreqSide["Total"]),
       "%s (%d occurrences)" % (mostFreqSide["communityAreaSide"], mostFreqSide["Total"]))))

Summary of columns Block, LocDescription, Beat, District, Police Department, Ward, Community Area, Area Name and Side of Chicago:
+-------+------------------+--------------------+-----------------+------------------+------------+------------------+-----------------+-----------------+-----------------+
|summary|             block|location_description|             beat|          district|districtName|              ward|   community_area|communityAreaName|communityAreaSide|
+-------+------------------+--------------------+-----------------+------------------+------------+------------------+-----------------+-----------------+-----------------+
|  count|            312537|              311629|           312537|            312537|      312537|            312537|           312537|           312537|           312537|
|   mean|              null|                null|1141.745745303756| 11.18682587981583|        null|22.630859706210785|38.15121729587217|             null|             null|
| std


| leastFreqBlock | mostFreqBlock | leastFreqLocationDesc | mostFreqLocationDesc |
|----|----|----|----|
| 017XX W MADISON ST (1 occurrences) | 0000X W TERMINAL ST (725 occurrences) | DELIVERY TRUCK (2 occurrences) | STREET (77396 occurrences) |



| leastFreqBeat | mostFreqBeat | leastFreqDistrict | mostFreqDistrict |
|----|----|----|----|
| 1655 (194 occurrences) | 1533 (3827 occurrences) | 20 (4627 occurrences) | 11 (31023 occurrences) |



| leastFreqWard | mostFreqWard | leastFreqCommunityArea | mostFreqCommunityArea |
|----|----|----|----|
| 48 (2396 occurrences) | 28 (20887 occurrences) | 9 (349 occurrences) | 25 (25144 occurrences) |



| leastFreqDeptName | mostFreqDeptName  | leastFreqCommName | mostFreqCommName | leastFreqSide | mostFreqSide |
|----|----|----|----|----|----|
| Lincoln (4627 occurrences) | Harrison (31023 occurrences) | Edison Park (349 occurrences) | Austin (25144 occurrences) | Central (16622 occurrences) | West side (89922 occurrences) |


### C. Basic Profiling: Crime related columns

In [28]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first

print ("Summary of columns Id, CaseNumber, IUCRCode, IndexCode, FBICode, PrimaryType and CrimeDescription:")
crimeDF.select("id", "case_number", "iucr","index_code", "fbi_code", "primary_type", "description").summary().show()

print ("Summary of columns Domestic and Arrest:")
crimeDF.select("arrest","domestic").summary().show()

print("Checking for nulls on Id, CaseNumber, IUCRCode, FBICode, PrimaryType, CrimeDescription, Domestic, Arrest:")
crimeDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["id", "case_number", "iucr", "fbi_code", "primary_type", "description","domestic","arrest"]]).show()

print("Checking amount of distinct values in columns Id, CaseNumber, IUCRCode, IndexCode, FBICode, PrimaryType, CrimeDescription, Domestic, Arrest:")
crimeDF.select([countDistinct(c).alias(c) for c in ["id", "case_number", "iucr","index_code","fbi_code", "primary_type", "description","domestic","arrest"]]).show()

print ("Most and least frequent occurrences for IUCRCode, IndexCode, FBICode, PrimaryType, CrimeDescription, Domestic and Arrest:")
iucrCodeDF         = crimeDF.groupBy("iucr").agg(count(lit(1)).alias("Total"))
indexCodeDF        = crimeDF.groupBy("index_code").agg(count(lit(1)).alias("Total"))
fbiCodeDF          = crimeDF.groupBy("fbi_code").agg(count(lit(1)).alias("Total"))
primaryTypeDF      = crimeDF.groupBy("primary_type").agg(count(lit(1)).alias("Total"))
crimeDescriptionDF = crimeDF.groupBy("description").agg(count(lit(1)).alias("Total"))
domesticDF         = crimeDF.groupBy("domestic").agg(count(lit(1)).alias("Total"))
arrestDF           = crimeDF.groupBy("arrest").agg(count(lit(1)).alias("Total"))

leastFreqIucrCode         = iucrCodeDF.orderBy(col("Total").asc()).first()
mostFreqIucrCode          = iucrCodeDF.orderBy(col("Total").desc()).first()
leastFreqIndexCode        = indexCodeDF.orderBy(col("Total").asc()).first()
mostFreqIndexCode         = indexCodeDF.orderBy(col("Total").desc()).first()
leastFreqFbiCode          = fbiCodeDF.orderBy(col("Total").asc()).first()
mostFreqFbiCode           = fbiCodeDF.orderBy(col("Total").desc()).first()

leastFreqPrimaryType      = primaryTypeDF.orderBy(col("Total").asc()).first()
mostFreqPrimaryType       = primaryTypeDF.orderBy(col("Total").desc()).first()
leastFreqCrimeDescription = crimeDescriptionDF.orderBy(col("Total").asc()).first()
mostFreqCrimeDescription  = crimeDescriptionDF.orderBy(col("Total").desc()).first()
leastFreqDomestic         = domesticDF.orderBy(col("Total").asc()).first()
mostFreqDomestic          = domesticDF.orderBy(col("Total").desc()).first()
leastFreqArrest           = arrestDF.orderBy(col("Total").asc()).first()
mostFreqArrest            = arrestDF.orderBy(col("Total").desc()).first()

display(Markdown("""
| %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s |
""" % ("leastFreqIucrCode", "mostFreqIucrCode","leastFreqIndexCode","mostFreqIndexCode","leastFreqFbiCode", "mostFreqFbiCode", \
       "%s (%d occurrences)" % (leastFreqIucrCode["iucr"], leastFreqIucrCode["Total"]), \
       "%s (%d occurrences)" % (mostFreqIucrCode["iucr"], mostFreqIucrCode["Total"]), \
       "%s (%d occurrences)" % (leastFreqIndexCode["index_code"], leastFreqIndexCode["Total"]), \
       "%s (%d occurrences)" % (mostFreqIndexCode["index_code"], mostFreqIndexCode["Total"]), \
       "%s (%d occurrences)" % (leastFreqFbiCode["fbi_code"], leastFreqFbiCode["Total"]), \
       "%s (%d occurrences)" % (mostFreqFbiCode["fbi_code"], mostFreqFbiCode["Total"]))))
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqPrimaryType", "mostFreqPrimaryType", "leastFreqCrimeDescription", "mostFreqCrimeDescription", \
       "%s (%d occurrences)" % (leastFreqPrimaryType["primary_type"], leastFreqPrimaryType["Total"]), \
       "%s (%d occurrences)" % (mostFreqPrimaryType["primary_type"], mostFreqPrimaryType["Total"]), \
       "%s (%d occurrences)" % (leastFreqCrimeDescription["description"], leastFreqCrimeDescription["Total"]), \
       "%s (%d occurrences)" % (mostFreqCrimeDescription["description"], mostFreqCrimeDescription["Total"]))))
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqDomestic", "mostFreqDomestic", "leastFreqArrest", "mostFreqArrest", \
       "%s (%d occurrences)" % (leastFreqDomestic["domestic"], leastFreqDomestic["Total"]), \
       "%s (%d occurrences)" % (mostFreqDomestic["domestic"], mostFreqDomestic["Total"]), \
       "%s (%d occurrences)" % (leastFreqArrest["arrest"], leastFreqArrest["Total"]), \
       "%s (%d occurrences)" % (mostFreqArrest["arrest"], mostFreqArrest["Total"]))))

Summary of columns Id, CaseNumber, IUCRCode, IndexCode, FBICode, PrimaryType and CrimeDescription:
+-------+--------------------+-----------+------------------+----------+-----------------+-----------------+--------------------+
|summary|                  id|case_number|              iucr|index_code|         fbi_code|     primary_type|         description|
+-------+--------------------+-----------+------------------+----------+-----------------+-----------------+--------------------+
|  count|              312537|     312537|            312537|    312537|           312537|           312537|              312537|
|   mean|1.0090193063195718E7|   161884.0|1872.6413737008586|      null|17.15632679155318|             null|                null|
| stddev|   368486.3256183279|        NaN| 958.0193833000208|      null|6.094784938055646|             null|                null|
|    min|             9446758|     161884|              031A|         I|               02|            ARSON|ABUSE/NEGLECT


| leastFreqIucrCode | mostFreqIucrCode | leastFreqIndexCode | mostFreqIndexCode | leastFreqFbiCode | mostFreqFbiCode |
|----|----|----|----|----|----|
| 4740 (1 occurrences) | 1320 (38883 occurrences) | I (25750 occurrences) | N (286787 occurrences) | 12 (119 occurrences) | 14 (82036 occurrences) |



| leastFreqPrimaryType | mostFreqPrimaryType | leastFreqCrimeDescription | mostFreqCrimeDescription |
|----|----|----|----|
| NON-CRIMINAL (3 occurrences) | CRIMINAL DAMAGE (82036 occurrences) | COMPELLING CONFESSION (1 occurrences) | TO VEHICLE (40883 occurrences) |



| leastFreqDomestic | mostFreqDomestic | leastFreqArrest | mostFreqArrest |
|----|----|----|----|
| True (29958 occurrences) | False (282579 occurrences) | True (116964 occurrences) | False (195573 occurrences) |


### D. Basic Profiling: Socio-economic factors columns

In [26]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first

print ("Summary of columns BelowPovertyLevel, CrowdedHousing, NoHighSchoolDiploma, perCapitaIncome and Unemployment:")
crimeDF.select("belowPovertyLevel", "crowdedHousing", "noHighSchoolDiploma", "perCapitaIncome","unemployment").summary().show()

print("Checking for nulls on columns BelowPovertyLevel, CrowdedHousing, NoHighSchoolDiploma, perCapitaIncome and Unemployment:")
crimeDF.select([count(when(col(c).isNull(), c)).alias(c) for c in ["belowPovertyLevel", "crowdedHousing", "noHighSchoolDiploma", "perCapitaIncome","unemployment"]]).show()


print("Checking amount of distinct values in columns BelowPovertyLevel, CrowdedHousing, NoHighSchoolDiploma, perCapitaIncome and Unemployment:")
crimeDF.select([countDistinct(c).alias(c) for c in ["belowPovertyLevel", "crowdedHousing", "noHighSchoolDiploma", "perCapitaIncome","unemployment"]]).show()

print ("Most and least frequent occurrences for BelowPovertyLevel, CrowdedHousing, NoHighSchoolDiploma, perCapitaIncome and Unemployment columns:")
belowPovertyLevelDF = crimeDF.groupBy("belowPovertyLevel").agg(count(lit(1)).alias("Total"))
crowdedHousingDF = crimeDF.groupBy("crowdedHousing").agg(count(lit(1)).alias("Total"))
noHighSchoolDiplomaDF = crimeDF.groupBy("noHighSchoolDiploma").agg(count(lit(1)).alias("Total"))
perCapitaIncomeDF = crimeDF.groupBy("perCapitaIncome").agg(count(lit(1)).alias("Total"))
unemploymentDF = crimeDF.groupBy("unemployment").agg(count(lit(1)).alias("Total"))

leastFreqPovertyLevel  = belowPovertyLevelDF.orderBy(col("Total").asc()).first()
mostFreqPovertyLevel   = belowPovertyLevelDF.orderBy(col("Total").desc()).first()

leastFreqCrowdHousing  = crowdedHousingDF.orderBy(col("Total").asc()).first()
mostFreqCrowdHousing   = crowdedHousingDF.orderBy(col("Total").desc()).first()

leastFreqNoHighSchool  = noHighSchoolDiplomaDF.orderBy(col("Total").asc()).first()
mostFreqNoHighSchool   = noHighSchoolDiplomaDF.orderBy(col("Total").desc()).first()

leastFreqPerCapitaInc  = perCapitaIncomeDF.orderBy(col("Total").asc()).first()
mostFreqPerCapitaInc   = perCapitaIncomeDF.orderBy(col("Total").desc()).first()

leastFreqUnemployment  = unemploymentDF.orderBy(col("Total").asc()).first()
mostFreqUnemployment   = unemploymentDF.orderBy(col("Total").desc()).first()

display(Markdown("""
| %s | %s | %s | %s | %s | %s |
|----|----|----|----|----|----|
| %s | %s | %s | %s | %s | %s |
""" % ("leastFreqPovertyLevel", "mostFreqPovertyLevel", "leastFreqCrowdHousing", "mostFreqCrowdHousing","leastFreqUnemployment","mostFreqUnemployment", \
       "%d (%d occurrences)" % (leastFreqPovertyLevel["belowPovertyLevel"], leastFreqPovertyLevel["Total"]), \
       "%d (%d occurrences)" % (mostFreqPovertyLevel["belowPovertyLevel"], mostFreqPovertyLevel["Total"]), \
       "%d (%d occurrences)" % (leastFreqCrowdHousing["crowdedHousing"], leastFreqCrowdHousing["Total"]), \
       "%d (%d occurrences)" % (mostFreqCrowdHousing["crowdedHousing"], mostFreqCrowdHousing["Total"]), \
       "%d (%d occurrences)" % (leastFreqUnemployment["unemployment"], leastFreqUnemployment["Total"]), \
       "%d (%d occurrences)" % (mostFreqUnemployment["unemployment"], mostFreqUnemployment["Total"]))))
display(Markdown("""
| %s | %s | %s | %s |
|----|----|----|----|
| %s | %s | %s | %s |
""" % ("leastFreqNoHighSchool", "mostFreqNoHighSchool", "leastFreqPerCapitaInc", "mostFreqPerCapitaInc", \
       "%d (%d occurrences)" % (leastFreqNoHighSchool["noHighSchoolDiploma"], leastFreqNoHighSchool["Total"]), \
       "%d (%d occurrences)" % (mostFreqNoHighSchool["noHighSchoolDiploma"], mostFreqNoHighSchool["Total"]), \
       "%d (%d occurrences)" % (leastFreqPerCapitaInc["perCapitaIncome"], leastFreqPerCapitaInc["Total"]), \
       "%d (%d occurrences)" % (mostFreqPerCapitaInc["perCapitaIncome"], mostFreqPerCapitaInc["Total"]))))

Summary of columns BelowPovertyLevel, CrowdedHousing, NoHighSchoolDiploma, perCapitaIncome and Unemployment:
+-------+------------------+------------------+-------------------+------------------+------------------+
|summary| belowPovertyLevel|    crowdedHousing|noHighSchoolDiploma|   perCapitaIncome|      unemployment|
+-------+------------------+------------------+-------------------+------------------+------------------+
|  count|            312537|            312537|             312537|            312537|            312537|
|   mean| 24.14373882132154| 5.501370717706838|  22.91644989233197|23438.705772436544|15.132684770125152|
| stddev|10.005733134362101|3.5845176042272247| 11.663855138910087|16725.441376752173| 7.043323214748578|
|    min|               3.1|               0.2|                2.9|              8535|               4.2|
|    25%|              15.7|               2.9|               14.9|             13596|               9.3|
|    50%|              24.5|               


| leastFreqPovertyLevel | mostFreqPovertyLevel | leastFreqCrowdHousing | mostFreqCrowdHousing | leastFreqUnemployment | mostFreqUnemployment |
|----|----|----|----|----|----|
| 5 (349 occurrences) | 27 (25144 occurrences) | 5 (483 occurrences) | 5 (25144 occurrences) | 7 (349 occurrences) | 21 (26424 occurrences) |



| leastFreqNoHighSchool | mostFreqNoHighSchool | leastFreqPerCapitaInc | mostFreqPerCapitaInc |
|----|----|----|----|
| 8 (349 occurrences) | 25 (25144 occurrences) | 38337 (349 occurrences) | 15920 (25144 occurrences) |


# 5. Business Question Analysis

### A1. What periods of the day are of concern for violent crimes?

In [24]:
from pyspark.sql.functions import col, count, round

# Period of the day is going to be categorized as required for the shifts of the determined for the police stations:

#   "Early morning"   - 00h00 - 06h00
#   "Morning"         - 06h00 - 12h00
#   "Afternoon"       - 12h00 - 18h00
#   "Evening"         - 18h00 - 00h00 
totalCrimeDF = crimeDF.count()
PeriodDayDF = crimeDF\
   .withColumn("Period_of_Day", when(((col("hour").cast(IntegerType()))>=0) & ((col("hour").cast(IntegerType()))<=6),"Early Morning")\
                               .when(((col("hour").cast(IntegerType()))>6) & ((col("hour").cast(IntegerType()))<=12),"Morning")\
                               .when(((col("hour").cast(IntegerType()))>12) & ((col("hour").cast(IntegerType()))<=18),"Afternoon")\
                               .when(((col("hour").cast(IntegerType()))>18) & ((col("hour").cast(IntegerType()))<=23),"Evening"))

PeriodDayDF.cache() # optimizing the process
print("A summary of the the Number of Indexed/Violent(FBI referenced) Crimes for each period of the day:")

PeriodDayDF.where(col("index_code")=="I")\
           .select("Period_of_Day","hour")\
           .groupBy("Period_of_Day")\
           .agg(count("Period_of_Day").alias("Num_of_Violent_Crimes"),\
                  (count("Period_of_Day")/totalCrimeDF*100).alias("Rate"))\
           .orderBy(col("Num_of_Violent_Crimes").desc())\
           .select("Period_of_Day","Num_of_Violent_Crimes",round("Rate",2).alias("Rate_of_Violent_Crimes")).show()


A summary of the the Number of Indexed/Violent(FBI referenced) Crimes for each period of the day:
+-------------+---------------------+----------------------+
|Period_of_Day|Num_of_Violent_Crimes|Rate_of_Violent_Crimes|
+-------------+---------------------+----------------------+
|      Evening|                 8025|                  2.57|
|Early Morning|                 6985|                  2.23|
|    Afternoon|                 6795|                  2.17|
|      Morning|                 3945|                  1.26|
+-------------+---------------------+----------------------+



### A2. What periods of the day are of concern for non-violent crimes?

In [25]:
from pyspark.sql.functions import col, count, round

totalCrimeDF = crimeDF.count()
PeriodDayDF = crimeDF\
   .withColumn("Period_of_Day", when(((col("hour").cast(IntegerType()))>=0) & ((col("hour").cast(IntegerType()))<=6),"Early Morning")\
                               .when(((col("hour").cast(IntegerType()))>6) & ((col("hour").cast(IntegerType()))<=12),"Morning")\
                               .when(((col("hour").cast(IntegerType()))>12) & ((col("hour").cast(IntegerType()))<=18),"Afternoon")\
                               .when(((col("hour").cast(IntegerType()))>18) & ((col("hour").cast(IntegerType()))<=23),"Evening"))

PeriodDayDF.cache() # optimizing the process
print("A summary of the Number of Non-Indexed/Non-Violent Crimes for each period of the day:")

PeriodDayDF.where(col("index_code")=="N")\
           .select("Period_of_Day","hour")\
           .groupBy("Period_of_Day")\
           .agg(count("Period_of_Day").alias("Num_of_Non_Violent_Crimes"),\
                  (count("Period_of_Day")/totalCrimeDF*100).alias("Rate"))\
           .orderBy(col("Num_of_Non_Violent_Crimes").desc())\
           .select("Period_of_Day","Num_of_Non_Violent_Crimes",round("Rate",2).alias("Rate_of_Non_Violent_Crimes")).show()


A summary of the Number of Non-Indexed/Non-Violent Crimes for each period of the day:
+-------------+-------------------------+--------------------------+
|Period_of_Day|Num_of_Non_Violent_Crimes|Rate_of_Non_Violent_Crimes|
+-------------+-------------------------+--------------------------+
|    Afternoon|                    86009|                     27.52|
|      Evening|                    79629|                     25.48|
|      Morning|                    76941|                     24.62|
|Early Morning|                    44208|                     14.14|
+-------------+-------------------------+--------------------------+



In [44]:
crimeDF.select(col("primary_type")).distinct().show(30,truncate=False)

+---------------------------------+
|primary_type                     |
+---------------------------------+
|ARSON                            |
|ASSAULT                          |
|BATTERY                          |
|CONCEALED CARRY LICENSE VIOLATION|
|CRIMINAL DAMAGE                  |
|CRIMINAL TRESPASS                |
|DECEPTIVE PRACTICE               |
|GAMBLING                         |
|HUMAN TRAFFICKING                |
|INTERFERENCE WITH PUBLIC OFFICER |
|INTIMIDATION                     |
|KIDNAPPING                       |
|LIQUOR LAW VIOLATION             |
|NARCOTICS                        |
|NON-CRIMINAL                     |
|OBSCENITY                        |
|OFFENSE INVOLVING CHILDREN       |
|OTHER NARCOTIC VIOLATION         |
|OTHER OFFENSE                    |
|PROSTITUTION                     |
|PUBLIC INDECENCY                 |
|PUBLIC PEACE VIOLATION           |
|ROBBERY                          |
|SEX OFFENSE                      |
|WEAPONS VIOLATION          

### B1. Which sides of Chicago are of concern for violent crimes?

In [26]:
from pyspark.sql.functions import col, count, round

# The sides of Chicago's which encompass the 77 community areas and subsequent neighbourhoods will combined by joining:
# Far Southwest Side + Far Southeast side + Northwest side + Southwest side => SouthSide
# Far North Side +  Far Southeast side + North side => NorthSide
# Westside + Central => Central 

totCrimeDF = crimeDF.count()
SideofChicagoDF = crimeDF\
   .withColumn("Side_of_Chicago", when((col("communityAreaSide")=="Far Southwest side") | (col("communityAreaSide")=="Far Southeast side") | (col("communityAreaSide")=="Southwest side") | (col("communityAreaSide")=="South side"),"Southside")\
                                 .when((col("communityAreaSide")=="Far North side") | (col("communityAreaSide")=="North side") | (col("communityAreaSide")=="Northwest side"),"Northside")\
                                 .when((col("communityAreaSide")=="Central") | (col("communityAreaSide") =="West side"),"Central"))\

SideofChicagoDF.cache() # optimizing the process
print("A summary of the Number of Indexed/Violent Crimes for each side of Chicago:")

SideofChicagoDF.where(col("index_code")=="I")\
           .select("Side_of_Chicago","communityAreaSide")\
           .groupBy("Side_of_Chicago")\
           .agg(count("Side_of_Chicago").alias("Num_of_Violent_Crimes"),\
                  (count("Side_of_Chicago")/totCrimeDF*100).alias("Rate"))\
           .orderBy(col("Num_of_Violent_Crimes").desc())\
           .select("Side_of_Chicago","Num_of_Violent_Crimes",round("Rate",2).alias("Rate_of_Violent_Crimes")).show()

A summary of the Number of Indexed/Violent Crimes for each side of Chicago:
+---------------+---------------------+----------------------+
|Side_of_Chicago|Num_of_Violent_Crimes|Rate_of_Violent_Crimes|
+---------------+---------------------+----------------------+
|      Southside|                14345|                  4.59|
|        Central|                 8194|                  2.62|
|      Northside|                 3211|                  1.03|
+---------------+---------------------+----------------------+



### B2. Which sides of Chicago are of concern for Non-Violent/Non-violent crimes?

In [41]:
totCrimeDF = crimeDF.count()
SideofChicagoDF = crimeDF\
   .withColumn("Side_of_Chicago", when((col("communityAreaSide")=="Far Southwest side") | (col("communityAreaSide")=="Far Southeast side") | (col("communityAreaSide")=="Southwest side") | (col("communityAreaSide")=="South side"),"Southside")\
                                 .when((col("communityAreaSide")=="Far North side") | (col("communityAreaSide")=="North side") | (col("communityAreaSide")=="Northwest side"),"Northside")\
                                 .when((col("communityAreaSide")=="Central") | (col("communityAreaSide") =="West side"),"Central"))\

SideofChicagoDF.cache() # optimizing the process
print("A summary of the Number of Non-Indexed/non-violent Crimes for each side of Chicago:")

SideofChicagoDF.where(col("index_code")=="N")\
           .select("Side_of_Chicago","communityAreaSide")\
           .groupBy("Side_of_Chicago")\
           .agg(count("Side_of_Chicago").alias("Num_of_Non_Violent_Crimes"),\
                  (count("Side_of_Chicago")/totCrimeDF*100).alias("Rate"))\
           .orderBy(col("Num_of_Non_Violent_Crimes").desc())\
           .select("Side_of_Chicago","Num_of_Non_Violent_Crimes",round("Rate",2).alias("Rate_of_Non_Violent_Crimes")).show()

A summary of the Number of Non-Indexed/non-violent Crimes for each side of Chicago:
+---------------+-------------------------+--------------------------+
|Side_of_Chicago|Num_of_Non_Violent_Crimes|Rate_of_Non_Violent_Crimes|
+---------------+-------------------------+--------------------------+
|      Southside|                   131487|                     42.07|
|        Central|                    98350|                     31.47|
|      Northside|                    56950|                     18.22|
+---------------+-------------------------+--------------------------+



### C1. Which 10 police station have the highest arrests for domestic violence rates for?

In [38]:
totCrimeDF = crimeDF.where(col("domestic")==True).count()

print("Summary of the arrests which occurred for domestic violence for each Police Station")
crimeDF.where((col("arrest")==True) & ((col("domestic")==True)))\
           .select("districtName","arrest","domestic")\
           .groupBy("districtName")\
           .agg(count("districtName").alias("Domestic Violence Arrest Rate"),\
                  (count("districtName")/totCrimeDF*100).alias("Rate"))\
           .orderBy(col("Domestic Violence Arrest Rate").desc())\
           .select("districtName","Domestic Violence Arrest Rate",round("Rate",2).alias("Domestic Violence Cases/Reported Crimes")).show(10)

Summary of the arrests which occurred for domestic violence for each Police Station
+--------------+-----------------------------+---------------------------------------+
|  districtName|Domestic Violence Arrest Rate|Domestic Violence Cases/Reported Crimes|
+--------------+-----------------------------+---------------------------------------+
| South Chicago|                          384|                                   1.28|
| Grand Central|                          345|                                   1.15|
|       Gresham|                          335|                                   1.12|
|      Harrison|                          317|                                   1.06|
|       Calumet|                          309|                                   1.03|
|     Englewood|                          304|                                   1.01|
|  Chicago Lawn|                          286|                                   0.95|
|        Austin|                          286|

### C2. For those who commit domestic violence, how many are actually arrested (top 10)?

In [40]:
totCrimeDF = crimeDF.where(col("domestic")==True).count()

print("Summary of the domestic violence related crime where no arrest happenned for each Police Stations")
crimeDF.where((col("arrest")==False) & ((col("domestic")==True)))\
           .select("districtName","arrest","domestic")\
           .groupBy("districtName")\
           .agg(count("districtName").alias("Domestic Violence Arrest Rate"),\
                  (count("districtName")/totCrimeDF*100).alias("Rate"))\
           .orderBy(col("Domestic Violence Arrest Rate").desc())\
           .select("districtName","Domestic Violence Arrest Rate",round("Rate",2).alias("Domestic Violence Cases/Reported Crimes")).show(10)

Summary of the domestic violence related crime where no arrest happenned for each Police Stations
+--------------+-----------------------------+---------------------------------------+
|  districtName|Domestic Violence Arrest Rate|Domestic Violence Cases/Reported Crimes|
+--------------+-----------------------------+---------------------------------------+
|       Gresham|                         2080|                                   6.94|
|  Chicago Lawn|                         1862|                                   6.22|
| South Chicago|                         1859|                                   6.21|
|      Harrison|                         1842|                                   6.15|
|     Englewood|                         1801|                                   6.01|
|Grand Crossing|                         1745|                                   5.82|
| Grand Central|                         1535|                                   5.12|
|       Calumet|                

### D. Is there seasonality with robberies?


In [39]:
from pyspark.sql.functions import lower, upper
# Seasons is going to be categorized as required for the shifts of the determined for the police stations:

# According to the 
#   "Winter"   -> (1 December - 28/29 February) 12,1,2
#   "Spring"   -> (1 March - 31 May) 3,4,5
#   "Summer"   -> (1 June  - 31 August) 6,7,8
#   "Autumn"   -> (1 September - 30 November) 9,10,11
totalCrimeDF = crimeDF.count()
quartersDF = crimeDF\
   .withColumn("Seasons", when(((col("month").cast(IntegerType()))==1) | ((col("month").cast(IntegerType()))==2) | ((col("month").cast(IntegerType()))==12),"Winter")\
                               .when(((col("month").cast(IntegerType()))>=3) & ((col("month").cast(IntegerType()))<=5),"Spring")\
                               .when(((col("month").cast(IntegerType()))>=6) & ((col("month").cast(IntegerType()))<=8),"Summer")\
                               .when(((col("month").cast(IntegerType()))>=9) & ((col("month").cast(IntegerType()))<=11),"Autumn"))\


quartersDF.cache() # optimizing the process
print("A summary of the number of the robberies for each season:")

quartersDF.where(col("primary_type")=="ROBBERY")\
          .select("Seasons","month")\
          .groupBy("Seasons")\
          .agg(count("Seasons").alias("Num_of_Robberies"),\
                 (count("Seasons")/totalCrimeDF*100).alias("Rate"))\
          .orderBy(col("Num_of_Robberies").desc())\
          .select("Seasons","Num_of_Robberies",round("Rate",2).alias("Seasonal_Rate_of_Robberies")).show()


A summary of the number of the robberies for each season:
+-------+----------------+--------------------------+
|Seasons|Num_of_Robberies|Seasonal_Rate_of_Robberies|
+-------+----------------+--------------------------+
| Summer|            3445|                       1.1|
| Autumn|            2985|                      0.96|
| Winter|            2832|                      0.91|
| Spring|            2690|                      0.86|
+-------+----------------+--------------------------+



### E. What is the socio-economic profile of the communities most affected by domestic violence?

In [62]:
from pyspark.sql.functions import max, min, avg, stddev
from pyspark.sql.types import IntegerType
profileDF =\
SideofChicagoDF.where((col("domestic")==True))\
       .withColumn("PercbelowPovertyLevel", col("belowPovertyLevel").cast(IntegerType()))\
       .withColumn("crowded_Housing", col("crowdedHousing").cast(IntegerType()))\
       .withColumn("noHighSchool", col("noHighSchoolDiploma").cast(IntegerType()))\
       .withColumn("incomePerCapita", col("perCapitaIncome").cast(IntegerType()))\
       .withColumn("unemploymentRate", col("unemployment").cast(IntegerType()))\
       .select("Side_of_Chicago", "PercbelowPovertyLevel","crowded_Housing","noHighSchool",\
                               "incomePerCapita", "unemploymentRate")
profileDF.cache() # optimization to make the processing faster

display(Markdown("**Statistics of percentage of constituents living below the national poverty Level:**"))
profileDF.groupBy("Side_of_Chicago")\
              .agg(avg("PercbelowPovertyLevel").alias("AveragebelowPovertyLvl"),\
                   min("PercbelowPovertyLevel").alias("LowestbelowPovertyLvl"),\
                   max("PercbelowPovertyLevel").alias("HighestbelowPovertyLvl"),\
                   stddev("PercbelowPovertyLevel").alias("StdDevbelowPovertyLevl"))\
              .orderBy("Side_of_Chicago").show()
display(Markdown("**Statistics of percentage of constituents who live in Crowded Housing:**"))
profileDF.groupBy("Side_of_Chicago")\
              .agg(avg("crowded_Housing").alias("AvgCrowdedHousing"),\
                   min("crowded_Housing").alias("LowestCrowdedHousing"),\
                   max("crowded_Housing").alias("HighestCrowdedHousing"),\
                   stddev("crowded_Housing").alias("StdDevCrowdedHousing"))\
              .orderBy("Side_of_Chicago").show()

display(Markdown("**Statistics of percentage of constituents with no High school diploma:**"))
profileDF.groupBy("Side_of_Chicago")\
              .agg(avg("noHighSchool").alias("AvgNoHighSchool"),\
                   min("noHighSchool").alias("LowestNoHighSchool"),\
                   max("noHighSchool").alias("HighestNoHighSchool"),\
                   stddev("noHighSchool").alias("StdDevNoHighSchool"))\
              .orderBy("Side_of_Chicago").show()

display(Markdown("**Income per Capita stats for those affected by domestic violence:**"))
profileDF.groupBy("Side_of_Chicago")\
              .agg(avg("incomePerCapita").alias("AvgIncomePerCapita"),\
                   min("incomePerCapita").alias("LowestIncomePerCapita"),\
                   max("incomePerCapita").alias("HighestIncomePerCapita"),\
                   stddev("incomePerCapita").alias("StdDevIncomePerCapita"))\
              .orderBy("Side_of_Chicago").show()

display(Markdown("**Unemployment Rate for those affected by domestic violence:**"))
profileDF.groupBy("Side_of_Chicago")\
              .agg(avg("unemploymentRate").alias("AvgUnemploymentRate"),\
                   min("unemploymentRate").alias("LowestUnemploymentRate"),\
                   max("unemploymentRate").alias("HighestUnemploymentRate"),\
                   stddev("unemploymentRate").alias("StdDevUnemploymentRate"))\
              .orderBy("Side_of_Chicago").show()


**Statistics of percentage of constituents living below the national poverty Level:**

+---------------+----------------------+---------------------+----------------------+----------------------+
|Side_of_Chicago|AveragebelowPovertyLvl|LowestbelowPovertyLvl|HighestbelowPovertyLvl|StdDevbelowPovertyLevl|
+---------------+----------------------+---------------------+----------------------+----------------------+
|        Central|     28.96376089663761|                   11|                    40|     8.254457194277899|
|      Northside|    14.262842294623226|                    5|                    22|     4.833066217722259|
|      Southside|    25.313264401772525|                    3|                    61|     9.414529471528521|
+---------------+----------------------+---------------------+----------------------+----------------------+



**Statistics of percentage of constituents who live in Crowded Housing:**

+---------------+-----------------+--------------------+---------------------+--------------------+
|Side_of_Chicago|AvgCrowdedHousing|LowestCrowdedHousing|HighestCrowdedHousing|StdDevCrowdedHousing|
+---------------+-----------------+--------------------+---------------------+--------------------+
|        Central|6.744582814445828|                   1|                   17|    3.76917011249918|
|      Northside|5.003997601439137|                   0|                   11|   3.085348644657619|
|      Southside|4.133234859675037|                   0|                   17|  3.1317723156200983|
+---------------+-----------------+--------------------+---------------------+--------------------+



**Statistics of percentage of constituents with no High school diploma:**

+---------------+------------------+------------------+-------------------+------------------+
|Side_of_Chicago|   AvgNoHighSchool|LowestNoHighSchool|HighestNoHighSchool|StdDevNoHighSchool|
+---------------+------------------+------------------+-------------------+------------------+
|        Central| 26.46027397260274|                 3|                 58|12.330175706582406|
|      Northside|20.092344593244054|                 2|                 41|10.392894806289652|
|      Southside|22.360236336779913|                 4|                 54|  9.70613671827135|
+---------------+------------------+------------------+-------------------+------------------+



**Income per Capita stats for those affected by domestic violence:**

+---------------+------------------+---------------------+----------------------+---------------------+
|Side_of_Chicago|AvgIncomePerCapita|LowestIncomePerCapita|HighestIncomePerCapita|StdDevIncomePerCapita|
+---------------+------------------+---------------------+----------------------+---------------------+
|        Central| 22107.09887920299|                10697|                 87163|    18408.65010712782|
|      Northside| 27861.60663601839|                15246|                 71403|   12245.018324692817|
|      Southside|17586.809098966027|                 8535|                 40107|    5440.463682533126|
+---------------+------------------+---------------------+----------------------+---------------------+



**Unemployment Rate for those affected by domestic violence:**

+---------------+-------------------+----------------------+-----------------------+----------------------+
|Side_of_Chicago|AvgUnemploymentRate|LowestUnemploymentRate|HighestUnemploymentRate|StdDevUnemploymentRate|
+---------------+-------------------+----------------------+-----------------------+----------------------+
|        Central| 15.569987546699876|                     4|                     25|     5.888554468845799|
|      Northside|  8.176693983609834|                     4|                     12|    2.1752397193328266|
|      Southside| 17.913382570162483|                     6|                     40|    6.2388876340969075|
+---------------+-------------------+----------------------+-----------------------+----------------------+

