#Vancouver Crime Analysis Program
#####Author: Luke Hansen
---
<b>What - </b>This notebook will report the top crimes in popular neighbourhoods in Vancouver, predict the number of certain crimes, and attempt to find a corelation between crime rates and housing values.

---
<b>Why - </b>The goal of this project is to teach myself:
- basic big data principles
- Python 3
- Apache Spark
- Apache Kafka
- the databricks platform

---

<b>How - </b>Datasets Used:

The Vancouver Police Department's CSV dataset (2003 - present): https://data.vancouver.ca/datacatalogue/crime-data.htm

The Canadian Housing Price Index from RPS Real Solutions (2005 - present): https://www.rpsrealsolutions.com/public-release/hpi/201906/BRPS_HPI_Download_201906.xlsx

---

<b>Conclusions - </b>

---

####Contents:
1. Crime in Vancouver
2. Housing Values in Vancouver
3. Corelations Between Crime Rates and Housing Values

##1. Crime in Vancouver

The first task to be completed is to import the crime data from the DBFS into the notebook (NOTE: I plan to use Apache Kafka to fetch the data real-time on each execution of this notebook, but for now I'll just manually download and upload to Databricks).

In [4]:
# import crime csv file from DBFS
crime_path = "/FileStore/tables/crime_csv_all_years.csv"
crime_dataset = spark.read.csv(crime_path, inferSchema=True, header=True)

crime_dataset.show(20)

# to be used to change the schema types
#from pyspark.sql.types import LongType, StringType, StructField, StructType, BooleanType, ArrayType, IntegerType
#
#customSchema = StructType(Array(
#    StructField("project", StringType, true),
#    StructField("article", StringType, true),
#    StructField("requests", IntegerType, true),
#    StructField("bytes_served", DoubleType, true)))
#
#pagecount = sc.read.format("com.databricks.spark.csv")
#         .option("delimiter"," ")
#         .option("quote","")
#         .option("header", "false")
#         .schema(customSchema)
#         .load("dbfs:/databricks-datasets/wikipedia-datasets/data-001/pagecounts/sample/pagecounts-20151124-170000")


Now that I have the full dataset, I can trim the the columns that I won't need for now. Due to privacy concerns, the Vancouver Police Department have nullified the time and place of any violent records, so I won't include those for these neighbourhood-specific records.

In [6]:
# trim away everything we won't need, except for NEIGHBOURHOOD, TYPE, and YEAR
crime_df = crime_dataset.select(crime_dataset["YEAR"], crime_dataset["NEIGHBOURHOOD"], crime_dataset["TYPE"]).orderBy("YEAR").filter(crime_dataset["NEIGHBOURHOOD"]!="null")
crime_df.show(20)

####Widget Control
This modified dataframe can now be used for the neighbourhood widget, so that the user can control and explore different neighbourhoods and their crime statistics.

In [8]:
# widget control
neighbourhoods_list = df.select("NEIGHBOURHOOD").distinct().rdd.flatMap(lambda x: x).collect()
dbutils.widgets.dropdown("neighbourhood", "Kitsilano", neighbourhoods_list)

####Number of Crimes in Area
The crimes are now specific to the neighbourhood variable that's currrently selected, and a count can be queried.

In [10]:
crimes = df.filter(df["NEIGHBOURHOOD"]==dbutils.widgets.get("neighbourhood"))

# count the number of crimes in the chosen neighbourhood
num_crimes = crimes.count()
print("There have been " + str(num_crimes) + " crimes in " + str(dbutils.widgets.get("neighbourhood")) + " since the year 2003.")

####Linear graph plotting the amount of crimes per year in this area

In [12]:
# LUKE make a line graph showing the number of crimes each year since 2003
crimes_per_year = crimes.groupBy("YEAR").count().orderBy("count").filter(crime_dataset["YEAR"]!="2019")
display(crimes_per_year)

YEAR,count
2009,239
2010,257
2008,271
2013,275
2007,334
2006,336
2012,349
2011,351
2015,371
2014,376


In [13]:
from pyspark.sql.functions import desc
top_crimes = crimes.groupBy('TYPE').count().orderBy("count").sort(desc("count"))
display(top_crimes)

TYPE,count
Theft from Vehicle,2086
Break and Enter Residential/Other,2000
Mischief,717
Vehicle Collision or Pedestrian Struck (with Injury),650
Theft of Vehicle,389
Theft of Bicycle,159
Break and Enter Commercial,150
Other Theft,27
Vehicle Collision or Pedestrian Struck (with Fatality),7


####All of Vancouver
Now we can see what the sum of all of the crimes - including violent crimes - looks like.

In [15]:
# TODO make linear graph of all of vancouvers crime. maybe split it by month?

####How many crimes will there be per year from 2018 until 2028 in this neighbourhood?

In [17]:
# linear regression here

##2. Housing Values in Vancouver

In [19]:
import pandas as pd

# import HPI excel file from DBFS and convert to csv
hpi_path = "/FileStore/tables/canada_hpi2.csv"
hpi_dataset = spark.read.csv(hpi_path, inferSchema=True, header=False)

# remove top two rows, and rename the column headers
hpi_df = hpi_dataset.filter(hpi_dataset["_c1"]!="null")
oldColumns = hpi_df.schema.names
newColumns = hpi_df.first()
for i in range(len(oldColumns)):
  hpi_df = hpi_df.withColumnRenamed(str(oldColumns[i]), str(newColumns[i]))
hpi_df = hpi_df.filter(hpi_df["Date"]!="Date")
hpi_df = hpi_df.select(hpi_df["Date"], hpi_df["Vancouver_BC_Index"], hpi_df["Vancouver_BC_Value"], hpi_df["Vancouver_BC_YoY"])
display(hpi_df)


Date,Vancouver_BC_Index,Vancouver_BC_Value,Vancouver_BC_YoY
200501,100.0,403920,0.0
200502,100.8,407190,0.0
200503,102.1,412410,0.0
200504,103.8,419350,0.0
200505,105.8,427350,0.0
200506,107.8,435610,0.0
200507,109.8,443410,0.0
200508,111.4,450090,0.0
200509,112.9,456210,0.0
200510,114.4,462070,0.0


###ToDo:
- perform a linear regression to determine yearly crime rates
- most dangerous day of the week
- most dangerous month of the year
- connect endpoints to apache kafka
- analyze trends with HPI
- find corelelation between crime rates and HPI
- split crimes into categories