#Vancouver Crime Analysis Program
####Author: Luke Hansen
---
This notebook will report the top crimes in popular neighbourhoods in Vancouver, as well as predict an approximation of the crime rates for the next 10 years.

---
#####The goal of this project is to teach myself:
- basic big data principles
- Python 3
- Apache Spark
- the databricks platform

####The Vancouver Police Department's CSV dataset that contains all crimes since 2003.
URL: https://data.vancouver.ca/datacatalogue/crime-data.htm

In [3]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
path = "/FileStore/tables/crime_csv_all_years.csv"
dataset = spark.read.csv(path, inferSchema=True, header=True)
# display(dataset)

In [4]:
# add # trim away everything we won't need, except for NEIGHBOURHOOD, TYPE, and YEAR
df = dataset.select(dataset["YEAR"], dataset["NEIGHBOURHOOD"], dataset["TYPE"]).orderBy("YEAR").filter(dataset["NEIGHBOURHOOD"]!="null")
display(df)

YEAR,NEIGHBOURHOOD,TYPE
2003,Kitsilano,Mischief
2003,Hastings-Sunrise,Theft of Vehicle
2003,Kitsilano,Mischief
2003,Renfrew-Collingwood,Mischief
2003,Grandview-Woodland,Break and Enter Residential/Other
2003,Renfrew-Collingwood,Mischief
2003,Kitsilano,Theft from Vehicle
2003,Central Business District,Theft from Vehicle
2003,Mount Pleasant,Theft of Vehicle
2003,West End,Theft from Vehicle


In [5]:
# interactive widget to control the neighbourhood variable
neighbourhoods_list = df.select("NEIGHBOURHOOD").distinct().rdd.flatMap(lambda x: x).collect()
dbutils.widgets.dropdown("neighbourhood", "Kitsilano", neighbourhoods_list)

In [6]:
# show all crimes in the chosen neighbourhood
crimes = df.filter(df["NEIGHBOURHOOD"]==dbutils.widgets.get("neighbourhood"))
display(crimes)

YEAR,NEIGHBOURHOOD,TYPE
2003,Shaughnessy,Theft from Vehicle
2003,Shaughnessy,Break and Enter Residential/Other
2003,Shaughnessy,Mischief
2003,Shaughnessy,Mischief
2003,Shaughnessy,Break and Enter Residential/Other
2003,Shaughnessy,Break and Enter Commercial
2003,Shaughnessy,Theft of Vehicle
2003,Shaughnessy,Vehicle Collision or Pedestrian Struck (with Injury)
2003,Shaughnessy,Mischief
2003,Shaughnessy,Break and Enter Residential/Other


####Number of crimes

In [8]:
# count the number of crimes in the chosen neighbourhood as well as the average of crimes LUKE calculate the average
num_crimes = crimes.count()
print("There have been " + str(num_crimes) + " crimes in " + str(dbutils.widgets.get("neighbourhood")) + " since the year 2003")

####The following is a line graph plotting the amount of crimes per year in this area

In [10]:
# LUKE make a line graph showing the number of crimes each year since 2003
crimes_per_year = crimes.groupBy("YEAR").count().orderBy("count").filter(dataset["YEAR"]!="2019")
display(crimes_per_year)

YEAR,count
2009,239
2010,257
2008,271
2013,275
2007,334
2006,336
2012,349
2011,351
2015,371
2014,376


In [11]:
from pyspark.sql.functions import desc
top_crimes = crimes.groupBy('TYPE').count().orderBy("count").sort(desc("count"))
display(top_crimes)

TYPE,count
Theft from Vehicle,2086
Break and Enter Residential/Other,2000
Mischief,717
Vehicle Collision or Pedestrian Struck (with Injury),650
Theft of Vehicle,389
Theft of Bicycle,159
Break and Enter Commercial,150
Other Theft,27
Vehicle Collision or Pedestrian Struck (with Fatality),7


####How many crimes will there be per year from 2018 until 2028 in this neighbourhood?

In [13]:
# linear regression here

###ToDo:
- perform a linear regression to determine yearly crime rates
- most dangerous day of the week
- most dangerous month of the year