# Association of Crime with Skytrain stations.
 Although it is a popular gossip that certain skytrain stations are notorious for crime, let us use the public data to figure out if there is any truth to the matter. Among the many  datasets found in the City of Vancouver Open Data Catalogue is one that lists skytrain stations in the city. There are a total of 22 stations, each of these belong to one of the following three transit lines:

* Millenium Line
* Expo Line
* Canada Line
 The dataset is in the.kml geographic format and we have a script (Found in the Source folder) that can convert it to a csv file format and convert the original X,Y co-ordinate system to latitude,longitude pairs.
In this particular notebook, we attempt to undertake an analysis of crime that occurs with respect to city skytrain stations - in other words, we want to understand the correlation of skytrain station with crime in the city. This can help us answer popular questions such as:
Which stations have a prevalence of crime ?
How can skytrain stations be categorized with the prevalence and/or type of crime that occurs in them ? and so on. Such observations are useful to a variety of entities. In the most obvious context, it helps law enforcement agencies concentrate their efforts in these particular stations and also alerts residents of potential dangers that might occur in their vicinity. Let us proceed in our analysis step-by-step. First we must import the necessary dependencies.

In [85]:
from pyspark.sql import SparkSession, functions, types
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [6]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext


In [131]:
skytrain_df = spark.read.format("csv").option("header", "true").load("../Data/skytrain_stations/rapid_transit_stations.csv")
skytrain_df = skytrain_df.select(skytrain_df["LINE"].alias("STATION"), skytrain_df["LAT"].alias("LATITUDE"),skytrain_df["LONG"].alias("LONGITUDE"))
skytrain_df.show(skytrain_df.count(),truncate=False)

+------------------------+----------------+-----------------+
|STATION                 |LATITUDE        |LONGITUDE        |
+------------------------+----------------+-----------------+
|WATERFRONT              |49.2860754854493|-123.111738155627|
|BURRARD                 |49.2858601496754|-123.119972336831|
|GRANVILLE               |49.2836376878638|-123.116404027665|
|STADIUM - CHINATOWN     |49.2794416930032|-123.109564795656|
|MAIN ST. - SCIENCE WORLD|49.2731779129851|-123.100606907519|
|BROADWAY                |49.2618178476532|-123.069099402992|
|NANAIMO                 |49.2482722113354|-123.055871516595|
|29TH AVENUE             |49.2442425934484|-123.045940674179|
|JOYCE - COLLINGWOOD     |49.2383938064   |-123.031806717203|
|RUPERT                  |49.2607647789351|-123.032823831819|
|RENFREW                 |49.2588929116788|-123.045308458871|
|COMMERCIAL DRIVE        |49.2629362684988|-123.068453898483|
|VCC - CLARK             |49.2657831997804|-123.078962252228|
|WATERFR

#### Here, we have lat/long pairs in the dataset but this is not enough to join it to any other datset based on location. The problem is that LAT/LONG pairs are never exact. For a 13 digit lat/long pair, there will exist only a single block.<BR> <BR> We on the other hand, are considering crime levels by AREA, hence we need a way to generate a 'Neighbourhood' field from the LAT/LONG pair.<BR> Geopy can be use to do so as to that effect 

In [132]:
latitude_list = skytrain_df.select("LATITUDE").rdd.flatMap(lambda x: x).collect()
longitude_list = skytrain_df.select("LONGITUDE").rdd.flatMap(lambda x: x).collect()
locator = Nominatim(user_agent="myGeocoder")
neighbourhood_list = []

for i,j in zip(latitude_list,longitude_list):
    location = locator.reverse([i,j])
    neighbourhood_list.append(location.raw['address']['city_district'])

temp_df = skytrain_df.toPandas()
temp_df['NEIGHBOURHOOD'] = neighbourhood_list
skytrain_df = spark.createDataFrame(temp_df)
skytrain_df.show(skytrain_df.count(),truncate=False)    

+------------------------+----------------+-----------------+------------------------+
|STATION                 |LATITUDE        |LONGITUDE        |NEIGHBOURHOOD           |
+------------------------+----------------+-----------------+------------------------+
|WATERFRONT              |49.2860754854493|-123.111738155627|Downtown                |
|BURRARD                 |49.2858601496754|-123.119972336831|Downtown                |
|GRANVILLE               |49.2836376878638|-123.116404027665|Downtown                |
|STADIUM - CHINATOWN     |49.2794416930032|-123.109564795656|Downtown                |
|MAIN ST. - SCIENCE WORLD|49.2731779129851|-123.100606907519|Strathcona              |
|BROADWAY                |49.2618178476532|-123.069099402992|Kensington-Cedar Cottage|
|NANAIMO                 |49.2482722113354|-123.055871516595|Renfrew-Collingwood     |
|29TH AVENUE             |49.2442425934484|-123.045940674179|Renfrew-Collingwood     |
|JOYCE - COLLINGWOOD     |49.2383938064   |

### Now we will load the dataset of crimes that is our main source of crime data


In [133]:
crime_df = spark.read.format("csv").option("header", "true").load("..//Data/crime/crime_all_years_latlong.csv")
#Drop unrequired columns
crime_df = crime_df.select(['TYPE','LATITUDE','LONGITUDE','NEIGHBOURHOOD'])
crime_df.show(10,truncate=True)
print("Crime Dataset has {} rows".format(crime_df.count()))

+--------------------+------------------+-------------------+--------------------+
|                TYPE|          LATITUDE|          LONGITUDE|       NEIGHBOURHOOD|
+--------------------+------------------+-------------------+--------------------+
|            Mischief| 49.22285547453633|-123.10457767461014|              Sunset|
|    Theft of Vehicle| 49.21942208176436|-123.05928356709362| Victoria-Fraserview|
|Break and Enter C...|49.280454355702865|-123.10100566349294|Central Business ...|
|            Mischief| 49.29261448054877|-123.13962081805273|            West End|
|            Mischief| 49.29260865723727|-123.13945233120421|            West End|
|            Mischief|49.281126361961825| -123.0554729922974|    Hastings-Sunrise|
|  Theft from Vehicle|49.263002922167225|-123.10655743565438|      Mount Pleasant|
|            Mischief| 49.28112610578195|-123.05525671257254|    Hastings-Sunrise|
|  Theft from Vehicle| 49.25958751890934| -123.1707943860336|           Kitsilano|
|  T

#### For this exercise, we shall group the crime counts according to neighbourhood. THis way we shall gain an interesting insight into the number of crimes per neighbourhood which we shall then join to the Skytrain table to gain an idea about the crime rate in the surrounding area of the skytrain station. We shall also discard other columns since we only need the neighbourhood and crime count column to join the data

In [140]:
crime_df = crime_df.select(['TYPE','NEIGHBOURHOOD'])
crime_df = crime_df.groupBy('NEIGHBOURHOOD').count().withColumnRenamed('count', 'CRIME_COUNT')
crime_df.show(10)

+-----------------+-----------+
|    NEIGHBOURHOOD|CRIME_COUNT|
+-----------------+-----------+
|         Oakridge|       8698|
|      Shaughnessy|       5993|
|         Fairview|      34654|
|    Arbutus Ridge|       6503|
|     Stanley Park|       3977|
|             null|       2387|
| Hastings-Sunrise|      19838|
|   Mount Pleasant|      33786|
|Dunbar-Southlands|       8384|
|         Musqueam|        560|
+-----------------+-----------+
only showing top 10 rows



### Join the Dataframes
The Dataframes have a common column "NEIGHBOURHOOD" which is also the primary key for both schemas
Since Spark SQL supports native SQL syntax, we can also write join operations after creating temporary tables on DataFrame’s and using spark.sql()

In [142]:
#Create Temp tables in SPark.sql
skytrain_df.createOrReplaceTempView("DF1")
crime_df.createOrReplaceTempView("DF2")

#SQL JOIN
joined_df = spark.sql("SELECT DF1.*,DF2.CRIME_COUNT FROM DF1 LEFT JOIN DF2 ON DF1.NEIGHBOURHOOD = DF2.NEIGHBOURHOOD")
joined_df.show(joined_df.count(),truncate=True)
print("The new Dataset has {} rows".format(joined_df.count()))

+--------------------+----------------+-----------------+--------------------+-----------+
|             STATION|        LATITUDE|        LONGITUDE|       NEIGHBOURHOOD|CRIME_COUNT|
+--------------------+----------------+-----------------+--------------------+-----------+
|          WATERFRONT|49.2860754854493|-123.111738155627|            Downtown|       null|
|             BURRARD|49.2858601496754|-123.119972336831|            Downtown|       null|
|           GRANVILLE|49.2836376878638|-123.116404027665|            Downtown|       null|
| STADIUM - CHINATOWN|49.2794416930032|-123.109564795656|            Downtown|       null|
|MAIN ST. - SCIENC...|49.2731779129851|-123.100606907519|          Strathcona|      23566|
|            BROADWAY|49.2618178476532|-123.069099402992|Kensington-Cedar ...|      26840|
|             NANAIMO|49.2482722113354|-123.055871516595| Renfrew-Collingwood|      29294|
|         29TH AVENUE|49.2442425934484|-123.045940674179| Renfrew-Collingwood|      29294|

#### We'll save it as a flat file to use in Tableau

In [143]:
joined_df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("Skytrain.csv")

### Here is the Tableau visualization:
Each bubble represents a skytrain station and the size of the bubble represents the severity of crime in the vicinity. The Tableau public dashboard can be found at <a href="https://public.tableau.com/shared/NTGFY2TG6?:display_count=y&:origin=viz_share_link">https://public.tableau.com/shared/NTGFY2TG6?:display_count=y&:origin=viz_share_link</a><br>
<img src="../Visualisation/Raw/Skytrain_Stations.PNG">