# Crime v/s street lighting 
The stereotype of crime in a city is that most crime occurs in dark, shady alleys vs. in broad daylight. Since the Vancouver Open Data Catalogue has a neat little dataset that lists street lighting poles throughout the city, we take it upon ourselves to analyse if there is any truth in this notion that crime occurs away from lighting and in more remote places.
<br>First we import some dependencies, start a SparkSession and read in the data.

In [33]:
import pandas as pd
import reverse_geocoder as rg
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [34]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

In [35]:
lights_df = spark.read.format("csv").option("header", "true").load("../Data/street_lightings/street_lighting_poles.csv")
#lights_df = skytrain_df.select(skytrain_df["LINE"].alias("STATION"), skytrain_df["LAT"].alias("LATITUDE"),skytrain_df["LONG"].alias("LONGITUDE"))
lights_df.show(10,truncate=False)

+-----------+----------------+-----------------+------------+
|NODE_NUMBER|LAT             |LONG             |BLOCK_NUMBER|
+-----------+----------------+-----------------+------------+
|1          |49.2678128146707|-123.162324988189|20          |
|2          |49.2554200308521|-123.164303441137|26          |
|3          |49.2555499673319|-123.164940708487|26          |
|1          |49.2555411740844|-123.163704551483|26          |
|3          |49.2550272963661|-123.164217031879|25          |
|1          |49.2550311396586|-123.163320610485|25          |
|2          |49.2550397663404|-123.163758906708|25          |
|6          |49.2494750376237|-123.100892983293|39          |
|2          |49.2491881601068|-123.100905232449|40          |
|5          |49.2489134798837|-123.101144032904|40          |
+-----------+----------------+-----------------+------------+
only showing top 10 rows



#### Here, we have lat/long pairs in the dataset but this is not enough to join it to any other datset based on location. The problem is that LAT/LONG pairs are never exact. For a 13 digit lat/long pair, there will exist only a single block.<BR> <BR> We on the other hand, are considering crime levels by AREA, hence we need a way to generate a 'Neighbourhood' field from the LAT/LONG pair.<BR> Geopy can be use to do so as to that effect 

In [43]:
latitude_list = lights_df.select("LAT").rdd.flatMap(lambda x: x).collect()
longitude_list = lights_df.select("LONG").rdd.flatMap(lambda x: x).collect()
neighbourhood_list = []

for i,j in zip(latitude_list,longitude_list):
    result = rg.search([i,j])
    neighbourhood_list.append(result[0]['name'])

temp_df = lights_df.toPandas()
temp_df['NEIGHBOURHOOD'] = neighbourhood_list
lights_df = spark.createDataFrame(temp_df)
lights_df.show(15,truncate=False) 

+-----------+----------------+-----------------+------------+---------------+
|NODE_NUMBER|LAT             |LONG             |BLOCK_NUMBER|NEIGHBOURHOOD  |
+-----------+----------------+-----------------+------------+---------------+
|1          |49.2678128146707|-123.162324988189|20          |West End       |
|2          |49.2554200308521|-123.164303441137|26          |West End       |
|3          |49.2555499673319|-123.164940708487|26          |West End       |
|1          |49.2555411740844|-123.163704551483|26          |West End       |
|3          |49.2550272963661|-123.164217031879|25          |West End       |
|1          |49.2550311396586|-123.163320610485|25          |West End       |
|2          |49.2550397663404|-123.163758906708|25          |West End       |
|6          |49.2494750376237|-123.100892983293|39          |Vancouver      |
|2          |49.2491881601068|-123.100905232449|40          |Vancouver      |
|5          |49.2489134798837|-123.101144032904|40          |Van

### Now we will load the dataset of crimes that is our main source of crime data

In [45]:
crime_df = spark.read.format("csv").option("header", "true").load("..//Data/crime/crime_all_years_latlong.csv")
#Drop unrequired columns
crime_df = crime_df.select(['TYPE','HUNDRED_BLOCK','LATITUDE','LONGITUDE','NEIGHBOURHOOD'])
crime_df.show(10,truncate=True)
print("Crime Dataset has {} rows".format(crime_df.count()))

+--------------------+------------------+------------------+-------------------+--------------------+
|                TYPE|     HUNDRED_BLOCK|          LATITUDE|          LONGITUDE|       NEIGHBOURHOOD|
+--------------------+------------------+------------------+-------------------+--------------------+
|            Mischief|     6X E 52ND AVE| 49.22285547453633|-123.10457767461014|              Sunset|
|    Theft of Vehicle|   71XX NANAIMO ST| 49.21942208176436|-123.05928356709362| Victoria-Fraserview|
|Break and Enter C...|   1XX E PENDER ST|49.280454355702865|-123.10100566349294|Central Business ...|
|            Mischief|     9XX CHILCO ST| 49.29261448054877|-123.13962081805273|            West End|
|            Mischief|     9XX CHILCO ST| 49.29260865723727|-123.13945233120421|            West End|
|            Mischief|24XX E HASTINGS ST|49.281126361961825| -123.0554729922974|    Hastings-Sunrise|
|  Theft from Vehicle| 8X W BROADWAY AVE|49.263002922167225|-123.10655743565438|  

#### We must now get this dataset into a proper format so as it can be meaningfully joined to the Street Light data

In [47]:
df.select(df.key,f.when(df.user_id.isin(['not_set', 'n/a', 'N/A']),None).otherwise(df.user_id)).show()

crime_df = crime_df.select(['TYPE','HUNDRED_BLOCK','NEIGHBOURHOOD'])
crime_df = crime_df.withColumn("HUNDRED_BLOCK",expr("substring(HUNDRED_BLOCK, 0, 2)"))
crime_df = crime_df.withColumn('HUNDRED_BLOCK', regexp_replace('HUNDRED_BLOCK', 'X', '0'))
crime_df.show(10,truncate=True)

+--------------------+-------------+--------------------+
|                TYPE|HUNDRED_BLOCK|       NEIGHBOURHOOD|
+--------------------+-------------+--------------------+
|            Mischief|           60|              Sunset|
|    Theft of Vehicle|           71| Victoria-Fraserview|
|Break and Enter C...|           10|Central Business ...|
|            Mischief|           90|            West End|
|            Mischief|           90|            West End|
|            Mischief|           24|    Hastings-Sunrise|
|  Theft from Vehicle|           80|      Mount Pleasant|
|            Mischief|           24|    Hastings-Sunrise|
|  Theft from Vehicle|           29|           Kitsilano|
|  Theft from Vehicle|           29|           Kitsilano|
+--------------------+-------------+--------------------+
only showing top 10 rows



#### Upon merging the Hundred_BLOCK and Neighbourhood values as a common column, we can join it on the streetlight dataset (after using the same transformation on it) to sufficiently narrow down street nights in each 10-block radius and associate crime in the area with it

In [55]:
#Create Temp tables in SPark.sql
lights_df.createOrReplaceTempView("DF1")
crime_df.createOrReplaceTempView("DF2")

#SQL JOIN
joined_df = spark.sql("SELECT DF1.*,DF2.CRIME_COUNT FROM DF1 LEFT JOIN DF2 ON DF1.BLOCK_NUMBER = DF2.HUNDRED_BLOCK")
joined_df.show(15,truncate=True)
print("The new Dataset has {} rows".format(joined_df.count()))

+-----------+----------------+-----------------+------------+---------------+-----------+
|NODE_NUMBER|             LAT|             LONG|BLOCK_NUMBER|  NEIGHBOURHOOD|CRIME_COUNT|
+-----------+----------------+-----------------+------------+---------------+-----------+
|          1|49.2678128146707|-123.162324988189|          20|       West End|      20713|
|          2|49.2554200308521|-123.164303441137|          26|       West End|       5916|
|          3|49.2555499673319|-123.164940708487|          26|       West End|       5916|
|          1|49.2555411740844|-123.163704551483|          26|       West End|       5916|
|          3|49.2550272963661|-123.164217031879|          25|       West End|       6913|
|          1|49.2550311396586|-123.163320610485|          25|       West End|       6913|
|          2|49.2550397663404|-123.163758906708|          25|       West End|       6913|
|          6|49.2494750376237|-123.100892983293|          39|      Vancouver|       1756|
|         

In [56]:
joined_df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("Street_Lights.csv")

## Here is the Tableau visualization:
The street lights are plotted on the map in a range of yellow to orange to red. The color intensity changes from yellow to red with the intensity of crime. The Tableau public dashboard can be found at <a href="https://public.tableau.com/views/StreetLighting_Crime/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link">https://public.tableau.com/views/StreetLighting_Crime/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link
</a><br>
<img src="../Visualisation/Raw/Street_Lights.png">
