# Crime v/s street lighting 
The stereotype of crime in a city is that most crime occurs in dark, shady alleys vs. in broad daylight. Since the Vancouver Open Data Catalogue has a neat little dataset that lists street lighting poles throughout the city, we take it upon ourselves to analyse if there is any truth in this notion that crime occurs away from lighting and in more remote places.
<br>First we import some dependencies, start a SparkSession and read in the data.

In [33]:
import pandas as pd
import reverse_geocoder as rg
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [34]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

In [35]:
lights_df = spark.read.format("csv").option("header", "true").load("../Data/street_lightings/street_lighting_poles.csv")
#lights_df = skytrain_df.select(skytrain_df["LINE"].alias("STATION"), skytrain_df["LAT"].alias("LATITUDE"),skytrain_df["LONG"].alias("LONGITUDE"))
lights_df.show(10,truncate=False)

+-----------+----------------+-----------------+------------+
|NODE_NUMBER|LAT             |LONG             |BLOCK_NUMBER|
+-----------+----------------+-----------------+------------+
|1          |49.2678128146707|-123.162324988189|20          |
|2          |49.2554200308521|-123.164303441137|26          |
|3          |49.2555499673319|-123.164940708487|26          |
|1          |49.2555411740844|-123.163704551483|26          |
|3          |49.2550272963661|-123.164217031879|25          |
|1          |49.2550311396586|-123.163320610485|25          |
|2          |49.2550397663404|-123.163758906708|25          |
|6          |49.2494750376237|-123.100892983293|39          |
|2          |49.2491881601068|-123.100905232449|40          |
|5          |49.2489134798837|-123.101144032904|40          |
+-----------+----------------+-----------------+------------+
only showing top 10 rows



#### Here, we have lat/long pairs in the dataset but this is not enough to join it to any other datset based on location. The problem is that LAT/LONG pairs are never exact. For a 13 digit lat/long pair, there will exist only a single block.<BR> <BR> We on the other hand, are considering crime levels by AREA, hence we need a way to generate a 'Neighbourhood' field from the LAT/LONG pair.<BR> Geopy can be use to do so as to that effect 

In [None]:
latitude_list = lights_df.select("LAT").rdd.flatMap(lambda x: x).collect()
longitude_list = lights_df.select("LONG").rdd.flatMap(lambda x: x).collect()
neighbourhood_list = []

for i,j in zip(latitude_list,longitude_list):
    result = rg.search([i,j])
    neighbourhood_list.append(result[0]['name'])

temp_df = lights_df.toPandas()
temp_df['NEIGHBOURHOOD'] = neighbourhood_list
lights_df = spark.createDataFrame(temp_df)
lights_df.show(15,truncate=False) 

### Now we will load the dataset of crimes that is our main source of crime data

In [27]:
neighbourhood_list

['Kitsilano',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Kitsilano',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Arbutus-Ridge',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Riley Park',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland',
 'Grandview-Woodland']