# Crime near Graffiti
Graffiti is a common sight in Vancouver, with many of the city's buildings adorned in various artwork and murals. Since graffiti is actually illegal in Vancouver, we postulate that it may have a correlation with reduced levels of law enforcement in the vicinity and hence an increased rate of crime. We now seek to use the Graffiti Open Dataset from the catalogue to verify if there is any truth to this statement.
<br>First we import some dependencies, start a SparkSession and read in the data.

In [15]:
import pandas as pd
import reverse_geocoder as rg
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from geocodio import GeocodioClient
API_KEY = 'dd80c07f04d3066730c74d703707660d407fdcf'

In [16]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

In [18]:
graf_df = spark.read.format("csv").option("header", "true").load("../Data/graffiti.csv")
graf_df.show(10,truncate=False)
print('The Graffiti dataset has {} rows'.format(graf_df.count()))

+-----+-----------+------------+
|COUNT|LATITUDE   |LONGITUDE   |
+-----+-----------+------------+
|1    |49.2238602 |-123.0904255|
|5    |49.26131589|-123.1139357|
|4    |49.28328122|-123.1134863|
|2    |49.2630182 |-123.1201315|
|10   |49.26279451|-123.0923838|
|10   |49.26279573|-123.0925276|
|1    |49.26588693|-123.1082242|
|1    |49.26318465|-123.1853592|
|2    |49.29116564|-123.1322237|
|8    |49.26497349|-123.1375141|
+-----+-----------+------------+
only showing top 10 rows

The Graffiti dataset has 8507 rows


In [51]:
graf_df.select('LONGITUDE').distinct().count()

8054

#### Here, we have lat/long pairs in the dataset but this is not enough to join it to any other datset based on location. The problem is that LAT/LONG pairs are never exact. For a 13 digit lat/long pair, there will exist only a single block.

We on the other hand, are considering crime levels by AREA, hence we need a way to generate a 'HUNDRED_BLOCK' field from the LAT/LONG pair.
We have a useful API that can be used to do so as to that effect.

In [87]:
latitude_list = graf_df.select("LATITUDE").rdd.flatMap(lambda x: x).collect()
longitude_list = graf_df.select("LONGITUDE").rdd.flatMap(lambda x: x).collect()
neighbourhood_list = []
client = GeocodioClient(API_KEY)

for i,j in zip(latitude_list,longitude_list):
    location = client.reverse((i,j))
    neighbourhood_list.append(location['results'][0]['address_components']['number'][:2]+'XX '+location['results'][0]['address_components']['formatted_street'].upper())

temp_df = graf_df.toPandas()
temp_df['HUNDRED_BLOCK'] = neighbourhood_list
graf_df = spark.createDataFrame(temp_df)
graf_df.show(15,truncate=False) 

+-----+-----------+------------+-----------------+
|COUNT|LATITUDE   |LONGITUDE   |HUNDRED_BLOCK    |
+-----+-----------+------------+-----------------+
|1    |49.2238602 |-123.0904255|66XX FRASER ST   |
|5    |49.26131589|-123.1139357|45XX W 12TH AVE  |
|4    |49.28328122|-123.1134863|50XX RICHARDS ST |
|2    |49.2630182 |-123.1201315|70XX W BROADWAY  |
|10   |49.26279451|-123.0923838|51XX E BROADWAY  |
|10   |49.26279573|-123.0925276|51XX E BROADWAY  |
|1    |49.26588693|-123.1082242|16XX W 6TH AVE   |
|1    |49.26318465|-123.1853592|36XX W 10TH AVE  |
|2    |49.29116564|-123.1322237|16XX W GEORGIA ST|
|8    |49.26497349|-123.1375141|14XX W 8TH AVE   |
|5    |49.24216877|-123.0596631|22XX KINGSWAY    |
|10   |49.2387985 |-123.0649746|50XX VICTORIA DR |
|7    |49.26276417|-123.0886586|71XX E BROADWAY  |
|4    |49.26386496|-123.1716457|29XX W BROADWAY  |
|1    |49.28428105|-123.109868 |34XX WATER ST    |
+-----+-----------+------------+-----------------+
only showing top 15 rows



### Now we will load the dataset of crimes that is our main source of crime data


In [67]:
crime_df = spark.read.format("csv").option("header", "true").load("..//Data/crime/crime_all_years_latlong.csv")
#Drop unrequired columns
crime_df = crime_df.select(['TYPE','HUNDRED_BLOCK','LATITUDE','LONGITUDE'])
crime_df = crime_df.dropna(how='any')
crime_df.show(10,truncate=True)
print("Crime Dataset has {} rows".format(crime_df.count()))

+--------------------+------------------+------------------+-------------------+
|                TYPE|     HUNDRED_BLOCK|          LATITUDE|          LONGITUDE|
+--------------------+------------------+------------------+-------------------+
|            Mischief|     6X E 52ND AVE| 49.22285547453633|-123.10457767461014|
|    Theft of Vehicle|   71XX NANAIMO ST| 49.21942208176436|-123.05928356709362|
|Break and Enter C...|   1XX E PENDER ST|49.280454355702865|-123.10100566349294|
|            Mischief|     9XX CHILCO ST| 49.29261448054877|-123.13962081805273|
|            Mischief|     9XX CHILCO ST| 49.29260865723727|-123.13945233120421|
|            Mischief|24XX E HASTINGS ST|49.281126361961825| -123.0554729922974|
|  Theft from Vehicle| 8X W BROADWAY AVE|49.263002922167225|-123.10655743565438|
|            Mischief|24XX E HASTINGS ST| 49.28112610578195|-123.05525671257254|
|  Theft from Vehicle|   29XX W 14TH AVE| 49.25958751890934| -123.1707943860336|
|  Theft from Vehicle|   29X

#### We must now get this dataset into a proper format so as it can be meaningfully joined to the Street Light data
#### Upon merging the Hundred_BLOCK and Neighbourhood values as a common column, we can join it on the streetlight dataset (after using the same transformation on it) to sufficiently narrow down street nights in each 10-block radius and associate crime in the area with it

In [99]:
crime_df = crime_df.select(['TYPE','HUNDRED_BLOCK'])
crime_df = crime_df.groupBy('HUNDRED_BLOCK').count().withColumnRenamed('count', 'CRIME_COUNT')
crime_df.show(10)

+--------------------+-----------+
|       HUNDRED_BLOCK|CRIME_COUNT|
+--------------------+-----------+
|   1XX COMMERCIAL DR|         47|
|      6XX W 10TH AVE|        130|
|E 48TH AVE / ELLI...|          2|
|        36XX RAE AVE|        114|
|   64XX CLARENDON ST|          6|
|     28XX E 44TH AVE|         43|
|     26XX W 20TH AVE|         10|
|     13XX W 13TH AVE|        140|
|          5X KERR ST|          2|
|      1XX ONTARIO PL|         56|
+--------------------+-----------+
only showing top 10 rows



In [103]:
#Create Temp tables in SPark.sql
graf_df.createOrReplaceTempView("DF1")
crime_df.createOrReplaceTempView("DF2")

#SQL JOIN
joined_df = spark.sql("""SELECT DF1.COUNT AS GRAFITI_COUNT,DF1.LATITUDE AS GRAFITI_LAT,
                      DF1.LONGITUDE AS GRAFITI_LONG, DF1.HUNDRED_BLOCK,
                      DF2.CRIME_COUNT AS NO_OF_CRIMES 
                      FROM DF1 LEFT JOIN DF2 ON DF1.HUNDRED_BLOCK = DF2.HUNDRED_BLOCK""")
joined_df.show(15,truncate=True)
print("The new Dataset has {} rows".format(joined_df.count()))

+-------------+-----------+------------+-----------------+------------+
|GRAFITI_COUNT|GRAFITI_LAT|GRAFITI_LONG|    HUNDRED_BLOCK|NO_OF_CRIMES|
+-------------+-----------+------------+-----------------+------------+
|            1| 49.2238602|-123.0904255|   66XX FRASER ST|          87|
|            5|49.26131589|-123.1139357|  45XX W 12TH AVE|          39|
|            4|49.28328122|-123.1134863| 50XX RICHARDS ST|        null|
|            2| 49.2630182|-123.1201315|  70XX W BROADWAY|        null|
|           10|49.26279451|-123.0923838|  51XX E BROADWAY|        null|
|           10|49.26279573|-123.0925276|  51XX E BROADWAY|        null|
|            1|49.26588693|-123.1082242|   16XX W 6TH AVE|          39|
|            1|49.26318465|-123.1853592|  36XX W 10TH AVE|         149|
|            2|49.29116564|-123.1322237|16XX W GEORGIA ST|         119|
|            8|49.26497349|-123.1375141|   14XX W 8TH AVE|          94|
|            5|49.24216877|-123.0596631|    22XX KINGSWAY|      

In [105]:
joined_df.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("Graffiti.csv")

## Here is the Tableau visualization:
The locations with graffiti are plotted on the map in blue markers. The size of the marker represents the No of crimes where as their color intensity depicts count of graffiti. It can be directly observed that there is no semblance between count of graffiti and crime intensity. i.e the biggest bubbles are not the most intensely colored. The Tableau public dashboard can be found at <a href="https://public.tableau.com/views/Crime_vs_Graffiti/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link">https://public.tableau.com/views/Crime_vs_Graffiti/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link
</a><br>
<img src="../Visualisation/Raw/Graffiti.png">
