# Crime near Schools
Perhaps one of the most important analysis from a safety point of view is to examine the intensity of crimes occuring near schools. In this exercise, we shall attemot to perform multiple visualizations to understand this.
First we import some dependencies, start a SparkSession and read in the data.

In [9]:
import pandas as pd
import reverse_geocoder as rg
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from geocodio import GeocodioClient
API_KEY = 'dd80c07f04d3066730c74d703707660d407fdcf'
import geopy
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter

In [6]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

read in the Data

In [7]:
schools = spark.read.format("csv").option("header", "true").load("../Data/schools.csv")
schools.show(10,truncate=False)
print('The Schools dataset has {} rows'.format(schools.count()))

+------------------------------------+----------------+-----------------+-------------------+------------------+
|SCHOOL_NAME                         |LATITUDE        |LONGITUDE        |ADDRESS            |SCHOOL_CATEGORY   |
+------------------------------------+----------------+-----------------+-------------------+------------------+
|Admiral Seymour Elementary          |49.27859427     |-123.0803143     |1130 Keefer St     |Public School     |
|Admiral Seymour StrongStart Centre  |49.27859427     |-123.0803143     |1130 Keefer St     |StrongStart BC    |
|Alexander Academy                   |49.2850006001823|-123.114009854193|688 W Hastings St  |Independent School|
|Anchor Point Montessori             |49.277061848292 |-123.130918922266|1351 Hornby St     |Independent School|
|BC Children's Adol. Psych. Unit     |49.239630962    |-123.12579096    |5025 Willow St     |Public School     |
|BC Childrens Hosp School Program    |49.239630962    |-123.12579096    |5025 Willow St     |Pub

#### We can also put our shapefile to use for this exercise. But for that we require a 'Neighbourhood' column in our schools dataset

In [13]:
#Won't be needing SCHOOL_CATEGORY
schools = schools.select('SCHOOL_NAME','LATITUDE','LONGITUDE','ADDRESS')
latitude_list = schools.select("LATITUDE").rdd.flatMap(lambda x: x).collect()
longitude_list = schools.select("LONGITUDE").rdd.flatMap(lambda x: x).collect()
neighbourhood_list = []
locator = Nominatim(user_agent="myGeocoder")


for i,j in zip(latitude_list,longitude_list):
    location = locator.reverse([i,j])
    neighbourhood_list.append(location.raw['address']['city_district'])

temp_df = schools.toPandas()
temp_df['NEIGHBOURHOOD'] = neighbourhood_list
schools = spark.createDataFrame(temp_df)
schools.show(15,truncate=False) 

+-------------------------------------+----------------+-----------------+-------------------+------------------+
|SCHOOL_NAME                          |LATITUDE        |LONGITUDE        |ADDRESS            |NEIGHBOURHOOD     |
+-------------------------------------+----------------+-----------------+-------------------+------------------+
|Admiral Seymour Elementary           |49.27859427     |-123.0803143     |1130 Keefer St     |Strathcona        |
|Admiral Seymour StrongStart Centre   |49.27859427     |-123.0803143     |1130 Keefer St     |Strathcona        |
|Alexander Academy                    |49.2850006001823|-123.114009854193|688 W Hastings St  |Downtown          |
|Anchor Point Montessori              |49.277061848292 |-123.130918922266|1351 Hornby St     |Downtown          |
|BC Children's Adol. Psych. Unit      |49.239630962    |-123.12579096    |5025 Willow St     |South Cambie      |
|BC Childrens Hosp School Program     |49.239630962    |-123.12579096    |5025 Willow St

## Now we will load the dataset of crimes that is our main source of crime data

In [15]:
crime_df = spark.read.format("csv").option("header", "true").load("..//Data/crime/crime_all_years_latlong.csv")
#Drop unrequired columns
crime_df = crime_df.select(['TYPE','NEIGHBOURHOOD','LATITUDE','LONGITUDE'])
crime_df = crime_df.dropna(how='any')
crime_df.show(10,truncate=True)
print("Crime Dataset has {} rows".format(crime_df.count()))

+--------------------+--------------------+------------------+-------------------+
|                TYPE|       NEIGHBOURHOOD|          LATITUDE|          LONGITUDE|
+--------------------+--------------------+------------------+-------------------+
|            Mischief|              Sunset| 49.22285547453633|-123.10457767461014|
|    Theft of Vehicle| Victoria-Fraserview| 49.21942208176436|-123.05928356709362|
|Break and Enter C...|Central Business ...|49.280454355702865|-123.10100566349294|
|            Mischief|            West End| 49.29261448054877|-123.13962081805273|
|            Mischief|            West End| 49.29260865723727|-123.13945233120421|
|            Mischief|    Hastings-Sunrise|49.281126361961825| -123.0554729922974|
|  Theft from Vehicle|      Mount Pleasant|49.263002922167225|-123.10655743565438|
|            Mischief|    Hastings-Sunrise| 49.28112610578195|-123.05525671257254|
|  Theft from Vehicle|           Kitsilano| 49.25958751890934| -123.1707943860336|
|  T

#### We shall also store another dataframe, this time with crime_counts by neighbourhood

In [16]:
crime_df = crime_df.select(['TYPE','NEIGHBOURHOOD'])
concise_crime = crime_df.groupBy('NEIGHBOURHOOD').count().withColumnRenamed('count', 'CRIME_COUNT')
crime_df.show(10)
print('Crime Counts by Neighbourhood:')
concise_crime.show(10)

+--------------------+--------------------+
|                TYPE|       NEIGHBOURHOOD|
+--------------------+--------------------+
|            Mischief|              Sunset|
|    Theft of Vehicle| Victoria-Fraserview|
|Break and Enter C...|Central Business ...|
|            Mischief|            West End|
|            Mischief|            West End|
|            Mischief|    Hastings-Sunrise|
|  Theft from Vehicle|      Mount Pleasant|
|            Mischief|    Hastings-Sunrise|
|  Theft from Vehicle|           Kitsilano|
|  Theft from Vehicle|           Kitsilano|
+--------------------+--------------------+
only showing top 10 rows

Crime Counts by Neighbourhood:
+-------------------+-----------+
|      NEIGHBOURHOOD|CRIME_COUNT|
+-------------------+-----------+
|           Oakridge|       8698|
|        Shaughnessy|       5993|
|           Fairview|      34654|
|      Arbutus Ridge|       6503|
|       Stanley Park|       3977|
|   Hastings-Sunrise|      19838|
|     Mount Pleasant|

### Let us go ahead and obtain these new datasets

In [17]:
#Create Temp tables in SPark.sql
schools.createOrReplaceTempView("DF1")
crime_df.createOrReplaceTempView("DF2")
concise_crime.createOrReplaceTempView("DF3")

#SQL JOIN
school_crime = spark.sql("""SELECT DF1.*,
                      DF2.TYPE AS CRIME 
                      FROM DF1 LEFT JOIN DF2 ON DF1.NEIGHBOURHOOD = DF2.NEIGHBOURHOOD""")
school_crime.show(15,truncate=True)
print("The School Crime Type Dataset has {} rows".format(school_crime.count()))

school_crimecount = spark.sql("""SELECT DF1.*,
                      DF3.CRIME_COUNT AS CRIME_COUNT 
                      FROM DF1 LEFT JOIN DF3 ON DF1.NEIGHBOURHOOD = DF3.NEIGHBOURHOOD""")
school_crimecount.show(15,truncate=True)
print("The School Crime Count Dataset has {} rows".format(school_crimecount.count()))


+--------------------+------------+-------------+---------------+-------------+--------------------+
|         SCHOOL_NAME|    LATITUDE|    LONGITUDE|        ADDRESS|NEIGHBOURHOOD|               CRIME|
+--------------------+------------+-------------+---------------+-------------+--------------------+
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|    Theft of Vehicle|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|    Theft of Vehicle|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|    Theft of Vehicle|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|Break and Enter R...|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|Break and Enter R...|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|Break and Enter R...|
|Dr Annie B Jamies...|49.226906936|-123.12100426|6350 Tisdall St|     Oakridge|    Theft of

In [18]:
school_crime.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("SCHOOL_CRIME.csv")
school_crimecount.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save("SCHOOL_CRIMECOUNT.csv")

## Here is the Tableau visualization:
The locations of schools are plotted on the map in red markers. The neighbourhoods can be observed with boundaries thanks to the shapefile provided by the Open Data Catalogue. We can visualize the neighbourhoods based on their crime intensity on the color scale. The Tableau public dashboard can be viewed at: <a href="https://public.tableau.com/views/School_Crime/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link">https://public.tableau.com/views/School_Crime/Dashboard1?:language=en&:display_count=y&publish=yes&:origin=viz_share_link
</a><br>
<img src="../Visualisation/Raw/Schools.png">
