# Analysing Crime by Lanes
Among the myriad of datasets found in the City of Vancouver Open Data Catalogue is one that lists every alleyway in the city. With a total of 7827, each of these alleyways can be categorized as one of the following:
* Lanes
* Non-city streets
* One-way streets
* Public Streets
* Right of Way widths
* Street intersections
In this particular notebook, we attempt to undertake an analysis of crime that occurs with respect to city lanes - in other words, we want to understand the correlation of lanes with crime in the city. This can help us answer popular questions such as:
* Which lanes have a prevalence of crime ?
* How can lanes be categorized with the prevalence and/or type of crime that occurs in them ?
and so on. Such observations are useful to a variety of entities. In the most obvious context, it helps law enforcement agencies concentrate their efforts in these particular lanes and also alerts residents of potential dangers that might occur in their vicinity. 
Let us proceed in our analysis step-by-step.
First we must import the necessary dependencies.

In [2]:
import pandas as pd
import requests
import json
from pyspark.sql import SparkSession, functions, types

#### We can use an API call as follows to obtain data directly from the Open Data Catalogue

In [3]:
response = requests.get("https://opendata.vancouver.ca/api/records/1.0/search/?dataset=lanes&q=&facet=std_street")
d = response.json()

#### However, for simplicity, and to save time, here we directly download the dataset as a flat file and use Pandas to work on the dataset.
In the next cell, we shall read the data as a pandas dataframe, drop empty values

In [4]:
df = pd.read_csv('/home/jim/Downloads/lanes.csv',sep=';')
df.dropna(how='any',inplace=True)
df.reset_index(drop=True,inplace=True)
df

Unnamed: 0,FROM_HUNDRED_BLOCK,Geom,STD_STREET
0,3400.0,"{""type"": ""LineString"", ""coordinates"": [[-123.0...",GRANDVIEW HIGHWAY
1,2000.0,"{""type"": ""LineString"", ""coordinates"": [[-123.1...",BLENHEIM ST
2,3200.0,"{""type"": ""LineString"", ""coordinates"": [[-123.0...",DIEPPE DRIVE
3,3500.0,"{""type"": ""LineString"", ""coordinates"": [[-123.0...",NORMANDY DRIVE
4,2800.0,"{""type"": ""LineString"", ""coordinates"": [[-123.1...",W 7TH AV
...,...,...,...
7822,7700.0,"{""type"": ""LineString"", ""coordinates"": [[-123.1...",ADERA ST
7823,7800.0,"{""type"": ""LineString"", ""coordinates"": [[-123.1...",CAMBIE ST
7824,7800.0,"{""type"": ""LineString"", ""coordinates"": [[-123.1...",CAMBIE ST
7825,900.0,"{""type"": ""LineString"", ""coordinates"": [[-123.0...",RUPERT ST


Let us perform some basic cleaning/preprocessing

In [5]:
df['FROM_HUNDRED_BLOCK'] = df['FROM_HUNDRED_BLOCK'].astype(int)
df['FROM_HUNDRED_BLOCK'] = df['FROM_HUNDRED_BLOCK'].astype(str)
df['FROM_HUNDRED_BLOCK'] = df['FROM_HUNDRED_BLOCK'].str.replace('0','X')
df["FROM_HUNDRED_BLOCK"].replace({"X": ""}, inplace=True)
df["HUNDRED_BLOCK"] = df["FROM_HUNDRED_BLOCK"] + " "+ df["STD_STREET"]
df = df [['HUNDRED_BLOCK','Geom']]
df

Unnamed: 0,HUNDRED_BLOCK,Geom
0,34XX GRANDVIEW HIGHWAY,"{""type"": ""LineString"", ""coordinates"": [[-123.0..."
1,2XXX BLENHEIM ST,"{""type"": ""LineString"", ""coordinates"": [[-123.1..."
2,32XX DIEPPE DRIVE,"{""type"": ""LineString"", ""coordinates"": [[-123.0..."
3,35XX NORMANDY DRIVE,"{""type"": ""LineString"", ""coordinates"": [[-123.0..."
4,28XX W 7TH AV,"{""type"": ""LineString"", ""coordinates"": [[-123.1..."
...,...,...
7822,77XX ADERA ST,"{""type"": ""LineString"", ""coordinates"": [[-123.1..."
7823,78XX CAMBIE ST,"{""type"": ""LineString"", ""coordinates"": [[-123.1..."
7824,78XX CAMBIE ST,"{""type"": ""LineString"", ""coordinates"": [[-123.1..."
7825,9XX RUPERT ST,"{""type"": ""LineString"", ""coordinates"": [[-123.0..."


<B>The Geographic co-ordinates are what is important to us for this exercise. We shall retain only the source/destination pairs. The other co-ordinates do not make a difference</B>

In [6]:
def only_coordinates(row):
    return json.loads(row)['coordinates']
def trim_lane(row):
    return row[:2]
df['Geom'] = df['Geom'].apply(only_coordinates,1)
df['Geom'] = df['Geom'].apply(trim_lane,1)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


Unnamed: 0,HUNDRED_BLOCK,Geom
0,34XX GRANDVIEW HIGHWAY,"[[-123.02973700619418, 49.25768576683555], [-1..."
1,2XXX BLENHEIM ST,"[[-123.17731344998101, 49.26807124685622], [-1..."
2,32XX DIEPPE DRIVE,"[[-123.02610923682472, 49.25548022534088], [-1..."
3,35XX NORMANDY DRIVE,"[[-123.02810057086191, 49.25148060042281], [-1..."
4,28XX W 7TH AV,"[[-123.16841805985807, 49.26538183821827], [-1..."
...,...,...
7822,77XX ADERA ST,"[[-123.14093918643101, 49.21589632767099], [-1..."
7823,78XX CAMBIE ST,"[[-123.11610653672975, 49.214093048751806], [-..."
7824,78XX CAMBIE ST,"[[-123.11620570205356, 49.21364236624855], [-1..."
7825,9XX RUPERT ST,"[[-123.03194562819851, 49.27661683155955], [-1..."


#### We shall split the columns appropriately. The rest can be discarded

In [7]:
df[['SRC','DEST']] = pd.DataFrame(df['Geom'].values.tolist())
df[['SRC_LONGITUDE','SRC_LATITUDE']] = pd.DataFrame(df['SRC'].values.tolist())
df[['DEST_LONGITUDE','DEST_LATITUDE']] = pd.DataFrame(df['DEST'].values.tolist())
df = df[['HUNDRED_BLOCK','SRC_LONGITUDE','SRC_LATITUDE','DEST_LONGITUDE','DEST_LATITUDE']]
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


Unnamed: 0,HUNDRED_BLOCK,SRC_LONGITUDE,SRC_LATITUDE,DEST_LONGITUDE,DEST_LATITUDE
0,34XX GRANDVIEW HIGHWAY,-123.029737,49.257686,-123.030106,49.257687
1,2XXX BLENHEIM ST,-123.177313,49.268071,-123.177330,49.267625
2,32XX DIEPPE DRIVE,-123.026109,49.255480,-123.026012,49.255416
3,35XX NORMANDY DRIVE,-123.028101,49.251481,-123.027537,49.251105
4,28XX W 7TH AV,-123.168418,49.265382,-123.169770,49.265408
...,...,...,...,...,...
7822,77XX ADERA ST,-123.140939,49.215896,-123.140989,49.214947
7823,78XX CAMBIE ST,-123.116107,49.214093,-123.116193,49.214005
7824,78XX CAMBIE ST,-123.116206,49.213642,-123.116222,49.213190
7825,9XX RUPERT ST,-123.031946,49.276617,-123.031832,49.276202


### Now we will load the dataset of crimes that is our main source of crime data

In [8]:
crime_df = pd.read_csv('/home/jim/Documents/Data_Analytics_Projects/Crime_in_Vancouver/Data/crime/crime_csv_all_years.csv')
crime_df = crime_df[['TYPE','HUNDRED_BLOCK']]
crime_df

Unnamed: 0,TYPE,HUNDRED_BLOCK
0,Mischief,6X E 52ND AVE
1,Theft of Vehicle,71XX NANAIMO ST
2,Break and Enter Commercial,1XX E PENDER ST
3,Mischief,9XX CHILCO ST
4,Mischief,9XX CHILCO ST
...,...,...
581729,Theft from Vehicle,4XX W PENDER ST
581730,Theft from Vehicle,4XX W PENDER ST
581731,Theft from Vehicle,6XX BURRARD ST
581732,Theft from Vehicle,6XX BURRARD ST


### Since we have a column for hundred block, we shall join the two datasets on that value. 
The lanes dataset we have brought into the appropriate format so that we can join it to the crime data. However, one arising issue is that with over 500,000 rows, the dataset for crime is rather huge and while, it will not take too long to process with conventional methods, remember that we need to use a method so that our data can be scaled up whenever required.<BR> When joining such giant datasets, we are effectively multiplying each row in a dataset with every one in the other. Conventional methods will fail when we process this for Big Datasets i.e. several million rows. Instead of waiting hours, we can make use of Apache Spark's parallel computation to speed up the process with a cluster.<BR> Let us load the Spark Session

In [9]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

### Convert Pandas Dataframes to Spark Dataframes


In [16]:
spark_df1 = spark.createDataFrame(df)
spark_df2 = spark.createDataFrame(crime_df)
print('First Dataframe:')
spark_df1.show(10)
print('\nSecond Dataframe:')
spark_df2.show(10,truncate=True)

First Dataframe:
+--------------------+-------------------+------------------+-------------------+------------------+
|       HUNDRED_BLOCK|      SRC_LONGITUDE|      SRC_LATITUDE|     DEST_LONGITUDE|     DEST_LATITUDE|
+--------------------+-------------------+------------------+-------------------+------------------+
|34XX GRANDVIEW HI...|-123.02973700619418| 49.25768576683555|-123.03010640405142| 49.25768650772408|
|    2XXX BLENHEIM ST|-123.17731344998101| 49.26807124685622|-123.17733013123826| 49.26762460041123|
|   32XX DIEPPE DRIVE|-123.02610923682472| 49.25548022534088| -123.0260116322752|49.255415940553156|
| 35XX NORMANDY DRIVE|-123.02810057086191| 49.25148060042281|-123.02753703907399| 49.25110452889622|
|       28XX W 7TH AV|-123.16841805985807| 49.26538183821827|-123.16976954192342|   49.265407880409|
|      25XX TRUTCH ST| -123.1745192016402| 49.26375244100464|-123.17453581030856| 49.26330182785896|
|       ONTARIO PLACE|-123.10503146176733|49.232158783613016|-123.10368092

### Join the Dataframes
#### The Dataframes have a common column "Customer ID" which is also the primary key for both schemas
#### Since Spark SQL supports native SQL syntax, we can also write join operations after creating temporary tables on DataFrame’s and using spark.sql()

In [22]:
#Create Temp tables in SPark.sql
spark_df1.createOrReplaceTempView("DF1")
spark_df2.createOrReplaceTempView("DF2")

#SQL JOIN
joined_df = spark.sql("SELECT DF1.*,DF2.TYPE FROM DF1 LEFT JOIN DF2 ON DF1.HUNDRED_BLOCK = DF2.HUNDRED_BLOCK")
joined_df.show(10,truncate=True)

+--------------------+-------------------+------------------+-------------------+------------------+----+
|       HUNDRED_BLOCK|      SRC_LONGITUDE|      SRC_LATITUDE|     DEST_LONGITUDE|     DEST_LATITUDE|TYPE|
+--------------------+-------------------+------------------+-------------------+------------------+----+
|      12XX W 38TH AV| -123.1324494899305| 49.23641747832169| -123.1347488111366| 49.23649657706668|null|
|      13XX W 73RD AV|-123.13477094791011| 49.20373268183043|-123.13555492058455|  49.2037451134505|null|
|      13XX W 73RD AV|-123.13663407519469|49.204131773260656|-123.13809644151712|49.204655935346516|null|
|18XX PRESTWICK DRIVE|-123.06849615879251| 49.21577713941792|-123.06670602679982|49.215744328830375|null|
|18XX PRESTWICK DRIVE| -123.0693602209185|  49.2150040012115|-123.06851130958024|49.214995944326006|null|
|18XX PRESTWICK DRIVE|-123.06851130958024|49.214995944326006|-123.06672082776852| 49.21497812930801|null|
|      18XX W 16TH AV|-123.14601736255405|49.2

#### Let us take a quick look at the types of crime and their number of occurences

In [27]:
joined_df.groupBy('TYPE').count().show()

+--------------------+-----+
|                TYPE|count|
+--------------------+-----+
|Vehicle Collision...|  394|
|         Other Theft|17978|
|                null| 4591|
|Vehicle Collision...|    5|
|            Mischief|24639|
|    Theft of Bicycle| 5281|
|Break and Enter C...|12634|
|  Theft from Vehicle|49421|
|Break and Enter R...|18201|
|    Theft of Vehicle|10328|
+--------------------+-----+



#### We'll save the table for use in Tableau

In [123]:
joined_df.write.csv('lanes.csv')