# Association of Crime with Cultural Spaces
### Another popular dataset found in the City of Vancouver Open Data Catalogue is one that lists the cultural spaces in the city. With a total of 388, each of these cultural spaces belong to one of the following categories:

* Museum/Gallery
* Studio/Rehearsal 
* Community Space
* Educational
* Theatre/Performance
* Cafe/Restaurant/Bar
* Other

 In this particular notebook, we attempt to undertake an analysis of crime that occurs with respect to city cultural spaces - in other words, we want to understand the correlation of lanes with crime in the city. This can help us answer popular questions such as:
* Is there a prevalence of crime near cultural spaces?
* Is any particular type of cultural space associated with more/less crime?
 and so on. Such observations are useful to a variety of entities. In the most obvious context, it helps law enforcement agencies concentrate their efforts in these particular spaces and also alerts residents of potential dangers that might occur in their vicinity. Let us proceed in our analysis step-by-step. First we must import the necessary dependencies.

In [1]:
import pandas as pd
from pyspark.sql import SparkSession, functions, types

#### Let us read in the table as a Dataframe object and retain only the important columns

In [17]:
df = pd.read_csv('../Data/cultural_spaces/2017CulturalSpaces.csv')
df = df[['CULTURAL_SPACE_NAME','TYPE','LOCAL_AREA','OWNERSHIP']]
df

Unnamed: 0,CULTURAL_SPACE_NAME,TYPE,LOCAL_AREA,OWNERSHIP
0,15th Field Artillery Regiment Museum and Archives,Museum/Gallery,Kitsilano,Privately Owned
1,221A - 1654,Studio/Rehearsal,Grandview-Woodland,Privately Owned
2,221A Artist Run Centre,Museum/Gallery,Strathcona,Privately Owned
3,222 E Georgia Studios,Studio/Rehearsal,Strathcona,Privately Owned
4,Aberthau Mansion/West Point Grey Community Centre,Community Space,West Point Grey,City of Vancouver
...,...,...,...,...
382,Wise Club Hall,Theatre/Performance,Grandview-Woodland,Non-Profit
383,Woodward's Atrium,Theatre/Performance,Downtown,Privately Owned
384,Writers' Exchange,Studio/Rehearsal,Strathcona,Privately Owned
385,York Theatre,Theatre/Performance,Grandview-Woodland,City of Vancouver


### Now we will load the dataset of crimes that is our main source of crime data

In [28]:
crime_df = pd.read_csv('../Data/crime/crime_all_years_latlong.csv')
crime_df = crime_df[['TYPE','NEIGHBOURHOOD','LATITUDE','LONGITUDE']]
crime_df.dropna(subset = ['NEIGHBOURHOOD'],inplace=True)
crime_df.reset_index(drop=True,inplace=True)
crime_df

Unnamed: 0,TYPE,NEIGHBOURHOOD,LATITUDE,LONGITUDE
0,Mischief,Sunset,49.222855,-123.104578
1,Theft of Vehicle,Victoria-Fraserview,49.219422,-123.059284
2,Break and Enter Commercial,Central Business District,49.280454,-123.101006
3,Mischief,West End,49.292614,-123.139621
4,Mischief,West End,49.292609,-123.139452
...,...,...,...,...
520714,Theft from Vehicle,Central Business District,49.283099,-123.112492
520715,Theft from Vehicle,Central Business District,49.283099,-123.112492
520716,Theft from Vehicle,Central Business District,49.285151,-123.119935
520717,Theft from Vehicle,Central Business District,49.285151,-123.119935


### Since we have a column for hundred block, we shall join the two datasets on that value.
The lanes dataset we have brought into the appropriate format so that we can join it to the crime data. However, one arising issue is that with over 500,000 rows, the dataset for crime is rather huge and while, it will not take too long to process with conventional methods, remember that we need to use a method so that our data can be scaled up whenever required.
When joining such giant datasets, we are effectively multiplying each row in a dataset with every one in the other. Conventional methods will fail when we process this for Big Datasets i.e. several million rows. Instead of waiting hours, we can make use of Apache Spark's parallel computation to speed up the process with a cluster.
Let us load the Spark Session

In [22]:
#Create Spark Session and context
spark = SparkSession\
    .builder\
    .appName("example code")\
    .config("spark.driver.extraClassPath","/home/jim/spark-2.4.0-bin-hadoop2.7/jars/mysql-connector-java-5.1.49.jar")\
    .getOrCreate()
spark.sparkContext.setLogLevel('WARN')
sc = spark.sparkContext

### Convert Pandas Dataframes to Spark Dataframes

In [29]:
spark_df1 = spark.createDataFrame(df)
spark_df2 = spark.createDataFrame(crime_df)
print('First Dataframe:')
spark_df1.show(10)
print('\nSecond Dataframe:')
spark_df2.show(10,truncate=True)

First Dataframe:
+--------------------+-----------------+------------------+-----------------+
| CULTURAL_SPACE_NAME|             TYPE|        LOCAL_AREA|        OWNERSHIP|
+--------------------+-----------------+------------------+-----------------+
|15th Field Artill...|   Museum/Gallery|         Kitsilano|  Privately Owned|
|         221A - 1654|Studio/Rehearsal |Grandview-Woodland|  Privately Owned|
|221A Artist Run C...|   Museum/Gallery|        Strathcona|  Privately Owned|
|222 E Georgia Stu...|Studio/Rehearsal |        Strathcona|  Privately Owned|
|Aberthau Mansion/...|  Community Space|   West Point Grey|City of Vancouver|
|        Acme Studios|Studio/Rehearsal |          Downtown|  Privately Owned|
|        AHVA Gallery|   Museum/Gallery|               UBC|            Other|
|Al Mozaico Flamen...|      Educational|        Strathcona|  Privately Owned|
|Alliance for Arts...|   Museum/Gallery|          Downtown|  Privately Owned|
|Alliance Francais...|  Community Space|       

### Join the Dataframes
#### The Dataframes have a common column "Customer ID" which is also the primary key for both schemas
#### Since Spark SQL supports native SQL syntax, we can also write join operations after creating temporary tables on DataFrame’s and using spark.sql()

In [35]:
#Create Temp tables in SPark.sql
spark_df1.createOrReplaceTempView("DF1")
spark_df2.createOrReplaceTempView("DF2")

#SQL JOIN
joined_df = spark.sql("SELECT DF1.*,DF2.TYPE AS CRIME,DF2.LATITUDE,DF2.LONGITUDE FROM DF1 LEFT JOIN DF2 ON DF1.LOCAL_AREA = DF2.NEIGHBOURHOOD")
joined_df.show(10,truncate=True)

+--------------------+---------------+----------+---------------+--------------------+------------------+-------------------+
| CULTURAL_SPACE_NAME|           TYPE|LOCAL_AREA|      OWNERSHIP|               CRIME|          LATITUDE|          LONGITUDE|
+--------------------+---------------+----------+---------------+--------------------+------------------+-------------------+
|Alliance Francais...|Community Space|  Oakridge|Privately Owned|    Theft of Vehicle| 49.22077844498322|-123.11653629459887|
|Alliance Francais...|Community Space|  Oakridge|Privately Owned|    Theft of Vehicle|  49.2202124109129| -123.1197527829363|
|Alliance Francais...|Community Space|  Oakridge|Privately Owned|    Theft of Vehicle|49.220223808123535|-123.12029799421632|
|Alliance Francais...|Community Space|  Oakridge|Privately Owned|Break and Enter R...| 49.23151697725072|-123.11384517629742|
|Alliance Francais...|Community Space|  Oakridge|Privately Owned|Break and Enter R...| 49.22969604392647|-123.11405114

In [38]:
print("There are {} rows in this dataset".format(joined_df.count()))

There are 7348943 rows in this dataset


### Here is the Tableau visualization. 
This is a dual-axis map that shows the density of crime as well as a plot of the Cultural spaces.
<img src="../Visualisation/Raw/Cultural_spaces.png">
The tableau public dashboard can be viewed at :
<a>https://public.tableau.com/profile/jamshed.khan#!/vizhome/Crime_CulturalSpace/Dashboard1</a>