## Assignment 8:
Overview of the Towns where **X**% of the Stops is within a **R** radius of the location (lat,
long) of the Town, calculated using the De Lijn data of all stops in Belgium in the *stops.txt*, the district locations in *zipcodes.csv*. No citizen counts are required in this assignment. Therefore the term *town* can be interpreted as the districts used in both *zipcodes.csv* and the De Lijn Data in the *stops.txt*.

#### Input Variables
Please set the desired position in *INPUT_POSITION* using the (latitude, longitude) notation. The desired radius in kilometers can be entered into *INPUT_RADIUS*. The percentage can be set into *INPUT_PERCENTAGE* using a value between 0 and 1 to indicate the percentage.

In [207]:
INPUT_RADIUS = 2
INPUT_PERCENTAGE = 0.80

#### Setup SparkContext
The SparkContext is required to setup other aspects of this project. SQLContext allows us to read the *stops.txt* directly into a JSON dataframe, resulting in easy readability and better access to the read data.

In [208]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("Ex5").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Setup Stops
As with the previous notebooks, we can read the De Lijn data from the *stops.txt*. The following function is designed to read and parse the JSON code in the *stops.txt*.

In [209]:
def parseJSON(f):
    import json
    f = open(f)
    data = json.load(f)
    f.close()
    return data

Using the function above we can load the JSON data and create a dataframe. This dataframe will contain geocoordinates for each stop. These coordinates will be used to calculate the distance to the town in which it's located.

In [210]:
stops = parseJSON("data/stops.txt")
columns = ["town", "stop_latitude", "stop_longitude"]
rows = []
for stop in stops["haltes"]:
    try:
        row = (stop["omschrijvingGemeente"], stop["geoCoordinaat"]["latitude"], stop["geoCoordinaat"]["longitude"])
    except:
        continue
    rows.append(row)
    
stopsDF = spark.createDataFrame(rows, columns)

In [211]:
stopsDF.show()

+---------+------------------+------------------+
|     town|     stop_latitude|    stop_longitude|
+---------+------------------+------------------+
|  Wilrijk| 51.16388937021345| 4.392073389160737|
|Antwerpen| 51.20624969023759|  4.42550621473228|
|  Hoboken| 51.16606659417422| 4.357705800030343|
|  Hoboken|51.166021637406374| 4.357548549004122|
|  Hoboken|51.174054839412754| 4.341260275816016|
|  Wilrijk| 51.16300843934687| 4.392315960084608|
|  Wilrijk|51.159774888706686|  4.36212420848111|
|  Wilrijk| 51.15996363300075| 4.361809703307476|
|  Wilrijk| 51.16295566692438| 4.385797177482637|
|  Wilrijk|51.163459288346274|  4.38396752309872|
|Antwerpen|51.188743165936835| 4.389583082467416|
|Antwerpen| 51.18297254153696| 4.418927404052235|
|  Wilrijk| 51.16220188106436| 4.372316437753825|
|  Wilrijk| 51.16195022546374| 4.371558768815994|
|  Wilrijk| 51.17942171543447| 4.391480757849793|
|  Wilrijk| 51.17950258375939| 4.391623804277545|
|  Wilrijk|  51.1750618209695| 4.393666351257672|


#### Setup Towns
The *zipcodes.csv* file contains all towns in Belgium with their geocoordinates. These can be used to calculate the distance of each stop. Using the earlier setup *SQLContext* we can extract the data.

In [212]:
townsDF = sqlContext.read.csv("data/zipcodes.csv", sep=";").selectExpr("_c1 as town", "_c2 as town_latitude", "_c3 as town_longitude")

In [213]:
townsDF.show()

+--------------------+-----------------+-----------------+
|                town|    town_latitude|   town_longitude|
+--------------------+-----------------+-----------------+
|             Brussel|       50.8427501|4.351549900000009|
|           Bruxelles|       50.8427501|4.351549900000009|
|Ass. R�un. Com. C...|             null|             null|
|Brusselse Hoofdst...|50.84487679999999|4.351433499999985|
|Conseil Region Br...|        50.847857|4.367408000000069|
|Ver. Verg. Gemeen...|             null|             null|
|Raad Vlaamse Geme...|             null|             null|
|Ass. Commiss. Com...|             null|             null|
|Chambre des Repr�...|50.84655679999999|4.364662199999998|
|Kamer van Volksve...|50.84655679999999|4.364662199999998|
|    Belgische Senaat|             null|             null|
|   Senat de Belgique|         50.79834|4.395649999999932|
|Cit� Administrati...|             null|             null|
|Rijksadministrati...|       50.8243276|4.51395430000002

Since multiple town names can be seen within the *town* column above, single names need to be extracted before the join can be executed.

In [214]:
from pyspark.sql.types import *
from pyspark.sql.functions import udf

cleanName = udf(lambda x: x.split(" ")[0], StringType())

towns = townsDF.withColumn("town", cleanName(townsDF["town"]))

In [215]:
towns.show()

+-------------------+-----------------+-----------------+
|               town|    town_latitude|   town_longitude|
+-------------------+-----------------+-----------------+
|            Brussel|       50.8427501|4.351549900000009|
|          Bruxelles|       50.8427501|4.351549900000009|
|               Ass.|             null|             null|
|          Brusselse|50.84487679999999|4.351433499999985|
|            Conseil|        50.847857|4.367408000000069|
|               Ver.|             null|             null|
|               Raad|             null|             null|
|               Ass.|             null|             null|
|            Chambre|50.84655679999999|4.364662199999998|
|              Kamer|50.84655679999999|4.364662199999998|
|          Belgische|             null|             null|
|              Senat|         50.79834|4.395649999999932|
|               Cit�|             null|             null|
|Rijksadministratief|       50.8243276|4.513954300000023|
|            V

We can now join our stops and towns to create a table with for each stop its coordinates and the coordinates of the corresponding town.

In [216]:
joined_dfs = stopsDF.join(towns, on=["town"])

In [217]:
joined_dfs.show()

+---------+------------------+-----------------+-----------------+------------------+
|     town|     stop_latitude|   stop_longitude|    town_latitude|    town_longitude|
+---------+------------------+-----------------+-----------------+------------------+
|  Wilrijk| 51.16388937021345|4.392073389160737|       51.1683102| 4.394286800000032|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2293515| 4.427988300000038|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2287575| 4.374022100000047|
|Antwerpen| 51.20624969023759| 4.42550621473228|51.34183059999999|4.2964604999999665|
|Antwerpen| 51.20624969023759| 4.42550621473228|51.27639629999999| 4.362460400000032|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.1890846|4.3836284000000205|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2037695| 4.411263700000063|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2198771| 4.401135599999975|
|  Hoboken| 51.16606659417422|4.357705800030343|      

#### Calculating the distance
The following function is designed to calculate the distance between positions, both given in (lat, long) format. The calculation is based on the radius of the earth in kilometers. The function will evaluate whether the stops position lies within the given radius of the given position.

In [218]:
from pyspark.sql.types import *
from pyspark.sql.functions import udf
from math import sin, cos, sqrt, atan2, radians

# Copyright: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude
def is_within_radius(stop_lat, stop_long, town_lat, town_long):
    R = 6373.0

    lat1 = radians(float(town_lat))
    lon1 = radians(float(town_long))
    lat2 = radians(float(stop_lat))
    lon2 = radians(float(stop_long))

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    if distance <= INPUT_RADIUS:
        return True
    else:
        return False
    
in_radius = udf(is_within_radius, BooleanType())

stops_in_radius = joined_dfs.withColumn("in_radius", in_radius(joined_dfs["stop_latitude"], joined_dfs["stop_longitude"], joined_dfs["town_latitude"], joined_dfs["town_longitude"]))

In [219]:
stops_in_radius.show()

+---------+------------------+-----------------+-----------------+------------------+---------+
|     town|     stop_latitude|   stop_longitude|    town_latitude|    town_longitude|in_radius|
+---------+------------------+-----------------+-----------------+------------------+---------+
|  Wilrijk| 51.16388937021345|4.392073389160737|       51.1683102| 4.394286800000032|     true|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2293515| 4.427988300000038|    false|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2287575| 4.374022100000047|    false|
|Antwerpen| 51.20624969023759| 4.42550621473228|51.34183059999999|4.2964604999999665|    false|
|Antwerpen| 51.20624969023759| 4.42550621473228|51.27639629999999| 4.362460400000032|    false|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.1890846|4.3836284000000205|    false|
|Antwerpen| 51.20624969023759| 4.42550621473228|       51.2037695| 4.411263700000063|     true|
|Antwerpen| 51.20624969023759| 4.4255062

#### Calculating the ratio
The following section computes for each town the amount of stops that are located within the radius using two dictionaries. Simple counts are stored to later on compute the ratio. Once these are computed, they can be stored in a new dataframe containing all the towns and their stop counts within the given radius.

In [220]:
total_stops_dict = dict()
stop_in_radius_dict = dict()

for row in stops_in_radius.rdd.collect():
    if row[0] not in total_stops_dict:
        total_stops_dict[row[0]] = 1
    else:
        total_stops_dict[row[0]] += 1
        
    if row[5] == True:
        if row[0] not in stop_in_radius_dict:
            stop_in_radius_dict[row[0]] = 1
        else:
            stop_in_radius_dict[row[0]] += 1

columns = ["town", "stops_in_radius_ratio"]
rows = []

for town in total_stops_dict:
    if town in stop_in_radius_dict:
        rows.append((town, (stop_in_radius_dict[town] / total_stops_dict[town])))

towns_stops_in_radius = spark.createDataFrame(rows, columns)
towns_stops_in_radius = towns_stops_in_radius.withColumn("stops_in_radius_ratio", towns_stops_in_radius["stops_in_radius_ratio"].cast(DoubleType()))

In [221]:
towns_stops_in_radius.show()

+-----------+---------------------+
|       town|stops_in_radius_ratio|
+-----------+---------------------+
|    Wilrijk|   0.8432835820895522|
|  Antwerpen|  0.21086186540731996|
|    Hoboken|   0.9565217391304348|
|     Deurne|              0.21875|
| Borgerhout|                  1.0|
|  Breendonk|                  1.0|
|    Mortsel|   0.9464285714285714|
|    Schelle|   0.9130434782608695|
|    Berchem|   0.4368421052631579|
|       Boom|                  1.0|
|       Reet|   0.7837837837837838|
|   Hemiksem|                  1.0|
|       Niel|                  1.0|
| Aartselaar|   0.8679245283018868|
|    Hingene|                  0.9|
|     Bornem|                 0.78|
|      Weert|   0.7692307692307693|
|     Edegem|   0.9655172413793104|
| Mariekerke|                  1.0|
|Sint-Amands|   0.9285714285714286|
+-----------+---------------------+
only showing top 20 rows



As we're only interested in the towns for which at least **X**% lies within the given radius, we need to filter the ratios in the dataframe above. This results in a dataframe containing all towns for which **X**% of the stops are located within the given radius of the town's location.

In [222]:
towns_stops_in_radius_percentage = towns_stops_in_radius.filter(towns_stops_in_radius["stops_in_radius_ratio"] > INPUT_PERCENTAGE)

In [223]:
towns_stops_in_radius_percentage.orderBy("town").show()

+----------+---------------------+
|      town|stops_in_radius_ratio|
+----------+---------------------+
|    Aaigem|   0.9583333333333334|
|   Aalbeke|                  1.0|
|   Aarsele|   0.9090909090909091|
|Aartselaar|   0.8679245283018868|
|     Achel|   0.8157894736842105|
| Adinkerke|                  1.0|
|    Afsnee|                  1.0|
| Alsemberg|                  1.0|
|    Appels|                  1.0|
|  Aspelare|                  1.0|
| Assebroek|               0.8375|
|Attenhoven|                  1.0|
| Attenrode|                  0.9|
|Avekapelle|                0.875|
|   Avelgem|                  1.0|
|  Averbode|                  1.0|
|   Baaigem|                  1.0|
| Baardegem|                  1.0|
|  Baasrode|   0.9444444444444444|
|   Balegem|   0.8571428571428571|
+----------+---------------------+
only showing top 20 rows

