## Assignment 5:
Number of Stops within a radious of a certain position (lat, long), calculated using the location data given in the *zipcodes.csv* and the De Lijn data of all stops in Belgium in the *stops.txt*. Since the only locations for towns are found in *zipcodes.csv*, we need to interpret the term *town* as a district, as it's used in the De Lijn data as well.

#### Setup SparkContext & SQLContext
The SparkContext is required to setup other aspects of this project. SQLContext allows us to read the *stops.txt* directly into a JSON dataframe, resulting in easy readability and better access to the read data.

In [96]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("Ex5").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Setup Stops
As with the previous notebooks, we can read the De Lijn data from the *stops.txt*. The following function is designed to read and parse the JSON code in the *stops.txt*.

In [97]:
def parseJSON(f):
    import json
    f = open(f)
    data = json.load(f)
    f.close()
    return data

Using the function above we can load the JSON data and create a dataframe. This dataframe will contain geocoordinates for each stop. These coordinates will be used to calculate the distance to the town in which it's located.

In [98]:
stops = parseJSON("data/stops.txt")
columns = ["stop", "town", "stop_latitude", "stop_longitude"]
rows = []
for stop in stops["haltes"]:
    try:
        row = (stop["omschrijving"], stop["omschrijvingGemeente"], stop["geoCoordinaat"]["latitude"], stop["geoCoordinaat"]["longitude"])
    except:
        continue
    rows.append(row)
    
stopsDF = spark.createDataFrame(rows, columns)

In [99]:
stopsDF.show()

+--------------------+---------+------------------+------------------+
|                stop|     town|     stop_latitude|    stop_longitude|
+--------------------+---------+------------------+------------------+
| A. Chantrainestraat|  Wilrijk| 51.16388937021345| 4.392073389160737|
|           Zurenborg|Antwerpen| 51.20624969023759|  4.42550621473228|
|Verenigde Natieslaan|  Hoboken| 51.16606659417422| 4.357705800030343|
|Verenigde Natieslaan|  Hoboken|51.166021637406374| 4.357548549004122|
|     D. Baginierlaan|  Hoboken|51.174054839412754| 4.341260275816016|
| A. Chantrainestraat|  Wilrijk| 51.16300843934687| 4.392315960084608|
|      Fotografielaan|  Wilrijk|51.159774888706686|  4.36212420848111|
|      Fotografielaan|  Wilrijk| 51.15996363300075| 4.361809703307476|
|            Moerelei|  Wilrijk| 51.16295566692438| 4.385797177482637|
|            Moerelei|  Wilrijk|51.163459288346274|  4.38396752309872|
|        J. De Voslei|Antwerpen|51.188743165936835| 4.389583082467416|
|   Mi

#### Setup Towns
The *zipcodes.csv* file contains all towns in Belgium with their geocoordinates. These can be used to calculate the distance of each stop. Using the earlier setup *SQLContext* we can extract the data.

In [100]:
townsDF = sqlContext.read.csv("data/zipcodes.csv", sep=";").selectExpr("_c0 as zipcode", "_c1 as town", "_c2 as town_latitude", "_c3 as town_longitude")

In [101]:
townsDF.show()

+-------+--------------------+-----------------+-----------------+
|zipcode|                town|    town_latitude|   town_longitude|
+-------+--------------------+-----------------+-----------------+
|   1000|             Brussel|       50.8427501|4.351549900000009|
|   1000|           Bruxelles|       50.8427501|4.351549900000009|
|   1005|Ass. R�un. Com. C...|             null|             null|
|   1005|Brusselse Hoofdst...|50.84487679999999|4.351433499999985|
|   1005|Conseil Region Br...|        50.847857|4.367408000000069|
|   1005|Ver. Verg. Gemeen...|             null|             null|
|   1006|Raad Vlaamse Geme...|             null|             null|
|   1007|Ass. Commiss. Com...|             null|             null|
|   1008|Chambre des Repr�...|50.84655679999999|4.364662199999998|
|   1008|Kamer van Volksve...|50.84655679999999|4.364662199999998|
|   1009|    Belgische Senaat|             null|             null|
|   1009|   Senat de Belgique|         50.79834|4.395649999999

Since multiple town names can be seen within the *town* column above, single names need to be extracted before the join can be executed.

In [102]:
from pyspark.sql.types import *
from pyspark.sql.functions import udf

cleanName = udf(lambda x: x.split(" ")[0], StringType())

towns = townsDF.withColumn("town", cleanName(townsDF["town"]))

In [103]:
towns.show()

+-------+-------------------+-----------------+-----------------+
|zipcode|               town|    town_latitude|   town_longitude|
+-------+-------------------+-----------------+-----------------+
|   1000|            Brussel|       50.8427501|4.351549900000009|
|   1000|          Bruxelles|       50.8427501|4.351549900000009|
|   1005|               Ass.|             null|             null|
|   1005|          Brusselse|50.84487679999999|4.351433499999985|
|   1005|            Conseil|        50.847857|4.367408000000069|
|   1005|               Ver.|             null|             null|
|   1006|               Raad|             null|             null|
|   1007|               Ass.|             null|             null|
|   1008|            Chambre|50.84655679999999|4.364662199999998|
|   1008|              Kamer|50.84655679999999|4.364662199999998|
|   1009|          Belgische|             null|             null|
|   1009|              Senat|         50.79834|4.395649999999932|
|   1010| 

We can now join our stops and towns to create a table with for each stop its coordinates and the coordinates of the corresponding town.

In [104]:
joined_dfs = stopsDF.join(towns, on=["town"])

In [105]:
joined_dfs.show()

+---------+--------------------+------------------+-----------------+-------+-----------------+------------------+
|     town|                stop|     stop_latitude|   stop_longitude|zipcode|    town_latitude|    town_longitude|
+---------+--------------------+------------------+-----------------+-------+-----------------+------------------+
|  Wilrijk| A. Chantrainestraat| 51.16388937021345|4.392073389160737|   2610|       51.1683102| 4.394286800000032|
|Antwerpen|           Zurenborg| 51.20624969023759| 4.42550621473228|   2060|       51.2293515| 4.427988300000038|
|Antwerpen|           Zurenborg| 51.20624969023759| 4.42550621473228|   2050|       51.2287575| 4.374022100000047|
|Antwerpen|           Zurenborg| 51.20624969023759| 4.42550621473228|   2040|51.34183059999999|4.2964604999999665|
|Antwerpen|           Zurenborg| 51.20624969023759| 4.42550621473228|   2030|51.27639629999999| 4.362460400000032|
|Antwerpen|           Zurenborg| 51.20624969023759| 4.42550621473228|   2020|   

#### Calculating the distances
The following function is used to calculate the distance between the stop and its corresponding town in kilometers, using the radius of the earth.

In [106]:
from math import sin, cos, sqrt, atan2, radians

# Copyright: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude
def distance(stop_lat, stop_long, town_lat, town_long):
    R = 6373.0

    lat1 = radians(float(stop_lat))
    lon1 = radians(float(stop_long))
    lat2 = radians(float(town_lat))
    lon2 = radians(float(town_long))

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

calculate_distance = udf(distance, DoubleType())

In [107]:
stops_distances = joined_dfs.withColumn("distance", calculate_distance(joined_dfs["stop_latitude"], joined_dfs["stop_longitude"], joined_dfs["town_latitude"], joined_dfs["town_longitude"]))

In [108]:
stops_distances.show(5)

+---------+-------------------+-----------------+-----------------+-------+-----------------+------------------+------------------+
|     town|               stop|    stop_latitude|   stop_longitude|zipcode|    town_latitude|    town_longitude|          distance|
+---------+-------------------+-----------------+-----------------+-------+-----------------+------------------+------------------+
|  Wilrijk|A. Chantrainestraat|51.16388937021345|4.392073389160737|   2610|       51.1683102| 4.394286800000032|0.5153933302667436|
|Antwerpen|          Zurenborg|51.20624969023759| 4.42550621473228|   2060|       51.2293515| 4.427988300000038|2.5754226142730405|
|Antwerpen|          Zurenborg|51.20624969023759| 4.42550621473228|   2050|       51.2287575| 4.374022100000047|  4.37421465495574|
|Antwerpen|          Zurenborg|51.20624969023759| 4.42550621473228|   2040|51.34183059999999|4.2964604999999665| 17.55162328501925|
|Antwerpen|          Zurenborg|51.20624969023759| 4.42550621473228|   2030|5

#### Dataframe Cleanup
Since the created dataframe above is quite hard to read due to too many columns with long values, a bit of cleanup is required to make the dataframe easier to read. For this, all the latitudes and longitudes are dropped, since these are not necessary when we're just interested in the distance between the specific stop and it's corresponding town. Since several towns are separated into multiple zipcodes, the zipcode is kept in the dataframe to easily show the differences between the towns.

In [109]:
stops_towns_dists = stops_distances.selectExpr("town as town", "zipcode as zipcode", "stop as stop", "distance as distance")

In [110]:
stops_towns_dists.show()

+---------+-------+--------------------+------------------+
|     town|zipcode|                stop|          distance|
+---------+-------+--------------------+------------------+
|  Wilrijk|   2610| A. Chantrainestraat|0.5153933302667436|
|Antwerpen|   2060|           Zurenborg|2.5754226142730405|
|Antwerpen|   2050|           Zurenborg|  4.37421465495574|
|Antwerpen|   2040|           Zurenborg| 17.55162328501925|
|Antwerpen|   2030|           Zurenborg| 8.952703605860485|
|Antwerpen|   2020|           Zurenborg| 3.487883653679154|
|Antwerpen|   2018|           Zurenborg|1.0301779473634403|
|Antwerpen|   2000|           Zurenborg|2.2761900085359903|
|  Hoboken|   2660|Verenigde Natieslaan|1.3902926229584962|
|  Hoboken|   2660|Verenigde Natieslaan| 1.390665295983725|
|  Hoboken|   2660|     D. Baginierlaan|0.7195942952328143|
|  Wilrijk|   2610| A. Chantrainestraat|0.6055236729885773|
|  Wilrijk|   2610|      Fotografielaan|2.4360024831857428|
|  Wilrijk|   2610|      Fotografielaan|