## Assignment 6:
The closest stop to a given position (lat, long), calculated using the De Lijn data of all stops in Belgium in the *stops.txt*.

#### Input Position
Please set the desired position using the (latitude, longitude) notation.

In [77]:
INPUT_POSITION = (51.2181962, 4.4244759)

#### Setup SparkContext
The SparkContext is required to setup other aspects of this project, such as the dataframes and their transformations.

In [78]:
from pyspark import SparkContext
spark = SparkSession.builder.appName("Ex6").getOrCreate()
sc = spark.sparkContext

#### Setup Stops
As with the previous notebooks, we can read the De Lijn data from the *stops.txt*. The following function is designed to read and parse the JSON code in the *stops.txt*.

In [79]:
def parseJSON(f):
    import json
    f = open(f)
    data = json.load(f)
    f.close()
    return data

Using the function above we can load the JSON data and create a dataframe. This dataframe will contain geocoordinates for each stop. These coordinates will be used to calculate the distance to the town in which it's located.

In [80]:
stops = parseJSON("data/stops.txt")
columns = ["stop", "town", "stop_latitude", "stop_longitude"]
rows = []
for stop in stops["haltes"]:
    try:
        row = (stop["omschrijving"], stop["omschrijvingGemeente"], stop["geoCoordinaat"]["latitude"], stop["geoCoordinaat"]["longitude"])
    except:
        continue
    rows.append(row)
    
stopsDF = spark.createDataFrame(rows, columns)

In [81]:
stopsDF.show()

+--------------------+---------+------------------+------------------+
|                stop|     town|     stop_latitude|    stop_longitude|
+--------------------+---------+------------------+------------------+
| A. Chantrainestraat|  Wilrijk| 51.16388937021345| 4.392073389160737|
|           Zurenborg|Antwerpen| 51.20624969023759|  4.42550621473228|
|Verenigde Natieslaan|  Hoboken| 51.16606659417422| 4.357705800030343|
|Verenigde Natieslaan|  Hoboken|51.166021637406374| 4.357548549004122|
|     D. Baginierlaan|  Hoboken|51.174054839412754| 4.341260275816016|
| A. Chantrainestraat|  Wilrijk| 51.16300843934687| 4.392315960084608|
|      Fotografielaan|  Wilrijk|51.159774888706686|  4.36212420848111|
|      Fotografielaan|  Wilrijk| 51.15996363300075| 4.361809703307476|
|            Moerelei|  Wilrijk| 51.16295566692438| 4.385797177482637|
|            Moerelei|  Wilrijk|51.163459288346274|  4.38396752309872|
|        J. De Voslei|Antwerpen|51.188743165936835| 4.389583082467416|
|   Mi

#### Calculating the distances
The following function will be used to calculate the distance between each stop and the given position.

In [82]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from math import sin, cos, sqrt, atan2, radians

# Copyright: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude
def distance(stop_lat, stop_long):
    R = 6373.0

    lat1 = radians(float(stop_lat))
    lon1 = radians(float(stop_long))
    lat2 = radians(INPUT_POSITION[0])
    lon2 = radians(INPUT_POSITION[1])

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    return distance

calculate_distance = udf(distance, DoubleType())

In [83]:
stops_distances_DF = stopsDF.withColumn("distance", calculate_distance(stopsDF["stop_latitude"], stopsDF["stop_longitude"]))

In [84]:
stops_distances_DF.show()

+--------------------+---------+------------------+------------------+------------------+
|                stop|     town|     stop_latitude|    stop_longitude|          distance|
+--------------------+---------+------------------+------------------+------------------+
| A. Chantrainestraat|  Wilrijk| 51.16388937021345| 4.392073389160737| 6.449053070753454|
|           Zurenborg|Antwerpen| 51.20624969023759|  4.42550621473228|1.3307461782075731|
|Verenigde Natieslaan|  Hoboken| 51.16606659417422| 4.357705800030343| 7.435399992260838|
|Verenigde Natieslaan|  Hoboken|51.166021637406374| 4.357548549004122| 7.446164923701917|
|     D. Baginierlaan|  Hoboken|51.174054839412754| 4.341260275816016| 7.599392968085201|
| A. Chantrainestraat|  Wilrijk| 51.16300843934687| 4.392315960084608|  6.53510913228184|
|      Fotografielaan|  Wilrijk|51.159774888706686|  4.36212420848111| 7.817984573315791|
|      Fotografielaan|  Wilrijk| 51.15996363300075| 4.361809703307476| 7.812777257977161|
|         

#### Finding closest Stop
The closest stop can be found by simple iterating over all rows, checking whether it's distance to the given position is smaller than the previously found distances.

In [85]:
from pyspark.sql.functions import min

min_distance = stops_distances_DF.select(min("distance").alias("min_distance")).head()[0]
minDF = stops_distances_DF.filter(stops_distances_DF["distance"] == min_distance)
minDF.show()

+---------+---------+-----------------+-----------------+--------------------+
|     stop|     town|    stop_latitude|   stop_longitude|            distance|
+---------+---------+-----------------+-----------------+--------------------+
|Ommeganck|Antwerpen|51.21794411371718|4.424490130633908|0.028057037861096215|
+---------+---------+-----------------+-----------------+--------------------+

