## Assignment 4:
Number of Stops within a radious of a certain position (lat, long), calculated using the De Lijn data of all stops in Belgium in the *stops.txt*.

#### Input Variables
Please set the desired position in *INPUT_POSITION* using the (latitude, longitude) notation. The desired radius in kilometers can be entered into *INPUT_RADIUS*.

In [29]:
INPUT_POSITION = (51.2181962, 4.4244759)
INPUT_RADIUS = 5

#### Setup Spark
The Spark object is required to later on create dataframes and perform transformations on them.

In [30]:
from pyspark import SparkContext
spark = SparkSession.builder.appName("Ex4").getOrCreate()

#### Calculating the distance
The following function is designed to calculate the distance between positions, both given in (lat, long) format. The calculation is based on the radius of the earth in kilometers.

In [31]:
from math import sin, cos, sqrt, atan2, radians

# Copyright: https://stackoverflow.com/questions/19412462/getting-distance-between-two-points-based-on-latitude-longitude
def is_within_radius(lat, long):
    R = 6373.0

    lat1 = radians(INPUT_POSITION[0])
    lon1 = radians(INPUT_POSITION[1])
    lat2 = radians(lat)
    lon2 = radians(long)

    dlon = lon2 - lon1
    dlat = lat2 - lat1

    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))

    distance = R * c

    if distance <= INPUT_RADIUS:
        return True
    else:
        return False

#### Setup Stops
As with the previous notebooks, we can read the De Lijn data from the *stops.txt*. The following function is designed to read and parse the JSON code in the *stops.txt*.

In [32]:
def parseJSON(f):
    import json
    f = open(f)
    data = json.load(f)
    f.close()
    return data

Using the function above we can load the JSON data and create a dataframe. This dataframe will contain geocoordinates for each stop. These coordinates will be used to check whether the stop is within distance of the requested location.

In [33]:
stops = parseJSON("data/stops.txt")
columns = ["stop", "latitude", "longitude"]
rows = []
for stop in stops["haltes"]:
    try:
        row = (stop["omschrijving"], stop["geoCoordinaat"]["latitude"], stop["geoCoordinaat"]["longitude"])
    except:
        continue
    rows.append(row)
    
stopsDF = spark.createDataFrame(rows, columns)

In [34]:
stopsDF.show()

+--------------------+------------------+------------------+
|                stop|          latitude|         longitude|
+--------------------+------------------+------------------+
| A. Chantrainestraat| 51.16388937021345| 4.392073389160737|
|           Zurenborg| 51.20624969023759|  4.42550621473228|
|Verenigde Natieslaan| 51.16606659417422| 4.357705800030343|
|Verenigde Natieslaan|51.166021637406374| 4.357548549004122|
|     D. Baginierlaan|51.174054839412754| 4.341260275816016|
| A. Chantrainestraat| 51.16300843934687| 4.392315960084608|
|      Fotografielaan|51.159774888706686|  4.36212420848111|
|      Fotografielaan| 51.15996363300075| 4.361809703307476|
|            Moerelei| 51.16295566692438| 4.385797177482637|
|            Moerelei|51.163459288346274|  4.38396752309872|
|        J. De Voslei|51.188743165936835| 4.389583082467416|
|   Middelheim Vijver| 51.18297254153696| 4.418927404052235|
|          Antarctica| 51.16220188106436| 4.372316437753825|
|          Antarctica| 5

#### Counting the number of stops within the radius
To find the total number of stops, we will go over each row within the *stopsDF* dataframe and check, using the created *is_within_radius* function, if the stop is located within the given radius.

In [35]:
total_stops_in_radius = 0

for row in stopsDF.rdd.collect():
    if is_within_radius(row[1], row[2]):
        total_stops_in_radius += 1
        
print("Total number of stops within given radius: {}".format(total_stops_in_radius))

Total number of stops within given radius: 916
