## Assignment 2:
Number of Stops per Town, calculated using the town data given in the *flemish_districs.txt* and the De Lijn data of all stops in Belgium in the *stops.txt*.

#### Setup SparkContext and SQLContext
The SparkContext is required to setup other aspects of this project. SQLContext allows us to read the *stops.txt* directly into a JSON dataframe, resulting in easy readability and better access to the read data.

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("Ex2").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Setup Districts RDD
The following dataframe is setup using the *stops.txt*, containing the De Lijn data. This data will be used to extract the districtnames which are needed to calculate the amount of stops per town later on.

In [2]:
from pyspark.sql.functions import explode_outer

stopsDF = sqlContext.read.json('data/stops.txt')

By using the *explode_outer* function as described in the previous notebook, we can extract the different districts from the De Lijn data.

In [3]:
districts = stopsDF.select(explode_outer("haltes.omschrijvingGemeente").alias("district"))
districts.show()

+---------+
| district|
+---------+
|  Wilrijk|
|Antwerpen|
|  Hoboken|
|  Hoboken|
|  Hoboken|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|Antwerpen|
|Antwerpen|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
|  Wilrijk|
+---------+
only showing top 20 rows



#### Setup Towns&Districts RDD
The following dataframe is setup using the *flemish_districs.txt*, containing the De Lijn data. This data contains all towns with their corresponding districts. These districts can be used to calculate the total amount of stops in the town by using the previously created dataframe. The data in this file is built up using the following format:
> town_name : \[ all districts belonging to town_name \]


In [4]:
towns_districts = sc.textFile('data/flemish_districs.txt')

To collect the necessary data, the data is mapped by first separating the towns from their lists of districts.

In [5]:
towns_districts_DF = towns_districts.map(lambda x: [x.split(":")[0].strip().replace('‐', '-'), [p.strip().replace('‐', '-') for p in x.split(":")[-1].split(",")]]).toDF(["town", "district"])
towns_districts_DF.show()

+-------------+--------------------+
|         town|            district|
+-------------+--------------------+
|        Aalst|[Aalst, Gijzegem,...|
|       Aalter|[Aalter, Bellem, ...|
|     Aarschot|[Aarschot, Gelrod...|
|   Aartselaar|        [Aartselaar]|
|     Affligem|[Essene, Hekelgem...|
|        Alken|             [Alken]|
|   Alveringem|[Alveringem, Hoog...|
|    Antwerpen|[Antwerpen, Beren...|
|      Anzegem|[Anzegem, Gijzelb...|
|      Ardooie|[Ardooie, Koolskamp]|
|     Arendonk|          [Arendonk]|
|           As|   [As, Niel-bij-As]|
|         Asse|[Asse, Bekkerzeel...|
|     Assenede|[Assenede, Boekho...|
|      Avelgem|[Avelgem, Kerkhov...|
|Baarle-Hertog|     [Baarle-Hertog]|
|        Balen|      [Balen, Olmen]|
|      Beernem|[Beernem, Oedelem...|
|       Beerse| [Beerse, Vlimmeren]|
|      Beersel|[Beersel, Lot, Al...|
+-------------+--------------------+
only showing top 20 rows



Using the *explode_outer* function we can now create a row for every district in every list, combined with their corresponding town.

In [6]:
towns_distr_sep = towns_districts_DF.select("town", explode_outer("district").alias("district"))
towns_distr_sep.show()

+----------+-------------+
|      town|     district|
+----------+-------------+
|     Aalst|        Aalst|
|     Aalst|     Gijzegem|
|     Aalst|     Hofstade|
|     Aalst|    Baardegem|
|     Aalst|    Herdersem|
|     Aalst|      Meldert|
|     Aalst|      Moorsel|
|     Aalst|  Erembodegem|
|     Aalst|Nieuwerkerken|
|    Aalter|       Aalter|
|    Aalter|       Bellem|
|    Aalter|   Lotenhulle|
|    Aalter|        Poeke|
|  Aarschot|     Aarschot|
|  Aarschot|      Gelrode|
|  Aarschot|     Langdorp|
|  Aarschot|      Rillaar|
|Aartselaar|   Aartselaar|
|  Affligem|       Essene|
|  Affligem|     Hekelgem|
+----------+-------------+
only showing top 20 rows



Both dataframes, *districts* and *towns & districts*,  are now ready to be joined on key **town**. This will result in the column *town* being filled with the name of a town for as many rows as there are stops in that town.

In [7]:
town_districts_joined = districts.join(towns_distr_sep, on=["district"])
town_districts_joined.show()

+--------+-----+
|district| town|
+--------+-----+
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
|   Aalst|Aalst|
+--------+-----+
only showing top 20 rows



Finally, we can group by town and count the number of instances, which results in the total number of stops in a town. This number is a summation of all the number of stops of all the districts belonging to that town.

In [8]:
stops_per_town = town_districts_joined.groupBy("town").count().orderBy("town")

In [9]:
stops_per_town.show()

+-------------+-----+
|         town|count|
+-------------+-----+
|        Aalst|  554|
|       Aalter|  107|
|     Aarschot|  210|
|   Aartselaar|   53|
|     Affligem|   68|
|        Alken|  118|
|   Alveringem|   78|
|    Antwerpen| 1295|
|      Anzegem|   92|
|      Ardooie|   61|
|     Arendonk|   60|
|           As|   34|
|         Asse|  145|
|     Assenede|  109|
|      Avelgem|   48|
|Baarle-Hertog|   12|
|        Balen|  155|
|      Beernem|   93|
|       Beerse|   62|
|      Beersel|  121|
+-------------+-----+
only showing top 20 rows

