## Assignment 3:
Number of Stops per Town per Citizen, calculated using the town data given in the *flemish_districs.txt*, the citizen data in the *citizens2.txt* and the De Lijn data of all stops in Belgium in the *stops.txt*.

#### Setup SparkContext and SQLContext
The SparkContext is required to setup other aspects of this project. SQLContext allows us to read the *stops.txt* directly into a JSON dataframe, resulting in easy readability and better access to the read data.

In [1]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("Ex3").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Stops per Town Setup
Since this notebook is an elaboration of notebook_2, we will reuse all its code in order to get the total number of stops per town, as defined in the previous notebook.

In [2]:
from pyspark.sql.functions import explode_outer

stopsDF = sqlContext.read.json('data/stops.txt')
districts = stopsDF.select(explode_outer("haltes.omschrijvingGemeente").alias("district"))
towns_districts = sc.textFile('data/flemish_districs.txt')
towns_districts_DF = towns_districts.map(lambda x: [x.split(":")[0].strip().replace('‐', '-'), [p.strip().replace('‐', '-') for p in x.split(":")[-1].split(",")]]).toDF(["town", "district"])
towns_distr_sep = towns_districts_DF.select("town", explode_outer("district").alias("district"))
town_districts_joined = districts.join(towns_distr_sep, on=["district"])
stops_per_town = town_districts_joined.groupBy("town").count().orderBy("town")

In [3]:
stops_per_town.show()

+-------------+-----+
|         town|count|
+-------------+-----+
|        Aalst|  554|
|       Aalter|  107|
|     Aarschot|  210|
|   Aartselaar|   53|
|     Affligem|   68|
|        Alken|  118|
|   Alveringem|   78|
|    Antwerpen| 1295|
|      Anzegem|   92|
|      Ardooie|   61|
|     Arendonk|   60|
|           As|   34|
|         Asse|  145|
|     Assenede|  109|
|      Avelgem|   48|
|Baarle-Hertog|   12|
|        Balen|  155|
|      Beernem|   93|
|       Beerse|   62|
|      Beersel|  121|
+-------------+-----+
only showing top 20 rows



#### Citizens Setup
The *citizens2.txt* file contains all towns in Belgium with their citizen count. All of Belgium is included in this file, but only Flanders is required, since De Lijn only operates in Flanders. This is however no issue, since the join that will be executed later on will only incorporate the towns found earlier. Since the data does not follow a certain format, some preprocessing is required. The following function will extract all possible names for a town, since both Dutch and French names are possible.

In [4]:
def extractNames(data):
    result = []
    # Only 1 name
    if len(data) == 1:
        # Strip possible trailing whitespaces
        name = data[0].strip()
        result.append(name)
    
    # Both Dutch & French name
    elif len(data) > 1:
        for name in data:
            # Strip possible trailing whitespaces
            name = name.strip()
            # Check whether name is a separation character
            if len(name) == 1:
                continue
            else:
                # Check whether name is encapsulated by parentheses
                if name[0] == "(":
                    name = name[1:-1]
            
            result.append(name)
            
    return result
        

In [5]:
citizensDF = sc.textFile('data/citizens2.txt')
citizens = citizensDF.map(lambda x: [[name for name in extractNames(x.split(" ")[:-1])], int(x.rpartition(" ")[-1].replace('.', ''))]).toDF(["towns", "citizen_count"])

In [6]:
citizens.show()

+--------------------+-------------+
|               towns|citizen_count|
+--------------------+-------------+
|        [Anderlecht]|       117724|
|[Brussel, Bruxelles]|       177112|
|   [Elsene, Ixelles]|        86336|
|         [Etterbeek]|        47410|
|             [Evere]|        41016|
|         [Ganshoren]|        24794|
|             [Jette]|        52144|
|        [Koekelberg]|        21765|
|[Oudergem, Auderg...|        33725|
|[Schaarbeek, Scha...|       132097|
|[Sint‐Agatha‐Berc...|        24831|
|[Sint‐Gillis, Sai...|        49361|
|[Sint‐Jans‐Molenb...|        95455|
|[Sint‐Joost‐ten‐N...|        26813|
|[Sint‐Lambrechts‐...|        56212|
|[Sint‐Pieters‐Wol...|        41513|
|      [Ukkel, Uccle]|        82038|
|     [Vorst, Forest]|        55694|
|[Watermaal‐Bosvoo...|        25001|
|        [Aartselaar]|        14298|
+--------------------+-------------+
only showing top 20 rows



Now the dataframe contains in every row a list and an integer. The list could be filled with 1 or 2 items, depending on the given options in the loaded data. To extract all possible town names, the *explode_outer* function is executed as mentioned in previous notebooks. 

In [7]:
extracted_citizens = citizens.select("citizen_count", explode_outer("towns").alias("town"))

In [8]:
extracted_citizens.show()

+-------------+--------------------+
|citizen_count|                town|
+-------------+--------------------+
|       117724|          Anderlecht|
|       177112|             Brussel|
|       177112|           Bruxelles|
|        86336|              Elsene|
|        86336|             Ixelles|
|        47410|           Etterbeek|
|        41016|               Evere|
|        24794|           Ganshoren|
|        52144|               Jette|
|        21765|          Koekelberg|
|        33725|            Oudergem|
|        33725|           Auderghem|
|       132097|          Schaarbeek|
|       132097|          Schaerbeek|
|        24831| Sint‐Agatha‐Berchem|
|        24831|Berchem‐Sainte‐Ag...|
|        49361|         Sint‐Gillis|
|        49361|        Saint‐Gilles|
|        95455| Sint‐Jans‐Molenbeek|
|        95455|Molenbeek‐Saint‐Jean|
+-------------+--------------------+
only showing top 20 rows



#### Join *Stops per Town* and Citizens
The next step in finding the amount of stops per town per citizen is joining in the previously created dataframes on their towns. This will add a column with the citizen count corresponding to the town in each row.

In [9]:
joined_dfs = stops_per_town.join(extracted_citizens, on=["town"])

In [10]:
joined_dfs.orderBy("town").show()

+------------+-----+-------------+
|        town|count|citizen_count|
+------------+-----+-------------+
|       Aalst|  554|        85615|
|      Aalter|  107|        20544|
|    Aarschot|  210|        29956|
|  Aartselaar|   53|        14298|
|    Affligem|   68|        13221|
|       Alken|  118|        11564|
|  Alveringem|   78|         5087|
|   Antwerpen| 1295|       521680|
|     Anzegem|   92|        14599|
|     Ardooie|   61|         8988|
|    Arendonk|   60|        13274|
|          As|   34|         8190|
|        Asse|  145|        32940|
|    Assenede|  109|        14204|
|     Avelgem|   48|        10063|
|       Balen|  155|        22425|
|     Beernem|   93|        15683|
|      Beerse|   62|        17928|
|     Beersel|  121|        25035|
|Begijnendijk|   92|        10053|
+------------+-----+-------------+
only showing top 20 rows



#### Calculating the ratio
To calculate the ratio, the stops count has to be divided by the citizen count for each town. The results can than be added to the dataframe to finish the analysis. 

In [11]:
stops_per_town_per_citizen = joined_dfs.withColumn("citizen_ratio", joined_dfs["count"] / joined_dfs["citizen_count"])

In [12]:
stops_per_town_per_citizen.orderBy("town").show()

+------------+-----+-------------+--------------------+
|        town|count|citizen_count|       citizen_ratio|
+------------+-----+-------------+--------------------+
|       Aalst|  554|        85615|0.006470828709922326|
|      Aalter|  107|        20544|0.005208333333333333|
|    Aarschot|  210|        29956|0.007010281746561...|
|  Aartselaar|   53|        14298|0.003706812141558...|
|    Affligem|   68|        13221|0.005143332576960895|
|       Alken|  118|        11564| 0.01020408163265306|
|  Alveringem|   78|         5087|0.015333202280322391|
|   Antwerpen| 1295|       521680|0.002482364667995706|
|     Anzegem|   92|        14599|0.006301801493252963|
|     Ardooie|   61|         8988|0.006786826880284824|
|    Arendonk|   60|        13274|0.004520114509567576|
|          As|   34|         8190|0.004151404151404151|
|        Asse|  145|        32940|0.004401942926533...|
|    Assenede|  109|        14204|0.007673894677555618|
|     Avelgem|   48|        10063|0.004769949319