## Assignment 1:
Number of Stops per Citizen in general, calculated using the citizen data given in the *citizens2.txt* and the De Lijn data of all stops in Belgium in the *stops.txt*.

#### Setup SparkContext and SQLContext
The SparkContext is required to setup other aspects of this project. SQLContext allows us to read the *stops.txt* directly into a JSON dataframe, resulting in easy readability and better access to the read data.

In [14]:
from pyspark import SparkContext
from pyspark.sql import SQLContext
spark = SparkSession.builder.appName("Ex1").getOrCreate()
sc = spark.sparkContext
sqlContext = SQLContext(sc)

#### Setup Stops RDD
Setup of the dataframes themselves for later use in extracting the total numbers of stops and citizens.

In [15]:
from pyspark.sql.functions import explode_outer

stopsDF = sqlContext.read.json('data/stops.txt')
stops = stopsDF.select(explode_outer("haltes").alias("haltes"))

The *explode_outer* function enables us to return a new row for each element in the given array or map. Since all the necessary data is held in the *haltes* element in the root, as can be seen in the dataframe schema, all will now be accessible through the newly created row. To acquire the total number of stops can now be calculated using *count()*, which calculates the total number of elements in the array linked to the *haltes* element.

In [16]:
total_stops = stops.count()
print("Total number of stops: {}".format(total_stops))

Total number of stops: 35790


#### Setup Citizens RDD
Using the SparkContext we can simply read the data in the *citizens2.txt* into a dataframe. Since every row has the following structure, "{name} {total}", we will need to extract the amounts from each row first. This requires the row to be split on spaces and take the final element, since this is the total amount.
Every three digits are separated with a dot, meaning those need to be replaced by empty strings.
Finally, all extracted strings must be cast to integers, so they can be added.

In [17]:
citizensDF = sc.textFile('data/citizens2.txt')
citizens = citizensDF.map(lambda x: int(x.rpartition(" ")[-1].replace('.', '')))

We can now acquire the total number of citizens by simply summing all calculated integers.

In [18]:
total_citizens = citizens.sum()
print("Total number of citizens: {}".format(total_citizens))

Total number of citizens: 11358357


#### Calculate the number of stops per citizens
The amount of stops per citizens can, now that we have the required values, be computed using the following formula: *total number of stops / total number of citizens*.

In [19]:
stops_per_citizen = total_stops / total_citizens
print("Number of stops per citizen: {}".format(stops_per_citizen))

Number of stops per citizen: 0.0031509838967026657
