![Spark Image](https://upload.wikimedia.org/wikipedia/commons/thumb/f/f3/Apache_Spark_logo.svg/1200px-Apache_Spark_logo.svg.png)

# Solution to the exercise

This notebook contains suggested solutions to the exercises. Since the exercises can be implemented in different ways and with different notations, the suggested solutions may differ from the code written in the exercise - that's totally ok :) But at least the output of the individual code cells should match.

#### *Create a SparkSession*

Exercise - let's create a *Spark session* - do you remenber which object you have to import?

In [None]:
# import SparkSession from library
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Exercise").getOrCreate()

In [2]:
# display information about the Spark session
spark

#### *Load data*

Now we need data to work with. For this we load data that is already provided in the local /data/berlin-data folder.<br><br>
The data was downloaded from the repository: https://github.com/berlinonline/haeufige-vornamen-berlin.<br>
The originator of the data in the original repository folder data/source/ is "Berlin State Office for Citizens and Regulatory Affairs (LABO)".<br>
The originator of the 'cleaned' data in the original repository folder data/cleaned/ is the "Berlin State Office for Citizens' and Regulatory Affairs (LABO) / BerlinOnline Stadtportal GmbH & Co. KG".<br>

All data sets contained in the repository are licensed under CC BY 3.0 DE (Creative Commons Attribution 3.0 Germany License).

##### Data content and data structure

Since 2013, the Berlin city data portal daten.berlin.de has always published lists of the first names of all newborn children and those registered with the registry office at the beginning of the new year. The State Office for Civil and Regulatory Affairs collects the lists from the registry offices in the individual Berlin districts and then publishes them.

There is a folder for each year that contains a CSV file with the frequency of names for each district of Berlin.<br><br>
For the years 2012-2016, the column structure of the CSV files is as follows:
- 'vorname' specifies the first name
- 'anzahl' the total number of children registered with this name
- 'geschlecht' the gender of the child

From 2017 there is an additional column 'position':
- 'position': In the event that a child has been given several first names, position designates the position<br> of the name in the list of names.

*There is nothing to add or exchange in the next code cell!*<br><br> In the next cell, all data in the different subdirectories is read, the information from folder names and file names is added to the datafarm as columns and all data is brought into a uniform schema. Furthermore, all column names and values are converted from German to English and the 'new' dataframe is saved as a file. The file is the basis for the further tasks in this notebook.

In [3]:
# import lit function from library
import os
from pyspark.sql.functions import lit, translate
from pyspark.sql.types import StructField, StructType, StringType, IntegerType

# create schema for dataframe
csvSchema = StructType([StructField("vorname", StringType(), True),
                             StructField("anzahl", IntegerType(), True),
                             StructField("geschlecht", StringType(), True),
                             StructField("position", IntegerType(), True),
                             StructField("year", IntegerType(), True),
                             StructField("district", StringType(), True) 
                            ])

# create empty dataframe
df = spark.createDataFrame(spark.sparkContext.emptyRDD(), csvSchema)

# 'root' path of berlin data
fileDirectory = 'data/berlin-data/cleaned/'
# loop over all subdirectories (each subdir = one year of data)
for dname in os.listdir(fileDirectory):    
    # loop over all files in the 'year' directory
    fpath = fileDirectory + dname + "/"
    for fname in os.listdir(fpath):
        df_tmp = spark.read.format("csv")\
            .option("header", "true")\
            .option("inferSchema",True)\
            .load(fpath + fname)
        # check if schema contains row 'position'. if not add row with default value 1
        if not ("position" in df_tmp.columns):
                    df_tmp = df_tmp.withColumn("position", lit(1))
        # get the final component of a pathname. This represents the year.
        year = os.path.basename(os.path.dirname(fpath))
        # add the year value column
        df_tmp = df_tmp.withColumn("Year", lit(year))
        # add the disrict value column. The district value is the file name without '.csv' extension.
        df_tmp = df_tmp.withColumn("District", lit(fname[:-4]))
        # add the df_tmp dataframe to the df dataframe 
        df = df.union(df_tmp)

# rename columns from German -> English
df = df.withColumnRenamed("vorname","FirstName") \
    .withColumnRenamed("anzahl","NumberOfChildren")\
    .withColumnRenamed("geschlecht","Gender")\
    .withColumnRenamed("position","Position")\
    .withColumnRenamed("year","Year")\
    .withColumnRenamed("district","District")

# change gender value from w (weiblich) -> f (female)    
df = df.withColumn('Gender', translate('Gender', 'w', 'f'))
# Write DataFrame data to CSV file
df.coalesce(1).write.mode("overwrite").option("header", "true").csv("data/berlin-data/berlin-data")

                                                                                

Exercise - in the next cells you have to 
- read a file in csv format and create a dataframe
- persist the dataframe with the default storage level (MEMORY_AND_DISK)
- print the schema of the dataframe
- display the number of rows in the dataframe
- display the first ten rows of the dataframe

In [4]:
# read .csv file
df = spark.read.format("csv")\
            .option("header", "true")\
            .option("inferSchema",True)\
            .load("data/berlin-data/berlin-data")

                                                                                

In [5]:
# check if dataframe is already cached - if not cache/persist the dataframe
if not (df.storageLevel.useMemory) :
    df.cache()   

In [6]:
# print the dataframe schema
df.printSchema()

root
 |-- FirstName: string (nullable = true)
 |-- NumberOfChildren: integer (nullable = true)
 |-- Gender: string (nullable = true)
 |-- Position: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- District: string (nullable = true)



In [7]:
# display the number of rows in the dataframe
df.count()

                                                                                

310025

In [8]:
# display the first ten rows of the dataframe
df.show(10,False)

+---------+----------------+------+--------+----+-----------------+
|FirstName|NumberOfChildren|Gender|Position|Year|District         |
+---------+----------------+------+--------+----+-----------------+
|Sophie   |27              |f     |1       |2013|treptow-koepenick|
|Alexander|21              |m     |1       |2013|treptow-koepenick|
|Charlotte|21              |f     |1       |2013|treptow-koepenick|
|Marie    |21              |f     |1       |2013|treptow-koepenick|
|Paul     |19              |m     |1       |2013|treptow-koepenick|
|Jonas    |17              |m     |1       |2013|treptow-koepenick|
|Anton    |13              |m     |1       |2013|treptow-koepenick|
|Emilia   |13              |f     |1       |2013|treptow-koepenick|
|Luca     |13              |m     |1       |2013|treptow-koepenick|
|Mia      |13              |f     |1       |2013|treptow-koepenick|
+---------+----------------+------+--------+----+-----------------+
only showing top 10 rows



#### *Analyse data*

We already know how many rows are in the dataframe.<br><br>Exercise - let's find out 
- how many different first names do exist
- how many female and male entries exist
- how many entries are made for each year
- what is the most popular female first name
- what is the most popular male first name
- the maximum number of first name in the register (maximum position)
- which district has the most entries
- How many children have the same name (NumberOfChildren) in average in the year 2021?

In [9]:
# how many different first names do exist?

# import countDistinct function from library
from pyspark.sql.functions import countDistinct
df.select(countDistinct("FirstName")).show()

# alternative implementation (return type is integer)
df.select("FirstName").distinct().count()

+-------------------------+
|count(DISTINCT FirstName)|
+-------------------------+
|                    56478|
+-------------------------+



56478

In [10]:
# how many different female and male entries exist?
df.groupBy("Gender").count().show()

+------+------+
|Gender| count|
+------+------+
|     m|156627|
|     f|153398|
+------+------+



In [11]:
# how many entries are made for each year?
df.sort("Year").groupBy(df.Year).count().show()

+----+-----+
|Year|count|
+----+-----+
|2015|27548|
|2013|26070|
|2014|27222|
|2012|25971|
|2017|35081|
|2018|35176|
|2019|34595|
|2020|34620|
|2021|34682|
|2016|29060|
+----+-----+



In [12]:
# what is the most popular female first name?
df.sort(df.NumberOfChildren.desc()).filter(df.Gender == "f").show(5, False)

+---------+----------------+------+--------+----+--------------------------+
|FirstName|NumberOfChildren|Gender|Position|Year|District                  |
+---------+----------------+------+--------+----+--------------------------+
|Marie    |128             |f     |1       |2015|pankow                    |
|Marie    |122             |f     |1       |2014|pankow                    |
|Marie    |122             |f     |1       |2012|charlottenburg-wilmersdorf|
|Marie    |121             |f     |1       |2013|charlottenburg-wilmersdorf|
|Marie    |119             |f     |1       |2013|pankow                    |
+---------+----------------+------+--------+----+--------------------------+
only showing top 5 rows



In [13]:
# what is the most popular female first name?
df.sort(df.NumberOfChildren.desc()).filter(df.Gender == "m").show(5, False)

+---------+----------------+------+--------+----+--------------------------+
|FirstName|NumberOfChildren|Gender|Position|Year|District                  |
+---------+----------------+------+--------+----+--------------------------+
|Alexander|74              |m     |1       |2015|pankow                    |
|Alexander|73              |m     |1       |2016|tempelhof-schoeneberg     |
|Paul     |69              |m     |1       |2012|pankow                    |
|Alexander|68              |m     |1       |2016|charlottenburg-wilmersdorf|
|Alexander|68              |m     |1       |2016|pankow                    |
+---------+----------------+------+--------+----+--------------------------+
only showing top 5 rows



In [14]:
# what is the maximum number of first names somebody has (maximum position)?

# import max function from library
from pyspark.sql.functions import max
df.select(max("Position")).show()

+-------------+
|max(Position)|
+-------------+
|            8|
+-------------+



In [15]:
# which district has the most entries?

# import functions from library
from pyspark.sql.functions import col
df.groupBy("District").count().sort(col("count").desc()).show(truncate = False)

+--------------------------+-----+
|District                  |count|
+--------------------------+-----+
|mitte                     |41670|
|tempelhof-schoeneberg     |40136|
|charlottenburg-wilmersdorf|38062|
|friedrichshain-kreuzberg  |36778|
|pankow                    |29550|
|spandau                   |27970|
|neukoelln                 |27510|
|lichtenberg               |22726|
|treptow-koepenick         |12184|
|reinickendorf             |11811|
|steglitz-zehlendorf       |11127|
|marzahn-hellersdorf       |10497|
|standesamt_i              |4    |
+--------------------------+-----+



In [16]:
# How many children have the same name (NumberOfChildren) in average in the year 2021?

# import functions from library
from pyspark.sql.functions import avg
df.filter(df.Year == '2021').agg(avg("NumberOfChildren")).show()

+---------------------+
|avg(NumberOfChildren)|
+---------------------+
|   1.8687503604175077|
+---------------------+



#### *Analyse data using 'sql' and 'join' commands*

Beside the functions used in above chapter we know want to use 'join' commands.<br><br>
Exercise - let's find out 
- how many first names exist in dictrict mitte but not in district pankow
- how many female and male entries exist
- how many entries are made for each year
- what is the most popular female first name
- what is the most popular male first name
- the maximum number of first name in the register (maximum position)
- which district has the most entries

In [17]:
# to have 'easier' acces to the different data of the districts 
# we can create new dataframe for the relevant districts

df_mitte = df.filter(df.District == 'mitte')
df_pankow = df.filter(df.District == 'pankow')
df_lichtenberg = df.filter(df.District == 'lichtenberg')

In [18]:
# how many first names exist in dictrict mitte but not in district pankow? 
# use dataframe 'join' command

df_mitte.join(df_pankow, (df_mitte.FirstName == df_pankow.FirstName), "leftanti").count()

15898

In [19]:
# how many first names exist in dictrict mitte but not in district pankow? 
# use 'sql' command - hint: don't forget to create local temporary views

df_mitte.createOrReplaceTempView("mitte")
df_pankow.createOrReplaceTempView("pankow")

joinDF = spark.sql("SELECT count(*) FROM mitte m \
                    LEFT OUTER JOIN pankow p \
                    ON m.FirstName = p.FirstName \
                    WHERE p.FirstName IS NULL ") \
  .show(truncate=False)

+--------+
|count(1)|
+--------+
|15898   |
+--------+



## Stop The Spark Session

In [20]:
# stop the underlying SparkContext.
try:
    spark
except NameError:
    print("Spark session does not context exist - nothing to stop.")
else:
    spark.stop()

---
*This is the end of the Spark101 course. The next notebooks show ML algorithms with Spark - for advanced users.*