# <center>Big Data for Engineers – Exercises</center>
## <center>Spring 2021 – Week 9 – ETH Zurich</center>
## <center>Spark - Solutions </center>

## 1. Setup the Spark cluster on Azure

### Create a cluster

1. Sign into the azure portal (portal.azure.com).
1. Search for "HDInsight clusters" using the search box at the top.
1. Create a new resource group (for example: 'exercise09').
1. Give the cluster a unique name (for example: 'exercise09cluster').
1. In the "Cluster Type" choose **Spark** and leave the default version as is. It is also indicated to use the **UK South** region. 
1. Create a cluster login password (you can use https://www.random.org/strings/ for inspiration). Keep the password around as you will need it for later!
1. Move to the *Storage* stage of the setup. Here, leave **Azure Storage** as the *Primary Storage Type*. For the *Primary Storage Account* you have the option to set up a new account. The *Container*'s name will be generated automatically, however make sure to remember it, or change it to something memorable, if you plan on finishing the exercises in more than one sitting.
1. Move to the *Configuration + Pricing* stage of the setup (skip *Security + networking*). Set up a Spark cluster which uses **D12 v2** deployments for both the *Head* and *Worker* nodes. Additionally, you should select to deploy *4* *Worker* nodes. It should cost roughly 2.2 CHF/h. Notice that you are running a relatively sizable cluster: $24 \text{ cores} = 4 \text{ cores} * 6 \text{ nodes}$, and $168 \text{GB} = 28 \text{GB} * 6 \text{ nodes}$. 
1. Move to the *Reivew + Create* stage of the setup, and click the **Create** button once validation succeeds.
1. Wait until your cluster is deployed (this can take up to 20 minutes).

## <span style="color: red;">**Important:** Remember to **delete** the cluster once you are done. If you want to stop doing the exercises at any point, delete it and recreate it using the same container name as you used the first time, so that the resources are still there.</span>
<img src="https://bigdataforengineers2021.blob.core.windows.net/exercise09/tutorial_delete.png" style="width: 900px;">

### Access your cluster

Make sure you can access your cluster (the NameNode) via SSH:

```
$ ssh ssh_user_name@cluster_name-ssh.azurehdinsight.net
```

If you are using Linux or MacOSX, you can use your standard terminal.
If you are using Windows you can use:
- Putty SSH Client and PSCP tool (get them at [here](http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)).
- This Notebook server terminal (Click on the Jupyter logo and the goto New -&gt; Terminal).
- Azure Cloud Terminal (see the HBase exercise sheet for details)


The cluster has its own Jupyter server. We will use it. You can access it through the following link:
```
 https://cluster_name.azurehdinsight.net/jupyter
```

You can access cluster's YARN in your browser
```
 https://cluster_name.azurehdinsight.net/yarnui/hn/cluster
```

The Spark UI can be accessed via Azure Portal, see [Spark job debugging](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-job-debugging)

## <span >You need to upload this notebook to your cluster's Jupyter in order to execute Python code blocks.</span>


To do this, just open the Jupyter through the link given above and use the "Upload" button. After the notebook has been loaded, you can open it using a Spark kernel.
<img src="https://bigdataforengineers2021.blob.core.windows.net/exercise09/tutorial_jupyter.png" style="width: 900px;">

## 2. Apache Spark Architecuture

Spark is a cluster computing platform designed to be fast and general purpose. Spark extends the MapReduce model to efficiently cover a wide range of workloads that previously required separate distributed systems, including interactive queries and stream processing. Spark offers the ability to run computations in memory.

At a high level, every Spark application consists of a **driver program** that launches various parallel operations on a cluster. The driver program contains your application's main function and defines distributed datasets on the cluster, then applies operations to them.

Driver programs access Spark through a **SparkContext** object, which represents a connection to a computing cluster. There is no need to create a SparkContext; it is created for you automatically when you run the first code cell in the Jupyter

The driver communicates with a potentially large number of distributed workers called **executors**. The driver runs in its own process and each executor is a separate process. A driver and its executors are together termed a Spark application.

![Image of Account](http://spark.apache.org/docs/latest/img/cluster-overview.png)

### 2.1 Understand resilient distributed datasets (RDD)

An RDD in Spark is simply an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. 

##### What are RDD operations?
RDDs offer two types of operations: **transformations** and **actions**.

* **Transformations** create a new dataset from an existing one. Transformations are lazy, meaning that no transformation is executed until you execute an action.
* **Actions** compute a result based on an RDD, and either return it to the driver program or save it to an external storage system (e.g., HDFS)




Transformations and actions are different because of the way Spark computes RDDs. Although you can define new RDDs any time, Spark computes them only in a **lazy** fashion, that is, the first time they are used in an action.

##### How do I make an RDD?

RDDs can be created from stable storage or by transforming other RDDs. Run the cells below to create RDDs from the sample data files available in the storage container associated with your Spark cluster. One such sample data file is available on the cluster at `wasb:///example/data/fruits.txt`. Notice that we're reading from Azure Blob storage, this is indicated by the `wasb` token in the previous url. Generally it is possible to read data from other resources using the following tokens:

* `file`: Read from the local file system.
* `hdfs`: Read from a Hadoop Distributed File System.
* `s3`  : Read from AWS S3 Storage.
* `wasb`: Read from Azure Blob Storage.

In [None]:
# sc is the Spark Context object automatically created for you
fruits = sc.textFile('wasb:///example/data/fruits.txt')
yellowThings = sc.textFile('wasb:///example/data/yellowthings.txt')

##### RDD transformations
Following are examples of some of the common transformations available. For a detailed list, see [RDD Transformations](https://spark.apache.org/docs/2.0.0/programming-guide.html#transformations)

Run some transformations below to understand this better.

**Note:** If some of the queries are taking too long to complete, try restarting the kernel, and rerunning the cell *above*.

In [None]:
# map
fruitsReversed = fruits.map(lambda fruit: fruit[::-1])
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
fruitsReversed.collect()

In [None]:
# filter
shortFruits = fruits.filter(lambda fruit: len(fruit) <= 5)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
shortFruits.collect()

In [None]:
# flatMap
characters = fruits.flatMap(lambda fruit: list(fruit))
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
characters.collect()

In [None]:
# union
fruitsAndYellowThings = fruits.union(yellowThings)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
fruitsAndYellowThings.collect()

In [None]:
# intersection
yellowFruits = fruits.intersection(yellowThings)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
yellowFruits.collect()

In [None]:
# distinct
distinctFruitsAndYellowThings = fruitsAndYellowThings.distinct()
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
distinctFruitsAndYellowThings.collect()

In [None]:
# groupByKey
yellowThingsByFirstLetter = yellowThings.map(lambda thing: (thing[0], thing)).groupByKey()
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
for letter, lst in yellowThingsByFirstLetter.collect():
        print("For letter", letter)
        for obj in lst:
            print(" > ", obj)

In [None]:
# reduceByKey
numFruitsByLength = fruits.map(lambda fruit: (len(fruit), 1)).reduceByKey(lambda x, y: x + y)
# Note: the `collect` command is NOT a Transformation, it is an Action used here for the purposes of showing the results! 
numFruitsByLength.collect()

##### RDD actions
Following are examples of some of the common actions available. For a detailed list, see [RDD Actions](https://spark.apache.org/docs/2.3.0/programming-guide.html#actions).

Run some transformations below to understand this better. 

In [None]:
# collect
fruitsArray = fruits.collect()
yellowThingsArray = yellowThings.collect()
print(fruitsArray)
print(yellowThingsArray)

In [None]:
# count
numFruits = fruits.count()
numFruits

In [None]:
# take
first3Fruits = fruits.take(3)
first3Fruits

In [None]:
# reduce
letterSet = fruits.map(lambda fruit: set(fruit)).reduce(lambda x, y: x.union(y))
letterSet

##### Lazy evaluation
Lazy evaluation means that when we call a transformation on an RDD (for instance, calling `map()`), the operation is not immediately performed. Instead, Spark internally records metadata to indicate that this operation has been requested. Rather than thinking of an RDD as containing specific data, it is best to think of each RDD as
consisting of instructions on how to compute the data that we build up through transformations. Loading data into an RDD is lazily evaluated in the same way transformations are. So, when we call `sc.textFile()`, the data is not loaded until it is necessary. As with transformations, the operation (in this case, reading the data) can
occur multiple times.


Finally, as you derive new RDDs from each other using transformations, Spark keeps track of the set of dependencies between different RDDs, called the lineage graph. For instance, the code bellow corresponds to the following graph:

<img src="https://bigdataforengineers2021.blob.core.windows.net/exercise09/stages.png" style="width: 300px;">

In [None]:
apples = fruits.filter(lambda x: "apple" in x)
lemons = yellowThings.filter(lambda x: "lemon" in x)
applesAndLemons = apples.union(lemons)
print(applesAndLemons.collect())
print(applesAndLemons.toDebugString())

##### Persistence (Caching)

Spark's RDDs are by default recomputed each time you run an action on
them. If you would like to reuse an RDD in multiple actions, you can ask Spark to persist it using `RDD.persist()`. After computing it the first time, Spark will store the RDD contents in memory (partitioned across the machines in your cluster), and reuse them in future actions. Persisting RDDs on disk instead of memory is also possible.

If you attempt to cache too much data to fit in memory, Spark will automatically evict old partitions using a Least Recently Used (LRU) cache policy. For the memory-only storage levels, it will recompute these partitions the next time they are accessed,
while for the memory-and-disk ones, it will write them out to disk. In either case, this means that you don't have to worry about your job breaking if you ask Spark to cache too much data. However, caching unnecessary data can lead to eviction of useful data
and more recomputation time. Finally, RDDs come with a method called `unpersist()` that lets you manually remove them from the cache.


##### Working with Key/Value Pairs


Spark provides special operations on RDDs containing key/value pairs. These RDDs
are called *pair RDDs*. Pair RDDs are a useful building block in many programs, as
they expose operations that allow you to act on each key in parallel or regroup data
across the network. For example, pair RDDs have a `reduceByKey()` method that can
aggregate data separately for each key, and a `join()` method that can merge two
RDDs together by grouping elements with the same key. Pair RDDs are also still RDDs. 



In [None]:
#Example
rdd = sc.parallelize([("key1", 0) ,("key2", 3),("key1", 8) ,("key3", 3),("key3", 9)])
rdd2 = rdd.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y: (x[0] + y[0], x[1] + y[1]))
print(rdd2.collect())
print(rdd2.toDebugString())

#### Converting a user program into tasks

A Spark driver is responsible for converting a user program into units of physical execution called tasks. At a high level, all Spark programs follow the same structure: they create RDDs from some input, derive new RDDs from those using transformations, and perform actions to collect or save data. A Spark program implicitly creates a logical **directed acyclic graph (DAG)** of operations.
When the driver runs, it converts this logical graph into a physical execution plan.

Spark performs several optimizations, such as "pipelining" map transformations together to merge them, and converts the execution graph into a set of **stages**.
Each stage, in turn, consists of multiple tasks. The tasks are bundled up and prepared to be sent to the cluster. Tasks are the smallest unit of work in Spark; a typical user program can launch hundreds or thousands of individual tasks.

Each RDD maintains a pointer to one or more parents along with metadata about what
type of relationship they have. For instance, when you call `val b = a.map()` on an
RDD, the RDD `b` keeps a reference to its parent `a`. These pointers allow an RDD to be
traced to all of its ancestors.

The following phases occur during Spark execution:
* User code defines a DAG (directed acyclic graph) of RDDs. Operations on RDDs create new RDDs that refer back to their parents, thereby creating a graph.
* Actions force translation of the DAG to an execution plan. When you call an action on an RDD, it must be computed. This requires computing its parent RDDs as well. 
* Spark's scheduler submits a job to compute all needed RDDs. That job will have one or more stages, which are parallel waves of computation composed of tasks. Each stage will correspond to one or more RDDs in the DAG. A single stage can correspond to multiple RDDs due to pipelining.
* Tasks are scheduled and executed on a cluster
* Stages are processed in order, with individual tasks launching to compute segments of the RDD. Once the final stage is finished in a job, the action is complete.

If you visit the application's web UI, you will see how many stages occur in order to
fulfill an action. The Spark UI can be accessed via Azure Portal, see [Spark job debugging](https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-apache-spark-job-debugging)

### 3. The Great Language Game

Now, you will get to write some queries yourself on a larger dataset. You will be using the [language confusion dataset](http://lars.yencken.org/datasets/languagegame/).

This exercise is a little bit different, in that it is part of a small project you will be doing over the following 3 weeks to compare Spark, Spark with DataFrames/SQL, and Sparksoniq. You will hear more about it in the coming weeks.

Apart from that, you will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need four things:
- The query you wrote
- Something related to its output (which you will be graded on)
- The time it took you to write it
- The time it took you to run it

As you might have observed in the sample queries above, the time a job took to run is displayed on the rightmost column of its ouptut. If it consists of several stages, however, you will need the sum of them. The easiest thing is if you just take the execution time of the whole query:

<img src="https://bigdataforengineers2021.blob.core.windows.net/exercise09/tutorial_time.png" style="width: 700px;">

Of course, you will not be evaluated on the time it took you to write the queries (nor on the time it took them to run), but this is useful to us in order to measure the increase in performance when using Sparksoniq. There is a cell that outputs the time you started working before every query. Use this if you find it useful.


For this exercise, you can chose to either set up a local Spark installation on your computer or use an Azure cluster as explained above. Both are fine, but a local installation really makes things easier to debug.

**If you choose to go for a local installation: **

- Make sure you have [Java 8](https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html) or later installed on your computer.
- Install the latest [Spark](https://spark.apache.org/downloads.html) release. Here are tutorials for [Linux](https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f), [Windows](https://medium.com/@naomi.fridman/install-pyspark-to-run-on-jupyter-notebook-on-windows-4ec2009de21f), and [MacOS](https://www.tutorialkart.com/apache-spark/how-to-install-spark-on-mac-os/).
- Launch pyspark with `$ pyspark`

**In any case:**

Now either log in to your cluster using SSH as explained above and run the following commands, or of course do it locally:

```
wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2
tar -jxvf confusion-2014-03-02.tbz2 -C /tmp
```
If you're on a cluster:
```
hdfs dfs -copyFromLocal /tmp/confusion-2014-03-02/confusion-2014-03-02.json /confusion.json
```

This dowloads the archive file to the cluster, decompresses it and uploads it to HDFS when using a cluster. Now, create an RDD from the file containing the entries:

In [3]:
data = sc.textFile('wasb:///confusion.json')

Or locally:
```
data = sc.textFile('file:///tmp/confusion-2014-03-02/confusion-2014-03-02.json')
```
**Note:** from here on, if you're doing things locally, just copy the contents of the cells into the pyspark shell and work on there!

Since the entries are JSON records, you will need to parse them and use their respective object representations. You can use this mapping for all queries. Since some of the queries take a long time to execute on the dataset, you might want to answer these queries on the first `100000` entries. Later you can run the same queries on the entire dataset, by runing queries on `entries` instead of `test_entries`. For the quiz running the queries on either dataset (100000 entries or all entries) will be accepted as correct.

In [14]:
import json

testset = sc.parallelize(data.take(100000))
test_entries = testset.map(json.loads)

entries = data.map(json.loads)


And test it. Is it working?

In [None]:
target_german = test_entries.filter(lambda e: e["target"] == "German").take(1)
print(json.dumps(target_german, indent = 4))

Good! Let's get to work. A few last things:
- Take into account that some of the queries might have very large outputs, which Jupyter (or sometimes even Spark) won't be able to handle. It is normal for the queries to take some time, but if the notebook crashes or stops responding, try restarting the kernel. Avoid printing large outputs. You can print the first few entries to confirm the query has worked, as shown in query 1.
- Remember to delete the cluster if you want to stop working! You can recreate it using the same container name and your resources will still be there.
- Refer to the [documentation](http://spark.apache.org/docs/2.3.0/api/python/pyspark.html#pyspark.RDD), as well as the programming guides on actions and transformations linked to above.

And now to the actual queries: *Please make sure that in your queries you *only* use PySpark, and avoid any dataframes (they will covered in next week's exercises)*

1\. Find all games such that the guessed language is correct (=target), and such that this language is Russian. What is the length of the resulting sequence?

In [18]:
from datetime import datetime

# Started working:
print(datetime.now().time())

19:27:01.344858


In [19]:
# Query:
matching = test_entries.filter(lambda e: e["target"] == "Russian").filter(lambda e: e["target"] == e["guess"]).collect()
# Only print the first few entries
print(json.dumps(matching[:3], indent=4))
# print number of matching 
print('Number of matching games ', len(matching))

[
    {
        "guess": "Russian",
        "target": "Russian",
        "country": "AU",
        "choices": [
            "Lao",
            "Nepali",
            "Russian",
            "Tongan"
        ],
        "sample": "8a59d48e99e8a1df7e366c4648095e27",
        "date": "2013-08-19"
    },
    {
        "guess": "Russian",
        "target": "Russian",
        "country": "AU",
        "choices": [
            "Arabic",
            "Estonian",
            "Greek",
            "Russian"
        ],
        "sample": "8a59d48e99e8a1df7e366c4648095e27",
        "date": "2013-08-19"
    },
    {
        "guess": "Russian",
        "target": "Russian",
        "country": "AU",
        "choices": [
            "Croatian",
            "Nepali",
            "Russian",
            "Slovenian"
        ],
        "sample": "8a59d48e99e8a1df7e366c4648095e27",
        "date": "2013-08-20"
    }
]
Number of matching games  1886


2\. For those cases where the *guess* matched the *target*, list what the *guessed* language was. How many times has *Hindi* been correctly guessed?

In [20]:
# Started working:
print(datetime.now().time())

19:27:05.351216


In [21]:
# Query:
answers = test_entries.filter(lambda e: e["target"] == e["guess"]).map(lambda e: e["guess"]).collect()
print(answers[:3])
matching = test_entries.filter(lambda e: e["target"] == "Hindi").filter(lambda e: e["target"] == e["guess"]).collect()
print('Number of games matching Hindi ',len(matching))

['Norwegian', 'Dinka', 'Japanese']
Number of games matching Hindi  589


3\. Find the number of all distinct values of the *target* languages (i.e. the *target* field). What is the length of the resulting sequence?

In [22]:
# Started working:
print(datetime.now().time())

19:27:06.811894


In [23]:
# Query:
count = test_entries.map(lambda e: e["target"]).distinct().count()
print('Number of distinct targets ', count)

Number of distinct targets  68


4\. Return the top three games where the guessed language is correct (=target) ordered by language (ascending), then country (ascending), then date (ascending). What is the date of the 3rd item in the list? Enter it without quotes, for example 2013-10-02 

In [24]:
# Started working:
print(datetime.now().time())

19:27:07.615430


In [25]:
# Query:
correct = test_entries.filter(lambda e: e['guess'] == e['target'])
first_three = correct.sortBy(lambda e: (e['target'], e['country'], e['date'])).take(3)

print(json.dumps(first_three, indent=4))
print('date of the third item', first_three[-1]["date"])

[
    {
        "guess": "Albanian",
        "target": "Albanian",
        "country": "AP",
        "choices": [
            "Albanian",
            "Indonesian",
            "Kurdish"
        ],
        "sample": "fdf23d0a7063ba2fcef4b18eb7d57ad8",
        "date": "2013-09-03"
    },
    {
        "guess": "Albanian",
        "target": "Albanian",
        "country": "AR",
        "choices": [
            "Albanian",
            "Portuguese"
        ],
        "sample": "efcd813daec1c836d9f030b30caa07ce",
        "date": "2013-09-03"
    },
    {
        "guess": "Albanian",
        "target": "Albanian",
        "country": "AT",
        "choices": [
            "Albanian",
            "Tongan"
        ],
        "sample": "00b85faa8b878a14f8781be334deb137",
        "date": "2013-09-03"
    }
]
date of the third item 2013-09-03


5\. Aggregate all games by country and target language, counting the number of guessing games that were done for each pair (country, target). How many guesses have been made for Maltese from the Netherlands (NL, Maltese)?

In [26]:
# Started working:
print(datetime.now().time())

19:27:09.855480


In [27]:
# Query:
counts = test_entries.map(lambda e: (e["country"], e["target"])).countByValue()
# Print the first few items 
for k in list(counts)[:3]:
    print((k, counts[k]))
    
print('number of games matching (NL, Maltese) ', counts[('NL', 'Maltese')])

(('AU', 'Norwegian'), 84)
(('AU', 'Dinka'), 50)
(('AU', 'Samoan'), 52)
number of games matching (NL, Maltese)  3


6\. Among all the games where the guess was correct (=target), what is the percentage of cases where the first choice (among the array of possible answers) was the target?

In [28]:
# Started working:
print(datetime.now().time())

19:27:10.503525


In [29]:
# Query:
correct = test_entries.filter(lambda e: e["target"] == e["guess"])
percent = float(correct.filter(lambda e: e["target"] == e["choices"][0]).count()) / correct.count()
print('percentage ', percent)

percentage  0.3472529891715626


7\. For each target language, compute the percentage of successful guess games (i.e. *guess* == *target*) relative to all games for that target language, and display the pairs `(target_language, percentage)` in ascending order of the percentage. What is the second language in this list? 

In [30]:
# Started working:
print(datetime.now().time())

19:27:11.499253


In [31]:
# Query:
correct = test_entries.map(lambda e: (e["target"], e["target"] == e["guess"])).groupByKey()
outcomes = correct.mapValues(lambda e: float(list(e).count(True))/len(e)).sortBy(lambda e: e[1]).collect()
# Print the first few values
print(outcomes[:10])
# 
print('The second language on the list ', outcomes[1][0])

[('Dinka', 0.4194053208137715), ('Fijian', 0.426367461430575), ('Kannada', 0.434654919236417), ('Dari', 0.47293447293447294), ('Maori', 0.478502080443828), ('Tigrinya', 0.479951397326853), ('Maltese', 0.4824242424242424), ('Amharic', 0.4922077922077922), ('Sinhalese', 0.49937733499377335), ('Indonesian', 0.5107398568019093)]
The second language on the list  Fijian


8\. Group the games by the index of the correct answer in the choices array and output all counts. How many games the last choice is the correct choice (target)? 

In [32]:
# Started working:
print(datetime.now().time())

19:27:12.282216


In [33]:
# Query:
print(test_entries.map(lambda e: e["choices"].index(e["target"])).countByValue())
count = test_entries.filter(lambda e: e["target"]==e["choices"][-1]).count()
print('Last was chosen count', count)

defaultdict(<class 'int'>, {2: 17688, 1: 36509, 3: 7837, 0: 33963, 4: 2746, 5: 881, 6: 259, 7: 83, 8: 20, 9: 11, 10: 3})
Last was chosen count 37176


9\. For the cases where both *guess* and *target* were `'French'`, what is the count of each possible number of choices (namely if you have two games with 5 choices report `(5,2)`). what is the most frequent choice length among these list? 

In [34]:
# Started working:
print(datetime.now().time())

19:27:15.396978


In [35]:
# Query:
correct_french_games = test_entries.filter(lambda e: e["target"] == "French").filter(lambda e: e["target"] == e["guess"])
choice_lengths = correct_french_games.map(lambda e: (len(e["choices"]), 1)).reduceByKey(lambda a, b: a + b)
choice_lengths = choice_lengths.sortBy(lambda e: e[1], ascending=False)
# print choice lengths ordered from most to least frequent 
print(choice_lengths.collect())
print('Most frequent choice length ',choice_lengths.first()[0])

[(2, 918), (3, 660), (4, 389), (5, 156), (6, 43), (7, 22), (8, 2), (9, 2), (10, 1)]
Most frequent choice length  2


10\. How many games were played on the last day? 

In [36]:
# Started working:
print(datetime.now().time())

19:27:18.050992


In [37]:
# Query 

latest = test_entries.map(lambda e: e["date"]).sortBy(lambda e: e, ascending=False).first()
count = test_entries.filter(lambda e: e["date"]==latest).count()
print('latest date ', latest)
print('games played on last date ', count)

latest date  2013-09-03
games played on last date  84426


### 4. Exercise

1. Why is Spark faster than Hadoop MapReduce?
1. Study the queries you wrote using Spark UI. Observe how many stages they have.
1. Which of the graphs below are DAGs?
<img src="https://bigdataforengineers2021.blob.core.windows.net/exercise09/dags.png" style="width: 700px;">

### Solution
1. There are many reasons for that. Firstly, Spark processes data in-memory while Hadoop MapReduce persists back to the disk after a map or reduce action. Secondly, Spark pipelines transformations to merge them into stages. 
2. Follow the following [instructions](https://docs.microsoft.com/en-gb/azure/hdinsight/spark/apache-spark-job-debugging)
3. DAGs: 2,3,4,5,7

### 5. True or False
Say if the following statements are *true* or *false*, and explain why.

1. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster.
1. Transformations construct a new RDD from a previous one and immediately calculate the result
1. Spark's RDDs are by default recomputed each time you run an action on them
1. After computing an RDD, Spark will store its contents in memory and reuse them in future actions.
1. When you derive new RDDs using transformations, Spark keeps track of the set of dependencies between different RDDs.

### Solution
1. True
1. False. Spark will not begin to execute until it sees an action.
1. True
1. False. Users have to use persist() or cache() for that.
1. True