# <center>Big Data &ndash; Exercises</center>
## <center>Fall 2021 &ndash; Week 9 &ndash; ETH Zurich</center>
## <center>Spark Dataframes and Spark SQL, Moodle exercise</center>

# Preparation for the moodle exercise in Spark

In this jupyter notebook we are going to make the preprocessing part of the dataset that is going to be used in the graded exercise of this week.
It will be the same language game dataset as in exercise08.

1. Change to exercise09 repository

2. Start docker <br>
```docker-compose up -d```

3. Getting the data:
Follow the procedure that is described below. The dataset can be found here: http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2. 

More specifically do the following:
- download the data      :<br> ```wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2```
- extract the data       :<br> ```tar -jxvf confusion-2014-03-02.tbz2```

4. copy the data to docker :<br> ```docker cp confusion-2014-03-02/confusion-2014-03-02.json jupyter:/home/jovyan/work``` <br>
(Copying the data to docker needs to be done only once and it might take 1-2 minutes.)

## More Info about the data
You can find more information about the dataset (as well as the schema and examples) in this link: http://lars.yencken.org/datasets/languagegame/

## Instructions:

In every query we ask you for three quantities: the query itself, the result of the query as well as the productivity time. That means the development time of each query (time elapsed before you start writing the query, and the time at which the correct, final query is ready). Note that the time part of every question is optional and not graded. In order to make easier the time recording we created two functions that do it automatically. Run the cell below in order to import the functions into the current notebook. Then before each query we will have a ```start_exercise()``` cell that you have to run in order to start time recording. After you have finished your query and you are sure about the answer run the ```finish_exercise()``` one to get the time measurement. 

In [None]:
import time

def start_exercise():
    global last
    last = time.time()
    
def finish_exercise():
    global last
    print("This exercise took {0}s".format(int(time.time()-last)))

## <center>1. Spark Dataframes</center>

Write queries for the same questions as last week, but this time using Spark Dataframes operations (the data loading will take a minute)

### 1.0. Data preprocessing

In [None]:
import json
from pyspark.sql import SparkSession
from pyspark import SparkConf

spark = SparkSession.builder.master('local').getOrCreate()
sc = spark.sparkContext

path = "confusion-2014-03-02.json"
dataset = spark.read.json(path).cache()

In [None]:
#test it out
dataset.limit(3).show()

## Assignment 1
Find the number of games where the guessed language is correct (meaning equal to the target one) and that language is Maori.

In [None]:
start_exercise()

In [None]:
#Your code here
dataset.filter(dataset['target'] == dataset['guess']).filter(dataset['target'] == 'Maori').count()

In [None]:
finish_exercise()

## Assignment 2
Return the number of distinct "target" languages.

In [None]:
start_exercise()

In [None]:
#Your code here
dataset.select("target").distinct().count()

In [None]:
finish_exercise()

## Assignment 3
Return the sample IDs (i.e., the *sample* field) of the bottom three games where the guessed language is incorrect (not equal to the target one) ordered by date (ascending), then by language (descending), then by country (ascending).

In [None]:
start_exercise()

In [None]:
#Your code here
dataset.filter(dataset['target'] == dataset['guess']).orderBy(
    dataset['target'].asc(), dataset['country'].asc(), dataset['date'].asc()).select("sample").take(3)


In [None]:
finish_exercise()

## Assignment 4
Aggregate all games by country and target language, counting the number of guesses for each group and return the frequencies of the three most frequent country/language combinations.

In [None]:
start_exercise()

In [None]:
from pyspark.sql.functions import col, asc, desc

#Your code here
dataset.select("country", "target").groupBy(
    "country", "target").count().orderBy(col("count").desc()).take(3)


In [None]:
finish_exercise()

## Assignment 5
Find the fraction (between 0 and 1) of games where (the answer was correct && the correct guess was the second choice amongst the array of possible answers)

Please write the fraction rounding to 4 decimals (eg. 0.3320)

In [None]:
start_exercise()

In [None]:
#Your code here
correct = dataset.filter(dataset['target'] == dataset['guess']).filter(
    dataset['target'] == dataset['choices'][1]).count()
total = dataset.select().count()

correct/total


In [None]:
finish_exercise()

## Assignment 6
Sort the languages by increasing overall percentage of correct guesses and return the first three languages.

In [None]:
start_exercise()

In [None]:
#Your code here
allz = dataset.groupBy('target').count().withColumnRenamed("count", "allz")
correct = dataset.filter(dataset['target'] == dataset['guess']).groupBy('target').count()
joined = allz.join(correct, ['target'])

joined.select((joined['count'] / joined['allz']).alias('p'), "target").sort('P').take(3)

In [None]:
finish_exercise()

## Assignment 7
Return the number of games played on the first day.

In [None]:
start_exercise()

In [None]:
#Your code here
first_day = dataset.select(dataset['date']).sort(dataset['date'].asc()).take(1)[0]['date']
dataset.filter(dataset['date'] == first_day).count()

In [None]:
finish_exercise()

## <center>2. Spark SQL</center>

Write Spark SQL queries for the same questions as earlier.

### 2.0. Data preprocessing

In [None]:
!pip install sparksql-magic

In [None]:
%load_ext sparksql_magic

In [None]:
path = "confusion-2014-03-02.json"
dataset = spark.read.json(path).cache()
dataset.registerTempTable("dataset")

In [None]:
%%sparksql
-- test it out
SELECT *
FROM dataset
LIMIT 3

## Assignment 1
Find the number of games where the guessed language is correct (meaning equal to the target one) and that language is Maori.

In [None]:
dataset.filter(dataset['target'] == dataset['guess']).filter(
    dataset['target'] == 'Maori').count()


In [None]:
start_exercise()

In [85]:
%%sparksql
SELECT count(*) FROM dataset WHERE target = guess and target = 'Maori'

22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)




22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)




22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)
22/11/29 19:27:42 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)
22/11/29 19:27:43 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)




22/11/29 19:27:43 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)
22/11/29 19:27:43 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)




22/11/29 19:27:43 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)
22/11/29 19:27:43 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
count(1)
74810


In [86]:
finish_exercise()

This exercise took 136s


## Assignment 2
Return the number of distinct "target" languages.

In [89]:
start_exercise()

In [87]:
%%sparksql
SELECT count(distinct target) FROM dataset

22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)




22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)




22/11/29 19:28:44 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)




22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)
22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)




22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)
22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)




22/11/29 19:28:45 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)
22/11/29 19:28:46 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)




22/11/29 19:28:46 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
count(DISTINCT target)
78


In [90]:
finish_exercise()

This exercise took 12s


## Assignment 3
Return the sample IDs (i.e., the *sample* field) of the bottom three games where the guessed language is incorrect (not equal to the target one) ordered by date (ascending), then by language (descending), then by country (ascending).

In [None]:
# Your code here
dataset.filter(dataset['target'] == dataset['guess']).orderBy(
    dataset['target'].asc(), dataset['country'].asc(), dataset['date'].asc()).select("sample").take(3)


In [92]:
start_exercise()

In [93]:
%%sparksql
SELECT sample from dataset WHERE target = guess ORDER BY target ASC, country ASC, date ASC LIMIT 3


22/11/29 19:32:03 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:32:03 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:32:03 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)




22/11/29 19:32:04 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:32:04 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)




22/11/29 19:32:04 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)
22/11/29 19:32:04 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)




22/11/29 19:32:04 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)
22/11/29 19:32:05 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)




22/11/29 19:32:05 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)
22/11/29 19:32:05 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)




22/11/29 19:32:05 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)




22/11/29 19:32:06 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)




22/11/29 19:32:06 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)




22/11/29 19:32:06 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
sample
00b85faa8b878a14f8781be334deb137
efcd813daec1c836d9f030b30caa07ce
efcd813daec1c836d9f030b30caa07ce


In [94]:
finish_exercise()

This exercise took 59s


## Assignment 4
Aggregate all games by country and target language, counting the number of guesses for each group and return the frequencies of the three most frequent country/language combinations.

In [None]:
dataset.select("country", "target").groupBy(
    "country", "target").count().orderBy(col("count").desc()).take(3)


In [95]:
start_exercise()

In [96]:
%%sparksql
select count(*) as count from dataset group by target, country order by count desc limit 3



22/11/29 19:34:35 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 54.2 MiB so far)




22/11/29 19:34:36 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 53.7 MiB so far)




22/11/29 19:34:36 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 54.6 MiB so far)




22/11/29 19:34:36 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 53.5 MiB so far)




22/11/29 19:34:36 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 53.1 MiB so far)




22/11/29 19:34:37 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 54.1 MiB so far)




22/11/29 19:34:37 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 52.7 MiB so far)




22/11/29 19:34:38 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 53.2 MiB so far)




22/11/29 19:34:38 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 53.2 MiB so far)




22/11/29 19:34:39 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 53.3 MiB so far)


                                                                                

0
count
112934
112007
110919


In [97]:
finish_exercise()

This exercise took 121s


## Assignment 5
Find the fraction (between 0 and 1) of games where (the answer was correct && the correct guess was the second choice amongst the array of possible answers)

Please write the fraction rounding to 4 decimals (eg. 0.3320)

In [None]:
correct = dataset.filter(dataset['target'] == dataset['guess']).filter(
    dataset['target'] == dataset['choices'][1]).count()
total = dataset.select().count()

In [106]:
start_exercise()

In [122]:
%%sparksql
SELECT count(*) /
(SELECT count(*) as count_all
 FROM dataset)
FROM dataset
WHERE target = guess and target = choices[1]


22/11/29 19:49:30 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 27.1 MiB so far)
22/11/29 19:49:30 WARN MemoryStore: Not enough space to cache rdd_11_2 in memory! (computed 14.0 MiB so far)
22/11/29 19:49:30 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:49:30 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)




22/11/29 19:49:30 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)




22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)
22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)




22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 27.0 MiB so far)
22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)




22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)
22/11/29 19:49:31 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 26.6 MiB so far)




22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)




22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 27.1 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)




22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 27.1 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_2 in memory! (computed 14.0 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:49:32 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)
22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 27.0 MiB so far)
22/11/29 19:49:33 W



22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)
22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 27.1 MiB so far)
22/11/29 19:49:33 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
(count(1) / scalarsubquery())
0.26400199040361877


In [123]:
finish_exercise()

This exercise took 273s


## Assignment 6
Sort the languages by increasing overall percentage of correct guesses and return the first three languages.

In [124]:
start_exercise()

In [142]:
%%sparksql
SELECT target
FROM
(
    SELECT target, count(*) as count_a
    FROM dataset
    GROUP BY target
)
JOIN
(
    SELECT target, count(*) as count_c
    FROM dataset
    WHERE target = guess
    GROUP BY target
)
 USING(target)
ORDER BY count_c/count_a ASC
LIMIT 3

22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)


[Stage 157:==>             (4 + 1) / 22][Stage 158:>               (0 + 0) / 22]

22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:57:53 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)


[Stage 157:=====>          (7 + 1) / 22][Stage 158:>               (0 + 0) / 22]

22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)




22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)
22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)




22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)
22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)




22/11/29 19:57:54 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)
22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)




22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)




22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)




22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:57:55 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)


[Stage 158:===>            (5 + 1) / 22][Stage 160:>                (0 + 0) / 1]

22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)


[Stage 158:====>           (6 + 1) / 22][Stage 160:>                (0 + 0) / 1]

22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)


[Stage 158:=====>          (8 + 1) / 22][Stage 160:>                (0 + 0) / 1]

22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)
22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)




22/11/29 19:57:56 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)




22/11/29 19:57:57 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)
22/11/29 19:57:57 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)




22/11/29 19:57:57 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)
22/11/29 19:57:57 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
target
Kannada
Fijian
Shona


In [143]:
finish_exercise()

This exercise took 501s


## Assignment 7
Return the number of games played on the first day.

In [None]:
start_exercise()

In [145]:
%%sparksql
SELECT count(*) FROM dataset WHERE date = (SELECT min(date) FROM dataset)

22/11/29 19:59:37 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)




22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:38 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)




22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)
22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)




22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)
22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)




22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)
22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_1 in memory! (computed 13.9 MiB so far)


                                                                                

22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_3 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:39 WARN MemoryStore: Not enough space to cache rdd_11_4 in memory! (computed 14.2 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_5 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_6 in memory! (computed 14.0 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_7 in memory! (computed 14.3 MiB so far)




22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_8 in memory! (computed 14.1 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_9 in memory! (computed 13.9 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_10 in memory! (computed 14.1 MiB so far)




22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_11 in memory! (computed 13.8 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_12 in memory! (computed 14.3 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_14 in memory! (computed 13.6 MiB so far)




22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_16 in memory! (computed 14.0 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_18 in memory! (computed 13.9 MiB so far)
22/11/29 19:59:40 WARN MemoryStore: Not enough space to cache rdd_11_20 in memory! (computed 14.0 MiB so far)


                                                                                

0
count(1)
163


In [146]:
finish_exercise()

This exercise took 598s


22/11/29 23:36:42 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 264730 ms exceeds timeout 120000 ms
22/11/29 23:36:42 WARN SparkContext: Killing executors is not supported by current scheduler.
