# Exercise Week 12 - Rumble



# 1. Install Rumble

## 1. Setup the Spark cluster in Azure

### Create a cluster

1. Sign into the azure portal (portal.azure.com).
1. Search for "HDInsight clusters" using the search box at the top.
<img src="https://cloud.inf.ethz.ch/s/WxpMXB3Jz8SykMw/download" width="900">
1. Note that under the *Subscription* section, you might be prompted that the subscription is not registered:
<img src="https://cloud.inf.ethz.ch/s/gyTcQYKFCn3Yg6J/download" width="500">

  To fix this, follow the *Click here to register* link, and in the new page, search for *hdinsight*. Then select the *Microsoft.HDInsight* Provider and click the *Register* button.  
<img src="https://cloud.inf.ethz.ch/s/oHn9eyeZRP4LfZq/download" width="500">

1. Create a new resource group (for example: 'exercise08').
1. Give the cluster a unique name.
1. In the "Cluster Type" choose **Spark** and leave the default version as is. It is also indicated to use the **US West** region. 
1. Create a cluster login password (you can use https://www.random.org/strings/ for inspiration). Keep the password around as you will need it for later.
<img src="https://cloud.inf.ethz.ch/s/JY3DRLg8NLH559K/download" width="900">
1. Move to the *Storage* stage of the setup. Here, leave **Azure Storage** as the *Primary Storage Type*. For the *Primary Storage Account* you have the option to set up a new account. The *Container*'s name will be generated automatically, however make sure to remember it, or change it to something memorable, if you plan on finishing the exercises in more than one sitting.
<img src="https://cloud.inf.ethz.ch/s/NgtHE6iwSCZ8FQi/download" width="900">
1. Move to the *Configuration + Pricing* stage of the setup (skip *Security + networking*). Set up a Spark cluster which uses 2 **A5**  deployments as *Head* nodes and 2 **D12 v2** deployments for the *Worker* nodes. It should cost roughly 1.9 EUR/h. Note that if Azure allows you deploy more cores, then do so, by increasing the number of *Worker* nodes.
<img src="https://cloud.inf.ethz.ch/s/JpJEfjkZLPja5EK/download" width="900">
1. Move to the *Reivew + Create* stage of the setup, and click the **Create** button once validation succeeds.
1. Wait until your cluster is deployed (this can take up to 20 minutes).

<span style="color: red;">**Important:** Remember to **delete** the cluster once you are done. If you want to stop doing the exercises at any point, delete it and recreate it using the same container name as you used the first time, so that the resources are still there.</span>

<img src="https://cloud.inf.ethz.ch/s/2jLERoTD6q8nRMQ/download" width="900">

### Access your cluster

Make sure you can access your cluster (the NameNode) via SSH:

```
$ ssh <ssh_user_name>@<cluster_name>-ssh.azurehdinsight.net
```

If you are using Linux or MacOSX, you can use your standard terminal.
If you are using Windows you can use:
- Putty SSH Client and PSCP tool (get them at [here](http://www.chiark.greenend.org.uk/~sgtatham/putty/download.html)).
- This Notebook server terminal (Click on the Jupyter logo and the goto New -> Terminal).
- Azure Cloud Terminal (see the HBase exercise sheet for details)

You can access cluster's YARN in your browser
```
 https://<cluster_name>.azurehdinsight.net/yarnui/hn/cluster
```


## Install Rumble

Then login to the shell and download the latest Rumble version:

```
wget https://github.com/RumbleDB/rumble/releases/download/v1.12/spark-rumble-1.12.0.jar
```

### HDInsight Shell

Unfortunately HDInsight will not provide us access to any other port than SSH.
Therefore the usual way to work with Rumble through HDInsight is through the shell. You can access the Rumble shell by running:
```
spark-submit spark-rumble-1.12.0.jar --shell yes
```

### SSH Forwarding

However for this sheet, we recommend to use SSH forwarding. For that, run the following command instead:

```
spark-submit spark-rumble-1.12.0.jar --server yes --port 8002
```

and then open another terminal on your local machine and run the following command to forward the server port 8002 to your localhost:8002. 

```
ssh -N -L 8002:localhost:8002 sshuser@[servername]-ssh.azurehdinsight.net
```

Now the port 8002 of your own machine (localhost:8002) will become a Rumble server for you to access locally.

# 2. Setup Rumble in Jupyter Notebook



### Install Jupyter Notebook



In order to execute the queries in this notebook, you need to [install](https://jupyter.org/install) jupyter notebook on your **own machine**, and then download this notebook and [run](https://jupyter.readthedocs.io/en/latest/running.html#running) it locally rather than rely on the colab.

To get started, you first need to execute the cell below to activate the Rumble magic (you do not need to understand what it does, this is just initialization Python code).

In [1]:
import requests
import json
import time
from IPython.core.magic import register_line_cell_magic

@register_line_cell_magic
def rumble(line, cell=None):
    if cell is None:
        data = line
    else:
        data = cell

    start = time.time()                                                         
    response = json.loads(requests.post(server, data=data).text)                   
    end = time.time()                                                              
    print("Took: %s ms" % (end - start))

    if 'warning' in response:
        print(json.dumps(response['warning']))
    if 'values' in response:
        for e in response['values']:
            print(json.dumps(e))
    elif 'error-message' in response:
        return response['error-message']
    else:
        return response

By default, this notebook uses a small public backend provided by us (very limited in CPU and memory, and with only the http scheme activated) that is sufficient to discover Rumble. This is new and experimental, so that it may occasionally break, especially if too many users use it at the same time, so please bear with us!

It is straightforward to execute your own Rumble server on your own Spark cluster (and then you can make full use of all the input file systems and file formats). In this case, just set the server the with your own hostname and port as follows.

In [2]:
server='http://localhost:9090/jsoniq' # 'http://public.rumbledb.org:9090/jsoniq' public server in case you get stuck

Now we are all set! You can now start reading and executing the JSONiq queries in this notebook as you go, and you can even edit them!

# 3. Rumble Sandbox

## JSON

As explained on the [official JSON Web site](http://www.json.org/), JSON is a lightweight data-interchange format designed for humans as well as for computers. It supports as values:
- objects (string-to-value maps)
- arrays (ordered sequences of values)
- strings
- numbers
- booleans (true, false)
- null

JSONiq provides declarative querying and updating capabilities on JSON data.

## Elevator Pitch

JSONiq is based on XQuery, which is a W3C standard (like XML and HTML). XQuery is a very powerful declarative language that originally manipulates XML data, but it turns out that it is also a very good fit for manipulating JSON natively.
JSONiq, since it extends XQuery, is a very powerful general-purpose declarative programming language. Our experience is that, for the same task, you will probably write about 80% less code compared to imperative languages like JavaScript, Python or Ruby. Additionally, you get the benefits of strong type checking without actually having to write type declarations.
Here is an appetizer before we start the tutorial from scratch.


In [3]:
%%rumble

let $stores :=
[
  { "store number" : 1, "state" : "MA" },
  { "store number" : 2, "state" : "MA" },
  { "store number" : 3, "state" : "CA" },
  { "store number" : 4, "state" : "CA" }
]
let $sales := [
   { "product" : "broiler", "store number" : 1, "quantity" : 20  },
   { "product" : "toaster", "store number" : 2, "quantity" : 100 },
   { "product" : "toaster", "store number" : 2, "quantity" : 50 },
   { "product" : "toaster", "store number" : 3, "quantity" : 50 },
   { "product" : "blender", "store number" : 3, "quantity" : 100 },
   { "product" : "blender", "store number" : 3, "quantity" : 150 },
   { "product" : "socks", "store number" : 1, "quantity" : 500 },
   { "product" : "socks", "store number" : 2, "quantity" : 10 },
   { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
  for $store in $stores[], $sale in $sales[]
  where $store."store number" = $sale."store number"
  return {
    "nb" : $store."store number",
    "state" : $store.state,
    "sold" : $sale.product
  }
return [$join]

Took: 2.134681463241577 ms
[{"nb": 1, "state": "MA", "sold": "broiler"}, {"nb": 1, "state": "MA", "sold": "socks"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "socks"}, {"nb": 3, "state": "CA", "sold": "toaster"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "shirt"}]


## And here you go

### Actually, you already knew some JSONiq

The first thing you need to know is that a well-formed JSON document is a JSONiq expression as well.
This means that you can copy-and-paste any JSON document into a query. The following are JSONiq queries that are "idempotent" (they just output themselves):

In [4]:
%%rumble
{ "pi" : 3.14, "sq2" : 1.4 }

Took: 2.1431849002838135 ms
{"pi": 3.14, "sq2": 1.4}


In [5]:
%%rumble
[ 2, 3, 5, 7, 11, 13 ]

Took: 2.140516996383667 ms
[2, 3, 5, 7, 11, 13]


In [9]:
%%rumble
{
      "operations" : [
        { "binary" : [ "and", "or"] },
        { "unary" : ["not"] }
      ],
      "bits" : [
        0, 1
      ]
    }

Took: 2.097799301147461 ms
{"operations": [{"binary": ["and", "or"]}, {"unary": ["not"]}], "bits": [0, 1]}


In [10]:
%%rumble
[ { "Question" : "Ultimate" }, ["Life", "the universe", "and everything"] ]

Took: 2.110863447189331 ms
[{"Question": "Ultimate"}, ["Life", "the universe", "and everything"]]



This works with objects, arrays (even nested), strings, numbers, booleans, null.

It also works the other way round: if your query outputs an object or an array, you can use it as a JSON document. JSONiq is a declarative language. This means that you only need to say what you want - the compiler will take care of the how.

In the above queries, you are basically saying: I want to output this JSON content, and here it is.

## JSONiq basics

### The real JSONiq Hello, World!

Wondering what a hello world program looks like in JSONiq? Here it is:

In [11]:
%%rumble
"Hello, World!"

Took: 2.110699415206909 ms
"Hello, World!"


Not surprisingly, it outputs the string "Hello, World!".

### Numbers and arithmetic operations

Okay, so, now, you might be thinking: "What is the use of this language if it just outputs what I put in?" Of course, JSONiq can more than that. And still in a declarative way. Here is how it works with numbers:

In [12]:
%%rumble
2 + 2

Took: 2.0880212783813477 ms
4


In [15]:
%%rumble
 (38 + 2) div 2 + 11 * 2


Took: 2.090130090713501 ms
42


(mind the division operator which is the "div" keyword. The slash operator has different semantics).

Like JSON, JSONiq works with decimals and doubles:

In [16]:
%%rumble
 6.022e23 * 42

Took: 2.0985445976257324 ms
2.52924e+25


### Logical operations

JSONiq supports boolean operations.

In [17]:
%%rumble
true and false

Took: 2.1006457805633545 ms
false


In [18]:
%%rumble
(true or false) and (false or true)

Took: 2.088419198989868 ms
true


The unary not is also available:

In [19]:
%%rumble
not true

Took: 2.0799973011016846 ms
false


### Strings

JSONiq is capable of manipulating strings as well, using functions:


In [20]:
%%rumble
concat("Hello ", "Captain ", "Kirk")

Took: 2.260148286819458 ms
"Hello Captain Kirk"


In [21]:
%%rumble
substring("Mister Spock", 8, 5)

Took: 2.1299660205841064 ms
"Spock"


In [25]:
%%rumble
concat(1, 2, 3, 4, true)

Took: 2.1026906967163086 ms
"1234true"


In [28]:
%%rumble
10 || "/" || 6

Took: 2.0957229137420654 ms
"10/6"


JSONiq comes up with a rich string function library out of the box, inherited from its base language. These functions are listed [here](https://www.w3.org/TR/xpath-functions-30/) (actually, you will find many more for numbers, dates, etc).



### Sequences

Until now, we have only been working with single values (an object, an array, a number, a string, a boolean). JSONiq supports sequences of values. You can build a sequence using commas:


In [29]:
%%rumble
 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Took: 2.0990076065063477 ms
1
2
3
4
5
6
7
8
9
10


In [30]:
%%rumble
1, true, 4.2e1, "Life"

Took: 2.094522476196289 ms
1
true
42
"Life"


The "to" operator is very convenient, too:

In [31]:
%%rumble
 (1 to 100)

Took: 2.0548760890960693 ms
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100


Some functions even work on sequences:

In [32]:
%%rumble
sum(1 to 100)

Took: 2.088104486465454 ms
5050


In [35]:
%%rumble
string-join(("These", "are", "some", "words"), "-")

Took: 2.1134231090545654 ms
"These-are-some-words"


In [36]:
%%rumble
count(10 to 20)

Took: 2.068220615386963 ms
11


In [37]:
%%rumble
avg(1 to 100)

Took: 2.101912021636963 ms
50.5


Unlike arrays, sequences are flat. The sequence (3) is identical to the integer 3, and (1, (2, 3)) is identical to (1, 2, 3).

## A bit more in depth

### Variables

You can bind a sequence of values to a (dollar-prefixed) variable, like so:

In [38]:
%%rumble
let $x := "Bearing 3 1 4 Mark 5. "
return concat($x, "Engage!")

Took: 2.088350296020508 ms
"Bearing 3 1 4 Mark 5. Engage!"


In [39]:
%%rumble
let $x := ("Kirk", "Picard", "Sisko")
return string-join($x, " and ")

Took: 2.1182405948638916 ms
"Kirk and Picard and Sisko"


You can bind as many variables as you want:

In [40]:
%%rumble
let $x := 1
let $y := $x * 2
let $z := $y + $x
return ($x, $y, $z)

Took: 2.0896427631378174 ms
1
2
3


and even reuse the same name to hide formerly declared variables:

In [41]:
%%rumble
let $x := 1
let $x := $x + 2
let $x := $x + 3
return $x

Took: 2.104177713394165 ms
6


### Iteration

In a way very similar to let, you can iterate over a sequence of values with the "for" keyword. Instead of binding the entire sequence of the variable, it will bind each value of the sequence in turn to this variable.

In [42]:
%%rumble
for $i in 1 to 10
return $i * 2

Took: 2.1191978454589844 ms
2
4
6
8
10
12
14
16
18
20


More interestingly, you can combine fors and lets like so:

In [43]:
%%rumble
let $sequence := 1 to 10
for $value in $sequence
let $square := $value * 2
return $square

Took: 2.1041972637176514 ms
2
4
6
8
10
12
14
16
18
20


and even filter out some values:

In [44]:
%%rumble
let $sequence := 1 to 10
for $value in $sequence
let $square := $value * 2
where $square < 10
return $square

Took: 2.114689350128174 ms
2
4
6
8


Note that you can only iterate over sequences, not arrays. To iterate over an array, you can obtain the sequence of its values with the [] operator, like so:


In [47]:
%%rumble
[1, 2, 3][]

Took: 2.0980772972106934 ms
1
2
3


### Conditions

You can make the output depend on a condition with an if-then-else construct:

In [49]:
%%rumble
for $x in 1 to 10
return if ($x < 5) then $x else -$x

Took: 2.083033561706543 ms
1
2
3
4
-5
-6
-7
-8
-9
-10


Note that the else clause is required - however, it can be the empty sequence () which is often when you need if only the then clause is relevant to you.

### Composability of Expressions

Now that you know of a couple of elementary JSONiq expressions, you can combine them in more elaborate expressions. For example, you can put any sequence of values in an array:

In [50]:
%%rumble
[ 1 to 10 ]

Took: 2.097451686859131 ms
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]


Or you can dynamically compute the value of object pairs (or their key):

In [51]:
%%rumble
{
      "Greeting" : (let $d := "Mister Spock"
                    return concat("Hello, ", $d)),
      "Farewell" : string-join(("Live", "long", "and", "prosper"),
                               " ")
}

Took: 2.084519863128662 ms
{"Greeting": "Hello, Mister Spock", "Farewell": "Live long and prosper"}


You can dynamically generate object singletons (with a single pair):


In [52]:
%%rumble
{ concat("Integer ", 2) : 2 * 2 }

Took: 2.0732126235961914 ms
{"Integer 2": 4}


and then merge lots of them into a new object with the {| |} notation:

In [53]:
%%rumble
{|
    for $i in 1 to 10
    return { concat("Square of ", $i) : $i * $i }
|}

Took: 2.132176637649536 ms
{"Square of 1": 1, "Square of 2": 4, "Square of 3": 9, "Square of 4": 16, "Square of 5": 25, "Square of 6": 36, "Square of 7": 49, "Square of 8": 64, "Square of 9": 81, "Square of 10": 100}


## JSON Navigation

Up to now, you have learnt how to compose expressions so as to do some computations and to build objects and arrays. It also works the other way round: if you have some JSON data, you can access it and navigate.
All you need to know is: JSONiq views
an array as an ordered list of values,
an object as a set of name/value pairs


### Objects

You can use the dot operator to retrieve the value associated with a key. Quotes are optional, except if the key has special characters such as spaces. It will return the value associated thereto:

In [64]:
%%rumble
let $person := {
    "first name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person."first name"

Took: 2.0504555702209473 ms
"Sarah"


You can also ask for all keys in an object:

In [65]:
%%rumble
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "keys" : [ keys($person)] }

Took: 2.0966694355010986 ms
{"keys": ["name", "age", "gender", "friends"]}


### Arrays

The [[]] operator retrieves the entry at the given position:

In [67]:
%%rumble
let $friends := [ "Jim", "Mary", "Jennifer"]
return $friends[[1+1]]

Took: 2.118917226791382 ms
"Mary"


It is also possible to get the size of an array:

In [71]:
%%rumble
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return { "how many friends" : size($person.friends) }

Took: 2.0887184143066406 ms
{"how many friends": 3}


Finally, the [] operator returns all elements in an array, as a sequence:

In [72]:
%%rumble
let $person := {
    "name" : "Sarah",
    "age" : 13,
    "gender" : "female",
    "friends" : [ "Jim", "Mary", "Jennifer"]
}
return $person.friends[]

Took: 2.0913147926330566 ms
"Jim"
"Mary"
"Jennifer"


### Relational Algebra

Do you remember SQL's SELECT FROM WHERE statements? JSONiq inherits selection, projection and join capability from XQuery, too.

In [73]:
%%rumble
let $stores :=
[
    { "store number" : 1, "state" : "MA" },
    { "store number" : 2, "state" : "MA" },
    { "store number" : 3, "state" : "CA" },
    { "store number" : 4, "state" : "CA" }
]
let $sales := [
    { "product" : "broiler", "store number" : 1, "quantity" : 20  },
    { "product" : "toaster", "store number" : 2, "quantity" : 100 },
    { "product" : "toaster", "store number" : 2, "quantity" : 50 },
    { "product" : "toaster", "store number" : 3, "quantity" : 50 },
    { "product" : "blender", "store number" : 3, "quantity" : 100 },
    { "product" : "blender", "store number" : 3, "quantity" : 150 },
    { "product" : "socks", "store number" : 1, "quantity" : 500 },
    { "product" : "socks", "store number" : 2, "quantity" : 10 },
    { "product" : "shirt", "store number" : 3, "quantity" : 10 }
]
let $join :=
    for $store in $stores[], $sale in $sales[]
    where $store."store number" = $sale."store number"
    return {
        "nb" : $store."store number",
        "state" : $store.state,
        "sold" : $sale.product
    }
return [$join]

Took: 2.1122922897338867 ms
[{"nb": 1, "state": "MA", "sold": "broiler"}, {"nb": 1, "state": "MA", "sold": "socks"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "toaster"}, {"nb": 2, "state": "MA", "sold": "socks"}, {"nb": 3, "state": "CA", "sold": "toaster"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "blender"}, {"nb": 3, "state": "CA", "sold": "shirt"}]


### Access datasets

Rumble can read input from many file systems and many file formats. If you are using our backend, you can only use json-doc() with any URI pointing to a JSON file and navigate it as you see fit. 

In [74]:
%%rumble
json-doc("Put any HTTP URL pointing to a JSON document here!").foo[[1]].bar.foobar[]

Took: 2.109262228012085 ms


'There was an error.\n\nCode: [FODC0002] (this code can be looked up in the documentation and specifications).\n\nLocation information: file:/C:/Users/Ivan/:LINE:1:COLUMN:0:\n\nMalformed URI: Put any HTTP URL pointing to a JSON document here! Cause: Illegal character in path at index 3: Put any HTTP URL pointing to a JSON document here!'

If you are using your own Rumble server on your cluster, you can also use any other function and scheme.

In [75]:
%%rumble
json-file("put the path to a JSON lines file here. This will only work against your own Rumble backend and Spark cluster, though.")

Took: 2.1068294048309326 ms


'There was an error.\n\nCode: [FODC0002] (this code can be looked up in the documentation and specifications).\n\nLocation information: file:/C:/Users/Ivan/:LINE:1:COLUMN:0:\n\nFile file:/C:/Users/Ivan/put%20the%20path%20to%20a%20JSON%20lines%20file%20here.%20This%20will%20only%20work%20against%20your%20own%20Rumble%20backend%20and%20Spark%20cluster,%20though. not found.'

# 4. The Great Language Game

This week you will be using again the [language confusion dataset](http://lars.yencken.org/datasets/languagegame/). You will write queries with Rumble. You will have to submit the results of this exercise to Moodle to obtain the weekly bonus. You will need these things:
- The query you wrote
- Something related to its output (which you will be graded on)
- The time it took you to run it

The execution time of the queries will be reported by Rumble.

Download and decompress the dataset in the same folder as `spark-rumble-1.12.0.jar` with the following:
```
wget http://data.greatlanguagegame.com.s3.amazonaws.com/confusion-2014-03-02.tbz2
tar -jxvf confusion-2014-03-02.tbz2
```



Afterwards upload the data into HDFS

```
hadoop dfs -copyFromLocal confusion-2014-03-02 /tmp/
```

## 4.1 Query the data

You can read data from a json file with `json-file`. For example, the following query will read and print the entries in the confusion dataset:




In [77]:
%%rumble
for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json")
return $i

Took: 2.1359879970550537 ms


'There was an error.\n\nCode: [FODC0002] (this code can be looked up in the documentation and specifications).\n\nLocation information: file:/C:/Users/Ivan/:LINE:1:COLUMN:10:\n\nFile file:/tmp/confusion-2014-03-02/confusion-2014-03-02.json not found.'

Note that you have to press enter once at the end of each line and two more times to execute the query if you are using the **shell**. Your jsoniq shell should look like this:
```
jiqs$ for $i in json-file("confusion-2014-03-02/confusion-2014-03-02.json")
>>> return $i
>>> 
>>> 
```



After the results of your query are printed, Rumble will report the execution runtime in milliseconds:
```
Took: 62.02618598937988 ms
```

In the `json-file` method you can optionally specify the number of partitions, which may allow your query to be parallelized and executed faster. For example:


In [None]:
%%rumble
for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10)
return $i

## 4.2 SQL to Rumble

The following examples, show how SQL queries can be converted to Sparksoniq queries. Assume that the dataset is accessible with SQL through the table "entries".





### 4.2.1 Get all games played from Switzerland


```sql
SELECT *
FROM entries
WHERE country == "CH"
```


In [None]:
%%rumble
# Your code here

### 4.2.2 Get all games played from Switzerland, where the correct answer (target) was "German"
```sql
SELECT *
FROM entries
WHERE country == "CH" AND target == "German"
```




In [None]:
%%rumble
# Your code here

### 4.2.3 Get the top 5 games played from Switzerland, where the correct answer (target) was "German"
```sql
SELECT *
FROM entries
WHERE country == "CH" AND target == "German"
LIMIT 5
```




In [None]:
%%rumble
# Your code here

### 4.2.4 Get all games played from Switzerland, where the correct answer (target) was "German", order them by date (ascending), and return the top 5 rows.
```sql
SELECT *
FROM entries
WHERE country == "CH" AND target == "German"
ORDER BY date ASC
LIMIT 5
```




In [None]:
%%rumble
# Your code here

### 4.2.5 Get all games played from Switzerland, where the correct answer (target) was "German", group them by date, and return for each different date the number of games played.

```sql
SELECT date, COUNT(*) AS num_games
FROM entries
WHERE country == "CH" AND target == "German"
GROUP BY date
```


In [None]:
%%rumble
# Your code here

### 4.2.6 Get all games played from Switzerland, group them by date and target, and return for each different date and target the number of games played.


NOTE: Rumble has some reserved keywords, for example `date`. If you try to create a variable `$date`, you may get an error, such as `no viable alternative at input 'date'`.

```sql
SELECT date, target, COUNT(*) AS num_games
FROM entries
WHERE country == "CH"
GROUP BY date, target
```




In [None]:
%%rumble
# Your code here

### 4.2.7 For all games played from Switzerland, return the distinct targets of those games.

```sql
SELECT DISTINCT(target)
FROM entries
WHERE country == "CH"
```




In [None]:
%%rumble
# Your code here

### 4.2.8 For all games played from Switzerland, get the distinct targets of those games, and return the index of "German" in the list of distinct targets.




In [None]:
%%rumble
# Your code here

### 4.2.9 Count the number of games played from Switzerland (without any grouping).


NOTE: `distinct-values` and `index-of` work on "sequences". The method `json-file` returns a sequence. If you have an array on which you want to apply `distinct-values` and `index-of`, you must first convert it to a sequence. This can be done with `[]`. For example, if you have an array called `arr`, you can find its distinct values with `distinct-values(arr[])`

```sql
SELECT COUNT(*) AS count
FROM entries
WHERE country == "CH"
```


In [None]:
%%rumble
# Your code here

If in your query you want to join 2 (or more) sequences (results of `json-file` or subqueries), you can do it in the following way:
```
let $seq1 := ...
let $seq2 := ...
for $i in $seq1, $j in $seq2
where $i.attr1 eq $j.attr2
...
```

## 4.3 More queries

Try writing a few more queries:
- List all chosen answers to games where the guessed language is correct (=target).

In [None]:
%%rumble
# Your code here

- Count the games where the index of the correct answer in the choices array is 2 (as returned by the index-of method).

In [None]:
%%rumble
# Your code here

- Return all games played on February 3rd 2014.

In [None]:
%%rumble
# Your code here

# 5. More nestedness
## 5.1 Create Nestedness
You may remember in the exercise of Spark Dataframes & Spark SQL, we mentioned two methods <font face="courier">collect_set/collect_list</font> for creating arrays. In JSONiq, this kind of things become even simpler because JSONiq natively supports JSON, so we can directly create arrays via adding square brackets (<font face="courier">[]</font>) and even without any group by operations. 

For example, if we want to know the list of date at which "Fijian" was used as the target, we can write a simple JSONiq query:

In [None]:
%%rumble
let $dateSeq := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10)
where $i.target eq "Fijian"
return $i.date
return [$dateSeq]

The above query is basically a counterpart of a Spark Dataframe query with <font face="courier">collect_list</font>. If we want to imitate the behavior of <font face="courier">collect_set</font>, which means we want the result array to be de-duplicated, we can just resort to <font face="courier">distinct-values</font>:

In [None]:
%%rumble
let $dateSeq := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10)
where $i.target eq "Fijian"
return $i.date
return [distinct-values($dateSeq)]

Now what if we want to know for each different language, the de-duplicated list of date at which it was used as a target? We may need group by again. Try to come up with the query on your own. What might be the difference in the query with and without group by?

**Note:** from here we use a truncated dataset to run the query because the query on the original dataset might take very long and consume a humongous amount of memory. The most important thing is not about the answer, but to come up with proper queries.

In [None]:
%%rumble
let $truncated := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10) 
  count $c 
  where $c <= 100000 
  return $i
for $i in $truncated
# complete the query

Obviously, unlike <font face="courier">collect_set/collect_list</font> which only accept one column and create arrays on that very column, JSONiq can create arrays on arbitrary things. For example, if we want to highlight the information of date, we can create a new dataset that shows the game info for each date:

In [None]:
%%rumble
let $truncated := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10) 
  count $c 
  where $c <= 100000 
  return $i
let $newDataset := for $i in $truncated
  group by $d := $i.date
  return {"date": $d, "info": [$i]}
return $newDataset

Now we have a more nested dataset! We can try to redo some of the exercises above with this new dataset. For example, get all games played in Switzerland:

In [None]:
%%rumble
let $truncated := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10) 
  count $c 
  where $c <= 100000 
  return $i
let $newDataset := for $i in $truncated
  group by $d := $i.date
  return {"date": $d, "info": [$i]}
# complete the query

Try another one: get the count of games played in Switzerland:

In [None]:
%%rumble
let $truncated := for $i in json-file("/tmp/confusion-2014-03-02/confusion-2014-03-02.json", 10) 
  count $c 
  where $c <= 100000 
  return $i
let $newDataset := for $i in $truncated
  group by $d := $i.date
  return {"date": $d, "info": [$i]}
# complete the query


Feel free to try out more questions on your own!

## 5.2 Git-archive dataset
Now let's get into the mess of the real world. We are going to explore the git-archive dataset to handle some properly messy data, which is very challenging if you use Spark. 

For getting the dataset, just run:
```
wget https://polybox.ethz.ch/index.php/s/HVWlvJAXVkQ05cw/download -O git-archive.json
```
and upload it to hdfs if you are using the cluster:
```
hadoop dfs -copyFromLocal git-archive.json /tmp/
```
Have a look at what the dataset looks like:

In [None]:
%%rumble
for $i in json-file("/tmp/git-archive.json", 10)
count $c
where $c <= 1
return $i

What a mess, isn't it? Anyway, let's try to write some challenging queries, since you've already mastered those easy ones with the language game dataset...

1. What is the number of distinct author names that are part of a push event (i.e., an event with the type PushEvent)?

In [None]:
%%rumble
# Your code here

2. What is the name of the repository with the highest number of push events (i.e., events with the type PushEvent), and how many of these push events occurred in this repository?

In [None]:
%%rumble
# Your code here

# Moodle Graded Exercise

And now to the actual Moodle queries. Here we still use the language game dataset (please use the **original** dataset for these questions).


1\. return the number of games where the player's guess is "Spanish" and the correct answer (target) is in the first place of the choices array

In [None]:
%%rumble 
# Your code here

2\. return the rate of correctness for all games with target "Mandarin" (write the fraction rounding to 4 decimals (eg. 0.3323))

In [None]:
%%rumble 
# Your code here

3\. return the top three countries that had the largest number of games with correct guesses

In [None]:
%%rumble 
# Your code here

4\. For the language that appeared in the choices array most frequently, how many times did it appear in the choices array?

In [None]:
%%rumble 
# Your code here

5\. Sort the languages by decreasing overall percentage of correct guesses and return the top 3 languages.

In [None]:
%%rumble 
# Your code here