### Origin of the data

The data used for the queries in the following sections come from orders dataset. The data in the dataset follow the schema:

#### 1. Find the number of distinct products

In [None]:
orders_df.select(explode("items").alias("i")).select("i.product").distinct().count()

#### 2. Find the average quantity at which each product is purchased. Only show the top 10 products by quantity. 


In [None]:
orders_df.select(explode("items").alias("i")).select("i.product", "i.quantity") \
    .groupBy("product").avg("quantity").orderBy(desc("avg(quantity)")).take(10)

#### 3. Find the most expensive order

In [None]:
exploded_df = orders_df.select("order_id", explode("items").alias("i")).select("order_id", "i.price", "i.quantity")
exploded_df.select(exploded_df["order_id"], (exploded_df["price"] * exploded_df["quantity"]).alias("p")) \
    .groupBy("order_id").sum("p").orderBy(desc("sum(p)")).take(1)

### Origin of the data

The next queries are run on the following dataset:

https://f003.backblazeb2.com/file/larsyencken-eu-public/greatlanguagegame/confusion-2014-03-02.tbz2

It follows the schema:

#### 4. Find the number of games where the guessed language and target language is Maltese.

In [None]:
dataset.filter(dataset["guess"] == "Maltese").filter(dataset["target"] == "Maltese").count()

#### 5. Return the number of distinct "target" languages.

In [None]:
dataset.select("target").distinct().count()

#### 6. Return the sample IDs (i.e., the "sample" field) of the first three games where the guessed language is correct (equal to the target one) ordered by date (descending), then by language (ascending), then by country (descending). 

In [None]:
dataset.filter(dataset["target"] == dataset["guess"]). \
    orderBy(dataset["date"].desc(), dataset["guess"].asc(), dataset["country"].desc()).take(3)                                                            

#### 7. Aggregate all games by country and "guess" language, counting the number of guesses for each group and return the frequencies of the two most frequent country/language combinations.

In [None]:
dataset.select("country", "guess").groupBy(["country", "guess"]).count().orderBy(desc("count")).take(2)

#### 8 Sort the languages by decreasing overall percentage of correct guesses and return the first four languages. 


In [None]:
correct_df = dataset.filter(dataset["target"] == dataset["guess"]).groupBy("target").count(). \
    withColumnRenamed("count", "correct")
mistakes_df = dataset.groupBy("target").count().withColumnRenamed("count", "total")
df = correct_df.join(mistakes_df, "target")
df.select("target", (df["correct"]/df["total"]).alias("perc")).orderBy(desc("perc")).take(1)