# Out-Of-Memory Exercise

For this exercise, you're given a purely fictional, simulated dataset. The dataset represents data from a webshop and is broken into three parts:



*   **customer data**: A catalogue over registered customers in the webshop. Contains personal information about each customer which (if this was a real dataset) would be encrypted.
*   **product data**: A catalogue of products sold by the webshop. Descriptions, price and other information.
*   **sales data**: Historic sales made by customers. Contains order total, which customer placed the order, status and other information.


In this exercise, you'll be tasked with answering concrete questions about the dataset which requires you to query it efficiently. To do this, you'll need to write SQL commands using polars to query the parquet files, and you'll be tasked with comparing the speed of these queries with a SQLite-version of the dataset.


We start with downloading the dataset in Parquet format:

In [None]:
# Download sales data
!gdown 1xWAK9ruxl9C9SHNFV09oEasnEWhLcvWZ

# Download product data
!gdown 1Xj8dL1wgNI-NpceKzSxWvs7HHSGKWAUy

# Download customer data
!gdown 1j2In8500o0yXTCXp2hBA8GRVl68UQ5ij

We then install polars:

In [None]:
!pip install polars==0.19.12

To get you started, we'll create the following SQLContext in polars:


*(If you get an error running the cell below, please restart the notebook and run the cell again)*

In [None]:
import polars as pl

# Reference dataset parts
customer_data_path = '/content/customer_data.parquet'
product_data_path = '/content/product_data.parquet'
sales_data_path = '/content/sales_data.parquet'

# Create a pl.LazyFrame to each
customer_data = pl.scan_parquet(customer_data_path)
product_data = pl.scan_parquet(product_data_path)
sales_data = pl.scan_parquet(sales_data_path)

# Combine the LazyFrames into a SQLContext
conn = pl.SQLContext(customer_data = customer_data,
                     product_data = product_data,
                     sales_data = sales_data)


Let's now print the available tables:

In [None]:
conn.tables()

# **Exercise 1.1**

## _Use SQL to write SELECT queries that extracts the first 7 rows of each of the tables above and print the result as a cell output._
Hint: LIMIT

# **Exercise 1.2**

##  _Familiarize yourself with the data in each of the tables. Do you understand what each column represent?_

## _Try to answer_:


### 1.   What does a row represent in `'sales_data'`?
### 2.   What does a row represent in `'customer_data'`?
### 3.   What does a row represent in `'product_data'`?
### 4.   Which column in `'sales_data'` relates an order to a row in `'customer_data'`?
### 5.   Which column in `'sales_data'` relates an order to a row in `'product_data'`?




# **Exercise 1.3**

##  _Write a SELECT query that returns the order with the highest total. Return the full order (with all columns) and save as a variable. Print the result to cell output._

Hint: =MAX






# **Exercise 1.4**

##  _Use the data stored in the variable you created in 1.3 to write a SELECT query that returns all available information about the products that was sold in the order with the highest total. Print the results to cell output._

hint: IN



# **Exercise 1.5**

##  _Use the data stored in the variable you created in 1.3 to write a SELECT query that returns all available information about the customer that placed the order with the highest total. Print the results to cell output._

hint: =

# **Exercise 2.1**

##  _Supposed we wanted to convert the parquet files to a single SQLite database, where each file becomes it's own table - just like in our pl.SQLContext:_

1. Which column in `'customer_data'` is an obvious candidate for a Primary Key?
2. Which column in `'product_data'` is an obvious candidate for a Primary Key?
3. Why should we construct an Index - not a Primary Key - on `'sales_data'` - and which column could be a potential candidate?




# **Exercise 2.2**
##_Download the two webshop SQLite database files `'webshop_data.db'` and `'webshop_data_no_indexing.db'`_ by running the cell below.

## We will use the function below to query the databases:

```
import sqlite3
# conn = sqlite3.connect(database_path)

def query_database(query, conn):
    return conn.execute(query).fetchall()
```

## and we define our query as the following

```
import numpy as np

customer_ids = np.random.randint(0, 9999, 10)
query = f'SELECT * FROM costumer_data WHERE customer_id IN {str(tuple(customer_ids))}'

```

## `customer_ids` will contain 10 random ids in the `customer_data` table, and the query asks for the corresponding rows.

## Establish a connection to each database, and use `%timeit -n 100 query_database(query = query, conn = your_connection)` to measure the execution time.

## Which of the two databases are faster for this query? Why?





In [None]:
!gdown 1c66GNAYJxsYEfLNYgh53LtoA0ncnvnmd
!gdown 1CRZA2a_K3tkGQUo3C7Cdwi8x-sf-2UeE


# **Exercise 2.3**
##_Execute the same query as in 2.2 but using the pl.SQLContext from polars on the parquet-version of the dataset. How does the execution time compare to the databases above?_

# **Bonus Exercises**

##  **a)** _Write memory efficient queries that answers the following questions:_



1.   _In what year did the webshop earn the most money?_
2.   _What is the average earnings in August? (average over years)_
3.   _Which product has been sold the most?_
4.   _Which customer has spent the most money at the webshop?_
5.   _Has all products been sold at least once?_

hint: DISTINCT, GROUP BY, SUM, NOT IN

## **b)** _Write a single SELECT query that returns the name of the customer that placed the order with the highest total. You may not use data stored in previous variables._
hint: LEFT JOIN