## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 20 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `script-data-generation` folder.

**Important:** Please start by completing Question 01 to set up the correct Python environment before proceeding with the other questions.

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### **Question 01: Setting up the Python Environment**

Before proceeding with the rest of the quiz, it is important to set up a Python environment with specific package versions to ensure compatibility and reproducibility. This quiz requires **Python 3.10** and the following packages with exact versions:
- `dask-sql=2024.5.0`
- `dask=2024.4.1`
- `ipykernel=6.29.3`
- `joblib=1.3.2`
- `numpy=1.26.4`
- `pandas=2.2.1`

You can use tools like `conda`, `pipenv`, or `uv` to manage your environment. If you use conda (recommended), please make sure you **create the environment and install all packages in the same command**. Also include `-c conda-forge` in your command. Make sure to change your current environment to the new environment after creation. 

Write the terminal commands in the code cell below:

In [None]:
conda create -n quiz-env python=3.10 \
    dask-sql=2024.5.0 \
    dask=2024.4.1 \
    ipykernel=6.29.3 \
    joblib=1.3.2 \
    numpy=1.26.4 \
    pandas=2.2.1 \
    -c conda-forge -y


### Question 02: Understanding the `map` Function and Parallelism

The built-in Python `map()` function applies a function to each element sequentially. Using `joblib`, rewrite the following serial code to run in parallel using **all available cores** (hint: use `n_jobs=-1`). Compare the results to verify correctness.

```python
import numpy as np

def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

# Serial version using map
serial_result = list(map(cube_root, numbers))
print("First 5 serial results:", serial_result[:5])
```

Write the parallel version using `joblib.Parallel` and `delayed`.

In [5]:
# Please write your answer here.
from joblib import Parallel, delayed
import numpy as np


def cube_root(x):
    return x ** (1/3)

numbers = np.arange(1, 500001)

# Serial version using map
serial_result = list(map(cube_root, numbers))
print("First 5 serial results:", serial_result[:5])

# Parallel version using joblib
parallel_result = Parallel(n_jobs=-1)(
    delayed(cube_root)(x) for x in numbers
)

print("First 5 parallel results:", parallel_result[:5])

# Verify correctness
print("Results match serial version:", serial_result == parallel_result)


First 5 serial results: [np.float64(1.0), np.float64(1.2599210498948732), np.float64(1.4422495703074083), np.float64(1.5874010519681994), np.float64(1.7099759466766968)]
First 5 parallel results: [np.float64(1.0), np.float64(1.2599210498948732), np.float64(1.4422495703074083), np.float64(1.5874010519681994), np.float64(1.7099759466766968)]
Results match serial version: True


### Question 03: Measuring Parallel Speedup

Create a function called `simulate_computation` that generates 100,000 random numbers and calculates their variance. Using `%timeit`, measure and compare the execution time of:

1. Running the function **4 times sequentially** in a list comprehension (`[simulate_computation() for _ in range(4)]`)
2. Running the function **4 times in parallel** using `joblib` with 4 workers

Print and compare both timing results.

In [8]:
# Please write your answer here.
import numpy as np
from joblib import Parallel, delayed

# Define the computation function
def simulate_computation():
    data = np.random.rand(100_000)
    return np.var(data)

# Sequential execution (4 times)
print("Sequential execution:")
%timeit [simulate_computation() for _ in range(4)]


# Parallel execution using joblib (4 workers)

print("\nParallel execution:")
%timeit Parallel(n_jobs=4)(delayed(simulate_computation)() for _ in range(4))


Sequential execution:
1.52 ms ± 86.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Parallel execution:
12.8 ms ± 806 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Question 04: Dask Array with Custom Chunk Sizes

Create a Dask array of shape (5000, 2000) filled with random integers between 1 and 100. Use chunks of size (500, 500). Then:

1. Compute the sum of each row
2. Calculate the mean and standard deviation of the entire array
3. Print all three results

In [5]:
!pip install dask

[31mERROR: Could not find a version that satisfies the requirement dask-array (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for dask-array[0m[31m
[0m

In [3]:
# Please write your answer here.
import dask.array as da

# Create Dask array: shape (5000, 2000), ints 1–100, chunks of (500, 500)
x = da.random.randint(
    low=1,
    high=101,
    size=(5000, 2000),
    chunks=(500, 500)
)

# 1. Compute the sum of each row
row_sums = x.sum(axis=1).compute()

# 2. Mean and standard deviation of the entire array
mean_val = x.mean().compute()
std_val  = x.std().compute()

# 3. Print all three results
print("Row sums (first 10):", row_sums[:10])
print("Mean of entire array:", mean_val)
print("Std of entire array:", std_val)


ImportError: No module named 'numpy'

Dask array requirements are not installed.

Please either conda or pip install as follows:

  conda install dask                 # either conda install
  python -m pip install "dask[array]" --upgrade  # or python -m pip install

In [None]:
# Please write your answer here.
import numpy as np
from joblib import Parallel, delayed

# Define the computation function
def simulate_computation():
    data = np.random.rand(100_000)
    return np.var(data)

# Sequential execution (4 times)
print("Sequential execution:")
%timeit [simulate_computation() for _ in range(4)]


# Parallel execution using joblib (4 workers)

print("\nParallel execution:")
%timeit Parallel(n_jobs=4)(delayed(simulate_computation)() for _ in range(4))


Sequential execution:
1.52 ms ± 86.2 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

Parallel execution:
12.8 ms ± 806 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Question 05: Optimising Chunk Size

The chunk size significantly affects Dask performance. Create a Dask array with 100,000 random numbers and test three different chunk sizes: 1,000 (many small chunks), 10,000 (medium chunks), and 50,000 (few large chunks).

For each configuration, measure the time to compute `mean(sin(x) + cos(x))`. Which chunk size performed best? Explain why in a comment.

In [None]:
# Please write your answer here.

import dask.array as da
import numpy as np
import time

chunk_sizes = [1_000, 10_000, 50_000]
timings = {}

for chunk in chunk_sizes:
    # Create a Dask array with given chunk size
    x = da.random.random(100_000, chunks=chunk)
    expr = da.sin(x) + da.cos(x)
    
    # Time the computation of mean(sin(x) + cos(x))
    start = time.perf_counter()
    result = expr.mean().compute()
    end = time.perf_counter()
    
    elapsed = end - start
    timings[chunk] = elapsed
    print(f"Chunk size {chunk:>6}: mean={result:.6f}, time={elapsed:.4f} seconds")

# Brief comment
best_chunk = min(timings, key=timings.get)
print(f"\nThe best-performing chunk size in my run was {best_chunk},")
print("because it balances overhead (too many tiny chunks) and lack of parallelism (too few huge chunks).")


### Question 06: Reading Parquet Files with Column Selection

The `data` folder contains Parquet files for multiple countries. Using Dask, read **all Parquet files at once** (`data/*.parquet`), but load only the `year` and `population` columns.

Calculate the total world population for each year across all countries and display the results sorted by year.

In [None]:
# Please write your answer here.

### Question 07: Dask SQL with Multiple Conditions

Load the `data.csv` file into a Dask DataFrame and register it as a SQL table. Write a SQL query that:

1. Selects countries where `gdp_per_capita` was between 10000 and 50000
2. Filters for years between 2000 and 2020
3. Orders results by `gdp_per_capita` in descending order
4. Limits to the top 2 results

Execute the query and display the results.

In [None]:
# Please write your answer here.

### Question 08: Dask SQL with Aggregation

Using the same `data.csv` file, write a SQL query that calculates:

1. The average GDP per capita for each country
2. The minimum and maximum years in the dataset for each country

Group by country and display all results.

In [None]:
# Please write your answer here.

### Question 09: Generating `requirements.txt` and `environment.yml` Files

Write the commands to:

1. Export your current environment's packages to a `requirements.txt` and an `environment.yml` file
2. Show how someone else would install these exact dependencies in these two cases

Explain each step with comments. It is not necessary to run the code.

In [None]:
# Please write your answer here.

### Question 10: Troubleshooting a Broken Dockerfile

The following Dockerfile has several errors. Identify and fix 5 issues, then explain what was wrong with each line:

```dockerfile
# Broken Dockerfile - Fix the errors
from ubuntu

RUN apt install python3 python3-pip
RUN pip install numpy pandas

COPY . .
EXPOSE 8888
RUN ["python3", "app.py"]
```

Write the corrected Dockerfile and list each error with its fix.

### Question 11 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages by checking their versions after installation. Include commands to clean up the package manager cache after installation to reduce the image size.

#### Please write your anwer here. You can use ```dockerfile to format your code

### Question 12: Dockerfile for a Jupyter Data Science Environment

Create a Dockerfile starting from Ubuntu that:

1. Installs Python 3.11 and pip
2. Installs `jupyterlab`, `numpy`, `pandas`, `matplotlib`, and `scikit-learn` with specific versions of your choice
4. Sets the working directory to `/home/analyst/notebooks`
5. Exposes port 8888
6. Starts JupyterLab with `--no-browser` and `--ip=0.0.0.0`

Clean up apt cache to reduce image size.

#### Please write your answer here. You can use ```dockerfile to format your code