## Quiz 05 - Parallel Computing, Reproducibility, and Containers

### Instructions

This quiz is based on the material covered in lectures 21 to 24. You may use
any resources available to you, including the lecture notes and the internet.

All the data required for this quiz can be found in the `data` folder within this repository. If you need to recreate the datasets, you can do so by running the Python script included in the `script-data-generation` folder. Please make sure that the following Python packages are installed:

```bash
pip install numpy pandas pyarrow dask dask-sql joblib SQLAlchemy
```

This notebook contains the questions you need to answer.
If possible, please submit your answers as an `.html` file on Canvas.

### Question 01 - Parallelising a Function with Joblib

Use `joblib` to parallelise the computation of squaring numbers in a large array. Import the required packages and write code that uses four cores to parallelise the computation.

```python
import numpy as np

def square(x):
    return x ** 2

numbers = np.arange(1000000)
```

In [15]:
# Please write your code here
import numpy as np
from joblib import Parallel, delayed

def square(x):
    return x ** 2

numbers = np.arange(1000000)
squared_numbers = (delayed(square)(num) for num in numbers)
result = Parallel(n_jobs=4)(squared_numbers)

In [16]:
result

[0,
 1,
 4,
 9,
 16,
 25,
 36,
 49,
 64,
 81,
 100,
 121,
 144,
 169,
 196,
 225,
 256,
 289,
 324,
 361,
 400,
 441,
 484,
 529,
 576,
 625,
 676,
 729,
 784,
 841,
 900,
 961,
 1024,
 1089,
 1156,
 1225,
 1296,
 1369,
 1444,
 1521,
 1600,
 1681,
 1764,
 1849,
 1936,
 2025,
 2116,
 2209,
 2304,
 2401,
 2500,
 2601,
 2704,
 2809,
 2916,
 3025,
 3136,
 3249,
 3364,
 3481,
 3600,
 3721,
 3844,
 3969,
 4096,
 4225,
 4356,
 4489,
 4624,
 4761,
 4900,
 5041,
 5184,
 5329,
 5476,
 5625,
 5776,
 5929,
 6084,
 6241,
 6400,
 6561,
 6724,
 6889,
 7056,
 7225,
 7396,
 7569,
 7744,
 7921,
 8100,
 8281,
 8464,
 8649,
 8836,
 9025,
 9216,
 9409,
 9604,
 9801,
 10000,
 10201,
 10404,
 10609,
 10816,
 11025,
 11236,
 11449,
 11664,
 11881,
 12100,
 12321,
 12544,
 12769,
 12996,
 13225,
 13456,
 13689,
 13924,
 14161,
 14400,
 14641,
 14884,
 15129,
 15376,
 15625,
 15876,
 16129,
 16384,
 16641,
 16900,
 17161,
 17424,
 17689,
 17956,
 18225,
 18496,
 18769,
 19044,
 19321,
 19600,
 19881,
 20164,
 2

### Question 02 - Using Dask Arrays for Large Data

Using Dask's `array` module, create a Dask array of random numbers with 10,000 rows and 10,000 columns. The array should be divided into chunks of 1,000 rows by 1,000 columns to enable efficient parallel computation. Populate the array with random numbers drawn from a normal distribution, where the mean is 0 and the standard deviation is 1. After creating the array, compute the mean, standard deviation, maximum, and minimum of the array using Dask's parallel computation capabilities. Use the `.compute()` method to execute the computations and print the results.

In [17]:
# Please write your code here
import dask.array as da
dask_array = da.random.normal(0, 1, size=(10000, 10000), chunks=(1000, 1000))

mean_value = dask_array.mean().compute()
std_dev_value = dask_array.std().compute()
max_value = dask_array.max().compute()
min_value = dask_array.min().compute()

print("Mean:", mean_value)
print("Standard Deviation:", std_dev_value)
print("Maximum:", max_value)
print("Minimum:", min_value)

Mean: 6.803333574040174e-05
Standard Deviation: 0.9999119946072418
Maximum: 5.773407607487137
Minimum: -5.852508745564136


### Question 03 - Dask DataFrame Operations with Parquet Files

The `data` folder containts datasets for four countries—Brazil, India, UK, and USA—covering the years 1945 to 2023. Each country's data is stored in a separate Parquet file named after the country (`Brazil.parquet`, `India.parquet`, `UK.parquet`, `USA.parquet`). Each file contains the following columns:

- `country` (string): The name of the country.
- `year` (integer): The year of the record.
- `gdp_per_capita` (float): The GDP per capita for that country and year.
- `population` (integer): The population for that country and year.

Using Dask's `dataframe` module, read _only the `country` and the `gdp_per_capita` columns_ from the Parquet files into a Dask DataFrame. Then, compute the mean and standard deviation of the GDP per capita for each country using Dask's parallel computation capabilities.


In [18]:
# Please write your code here
import dask.dataframe as dd

file_path_template = "data/{}.parquet"
countries = ["Brazil", "India", "UK", "USA"]

dfs = [dd.read_parquet(file_path_template.format(country), columns=["country", "gdp_per_capita"]) for country in countries]
data = dd.concat(dfs)

results = data.groupby("country")["gdp_per_capita"].agg(["mean", "std"]).compute()
print(results)

                 mean           std
country                            
Brazil    5496.292031   2682.494158
India     1251.704443    456.525628
UK       27496.851363  10607.858036
USA      40189.822290  14892.455747


### Question 04 - Dask and SQL Queries

Load the `data.csv` file into a Dask DataFrame and use the `dask_sql` package to perform a SQL query that selects the `country` and `gdp_per_capita` columns and filters the rows where `gdp_per_capita` is greater than 20000 in 2014. Display the results. Do not forget to register the Dask DataFrame as a SQL table with the `create_table` method.

In [22]:
# Please write your code here
import dask.dataframe as dd
from dask_sql import Context

data = dd.read_csv("data/data.csv")
context = Context()

context.create_table("gdp_data", data)
query = """
SELECT country, gdp_per_capita
FROM gdp_data
WHERE gdp_per_capita > 20000 AND year = 2014
"""
result = context.sql(query).compute()
print(result)

    country  gdp_per_capita
227      UK    40455.486012
306     USA    65386.141694


### Question 05 - Parallelising a Function with Dask Delayed

Suppose we need to compute the sum of squares of numbers for large ranges. The function below calculates the sum of squares from `0` up to `n-1`. Modify the given `sum_of_squares` function to use Dask's `@delayed` decorator and compute the sum of squares for each number in the numbers list in parallel. Measure and print the total execution time for the parallel computation, and print the results for each input number (as indicated in the code).

```python
import time

def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i * i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Perform the computations serially
results_serial = []
for n in numbers:
    result = sum_of_squares(n)
    results_serial.append(result)
    print(f"Sum of squares up to {n}: {result}")

# Measure the end time
end_time = time.time()

# Calculate and print the total execution time
serial_execution_time = end_time - start_time
print(f"\nSerial execution time: {serial_execution_time:.2f} seconds")
```


In [20]:
# Please write your code here
import time
from dask import delayed, compute

@delayed
def sum_of_squares(n):
    """Compute the sum of squares from 0 to n-1."""
    return sum(i * i for i in range(n))

numbers = [100_000_000, 200_000_000, 300_000_000, 400_000_000]

# Measure the start time
start_time = time.time()

# Perform the computations in parallel
results_parallel = [sum_of_squares(n) for n in numbers]
results_computed = compute(*results_parallel)

# Measure the end time
end_time = time.time()
parallel_execution_time = end_time - start_time
print(f"\nParallel execution time: {parallel_execution_time:.2f} seconds")


Parallel execution time: 38.65 seconds


### Question 06 - Using `pip` and `requirements.txt` for Dependency Management

Explain how you can use `pip` to manage dependencies in a Python project. Describe the process of generating a `requirements.txt` file from your current environment and how to use this file to install the same packages in another environment or on a different machine. Please comment your code to explain each step.

Generate a `requirements.txt` file from the current environment using the commad `pip freeze > requirements.txt` \
Install dependencies from `requirements.txt` on another machine using the command `pip install -r requirements.txt` \

### Question 07 - Creating and Sharing a Conda Environment

Describe the steps to create a new Conda virtual environment named `qtm350` with Python 3.12 and install the packages `numpy`, `pandas`, and `matplotlib`. Explain how to export this environment to an `environment.yml` file and how someone else can recreate the same environment on their machine using this file. Please comment your code to explain each step. There is no need to run the code for this question, but you can do so if you wish.

> Step 1: Create a new Conda environment \
conda create --name qtm350 python=3.12 numpy pandas matplotlib

> Step 2: Activate the new environment \
conda activate qtm350

> Step 3: Export the environment to a file \
conda env export > environment.yml

> Step 4: Recreate the environment on another machine \
conda env create --file environment.yml

### Question 08 - Writing a Simple Dockerfile

Write a simple `Dockerfile` that creates a Docker image for a Python application. The application consists of a single Python script named `app.py` that prints "Hello, World!" when executed. The `Dockerfile` should use the official Python image as the base image and copy the `app.py` script into the image. When the container is run, it should execute the `app.py` script and print "Hello, World!".

In [None]:
# Please write your code here (no need to run it)
FROM python:3.12-slim

WORKDIR /app

COPY app.py .

CMD ["python", "app.py"]

### Question 09 - Writing a Dockerfile to Install Software on a Base Image

Create a Dockerfile that starts from an Ubuntu 24.04 base image and installs the following software:

- Git version 2.43.0-1ubuntu7.1
- SQLite version 3.45.1-1ubuntu2

Ensure that you specify the exact versions of the packages. Include commands to clean up the package manager cache after installation to reduce the image size.

In [None]:
# Please write your code here (no need to run it)
FROM ubuntu:24.04

RUN apt-get update && apt-get install -y \
    git=1:2.43.0-1ubuntu7.1 \
    sqlite3=3.45.1-1ubuntu2 \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

CMD ["bash"]

### Question 10 - Writing a Dockerfile to Install Python and Packages on Ubuntu

Write a `Dockerfile` that starts from an Ubuntu 24.04 base image, installs Python 3.12 and `pip`, and then uses `pip` to install specific versions of `numpy` (1.26.4), `pandas` (2.2.2), and `matplotlib` (3.9.2). Ensure you include commands to clean up the package manager cache after installation to reduce the image size. Set up a working directory named `app/` and configure the container to start an interactive Python shell `python3` by default.

In [None]:
# Please write your code here (no need to run it)
FROM ubuntu:24.04

RUN apt-get update && apt-get install -y \
    python3.12 \
    python3.12-venv \
    python3-pip \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

RUN python3.12 -m pip install --upgrade pip
RUN pip install numpy==1.26.4 pandas==2.2.2 matplotlib==3.9.2

WORKDIR /app
CMD ["python3"]