# W11: High-Performance Computing and Dask for Parallel Computing
- Contributer: Dr. Zhonghua Zheng, Yuan Sun
- Course Unit: Earth and Environmental Data Science (EART60702)
- Last modified date: 19 April, 2024

## Intended Learning Outcomes (ILOs)
- High-Performance Computing: Gain a practical understanding of accessing HPC systems and running applications on HPC systems.

- Dask for Parallel Computing: Learn to use Dask for parallel data processing and computing tasks.

## 1. High-Performance Computing System

- HPC systems are designed to provide significantly higher computational power. HPC clusters typically consist of multiple nodes, each containing powerful processors (CPUs) and sometimes specialized accelerators such as GPUs.
- HPC systems are optimized for parallel computing, where multiple processing units work together to run jos simultaneously.
- HPC at The University of Manchester: https://ri.itservices.manchester.ac.uk/csf3/

log in through SSH remote, the operation system is Linux, Batch system is PBS: https://openpbs.org/
- in windows, use the built-in OpenSSH
- in linux, use 'ssh' directly through terminal


```bash
ssh UserName@10.141.12.196
```

Run the commands below:
```bash
touch my_script.sh # create a script file to run jobs
vim my_script.sh # write in scripts
```

The script should be:
```
#!/bin/bash
export OMP_NUM_THREADS=4
# use 1 node with 4 processors 
# This is a simple Bash script
echo "Hello, world!"
```

Run the commands below:
```bash
chmod +x my_script.sh
qsub my_script.sh # submit job to the batch system (pbs)
qstat # check state
cat my_script.sh.o3 # view output
```

**Q1: try to run a job with**
- creat a folder 'hpc'
- creat a script file 'loop.sh'
- use loop.sh to create five scripts files('1.sh', '2.sh', ...'5'.sh ) under the ./hpc/job_script directory, and write in 'Hello, world!' for each file


Run the commands below:
```bash
mkdir -p hpc/job_script
cd hpc
touch loop.sh
vim loop.sh
```

The script should be:
```
#!/bin/bash
NUM_SCRIPT=5
for ((i=1; i<=NUM_SCRIPT; i++)); do
    SCRIPT="./job_script/${NUM_SCRIPT[i]}.sh"
    touch ${SCRIPT}
    echo "Hello, world!" >> ${SCRIPT}
    chmod +x ${SCRIPT}
done    
```

Run the commands below:
```bash
chmod +x loop.sh
./loop.sh # run the script directly
qusb loop.sh # run the script through the batch system
```

- Learn about VS Code to faciliate file explorer remotely.

## 2. Dask (20mins)
- Dask | Scale the Python tools you love: https://docs.dask.org/en/stable/.
- Dask has become a popular choice for data scientists, researchers, and engineers tackling big data and parallel computing challenges in various domains, including scientific computing, machine learning, and data analysis.
- Dask provides high-level parallelism and distributed collections that mimic familiar data structures like NumPy array and Pandas DataFrames

**Dask installation**: https://docs.dask.org/en/stable/install.html

### 2.1 Comparisons in computation speed and result

In [None]:
import numpy as np
import pandas as pd
import dask.array as da
import dask.dataframe as dd

### 2.1.1: Dask vs NumPy

In [None]:
%%time

# traditional way using Numpy
x = np.random.random((10000, 10000))
y = x * 2

mean_y = y.mean(axis=0)
sum_y = mean_y.sum()

In [None]:
%%time

# if using Dask
x = da.random.random((10000, 10000), chunks=(1000, 1000)) 
# In Dask, chunks is a parameter used to specify how the data should be partitioned or divided into smaller blocks for parallel computation.

y = x * 2

mean_y = y.mean(axis=0)
sum_y = mean_y.sum()

### 2.1.2: Dask vs Pandas

In [None]:
data = {
    'feature1': np.random.rand(10000) * 100,
    'feature2': np.random.rand(10000) * 100,
    'target': np.random.randint(0, 2, 10000)
}
df = pd.DataFrame(data)
ddf = dd.from_pandas(df, npartitions=10) 

In [None]:
ddf['feature3'] = ddf['feature1'] / (ddf['feature2'] + 1)

# metic
mean_feature1 = ddf['feature1'].mean()
mean_feature2 = ddf['feature2'].mean()
correlation = ddf[['feature1', 'feature2']].corr().compute()
print(correlation)

**Q1: Please use Pandas only to repeat the function above.**