# HPC intro

## Using Python

Python is of course a very useful programming language for data processing, analysis and visualization.

There are many tutorials and courses that will teach you Python, so that is not the scope of this tutorial.  Here you will learn how to run Python on our HPC infrastructure, assuming you are already familiar with the language and packages.

### Python scripts

Given that Python scripts are simple text files, you can create or modify them using your favorite editor.  You can do this for instance on the infrastructure using `nano`, or on your own system and transfer the finished script or module to the HPC system.

To build up gradually, you can start with a very simple script that takes a string as a command line argument, and prints a greeting to standard output.  Your script is stored in a file `hello.py` which could look like this.

```
#!/usr/bin/env python

import argparse


arg_parser = argparse.ArgumentParser(description='say hello')
arg_parser.add_argument('name', help='who to say hello to')
options = arg_parser.parse_args()

print('Hello ' + options.name + '!')
```

The only module used in this script, `argparse` is in Python's standard library, and the script has been written in such a way that it will work with any version of Python.  In practice, use f-strings, and a recent version of Python.

### Running simple scripts

You can run it by giving it as a command line argument to the Python interpreter.

In [None]:
python hello.py there

You can of course easily check which version of Python is used to run your script, as well as where it is installed on the system.

In [None]:
python --version

In [None]:
which python

However, often the version of the Python interpreter that comes with the operating system is not the one you would like to use, or you want to use Python packages that are not installed, so what can you do?

Here, we will assume that you have a (fairly simple) script that computes a function for an array of floating point values, and that writes a line plot that shows the results to a file.

```
#!/usr/bin/env python

import matplotlib.pyplot as plt
import numpy as np


x = np.linspace(-2*np.pi, 2*np.pi, 501)
y = np.sin(x)

plt.plot(x, y, '-')
plt.savefig('sin.png')
```

This script requires both the numpy and matplotlib packages, running it with the default Python interpreter is not going to be a big success.

In [None]:
python sin_plot.py

As it happens, there are quite a number of options,

  1. install packages in your home directory's `.local` directory using `pip`;
     * advantages: fairly straightforward
     * disadvantages: sure to create a dependency mess later on, performance is likely to be an issue
     * conclusion: *please don't*
  1. use the module system and Python versions and packages installed by your system administrator;
     * advantages: typically excellent performance
     * disadvantages: since system administrators really can't install any and all Python packages, you
       may have to install some packages yourself anyway
     * conclusion: perfect if you have no requirements beyond the packages that are available
  1. use a package manager such as [miniconda](https://docs.conda.io/en/latest/miniconda.html)
     or [mamba](https://github.com/mamba-org/mamba)
     * advantages: you have full control over the versions of Python and all packages
     * disadvantages: unless you know what you are doing, performance may be an issue
     * conclusion: way to go if you know what you are doing
  1. use [apptainer](https://apptainer.org/) or [podman](https://podman.io/) containers
     * advantages: if you know what you are doing, you can create a reproducible environment that is
       portable across systems
     * disadvantages: more involved than the other approaches, with considerable pitfalls
     * conclusion: not for the faint of heart
    
Given that the first option is not recommended at all, and the fourth goes beyond the scope of this tutorial, you will learn how to

  * use the module system
  * install and use miniconda

### Software module system

An HPC system is almost by definition a multi-tenant system.  The users on such a system have specific requirements with respect to the software they want to use.  For instance, some may want to work with a certain version of Python, while others prefer a newer one.

To deal with this, most HPC system use a module system that allows to easily pick the software and its specific version you want to use.  There is just a single command to interact with the software stack: `module`.  It has several subcommand that you will learn about belo.

#### Available software

In order to get a list of the software that is available throught the module system, you can use the `module available` command.  That will list all the software packages that you can use on the system.

Since this list is huge, you can be a bit more specific by providing (part of) the name of the software package you are looking for.  Note that this is case-sensitive.

In [None]:
module available Python

To run Python, the `Python/3.9.5-GCCcore-10.3.0` module sounds promising.  The name may seem a bit cryptic, but once you understand the pattern, it is easy to interprete.

The name of a module consists of several parts that provide useful information:
  * `Python` is the name of the software package;
  * `3.9.5` is the version of that package, i.e., of the Python distribution;
  * `GCCcore-10.3.0` tells you that this Python distribution has been compiled using the GCC compiler suite, version 10.3.0.

#### Using a software package

To use a software package, you simply load the corresponding module.

In [None]:
module load Python/3.9.5-GCCcore-10.3.0

You can verify that you now use the Python interpreter you expect by checking version and location of the `python` executable.

In [None]:
python --version

In [None]:
which python

Now you can run your script using this version of Python.

In [None]:
python sin_plot.py

As you can see, there is still an issue: no numpy.

#### Searching for software

While `module available` can be used to find what you are looking for if you know the name of the package *exactly*, that is not always the case.  Here, `module spider` can help you.  It does a search through the meta-data of the modules as well, and is case sensitive, so it is a very useful tool.

You are looking for a module that would be useful to do scientific computing with Python, so perhaps "scipy" would be a useful search term.|

In [None]:
module spider scipy

This module sounds promising.  You can load it and test your script.

In [None]:
module load SciPy-bundle/2021.05-foss-2021a

Notice that the module system will sometimes substute one module for another in order to satisfy dependencies.  Although this is usually innocent, you may want to keep an eye on the output of `module load` commands.

In [None]:
python sin_plot.py

Closer, but no cigar.  You still need matplotlib.  You can check whehter it is available.  Note that you can use `av` as an abbreviation for `available`.

In [None]:
module av matplotlib

After you load that module, you can succesfully run your script.

In [None]:
module load matplotlib/3.4.2-foss-2021a

In [None]:
python sin_plot.py

#### Which modules are loaded?

It can be useful to check which modules you have loaded.  You can get a list of them easily.

In [None]:
module list

You will see that many more modules are listed than you actually loaded, that was just the Scipy-bundle and matplotlib.  All the other modules you see listed are loaded automatically, since the ones you load list them as dependencies.

#### Getting rid of loaded modules

If you no longer need a loaded module, you can simply unload it.

In [None]:
module unload matplotlib

Note that you don't have to specify the version of the module.  The module system can have only one version of a software module loaded, so it will unload that one.

To get rid of all loaded modules, you can purge them.

In [None]:
module purge

As you will see later, it is good practice to purge all the modules in a job script, and only load the ones that you require in you script, with the exacct version you would like to use.

You can verify that no modules are loaded.

In [None]:
module list

#### More information

For a more systematic overview of the module system and how to use it, you can view the [tutorial](003_software_modules.ipynb) or the [documentation](http://lmod.readthedocs.org).

### Package manager: miniconda

Although using the module system guarantees that you will use a version of Python and packages that give you good performance, this approach may not be flexible enough for you.  You may want to use other versions of Python or packages than provided through the module system, or even use packages that are not provided at all.

Of course, you can ask the helpdesk to install them for you, but typically this is only done for packages that are used fairly frequently.

Using a package manager such as miniconda can help you with this issue.  Moreover, using conda environments helps you manage your dependencies and keep them sane.  With respect to reproducable computations, they are a great help as well since you can freeze an environment for a particular project and be sure that it will run with the identical software stack at a later stage.

#### Installing miniconda

The first step is to download the miniconda installer script, and that is easy to do on the cluster itself using `wget`, a command line tool for downloading files from the web (and much more, but that is outside the scope of this tutorial).

In [None]:
 wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

You can verify that the installer was downloaded, it is a shell script with the `.sh` extension.

In [None]:
ls

It is important to install miniconda in your data directory since this directory will also contain all your environments, and this can easily run into the gigabytes of storage after a short while.  Thiss would exceed the quota of your home directory.  You can specify the directory where you want to install using the `-p` option.

In [None]:
bash Miniconda3-latest-Linux-x86_64.sh -b -p $VSC_DATA/miniconda3

You can make miniconda more convenient to use by adding some configuration information to the files that control your settings.  This is easy using the following command.

In [None]:
$VSC_DATA/miniconda3/bin/conda init

To make these new settings active for this notebook, you should reload your `.bashrc` file by sourcing it.

In [None]:
source ~/.bashrc

Clearly, you have to do this only once.  Now you are ready to use `conda` conveniently and create your first environment.

#### Create an environment

To create a new environment, you have to specify a name, e.g., `tutorial` for this example, and a list of packagesyou would like to include, e.g., `numpy`.  Of course, you also want matplotlib, but for the sake of this tutorial, you'll do that later so that you also know how to install new packages in an existing environment.

In [None]:
conda create -y -q --name tutorial numpy

#### Activating an environment

To use an environment, you have to activate it.  You can do this as follows.

In [None]:
conda activate tutorial

When you are done, you can deactivate the currently active environment very easily.

In [None]:
conda deactivate

#### Installing packages

You still need to install matplotlib.  Since you can only install packages in an active environment, make sure that the one you want to install in is active.

In [None]:
conda activate tutorial

To install matplotlib, you can use `conda install`.

In [None]:
conda install -y -q matplotlib

Note that you can install multiple packages by simply listing them, e.g., `conda install pandas seaborn`.

Now you can run `sin_plot.py` in your new `tutorial` environment.

In [None]:
python sin_plot.py

In [None]:
ls

As you can see, the PNG file containing the plot has been created succesfully.

#### Updating en environment

Sometimes you want to make sure that your environment conatains the latest version of the packages you are using.  Updating an environment is straightforward, but bear in mind that older scripts may no longer work, or that previous results are not exactly reproducilbe, so consider carefully before updating.

With the environment you want to update active, you can update easily.

In [None]:
conda update -y -q --all

#### Duplicating an environment

It can be a good idea to duplicate an existing environment as a starting point for a new project, or to ensure backward compatibility (the original environment) after an update (new environment).  This can easily be done by cloning the original environment.

In [None]:
conda create  --name tutorial_copy  --clone tutorial

#### Removing an environment

Once you are sure you no longer need an environment, you can remove it.

In [None]:
conda env remove  --name tutorial_copy

## Summary

In this tutorial you learned how to run Python scripts, either
  * using modules, or
  * by creating your own environment.
  
For the module system, You learned how to
  * list available modules using `module available`
  * search for modules using `module spider`
  * use the software package using `module load`
  * list all the modules you have currently loaded using `module list`
  * unloading a module you no longer require using `module unload`
  * unloading all modules, cleaning your environment using `module purge`

If you prefer to use conda, you've learned ho to
  * create a new environment using `conda create`
  * activate an environment using `conda activate`
  * deactivate an active environment using `conda deactivate`
  * install additional packages in an environment using `conda install`
  * update an enviornment using ``conda update`
  * clone an environment
  * remove an environment

## Where to go from here?

You can now run a Python script on the login node, but that is only useful for very short computation, i.e., scripts that run in a minute or less.  You share the login node with many other users of the HPC system, so if you perform computationally intensive computations on this system, it will impact the performnce for all other users.

Your real workloads will run on the compute nodes of the HPC system, and these computations are typically run via a job script.  You can learn more about that in
  * [job scripts and the scheduler](020_jobs.ipynb).