## Introduction to SSH remote connections

This notebook only contains instructions for executing code in a terminal, so you will not need to open this notebook to execute code in any of the cells, you can simply read along with a terminal open and run the code in the terminal as instructed. 

### What is SSH?
The command `ssh` stands for secure shell, and is a secure encrypted way to exchange communications between different computers. You use secure encrypted connections with `ssh` every day when you visit any website that requests you to enter a password. Here we will learn to setup an SSH connection to the remote high performance computing (HPC) cluster at Columbia University so we can use the computing cluster as a remote Linux computing environment. 

### What is the remote computer?
`ssh` can be used to connect to any remote computer for which we have login credentials, and which has a public `IP address` allowing it to be reached through the internet. An IP address is simply a string of numbers that identifies a specific computer. Similarly, a computer can have a `hostname`, which works like an *alias* for the IP address, but is a string of characters (a name) instead of a numbers, and so is easier to remember. We'll be using the `hostname` of the computer that we will connect to, called *Habanero*, which is `habanero.rcs.columbia.edu`. We'll discuss in much more detail later what the habanero cluster is and how to use it. 

### Logging in for the first time

To login to a computing cluster you will typically first need to signup to ensure that the administrator creates an account for you. I've already done this for you so that you should be able to login using your Columbia UNI and password. The standard command to login to a remote computer using `ssh` is the following: `ssh <username>@<hostname>`. Open a terminal and try logging in with your credentials as shown below. Once you are logged in you will see a different address to the left of your cursor in the terminal (e.g., instead of `deren@oud` it now shows `de2356@holmes`). 

![../Lecture/ssh-habanero1.gif](../Lecture/ssh-habanero1.gif)

### Now logout
To disconnect from the `remote` cluster you could just close the terminal and it would disconnect, or you can type `exit` and it will exit the connection and return to your `local` terminal. For the rest of this notebook I will refer to the terminal that is connected to Habanero as the `remote`, and to a terminal running on your local laptop as `local`.  

Type `exit` in the `remote` terminal to return to `local`. 

### Login faster with a config file
As with other command line programs we've used, after you use `ssh` for the first time it creates a hidden directory in your home directory called `.ssh/`. It *starts with a dot*, just like other hidden files and directories that we've seen before. There will be a few files in this folder which contain information that is used by `ssh` for the two computers to know that you are who you say you are. These are the encryption keys. We're going to add another file to this directory that is useful for connecting to our cluster more easily, and is simply called `config`. Create this file exactly like in the gif below using a text editor such as nano or sublimetext. 

The format of the config file is the following: 

```bash
Host <name of the remote>
    HostName <remote hostname>
    User <your-username>
```

Once this file is created we will be able to login much more easily by simply typing `ssh habanero` instead of the longer command above. This is nice because we don't need to remember as much, and because as we will see later, we will eventually write longer more complex ssh commands, so it will be nicer to have a simplified the login by storing our username and the hostname address. 

![../Lecture/ssh-habanero2.gif](../Lecture/ssh-habanero2.gif)

### Computing clusters
Now that we have a simple command to login to the cluster, let's talk about what the cluster is and why you would want to use it. The habanero cluster, like all large computing clusters a network of connected computers that can be used to distribute computational work across the CPUs and RAM (processing power and memory) of multiple machines. And like all of the largest computing clusters these machines are all running Linux. This is because linux has the simplest and most secure methods for networking computers, it is lightweight and takes up little space on each computer, is free and open source, and because most scientific software is written to run on Linux. Because throughout this class we've been working in Linux-like environments already (e.g., OSX or GitBash) this should feel familiar. 

### Cluster structure (head nodes, job nodes)
HPC clusters are organized such that when you login you are connected to a `head node`, or sometimes called the `login node`, and from there you can then submit jobs to be run on the `compute nodes`, or sometimes called `job nodes`. **The `head node` is not intended to be used for computing**, and you will get scolded pretty quickly if you try running any kind of intensive job on it. This node is shared by all users and is meant only as a landing pad where you do simple things like view files, move or copy them, and write scripts to run your jobs. Instead, when you are ready to run your intensive computing job, **a job script must be submitted to run on a `job node`**. In this way, the proper amount of resources can be allocated to each job, and no one person can use up to much of the shared resources. 

### The Habanero cluster
The [documentation](https://confluence.columbia.edu/confluence/display/rcs/Habanero+HPC+Cluster+User+Documentation) for the Habanero cluster is a great place to start to understand how it works, and how it is organized. The "Getting Started" section explains how to write your first job script, which we will repeat here, but I recommend going through the full documentation to learn how to use the habanero cluster in more detail. 

### User accounts
As you will read in the "Getting Started" section of the documentation there are many different accounts on the cluster and each is allocated a different amount of resources. In total there are 302 compute nodes on Habanero, and on average these each have 24 CPUs and 128 Gb of RAM, meaning the total processing power on the cluster is ~302 x 24 = 7,248 computing cores. To put that in perspective, your laptop likely has between 2-4 processors, and ~4-16 Gb of RAM. We have access to the `edu` account for this class, which has a few dozen nodes. 

### Job allocations
To run a job we will write a job script using a specific format that is recognized by a scheduling software called SLURM. This software will find how much resources are available, and how much we are requesting, and will allocate resources to us based on whether it is available, and also whether we have over used our allotted quota. It does not only allocate entire nodes to users, but can slice up nodes into smaller bits so that if 24 users were each asking for only a single core, then one 24-core node could be used by 24 users. 

### File structure & organization
Before we start running jobs it is important that we take a moment to get familiar with the new environment that we have logged into on the remote computer. Remember, as we learned in the first few lectures, when you are in a terminal you should always be thinking "where am I?", and "where are my files?". We can ask these questions using commands like `pwd` to print our current directory, and `ls` to see the files in any given directory. 

Unlike when working on your own local computer, the file system on the computing cluster is enormous, and includes the files for many other users in addition to your own. You will not be able to see most other users files because they have different permission settings that exclude you from peeking into their stuff. 



### Home and Scratch


#### HOME
When you login you will be located in your home directory. This is just like your home directory on your own computer, although in this case the naming will follow linux conventions. The files are structured relative to the `root` location. When we call `pwd` you should see something like `/rigel/home/de2356`. This means that the root has a directory called `rigel/` and within `rigel` there is a directory called `home`. Unlike our own systems where there is typically only one user in the home directory, this home directory will have hundreds of users, each referred to by their UNI. It is good practice to try to keep your home directory organized by creating new folders in it (e.g., `mkdir scripts/`) and keeping files in those folders. 

#### Scratch
Another important location on a cluster is a space called `scratch/`. You can think of it like having an external harddrive connected to your own computer. It is a place for storing large files and keeping them separate from your normal disk space. The `scratch` space is meant for storing large data files temporarily, while you are working on them, and it is often automatically deleted every 30 days or so if a files has not been changed in that time, as a way to ensure there is plenty of space for all users. Unlike an external harddrive connected to your laptop, the scratch space is designed to be super fast and efficient for moving big data around, and so it is a good thing to use. Your home directory will often have only a very limited amount of disk space allotted to it, so you often cannot store large genomic data files in home and instead need to use a space like scratch. On habanero you will have a scratch space in `/rigel/edu/w4050/users/` which has the capacity to hold ~1Tb of data. 


### Submit first job (`sbatch`, `squeue`)

Following the instructions from the "Getting Started" guide, let's try submitting our first script to run on Habanero. The important format here is that we have a `shebang` line at the top that tells it this should be executed as a shell script (`!/bin/sh`), followed by several lines of code that start with the string `#SBATCH`, and tell the SLURM scheduler how to allocate resources for our job. Here we must tell it the account we have permissions to use, which for us is the `edu` account. Finally, at the end we execute some code (the actual job), a few lines of bash commands. 

In the GIF below I use `nano` to create the job submission file by copying and pasting the code below, and I save it in a directory called `scripts/`. Then I use the command `sbatch` to submit the job to SLURM. Once submitted, you can check the status of your jobs using `squeue`, and the argument `-u <username>` will show only your submitted jobs. 

![../Lecture/ssh-habanero3.gif](../Lecture/ssh-habanero3.gif)

In [None]:
#!/bin/sh
#SBATCH --account=edu        # The account name for the job.
#SBATCH --job-name=PDSB      # The job name.
#SBATCH -c 1                 # The number of cpu cores to use.
#SBATCH --time=1:00          # The time to run the job (here, 1 min)
#SBATCH --mem-per-cpu=1gb    # The memory the job will use per cpu core.
 
echo "Hello World"
sleep 10
date

## Installing software on HPC
One of the largest difficulties of working on a remote cluster is that you arrive there to a fresh slate, without any of the software and files that are on your local computer, and in most instances you do not have permission to install packages on your own, since you are not the administrator of the system.

There are two ways to get software onto the remote cluster: (1) Ask the administrator to install it for you and then load it from the system-wide modules, or (2) install is *locally*, like we have been doing so far with our conda software. I'll first show you how to load system-wide modules, so that you know how to do so, but in practice, I recommend using locally installed software since it allows you to have much greater control over the versions of software you are using. 

### Loading modules

There is typically a large range of software tools available on an HPC system, but none of them are made available to you by default. Instead, you request only the resources that you plan to use. Some details on how this works for habanero in particular can be found [here](https://confluence.columbia.edu/confluence/display/rcs/Habanero+-+Software). The main commands to know are `module avail` which lists the available software packages that can be loaded, and `module load <package>` which loads the designated package into your $PATH so that you can use it. The `module load` commands should be placed in your submission scripts. A quick example is shown below. 

![../Lecture/ssh-habanero5.gif](../Lecture/ssh-habanero5.gif)


### Conda installation
Using conda will allow us to recreate a software environment on the remote system that is identical to the one on our local systems. This can be a much easier alternative to using `module load` commands to load many software package individually, especially if the software you want is not already available. 

Since we are all installing miniconda into a Linux environment this time it should be much easier than before when we were installing into many different kinds of operating systems. Technically we should write a job script and submit it to run this code, but since it runs fast and is not CPU intensive you can go ahead and just run the code below on the head node. Run each of the cells below one at a time by copying and pasting into the remote terminal.

The first line of code downloads the Miniconda installer. 

In [None]:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
    

The next line of code runs the installer to create a new directory in your home directory called `miniconda3/`, and installs Python3 and some basic packages along with it. This is just like we did before on our own computers. The `-b` flag tells it to run in batch mode so that you do not need to anser yes to a bunch of questions. 

In [None]:
bash Miniconda3-latest-Linux-x86_64.sh -b

The command below prints a command to the `.bashrc` file in your remote home directory. The `.bashrc` file is the Linux equivalent of the `.bash_profile` file on OSX or GitBash. This is the file that is automatically run when you connect to the system. This will make it so that our miniconda software is always available and at the front of our $PATH variable. 

In [None]:
echo 'export PATH=/rigel/home/'$USER'/miniconda3/bin:$PATH' >> ~/.bashrc

Finally, the command below will update our environment by running the code in `.bashrc` so that we can start using conda right away. Calling `conda info `will print information about our installation to the screen (stdout). 

In [None]:
source ~/.bashrc
conda info

### Install *locally* from source (e.g., GitHub code)
Another way to install software on HPC is to install *locally* from source. Often instructions for installing software from source have instructions which install the software into the system-wide software location. If so, the installation instructions will need to be modified to instead install into your local directory somewhere, and that place will need to be in your $PATH. We won't go into that type of installation in detail for now. 

But for Python code we have an easy shortcut to do this kind of local installation, which is that we can now install local code into our miniconda directory. For example, this can be done by using a `pip install`  on a github directory Python package, since this is the default location where `pip` will install. For an example, let's install our `helloworld` assignment from github by cloning the repo and then running `pip install`. 

To keep my directory organized I create a new directory called `PDSB/` and will put our class repos in there. Execution of the code below is demonstrated in the following GIF (it runs a bit slowly since we're on the head node): 

In [None]:
## this code is run in the GIF below
mkdir PDSB/
cd PDSB/
git clone https://github.com/programming-for-bio/helloworld
cd helloworld
pip install .
cd
helloworld -n $USER 

![../Lecture/ssh-habanero4.gif](../Lecture/ssh-habanero4.gif)

In this same way you can copy and run any of our future GitHub repos for this class on the remote cluster using `git clone`. 

### Submit another job

Let's now submit a job that will use software that we have installed ourselves into our local miniconda directory. To save you from having to write it out, I've saved a script in a shared scratch space that we have for this class. To test it out do the following: Use the `cp` command below to copy the file to your `scripts` directory, and then use `cat` to look its contents. 

+ Create a new directory for output files called `outputs/`.
+ Use the `cp` command below to copy the new `hello-world2` script to your `scripts` dir.
+ Use `cat` to look at the new `hello-world2` script.

In [None]:
## new outputs directory
mkdir outputs/

## copy the helloworld submission script from our shared scratch space
cp /rigel/edu/w4050/files/hello-world2.sbatch scripts/

## print it to stdout
cat scripts/hello-world2.sbatch

Compared to the first `hello-world.sbatch` script this new one has a number of new arguments to SLURM. These include: 

+ added a `--workdir` argument to designate an output directory.
+ added a `--output` argument to designate output files to be named `slurm-{jobname}-{jobid}.out`.
+ changed the bash commands at the end to call our `helloworld` program.


### Submit the job script and read the output from outputs/
Finally, use `sbatch scripts/hello-world2.sbatch` to submit the new job script. 

In the case below, the queue was full and so it took about 10 seconds before my job would start. You can see that I used `squeue` to check its status. The first few times it was listed as `(Resources)`, meaning the scheduler was waiting for resources to become available. After around 10 seconds the job started and the queue showed which node the job was assigned to. 

![../Lecture/ssh-habanero6.gif](../Lecture/ssh-habanero6.gif)


### SLURM arguments
The SLURM arguments can be modified to request more resources in very detailed ways. Many of those details can be found through good documentation like [here](https://slurm.schedmd.com/sbatch.html), or using google to search for specific questions. 