# Chapter 3. Remedial Unix Shell

In a notebook `~/Projects/learning/linux/notebooks/my_notes.ipynb`, I have summarized some of useful linux/bash notes.

Unix allows to be efficient in using computer memory. Bioinformatic tools must rely on streams of data, being read from a source and actively processed. Both general Unix tools and many bioinformatics programs are designed to take input through a stream and pass output through a different stream.

Recommended unix shell is `bash` (Bourne-again shell). Other shells like the C shell (`csh`), its descendant `tcsh`, and the Korn shell (`ksh`) are less popular in bioinformatics. The Bourne shell (`sh`) was the predecessor of the `bash`; but `bash` is newer and usually preferred.

### Redirecting Standard Output and Error

To redirect standard output and standard error streams to separate files, we combine the `>` operator with `2>` operator, respectively. Additionally, `2>` has `2>>`, which is analogous to `>>` (it will append to a file rather than overwrite it).

Note that all open files (including streams) on Unix systems are assigned a unique integer known as a file descriptor. Unix’s three standard streams — standard input , standard output, and standard error — are given the file descriptors `0`, `1`, and `2`, respectively.

Unix-like operating systems have a special “fake” disk (known as a pseudodevice) to redirect unwanted output to: `/dev/null`. Output written to `/dev/null` disappears.

`tail` can also be used to constantly monitor a file with `-f` (`-f` for follow). As the monitored file is updated, tail will display the new lines to your terminal screen, rather than just display 10 lines and exiting as it would without this option.

In [1]:
%%bash
cd ../test-project/data/seqs/
pwd
ls -l zmaysA_R1.fastq zmaysA_R3.fastq > listing.txt 2> listing.stderr
ls -l
echo
cat listing.txt
cat listing.stderr

bash: line 1: cd: ../test-project/data/seqs/: No such file or directory


/home/payman/Projects/learning/learning-bioinformatics/bioinfx-data-skills-book/notebooks
total 532
-rw-rw-r-- 1 payman payman 208764 Nov  8 21:19 bioinfx_skills.ipynb
-rw-rw-r-- 1 payman payman   3198 Nov  8 21:11 chapter01.ipynb
-rw-rw-r-- 1 payman payman  20354 Nov  8 21:19 chapter02.ipynb
-rw-rw-r-- 1 payman payman     72 Nov  8 21:19 chapter03.ipynb
-rw-rw-r-- 1 payman payman 294242 Jul  1 21:45 foundations_of_git.html
drwxrwxr-x 2 payman payman   4096 Oct 22 08:23 images
-rw-rw-r-- 1 payman payman    126 Nov  8 21:20 listing.stderr
-rw-rw-r-- 1 payman payman      0 Nov  8 21:20 listing.txt
-rw-rw-r-- 1 payman payman   1039 Nov  8 21:19 notebook.md

ls: cannot access 'zmaysA_R1.fastq': No such file or directory
ls: cannot access 'zmaysA_R3.fastq': No such file or directory


### Using Standard Input Redirection

You can read standard input directly from a file with the `<` redirection operator. 

```bash
$ program < inputfile > outputfile
```

It’s a bit more common to use Unix pipes (e.g., `cat inputfile | program > output file`) than `<`.

### The Almighty Unix Pipe: Speed and Beauty in One

Rather than redirecting a program’s standard output stream to a file, pipes redirect it to another program’s standard input.

In many cases, creating a file would be helpful in checking intermediate results and possibly debugging steps of your workflow—so
why not do this every time? The answer is that it often comes down to computational efficiency—reading and writing to the disk is very slow. We use pipes in bioinformatics (quite compulsively) not only because they are useful way of building pipelines, but because they’re faster (in some cases, much faster). Modern disks are orders of magnitude slower than memory. For example, it only takes about 15 microseconds to read 1 megabyte of data from memory, but 2 milliseconds to read 1 megabyte from a disk.

%%bash
cd ../test-project/data/seqs/

# `-v` option inverts matching RE
# `^` symbol anchors RE to the start of line
# the pattern [^ATCG] matches any character that’s not A, T, C,or G
# `-i` option ignores character case
# `--color` option colors the matching non-nucleotide character
grep -v '^>' tb1.fasta | \
grep --color -i '[^ATCGC]'

### Combining pipes and redirection

```bash
program1 input.txt 2> program1.stderr | \
program2 2> program2.stderr > results.txt
```

`program1` processes the `input.txt` input file and then outputs its results to standard output. `program1`’s standard error stream is redirected to the `program1.stderr` logfile. Meanwhile, `program2` uses the standard output from `program1` as its standard input. The shell redirects `program2`’s standard error stream to the `program2.stderr` logfile, and `program2`’s standard output to `results.txt`.

When we want to redirect both standard error and standard out into one file to search some terms in both, we would use `2>&1`operator. See below:

```bash
program1 2>&1 | grep "error"
```

In [None]:
%%bash
ls payman 2>&1 | grep -i "No"

Sometimes, we need to write intermediate files into a file while directing standard output of a program to another program. To do so, we can use unix `tee` command:

```bash
program1 input1.txt | tee intermdiate-file.txt | program2 > results.txt
```

Briefly, `tee` reads from standard input and writes to standard output.

### Managing and Interacting with Processes

We can tell the Unix shell to run a program in the background by appending an ampersand (`&`) to the end of our command. For example:

```bash
program1 input.txt > results.txt &
```

The number returned by the shell is the process ID or PID of `program1`.

We can check what processes we have running in the background with `jobs`. To bring a background process into the foreground again, we can use `fg` (for foreground). `fg` will bring the most recent process to the foreground. If you have many processes running in the background, they will all appear in the list output by the
program jobs. The numbers like `[1]` are job IDs (which are different than the process IDs your system assigns your running programs). To return a specific background
job to the foreground, use `fg %<num>` where `<num>` is its number in the job list. If we wanted to return `program1` to the foreground, both `fg` and `fg %1` would do the same thing, as there’s only one background process.

Note that although jobs run in the background and seem disconnected from our terminal, closing our terminal window would cause these processes to be killed.

To place a process already running in forground into the background, we first need to suspend the process, and then use the bg command to run it in the background. Suspending a process temporarily pauses it, allowing you to put it in the background. We can suspend processes by sending a stop signal through the key combination `Control-z`. With our imaginary program1, we would accomplish this as follows:

```bash
$ program1 input.txt > results.txt # forgot to append ampersand
# enter control-z
[1]+ Stopped
$ program1 input.txt > results.txt
$ bg
[1]+ program1 input.txt > results.txt
```

### Exit Status: How to Programmatically Tell Whether Your Command Worked

The exit status isn’t printed to the terminal, but your shell will set its value to a variable in your shell named `$?`. We can use the `echo` command to look at this variable’s value after running a command:

```bash
$ program1 input.txt > results.txt
$ echo $?
0
```

The shell provides two operators that implement this:
- one operator that runs the subsequent command only if the first command completed successfully (`&&`).
- one operator that runs the next command only if the first completed unsuccessfully (`||`).

Here is an example: the shell operator `&&` executes subsequent commands only if previous commands have completed with a nonzero exit status:

```bash
    $ program1 input.txt > intermediate-results.txt && \
    program2 intermediate-results.txt > results.txt
```

Using the `||` operator, we can have the shell execute a command only if the previous command has failed (exited with a nonzero status). This is useful for warning messages:

```bash
    $ program1 input.txt > intermediate-results.txt || \
echo "warning: an error occurred"
```

If you don’t care about the exit status and you just wish to execute two or more commands sequentially, you can use a single semicolon (`;`).


### Managing Processes with `ps` and Brief Look at Memory Management

In cases where a program isn't writing to output files or actively logging what it's doing require a different way monitoring their activity, and the two most programs to do this are `top` and `ps`.

`ps` stands for *process status*, as it gives you the status of all running processes. It is useful to run it with `aux` options, which display processes for all users (from `-a`), adding a column indicating the user (from `-u`), and outputs processes that are running even if they weren't started from a terminal (`-x`).

In [3]:
%%bash
ps -aux | head -n5

USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.0  0.0 167100 11652 ?        Ss   09:29   0:02 /sbin/init splash
root           2  0.0  0.0      0     0 ?        S    09:29   0:00 [kthreadd]
root           3  0.0  0.0      0     0 ?        S    09:29   0:00 [pool_workqueue_release]
root           4  0.0  0.0      0     0 ?        I<   09:29   0:00 [kworker/R-rcu_g]


Columns in `ps -aux`:

- `USER`: user running the process
- `PID`: process ID
- `CPU`: percentage of CPU used
- `VSZ`: virtual memory size (in kilobytes)
- `RSS`: resident set size (in kilobytes)
- `TT`: controlling terminal
- `STAT`: process state code
- `START`: time command was started (sometimes this is STARTED)
- `TIME`: time running
- `COMMAND`: command that started process

`VSZ` and `RSS` are more interesting to bioinformaticians. Occasionally your system runs low on physical memory (RAM), and your operating system does its best to manage. Unfortunately the only way your operating system can give a process more memory than is physical available is by taking a chunk of less-used memory, writing it to disk (this is slow!), and then using that now-free page for your process. Since you're swapping a in-memory physical page of memory for one on a hard drive, the part of your disk that manages this type of activity is called `swap` space. If you run out of physical memory, your processes will be forced to start swapping memory to the hard disk, and hard disks are really slow. High-memory tasks like assembly (and in some cases alignment) on machines with insufficent physical memory will halt even the fastest machines to a stand still.

With this information, `VSZ` and `RSS` will now make more sense. `VSZ` is the amount of virtual memory and `RSS` is the amount of physical memory. Virtual memory includes both swap and physical memory, so `VSZ` is larger than `RSS`. `ps aux` gives you a quick glance at these values, but the way an operating system allocates memory can be a very baffling process to decode. When we want to integate our processes to see which are using the most memory, CPU, or swap space, `ps` becomes a less useful and `top` becomes our tool of choice. But remember, for searching for processes, `ps` and `grep` can be combined in great ways.

### Interacting with Processes Through Signals: Using `Kill`

You can terminate a running process or set its priority using the Unix tools `kill` and `renice`, respectively. When you kill a process, Unix sends a `SIGTERM` signal to the program.

```bash
$ ps aux | grep "program"  # we can obtain PID from this command
$ kill <pid>
```

Programs can choose how they wish to handle a `SIGTERM`. Some programs could even ignore a termination signal entirley, although this is not common practice with most programs we'll use. Still, sometimes you'll need to send a more forceful signal like `SIGKILL`, which can't be ignored by programs. To specify the signal with kill, we use `kill -s SIGKILL <pid>`.

### Prioritizing Processes: Using `nice` and `renice`

The `nice` value of a process ranges from -20 to 20 (19 on some systems), where a lower nice value gives a process more priority. A very high nice value like 19 tells your operating system that this process is pretty low priority, so it should run it whenever resources are available. Note that the `nice` value only affects how much CPU priority a process gets. Memory or disk-bound processes will not gain much from getting a lower `nice` value.

```bash
$ nice -n 10 gzip zmaysA_R1.fastq
```

This runs the command `gzip zmaysA_R1.fastq`, incrementing the default value of 0 to 10. If we have an already running process, we could adjust its `nice` value with the command `renice`, which takes a `nice` value and a process `ID`, like: `renice 10 <pid>`. This sets the `nice` value of the process with ID <pid>. As more cores are packed into modern CPUs, CPUs are less likely to be the bottleneck than the disk or memory.

### Monitoring Disk Input and Output

Using the `iostat` command without any arguments, we see how much CPU usage is. We could generate three reports continually (until we exit with Control-C) in 10 seconds intervals with: `iostat 10 3`.

If it's unclear which processes are the cause of increased I/O another useful Linux program can help is `iotop`. `iotop` updates at a fixed interval, and indicates which processes have the highest disk I/O usage:


### Disk Usage

As disks fill up, they also become more fragmented, meaning that they write data to the disk in non-consecutive chunks. Disk fragmentation leads to slower disk performance; disks pushing 80% full not only run a risk of being filled up during data processing, but even performance tasks not requiring lots of disk space will suffer. Thus, it's useful to monitor disk usage periodically. The two tools used to look at disk usage are `df` and `du`. The first, `df` simply gives you a terminal-based display of your disk usage, broken down by volume:

In [None]:
%%bash
df -h

One command that's useful in finding large files is `du`, which recursively lists the sizes of the file in the directory it's being run. For example, if you suspect that there are some large files in a project directory named `~/Projects/tarsier_genes`, you could use `du` to find which directories contain the largest files:

In [None]:
%%bash
du -h /home/payman/Projects/learning/bioinfx_data_skills/notes/
du /home/payman/Projects/learning/ | sort -r -n | head -n 5