# Chapter 7: Unix Data Tools

### Unix Data Tools and the Unix One-Liner Approach: Lessons from Programming Pearls

The speed and power of this approach is why it’s a core part of bioinformatics work. For example, the following one-liner returns $K$ most common work in a file:

```bash
cat input.txt | tr -cs A-Za-z '\n' | tr A-Z a-z | sort | uniq -c | sort -rn | sed 10q
```

### When to Use the Unix Pipeline Approach and How to Use It Safely

Fast, low-level data manipulation toolkit to:
- Explore data
- Transform data between formats
- Inspect data for potential problems

For larger, more complex tasks it’s often preferable to write a custom script in a language like Python (or R if the work involves lots of data analysis).

Lengthy Unix pipelines can be fragile and less robust than a custom script.

Storing pipelines in scripts is a good approach — not only do scripts serve as documentation of what steps were performed
on data, but they allow pipelines to be rerun and can be checked into a Git repository.

### Inspecting and Manipulating Text Data with Unix Tools

There are mainly three types of delimited files:
1. Tab-delimited (TSV)
2. Comma-separated (CSV)
3. Variable space-delimited (VSV)

Of these three formats, tab-delimited is the most commonly used in bioinformatics. File formats such as BED, GTF/GFF, SAM, tabular BLAST output, and VCF are all examples of tab-delimited files.

Sometimes it’s useful to see both the beginning and end of a file:

```bash
# Return first and last two lines
(head -n2; tail -n2) < file.txt
```
#### **less (zless)**

Common uses of `less` or `zless`:
- Allows you to search text and highlights matches.
- Helps with debugging our command-line pipelines.
- Crucial when iteratively building up a pipeline — which is the best way to construct pipelines.
  
  ```bash
  step1 input.txt | less                  # inspect output in less
  step1 input.txt | step2 | less
  step1 input.txt | step2 | step3 | less
  ```
- Piping a program's output to `less` pauses the program's execution when the screen is full.
- This happens because `less` stops reading data from the pipe once its buffer is full, causing the pipe to become blocked.
- The blocked pipe then prevents the sending program from writing more data, effectively pausing its process.
- This behavior allows you to safely inspect the output of a long-running command without it wasting CPU resources to process data you aren't yet viewing.

Here is commonly used `less` commands:

| Shortcut  | Action    |
|---------- | ----------|
| space bar | Next page |
| b         | Provioubs page |
| g         | First line |
| G         | Last line  |
| j         | Down (one line at a time) |
| k         | Up (one line at a time)   |
| /\<pattern\>| Search down (forward) for string \<pattern\> |
| ?\<pattern\>| Search up (backword) for string \<pattern\> |
| n          | Repeat last search downword (forward) |
| N          | Repeat last search upward (backword)  |


#### **cut** and **column**

```bash
# cut 3rd column from tab-delimited file
cut -f 3 file.tsv 

# cut 3rd-5th columns from tab-delimited file
cut -f 3-5 file.tsv

#cut 3rd, 5th, 7th columns from tab-delimited file
cut -f 3,5,7 file.tsv

# Combined selection of columns
cut -f1-4,6,7 file.tsv

# Cut columns from csv file
cut -d, -f 2 file.csv
```

Note that usig `cut`, we cannot reorder columns. To do that we will use `awk`.

Program `column -t` (the `-t` option tells column to treat data as a table) produces neat columns that are much easier to read. Like `cut`, `column`’s default delimiter is the tab character (`\t`). We can specify a different delimiter with the `-s` option.

Note that you should only use columnt `-t` to visualize data in the terminal, not to reformat data to write to a file.

#### **grep**

- Fastest program to find patterns (fixed string or regular expression) in a file:

<img src='images/grep_speed.png' width='600'>

We can show context around any match using:
- `-A`: after match
- `-B`: before match
- `-C`: after and before

```bash
# Return two lines after matched <pattern> line
grep -A2 "<pattern>" file
```

**grep's regular expression:**
- supports a flavor of regular expression called POSIX Basic Regular Expressions (BRE)
- Not as strong and comprehnsive as Python's or Perl's
- For more complex RE's we can activate its POSIX Extended Regular Expressions (ERE) using `-E` option.

**grep's other useful options:**
- `-c`: counts matching lines
- `-o`: outputs only matching part of the pattern
- `-w`: matchs the whole word separated with white space

#### **sort**
- `sort` is designed to work with plain-text data with columns.
- Running `sort` without any arguments simply sorts a file alphanumerically by line.
- By default, `sort` treats blank characters (like tab or spaces) as field delimiters. If your file uses another delimiter (such as a comma for CSV files), you can specify the field separator with `-t` (e.g., `-t","`).
- To specify a sorting key, we use `-k` arguments. Each `-k` argument takes a range of columns as `start,end`. So to sort by a single column we use `start,start`. For example, to sort a file using first column and second column (a number), we can use the following command:
    ```bash
    sort -k1,1 -k2,2n file.bed
    ```
- If we don’t want sort to change the order of lines that are equal according to our sort keys, we can specify the `-s` option. `-s` turns off this last-resort sorting, thus making sort a stable sorting algorithm.
- Sorting can be computationally intensive. We can check if a file is sorted according to our `-k` arguments using `-c`. If file is already sorted, `echo $?` should return `0`, otherwise, it should return `1`.
    ```bash
    sort -k1,1 -k2,2n -c file_sorted.bed
    ```
- We can sort in reverse order using `-r` option:

    ```bash
    # Reverse based on first column
    sort -k1,1 -k2,2n -r file.bed
    # Reverse based on second column
    sort -k1,1 -k2,2nr file.bed
- For clever sorting, specially whe sorting alphanumerically, we should use `-V` in the sorting key:

    ```bash
    sort -k1,1V file.bed
    ```
- Under the hood, `sort` uses a fixed-sized memory buffer to sort as much data in-memory as fits. Increasing the size of this buffer allows more data to be sorted in memory, which reduces the amount of temporary sorted files that need to be written and read off the disk. To do this, we can use `-S` argument. The `-S` argument understands suffixes like **K** for kilobyte, **M** for megabyte, and **G** for gigabyte, as well as **%** for specifying what percent of total memory to use (e.g., 50% with `-S 50%`). For example:

    ```bash
    # Allocate 2G to sort on memory
    sort -k1,1 -k2,2n -S2G file.bed
    ```
- We can also run `sort` on multiple cores (parallel runs) with `--parallel` argument:

    ```bash
    # Run sort on 4 cores
    sort -k1,1 -k2,2n --parallel 4 file.bed
    ```

#### **uniq**

Unix’s `uniq` takes lines from a file or standard input stream, and outputs all lines with **consecutive duplicates removed**. If we want to find all unique lines in a file, we would first sort all lines using `sort` so that all identical lines are grouped next to each other, and then run `uniq`.

**Uniq oprions:**
- `-i`: makes uniq to be case-insensitive
- `-c`: shows the counts of occurrences next to the unique lines.
- `-d`: checks for duplicates

#### **join**
- The Unix tool `join` is used to join different files together by a common column.
- To append the two files on their commom column, we first need to sort both files by the column to be joined on. This is a vital step—Unix’s join will not work unless both files are sorted by the column to join on.
- The basic syntax is `join -1 <file_1_field> -2 <file_2_field> <file_1> <file_2>`, where `<file_1>` and `<file_2>` are the two files to be joined by a column `<file_1_field>` in `<file_1>` and column `<file_2_field>` in `<file_2>`.
- GNU `join` implements the `-a` option to include unpairable lines — ones that do not have an entry in either file. To use `-a`, we specify which file is allowed to have unpairable entries:

```bash
join -1 1 -2 1 -a 1 file1.txt file2.txt
```


#### **awk**
- `Awk` is an easy, small programming language great at working with text data like TSV and CSV files.
- The key to using `Awk` effectively is to reserve it for the subset of tasks it’s best at: quick data-processing tasks on tabular data.
- There’s also `GNU Awk`, known as `Gawk`, which is based on the original Awk but has many extended features.

**How AwK works:**

* **Data Structure**: Awk processes data one record at a time. Each **record** is a line, and each **field** is a column within that line.
* **Automatic Variables**: Awk automatically assigns the entire record to the variable `$0` and each field to a corresponding variable, such as `$1`, `$2`, and so on.
* **Program Structure**: Awk programs are built using `pattern { action }` pairs.
    * The `pattern` is an expression or regular expression that, if true, triggers the `action`.
    * The `action` contains the commands to be executed.
* **Optional Components**: You can omit either the pattern or the action.
    * If you omit the pattern, the action runs on all records.
    * If you omit the action, Awk prints all records that match the pattern.
* **Summary**: These two core concepts—records/fields and pattern-action pairs—form the foundation of using Awk for text processing.

For example, the following awk command mimmics GNU `cat`:

```bash
awk '{print $0}' file.tsv
```
In above command, `print` prints a string.

- Awk supports arithmetic with the standard operators +, -, *, /, % (remainder), and ^ (exponentiation).

**Awk comparison and logical operators**

<img src="images/awk_comp_logic.png" width="400">

We can also chain patterns, by using logical operators `&&` (AND), `||` (OR), and `!` (NOT). For example, if we wanted all lines on chromosome 1 with a length greater than 10:

```bash
$ awk '$1 ~ /chr1/ && $3 - $2 > 10' example.bed
# [output]
# chr1 26 39
# chr1 32 47
# chr1 9  28
```

The first pattern, `$1 ~ /chr1/`, is how we specify a regular expression. Regular expressions are in **slashes**.

We can combine patterns and more complex actions than just printing the entire record. For example, if we wanted to add a column with the length of this feature (end position - start position) for only chromosomes 2 and 3, we could use:

```bash
$ awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' example.bed
# [Output]
# chr3 11 28 17
# chr3 16 27 11
# chr2 35 54 19
```

**`BEGIN` and `END` patterns:**
- The `BEGIN` pattern specifies what to do *before* the first record is read in. It is useful to initialize and set up variables.
- `END` specifies what to do *after* the last record’s processing is complete. It is useful to print data summaries at the end of file processing.

For example, suppose we wanted to calculate the mean feature length. We would have to take the sum 
feature lengths, and then divide by the total number of records. We can do this with:

```bash
$ awk 'BEGIN{ s = 0 }; { s += ($3-$2) }; END{ print "mean: " s/NR };' example.bed
# [Output]
# mean: 14
```
In above example, `NR` is the current record number, so on the last record `NR` is set to the total number of records processed. In this example, we’ve initialized a variable `s` to 0 in `BEGIN` (variables you define do not need a dollar sign). Then, for each record we increment `s` by the length of the feature. At the end of the records, we print this sum `s` divided by the number of records `NR`, giving the mean.

- We can use `NR` to extract ranges of lines. For example, if we wanted to extract all lines between 3 and 5 (inclusive):

    ```bash
    awk 'NR >=3 && NR <=5' example.bed
    ```

- AWK's **associative arrays** are a powerful data structure that works like a **dictionary** in Python or a hash map in other languages. Instead of using a numeric index, you access and store data using a **key** and a **value**. This is useful for tasks such as counting items, like features belonging to a specific gene, by simply assigning a value to a key.

    ```bash
    awk '/gene_name/ {feat[$3] += 1}; END {for (i in feat) print i "\t" feat[i] }' example.gtf 
    ```
    
- Awk has several useful built-in functions:

<img src="images/awk_funcs.png" width="400">


**Setting Field, Output Field, and Record Separators:**

* **Field Separator (-F):** You can use the `-F` flag to specify a field separator other than whitespace, like a comma for a CSV file (e.g., `awk -F","`).
* **Variable Assignments (-v):** The `-v` flag allows you to set variables. This is useful for specifying other separators.
* **Output Field Separator (OFS):** The `OFS` variable defines the character that separates fields in the output. For example, `awk -F"," -v OFS="\t"` converts a comma-separated file to a tab-separated one.
* **Other Separators:** You can also set the Record Separator (`RS`) and Output Record Separator (`ORS`).

#### Bioawk: An Awk for Biological Formats 

Bioawk is the extension of Awk, developed by Heng Li, that is tailored for bioinformatics. The basic idea of Bioawk is that we specify what bioinformatics format we’re working with, and Bioawk will automatically set variables for each field (just as regular Awk sets the columns of a tabular text file to `$1`, `$1`, `$2`, etc.). For Bioawk to set these fields, specify the format of the input file or stream with `-c`.

```bash
bioawk -c help
```

Bioawk is also quite useful for processing FASTA/FASTQ files. For example, we could use it to turn a FASTQ file into a FASTA file:

```bash
bioawk -c fastx '{print ">"$name"\n"$seq}' contam.fastq
```

Or Bioawk’s function `revcomp()` can be used to reverse complement a sequence:

```bash
$ bioawk -c fastx '{print ">"$name"\n"revcomp($seq)}' contam.fastq
```

Bioawk offers two convenient options for working with tab-delimited files:

* **-t:** This option is for general tab-delimited files. It automatically sets both the input and output field separators to tabs, which saves you from having to manually set `FS` and `OFS`.
* **-c hdr:** Use this option for tab-delimited files that have a header on the first line. It not only sets the field separators to tabs but also uses the names from the header row to create variables, making it easier to refer to specific columns.

For example, for the following file:

```bash
$ head -n 4 genotypes.txt
# [Output]
# id    ind_A ind_B ind_C
# S_000 T/T   A/T   A/T
# S_001 G/C   C/C   C/C
# S_002 C/A   A/C   C/C
```
If we wanted to return all variants for which individuals `ind_A` and `ind_B` have identical genotypes:

```bash
bioawk -c hdr '$ind_A == $ind_B {print $id}' genotypes.txt
```

#### Stream Editing with Sed
- The stream editor, or `sed`, allows you to make trivial edits to a stream, usually to prepare it for the next step in a Unix pipeline.
- `sed` reads data from a file or standard input and can edit **a line at a time** and allow us to edit it without opening the entire file in memory (very useful for large files).

For example, we want substitute a `chrom` to `chr` in each line `chroms.txt` file:

```bash
sed 's/chrom/chr/' chroms.txt > modified_chroms.txt
```

- The syntax of sed’s substitute is `s/pattern/replacement/`. By default, `sed` only replaces the first occurrence of a match. To replace all occurrences of strings that match our pattern, we need to set the global flag `g` after the last slash: `s/pattern/replacement/g`. For case-insensitive matching, we can enable this with the flag `i` (e.g., `s/pattern/replacement/i`).
- `-n`: disables `sed` from outputting all lines.
- It’s also possible to select and print certain ranges of lines with sed. To print the first 10 lines of a
file (similar to `head -n 10`), we use:

    ```bash
    sed -n '1,10p' Mus_musculus.GRCm38.75_chr1.gtf
    ```
    If we wanted to print lines 20 through 50, we would use:

    ```bash
    sed -n '20,50p' Mus_musculus.GRCm38.75_chr1.gtf
    ```
- By appending `p` after the last number, `sed` will print all lines requested.
   
#### Decoding Plain-Text Data: hexdump

* **ASCII Encoding**: ASCII is a character encoding scheme that uses 7 bits to represent 128 different characters, including letters, numbers, and symbols.
* **Modern Computers and ASCII**: Although ASCII only uses 7 bits, modern computers typically use an 8-bit byte to store ASCII characters.
* **Relevance in Bioinformatics**: Plain-text data in bioinformatics is often encoded in ASCII.
* **When Encoding Matters**: While most of the time you won't need to worry about encoding details, issues can arise when non-ASCII or invisible characters are present in a file, causing errors and "major headaches."
* **Non-ASCII formats**: May contain special characters that cause may cause problems with bioinformatic tools. The most common character encoding scheme is UTF-8, which is a superset of ASCII but allows for special characters.
* **Identify file format**: Use `file` command to infer what the encoding is from the file's content.
* **Decipher non-ASCII characters**: Use `hexdump` to return hexadecimal values of each character: `hexdump -c file.txt`.


### Subshells

**Sequential vs. Piped Commands**

* **Sequential Commands:** These commands run one after the other. The output from the first command does not automatically become the input for the next command.
    * If we run two commands with `command1 ; command2`, `command2` will always run, regardless of whether command1 exits successfully (with a zero exit status). In contrast, if we use `command1 && command2`, `command2` will only run if `command1` completed with a zero-exit status.
* **Piped Commands:** When you use a pipe (`|`) to connect two commands, the standard output of the first command is sent directly to the standard input of the second command. This lets you chain multiple commands together to process data.

You can group multiple sequential commands with parentheses `()` so that their combined output is treated as a single stream. This single stream can then be piped (`|`) as input to another command.

```bash
echo "this command"; echo "that command" | sed 's/command/step/'
# [Output]
# this command
# that step
(echo "this command"; echo "that command") | sed 's/command/step/'
# [Output]
# this step
# that step
```

Another example of using subshells:

```bash
(zgrep "^#" Mus_musculus.GRCm38.75_chr1.gtf.gz; \
zgrep -v "^#" Mus_musculus.GRCm38.75_chr1.gtf.gz | \
sort -k1,1 -k4,4n) | gzip > Mus_musculus.GRCm38.75_chr1_sorted.gtf.gz
```

### Named Pipes and Process Substitution

Named pipes are a solution for programs that cannot use standard Unix pipes. Here are the key takeaways:

* **The Problem:** Some programs, especially in bioinformatics, require separate input files and produce separate output files, making it impossible to use the standard Unix pipe (`|`) to connect them in a pipeline.
* **The Bottleneck:** This limitation forces the user to write and read multiple temporary files to and from the disk, which is a very slow process and creates a significant performance bottleneck in a data-processing pipeline.
* **The Solution:** A **named pipe** (also known as a **FIFO**, First In First Out) is a special type of file that acts like a pipe but has a name and exists on the filesystem. It allows you to pass data between programs that require explicit file inputs and outputs, avoiding the need to write to disk and thus maintaining the speed of a pipeline.
* We can create a named pipe with the program `mkfifo`.
* Like a file, we clean up by using `rm` to remove it.

```bash
mkfifo fqin
ls -l fqin
# [Output]
# prw-r--r--   1 vinceb  staff    0 Aug 5 22:50 fqin

echo "hello, named pipes" > fqin &
cat fqin
rm fqin
```

You’ll notice that this is indeed a special type of file: the `p` before the file permissions is for pipe.

Process substitution are ways to make data pipelines more flexible:

* **Process Substitution:** This is a shortcut that creates an anonymous named pipe on the fly. It allows the output of a command to be treated as a file, which can then be used as input for a program. This removes the need to manually create and remove named pipe files, making it easier to connect programs that don't accept standard input.
* **The Difference to named pipes:** While both achieve similar goals, a named pipe is a persistent file you create with `mkfifo`, whereas process substitution creates a temporary, unnamed pipe that the shell handles automatically, making it ideal for quick command-line use.

Here are two examples:

```bash
program --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) \
--out1 out1.txt --out2 out2.txt
```

```bash
program --in1 in1.txt --in2 in2.txt \
--out1 >(gzip > out1.txt.gz) --out2 >(gzip > out2.txt.gz)
```