<p style="font-size:22pt; text-decoration:underline; font-weight:bold; color:#003057">
    PACE Linux 102: Efficient Workflow with Command Line Utilities
</p>

<br>
<center>
    <a href = "mailto: jvaldez8@gatech.edu"><b>Jeffrey Valdez</b></a><br>
    <a href = "https://pace.gatech.edu" target = "_blank"><b>PACE, Georgia Tech</b></a>
</center>


PACE’s Linux 102: Efficient Workflow with Command Line Utilities builds upon the Linux 101 content to demonstrate extended features to improve user knowledge and workflow using command line utilities in a Hands-On course. Topics covered include the use of built-in job control to manage multiple processes, filtering and stream-processing utilities for lightweight data processing, advanced I/O options, and data compression utilities. The content is focused on example usage, and is intended to provide a template for users to adopt in their own workflow. (Material developed by Aaron Jezghani, PhD)

#### This Course is Being Taught Online Exclusively


<font size=4 color=B3A369><u><b><i>Why</i> are we running bash through a Jupyter notebook?</b></u></font>

Traditionally, this course would be taught using a slide-deck and a terminal for the hands-on components. While it is true that the terminal has more overall functionality (as you can run interactive shells), a Jupyter Notebook is used to integrate the course content and hands-on exercises into a single document, while maintaining a record of all output for your review. The examples and exercises for this class have been designed so that they can fully run within this notebook.

<font size=4 color=B3A369><u><b>Connect to PACE-ICE</b></u></font>

- Connect to the GT VPN and then login to <b>ondemand-ice.pace.gatech.edu</b>. Open OnDemand is a gateway developed by Ohio Supercomputing Center and adopted at HPC centers around the world. Through your browser, you can launch interactive applications such as a terminal shell or Jupyter notebooks.
- You can connect to PACE-ICE via ssh. On Mac or Linux, a terminal window provides access to ssh, while on Windows 10 release 18.03 and newer, it can be accessed via PowerShell (which is already installed!). PuTTY and other ssh software can be used, but you should consult the manual to add port-forwarding to an existing session, as the displayed instructions will not work.
- If you have an older version of Windows or your computer has a heavily restricted firewall, you may wish to use VLab (http://mycloud.gatech.edu), which offers access to Windows machines on Georgia Tech's Virtual Lab. 

- Log in to ICE via the headnode: 
    - `ssh USERNAME@login-ice.pace.gatech.edu`. 
    - Replace USERNAME with your username. 
    - Enter your password when prompted. No asterisks will appear, but type your full password, then hit Enter.

<font size=4 color=B3A369><u><b>Copy workshop files</b></u></font>
- Download the workshop repo from Github:
    - `git clone https://github.com/jeffvaldez/PACE-Linux-102.git`
- Change into the Linux102 directory, where you should have a Jupyter notebook file and some other content


<font size=4 color=B3A369><u><b>Installing Anaconda, with Python 3, bash_kernel, and Jupyter included, on your laptop</b></u></font>

- The `bash_kernel` allows you to execute bash commands within your environment without the use of the %%bash magic command so that each code block reflects native bash (although the magic command has real benefits too - you should check it out!)

- For those without PACE accounts, this will allow you to use Jupyter Notebooks with `bash_kernel` in the future. Even those with PACE access may find using a local installation to be a convenient option at times. 

- Download from www.anaconda.com, for Mac/Linux/Windows

- To install bash_kernel, simply run the following commands after you have installed Anaconda:

```
conda install -c conda-forge bash_kernel #installs bash_kernel
python -m bash_kernel.install #makes bash_kernel available for jupyter notebook
```

- Load Jupyter from the Anaconda Navigator, or type `jupyter notebook` on the command line (except on Windows). 

- When you finish, end the Jupyter session by pressing `Ctrl-c` in the terminal where you started it (except on Windows), or click **Quit** in the top right of the main Jupyter window. (orange arrow below)



<p style="font-size:24pt; text-decoration:underline; font-weight:bold; color:#003057">
    Stream Manipulation
</p>

When used correctly, Bash can be extremely powerful for language and data processing, especially through stream manipulation. A number of utilities exist to quickly perform some simple data-handling tasks:
- `tr <set1> <set2>` can be used to translate characters from _set1_ to _set2_
    - The `-d` option is used to delete characters (don't provide <i>set2</i>)
    - The `-c` option inverts the character set to match
    - The `-s` option is used to "squeeze" the whitespace between text fields in formatted text
- `sort` is used to sort a text stream or file
    - The `-n` option is used to do a numerical sort
    - The `-r` options reverses the sorted order
    - The `-m` option merges presorted input files
    - The `-u` option only prints one occurrence of each item in the sorted list
    - The `-k` option allows you to define the key (column) to sort by
    - The `-t` option is used to indicate the field delimiter
- `uniq` removes duplicate lines from a sorted file
    - Without any options, will simply return the first occurrence of each line
    - The `-c` option can be used to count the number of occurrences of each line
- `head` is used to print from the beginning of a file
    - The `-n=<#>` option is used to specify the number of lines to print
    - The `-c=<#>` option is used to specify the number of bytes to print
    - The `<#>` can be preceded by a `-` rather than `=` (e.g. `-n-10`) to print all but the last `<#>` lines or bytes
- `tail` is used to print from the end of a file
    - The `-n=<#>` option is used to specify the number of lines to print
    - The `-c=<#>` option is used to specify the number of bytes to print
    - The `<#>` can be preceded by a `+` rather than `=` (e.g. `-n+10`) to print everything after `<#>` lines or bytes
- `cut` is used to extract a specific field from a file
    - The `-f<#>[,<#>,<#>,...]` option is used to specify which field(s) to cut
    - The `-d<C>` option to indicate the character used to delimit fields in the file
- `paste` is used to merge lines of files
    - The `-d=<LIST>` option specifies a list of delimeters, to be cycled through on each line, to delimit each field
    - The `-s` option is used to combine lines from each file separately, rather than merging the respective lines
    
<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 2: Filtering Command Output
</p>

- The first line shows the familiar output for the details of the pace-ice queue:  
  <i>the head command is used to only print the first 11 lines of output</i>
- The second line does the same, but adds a translate command to filter the output:  
  <i>the translate command is used to turn all '/' and ' ' characters into '|' to highlight the effect of formatted text on fields</i>
- The third line utilizes multiple filters to print an ordered list of the per-node memory utilization percent in the queue
  <i>the tail command is used to print only compute node lines, translate is used with squeeze to reduce all whitespace to a single character, the 10th field is cut, a descending numerical sort is implemented, and finally the counts for each unique value are ascertained
    
<b>Execute each line, and note what each command does</b>
</font>

In [None]:
pace-check-queue pace-cpu

In [None]:
pace-check-queue pace-cpu | tail -n+12

In [None]:
 pace-check-queue pace-cpu | tail -n+12 | tr -s '/' ' '

In [None]:
pace-check-queue pace-cpu | tail -n+12 | tr -s '/' ' '  | cut -f10 -d' '

In [None]:
pace-check-queue pace-cpu | tail -n+12 | tr -s '/' ' '  | cut -f10 -d' ' | sort -nr

In [None]:
pace-check-queue pace-cpu | tail -n+12 | tr -s '/' ' '  | cut -f10 -d' ' | sort -nr | uniq -c

<font size=3><u><font color=B3A369><b>
    Shell Parameter Expansion
</b></font>: light-weight variable manipulation</u></font>

Beyond arithmetic expansion (`$((...))`) and command substitution (`$(<cmd> <options>)`), the `$` character also introduces <b>parameter expansion</b>. Substrings can be selected by index/pattern, variables can be indirectly references, string lengths can be measured, and strings can be modified - all using parameter expansion. The general syntax of parameter expansion uses some combination of `${...}`, `PARAMETER` (the value being modified - cannot be an expansion or pattern), `WORD` (the pattern explaining what to modify), and offset/length.

When working with variables, they can be evaluated with the `$` operator, but broadly speaking, it is good to be overly cautious and use double quotes and curly brackets
- Double quotes will help correctly interpret null or white space characters in your variable
    - NOTE: variables wrapped in single quotes will not be evaluated
- Curly braces correctly identify the variable to expand, especially if it is concatenated with another string such as `$foobar` versus `${foo}bar`
- Curly braces are also unconditionally required for:
    - Expanding array elements: `${array[42]}`
    - Expanding positional parameters beyond 9: `$8 $9 ${10} ${11}`  
    - Parameter expansion (as below)
    
$^{1}$[https://stackoverflow.com/questions/8748831/when-do-we-need-curly-braces-around-shell-variables](https://stackoverflow.com/questions/8748831/when-do-we-need-curly-braces-around-shell-variables)

In [None]:
echo $SLURM_JOBID #This works, but can leave you open to misinterprations
echo ${SLURM_JOBID} #This works, but not can be misinterpreted if there are whitespace or null characters
echo "$SLURM_JOBID" #This will interpret whitespace/null characters correctly, but will fail with arrays and positions>9
echo "${SLURM_JOBID}" #This is overly cautious, but leaves no room for misinterpretation

As for examples of parameter expansion, here are several operators you might encounter or include in your scripts:
- `${!PARAMETER}`: indirectly references the variable pointed to by parameter

In [None]:
JUST_ANOTHER_VAR="SLURM_JOBID"
echo JUST_ANOTHER_VAR=${JUST_ANOTHER_VAR}
echo '!JUST_ANOTHER_VAR'=${!JUST_ANOTHER_VAR}

- `${#PARAMETER}`: returns the length of `PARAMETER`

In [None]:
echo ${#SLURM_JOBID}

- `${PARAMETER:offset:length}`: as in Python, returns the substring of length `length` starting at index `offset` (zero-indexed, inclusive bound). If offset is less than 0, returns the substring starting at the index corresponding to `offset` characters from the end of `PARAMETER` (inclusive bound - note that the syntax here is `${PARAMETER: -offset:length}`: to avoid issues with the `:-` operator); if length is less than zero, returns the substring through `length` characters from the end of `PARAMETER` (exclusive bound)

In [None]:
echo ${SLURM_SUBMIT_HOST}
echo ${SLURM_SUBMIT_HOST:7}
echo ${SLURM_SUBMIT_HOST:7:5}
echo ${SLURM_SUBMIT_HOST::5}
echo ${SLURM_SUBMIT_HOST::-10}
echo ${SLURM_SUBMIT_HOST:-10}
echo ${SLURM_SUBMIT_HOST: -10}
echo ${SLURM_SUBMIT_HOST: -10:3}
echo ${SLURM_SUBMIT_HOST: -10:-6}

- `${PARAMETER%WORD}`: Removes `WORD` from the end of `PARAMETER` (note that `WORD` can be a globbing pattern)
- `${PARAMETER%%WORD}`: Removes the longest match for `WORD` from the end of `PARAMETER` (note that `WORD` can be a globbing pattern)

In [None]:
echo ${SLURM_SUBMIT_HOST}
echo ${SLURM_SUBMIT_HOST%.*} #remove .edu
echo ${SLURM_SUBMIT_HOST%%.*} #remove .pace.gatech.edu
echo ${SLURM_SUBMIT_HOST%.g*} #remove .gatech.edu

- `${PARAMETER#WORD}`: Removes `WORD` from the start of `PARAMETER` (note that `WORD` can be a globbing pattern)
- `${PARAMETER##WORD}`: Rmoves the longest match for `WORD` from the start of `PARAMETER` (note that `WORD` can be a globbing pattern)

In [None]:
echo ${SLURM_SUBMIT_HOST}
echo ${SLURM_SUBMIT_HOST#*.}
echo ${SLURM_SUBMIT_HOST##*.}

- `${PARAMETER^WORD}`: Changes first character of `PARAMETER` to uppercase if it matches `WORD`, which can be a globbing pattern.
- `${PARAMETER^^WORD}`: Changes all characters of `PARAMETER` to uppercase if they match `WORD`, which may be a globbing pattern.

In [None]:
echo ${SLURM_SUBMIT_HOST}
echo ${SLURM_SUBMIT_HOST^*}
echo ${SLURM_SUBMIT_HOST^o}
echo ${SLURM_SUBMIT_HOST^^*}
echo ${SLURM_SUBMIT_HOST^^[se]}

- `${PARAMETER,WORD}`: Changes first character of `PARAMETER` to uppercase if it matches `WORD`, which can be a globbing pattern.
- `${PARAMETER,,WORD}`: Changes all characters of `PARAMETER` to lowercase if they match `WORD`, which may be a globbing pattern.

In [None]:
echo ${JUST_ANOTHER_VAR}
echo ${JUST_ANOTHER_VAR,*}
echo ${JUST_ANOTHER_VAR,S}
echo ${JUST_ANOTHER_VAR,,*}

- `${PARAMETER/WORD/NEWWORD}`: Replaces first occurrence of `WORD` with `NEWWORD` in `PARAMETER` (note that `WORD` can be a globbing pattern, but `NEWWORD` cannot)
- `${PARAMETER//WORD/NEWWORD}`: Replaces all occurrences of `WORD` with `NEWWORD` in `PARAMETER` (note that `WORD` can be a globbing pattern, but `NEWWORD` cannot)

In [None]:
echo ${SLURM_SUBMIT_HOST}
echo ${SLURM_SUBMIT_HOST/-/.}
echo ${SLURM_SUBMIT_HOST//./-}
echo ${SLURM_SUBMIT_HOST/[ec]/*}
echo ${SLURM_SUBMIT_HOST//[ec]/*}
echo ${SLURM_SUBMIT_HOST/*/*}

<font size=3><u><font color=B3A369><b>
    Regular Expressions
</b></font>: character pattern descriptions</u></font>

In addition to the above filter utilities, regular expressions can be used for stream manipulation. A regular expression (abbreviated as regex or regexp) is a pattern that describes a sequence of characters. Programs such as `sed`, `awk`, and `grep` use them to perform operations and search for general patterns, such as finding and stripping email addresses from large volumes of user data. There are entire text books written about the use of regular expressions, but some of the general patterns are listed here:

|Operator | Effect                       |Example |
|:---     | :---                         |:---    |
|` . `    |Matches any single character except new line. | **ab.** matches **abc**, **abC**, **abz**, **ab5**, **ab$**, **...**|
|`?`    |Matches the preceding item 0 or 1 times.|**ab?c** matches **ac** and **abc**, but not **abbc**|
|`*`    |Matches the preceding item 0 or more times.|**ab*c** matches **ac**, **abc**, **abbc**, **abbbc**, **...**|
|`+`    |Matches the preceding item 1 or more times.|**ab+c** matches **abc** and **abbc**, but not **ac**|
|`{n}`|Matches the preceding item exactly _n_ times|**ab{2}c** matches **abbc** but not **abc** or **abbbc**|
|`{n,}`|Matches the preceding item _n_ or more times|**ab{2,}c** matches **abbc**, **abbbc**, **...** but not **abc**|
|`{n,m}`|Matches the preceding itme at least _n_ times, but not more than _m_ times|**ab{2,3}c** matches **abbc** and **abbbc**, but not **abc** or **abbbbc**|
|`[ ]`|Matches any single character contained in a set enclosed by `[` and `]`; ranges can be specified with `-`|**a[bBr-u]c** matches **abc**, **aBc**, **arc**, **asc**, **atc**, and **auc**|
|`-[ ]` in `[ ]`|Character set subtraction|**a[a-z-[ac-z]]c** matches **abc** (`(a-z)-(a+c-z)=b`)|
|`[^ ]`|Matches any single character not included in set enclosed by `[` and `]`|**a[^c-z]c** matches **aac**, **abc**, **aCc**, **a3c**, **a^c**, **...**|
|`( )`|Matches a group of characters for extracting a substring or using a backreference|**a(abc)+c** matches **aabcc**, **aabcabcc**, **...** but not **aabc**|
|`^` |Start of a line or string|**^a** matches **abc 123**, but not **123 abc** |
|`$` |End of a line or string|**a\$** matches **bababa**, but not **ababab** |

_Note: in basic regular expressions the metacharacters `?`, `+`, `{`, `|`, `(`, and `)` lose their special meaning; 
<br>instead use the backslashed versions `\?`, `\+`, `\{`, `\|`, `\(`, and `\)`_



In addition to the above regular expression operators to describe patterns, certain classes, which can describe general types of characters, are defined by the POSIX standard. Since each class is actually a list of characters, it should be enclosed in square brackets (i.e. `[[:alpha:]]` not `[:alpha:]`)

|POSIX Class | Bracket Expression| Meaning |
|:---  |:--- |:---|
|`[:alnum:]`|`[A-Za-z0-9]`|upper- and lowercase letters, digits|
|`[:alpha:]`|`[A-Za-z]`|upper- and lowercase letters|
|`[:ascii:]`|`[\x00-\x7F]`|ASCII characters|
|`[:blank:]`|`[ \t]`|space and TAB characters only|
|`[:cntrl:]`|`[\x00-\x1F\x7F]`|control characters|
|`[:digit:]`|`[0-9`]|digits|
|`[:graph:]`|`[^[:cntrl:]]`|graphic characters (all characters which have graphic representation|
|`[:lower:]`|`[a-z]`|lowercase letters|
|`[:print:]`|`[[:graph:] ]`|graphic characters and space|
|`[:punct:]`|`[!"#$%&'()*+,-.\/:;<=>?@‘[]^_{\|}~]`|punctuation and symbols|
|`[:space:]`|`[ \t\r\n\v\f]`|all whitespace characters, including line breaks|
|`[:upper:]`|`[A-Z]`|uppercase letters|
|`[:word:]`|`[A-Zza-z0-9_]`|upper- and lowercase letters, digits, and underscores|
|`[:xdigit:]`|`[A-Fa-f0-9]`|hexadecimal digits|

<font size=3><u><font color=B3A369><b>grep</b></font>: _g/re/p_ (<b>g</b>lobally search a <b>r</b>egular <b>e</b>xpression and <b>p</b>rint)</u></font>

As mentioned in Linux 101, grep is a great utility for searching for and printing matching text within files or streams. Given the name, it shouldn't be surprising that `grep` can also be used to look for regular expressions too!

Here are some of the more useful options to use with grep:

|Option|Meaning|
|:---|:---|
|`-E`|Interpret PATTERN as an extended regular expression|
|`-f FILE`|Obtain patterns from `FILE`, one per line. The empty file contains zero patterns, and therefore matches nothing.|
|`-i`|Ignore case distrinctions in both the PATTERN and the input files.|
|`-c`|Suppress normal output; instead print a count of matching lines for each input file. With the `-v` option, count non-matching lines.
|`-l`|Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match.|
|`-o`|Print only the matched (non-empty) parts of a matching line, with each such part on a separate output line.|
|`-v`|Invert the sense of matching, to select non-matching lines.|
|`-m <#>`|Only print *#* matching lines
|`-n`|Prefix each line of output with the 1-based line number within its input file.|
|`-q`|Quit immediately and return exit status 0 if a match is found; else return 1. (useful for conditionals!)
|`-R`|Search recursively for expression.|

In [None]:
scontrol show job ${SLURM_JOBID}

In [None]:
scontrol show job ${SLURM_JOBID} | grep -n 'Time'

In [None]:
scontrol show job ${SLURM_JOBID} | grep -q 'Time'
echo $?

In [None]:
scontrol show job ${SLURM_JOBID} | grep -n 'TIme'

In [None]:
scontrol show job ${SLURM_JOBID} | grep -q 'TIme'
echo $?

In [None]:
scontrol show job ${SLURM_JOBID} | grep -m 3 'Time'

In [None]:
scontrol show job ${SLURM_JOBID} | grep -c -i 'Time'

In [None]:
scontrol show job ${SLURM_JOBID} | grep 'SubmitTime'

In [None]:
scontrol show job ${SLURM_JOBID} | grep 'StartTime'

In [None]:
scontrol show job ${SLURM_JOBID} | grep -E 'SubmitTime|StartTime'
scontrol show job ${SLURM_JOBID} | egrep 'SubmitTime|StartTime'
scontrol show job ${SLURM_JOBID} | grep 'SubmitTime\|StartTime'

In [None]:
scontrol show job ${SLURM_JOBID} | grep 'Time.*[0-9]*:[0-9]*:[0-9]*'

In [None]:
scontrol show job ${SLURM_JOBID} | grep 'Time.*[0-9]*:[0-9]*:[0-9]*'

In [None]:
scontrol show job ${SLURM_JOBID} | grep 'Time.\+[0-9]\+:[0-9]\+:[0-9]\+'

In [None]:
scontrol show job ${SLURM_JOBID} | grep -E 'Time.+[0-9]+:[0-9]+:[0-9]+'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep 'Time.+[0-9]+:[0-9]+:[0-9]+'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep -o '.*SubmitTime.*'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep '.*StartTime.*'
scontrol show job ${SLURM_JOBID} | egrep '.*StartTime.*' | egrep -o '[0-9:]{8}'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep '.*AccrueTime.*'
scontrol show job ${SLURM_JOBID} | egrep '.*AccrueTime.*' | egrep -o '[0-9:]{8}'

<font size=3><u><font color=B3A369>**sed**</font>: **s**tream **ed**itor</u></font>

sed is another extremely powerful regular expression utility. Typically, it is used to search and replace in more robust fashions than `tr`, but it can also insert by line address and more. Here are some of the more useful options to use with sed:

|Option|Meaning|
|:---|:---|
|`-f SCRIPT_FILE`|Add the contents of script-file to the commands to be executed.|
|`-r`|Use extended regular expressions in the script.|
|`-n`|Suppress automatic printing of pattern space|
|`-i[SUFFIX]`|Edit files in place (makes backup if SUFFIX supplied).|

And here is some basic usage:

- `s`: to substitute
- `/../../`: to delimit search and replace patterns (can delimit with #,$,/,...)
- `g`: to apply globally
- `/../p`: print any line matching `/../`
- `/../a`: append after any line matching `/../`
- `/../d`: delete any line matching `/../`
- `<#>`: specify an address (line number)
- `<#1>,<#2>`: specify an address range (from line <#1> to line <#2>)

In [None]:
scontrol show job ${SLURM_JOBID} | sed '1,10p'

In [None]:
scontrol show job ${SLURM_JOBID} | sed -n '1,10p'

In [None]:
scontrol show job ${SLURM_JOBID} | sed 's/\//\\/'

In [None]:
scontrol show job ${SLURM_JOBID} | sed 's&/&\\&g'

In [None]:
scontrol show job ${SLURM_JOBID} | sed '/[Rr]e/d'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep '.*SubmitTime.*'
scontrol show job ${SLURM_JOBID} | sed -n 's&SubmitTime.*\([0-9]*:[0-9]*:[0-9]*\)&\1&p'

In [None]:
scontrol show job ${SLURM_JOBID} | egrep '.*SubmitTime.*'
scontrol show job ${SLURM_JOBID} | sed -n 's&SubmitTime.*\([0-9]\{2\}:[0-9]*:[0-9]*\)&\1&p'

In [None]:
scontrol show job ${SLURM_JOBID} | sed -n 's&.*SubmitTime.*\([0-9:]\{8\}\)&\1&p'

In [None]:
scontrol show job ${SLURM_JOBID} | sed -rn 's&.*SubmitTime.*([0-9:]{8})&\1&p'

<font size=3><u><font color=B3A369>**AWK**</font>: data-driven language by **A**ho, **W**einberger, and **K**ernighan</u></font>



Here are some of the more useful options to use with awk:

|Option|Meaning|
|:---|:---|
|`-f PROGRAM_FILE`|Read the AWK program source from the file PROGRAM_FILE, instead of from the first command line argument.|
|`-F FIELD_SEPARATOR`|Use FIELD_SEPARATOR for the input field separator (the value of the FS predefined variable).|
|`-v VAR=VAL`|Assign the value VAL to the variable VAR, before execution of the program begins.|
|`-r`|Enable the use of interval expressions in regular expression matching.|


And here are some of the useful built-in variables:

|Variable|Description|
|:---|:---|
|`FS`|The input field separator, a space by default.|
|`RS`|The input record separator, a newline by default.|
|`NF`|The number of fields in the current input record.|
|`NR`|The total number of input records seen so far.|
|`OFMT`|The output format for numbers, "%.6g", by default.|
|`OFS`|The output field separator, a space by default.|
|`ORS`|The output record separator, a newline by default.|
|`BEGIN`|Special keyword to indicate beginning of awk program, executes action before processing file(s).|
|`END`|Special keyword to indicate end of awk program, executes action after processing all file(s).|
|`BEGINFILE`|Special keyword to indicate beginning of each file; executes action before processing each file.|
|`ENDFILE`|Special keyword to indicate end of each file; executes action after processing each file.|

The basic one-line `awk` command consists of `awk <OPTIONS> '<PATTERN> <ACTION>' <input-file>`. Patterns can include regular expression searches (`/<REGEX>/`) or conditional statements (`NR>4`), along with many more! Actions are usually enclosed in curly braces, and can involve things like displaying specific fields (`{print $3}`) to only print the 3rd field).<br><br>

In [None]:
scontrol show job ${SLURM_JOBID} | awk '/Time.*[0-9]*:[0-9]*:[0-9]*/ {print}'

In [None]:
scontrol show job ${SLURM_JOBID} | awk '/SubmitTime.*[0-9]*:[0-9]*:[0-9]*/ {print}'

In [None]:
scontrol show job ${SLURM_JOBID} | awk -r '/SubmitTime.*[0-9:]{8}/ {print}'

In [None]:
scontrol show job ${SLURM_JOBID} | awk -r '/SubmitTime.*[0-9:]{8}/ {
split($1,submit_str,"=");
split(submit_str[2],sdt,"T"); 
print sdt[1], sdt[2]

split($2,elig_str,"=");
split(elig_str[2],edt,"T"); 
print edt[1], edt[2]
}'

In [None]:
scontrol show job ${SLURM_JOBID} | awk -r '/SubmitTime.*[0-9:]{8}/ {
split($1,submit_str,"=");
split(submit_str[2],sdt,"T");
split(sdt[2],stime,":");
print 3600*stime[1]+60*stime[2]+stime[3]

split($2,elig_str,"=");
split(elig_str[2],edt,"T"); 
split(edt[2],etime,":");
print 3600*etime[1]+60*etime[2]+etime[3]
}'

In [None]:
scontrol show job ${SLURM_JOBID} | 
awk '/RunTime/ {
split($1,runtime_str,"=");
split(runtime_str[2],ts,":"); 
runtime=3600*ts[1]+60*ts[2]+ts[3];
print runtime;
}

/TimeLimit/ {
split($2,walltime_str,"=");
split(walltime_str[2],ts,":"); 
walltime=3600*ts[1]+60*ts[2]+ts[3];
print walltime;

print runtime/walltime
}'

<font size=3><u><font color=B3A369>**=~**</font>: Bash regex</u></font>

In bash, the binary operator `=~` is used to compare a string against an extended regular expression; letters in the regular expression can If capture groups are used, the variable `BASH_REMATCH` is an array of the indexed matches.

The return values are:

- `0` if the string on the LHS matches the regex on the RHS
- `2` if the regex on the RHS is syntactically incorrect
- `1` otherwise

Thus, it can be used in conditional expressions such as the following:

In [None]:
if [[ "The quick brown fox" =~ ."ui"(.{4}).*(fox) ]]
then
  echo "Return value is $?"
  echo "Capture group matches are \"${BASH_REMATCH[1]}\" and \"${BASH_REMATCH[2]}\""
  echo "Note that you could also print the entire matched pattern with \"${BASH_REMATCH[0]}\""
fi

In [None]:
scontrol show job

In [None]:
scontrol show job | grep "TimeLimit"

In [None]:
for line in `scontrol show job`
do
  if [[ $line =~ "TimeLimit="([0-9:]{8}) ]]
  then
   echo "${BASH_REMATCH[1]}"
  fi
done

<font size=3 color='red'><u>DEPRECATED - Comparing each approach</u></font>

Although `sed` and `awk` are extremely efficient, if you make many separate calls to them, the total time can accummulate and make for an unnecessarily inefficient script. This is a consequence of the shell's need to create many subprocesses and continuously move the programs in and out of memory. As such, if your workflow executes many (hundreds or thousands) regex calls, you may find it more effective to instead use the bash regex operator. As an example, consider the following:

In [None]:
#pbsnodes

In [None]:
#pbsnodes > pbsnodes.cache

In [None]:
#pbsnodes | sed -n 's&.*total_cores = \([0-9]*\).*&\1&p' | wc -l

In [None]:
#IFS_OLD=$IFS
#IFS='\n$'
#time for iter in {1..10}; do for line in $(cat pbsnodes.cache); do echo $line | sed -n 's&.*total_cores = \([0-9]*\).*&\1&p'; done; done > sed.out
#echo 'sed'
#IFS=$IFS_OLD

In [None]:
#IFS_OLD=$IFS
#IFS='\n$'
#time for iter in {1..10}; do for line in $(cat pbsnodes.cache); do echo $line | awk '/total_cores/ {print $NF}' ; done; done > awk.out
#echo 'awk'
#IFS=$IFS_OLD

In [None]:
#IFS_OLD=$IFS
#IFS='\n$'
#TOTCOREEXP='.*total_cores = ([0-9]+)'
#time for iter in {1..10}; do for line in $(cat pbsnodes.cache); do if [[ $line =~ $TOTCOREEXP ]]; then echo "${BASH_REMATCH[1]}"; fi; done; done > bash_regex.out
#echo 'bash regex'
#IFS=$IFS_OLD

In [None]:
#diff sed.out awk.out
#diff sed.out bash_regex.out

<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 3: Using AWK and sed to Parse the Output of 'pace-check-queue'
</p>

- pace-check-queue returns the usual table of nodes within the pace-train queue
- The output is piped to awk, which prints the first column (node name), skipping the first 12 lines (output header), and separating each value with a comma
- Because the last value is considered its own record, it is followed by the output record seperator, so we pipe the output to sed to substitute only the last comma with a new line character

<b>Could this be done with one command (e.g. just awk or sed) instead of the two?</b>
</font>

In [None]:
pace-check-queue pace-cpu

In [None]:
pace-check-queue pace-cpu | awk 'BEGIN {ORS=","} /^atl/ {print $1}'

In [None]:
pace-check-queue pace-cpu | awk 'BEGIN {ORS=","} /^atl/ {print $1}' | sed 's&,$&&'

In [None]:
pace-check-queue pace-cpu | awk '/^atl/ {nodes=nodes$1","; print nodes;}'

In [None]:
pace-check-queue pace-cpu | awk '/^atl/ {nodes=nodes$1","} END {print gensub(/,$/,"","g",nodes)}'

<font color='red'>
    <p style="font-size:14pt; text-decoration:underline; font-weight:bold">
        DEPRECATED - PBSNODES EXAMPLE
    </p>
</font>

<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 4: Using AWK and sed to Parse the Output of 'pbsnodes'
</p>

- pbsnodes provides the node details used for `pace-check-queue`
- One of the options allows the data to be output in XML format, which makes for slightly easier parsing
- Using a combination of `sed` and `awk`, we can do some neat things!


In [None]:
# pbsnodes --xml

In [None]:
# pbsnodes --xml | sed 's&<Node>&\n<Node>&g'

In [None]:
#  pbsnodes --xml | sed 's&<Node>&\n<Node>&g' | sed -rn 's&.*<name>([^.]*).*ze=([0-9]*)kb:([0-9]*)kb.*&\1 \2 \3&p'

In [None]:
#  pbsnodes --xml | sed 's&<Node>&\n<Node>&g' \
#    | sed -rn 's&.*<name>([^.]*).*ze=([0-9]*)kb:([0-9]*)kb.*&\1 \2 \3&p'

In [None]:
#   pbsnodes --xml \
#     | sed 's&<Node>&\n<Node>&g' \
#     | sed -rn 's&.*<name>([^.]*).*ze=([0-9]*)kb:([0-9]*)kb.*&\1 \2 \3&p' \
#     | awk '{sum1+=$2; sum2+=$3; printf "%40s %5.2f of %6.2f GB used (%5.2f%% available)\n",$1" disk utilization:",(($3-$2)/1048576),($3/1048576),(100*$2/$3)} END {printf "Total of %7.2f out of %7.2f GB used (%5.2f%% available)",((sum2-sum1)/1048576),(sum2/1048576),(100*sum1/sum2)}' 

In [None]:
# pbsnodes --xml \
#     | sed 's&<Node>&\n<Node>&g' \
#     | sed -n 's&.*<name>\([-a-z0-9-]*\)\.pace\.gatech\.edu</name>.*size=\([0-9]*\)kb:\([0-9]*\)kb.*&\1 \2 \3&p' \
#     | awk '
#           {
#             sum1+=$2; 
#             sum2+=$3; 
#             printf "%40s %5.2f of %6.2f GB used (%5.2f%% available)\n",$1" disk utilization:",
#             (($3-$2)/1048576),($3/1048576),(100*$2/$3)
#           }
#           END {
#             printf "Total of %7.2f out of %7.2f GB used (%5.2f%% available)",
#             ((sum2-sum1)/1048576),(sum2/1048576),(100*sum1/sum2)
#           }'

In [None]:
# AWKCMD='
# {
#   sum1+=$2; 
#   sum2+=$3; 
#   printf "%40s %5.2f of %6.2f GB used (%5.2f%% available)\n",$1" disk utilization:",
#     (($3-$2)/1048576),($3/1048576),(100*$2/$3)
# } 
# END {
#   printf "Total of %7.2f out of %7.2f GB used (%5.2f%% available)",
#   ((sum2-sum1)/1048576),(sum2/1048576),(100*sum1/sum2)
# }' 
 
# pbsnodes --xml \
#     | sed 's&<Node>&\n<Node>&g' \
#     | sed -n 's&.*<name>\([-a-z0-9-]*\)\.pace\.gatech\.edu</name>.*size=\([0-9]*\)kb:\([0-9]*\)kb.*&\1 \2 \3&p' \
#     | awk "$AWKCMD" 

<font color='red'>
    <p style="font-size:14pt; text-decoration:underline; font-weight:bold">
        DEPRECATED - DISTRIBUTED CODE EXAMPLE
    </p>
</font>

<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 5: Using Various Techniques to Run Distributed Code
</p>

- `run_wrapper.sh` is the launcher script to invoke MPI
- `run_linpack.sh` is the task script to map processes correctly

In [None]:
# cat content/run_wrapper.sh
# cat content/run_linpack.sh

<font color=F95E10>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Hands-On Exercise 1: Stream Manipulation for Efficient Data Processing
</p>

Use command line tools to look determine the best (i.e. lowest sigma) shape and gap parameters for the filter. These should be stored (and exported!) in the variables SHAPE_TIME and GAP_TIME. There is a right answer, but multiple methods to getting it - do what works for you and makes sense :)
</font>

In [None]:
cat content/filt.par

In [None]:
export SHAPE_TIME=$(cat content/filt.par | tail -n+2 | sort -k7 -n | awk 'NR==1 {print $1}')
export GAP_TIME=$(cat content/filt.par | tail -n+2 | sort -k7 -n | awk 'NR==1 {print $2}')
echo "Shape time is $SHAPE_TIME"
echo "Gap time is $GAP_TIME"

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    Beyond Standard Files in Linux
</p>

<font size=3><u><font color=B3A369><b>Soft/Symbolic and Hard Links</b></font>: Simplifying Paths and Reducing Redundancy</u></font>

Sometimes, the same file/directory is needed many times by many applications, or by multiple users. While the data could be copied for each instance, this can introduce problems, mainly:

- Any updates to a copy will only be reflected in that copy
- Each copy creates data redundancy, requiring more disk space than likely necessary

Path links address both of these problems by creating a reference to a path in another location. There are two types of links:

- Hard Links: To create a hard link, use `ln <PATH_TO_SOURCE> <PATH_OF_LINK>`
    - Hard links associated two or more filenames with the same inode
    - Hard links share the same data blocks on disk, but behave as independent files
    - Because inodes are partition specific, had links CANNOT span partitions (e.g., TruNAS where home lives cannot have a hard link to GPFS where project lives)
    - Hard links CANNOT be created for directories because they create loops
    - To remove the source file, all hard links to it must be deleted (in other words, running `rm <PATH_TO_SOURCE>` will remove the file entry in the source location, but `<PATH_OF_LINK>` still can access the file because it was not deleted
    - An example use case for hard links is incremental backups with `rsync` ([http://www.mikerubel.org/computers/rsync_snapshots/#Incremental](http://www.mikerubel.org/computers/rsync_snapshots/#Incremental))
- Symbolic (or Soft) Links: To create a symbolic link, use `ln -s <PATH_TO_SOURCE> <PATH_OF_LINK>`
    - Symbolic links are effectively pointers to a file location
    - Symbolic links will create a unique inode entry on the partition on which it resides
    - Symbolic links CAN be created for directories
    - Removing the source file BREAKS symbolic links
    - An example use case for symbolic links is to reduce path name complexity, such as your **data** and **scratch** storage volumes

In [None]:
echo -e "\033[1;34mls -ial\033[m"
ls -ial
echo -e "\n\033[1;34mls -ila content\033[m"
ls -lia content
echo -e "\n\033[1;34mls -lid ../PACE-Linux-102\033[m"
ls -lid ../PACE-Linux-102

<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 4: Creating a symbolic link to a shared data file
</p>

- We want to create a symbolic link to a shared data file so that we can treat it like it's actually located here
- To make things more exciting, we will randomly pick one of the 5 available files using RANDOM, the built-in bash function for producing random numbers
    - By operating modulo 5, we can get a random file name from "unknown_0.tgz" through "unknown_4.tgz"
- We will use this symlink in a later exercise!

<b>Can you determine to which file you created a link? (Hint: think about 'file' or 'ls -l')</b>
</font>

In [None]:
ln -sfn /storage/ice-shared/pace-shared/materials/linux102/unknown_$((RANDOM%5)).tgz unknown.tgz

In [None]:
ls -l

<font size=3><u><font color=B3A369><b>Named Pipes</b></font>: Command Line Shared Memory Streams</u></font>

In Linux 101, we learned about the anonymous pipe, `|`, which transfers the STDOUT of one command to STDIN of another. While generally useful, there are limitations to this type of pipe. The data must immediately be consumed by another process, the two commands must run within the same shell, and the two commands must run on the same partition. This is especially problematic for producer-consumer schemes, where two processes run simultaneously in separate shells and exchange data back and forth. In compiled code, the solution is to utilize shared memory, a memory space that is utilized by both programs simultaneously; in addition to the problem of having to write your own code to achieve this, this method has its own challenges in handling the effective read-write synchronization.

In Linux, there is a powerful tool called a Named Pipe that can operate outside of compiled applications. A FIFO (first-in, first-out) is a special file type that can ingest data from a producer and provide it to a consumer, even if they're on separate partitions or in different shells. To create a Named Pipe, the command `mkfifo <NAME>` is used. It will look just like a regular file, but now it takes up no disk space and will directly move data from one command to another!

Warning: data written to a pipe must be consumed, so producer processes will hang until there is a consumer ingesting the output.

In [None]:
rm -f fifo
mkfifo fifo
ls -l
for i in $(seq 3000); do echo $i; done > fifo &

In [None]:
ls -l
cat fifo

<font size=3><u><font color=B3A369><b>Temporary File System</b></font>: More Explicit Use of Local Memory</u></font>

Unfortunately, Named Pipes have the inherent problem that data must be consumed as its written, which can block processes from running. To address this, another option is to use the temporary file system, tmpfs. This is a filesystem based on local memory, and while its use is the same as the regular filesystem, everything written to and read from here is actually in memory, not on a disk. This can help with file I/O immensely, especially if the data is not amenable to caching (although the use of parallel filesystems such as GPFS and Lustre on HPC clusters makes a compelling case to use disk-based filesystems).

Often, `/dev/shm` (shared-memory device) is the best way to utilize the temporary file sytem. Data can be written to and read from here just like normal - but if you use it, be sure to clean up after yourself, as this volume is local to the machine and can fill up very quickly!

<font size=3><u><font color=B3A369><b>Linux Device Mounts - /dev</b></font>: Super Special Awesome Files in Linux</u></font>

The `/dev` directory contains special files for the various devices, which are created during installation. Within this directory are a few really special files:

- `/dev/null`: the bit bucket; anything written here will disappear forever; useful if you want to discard the output from commands
- `/dev/zero`: provides as many NULL characters as are read from it; the content is formatted, though, so something like `head` with a bit-count should be used to read it (as `cat` can only read unformatted text)
- `/dev/random`: a non-deterministic random number generator that uses entropy from system hardware; if no entropy is available, it will wait until more is available before producing additional numbers
- `/dev/urandom`: a semi-deterministic random number generator that uses entropy from system hardware; if no entropy is available, it uses a pseudo-random number generator to produce additional numbers

In [None]:
head -c5G /dev/zero | timeout 10 tail

In [None]:
head -c 30 /dev/urandom | tr -dc A-Za-z0-9[:punct:] ; echo ''

<font size=3><u><font color=B3A369><b>Here Docs and Here Strings</b></font>: On-the-Fly STDIN Utilization</u></font>

Sometimes, it is more convenient to redirect variable values or string literals to the STDIN of commands. For example, the `pace-jupyter-notebook` wrapper is a Bash script that uses a built-in template to produce the job script, rather than a separate template file stored elsewhere. To preserve the formatting and utilize run-time determined parameters, it is constructed using a Here Doc.

From Linux 101, the `<` operator is used to redirect the contents of a file to STDIN of a command. If, instead, we use `<<TERMINATOR`, the shell will continue reading from the terminal (or if in a script, the subsequent lines) until it encounters the `TERMINATOR` value; this could be a single character, or a sequence such as **EOF**. This operation is called a Here Doc, because a multi-line, formatted document is effectively written between the initial redirect (`<<TERMINATOR`) and the termination sequence (`TERMINATOR`). When writing scripts that use a general format, but can be configured differently, Here Docs are invaluable for passing the content to the command of interest.

In [None]:
awk -F';' '{printf "%30-s%6.2f\n",$2,$3*100/($4+$3)}' <<EOF
SEC;Kentucky Wildcats;38;2
Big 12;Kansas Jayhawks;32;7
ACC;Duke Blue Devils;27;7
ACC;North Carolina Tar Heels;32;6
Pac-12;UCLA Bruins;19;14
EOF

Similarly, sometimes we want to use the value of a variable or literal string as the input to our command. To redirect this content, we simply add one more `<` to get `<<<STRING`. With this, the string is read directly into STDIN and processed by the command accordingly. The torque submit filter very heavily utilizes Here Strings to efficiently investigate job submissions.

In [None]:
VALUE="The quick brown fox jumped over the lazy dog."
echo $VALUE | sed 's/The \([[:alpha:]]*\) \([[:alpha:]]*\) fox jumped over the \([[:alpha:]]*\) dog./\1 \2\ \3/'
sed 's/The \([[:alpha:]]*\) \([[:alpha:]]*\) fox jumped over the \([[:alpha:]]*\) dog./\1 \2\ \3/' <<<$VALUE
read -r WORD1 WORD2 WORD3 <<<$VALUE && echo $WORD3 $WORD2 $WORD1
wc <<<$VALUE

<font size=3><u><font color=B3A369><b>Process Substitution</b></font>: Redirecting Streams as Files</u></font>

Usually we think of I/O redirection as simply piping stdin or stdout between processes using `<` and `>`, respectively. However, sometimes commands expect files for input/output, and cannot process these streams. For instances like this, **process substitution** provides the solution:

```
Process substitution allows a process’s input or output to be referred to using a filename.
```

The syntax for process substition is `<(command list)` and `>(command list)` for input and output, respectively. Note, there are no spaces between the `<`/`>` and the opening parentheses, as this would result in an error.

In [None]:
awk '{print $1/$2}' <(echo "10 5")

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    Compiling Code from Source
</p>

As mentioned in Linux 101, scientific software is usually available in two ways - as a precompiled binary or as source code. Precompiled binaries are pre-built packages

<font size=3><u><font color=B3A369><b>Git</b></font>: A Software Repository for Version Control</u></font>

According to the `git` man page:

```
Git is is a fast, scalable, distributed revision control system with an unusually rich command set that provides both high-level operations and full access to internals.
```

Originally designed for development of the Linux kernel, Git is widely used by teams of programmers to coordinate work, as well as track changes across files. Additionally, a shared Git repository can provide a great way to distribute your code to other researchers. If you write code (in python, bash, java, C, fortran, Bash - any language!) for simulation, analysis, or data manipulation, integrating Git as a part of your workflow can help you immensely!

Repository management services like Github (GT provides enterprise access) and Gitlab provide hosting services to share and maintain your repository. This allows you to maintain one global repository, with multiple branches if desired, to proceed with non-linear development 

Some important commands to use Git are as follows:

- `git init`: turn any current repository into a Git repository
- `git remote add origin <REMOTE REPOSITORY URL>`: connect your local repository to the remote 
- `git clone <REMOTE REPOSITORY URL>`: creates a local copy of a remote repository
- `git add .`: tells git to track all of the changes in the local repository
- `git commit -m "INITIALS: Comment for commit"`: adds a useful message to summarize all of the changes that were made
- `git push <BRANCH>`: updates the remote repository to reflect the changes made locally and tracked with git add
- `git pull`: updates the local repository to reflect the status of the remote repository

<font size=3><u><font color=B3A369><b>A Note on Linking Libraries and Environment Variables</b></font></u></font>

When compiling and running code, the compiler needs to know where to find shared libraries. The details for the location of shared libraries are actually stored in two separate variables:

- `LIBRARY_PATH` tells the compiler where to find the libraries at linkage (i.e. when compiling)
- `LD_LIBRARY_PATH` tells the program where to find the libraries at runtime

Both of these variables should be set so that the libraries can be found. These variables use the static link (`-L<ABSOLUTE_PATH_TO_LIBRARY`), but when compiling, typically the dynamic link is passed to the compiler (`-l<NAME_OF_LIBRARY>`). If you are unsure of the name of the library (e.g. `fftw` versus `fftw3`), you can run `ldconfig -p | grep <LIBRARY_BASE_NAME>` to see if it is installed, and if so, what is the name of the library.

<font size=3><u><font color=B3A369><b>Make</b></font>: Build Automation</u></font>

From the `make` man page:

```
The purpose of the make utility is to determine automatically which pieces of a large program need to be recompiled, and issue the commands to recompile them.
```

Make is a <b>build automation</b> tool that reads a <b>Makefile</b>, which specifies the details for how to correctly build the target application. The Makefile provides instructions, such as the compiler flags to be used and what libraries should be linked, meaning that you, the user, do not need to remember them each time. Additionally, if only part of the source code is changed, `make` will only build it and any subsequent portions, which improves workflow efficiency by reducing the time spent compiling code.

Typically, usage is as simple as updating the <b>Makefile</b> to reflect the local software environment (if not in usual variables such as `LD_LIBRARY_PATH`, `LIBRARY_PATH`, etc.) and running `make` (and `make install` if this will install the project somewhere else). If an error occurs, and the build needs to be restarted, simply run `make clean` to purge any problematic components.

<font color=F95E10>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Hands-On Exercise 2: Building Code from Source with git and Make
</p>
    
Clone the repository from [https://gitlab.com/apjezghani/analyze](https://gitlab.com/apjezghani/analyze), load the appropriate modules, and build the project.
</font>

In [None]:
git clone https://gitlab.com/apjezghani/analyze.git
cd analyze
module load gcc/10.3.0 mvapich2/2.3.6 fftw root
make clean
make
cd ..

In [None]:
tree analyze

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    File Management with Compression and Archiving
</p>

Compression and archival of data has many benefits for researchers:

- Reduced data volume, allowing more data to be stored on disk
- And reduced number of inodes, preventing issues with number of file/directory quotas
- Smaller files can be transferred more quickly (for example, from network storage like on PACE)
- Reduces data structure complexity, as multiple files can be compressed to a single archive

The trade-off, however, is that CPU time must be spent packing/unpacking the data, which can impact performance. The trick is to find the right balance for individual needs - what may work for one problem might not be applicable for another.

With regards to compression, there are two types of compression: lossy and lossless

- <b>Lossy Compression</b> is fast and can achieve very high compression ratios (95%+ is quite common), but it introduces "noise" (random errors) into the data, so it can never be reconstructed perfectly. Common examples include JPEG, MP3, and MP4 files; more specialized examples for scientific research are SZ and ZFP.
- <b>Lossless Compression</b> is slower and typically achieves lower compression ratios (50-75% is usual), but the files can be reconstructed exactly. Common examples include PNG, FLAC, and ZIP, however, domain specific tools such as GeCo for bioinformatics can improve both compression rate and ratio.

<img src="img/lossy_compression.png" alt="Bash indicating that the job is now running in the background." width="75%" align="center">

<font size=3><u><font color=B3A369><b>Lossless compression utilities</b></font>: zip, gzip, bzip2, and xz</u></font>

A variety of utilities are available as "standard" on most Linux systems (they don't need to be built from source), and they each serve a purpose. The zip format used on Windows is also available on Linux, but for many reasons other compression utilities are preferred. The table below lists the standard compression utilities, and some general properties (but note - these are very generalized statements, and the individual use case should be considered before picking the utility to use):

|Utility|Algorithm|Archives?|Compression Ratio|Compression Speed|Decompression Speed|
|:---|:---:|:---:|:---:|:---:|:---:|
|zip|DEFLATE|Yes|2x-3x|Very Fast|Very Fast|
|gzip|DEFLATE|No|2x-3x|Very Fast|Very Fast|
|bzip2|Burrows-Wheeler + Huffman|No|3x-4x|Fast|Very Slow|
|xz|LZMA|No|5x-7x|Very, Very Slow|Slow|

From the above table, it would seem that the only difference between `zip` and `gzip` is that the former can archive (compress multiple files at once) while the latter cannot. However, an important detail is that when `zip` is used to archive, each file is compressed individually - thus it cannot benefit from similarities between each file! In other words, compressing and archiving many similar files with `zip` will often result in lower efficiency than the same operation with `gzip`+`tar`!

<font size=3><u><font color=B3A369><b>Archiving with tar</b></font>: the <b>t</b>ape <b>ar</b>chive utility</u></font>

As mentioned in Linux 101, `tar` is used for archival (packaging many directories and files into a single container), and can be used with the above utilities to compress the archive (`-zc` compresses with `gzip` and `-zx` decompresses with `gunzip`). Some of the more advanced options for `tar` include:

- `-j`: filter compression/decompression through `bzip2`
- `-J`: filter compression/decompression through `xz` (note: this replaces the deprecated `--lzma` option)
- `-I <UTILITY>`: filter compression/decompression through `<UTILITY>` (note: the utility must accept the `-d` option to find the differences between the filesystem and archive)
- `-t`: list the contents of the archive (use with a compression utility flag if needed)
- `-O`: write the extracted data to STDOUT instead of to the filesystem
- `--to-command=<COMMAND>`: pipe the extracted data to STDIN of `<COMMAND>`

<font color=377117>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Example 5: Comparing compression with different data types
</p>

- To highlight the differences in the standard command line compression utilities, try running the test script **compress-demo.sh**, which analyzes performance on some small files for illustrative purposes
- The first command allows us to see what the script does
    - For each data type file and compressor combination, time compression and decompression, and report the compressed size
- The second command actually runs the test - but be sure to run it twice to get results that make sense!

<b>What general conclusions can you draw regarding the differences between the compression utilities? Why do you think we had to run the second command twice?</b>
</font>


In [None]:
cat content/compress-demo.sh

In [None]:
./content/compress-demo.sh #run this command twice!

<font color=F95E10>
<p style="font-size:14pt; text-decoration:underline; font-weight:bold">
    Hands-On Exercise 3: Putting it All Together
</p>

Use command line tools to create a PBS script that will:

- Set the appropriate PBS directives for the job:
    - &num;SBATCH -p pace-cpu
    - &num;SBATCH --mem-per-cpu=8G
    
- Load the prerequisite modules for the **analyze** program
- Use the results from Exercise 1 (stored in the variables SHAPE_TIME and GAP_TIME) that give the best resolution (smallest SIGMA) as command line arguments to the **analyze** program
- Unpack the linked tarball in memory and run the **analyze** program on the data

Submit the job via sbatch and wait for results!
</font>

In [None]:
tar -ztf unknown.tgz

In [None]:
cat <<EOF > script.sh
#!/bin/bash
#SBATCH -p pace-cpu
#SBATCH -N1 -n2
#SBATCH --mem-per-cpu=8G

cd $SLURM_SUBMIT_DIR
pwd

module purge
module load gcc/10.3.0 root

export SHAPE_TIME=$(cat content/filt.par | tail -n+2 | sort -k7 -n | awk 'NR==1 {print $1}')
export GAP_TIME=$(cat content/filt.par | tail -n+2 | sort -k7 -n | awk 'NR==1 {print $2}')

./analyze/bin/analyze $SHAPE_TIME $GAP_TIME <(tar -Ozxf unknown.tgz)
EOF

sbatch script.sh

In [None]:
cat script.sh

In [None]:
squeue -u $USER

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    Your Feedback is Valued!
</p>

While the above code runs, please take the time to fill out this brief questionnaire. These answers provide useful feedback to better tailor the course content to your needs!

<center><font size=4><a href=https://b.gatech.edu/2Fwx1PB>https://b.gatech.edu/2Fwx1PB</a></font></center>

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    Checking Your Results
</p>

When your analysis job has finished (~2-3 minutes), run the command in the cell below to display an image of of the spectrum. Compare it to the known spectra below - which isotope do you think you had?

<img src="img/choices.png" alt="Possible nuclear isotopes for measured spectra" width="100%" align="center">

In [None]:
display < spectrum.jpeg

<p style="font-size:19.5pt; text-decoration:underline; font-weight:bold; color:#003057">
    Some Useful Links!
</p>

* [https://explainshell.com/](https://explainshell.com/): Breaks down shell commands and options according to utility man page
* [https://www.regextester.com/111539](https://www.regextester.com/111539): A regular expression tester (personally I still prefer fumbling through the command line).
* [An Introduction to Regular Expressions](https://learning.oreilly.com/library/view/an-introduction-to/9781492082569/): A textbook detailing the use of regular expressions for pattern matching.
* [Learning AWK Programming](https://learning.oreilly.com/library/view/learning-awk-programming/9781788391030/): A textbook all about AWK!
* [sed & awk Pocket Reference](https://learning.oreilly.com/library/view/sed-and-awk/0596003528/ch01.html): A brief overview for the usage of sed and AWK.