# Learn GAWK programming language hands-on

## Introduction

GAWK is a programming language designed for text processing and typically used as a data extraction and reporting tool. It is a standard feature of most Unix-like operating systems.

GAWK is one of the most popular and powerful scripting languages. It is a standard feature of most Unix-like operating systems. It is also available for Microsoft Windows and other operating systems. It was created in the 1970s at Bell Labs.

GAWK is a complete programming language. It has variables, control structures, user-defined functions, and so on. It also has built-in functions and operators for doing regular expression matching, string concatenation, file input/output, and so on. It is interpreted and dynamically typed.

## GAWK vs AWK

GAWK has many extensions over the original version of awk. It is upwardly compatible with the System V Release 4 version of awk. It also has many additional features. It is also backwardly compatible with the original version of awk. It is a superset of the original version of awk.

These are some of the extensions of GAWK over the original version of awk:

* Additional data types: arrays, associative arrays, multidimensional arrays, and objects.
* Additional control structures: do-while, for, switch, break, continue, and the three-argument form of the if statement.
* Additional operators: **, ++, --, +=, -=, *=, /=, %=, ^=, and the ternary conditional operator.
* Additional built-in functions: and(), asort(), asorti(), atan2(), close(), cos(), exp(), fflush(), gensub(), getline(), index(), int(), log(), match(), mktime(), rand(), sin(), split(), sprintf(), srand(), strftime(), substr(), systime(), tolower(), toupper(), and close().
* Additional special patterns: BEGINFILE, ENDFILE, and the three-argument form of the if statement.
* Additional command-line options: -f, -F, -M, -W, and -v.
* Additional built-in variables: ARGC, ARGIND, ARGV, BINMODE, CONVFMT, ENVIRON, ERRNO, FIELDWIDTHS, FPAT, IGNORECASE, LINT, NF, NR, OFMT, OFS, ORS, RLENGTH, RS, RSTART, RT, SUBSEP, TEXTDOMAIN, and TEXTDOMAINDIR.
* Additional features: 
    1. the ability to include source files with the include statement
    1. the ability to load shared libraries with the load statement
    1. the ability to use arbitrary-precision

*In mordern Linux distributions, the awk command is a symbolic link to the gawk command. So, when you run the awk command, you are actually running the gawk command. The awk command is a symbolic link to the gawk command for backward compatibility.*


## General syntax

GAWK programs are made up of rules and actions. A rule is made up of a pattern and an action. Please read on. these concept will be explained below. 

This is the general syntax of a GAWK program:

```awk
pattern { action }
pattern { action }
...
```

This is an example of pattern and action:

```awk
BEGIN { print "Hello, world!" }
```
where `BEGIN` is a pattern which means that the action will be executed before the first input record is read.
`{ print "Hello, world!"}` is an action which can be a single or multiple statements. In this case, action is a single statement which is to print the string "Hello, world!". An action is executed when a pattern matches the current input record.  The curly braces are required to designate an action.


When a rule matches the current input record, the action is executed. A pattern is made up of one or more patterns separated by newlines or semicolons. An action is made up of one or more statements separated by newlines or semicolons.



In [16]:
awk 'BEGIN { print "Hello, world!" }'

Hello, world!


In [None]:
# Overall structure of a gawk program including BEGIN, main, and END blocks

awk 'BEGIN { print "Hello, world!" }
    (NR == 1) { print "This is the first line." })
    END { print "Goodbye, world!" }' file1 file2




## GAWK's special variables

GAWK has many built-in variables. These are some of the built-in variables:

* `ARGC`: The number of command-line arguments.
* `ARGV`: An array of command-line arguments.
* `BEGINFILE`: The BEGINFILE pattern.
* `ENDFILE`: The ENDFILE pattern.
* `BINMODE`: The binary mode flag.
* `CONVFMT`: The conversion format for numbers.
* `ENVIRON`: An array of environment variables.
* `ERRNO`: The system error message corresponding to the value of the system variable `ERRNO`.
* `FIELDWIDTHS`: The field widths for fixed-width data.
* `FILENAME`: The name of the current input file.
* `FNR`: The current record number in the current input file.
* `FPAT`: The regular expression describing the contents of the fields in a record.
* `FS`: The input field separator.
* `IGNORECASE`: The case-insensitive matching flag.
* `LINT`: The lint flag.
* `NF`: The number of fields in the current record.
* `NR`: The current record number in the current input file.
* `OFMT`: The output format for numbers.
* `OFS`: The output field separator.
* `ORS`: The output record separator.
* `RLENGTH`: The length of the string matched by the `match()` function.
* `RS`: The input record separator.
* `RSTART`: The index of the first character matched by the `match()` function.
* `RT`: The record terminator.
* `SUBSEP`: The subscript separator.
* `TEXTDOMAIN`: The text domain.
* `TEXTDOMAINDIR`: The directory for the text domain.

These are some of the special variables usually used to control the execution of a GAWK program:

* `BEGIN`: The BEGIN pattern.
* `END`: The END pattern.
* `FILENAME`: The name of the current input file.
* `FNR`: The current record number in the current input file.
* `NF`: The number of fields in the current record.
* `NR`: The current record number in the current input file.

In [2]:
# Create some randomly generated data to work with GAWK

awk 'BEGIN {
    for (i = 0; i < 10; i++)
        print int(101 * rand())
}'


93
59
30
58
74
79
44
33
78
10


In [None]:
# One-liner to list out the commmon fasta between two files, print out only the fasta ID
awk 'BEGIN {RS = ">" ; FS = "\n" ; OFS = "\t"} NR == FNR {a[$1] = $0 ; next} $1 in a {print $1}' data/ecoli.fasta data/ecoli2.fasta


NZ_JAGWDO010000079.1 Escherichia coli strain INF32/16/A Scaffold_79, whole genome shotgun sequence
NZ_JAGWDO010000078.1 Escherichia coli strain INF32/16/A Scaffold_78, whole genome shotgun sequence


In [None]:
# One-liner to list out the commmon fasta between two files
awk 'BEGIN {RS = ">" ; FS = "\n" ; OFS = "\t"} NR == FNR {a[$1] = $0 ; next} $1 in a {print a[$1]}' data/ecoli.fasta data/ecoli2.fasta


NZ_JAGWDO010000079.1 Escherichia coli strain INF32/16/A Scaffold_79, whole genome shotgun sequence
CCTTCGAACCACAACTGGTGAAGAAAAATCAGACCCGCATTACCGGCATGGACAACCAGATCCTGGCGCT
GTATGCCAGAGGGATGACTACCCGCGAAATTACCTCAGCCTTCAAAGAAATGTACGACGCCGATGTCTCG
CCCACGCTGATATCCAAAGTGACCGATGCGGTTAAAGAACAGGTCTCTGAATGGCAAAACCGACCGTTGG
ATGCACTGTATCCCATTGTTTATCTTGACTGTATTGTGGTTAAGGTTCGTCACAGCGGGAGTGTCATTAA
CAAAGCGGTGTTCCTCGCGCTGGGCATCAATACCGACGGTCAGAAAGAGCTGCTGGATATGTGGCTGGCC
GAAAATGAAGGTGCAAAGTTCTGGCTGAATGTACTGACGGAACTGAAAAATCGCGGCCTGAACGATATCC
TTATCGCCTTCGTGGACGGCCTGAAAGGCTTCCCGGAAGCGAT


NZ_JAGWDO010000078.1 Escherichia coli strain INF32/16/A Scaffold_78, whole genome shotgun sequence
TTTTTACTAAGCATTTCAGTAAGTGTTATACACGTATTTTCTACTAAGTGTTTTACCAGACATACACATG
TTTTCATAAACAATTCTACAAGTGTTTTCATTAAGTATTGTTATACACATTGCTTTGTCTGATACACATC
CAGTTTAGTAAACCTGTCAGTCCTGTTTTTACGTTAACTAAAATCAGCTTAATGCTACTACAAAACAATC
TTTCCTGTTTAAGCAATCAATATGATATTCTATTTTTCTGATGTTACATTTAACCTTTTATGAGTTTTCG
TGTTTTTAGTCGTTAACCTGAGACAAGTGTTTTTATGTAAAACGC

In [None]:
# Excersise: Print out the fasta ID and sequence length for the common fasta between two files
awk 'BEGIN {RS = ">" ; FS = "\n" ; OFS = "\t"} NR == FNR {a[$1] = $0 ; next} $1 in a {print $1, length($2)}' data/ecoli.fasta data/ecoli2.fasta

	0
NZ_JAGWDO010000079.1 Escherichia coli strain INF32/16/A Scaffold_79, whole genome shotgun sequence	70
NZ_JAGWDO010000078.1 Escherichia coli strain INF32/16/A Scaffold_78, whole genome shotgun sequence	70


In [None]:
# Tabulate statistics for multi-fasta file with gawk.
# This has a bug. Can you find it?

gawk 'BEGIN {
    RS = ">" # Record separator
    FS = "\n" # Field separator
    OFS = "\t" # Output field separator
    print "Name", "Length", "GC", "AT", "N", "GC%", "AT%", "N%"
} {
    if (NR > 1) {
        name = $1
        # Keep only fasta ID
        sub(/ .*/, "", name)
        seq = $2
        gsub("\n", "", seq)
        gsub("\r", "", seq)
        len = length(seq)
        gc = gsub(/[GC]/, "", seq)
        at = gsub(/[AT]/, "", seq)
        n = gsub(/[Nn]/, "", seq)
        # Round to 2 decimal places
        gc = sprintf("%.2f", gc / len)
        at = sprintf("%.2f", at / len)
        n = sprintf("%.2f", n / len)
        print name, len, gc, at, n, gc * 100, at * 100, n * 100
    }
}' data/ecoli.fasta


In [15]:
# Tabulate statistics for multi-fasta file with gawk.
# Fasta sequence of each record is on multiple lines.
# Need to read in entire record before processing.
# Output is tab-delimited.

gawk 'BEGIN {
    RS = ">" # Record separator
    FS = "\n" # Field separator
    OFS = "\t" # Output field separator
    print "Name", "Length", "GC", "AT", "N", "GC%", "AT%", "N%"
} (NR > 1){
   # Read multi-line fasta sequence into single variable
    seq = ""
    for (i = 2; i <= NF; i++) {
        seq = seq $i
    }
    len = length(seq)
    gc = gsub(/[GC]/, "", seq)
    at = gsub(/[AT]/, "", seq)
    n = gsub(/[Nn]/, "", seq)
    # Round to 2 decimal places
    gc = sprintf("%.2f", gc / len)
    at = sprintf("%.2f", at / len)
    n = sprintf("%.2f", n / len)
    print $1, len, gc, at, n, gc * 100, at * 100, n * 100
}' data/ecoli.fasta

Name	Length	GC	AT	N	GC%	AT%	N%
NZ_JAGWDO010000097.1 Escherichia coli strain INF32/16/A Scaffold_97, whole genome shotgun sequence	220	0.28	0.40	0.32	28	40	32
NZ_JAGWDO010000096.1 Escherichia coli strain INF32/16/A Scaffold_96, whole genome shotgun sequence	243	0.60	0.40	0.00	60	40	0
NZ_JAGWDO010000095.1 Escherichia coli strain INF32/16/A Scaffold_95, whole genome shotgun sequence	272	0.51	0.49	0.00	51	49	0
NZ_JAGWDO010000094.1 Escherichia coli strain INF32/16/A Scaffold_94, whole genome shotgun sequence	273	0.48	0.52	0.00	48	52	0
NZ_JAGWDO010000093.1 Escherichia coli strain INF32/16/A Scaffold_93, whole genome shotgun sequence	300	0.52	0.48	0.00	52	48	0
NZ_JAGWDO010000092.1 Escherichia coli strain INF32/16/A Scaffold_92, whole genome shotgun sequence	300	0.52	0.48	0.00	52	48	0
NZ_JAGWDO010000091.1 Escherichia coli strain INF32/16/A Scaffold_91, whole genome shotgun sequence	301	0.37	0.63	0.00	37	63	0
NZ_JAGWDO010000090.1 Escherichia coli strain INF32/16/A Scaffold_90, whole genome shot