# AWK tutorial

See https://github.com/tdhopper/awk-lessons for inspiration

## Lesson 01: Basics of Awk

If you haven't read the Awk man page, you should start there. It's helpful! Some highlights:

```
awk − pattern-directed scanning and processing language

awk [ −F fs ] [ −v var=value ] [ ’prog’ | −f progfile ] [ file ... ]
```

Awk scans each input file for lines that match any of a set of patterns specified literally in prog or in one or more files specified as −f progfile.

With each pattern there can be an associated action that will be performed when a line of a file matches the pattern.

Each line is matched against the pattern portion of every pattern-action statement; the associated action is performed for each matched pattern

A pattern-action statement has the form pattern {action}.

A missing { action } means print the line; a missing pattern always matches.

I created an simple example file to demonstrate basic Awk:

In [1]:
cat data/letters.txt

a
bb
ccc
dddd
ggg
hh
i

### A Basic Pattern

If we match lines longer than two characters and use the implicit print action, we get:

In [2]:
awk 'length $0 > 2' data/letters.txt

bb
ccc
dddd
ggg
hh


$0 is a built-in variable that contains the line.

### A Basic Function

If we leave out a pattern, we will match every line. A trivial action would be to print each line:

In [3]:
awk '{ print }' data/letters.txt

echo
echo Using the length function as our action, we can get the length of each line:
echo 

awk '{ print length }' data/letters.txt

a
bb
ccc
dddd
ggg
hh
i

Using the length function as our action, we can get the length of each line:

1
2
3
4
3
2
1


In [4]:
# We can combine things
awk '{ print length $0}' data/letters.txt

echo 
echo The above prints length of line and line - the value of \$0
echo

awk '{ print length,$0}' data/letters.txt
echo 
echo Using "," as separator puts whitespace 
echo


1a
2bb
3ccc
4dddd
3ggg
2hh
1i

The above prints length of line and line - the value of $0

1 a
2 bb
3 ccc
4 dddd
3 ggg
2 hh
1 i

Using , as separator puts whitespace



In [5]:
# Awk has special controls for executing some code before the file input begins and after it is complete.

awk 'BEGIN { print "HI" } { print $0 } END { print "BYE!" }' data/letters.txt

HI
a
bb
ccc
dddd
ggg
hh
i
BYE!


In [6]:
awk "BEGIN { print \"Don't Panic! \" }"

Don't Panic! 


### Combining Patterns and Functions
Of course, patterns and functions can be combined so that the function is only applied when the pattern is matched.

From the man page:
```
A pattern-action statement has the form

pattern { action }
```

We can print the length of all lines longer than 2 characters.

In [7]:
awk 'length($0) > 2 { print length($0) }' data/letters.txt

3
4
3


In [8]:
# Actually, we don't have to limit Awk to just one pattern! 
# We can have arbitrarily many patterns separated by a semicolon or a new line:

awk 'length($0) > 2 { print "Long:  " length($0) }; length($0) < 2 { print "Short: " length($0) }' data/letters.txt

Short: 1
Long:  3
Long:  4
Long:  3
Short: 1


### Multiple Fields

Awk is designed for easy handling of data with multiple fields per row. 
The field delimiter can be specified with the -F option.

Here's a simple space-delimited file:

In [9]:
cat data/field_data.txt

Roses are red,
Violets are blue,
Sugar is sweet,
And so are you.


In [10]:
# If we specify the field seperator, we can print the second field from each row:

awk -F " " '{ print $2 }' data/field_data.txt
echo

# which is also a default
awk '{ print $2 }' data/field_data.txt


are
are
is
so

are
are
is
so


In [11]:
# We don't get an error if a line doesn't have the referenced field; it just shows up as blank
awk -F " " '{ print $4 }' data/field_data.txt




you.


In [12]:
# The seperator expression is interpreted as a regular expression.

awk -F "((so )?are|is) " '{print "Field 1: " $1 "\nField 2: " $2}' data/field_data.txt


Field 1: Roses 
Field 2: red,
Field 1: Violets 
Field 2: blue,
Field 1: Sugar 
Field 2: sweet,
Field 1: And 
Field 2: you.


### Regular Expressions

Patterns can be regular expressions, not just built-in functions. From the man page:

Regular expressions are as defined in re_format(7) - 
Isolated regular expressions in a pattern apply to the entire line.

In [13]:
ls -la dict

total 1172
drwxr-xr-x  4 jovyan users    128 Oct 20 17:01 .
drwxr-xr-x 17 jovyan users    544 Oct 20 18:21 ..
-rw-r--r--  1 jovyan users 256374 Oct 20 16:59 8927565-d9783627c731268fb2935a731a618aa8e95cf465.zip
-rw-r--r--  1 jovyan users 938847 Feb 11  2014 words


In [14]:
awk '/^[a-z][aeiou][aeiou][aeiou][aeiou]/' dict/words
echo
# This should be the same but is not
awk '/^[a-z][aeiou]{4}/' dict/words

gooier
gooiest
queue
queue's
queued
queues
queuing

gooier
gooiest
queue
queue's
queued
queues
queuing


### Passing variables into program

The -v option for Awk allows us to pass variables it the program. 
For example, we could use it to hard code constants.

In [15]:
awk -v pi=3.1415 'BEGIN { print pi }'

# The $USER will work in terminal, not in Jupyter or in Docker
awk -v curdir=$PWD 'BEGIN { print curdir }'

3.1415
/home/jovyan/work


### If-else Statements
If-else statements in Awk are of the form:

if (condition) then-body [else else-body]

For example:

In [16]:
printf "1\n2\n3\n4" | awk \
    '{ \
        if ($1 % 2 == 0) print $1, "is even"; \
        else print $1, "is odd" \
     }'

1 is odd
2 is even
3 is odd
4 is even


### Looping
Awk includes several looping statements: while, do while, and for.

They take the expected C-ish syntax.

In [17]:
awk \
    'BEGIN { \
        i = 0; \
        while (i < 5) { print i; i+=1; } \
     }'

0
1
2
3
4


In [18]:
awk \
    'BEGIN { \
        i = 0; \
        do { print i; i+=1; } while(i < 5) \
     }'

0
1
2
3
4


In [19]:
awk \
    'BEGIN { \
        i = 0; \
        for(i = 0; i<5; i++) print i \
     }'

0
1
2
3
4


In [20]:
awk --version

GNU Awk 4.1.4, API: 1.1 (GNU MPFR 4.0.1, GNU MP 6.1.2)
Copyright (C) 1989, 1991-2016 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.
