<a href="https://colab.research.google.com/github/kellycochran/colab_notebooks/blob/master/Awk_Basics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Intro To Awk

First, let's make a test file to run awk commands on:

In [46]:
# "%%bash" means this cell will run bash/command line code, not Python code
%%bash

# create an empty file named test.txt
if [[ -f test.txt ]]; then rm test.txt; fi
touch test.txt

# append each line to the file one by one
# each line has 3 columns
echo "1    A    Apple" >> test.txt
echo "2    B    Banana" >> test.txt
echo "3    C    Cat" >> test.txt
echo "4    D    Dog" >> test.txt
echo "5    E    Extra Dog" >> test.txt

# ensure columns are tab-separated (replace 4 spaces with a tab)
sed 's/    /\t/g' test.txt > tmp.txt
mv tmp.txt test.txt

# print out the file contents
cat test.txt

1	A	Apple
2	B	Banana
3	C	Cat
4	D	Dog
5	E	Extra Dog


At its simplest, an awk command takes the format:

`awk '{  [something]  }' filename`

This command will execute whatever awk code is written inside the braces on every line in the input file, individually.

The simplest thing we could put in the braces is `print`. This causes awk to print the entire line; when this is executed once each line, the result is that the entire file is printed.

In [47]:
%%bash
awk '{ print }' test.txt

1	A	Apple
2	B	Banana
3	C	Cat
4	D	Dog
5	E	Extra Dog


We can specify that we only want awk to print one or more columns from each line:

In [48]:
%%bash
# print only column 1
awk '{ print $1 }' test.txt

1
2
3
4
5


In [49]:
%%bash
# print only columns 1 and 3
# note that only "Extra", not "Extra Dog", is printed for line 5
awk '{ print $1, $3 }' test.txt

1 Apple
2 Banana
3 Cat
4 Dog
5 Extra


Using `print $0` is equivalent to `print` -- the entire line is printed.

## Field Separators

By default, awk decides what the columns in a file are by looking for whitespace (tabs or spaces); and when awk prints out multiple columns, by default it puts single spaces between them. We can change this by setting the variables FS (input field separator) and OFS (output field separator).

In [54]:
%%bash
# here the output columns will be tab-separated
awk -v OFS="\t" '{ print $1, $3 }' test.txt

1	Apple
2	Banana
3	Cat
4	Dog
5	Extra


In [55]:
%%bash
# here "Extra Dog", with the space, is printed for line 5
awk -v FS='\t' '{ print $1, $3 }' test.txt

1 Apple
2 Banana
3 Cat
4 Dog
5 Extra Dog


In [56]:
%%bash
# combining the two
awk -v FS='\t' -v OFS='\t' '{ print $1, $3 }' test.txt

1	Apple
2	Banana
3	Cat
4	Dog
5	Extra Dog


## Awk With Conditionals/Regex

If you don't want the main awk command to run on every line, there are two ways to make the command only apply to certain lines: if statements and regex conditionals.

### If Statements

In [71]:
%%bash
# only the first 3 lines
awk '{ if ($1 < 4) print }' test.txt

1	A	Apple
2	B	Banana
3	C	Cat


In [72]:
%%bash
# only the even lines
awk '{ if ($1 % 2 == 0) print }' test.txt

2	B	Banana
4	D	Dog


In [73]:
%%bash
#if you want to output the whole line, you can use this shorthand:
awk '$1 % 2 == 0' test.txt
# note that there are no braces now

# this is equivalent to:
# awk '{ if ($1 % 2 == 0) print }' test.txt

2	B	Banana
4	D	Dog


### Regex

In [74]:
%%bash
# only lines 2, 3, and 5 have a lowercase a in them
awk '/a/ { print }' test.txt

2	B	Banana
3	C	Cat
5	E	Extra Dog


In [77]:
%%bash
# only line 4 ends in "[tab]Dog"
awk '/\tDog$/ { print }' test.txt

4	D	Dog


## Some Awk Variables

Within an awk command, the variable `NR` will be set to whatever line you are currently on (so if you are on the first line, `NR == 1`). The variable `NF` contains the number of columns/fields found in that line. This means you can access the last column in a file with `$NF`. Note that `NF` by itself is an integer variable, while `$NF` with the dollar sign refers to a column.

In [82]:
%%bash
# another way of only printing the even lines
awk 'NR % 2 == 0' test.txt

2	B	Banana
4	D	Dog


In [85]:
%%bash
# only print the last column in each line
awk -v FS='\t' '{ print $NF }' test.txt

Apple
Banana
Cat
Dog
Extra Dog


In [86]:
%%bash
# only print the second-to-last column in each line
awk -v FS='\t' '{ print $(NF - 1) }' test.txt

A
B
C
D
E


One good use of `NF` is to check if all the lines in a file have the same number of columns.

In [89]:
%%bash
# with the default spaces and tabs as field separator...
# note that we use NF without a dollar sign here
# the result: we see the last line in the file appears to have an extra column
awk '{ print NF }' test.txt

3
3
3
3
4


In [88]:
%%bash
# with tab as the input field separator...
# now all the lines in the file have the same number of columns
awk -v FS='\t' '{ print NF }' test.txt

3
3
3
3
3


## You Can Do Math In Awk: A BED File Example

In [90]:
%%bash

# let's make an example BED-formatted file (a common bioinformatics file format).
if [[ -f test.bed ]]; then rm test.bed; fi
touch test.bed

# append each line to the file one by one
# each line has 3 columns
echo "chr1    0    500" >> test.bed
echo "chr1    20    340" >> test.bed
echo "chr2    58    60" >> test.bed
echo "chr2    100    1000" >> test.bed
echo "chr3    101    201" >> test.bed

# ensure columns are tab-separated (replace 4 spaces with a tab)
sed 's/    /\t/g' test.bed > tmp.bed
mv tmp.bed test.bed


# print out the file contents
cat test.bed

chr1	0	500
chr1	20	340
chr2	58	60
chr2	100	1000
chr3	101	201


In [91]:
%%bash
# only print lines where column 3 is more than 400 larger than column 2
awk '$3 - $2 > 400' test.bed

chr1	0	500
chr2	100	1000


In [96]:
%%bash
# a more complicated example, with if/else

# if the size of the difference between column 2 and column 3 is larger than 200,
# replace the values in columns 2 and 3 such that the difference is exactly 200,
# while keeping their midpoint the same.

# otherwise, print the line as-is.

awk -v OFS='\t' '{ if ($3 - $2 > 200) print $1, ($2 + $3)/2 - 100, ($2 + $3)/2 + 100 ; else print }' test.bed

chr1	150	350
chr1	80	280
chr2	58	60
chr2	450	650
chr3	101	201
