# 1.6 awk

This notebook will demonstrate the basics of using `awk` to extract specific fields from within data files.

In [29]:
%%bash
less data/art/artists.txt


KILIAN, Wolfgang	(1581-1662)	Baroque	German graphic artist (Augsburg)
KINSOEN, François-Joseph	(1771-1839)	Romanticism	Flemish painter
KISS, August Karl Edouard	(1802-1865)	Romanticism	German sculptor
KISS, Bálint	(1802-1868)	Romanticism	Hungarian painter (Pest)
KLEINER, Salomon	(1703-1761)	Baroque	German graphic artist
KLENZE, Leo von	(1784-1864)	Romanticism	German architect
KLESECKER, Justus (see GLESKER, Justus)	(c. 1615-1678)	Baroque	German sculptor (Franconia)
KLINGER, Max	(1857-1920)	Realism	German painter
KLOCKER, Hans	(active 1478-1500l)	Northern Renaissance	Austrian sculptor (South Tyrol)
KLODT, Mikhail Konstantinovich	(c. 1832-1902)	Realism	Russian painter (St. Petersburg)
KLODT, Pyotr Karlovich	(1805-1867)	Romanticism	Russian sculptor (St. Petersburg)
TACCA, Ferdinando	(1619-c. 1688)	Baroque	Italian sculptor (Florence)
TACCA, Pietro	(1577-1640)	Baroque	Italian sculptor (Florence)
TACCONE, Paolo (see PAOLO ROMANO)	(c. 1415-c. 1470)	Early Renaissance	Italian sculptor (Rome)
TA

#### manual inspection of `artists.txt`

There are at least two options for loading `artists.txt` into your favorite spreadsheet program:

* you already have the file on your computer somewhere on your file system, so you can load it from there (if you can figure it out)
* you can download the text file from https://raw.githubusercontent.com/SuLab/Applied-Bioinformatics/master/Module-1_bash-jupyter-git/data/art/artists.txt

Note that every row has four columns. 

In [38]:
%%bash
man grep



GREP(1)                   BSD General Commands Manual                  GREP(1)

NNAAMMEE
     ggrreepp, eeggrreepp, ffggrreepp, zzggrreepp, zzeeggrreepp, zzffggrreepp -- file pattern searcher

SSYYNNOOPPSSIISS
     ggrreepp [--aabbccddDDEEFFGGHHhhIIiiJJLLllmmnnOOooppqqRRSSssUUVVvvwwxxZZ] [--AA _n_u_m] [--BB _n_u_m] [--CC[_n_u_m]]
          [--ee _p_a_t_t_e_r_n] [--ff _f_i_l_e] [----bbiinnaarryy--ffiilleess=_v_a_l_u_e] [----ccoolloorr[=_w_h_e_n]]
          [----ccoolloouurr[=_w_h_e_n]] [----ccoonntteexxtt[=_n_u_m]] [----llaabbeell] [----lliinnee--bbuuffffeerreedd]
          [----nnuullll] [_p_a_t_t_e_r_n] [_f_i_l_e _._._.]

DDEESSCCRRIIPPTTIIOONN
     The ggrreepp utility searches any given input files, selecting lines that
     match one or more patterns.  By default, a

#### confirm using awk that each row has four columns
Find and fix the error.

```
%%bash
awk '{print NF}' artists.txt
```

In [2]:
%%bash
cd data/art
awk '{print NF}' artists.txt

8
6
8
7
7
7
11
6
10
10
9
8
7
13
8
14
7
9
9
10
9
12
8
9
7
7
7
7
7
8
7
8
7
6
6
7
7
7
11
12
8
6
7
9
7
12
6
7
7
9
9
7
7
8
8
8
9
9
7
6
10
7
7
7
9
7
7
7
7
7
9
7
7
10
12
7
7
9
7
7
10
6
7
10
9
7
9
8
9
7
7
11
12
7
9
13
7
7
10
7
7
8
11
8
8
6
9
13
8
6
7
9
7
7
7
7
7
7
7
7
7
11
8
11
7
9
9
7
10
8
7
13
11
9
9
9
14
12
8
7
9
8
11
7
8
10
7
7
8
8


#### Print the third column
Find and fix the error.
```
%%bash
awk '{print $3}' artists.txt
```

In [11]:
%%bash
awk '{FS="\t";print $3}' data/art/artists.txt

(1581-1662)
Romanticism
Romanticism
Romanticism
Baroque
Romanticism
Baroque
Realism
Northern Renaissance
Realism
Romanticism
Baroque
Baroque
Early Renaissance
Medieval
Early Renaissance
Realism
Medieval
Medieval
Early Renaissance
High Renaissance
Early Renaissance
Baroque
Baroque
Baroque
Rococo
Neoclassicism
Impressionism
Baroque
Baroque
Mannerism
Baroque
Baroque
Rococo
Romanticism
Baroque
Baroque
Rococo
Baroque
Baroque
Mannerism
Romanticism
Rococo
Baroque
Baroque
Baroque
Neoclassicism
Baroque
Baroque
Baroque
Baroque
Baroque
Baroque
Baroque
Neoclassicism
Mannerism
Baroque
Baroque
Baroque
Baroque
Mannerism
Romanticism
Baroque
Baroque
Baroque
Realism
Romanticism
Romanticism
Romanticism
Rococo
Baroque
Neoclassicism
Baroque
Baroque
Baroque
Rococo
Rococo
Rococo
Neoclassicism
Romanticism
Northern Renaissance
Romanticism
Romanticism
Baroque
Baroque
Romanticism
Northern Renaissance
Baroque
Baroque
Neoclassicism
Rococo
Mannerism
Baroque
Baroque
Baroque
Baroque
Baroque
Baroque
Northern Renaissan

#### Use `uniq` (and another command) to find the most common art periods
Find and fix the error(s).
```
%%bash
awk '{print $3}' artists.txt | uniq
```

In [37]:
%%bash
cd data/art 
ls
awk '{FS = "\t";print $3}' artists.txt | sort | uniq -c 

artists.txt
t1
   1 (1581-1662)
  60 Baroque
  12 Early Renaissance
   5 High Renaissance
   1 Impressionism
  10 Mannerism
   8 Medieval
  10 Neoclassicism
   6 Northern Renaissance
   9 Realism
  10 Rococo
  18 Romanticism


## Homework

### HW1: Repeat the above command (finding the most common art periods) restricting to Dutch painters only.
HINT: start by searching/filtering for Dutch painters...

In [43]:
%%bash
grep 'Dutch painter' data/art/artists.txt | awk '{FS = "\t"; print $3}' | sort | uniq -c


  16 Baroque
   1 Neoclassicism
   1 Realism
   2 Rococo
   4 Romanticism
   1 van


### HW2: Find the most common last names among all artists in `artists.txt`.
HINT: you will use `awk`, `sort`, and `uniq`, and you may use one or more of those commands multiple times

In [49]:
%%bash
awk '{print $1}' data/art/artists.txt | sort | uniq -c | sort -r

   5 KOBELL,
   5 KNIP,
   4 KONINCK,
   4 ALBERTI,
   4 ADAM,
   3 TENIERS,
   3 TARAVAL,
   3 AGOSTINO
   3 AGNOLO
   2 TESSIN,
   2 TEMPESTA,
   2 TEDESCO,
   2 TASSAERT,
   2 TALENTI,
   2 TADDEO
   2 TACCA,
   2 KNYFF,
   2 KNIJFF,
   2 KLODT,
   2 KISS,
   1 TETRODE,
   1 TESTELIN,
   1 TESTA,
   1 TERZIO,
   1 TERRENI,
   1 TERILLI,
   1 TERBRUGGHEN,
   1 TERBORCH,
   1 TENGNAGEL,
   1 TENERANI,
   1 TEMPEL,
   1 TEMANZA,
   1 TELEPY,
   1 TEERLINC,
   1 TAUNAY,
   1 TASSI,
   1 TASSEL,
   1 TARUFFI,
   1 TARSIA,
   1 TARGONE,
   1 TARDIEU,
   1 TARCHIANI,
   1 TARBELL,
   1 TANZIO
   1 TAMM,
   1 TAMAGNINO,
   1 TAMAGNI,
   1 TALPA,
   1 TADOLINI,
   1 TACCONE,
   1 KŘBKE,
   1 KÖNIG,
   1 KOPISCH,
   1 KONRAD
   1 KOMPE,
   1 KOLUNIĆ,
   1 KOLLONITSCH,
   1 KOLBE,
   1 KOKORINOV,
   1 KOETS,
   1 KOERBECKE,
   1 KOEKKOEK,
   1 KOEDIJCK,
   1 KOEBERGER,
   1 KOCH,
   1 KOBERGER,
   1 KNÜPFER,
   1 KNOLLER,
   1 KNOBELSDORFF,
   1 KNELLER,
   1 KNEBEL,
   1 KLOCKER,
   1 KLINGER