# Unix tools for text file manipulation and exploration

This notebook contains a list of examples of common linux/unix commands.

```{objectives}
1. View and manipulate text and text files using fundamental Unix commands
2. Combine fundamental Unix commands
```


```{admonition} Weird syntax ahead
In order to reduce the need for temporary files to show the their effect,
*process substitution* (that is, the syntax `<( *command* )`) and pipes (`|`) are used in the examples.
If this looks obscure,
the reader is invited to create the necessary temporary files.
```

```{admonition} To follow along
Please navigate to the `examples/text_manipulation`directory.
```

In [1]:
cd ../../examples/text_manipulation/
ls

example.tsv  [0m[01;32mgenerate.sh[0m


## "Vertical" text manipulation
In this section, we have a look at commands 
that cut and sew together files "vertically".

We are going to use a Tab separated value file
as an example:

In [2]:
cat example.tsv

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


Tabs are characters that are nicely understood as column separators by many tools.
We slice it vertically using the `head` and `tail` commands.

In [3]:
head -n 5 example.tsv

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q


In [4]:
tail -n 4 example.tsv

p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


When a text file is large, 
the `head` and `tail` commands are very useful 
to get an idea of its content.
To explore files, `wc` is also useful

In [5]:
wc example.tsv

 12  48 108 example.tsv


These number are, respectively, the number of lines/words/characters in `example.tsv`.

Multiple files can be joined together vertically 
using the `cat` command:

In [6]:
cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


```{exercise} Break and concatenate
   1. Verify, using the `diff` command,
      that the output command above
      reproduces the original `example.tsv`. 

      ```{hint}
         You save the output to a temporary 
         file using `>`.
      ```


   2. What if you change the head command to take only 4 lines instead of 5?

   3. (Bonus points) Try again, but do not use temporary files.
   ```{solution}
   1. Using temporary files:
      ```bash
      cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv) > reconstructed.tsv
      diff example.tsv reconstructed.tsv
      ```
      And there should be no output. 
    2. A possible solution, 
       in a single command, 
       without temporary files, 
       can be obtained by nesting 
       process substitution:
       ```bash
       diff <(cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)) example.tsv
       ```
       And there should be no output. 
   ```
```

Another notable program that cuts files vertically is `grep`.
`grep` can also be used with *regular expressions*:

In [7]:
grep -E '[acp][12]?' example.tsv

[01;31m[Ka1[m[K	[01;31m[Ka2[m[K	[01;31m[Ka[m[K3	[01;31m[Ka[m[K4
[01;31m[Kc1[m[K	[01;31m[Kc2[m[K	[01;31m[Kc[m[K3	[01;31m[Kc[m[K4
n	2	[01;31m[Kp[m[K	q
n	1	[01;31m[Kp[m[K	q
[01;31m[Kp[m[K	2	n	o
[01;31m[Kp[m[K	1	n	o


*Regular expressions* are a very powerful tool to search, extract and replace text, implemented in many programming languages and supported by many tools in the shell.

Multiple `grep` commands can be combined with *pipes*, creating sophisticated filters:

In [8]:
grep -E '[acp][12]?' example.tsv | grep -v '2'

n	1	p	q
p	1	n	o


```{exercise} View and search command history
1. How would you print the last 10 entries in your command history into a text file?
2. How many times have you used the `ls` command?
```

Another kind of vertical manipulation is done with the `sort` command (in this case, according to the second column):

In [9]:
tail -n 8 example.tsv | sort -k2

p	1	n	o
q	1	o	x
n	1	p	q
o	1	q	n
p	2	n	o
q	2	o	x
n	2	p	q
o	2	q	n


## Horizontal Manipulations

`cut` can extract columns from a file:

In [10]:
cut -f1,2 example.tsv

X	Y
a1	a2
b1	b2
c1	c2
n	2
n	1
o	2
o	1
p	2
p	1
q	2
q	1


and `paste` can join columns together:

In [11]:
paste <(cut -f1,2 example.tsv) <(cut -f3,4 example.tsv)

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


```{exercise} Cut and paste
1. Verify using the `diff` command that the output of the command above is equal to the original `example.tsv` file
2. What happens if you change the column choices in `cut`?
```

## When `cut` does not cut it: `awk`

`cut` is a simple tool that works when the columns of a file have a one-character separator.

If this is not the case, one can resort to `awk`, which is a very powerful tool.

To print the first 2 columns of `example.tsv` with `awk`, we can use

In [12]:
awk '{print $1,$2}' example.tsv

X Y
a1 a2
b1 b2
c1 c2
n 2
n 1
o 2
o 1
p 2
p 1
q 2
q 1


`awk` can also be used in a pipe, and do mathematical operations, if you need to do quick checks.

## Following a running program

If we have a process that is generating some output in a text file
and we want to monitor its output, we have two possibilities.


### The `tee` command

If we just want to see the output of a process 
and at the same time save it into a file, the `tee` command helps us to do that:


In [14]:
./generate.sh | tee growing_file 

Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.


### The `tail -f` command
Alternatively, we can use `tail -f` (`-f` stands for follow).
Example:

In [15]:
./generate.sh > growing_file &

[1] 23541


This command is generating lines of text and adding them one by one to `growing_file`.
To monitor the process, we can do

In [16]:
tail -f -s 5 growing_file

Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.

[1]+  Done                    ./generate.sh > growing_file


And terminate with `CTRL+C` when we so decide.

```{warning}

`tail -f` can be nasty to other users!

When used to monitor files in a global filesystem (e.g., you home directory) the frequent *polling* by `tail -f` might strain the filesystem unnecessarily. 
By adding the option `-s 20`, for example, we reduce the load by telling `tail` to check less frequently - in this case every 20 seconds.
```