# Unix tools for text file manipulation and exploration

This notebook contains a list of examples of common linux/unix commands.

**Note 1**:In order to reduce the need for temporary files to show the their effect,
*process substitution* (that is, the syntax `<( *command* )`) and pipes (`|`) are used in the examples.
If this looks obscure,
the reader is invited to create the necessary temporary files.

**Note 2**: to follow along, please navigate to the `examples/text_manipulation`directory.

In [1]:
cd ../../examples/text_manipulation/
ls

example.tsv  [0m[01;32mgenerate.sh[0m  growing_file  other_example.tsv


## "Vertical" text manipulation
In this section, we have a look at commands 
that cut and sew together files "vertically".

We are going to use a Tab separated value file
as an example:

In [2]:
cat example.tsv

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


Tabs are characters that are nicely understood as column separators by many tools.
We slice it vertically using the `head` and `tail` commands.

In [3]:
head -n 5 example.tsv

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q


In [4]:
tail -n 4 example.tsv

p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


When a text file is large, 
the `head` and `tail` commands are very useful 
to get an idea of its content.
To explore files, `wc` is also useful

In [5]:
wc example.tsv

 12  48 108 example.tsv


These number are, respectively, the number of lines/words/characters in `example.tsv`.

Multiple files can be joined together vertically 
using the `cat` command:

In [6]:
cat <(head -n 5 example.tsv) <(tail -n +6 example.tsv)

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


**Exercise for the reader**:
Verify, using the `diff` command,
that the command above reproduces the original `example.tsv`.
Bonus point: do not use temporary files.

Another notable program that cuts files vertically is `grep`.
`grep` can also be used with *regular expressions*:

In [7]:
grep -E '[acp][12]?' example.tsv

[01;31m[Ka1[m[K	[01;31m[Ka2[m[K	[01;31m[Ka[m[K3	[01;31m[Ka[m[K4
[01;31m[Kc1[m[K	[01;31m[Kc2[m[K	[01;31m[Kc[m[K3	[01;31m[Kc[m[K4
n	2	[01;31m[Kp[m[K	q
n	1	[01;31m[Kp[m[K	q
[01;31m[Kp[m[K	2	n	o
[01;31m[Kp[m[K	1	n	o


Regular expressions are a very powerful tool to search, extract and replace text, implemented in many programming languages and supported by many tools in the shell.

Multiple `grep` commands can be combined with *pipes*, creating sophisticated filters:

In [8]:
grep -E '[acp][12]?' example.tsv | grep -v '2'

n	1	p	q
p	1	n	o


Another kind of vertical manipulation is done with the `sort` command (in this case, according to the second column):

In [9]:
sort -k2 <(tail -n 8 example.tsv)

p	1	n	o
q	1	o	x
n	1	p	q
o	1	q	n
p	2	n	o
q	2	o	x
n	2	p	q
o	2	q	n


## Horizontal Manipulations

`cut` can extract columns from a file:

In [10]:
cut -f1,2 example.tsv

X	Y
a1	a2
b1	b2
c1	c2
n	2
n	1
o	2
o	1
p	2
p	1
q	2
q	1


and `paste` can join columns together:

In [11]:
paste <(cut -f1,2 example.tsv) <(cut -f3,4 example.tsv)

X	Y	Z	T
a1	a2	a3	a4
b1	b2	b3	b4
c1	c2	c3	c4
n	2	p	q
n	1	p	q
o	2	q	n
o	1	q	n
p	2	n	o
p	1	n	o
q	2	o	x
q	1	o	x


**Exercise for the reader**:
Verify, using the `diff` command,
that the command above reproduces the original `example.tsv`.
Bonus point: do not use temporary files.

## Following a running program

If we have a process that is generating some output in a text file
and we want to monitor its output, we have two possibilities.

### The `tee` command

If we just want to see the output of a process 
and at the same time save it into a file, the `tee` command helps us to do that:


In [12]:
./generate.sh | tee growing_file 

Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.


### The `tail -f` command
Alternatively, we can use `tail -f` (`-f` stands for follow).
Example:

In [13]:
./generate.sh > growing_file &

[1] 35007


This command is generating lines of text and adding them one by one to `growing_file`.
To monitor the process, we can do

In [None]:
tail -f growing_file

Generating line 1...
Generating line 2...
Generating line 3...
Generating line 4...
Generating line 5...
Generating line 6...
Generating line 7...
Generating line 8...
Generating line 9...
Generating line 10...
Done.


And terminate with `CTRL+C` when we so decide.