<script>
    var code_show=true; //true -> hide code at first

    function code_toggle() {
        $('div.prompt').hide(); // always hide prompt

        if (code_show){
            $('div.input').hide();
        } else {
            $('div.input').show();
        }
        code_show = !code_show
    }
    $( document ).ready(code_toggle);
</script>

<center>
<h1>Workshop 1b: Advanced UNIX</h1><br><br>
<i><big> File operations, pipes, data reformatting and queries.</big><br><br>
Hatzikotoulas Konstantinos (Kostas) (konstantinos.hatzikotoulas@helmholtz-munich.de)</i>
</center>

**Topics covered in this tutorial**:

* Text processing tools: grep, cut, tr and paste.
* Sed and AWK.
* Pipes and redirections.

**Topics _not_ covered in this tutorial**:

* Variable handling.
* Loops.
* Conditionals.
* Subshells, FIFO pipes and named pipes.


## Step 1 : Downloading resources from the internet

It is easy to access the internet via the command line. For this we use the `wget` tool. We want to  download the complete [GWAS catalog](https://www.ebi.ac.uk/gwas/) maintained by [EBI](https://www.ebi.ac.uk/) that contains all the published genome-wide associations to date. You can perform simple queries on the database using the GWAS catalog website, but you can also download it so you can build you own database. 


<div class="alert alert-success"><b>Question 0:</b>  Go to the GWAS catalog website and find the download link. Create a directory in your home named `GWAS_catalog` and download it to a file named `GWAS_catalog.txt`. How large is the file you just downloaded?
</div>


<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answer</button>



In [1]:
%%bash

# Goes to your home
cd

# Creates directory and enters it
mkdir GWAS_catalog && cd GWAS_catalog

# After the URL to the dataset, we specify the output file name.
wget https://www.ebi.ac.uk/gwas/api/search/downloads/full -O GWAS_catalog.txt

# Size
ls -lh GWAS_catalog.txt

-rw-rw---- 1 ag15 team144 71M May 13 10:52 GWAS_catalog.txt


--2019-05-13 10:52:24--  https://www.ebi.ac.uk/gwas/api/search/downloads/full
Resolving wwwcache.sanger.ac.uk (wwwcache.sanger.ac.uk)... 172.18.24.1, 172.18.24.2
Connecting to wwwcache.sanger.ac.uk (wwwcache.sanger.ac.uk)|172.18.24.1|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: unspecified [text/tsv]
Saving to: ‘GWAS_catalog.txt’

     0K .......... .......... .......... .......... .......... 1.16M
    50K .......... .......... .......... .......... .......... 1.10M
   100K .......... .......... .......... .......... .......... 5.86M
   150K .......... .......... .......... .......... .......... 11.1M
   200K .......... .......... .......... .......... .......... 10.7M
   250K .......... .......... .......... .......... .......... 57.4M
   300K .......... .......... .......... .......... .......... 10.8M
   350K .......... .......... .......... .......... .......... 27.6M
   400K .......... .......... .......... .......... .......... 15.7M
   450K .......

<div class="alert alert-info"> The `&&` operator allows you to execute several commands one after the other, but only if previous ones succeed. For example `command1 && command2 && command3` will execute `command2` only if `command1` succeeded, and `command3` only if both `command1` and `command2` succeeded.
</div>

`wget` is an extremely powerful bash tool, and has a wide range of amazing features (like resuming interrupted download) that we can't cover here. As it has been mentioned in the previous workshop, you can read the comprehensive manual using `man wget`. This also apply to the rest of the tools we are covering in this session.

<div class="alert alert-warning"> The file we just downloaded was created in Windows. This creates problems, because some characters are coded in different ways in UNIX/Linux and Windows. To make the file Linux-compatible, we do `dos2unix GWAS_catalog.txt`.
</div>

## Text processing bash tools

###  Step 2: Counting lines, first $n$ and last $n$ lines of a file


<div class="alert alert-success"><b>Question 1:</b>  Using `wc`, check the number of lines in the GWAS catalog. Using `head` and `tail`, display the first 10 lines and the last 5 lines of the file.
</div>


<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answer</button>


In [3]:
%%bash

cd ~/GWAS_catalog

# How many lines does the file have?
wc -l GWAS_catalog.txt

# First 10 rows
head GWAS_catalog.txt

# Last 5 rows
tail -5 GWAS_catalog.txt
tail -n 5 GWAS_catalog.txt

136288 GWAS_catalog.txt
DATE ADDED TO CATALOG	PUBMEDID	FIRST AUTHOR	DATE	JOURNAL	LINK	STUDY	DISEASE/TRAIT	INITIAL SAMPLE SIZE	REPLICATION SAMPLE SIZE	REGION	CHR_ID	CHR_POS	REPORTED GENE(S)	MAPPED_GENE	UPSTREAM_GENE_ID	DOWNSTREAM_GENE_ID	SNP_GENE_IDS	UPSTREAM_GENE_DISTANCE	DOWNSTREAM_GENE_DISTANCE	STRONGEST SNP-RISK ALLELE	SNPS	MERGED	SNP_ID_CURRENT	CONTEXT	INTERGENIC	RISK ALLELE FREQUENCY	P-VALUE	PVALUE_MLOG	P-VALUE (TEXT)	OR or BETA	95% CI (TEXT)	PLATFORM [SNPS PASSING QC]	CNV
2018-08-21	29518117	Lee HS	2018-03-08	PLoS One	www.ncbi.nlm.nih.gov/pubmed/29518117	Nuclear receptor and VEGF pathways for gene-blood lead interactions, on bone mineral density, in Korean smokers.	Bone mineral density x blood lead interaction in current smokers (1df test)	119 Korean ancestry current smokers	NA	7p22.1	7	5212643	WIPI2	WIPI2			ENSG00000157954			rs4720530-?	rs4720530	0	4720530	intron_variant	0		4E-7	6.3979400086720375				Affymetrix [344396]	N
2018-05-11	29523524	Singh S	2018-03-09	J Am Heart Assoc


From the first 10 rows we can see how the file is structured: 
* the first line is the header describing the content of each column (a more comprehensive explanation can be found on the [gwas website](https://www.ebi.ac.uk/gwas/docs/fileheaders))
* then each line is an individual association between a genetic variant and an observed phenotype.

### Step 2b : Using pipes

Right now, all the commands we ran were printing their output to the screen (like `wc`) or to a file (`wget -O`). But we can also make the output go to another bash command. We do this using the "pipe" character (`|`). For example, `command1 | command2` will pass the output of `command1` to the input of `command2`. 


<div class="alert alert-success"><b>Question 2:</b>  Using `head` piped to `tail`, print **only the 200th line** of the GWAS catalog. 
<br><br>
`cat` prints the output of one or more files to the screen. Using `cat` and `wc`, print the number of lines in the GWAS catalog.
</div>


<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answer</button>


In [9]:
%%bash

cd ~/GWAS_catalog

head -200 GWAS_catalog.txt | tail -1

cat GWAS_catalog.txt | wc -l

2018-07-02	29615537	Kulminski AM	2018-03-01	Aging (Albany NY)	www.ncbi.nlm.nih.gov/pubmed/29615537	Strong impact of natural-selection-free heterogeneity in genetics of age-related phenotypes.	High density lipoprotein cholesterol levels	up to 33,431 European ancestry individuals	NA	9q31.1	9	104894789	ABCA1	ABCA1			ENSG00000165029			rs3905000-A	rs3905000	0	3905000	intron_variant	0	0.1354	3E-8	7.522878745280337		1.6761924	[1.08-2.27] unit decrease	Affymetrix, Illumina [~ 2500000] (imputed)	N
136288


### Step 3 : redirections

Instead of ouputting the result of a command to another command, we can also tell UNIX to write the output to a file. This is not so useful when you have 1 command, but when you have several piped commands (pipeline) it is used to write the end result to a file. There are 2 types of redirection:

* **overwrite** which is a single "greater than". `command1 > file` will overwrite `file` if it exists, otherwise create it and fill it with the output of `command1`.
* **append**, which is a double "greater than". `command1 >> file` will add the output of `command1` to `file`, creating it if it does not exist.


<div class="alert alert-success"><b>Question 3:</b>  Using `head`, `tail` and pipes, create a file named `200th_line.txt` that contains **only the header and the 200th line**, and display it using `cat`.

</div>


<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answer</button>


In [11]:
%%bash

cd ~/GWAS_catalog

## First the header
head -1 GWAS_catalog.txt > 200th_line.txt

## Then the 400th line
head -200 GWAS_catalog.txt | tail -1 >> 200th_line.txt

cat 200th_line.txt

DATE ADDED TO CATALOG	PUBMEDID	FIRST AUTHOR	DATE	JOURNAL	LINK	STUDY	DISEASE/TRAIT	INITIAL SAMPLE SIZE	REPLICATION SAMPLE SIZE	REGION	CHR_ID	CHR_POS	REPORTED GENE(S)	MAPPED_GENE	UPSTREAM_GENE_ID	DOWNSTREAM_GENE_ID	SNP_GENE_IDS	UPSTREAM_GENE_DISTANCE	DOWNSTREAM_GENE_DISTANCE	STRONGEST SNP-RISK ALLELE	SNPS	MERGED	SNP_ID_CURRENT	CONTEXT	INTERGENIC	RISK ALLELE FREQUENCY	P-VALUE	PVALUE_MLOG	P-VALUE (TEXT)	OR or BETA	95% CI (TEXT)	PLATFORM [SNPS PASSING QC]	CNV
2018-07-02	29615537	Kulminski AM	2018-03-01	Aging (Albany NY)	www.ncbi.nlm.nih.gov/pubmed/29615537	Strong impact of natural-selection-free heterogeneity in genetics of age-related phenotypes.	High density lipoprotein cholesterol levels	up to 33,431 European ancestry individuals	NA	9q31.1	9	104894789	ABCA1	ABCA1			ENSG00000165029			rs3905000-A	rs3905000	0	3905000	intron_variant	0	0.1354	3E-8	7.522878745280337		1.6761924	[1.08-2.27] unit decrease	Affymetrix, Illumina [~ 2500000] (imputed)	N


<div class="alert alert-success"><b>Question 4:</b> The file you have created contains both spaces and tab characters as separators, which makes it a bit messy to display. The `column` command can add artificial spaces to a pipe, so that it looks a bit more like Excel. Pipe `cat` into `column -s$'\t' -t`, and then into `less`. By default, `less` folds long lines to make them fit the screen. We have very long lines so we want to display them laterally, without breaking them. Can you find the option in the `less` manual that does that? The 200th line of the GWAS catalog tells us about a mutation in a gene. Can you find the chromosome, position and gene for that variant?

</div>


<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answer</button>


In [12]:
%%bash

cd ~/GWAS_catalog

cat 200th_line.txt | column -s$'\t' -t | less -S

# chromosome 9 position      104894789 gene ABCA1

DATE ADDED TO CATALOG  PUBMEDID  FIRST AUTHOR  DATE        JOURNAL            LINK                                  STUDY                                                                                         DISEASE/TRAIT                                INITIAL SAMPLE SIZE                         REPLICATION SAMPLE SIZE  REGION  CHR_ID  CHR_POS    REPORTED GENE(S)  MAPPED_GENE  UPSTREAM_GENE_ID  DOWNSTREAM_GENE_ID  SNP_GENE_IDS  UPSTREAM_GENE_DISTANCE  DOWNSTREAM_GENE_DISTANCE  STRONGEST SNP-RISK ALLELE  SNPS  MERGED  SNP_ID_CURRENT  CONTEXT            INTERGENIC  RISK ALLELE FREQUENCY      P-VALUE                                     PVALUE_MLOG  P-VALUE (TEXT)  OR or BETA  95% CI (TEXT)  PLATFORM [SNPS PASSING QC]  CNV
2018-07-02             29615537  Kulminski AM  2018-03-01  Aging (Albany NY)  www.ncbi.nlm.nih.gov/pubmed/29615537  Strong impact of natural-selection-free heterogeneity in genetics of age-related phenotypes.  High density lipoprotein cholesterol levels  up to 33,431 

### Searching files

Very often, we want to search the contents of files. This is done using a very powerful command, `grep`.

```bash

grep [OPTIONS] [PATTERN] [FILE]

# Example:

grep Volos cities.txt
```

the last example looks for every line that contains "Volos" in the file `cities.txt` and displays the corresponding row. If your pattern contains a space, enclose it within simple or double quotes (e.g. `"Volos, Thessalia"` or `'Volos, Thessalia'`)

<div class="alert alert-info"> It is possible to search for several patterns at a time, not just one. If you are interested in only a few patterns, you can add them to the command line, separating them by `-e`. For example, ` grep -e Volos -e Thessaloniki cities.txt`. If you have many patterns that you want to search for, you can use `grep -f patterns.txt file.txt` where `patterns.txt` contains all the patterns you are interested in , 1 per line.
</div>
<br>

<div class="alert alert-info"> Interesting parameters for `grep` are `-w`, `-v`, `-i`, `-n`, `-A`, `-B`, `-C`, `-l` and `-L`. Check them out and play around with them if you have time!
</div>

<div class="alert alert-success"><b>Question 5:</b> Use `grep` to search for other variants in the gene you found above. Using a pipe and `wc`, determine how many lines from the GWAS catalog mention it. How about the variant you found? Are there other lines that mention this variant?

</div>

<div class="alert alert-success"><b>Question 6:</b> Before, we used 2 steps to create a file containing the 200th line. Now with `grep`, we can do everything in one line. Find something unique to the header and our 200th line, and create a file called `200th_line2.txt` that contains them (you might need several grep commands piped together). Use the `diff` command to check that the files are identical.
</div>

<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answers</button>


In [13]:
%%bash

cd ~/GWAS_catalog

## Question 5
#grep ABCA1 GWAS_catalog.txt
grep ABCA1 GWAS_catalog.txt | wc -l

grep 'rs3905000-A' GWAS_catalog.txt | wc -l

182
1


In [20]:
%%bash

cd ~/GWAS_catalog

## Question 6

# the -n option of grep is useful here to show which line has a match

# There is no single right answer here, you might have selected other search criteria that give the same result.
fgrep -e DOWNSTREAM_GENE_DISTANCE -e 'rs3905000-A' GWAS_catalog.txt > 200th_line2.txt

# The diff output is empty: our files are identical!
diff 200th_line.txt 200th_line2.txt

### Step 4: Playing with columns: cut and paste

As we now know, our file contains many columns, but we are interested in only a couple of them. `cut` allows to extract specific columns from a file, whereas `paste` appends several columns together. The delimiter (space, comma, tab) that separates your fields is specified with the `-d` argument (as they are special characters they will need to be quoted with `'`).

#### Example
The below command extracts columns 1 to 4, 6, and 8 to 10 of `file.txt` and prints the output to the screen.
```bash
cut -f1-4,6,8-10 file.txt
```

<div class="alert alert-success"><b>Question 7:</b> Extract the associated disease/trait and the SNP id (column `SNPS`) from the GWAS catalog and write them to a file called `phenotype_SNP.txt`. Then create a file called `SNP_phenotype.txt` where the SNP column comes first, and the Disease/trait one second (`cut` alone is not enough for this, you need to create 2 files and paste them together).
</div>

<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answers</button>



In [22]:
%%bash

cd ~/GWAS_catalog

cut -f8,22 GWAS_catalog.txt > phenotype_SNP.txt

cut -f8 GWAS_catalog.txt > phenotype.txt
cut -f22 GWAS_catalog.txt > SNP.txt

paste SNP.txt phenotype.txt > SNP_phenotype.txt

## Step 5 : Find and replace

Once we have extracted some data, we often want to modify it. There are 2 programs to do this type of thing, one very simple and the other very complicated:

* `tr` replaces every occurrence of one particular character by another character
*  * The syntax is very simple : `command1 | tr 'a' 'b' ` replaces every occurrence of the letter a in the pipe by the letter b.
* `sed` is more complicated to use, but allows to substitute pretty much anything by pretty much anything else.
*  * `command1 | sed 's/pattern_to_find/replacement pattern/'`
*  * `pattern_to_find` is a **regular expression**, a special command that allow you to match certain bits of text an not others. Regular expressions are very, very common in the computing world, and many programs understand them (`grep` also allows them). Regular expressions (or regexes) can take a long time to learn and use properly. You can learn about them [here](http://www.regular-expressions.info/) and test them out [here](http://regexr.com/).

### Some (very) basic regular expressions

* `/Volos/` matches the string `Volos`, as expected
* `/Vo*los/` matches Volos with zero or more first `o`: `Volos`, `Vlos` `Voooooooolos`
* `/Vo+los/` matches Volos with one or more first `o`: `Volos`,  `Voooooooolos` but not `Vlos`
* `/V.los/` matches Volos, but the second letter can be anything (but not nothing): `Volos`, `Vylos`, `Vvlos` but not `Voolos`
* `/$/` matches the end of the line
* `/^/` matches the beginning of the line
* `/^Volos$/` matches a line that contains only `Volos` but not if there is a space (or anything) before or after.
* `s/o/i/` will replace `Volos` by `Vilos`, but `s/o/i/g` will replace all occurrences of the pattern: `Volos` will become `Vilis`.
* etc...


<br><br><br>

<div class="alert alert-success"><b>Question 8:</b> Using `tr`, replace every space character in the GWAS catalog by an underscore (`_`) and write it to `GWAS_catalog_no_spaces.txt`. Remember they are special characters, so need to be quoted.
</div>

<div class="alert alert-success"><b>Question 9:</b> In the file `phenotype_SNP.txt` you generated before, some SNPs have several IDs separated by a semicolon. Find them and remove all alternate IDs, keeping only the first one; write the file to `duplicates_removed.txt`.
</div>

<button type="button" class="btn btn-primary" onClick="code_toggle()">Click here to show/hide answers</button>


In [23]:
%%bash
cd ~/GWAS_catalog

## Question 8
cat GWAS_catalog.txt | tr ' ' '_' > GWAS_catalog_no_spaces.txt

## Question 9
grep ';' phenotype_SNP.txt | sed 's/;.*//' > duplicates_removed.txt

### Step 6 : Advanced column manipulation using `awk`

Here, we introduce another very powerful tool: `awk`. Like `sed`, entire programs can be written in awk, so we will show only a tiny fraction of what can be done with it.

The great usefulness of `awk` is that it **automatically splits lines on whitespace**, allowing you to select specific columns and performing actions on them.

<div class="alert alert-danger"><b>Beware:</b> `awk` splits on **any whitespace**. What will happen in a file like ours, with both tabs and spaces? Try it out: run `awk '{print $2}' GWAS_catalog.txt | less`. Look at the first row. Is this what you would expect? What would you do to solve this problem?
</div>

### Some (very) basic examples


* `'{print $2}'` prints the second field
* `'{print $0}'` prints the whole line
* `'$2==1'` prints the whole line if the second field is equal to 1 (equivalent to `'{if($2==1){print $0}}'`)
* `'$3~/Volos/{print "field 2 is: ", $2+1, "field 3 is: ", $3}'` prints a custom string if the third field matches (contains) "Volos"
* `'NF>2'` prints the line if it has more than 2 fields
* `'$NF>2'` prints the line if the value of the last field is greater than 2
* `'$10>$2+1 && $(NF-1)=="yes"'` prints the line if the 10th field is greater than the second + 1 **and** if the before-last field is equal to `yes`.


<br><br>
<div class="alert alert-info"> AWK commands, like `sed` ones, are enclosed within quotes (`'`). If you want to print the whole line only if a condition is satisfied, just write the condition. If you want to print certain columns, or do more complex operations, you must include your code in curly brackets (`{...}`).
</div>

<br><br>
<div class="alert alert-info"> AWK merges consecutive delimiters into one. For example, in the string `a\tb\t\tc`, `$2=="b"` and `$3=="c"`. Sometimes (as here) we might have empty fields, so we want to keep consecutive delimiters separated. To do that you have to tell awk specifically : `awk -F'\t' '...'`
</div>


<div class="alert alert-success"><b>Question 10:</b> Use the version of the GWAS catalog with no spaces we just created. How many fields does the GWAS catalog contain? Using `paste`, `tr` and `seq`, build a file named `column_indices.txt` that contains the column names and their numbers. With `grep`, locate the number of the columns `DISEASE/TRAIT`, `CHR_ID`, `CHR_POS`, `MAPPED_GENE`, `SNPS` in that file. Use awk to extract all records of the GWAS catalog that are on chromosome 11 between 5220000 and 5300000, and print only the fields above to a file called `Hemoglobin_region.txt`.
</div>


In [25]:
%%bash
cd ~/GWAS_catalog
# number of fields (=34)
echo "Number of fields"
head -1 GWAS_catalog.txt| tr ' ' '_' | awk '{print NF}'
echo

# building a file with column numbers
head -1 GWAS_catalog_no_spaces.txt | tr '\t' '\n' > header.column
seq 1 34 > indices.column
paste header.column indices.column > header.columns

# finding column IDs
echo "Column numbers"
grep -w -e DISEASE -e CHR_ID -e CHR_POS -e MAPPED_GENE -e SNPS header.columns

# Printing only selected fields
awk -F'\t' '$12==11 && $13>5220000 && $13<5300000{print $8, $12, $13, $15, $22}' GWAS_catalog_no_spaces.txt > Hemoglobin_region.txt

Number of fields
34

Column numbers
DISEASE/TRAIT	8
CHR_ID	12
CHR_POS	13
MAPPED_GENE	15
SNPS	22


### Step 7:  Sorting

Last but not least, we introduce the `sort` and `uniq` commands. Very often we need to sort files according to one or several columns, this is achieved using sort. Sort understands alphabetical, numerical and "natural" (i.e. scientific) orders and can compute unique values.

#### Examples
* `sort -k1,1 -k2,2n` sorts according to the first field (alphanumeric order) then the second field (numeric order)
* `sort` sorts on the whole line according to alphanumeric order
* `sort -r` sorts randomly

<br>
<div class="alert alert-info"> `uniq` does not add much to sort, as `sort -u` is the same as `sort | uniq`. `uniq` is frequently used for its `-c` argument, which counts the number of occurrences of every unique line.
</div>

<div class="alert alert-success"><b>Question 11:</b> How many unique genes are there in our previous query? How many times is each of them mentioned?
</div>

In [26]:
%%bash
cd ~/GWAS_catalog
cut -d' ' -f4 Hemoglobin_region.txt | sort | uniq -c

      1 AC104389.4,_HBE1,_HBG2
      6 HBB
      4 HBBP1,_HBD
      3 HBD
      2 HBE1,_AC104389.4,_HBG2
      2 HBE1,_HBG2,_AC104389.4
      2 HBG2,_AC104389.4,_HBE1
      2 HBG2,_HBE1,_AC104389.4


## Step 8 : Variables and loops

In bash, we can define variables to store temporary information:

```bash
city=Volos
```

To recall the value of the variable we have to prefix it with a dollar sign (`$`):
```bash
echo $city
```



### Variable operations

#### Concatenation
This is the easiest of all operations: in order to "glue" the value of 2 variables together, simply write them one after the other:

In [27]:
%%bash
a=1
b=2
echo $a$b

12


You can also mix variables with text of your own:

```bash
Konstantinos lives in $city.
```

<div class="alert alert-danger"><b>Beware:</b> The character that follows your variable name is important. Bash will think that some characters are part of the variable name, for example you cannot write `$a-$b`, although you can write `$a.$b`. If you encounter such characters, you should write `${a}-${b}`.
</div>

#### Mathematical operations

* Addition `$(( a + b ))`, `$(( 1 + 2 ))`
* Other arithmetic operators `$(( a * b ))`, `$(( a / 10 ))`

#### For loops

For loops are used to iterate on a list of values:

```bash
for i in a b c; do
    echo $i
done
```


In [29]:
%%bash
for i in a b c; do
    echo $i
done

a
b
c


The value of a variable can contain spaces, or other separators. Bash interprets this as a list, so you can iterate over the values in a variable. For example:

In [33]:
%%bash
values="a b c d e f g h"
for v in $values; do
    echo $v
done

a
b
c
d
e
f
g
h


#### Special variables

There are many "special" variables hidden in BASH. Some of them are set by the system, such as `$PATH`, which contains all the directories in which your system is going to look for commands. (a funny way to temporarily break your system is to set `PATH` to something random).

Another special variable is the range operator:

In [37]:
%%bash
echo {1..22}


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22


#### The `read` command

The `read` command is a very powerful tool which reads whatever it receives through a pipe into a variable. We use it a lot in combination with the `while` loop when we want to **execute the same command on every line of a file**.

In [39]:
%%bash
echo -e "a\nb\nc\nc\nd\ne\nf\ng" > test.txt
cat test.txt

a
b
c
c
d
e
f
g


In [40]:
%%bash
cat test.txt | while read letter; do
    echo $letter
done

a
b
c
c
d
e
f
g


`read` can also read several fields at the same time:

In [42]:
%%bash
paste test.txt test.txt

a	a
b	b
c	c
c	c
d	d
e	e
f	f
g	g


In [44]:
%%bash
paste test.txt test.txt| while read first second; do
    echo $first $second $second $first
done

a a a a
b b b b
c c c c
c c c c
d d d d
e e e e
f f f f
g g g g


The great use of this is that you can put whatever you want between `do` and `done` (many many commands, potentially). If you just have one quick command that you want to run, you can use `xargs` instead.

In [41]:
%%bash
cat test.txt | xargs echo

a b c c d e f g


<div class="alert alert-success"><b>Question 12:</b> Extract first 10 reported chromosome and position columns from the `GWAS_catalog.txt` file.
</div>

<div class="alert alert-success"><b>Question 13:</b> For every chromosome, position (e.g. `1   1234`), convert this to a list of intervals extended by 1000 on either side (e.g. `1:234-2234`).
</div>