# Advanced Bash tutorial

**Topics covered in this tutorial**:

* Text processing tools: grep, sed, cut, awk, join, tr and paste.
* Simple variable handling.
* Loops.
* Conditionals.
* Pipes, backticks and brackets.
* Functions (?).


## Downloading resources from the internet

We are downloading the complete [GWAS catalog](https://www.ebi.ac.uk/gwas/) maintained by [EBI](https://www.ebi.ac.uk/) that contains all the published genome-wide associations to date. This collection by 2017.05.12 contains 29,196 SNPs and 33,898 associations publised in 2,882 research articles. You can perform simple queries on the database using the GWAS catalog website, but you can also download it so you can build you own database, perform systematic searches on a larger scale. You can find the download link in the top menubar.

To download the file from the internet we use wget:

In [1]:
%%bash

# After the URL to the dataset, we specify the output file name.
wget https://www.ebi.ac.uk/gwas/api/search/downloads/full -O GWAS_catalog_2017.05.12

bash: line 3: wget: command not found


`wget` is an extremely powerful bash tool, and has a wide range of amazing features (like resuming interrupted download) that we can't cover here. As it has been mentioned in the previous workshop, you can read the comprehensive manual using `man wget`. This also apply to the rest of the tools we are covering in this session.

## Text processing bash tools

**In the following section the following tools will be introduced:** wc, head, cat, tail, cut, join, paste, grep

Taking a look at the content of the file we have just downloaded (some of these tools have already been mentioned in the previous workshop):

In [6]:
%%bash

# How many lines does the file have?
wc -l GWAS_catalog_2017.05.12

   38038 GWAS_catalog_2017.05.12


In [7]:
%%bash

# display the first n-rows of the file:
echo "First 10 rows:"
head -n3 GWAS_catalog_2017.05.12

First 10 rows:
DATE ADDED TO CATALOG	PUBMEDID	FIRST AUTHOR	DATE	JOURNAL	LINK	STUDY	DISEASE/TRAIT	INITIAL SAMPLE SIZE	REPLICATION SAMPLE SIZE	REGION	CHR_ID	CHR_POS	REPORTED GENE(S)	MAPPED_GENE	UPSTREAM_GENE_ID	DOWNSTREAM_GENE_ID	SNP_GENE_IDS	UPSTREAM_GENE_DISTANCE	DOWNSTREAM_GENE_DISTANCE	STRONGEST SNP-RISK ALLELE	SNPS	MERGED	SNP_ID_CURRENT	CONTEXT	INTERGENIC	RISK ALLELE FREQUENCY	P-VALUE	PVALUE_MLOG	P-VALUE (TEXT)	OR or BETA	95% CI (TEXT)	PLATFORM [SNPS PASSING QC]	CNV
2009-09-28	18403759	Ober C	2008-04-09	N Engl J Med	www.ncbi.nlm.nih.gov/pubmed/18403759	Effect of variation in CHI3L1 on serum YKL-40 level, risk of asthma, and lung function.	YKL-40 levels	632 Hutterite individuals	443 European ancestry cases, 491 European ancestry controls, 206 European ancestry individuals	1q32.1	1	203186754	CHI3L1	CHI3L1			1116			rs4950928-G	rs4950928	0	4950928	upstream_gene_variant	0	0.29	1E-13	13.0		0.3	[NR] ng/ml decrease	Affymetrix [290325]	N
2008-06-16	18369459	Liu Y	2008-04-04	PLoS Genet	www.

We can also list the last 10 lines of the file using `tail`
```bash
tail -n10 GWAS_catalog_2017.05.12
```

To display the whole content of a file we can use `cat`
```bash
cat GWAS_catalog_2017.05.12
```

From the first 10 rows we can see how the file is structured: 
* the first line is the header describing the content of each column (a more comprehensive explanation can be found on the [gwas website](https://www.ebi.ac.uk/gwas/docs/fileheaders))
* then each line is an individual association between a genetic variant and an observed phenotype.

Most of the columns are not relevant for us now, so let's extract the following columns: **PUBMEDID**(2nd column), Trait(8th column)

In [None]:
%%bash

echo "List of the columns in the "

### grep

Filters lines of a file based on a defined criteria.

In [None]:
%%bash

# extracting variants that are associated with height:
grep height filename
grep -i height filename
grep 


# extracting lines that are associated with 

## Variables

Temporarily storing a piece of information that we are going to refer later using its name. Two basic actions:
1. Setting the value (assignment).
2. Reading the value (The shell substitues the name of the variable with its value).

In [10]:
%%bash

# Setting the value a local variable (no spaces around the = sign):
chromosome=12 

# Calling variable:
echo 1. $chromosome
echo 2. ${chromosome}
echo 3. "${chromosome}"

# When curly braces are needed:
ls gene_list_chr$chromosome_lst # Warning message! The variable name is not separated from the rest of the string!
ls gene_list_chr${chromosome}_lst

1. 12
2. 12
3. 12
gene_list_chr12_lst


ls: gene_list_chr: No such file or directory


Often automatic value assignment is required when the value is read derived from an other process. Imagine a list of files

In [11]:
%%bash



UsageError: %%bash is a cell magic, but the cell body is empty.


## Loops

Often we have to repeat a set of steps multiple times. In such cases loops are constructed to keep the code short and clean. In bash we mainly use the `for` and the `while` loops. 

In [17]:
%%bash

# Repeating a process for all autosomes:
for chr in {1..22}; do
    echo Processing ${chr}
    # Some commands... 
done

Processing 1
Processing 2
Processing 3
Processing 4
Processing 5
Processing 6
Processing 7
Processing 8
Processing 9
Processing 10
Processing 11
Processing 12
Processing 13
Processing 14
Processing 15
Processing 16
Processing 17
Processing 18
Processing 19
Processing 20
Processing 21
Processing 22


In the above example using the curly braces, we generate a series of numbers between 1 and 22. Then in each this value is assigned to variable named `chr` in 