# Bedtools

Bedtools is an extremely useful tool for doing regional comparisons over genomic co-ordinates. It has many commands for doing region based comparisons with BAM, VCF, GFF, BED file formats.

To see the list of commands available, on the command line type:

In [None]:
bedtools

Navigate to the `exercise5` directory.

In [None]:
cd ../exercise5

In [None]:
ls

In this directory, there are two VCF files and the yeast genome annotation in GFF3 format `Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3`.

## bedtools intersect
 
Given two sets of genomic features, the `bedtools intersect` command can be used to determine whether or not any of the features in the two sets “overlap” with one another. For the intersect command, the -a and -b parameters are used to denote the input files A and B.

![IGV - main window](images/bedtools.png "IGV - main window")

(Credit to Aaron Quinlan for original source of figure: http://quinlanlab.org/tutorials/bedtools/bedtools.html)

For example, to find out the overlap between the SVs in `ERR1015069.dels.vcf` and the annotated region of the yeast genome try

This command reports the variant in the file `ERR1015069.dels.vcf` every time it overlaps with a feature in `Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3`. Therefore if a variant overlaps more than one feature it will be reported more than once. To report the unique set of variants use:

In [None]:
bedtools intersect -a ERR1015069.dels.vcf -b Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3

In [None]:
bedtools intersect -u -a ERR1015069.dels.vcf -b Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3

The default is to report overlaps between features in A and B so long as at least one base pair of overlap exists. However, the `-f` option allows you to specify what fraction of each feature in A should be overlapped by a feature in B before it is reported.

To specify a more strict intersect and require at least 25% of overlap of the SV with the genes use the command:

In [None]:
bedtools intersect -u -f 0.25 -a ERR1015069.dels.vcf -b Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3

The `bedtools intersect` command can also be used to determine how many SVs overlap between two VCF files. For more information about `bedtools intersect` see the help:

In [None]:
bedtools intersect -h

### Exercises
1. How many SVs found in `ERR1015069.dels.vcf` overlap with a gene? (**Hint:** Use bedtools intersect command)

2. How many SVs found in `ERR1015069.dels.vcf` do not overlap with a gene? (**Hint:** note the -v parameter to bedtools intersect)

3.  How many SVs found in `ERR1015069.dels.vcf` overlap with a more strict definition of 50%?

4. How many features does the deletion at VII:811446 overlap with? What type of genes? Note you will need to also use the -wb option in bedtools intersect.

5. How many features does the deletion at XII:650823 overlap with? What type of genes? Note you will need to also use the -wb option in bedtools intersect.

## bedtools closest

Similar to intersect, `bedtools closest` searches for overlapping features in A and B. In the event that no feature in B overlaps the current feature in A, closest will report the nearest (that is, least genomic distance from the start or end of A) feature in B.

An example of the usage of `bedtools closest` is:

In [None]:
bedtools closest -d -a ERR1015069.dels.vcf -b Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3

This command will list all the features in the file `Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3` that are closest to each of the variants in `ERR1015069.dels.vcf`.

The `-d` option means that in addition to the closest feature in `Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3`, the distance to the variant in `ERR1015069.dels.vcf` will be reported as an extra column. The reported distance for any overlapping features will be 0.

For example, to find the closest gene to the variant found at position 43018 on chromosome XV, try

In [None]:
bedtools closest -d -a ERR1015069.dels.vcf -b Saccharomyces_cerevisiae.R64-1-1.82.genes.gff3| grep XV | grep 43018 

For more information about `bedtools closest` see the help:

In [None]:
bedtools closest -h

## Exercises

6. What is the closest gene to the structural variant at IV:384220 in `ERR1015069.dels.vcf`?

7. How many SVs overlap between the two files `ERR1015069.dels.vcf` and `ERR1015121.dels.vcf`?

8. How many SVs have a 50% reciprocal overlap between the two files `ERR1015069.dels.vcf` and `ERR1015121.dels.vcf` (**Hint:** first find the option for reciprocal overlap by typing: bedtools intersect -h)

Congratulations, you have reached the end of the tutorial.