Welcome to the Introduction to Data Formats in Computationaly Biology exercises page!
These materials are part of the COST Project Epichembio Bioinformatics workshop: Introduction to NGS data analysis held in Badalona the 13th to 15th of March 2019.
- EU Horizon 2020 COST Project Epichembio
- IJC Carreras Foundation
- Organizers: Sarah, Marguerite-Marie, David, Roberto
- SIB Swiss Institute of Bioinformatics and University of Zurich
The course slides cover first why do we need scripted workflows with data stored using standards, in context of the best practices for scientific reproducibility; and after that discuss sequence formats (fasta, fastq), coordinate-based formats (BED, SAM, GTF, VCF..) and finally presents some tricks in efficience and speed-up (Wiggle files, indexed binary formats).
Slides are linked to exercises (see below) that largely differ in complexity.
Computational biology needs high standards to make results reproducible, track work, organize projects and collaborate with colleagues.
To accomplish this we rely on sharing data and results using data standards (lingua franca) dealing with them using powerful and open tools (UNIX).
Recommended reads:
- Wilson et al Good enough practices in scientific computing
- Brian Excuse me, do you have a moment to talk about version control?
- Data carpentry lessons
These exercises are meant to be run under a commodity computer (i.e. old laptop) running GNU/Linux or MacOS. To refresh your UNIX skills please run the tutorial at SIB Course on UNIX.
As a backup, the repository data folder has a backup of all datasets used, and the script get_data.sh
to retrieve them.
Exercises complexity goes from low to challenging. Namely, from counting the number of lines of a fasta file or using the modulus operator to do so in FASTQ to dissecting chromatin compartmentalization trends in 1000 genomes project transposon variants. Please note gluttony is encouraged but exercises listing exceeds the workshop's workload.
Cheatsheets:
For extra/advanced reading, please check the following tutorials on bash and awk:
Setting up a working directory with folders
cd ~ # goes to the home directory
mkdir -p course
ls -l course
mkdir -p course/data course/soft # multiple directories can be created at once
ls -l course
Downloading text files and checking their details.
cd ~/course/data
wget http://gattaca.imppc.org/groups/maplab/imallona/teaching/example.bed
wget http://gattaca.imppc.org/groups/maplab/imallona/teaching/hg19.genome
file example.bed # which kind of file is it, is it text?
ls -lah example.bed # how big?
head example.bed # how do the first lines look like?
file hg19.genome
ls -lah hg19.genome
head hg19.genome
Retrieving source code and using a Makefile to compile exactly the software version required is a standard task in computational biology. Importantly, this can be scripted to make sure of dependencies/version. In this example we'll install bedtools (if not yet installed, which can be tested typing bedtools --help
in a terminal).
cd
mkdir -p course/soft
cd course/soft
wget https://github.com/arq5x/bedtools2/releases/download/v2.25.0/bedtools-2.25.0.tar.gz
tar zxvf bedtools-2.25.0.tar.gz
cd bedtools2
make # will take time! you could read about Makefiles during compiling time
alias bedtools='~/course/soft/bedtools2/bin/bedtools'
bedtools --help
Find the number of lines of the file ~/course/data/example.bed
Answer
Note the -l
flag
wc -l ~/course/data/example.bed
Get the manual page of head
, what is this command for?
Answer
man head
Get the first 5 lines of the file ~/course/data/example.bed
Answer
head -5 ~/course/data/example.bed
Using pipes (|
): chain the head -5
command before with wc -l
to make sure the number of lines was as expected (5).
Seems a stupid thing to do? Actually it's a good idea to check some results during routine computational biology runs, i.e. whether the number of genes checked remains the same across samples and similar things. Negative and positive controls are easily transferable from wetlab to drylab read more about unit testing in Wikipedia.
Answer
head -5 ~/course/data/example.bed | wc -l
DISCLAIMER: data is based upon the beautiful Genome Analysis Workshop (MOLB 7621) Course on genome analysis at the University of Colorado SOM at https://github.com/MOLB7621
Retrieve the data for these exercises (mind the path you're working at, should be ~/course/data
).
The path for URL for the data is https://molb7621.github.io/workshop/_downloads/SP1.fq
Answer
cd ~/course/data
wget https://molb7621.github.io/workshop/_downloads/SP1.fq
(Source: MOLB7621)
Inspect the file from the previous exercise (file size, number of lines, and visualize its head)
Answer
cd ~/course/data
ls -lh SP1.fq
wc -l SP1.fq
head SP1.fq
(Source: MOLB7621)
The FASTQ file stored at ~/course/data/SP1.fq
is in FASTQ format, meaning it contains 4 lines per read sequence.
head -n 4 SP1.fq
produces
@cluster_2:UMI_ATTCCG
TTTCCGGGGCACATAATCTTCAGCCGGGCGC
+
9C;=;=<9@4868>9:67AA<9>65<=>591
- So Line 1 contains a sequence identifier that begins with an
@
- Line 2 contains the read sequence (A,T,C,G,N)
- Line 3 is a comment line, often unused and only contains a
+
- Line 4 is the Phred quality score for each sequence encoded in ASCII format
First use wc
and awk
to determine the number of sequences in the fastq
Answer
cd ~/course/data/
wc -l SP1.fq # total number of lines
wc -l SP1.fq | awk '{print $1 / 4}' # total number of fastq records
(Source: MOLB7621)
A common mistake is to use grep
to pattern match the @
in the
sequence identifier. Why doesn't this work?
wc -l SP1.fq | awk '{print $1 / 4}'
renders 250
grep -c "@" SP1.fq
renders 459
.
Answer
`@` is a valid quality score, so lines of phred scores will be counted as well when using grep
(Source: MOLB7621)
Next, extract out the read sequences from the fastq. This is a bit complicated as we need to only pull out the second line from each record.
Answer
One approach to this problem is to use the %
modulo operator
(Wikipedia), which returns
the remainder after division of two integers. For example using awk
awk 'BEGIN { {print 4 % 2}}'
awk 'BEGIN { {print 4 % 3}}'
awk 'BEGIN { {print 5 % 3}}'
awk 'BEGIN { {print 1 % 4}}'
In awk
there is a special variable NR
which is equal to the
current line number.
So let's extract the sequences of the fasta file SP1.fq
awk 'NR%4==2' ~/course/data/SP1.fq
Print the line numbers of the file SP1.fq
Answer
awk '{print NR}' ~/course/data/SP1.fq
We can also prepend the line number to the FASTQ file (please note we asume SP1.fq
is on the working directory)
Tip: awk '{print $0 }'
filename prints the whole line, to which you can add the NR
.
awk '{print NR, $0 }' SP1.fq | head # note output in first column
(Source: MOLB7621)
We can use the modulo operator with NR
to only capture specific
records from the fastq
awk 'NR % 4 == 1' SP1.fq | head # get header line
awk 'NR % 4 == 2' SP1.fq | head # get sequence line
awk 'NR % 4 == 3' SP1.fq | head # get comment line
awk 'NR % 4 == 0' SP1.fq | head # get quality line
(Source: MOLB7621)
FASTA format is a more compact sequence format than FASTQ
>sequence_identifier_1
ATCGTCGATGCTAGTCGA
>sequence_identifier_2
AGCTAGCTAGCTAGC
Convert fastq to fasta (tip: use lines 1 and 2)
Answer
awk 'NR % 4 == 1 {print ">"$1};
NR % 4 == 2 {print}' SP1.fq \
| head
Save the SP1 sequences as a file in fasta format (named ~/course/data/example.fa
)
Answer
awk 'NR % 4 == 1 {print ">"$1};
NR % 4 == 2 {print}' SP1.fq \
> ~/course/data/example.fa
or, equivalently (data is one-column),
awk 'NR % 4 == 1 {print ">"$0};
NR % 4 == 2 {print}' SP1.fq \
> ~/course/data/example.fa
Make sure the number of sequences is the same in both fastq and fasta files
Answer
awk 'NR % 4 == 1' SP1.fq | wc -l
awk 'NR % 2 == 1' example.fa | wc -l
Download the compressed SAM data from https://github.com/samtools/samtools/raw/develop/examples/ex1.sam.gz
, uncompress (.gz
files need to be uncompressed with gunzip
) and inspect it.
Answer
cd ~/course/data
wget https://github.com/samtools/samtools/raw/develop/examples/ex1.sam
wget http://genomedata.org/rnaseq-tutorial/annotations/GRCh38/chr22_with_ERCC92.gtf
head chr22_with_ERCC92.gtf
wc -l chr22_with_ERCC92.gtf
Transform the file ~/course/soft/bedtools2/test/intersect/a.bed
(BED6) to BED3 (tip: use awk to extract the first three columns).
Tip: OFS='\t'
in awk specifies the output field separator: a tab (\t
).
Answer
## to go to the a.bed file directory
cd ~/course/soft/bedtools2/test/intersect/
awk -v OFS='\t' '{print $1,$2,$3}' a.bed
cd - ## to go back to the previous directory
Transform the BED3 file ~/course/soft/bedtools2/test/intersect/recordsOutOfOrder.bed
to BED6 (unspecified strand, 0 score)
Tip: check the BED6 file specification
Answer
## to go to the a.bed file directory
cd ~/course/soft/bedtools2/test/intersect/
awk -v OFS='\t' '{print $1,$2,$3,".",0,"."}' \
recordsOutOfOrder.bed
cd - ## to go back to the previous directory
Add a nucleotide to the start and subtract a nucleotide to the end to all records, regardless of the strand, to the file ~/course/soft/bedtools2/test/intersect/a.bed
.
Tip: use awk
.
Answer
## to go to the a.bed file directory
cd ~/course/soft/bedtools2/test/intersect/
awk -v OFS='\t' '{print $1,$2-1,$3+1,$4,$5,$6}' a.bed
cd - ## to go back to the previous directory
Use bedtools intersect to find overlapping genomic features. The BEDtools paper is also very helpful to understand the genomic arithmetics analysis (without sequences, using coordinates).
For this and the next exercises we use some example datasets from the bedtools sourcecode we compiled in one of the first activities. For instance, you could check them i.e. at ~/course/soft/bedtools2/test/intersect/
.
Check the ~/course/soft/bedtools2/test/intersect/a.bed
and ~/course/soft/bedtools2/test/intersect/b.bed
files content. Are they BED3, BED6 or BED12 files?
Answer
cat ~/course/soft/bedtools2/test/intersect/a.bed
cat ~/course/soft/bedtools2/test/intersect/b.bed
What will happen if you intersect those files? For example, the a.bed
region chr1:100-200 overlaps with b.bed
alias bedtools='~/course/soft/bedtools2/bin/bedtools'
bedtools intersect \
-a ~/course/soft/bedtools2/test/intersect/a.bed \
-b ~/course/soft/bedtools2/test/intersect/b.bed
Which is the format of the output? How are results interpreted?
Answer
The output is a direct intersect:
chr1 100 101 a2 2 -
chr1 100 110 a2 2 -
Starting from the original interval from a.bed:
chr1 100 200 a2 2 -
And the overlapping intervals from b.bed:
chr1 90 101 b2 2 -
chr1 100 110 b3 3 +
What does it happen if running bedtools intersect
with the a.bed
file as -b
and the b.bed
file as -a
?
Answer
bedtools intersect \
-b ~/course/soft/bedtools2/test/intersect/a.bed \
-a ~/course/soft/bedtools2/test/intersect/b.bed
Tip: check the strands
Run bedtools intersect with the a.bed
file as -a
, the b.bed
file as -b
but forcing strandness, i.e. reporting hits in B that overlap A on the same strand
Answer
bedtools intersect \
-s \
-a ~/course/soft/bedtools2/test/intersect/a.bed \
-b ~/course/soft/bedtools2/test/intersect/b.bed
Use the -v
to report those intervals in a.bed
that do not overlap any of the intervals in b.bed
.
Answer
bedtools intersect \
-v \
-a ~/course/soft/bedtools2/test/intersect/a.bed \
-b ~/course/soft/bedtools2/test/intersect/b.bed
You can explore what happens if inverting the -a and -b flags; for this the -wa and -wb flags might be helpful
bedtools intersect \
-v \
-wa -wb \
-b ~/course/soft/bedtools2/test/intersect/a.bed \
-a ~/course/soft/bedtools2/test/intersect/b.bed
Use the -wao
flag to report the amounts of overlap for all features when comparing a.bed
and b.bed
, including those that do not overlap. How are non overlaps encoded? Which strand are they on?
Answer
bedtools intersect \
-wao \
-a ~/course/soft/bedtools2/test/intersect/a.bed \
-b ~/course/soft/bedtools2/test/intersect/b.bed
Retrieve the details of transcript ENST00000342247
(tip: use grep) from the
chr22_with_ERCC92.gtf
file. Then, retrieve the details of the exon
s of transcript ENST00000342247
(tip: use grep after the grep). How many exons are they?
Answer
cd ~/course/data
grep ENST00000342247 chr22_with_ERCC92.gtf | grep "exon\s"
grep ENST00000342247 chr22_with_ERCC92.gtf | grep "exon\s" | wc -l
How many start codons and stop codons does the chr22 have? Tip: use the gtf and grep for start_codon
and stop_codon
.
Answer
cd ~/course/data
grep start_codon chr22_with_ERCC92.gtf | wc -l
grep stop_codon chr22_with_ERCC92.gtf | wc -l
Install an old version of VCFtools (disclaimer, this is old! this version is ok for teaching purposes, but please consider installing an up-to-date verson for handling your data in the future) you can download from https://sourceforge.net/projects/vcftools/files/vcftools_0.1.13.tar.gz/download
using make
. The path you can choose, but you can use for instance ~/course/soft/
.
Answer
cd ~/course/soft/
wget https://sourceforge.net/projects/vcftools/files/vcftools_0.1.13.tar.gz/download
tar xzvf vcftools.tar.gz
cd vcftools_0.1.13/
make
Check which is the current version of VCFtools? How should installation be carried out? (tip: Git repositories release tagged versions).
Answer
Git repository at https://github.com/vcftools/vcftools
There is a tag named v0.1.16 (as for the 26th Sept 2018) https://github.com/vcftools/vcftools/releases
VCFtools has some data for testing purposes; find all the VCF files (filenames similar to *vcf*
) that you downloaded during installation and inspect them using head
.
Answer
find ~/course/soft/vcftools_0.1.13 -name "*vcf" -type f
Alias the vcftool binary (full path) to vcftools
and then run it with no parameters
Answer
alias vcftools='~/course/soft/vcftools_0.1.13/bin/vcftools'
vcftools
Should render
VCFtools (v0.1.13)
© Adam Auton and Anthony Marcketta 2009
Process Variant Call Format files
For a list of options, please go to:
https://vcftools.github.io/examples.html
Questions, comments, and suggestions should be emailed to:
vcftools-help@lists.sourceforge.net
How many variants are kept after filtering at ~/course/soft/vcftools_0.1.13/examples/merge-test-a.vcf
? Tip: use the --vcf
flag to vcftools
. What does this result mean?
Answer
vcftools --vcf ~/course/soft/vcftools_0.1.13/examples/merge-test-a.vcf
Explore human variation data for mobile elements insertion using the 1000genomes data. More precisely, download the pilot pilot data named CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf.gz
(the full path to the remote FTP server is ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/sv/CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf.gz
).
Then, explore the number of variants (tip: use vcftools
), the file contents etc.
Tip: remember where the bedtools
and vcftools
binaries are and/or use an alias to them.
For further details on mobile elements insertion as a source of human variation please check the paper Stewart et al 2011.
Answer
cd ~/course/data
curl -L ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/paper_data_sets/a_map_of_human_variation/low_coverage/sv/CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf.gz > CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf.gz
gunzip CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf.gz
wc -l CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf
vcftools --vcf CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf
As a first exercise, how many of these intertions overlap human exons? Do you expect it to be a low or high proportion?
To do so download an exons bedfile from https://s3.amazonaws.com/bedtools-tutorials/web/exons.bed
and then use bedtools intersect. Tip: it won't work because of the chromosome numbering (tip: check how chromosome numbers are encoded in both exons.bed and the vcf file, i.e. whether starting with a chr
or not, and harmonize them using awk
).
Answer
cd ~/course/data
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/exons.bed
bedtools intersect -a CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf \
-b exons.bed
## this does not work as intended! or, rather, gives some warnings.
## Why? The chr strings at the coordinates are present in one bedfile but not in the other
tail CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf
tail exons.bed
## removing the chr string from the exons bedfile
sed 's/^chr//g' exons.bed > exons_nochr.bed
head exons_nochr.bed
## this should work (chr naming conventions are equivalent)
bedtools intersect -a CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf \
-b exons_nochr.bed
Are these few or lots? What would you expect?
bedtools intersect -a CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf -b exons_nochr.bed | wc -l
Continuing with the mobile elements insertions, where do they insert preferentially? In active or inactive regions? To briefly explore this get the chromatin states with the highest number of insertions using the CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf
and the Ernst's genome segmentation (more info here) as downloaded from https://s3.amazonaws.com/bedtools-tutorials/web/hesc.chromHmm.bed
. Tip: then select the fourth column of the ChromHMM bedfile, sort it and count with uniq -c
.
For further details on what is a ChromHMM segmentation (i.e. assigning functions to genomic ranges) please read Nature Methods volume 9, pages 215–216 (2012).
Answer
cd ~/course/data
curl -O https://s3.amazonaws.com/bedtools-tutorials/web/hesc.chromHmm.bed
# first removing the chrstrings again
sed 's/^chr//g' hesc.chromHmm.bed > hesc.chromHmm_nochr.bed
bedtools intersect\
-b CEU.low_coverage.2010_10.MobileElementInsertions.sites.vcf \
-a hesc.chromHmm_nochr.bed | awk '{print $4}' | sort | uniq -c
So apparently there is plenty of heterochromatin/low CNV regions, but is there an overrepresentation? How many heterochromatin/low CNV regions are there in the whole genome?
awk '{print $4}' hesc.chromHmm_nochr.bed | sort | uniq -c
Play with other real genomic data using the BEDtools tutorial which explores the Maurano et al paper Systematic Localization of Common Disease-Associated Variation in Regulatory DNA published in Science, 2012.
Mind that the tutorial recommends creating a folder with mkdir -p ~/workspace/monday/bedtools
: if you do so and move (mv
) there, your path (the one you can get using pwd
) won't be at the standard ~/course
we used till now.
Remember that the bedtools binary can be aliased using alias bedtools='~/course/soft/bedtools2/bin/bedtools'
.
Some of the puzzles include:
- Count the number of exons and CpG islands
Tip
Count the number of lines using `wc -l exons.bed` etc.
- How many CpG islands overlap exons in the genome?
Tip
Intersect and count
bedtools intersect -a cpg.bed -b exons.bed | wc -l
- Create a BED file representing all of the intervals in the genome that are NOT exonic.
Tip
Use intersect with the -v
flag
- What is the average distance from GWAS SNPs to the closest exon?
Tip
(Hint - have a look at the closest tool.)
- What fraction of the GWAS SNPs are exonic?
Explore (not necessarily run) more usage examples with biological meaning using UNIX and BEDTools http://pedagogix-tagc.univ-mrs.fr/courses/jgb53d-bd-prog/practicals/03_bedtools/.
Read an overview of all the file formats available at UCSC. You can upload some of the files you used during the course to the UCSC genome browser to see how do they look like. Anyway you can download, read, edit them using UNIX (i.e. awk
, grep
etc). And ask biological questions using bedtools/vcftools.
In this course we used plain text data standards. But faster, indexed versions of such formats exist: i.e. bigWig (for wiggle), BCF (for VCF), BAM (for SAM). For instance, you can read how indexing is carried out in bigWigs.