In this tutorial, we describe how to make custom genome annotations <annotation>
in BED, BigBed and GTF2 formats. These can then be used like any other annotation file, for example:
- to view custom regions of interest in a
genome browser
- to perform analyses, using pipelines in
plastid
or other programs
Note
plastid
can also make custom GFF3 files. For this see /concepts/gff3
For a detailed comparison of the various annotation formats, see data-annotation-format
.
This document contains the following sections:
Annotation files are tab-delimited formats, and could manually be made in a text editor or spreadsheet. However, an easier and potentially less error-prone process is to:
- Enter an interactive Python session
- Construct and/or that represent the
features <feature>
of interest, using0-indexed
,half-open
coordinates. This can be done completely de novo, or using other features as starting points. - Use the
~plastid.genomics.roitools.SegmentChain.as_bed
or~plastid.genomics.roitools.SegmentChain.as_gtf
methods to export the or in the desired format. To convert the BED file to a BigBed file, seemake-annotation-bigbed
.
plastid
will automatically handle all coordinate conversions, splicing, text formatting, and attribute conversion.
Suppose we made a cell line containing a reporter construct, and that we want to measure the expression level of this reporter along with the expression levels of all other genes in an RNA-seq
experiment. Let's call the vector lenti223 and suppose it contains a GFP and an mCherry, each driven by some promoter.
An analysis pipeline for this experiment might look the following:
- Create a custom BED or GTF2 file describing the reporter construct(s)
- Combine the custom annotation file with an annotation from the host genome
- Download the appropriate genome sequence, and add the vector sequence as another contig
- Align sequencing data to the combined genome
- Analyze sequencing data using custom annotation file.
Here, we'll focus on just making the custom GTF2 file. Interactively we'll represent the reporter transcripts as and define coordinates manually:
>>> from plastid import GenomicSegment, SegmentChain, Transcript
# GFP transcript, containing 100 bp of 5' UTR and 150 bp of 3' UTR
# 714bp coding region from bases 945-1659
>>> gfp = Transcript(GenomicSegment("lenti223",845,1809,"+"),ID="sfGFP",cds_genome_start=945,cds_genome_end=1659)
# mCherry transcript, similarly constructed
>>> rfp = Transcript(GenomicSegment("lenti223",2100,3061,"+"),ID="mCherry",cds_genome_start=2200,cds_genome_end=2911)
# now, write out features
# we could have made a BED file using as_bed() in place of as_gtf()
>>> with open("custom.gtf","w") as fout:
>>> fout.write(gfp.as_gtf())
>>> fout.write(rfp.as_gtf())
>>> fout.close()
The file custom.gtf
should look something like this:
lenti223 . exon 846 1809 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223 . CDS 946 1656 . + 0 gene_id "gene_sfGFP"; transcript_id "sfGFP"; ID "sfGFP";
lenti223 . start_codon 946 948 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223 . stop_codon 1657 1659 . + . gene_id "gene_sfGFP"; transcript_id "sfGFP"; cds_start "100"; cds_end "814"; ID "sfGFP";
lenti223 . exon 2101 3061 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223 . CDS 2201 2908 . + 0 gene_id "gene_mCherry"; transcript_id "mCherry"; ID "mCherry";
lenti223 . start_codon 2201 2203 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";
lenti223 . stop_codon 2909 2911 . + . gene_id "gene_mCherry"; transcript_id "mCherry"; cds_start "100"; cds_end "811"; ID "mCherry";
Manually entering coordinates is laborious. More frequently, novel annotations are derived from existing ones. Let's suppose we'd like to make a BED file containing halves of coding regions. For this we'll use the demo dataset </test_dataset>
.
We'll load the transcripts, create new from those, and save them:
>>> from plastid import BED_Reader
# read transcripts
>>> reader = BED_Reader("some_file.bed")
# open file for writing
>>> halfbed = open("cds_halves.bed","w")
>>> for transcript in reader:
>>> cds = transcript.get_cds()
>>>
>>> # make sure transcript is coding before divide CDS
>>> if cds.length > 0:
>>> name = transcript.get_name()
>>> halflength = cds.length // 2
>>>
>>> # get halves, name each half after the parent CDS
>>> first_half = cds.get_subchain(0,halflength,ID=name + "_firsthalf")
>>> second_half = cds.get_subchain(halflength,cds.length,ID=name + "_secondhalf")
>>>
>>> # save output
>>> halfbed.write(first_half.as_bed())
>>> halfbed.write(second_half.as_bed())
# close file
>>> halfbed.close()
Extended BED <extended BED>
and BigBed files can contain extra columns, such as a gene ID. This can be extremely useful.
To export attributes of a or as extra columns in a extended BED
format, pass a list of the attribute names (from the dictionary attr) to the extra_columns keyword of SegmentChain.as_bed <plastid.genomics.roitools.SegmentChain.as_bed>
. Attributes will be exported in the order they appear in extra_columns, and will be given an empty value of "" when they are not defined
>>> attr = { "ID" : "some feature ID",
"extra_field_1" : 542,
"extra_field_2" : "some extra field",
}
>>> my_chain = Transcript(GenomicSegment("chrA",100,150,"+"),
GenomicSegment("chrA",500,550,"+"),
**attr)
>>> print(my_chain.as_bed()
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400,
>>> print(my_chain.as_bed(extra_columns=["extra_field_1","extra_field_2"]))
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400, 542 some extra field
If an attribute is not defined, the column will be left empty "":
>>> print(my_chain.as_bed(extra_columns=["extra_field_1","nonexistent_field","extra_field_2"]))
chrA 100 550 some feature ID 0 + 100 100 0,0,0 2 50,50, 0,400, 542 some extra field
Making BigBed files
BigBed files are easily made from BED or Extended BED <extended BED>
files using Jim Kent's utilities. To make a BigBed file:
- Create a custom BED or
extended BED
file. Forextended BED
files, consider making an optional autoSql description of the names & data types of the extra columns. This will allow parsers to convert these to native types when reading the BigBed file. Sort the BED file by chromosome and start position. This is easily done in a terminal session:
$ sort -k1,1n -k2,2n my_annotation.bed >my_annotation_sorted.bed
- Download and install Jim Kent's utilities, which include the
bedToBigBed
program. Obtain a chromosome/contig
.sizes
file. If using genome builds from UCSC, these can be downloaded using thefetchChromSizes
program included with Jim Kent's utilities. For example:$ fetchChromSizes hg38 >>hg38.sizes
Run
bedToBigBed
. From the terminal:$ bedToBigBed my_annotation_sorted.bed my_genome.sizes my_annotation.bb
Your annotation will be saved as
my_annotation.bb
.
data-annotation-format
for a brief overview of the costs & benefits of BED, BigBed, GTF2 and GFF3 files./concepts/gff3
for information on making GFF3 files~plastid.genomics.roitools.SegmentChain
and~plastid.genomics.roitools.Transcript
for details on these classes- The UCSC file format FAQ for details on file formats and further discussion of their capabilities, advantages, and disadvantages
- The GFF3 specification for details on GFF3 files
/concepts/coordinates
for information on genomic coordinates- Sequence Ontology (SO) v2.53, for a description of a common GFF3 feature ontology
- SO releases, for the current SO consortium release.
- Jim Kent's utilities for more info on making BigBed files.
- autoSql Grammar specification