In this recipe, we will learn how to extract a gene sequence with the help of an annotation file to get
its coordinates against a reference FASTA. We will use the **Anopheles gambiae** genome, along with
its annotation file (as per the previous two recipes). First, we will extract the **voltage-gated sodium channel (VGSC)** gene , which is involved in resistance to insecticides.

In [1]:
! wget https://vectorbase.org/common/downloads/release-55/AgambiaePEST/gff/data/VectorBase-55_AgambiaePEST.gff -O gambiae.gff

--2022-10-21 20:56:04--  https://vectorbase.org/common/downloads/release-55/AgambiaePEST/gff/data/VectorBase-55_AgambiaePEST.gff
Resolving vectorbase.org (vectorbase.org)... 128.91.204.54
Connecting to vectorbase.org (vectorbase.org)|128.91.204.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 23512903 (22M) [application/x-gff]
Saving to: ‘gambiae.gff’


2022-10-21 20:56:07 (7.88 MB/s) - ‘gambiae.gff’ saved [23512903/23512903]



In [2]:
! gzip -9 gambiae.gff

In [9]:
! wget https://vectorbase.org/common/downloads/Current_Release/AgambiaePEST/fasta/data/VectorBase-59_AgambiaePEST_Genome.fasta -O gambiae.fa

--2022-10-21 21:01:27--  https://vectorbase.org/common/downloads/Current_Release/AgambiaePEST/fasta/data/VectorBase-59_AgambiaePEST_Genome.fasta
Resolving vectorbase.org (vectorbase.org)... 128.91.204.54
Connecting to vectorbase.org (vectorbase.org)|128.91.204.54|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 286083264 (273M) [application/x-fasta]
Saving to: ‘gambiae.fa’


2022-10-21 21:01:45 (15.1 MB/s) - ‘gambiae.fa’ saved [286083264/286083264]



In [10]:
! gzip -9 gambiae.fa

In [4]:
import gffutils
import sqlite3

try:
 db = gffutils.create_db('gambiae.gff.gz', 'ag.db')
except sqlite3.OperationalError:
 db = gffutils.FeatureDB('ag.db')

Let’s start by retrieving the annotation information for our gene:

In [7]:
import gzip
from Bio import Seq, SeqIO
gene_id = 'AGAP004707'
gene = db[gene_id]
print(gene)
print(gene.seqid, gene.strand)

AgamP4_2L	VEuPathDB	protein_coding_gene	2358158	2431617	.	+	.	ID=AGAP004707;Name=para;description=voltage-gated sodium channel
AgamP4_2L +


*gene_id* was retrieved from **VectorBase**, an online database for the genomics of disease vectors.
For other specific cases, you will need to know the ID of your gene (which will be dependent on the species and database).
</br>
Note that the gene is on the 2L chromosome arm and coded in the positive direction(the + strand).

Let’s hold the sequence for the *2L chromosome* arm in memory (it’s just a single chromosome,
so we will indulge):

In [11]:
recs = SeqIO.parse(gzip.open('gambiae.fa.gz', 'rt', encoding='utf-8'), 'fasta')
for rec in recs:
    print(rec.description)
    if rec.id == gene.seqid:
        my_seq = rec.seq
        break

AgamP4_2L | organism=Anopheles_gambiae_PEST | version=AgamP4 | length=49364325 | SO=chromosome


Let’s create a function to construct a gene sequence for a list of CDSs:

This function will receive a chromosome sequence (in our case, the 2L arm), a list of coding
sequences (retrieved from the annotation file), and the strand.

In [12]:
# We have to be very careful with the start and end of the sequence (note that the GFF file is
# 1-based, whereas the Python array is 0-based). Finally, we return the reverse complement if
# the strand is negative.

def get_sequence(chrom_seq, CDSs, strand):
    seq = Seq.Seq('')
    for CDS in CDSs:
        my_cds = Seq.Seq(str(chrom_seq[CDS.start - 1: CDS.end]))
        seq += my_cds
    return seq if strand == '+' else seq.reverse_complement()

Although we have the gene_id at hand, we only want one of the transcripts of the three
available for this gene, so we need to choose one:

In [13]:
mRNAs = db.children(gene, featuretype='mRNA')
for mRNA in mRNAs:
    print(mRNA.id)
    if mRNA.id.endswith('RA'):
        break

AGAP004707-RA


Now, let’s get the coding sequence for our transcript, then get the gene sequence, and translate it:

In [14]:
CDSs = db.children(mRNA, featuretype='CDS', order_by='start')
gene_seq = get_sequence(my_seq, CDSs, gene.strand)

print(len(gene_seq), gene_seq)
prot = gene_seq.translate()
print(len(prot), prot)

6195 ATGACCGAAGACTCCGATTCGATATCTGAGGAAGAACGTAGTTTGTTCCGTCCTTTCACTCGTGAATCATTACAAGCTATCGAAGCACGCATTGCAGATGAAGAAGCCAAACAGCGAGAATTGGAAAGAAAACGAGCTGAGGGGGAGATACGCTACGATGACGAGGATGAGGATGAAGGTCCCCAACCGGACCCTACTCTTGAACAGGGTGTACCAGTCCCAGTTCGAATGCAAGGCAGCTTCCCCCCGGAGTTGGCCTCCACGCCTCTCGAGGATATTGACAGTTTCTATTCAAATCAAAGGACATTCGTAGTGATTAGTAAAGGAAAAGATATATTTCGTTTCTCCGCAACTAACGCATTATATGTACTTGATCCGTTTAACCCCATACGCCGCGTAGCTATTTATATTTTAGTACATCCACTGTTTTCACTTTTTATAATAACGACCATTCTTGTTAATTGTATATTGATGATTATGCCTACCACGCCGACAGTCGAATCTACCGAGGTGATATTCACCGGCATCTACACGTTCGAATCAGCTGTAAAAGTGATGGCGCGAGGTTTCATATTACAACCGTTTACTTATCTTAGAGATGCATGGAATTGGTTGGACTTCGTAGTAATAGCATTAGCATATGTAACTATGGGTATAGATTTGGGTAATCTTGCTGCGTTGAGAACATTCAGGGTATTACGAGCTCTTAAAACGGTAGCCATCGTTCCAGGCTTAAAAACCATCGTCGGAGCCGTTATAGAATCCGTAAAGAATCTCAGAGATGTGATAATTTTAACAATGTTTTCGTTATCCGTGTTTGCTTTGATGGGTCTACAAATCTACATGGGAGTACTAACACAAAAGTGCATAAAAGAGTTCCCATTGGATGGTTCCTGGGGTAATCTAACCGACGAAAGCTGGGAGCTGTTCAACAGCAATGACACAAATTGGTTCTATTCCGAGAGTGGCGACATTCCTCTTTGTGGAAACTCATC

Let’s get the gene that is coded in the negative strand direction. We will just take the gene next
to VGSC (which happens to be the negative strand):

Here, I avoided getting all of the information about the gene and just hardcoded the transcript ID. 

In [15]:
reverse_transcript_id = 'AGAP004708-RA'

reverse_CDSs = db.children(reverse_transcript_id, featuretype='CDS', order_by='start')
reverse_seq = get_sequence(my_seq, reverse_CDSs, '-')

print(len(reverse_seq), reverse_seq)
reverse_prot = reverse_seq.translate()
print(len(reverse_prot), reverse_prot)

1992 ATGGCTGACTTCGATAGTGCCACTAAATGTATCAGAAACATTGAAAAAGAAATTCTTCTCTTGCAATCCGAAGTTTTGAAGACTCGTGAGGGGCTTGGGCTGGAAGATGATAACGTGGAACTTAAAAAGTTAATGGAGGAAAACACGAGATTAAAGCATCGTTTGGAGATAGTGCAATCGGCTATTGTACAGGAAGGCGGATCAATCGCATCCTCCGATTCTGGCAACCAATCCATTGTTGGCGAACTGCAGCAAGTATTTACCGAAGCCATTCAAAAAGCTTTTCCAAGTGTGTTGGTTGAGGCGGTTATTACTATTTCGTCATCCCCCAAGTTTGGCGATTATCAATGCAATAGTGCTATGCAGATTGCGCAGCATTTGAAGCAGTTATCTGTTAAATCGTCGCCACGTGAAGTGGCCCAAAAACTGGTAGCTGAATTGCAAAAACCAATACCTTGTGTCGATAGATTAGAAATCGCTGGAGCGGGATACGTTAATATTTTCCTGTCTAGATCTTATGGAGAACAACGCATTATGAGCATCTTGAGGCATGGGATTGTGGTACCATTAATAGAAAAGAAACGTGTGATAGTCGATTTTTCCTCGCCTAACGTAGCGAAAGAAATGCATGTCGGTCATTTACGTTCGACCATCATTGGTGATTCAATTTGTCGATTTTTGGAATATCTCGGACACGATGTGCTTCGTATTAACCATATCGGAGACTGGGGAACGCAATTTGGTATGTTAATTGCTCATTTGCAGGACCGTTTCCCTAATTTCCAAACCGAGTCCCCGCCTATCAGCGATTTGCAAGCATTTTACAAGGAGTCAAAGGTCCGATTTGACAGCGATGAAGTATTTAAAAAGCGTGCCTACGAATGTGTAGTCAAACTGCAAAGTGGAGAGCTGAGTTATTTGAAGGCCTGGAATCTAATTTGCGATGTTTCACGCAAAGAATTCCAAACCATCTACAACAGATTGGATGTGAAACT