<a href="https://colab.research.google.com/github/lauraluebbert/gget/blob/dev/examples/gget_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

gget features:
- `gget ref` Fetch FTPs for reference genomes and annotations by species.
- `gget search`  Fetch gene and transcript IDs from Ensembl using free-form search terms.
- `gget info` Fetch gene and transcript metadata using Ensembl IDs. 
- `gget seq` Fetch nucleotide or amino acid sequences of genes or transcripts.
- `gget blast` BLAST a nucleotide or amino acid sequence against any BLAST database.
- `gget muscle` Align multiple nucleotide or amino acid sequences against each other.

___

Install from gget dev repository (only necessary until next release):

In [None]:
!git clone -b dev --single-branch https://github.com/lauraluebbert/gget.git -q
!pip install mysql-connector-python -q
!cd gget && pip install . -q

[K     |████████████████████████████████| 25.2 MB 1.5 MB/s 
[?25h[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
[K     |████████████████████████████████| 128 kB 5.3 MB/s 
[?25h  Building wheel for gget (setup.py) ... [?25l[?25hdone


___


<h1><center>Terminal version</center></h1>
<center>Jupyter lab version below.<center>


In [None]:
!gget

Mon May  2 21:27:51 2022 INFO NumExpr defaulting to 2 threads.
usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat} ...

gget v0.0.18

positional arguments:
  {ref,search,info,seq,muscle,blast,blat}
    ref                 Fetch FTPs for reference genomes and annotations by
                        species.
    search              Fetch gene and transcript IDs from Ensembl using free-
                        form search terms.
    info                Fetch gene and transcript metadata using Ensembl IDs.
    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a
                        gene (and all isoforms) or transcript by Ensembl ID.
    muscle              Align multiple nucleotide or amino acid sequences
                        against each other (using the Muscle v5 algorithm).
    blast               BLAST a nucleotide or amino acid sequence against any
                        BLAST DB.
    blat                BLAT a nucleotide or amino acid sequence ag

In [None]:
# # Show detailed help page
# !gget -h

___
Ensembl just released Ensembl 106. Note that gget ref and search will automatically fetch from that release now unless a previous release is specified (all other functions are release independent):

In [None]:
!gget ref -s human -w gtf

Mon May  2 21:27:53 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:27:59 2022 INFO Fetching reference information for homo_sapiens from Ensembl release: 106.
{
    "homo_sapiens": {
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz",
            "ensembl_release": 106,
            "release_date": "28-Feb-2022",
            "release_time": "23:27",
            "bytes": "51379459"
        }
    }
}


Show newly available genomes in the latest Ensembl release (compared to previous release 105):

In [None]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

Mon May  2 21:28:00 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:01 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:03 2022 INFO Fetching available genomes in Ensembl release 106 (latest).
Mon May  2 21:28:04 2022 INFO Fetching available genomes in Ensembl release 105.
cyprinus_carpio_carpio


___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome. Just writing 'tae' is enough, because no other genome begins with those letters.

In [None]:
!gget search -sw fun -s tae

Mon May  2 21:28:05 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:08 2022 INFO Fetching results from database: taeniopygia_guttata_core_106_12
Mon May  2 21:28:11 2022 INFO Query time: 4.26 seconds
Mon May  2 21:28:11 2022 INFO Matches found: 14
ensembl_id,gene_name,ensembl_description,ext_ref_description,biotype,url
ENSTGUG00000003915,AIMP1,aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419],aminoacyl tRNA synthetase complex interacting multifunctional protein 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000003915
ENSTGUG00000004896,MFHAS1,malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808],multifunctional ROCO family signaling regulator 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000004896
ENSTGUG00000004956,BFAR,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],bifunctional apoptosis

# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Show short info on a few of the genes
!gget info -id ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

Mon May  2 21:28:12 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end
A0A674GVD2,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106
A0A674GIX6,LOC115492155,TRMT112,['LOC115492155'],,Multifunctional methyltransferase subunit TRM112-like protein (tRNA methyltransferase 112 homolog),,multifunctional methyltransferase subunit TRM112-like protein [Source:NCBI gene;Acc:115492155],Gene,protein_coding,ENSTGUT00000042451.1,taeniopygia_guttata,bTaeGut1_v1.p,RRCB01000041.1,1,484672,487065
H0Z3G6,BFAR,BFAR,['BFAR'],,Uncharacterized protein,,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],Gene,protein_coding,ENSTGUT00

In [None]:
# Expand info to show all transcripts
!gget info -id ENSTGUG00000006139 -e

Mon May  2 21:28:21 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end,all_transcripts,transcript_biotypes,transcript_names,all_exons,exon_starts,exon_ends,all_translations,translation_starts,translation_ends
A0A674GVD2,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106,"['ENSTGUT00000006367', 'ENSTGUT00000027003']","['protein_coding', 'protein_coding']","['FUNDC1-201', 'FUNDC1-202']",,,,,,


# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
!gget seq -id ENSTGUG00000006139 -o gene_fasta.fa

Mon May  2 21:28:26 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:27 2022 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


In [None]:
!gget seq -id ENSTGUG00000006139 -iso -o gene_iso_fasta.fa

Mon May  2 21:28:27 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:30 2022 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq -id ENSTGUG00000006139 -st transcript -o transcript_fasta.fa

Mon May  2 21:28:31 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:33 2022 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


In [None]:
# Get AA sequences of all isoforms
!gget seq -id ENSTGUG00000006139 -st transcript -iso -o transcript_iso_fasta.fa

Mon May  2 21:28:36 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:38 2022 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
!gget seq -id ENSTGUT00000027003.1 -st transcript -iso

Mon May  2 21:28:40 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:42 2022 INFO Requesting amino acid sequence of ENSTGUT00000027003 from UniProt.
>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167
MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS


# BLAST the gene **nucleotide** sequence:

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [None]:
!gget blast -s gene_fasta.fa -o gene_blast.csv

Mon May  2 21:28:44 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:28:45 2022 INFO Sequence recognized as nucleotide sequence.
Mon May  2 21:28:45 2022 INFO BLAST will use program 'blastn' with database 'nt'.
Mon May  2 21:28:46 2022 INFO BLAST initiated with search ID 70K8DSRS016. Estimated time to completion: 66 seconds.
Mon May  2 21:29:53 2022 INFO Retrieving results...


# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
!gget blast -s transcript_fasta.fa -o transcript_blast.csv

Mon May  2 21:29:54 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:29:55 2022 INFO Sequence recognized as amino acid sequence.
Mon May  2 21:29:55 2022 INFO BLAST will use program 'blastp' with database 'nr'.
Mon May  2 21:29:56 2022 INFO BLAST initiated with search ID 70KAKYKJ016. Estimated time to completion: 47 seconds.
Mon May  2 21:30:43 2022 INFO BLASTING...
Mon May  2 21:31:44 2022 INFO BLASTING...
Mon May  2 21:32:46 2022 INFO BLASTING...
Mon May  2 21:33:47 2022 INFO BLASTING...
Mon May  2 21:34:52 2022 INFO Retrieving results...


# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:
Returns an alignment fasta (.afa) file.

In [None]:
# For long/many sequences, use super5 algorithm (activate with flag [-s5]) to decrease memory
# Save results with flag -o
!gget muscle -fa gene_iso_fasta.fa

Mon May  2 21:34:53 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:34:54 2022 INFO MUSCLE compiled. 
Mon May  2 21:34:54 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
tcmalloc: large alloc 3775569920 bytes == 0x55fefe080000 @  0x7f64dc6c11e7 0x55fefc3f3017 0x55fefc3ab993 0x55fefc3e2957 0x7f64dbd45edf 0x55fefc3e4fe3 0x55fefc3a0c6d 0x55fefc3a13fd 0x55fefc3991f7 0x7f64db531c87 0x55fefc39ee2a
tcmalloc: large alloc 3775569920 bytes == 0x55ffdf12a000 @  0x7f64dc6c11e7 0x55fefc3f3017 0x55fefc3ab99f 0x55fefc3e2957 0x7f64dbd45edf 0x55fefc3e4fe3 0x55fefc3a0c6d 0x55fefc3a13fd 0x55fefc3991f7 0x7f64db531c87 0x55fefc39ee2a
00:33 8.4Gb   100.0% UPGMA5         
Mon May  2 21:35:28 2022 INFO MUSCLE alignment complete. Alignment time: 34.31 second

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
!gget muscle -fa transcript_iso_fasta.fa

Mon May  2 21:35:32 2022 INFO NumExpr defaulting to 2 threads.
Mon May  2 21:35:33 2022 INFO MUSCLE compiled. 
Mon May  2 21:35:33 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
00:00 48Mb    100.0% UPGMA5         
Mon May  2 21:35:33 2022 INFO MUSCLE alignment complete. Alignment time: 0.02 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48

___

<h1><center>Jupyter Lab version</center></h1>

In [None]:
import gget

Mon May  2 21:35:33 2022 INFO NumExpr defaulting to 2 threads.


In [None]:
# # Show manual per sub-function, e.g. for seq:
# help(gget.seq)

___

# Find gene IDs based on free form search words:

In [None]:
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading
search_results = gget.search("fun", "tae", wrap_text=True)

Mon May  2 21:35:35 2022 INFO Fetching results from database: taeniopygia_guttata_core_106_12
Mon May  2 21:35:36 2022 INFO Query time: 2.04 seconds
Mon May  2 21:35:36 2022 INFO Matches found: 14


Unnamed: 0,ensembl_id,gene_name,ensembl_description,ext_ref_description,biotype,url
0,ENSTGUG00000003915,AIMP1,aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419],aminoacyl tRNA synthetase complex interacting multifunctional protein 1,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000003915
1,ENSTGUG00000004896,MFHAS1,malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808],multifunctional ROCO family signaling regulator 1,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000004896
2,ENSTGUG00000004956,BFAR,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],bifunctional apoptosis regulator,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000004956
3,ENSTGUG00000006139,FUNDC1,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],FUN14 domain containing 1,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000006139
4,ENSTGUG00000008804,AIMP2,aminoacyl tRNA synthetase complex interacting multifunctional protein 2 [Source:NCBI gene;Acc:100226087],aminoacyl tRNA synthetase complex interacting multifunctional protein 2,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000008804
5,ENSTGUG00000011666,,pseudouridine-metabolizing bifunctional protein C1861.05-like [Source:NCBI gene;Acc:100222446],,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000011666
6,ENSTGUG00000014433,,hydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit beta [Source:NCBI gene;Acc:115494596],,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000014433
7,ENSTGUG00000014477,,"trifunctional enzyme subunit alpha, mitochondrial-like [Source:NCBI gene;Acc:115494667]",,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000014477
8,ENSTGUG00000019264,ASF1A,anti-silencing function 1A histone chaperone [Source:NCBI gene;Acc:100229097],,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000019264
9,ENSTGUG00000020253,,hydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit beta [Source:NCBI gene;Acc:100226147],,protein_coding,https://uswest.ensembl.org/tae niopygia_guttata/Gene/Summary? g=ENSTGUG00000020253


# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Get gene ID of FUNDC1
gene_ID = search_results[search_results["gene_name"]=="FUNDC1"]["ensembl_id"].values[0]
gene_ID

'ENSTGUG00000006139'

In [None]:
# Show short info on a few genes
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading
df = gget.info([gene_ID, "ENSTGUG00000019264", "ENSTGUG00000022620"], wrap_text=True)



Unnamed: 0,uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end
ENSTGUG00000006139,A0A674GVD2,FUNDC1,FUNDC1,[FUNDC1],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106
ENSTGUG00000019264,A0A674HJ14,ASF1A,ASF1A,[ASF1A],,Uncharacterized protein,,anti-silencing function 1A histone chaperone [Source:NCBI gene;Acc:100229097],Gene,protein_coding,ENSTGUT00000021767.1,taeniopygia_guttata,bTaeGut1_v1.p,3,1,49529450,49548203
ENSTGUG00000022620,A0A674GWL9,LOC100223290,,[LOC100223290],,Enoyl-CoA hydratase (EC 4.2.1.17),,hydroxyacyl-CoA dehydrogenase trifunctional multienzyme complex subunit alpha [Source:NCBI gene;Acc:100223290],Gene,protein_coding,ENSTGUT00000026940.1,taeniopygia_guttata,bTaeGut1_v1.p,3,1,1645998,1678163


In [None]:
# Show expanded info
info_results = gget.info(gene_ID, expand=True, wrap_text=True)



Unnamed: 0,uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end,all_transcripts,transcript_biotypes,transcript_names,all_exons,exon_starts,exon_ends,all_translations,translation_starts,translation_ends
ENSTGUG00000006139,A0A674GVD2,FUNDC1,FUNDC1,[FUNDC1],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106,"[ENSTGUT00000006367, ENSTGUT00000027003]","[protein_coding, protein_coding]","[FUNDC1-201, FUNDC1-202]",,,,,,


# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
gene_fasta = gget.seq(gene_ID)
gene_fasta

Mon May  2 21:35:44 2022 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


['>ENSTGUG00000006139 primary_assembly:bTaeGut1_v1.p:1:107513786:107528106:-1',
 'CTCAGCACCGGCCAACATGGCGGCGCGGCGGCCCCGCTCCGCCTCAGGTCAGTGCTGGCCCCTTCCTGGCGGGAGCGGGGAGGAGGCGGCGCGGCGGGCGCCCTTCCTCCTCTCGGGGGCGGCGGGCAGCTCCCTCCCGCTGTATCCCGGGGGCGGGGAGGGATGGCCCGGGCTCTGCGCCTCCCTAGTCCGTTGCGTGTCCGTTGCGTGTGCCCCCTCCACCGCGCGGCCCGGCGCATCGGCCCCGGCGCTCCTGGCATCACCCGGGCTGAAGCTCATTCCCGGGATTTAACCAGCGAAACCCTTTCTAGGCAGGCCCGCAGGCAGAATGAGTGCCGGCCGAGCCCTCACGGAGCCGGAGGTGCCGGGGGGATGCGGGACCGGAGCCAGGAGGCTCCAGCCCCCATGGCCGCCGCATCCTTCGCCGTGCGGGGCTCTCCGGGCATCGCCGCGCCTCGTCCTGCGGCCCTTTCGGCACTGGCACAGTTCCGTGTGCTGCTCAATGTCCGGAACATCATTTGTCGCATGCAGGAGTATTTTTCATCGGTAGAAAATGCTCTGGCAGTTACTTGCCATAGAGCATGTTATGCTTGTGTACATGAGTTTTGGTTTAGATAATAATAATTTAAGGGCGGAATGAATGTGACTGTTCATGACAGTGTTTTAATATTCTCCATCTAAAGAAGTTAAACGTGTTGTCCTCAGTACCGCAAAGAAAGAACATTCAATAACAAGTTCTCAGCATTATGGATCTCACTATTTATTATTTAGTACTCAAGACCATGTGGTAATAAAGGGAAATAATGCACACCTATATATGTACTTCTTGCAGTCTTTGAAGCTTTTACCCATACTGGTAGTACAAGTTAAAAAGCTGTCAAAATTCTAATAAAATGTTTATATCCACAGTCTTACTTT

In [None]:
gget.seq(gene_ID, isoforms=True)

Mon May  2 21:35:45 2022 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


['>ENSTGUT00000006367 primary_assembly:bTaeGut1_v1.p:1:107513786:107528106:-1',
 'CTCAGCACCGGCCAACATGGCGGCGCGGCGGCCCCGCTCCGCCTCAGGTCAGTGCTGGCCCCTTCCTGGCGGGAGCGGGGAGGAGGCGGCGCGGCGGGCGCCCTTCCTCCTCTCGGGGGCGGCGGGCAGCTCCCTCCCGCTGTATCCCGGGGGCGGGGAGGGATGGCCCGGGCTCTGCGCCTCCCTAGTCCGTTGCGTGTCCGTTGCGTGTGCCCCCTCCACCGCGCGGCCCGGCGCATCGGCCCCGGCGCTCCTGGCATCACCCGGGCTGAAGCTCATTCCCGGGATTTAACCAGCGAAACCCTTTCTAGGCAGGCCCGCAGGCAGAATGAGTGCCGGCCGAGCCCTCACGGAGCCGGAGGTGCCGGGGGGATGCGGGACCGGAGCCAGGAGGCTCCAGCCCCCATGGCCGCCGCATCCTTCGCCGTGCGGGGCTCTCCGGGCATCGCCGCGCCTCGTCCTGCGGCCCTTTCGGCACTGGCACAGTTCCGTGTGCTGCTCAATGTCCGGAACATCATTTGTCGCATGCAGGAGTATTTTTCATCGGTAGAAAATGCTCTGGCAGTTACTTGCCATAGAGCATGTTATGCTTGTGTACATGAGTTTTGGTTTAGATAATAATAATTTAAGGGCGGAATGAATGTGACTGTTCATGACAGTGTTTTAATATTCTCCATCTAAAGAAGTTAAACGTGTTGTCCTCAGTACCGCAAAGAAAGAACATTCAATAACAAGTTCTCAGCATTATGGATCTCACTATTTATTATTTAGTACTCAAGACCATGTGGTAATAAAGGGAAATAATGCACACCTATATATGTACTTCTTGCAGTCTTTGAAGCTTTTACCCATACTGGTAGTACAAGTTAAAAAGCTGTCAAAATTCTAATAAAATGTTTATATCCACAGTCTTACTTT

# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get AA sequence of canonical transcript
transcript_fasta = gget.seq(gene_ID, seqtype="transcript")
transcript_fasta

Mon May  2 21:35:48 2022 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


['>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167',
 'MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS']

In [None]:
# Get AA sequences of all isoforms
gget.seq(gene_ID, seqtype="transcript", isoforms=True)

Mon May  2 21:35:51 2022 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


['>ENSTGUT00000006367 uniprot_id: H0Z6V5 ensembl_id: ENSTGUT00000006367 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 156',
 'MAARRPRSASDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS',
 '>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167',
 'MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS']

Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
gget.seq("ENST00000334527", seqtype="transcript", isoforms=True)

Mon May  2 21:35:54 2022 INFO Requesting amino acid sequence of ENST00000334527 from UniProt.


['>ENST00000334527 uniprot_id: Q9GZM8 ensembl_id: ENST00000334527 gene_name(s): NDEL1 EOPA MITAP1 NUDEL organism: Homo sapiens (Human) sequence_length: 345',
 'MDGEDIPDFSSLKEETAYWKELSLKYKQSFQEARDELVEFQEGSRELEAELEAQLVQAEQRNRDLQADNQRLKYEVEALKEKLEHQYAQSYKQVSVLEDDLSQTRAIKEQLHKYVRELEQANDDLERAKRATIVSLEDFEQRLNQAIERNAFLESELDEKESLLVSVQRLKDEARDLRQELAVRERQQEVTRKSAPSSPTLDCEKMDSAVQASLSLPATPVGKGTENTFPSPKAIPNGFGTSPLTPSARISALNIVGDLLRKVGALESKLAACRNFAKDQASRKSYISGNVNCGVLNGNGTKFSRSGHTSFFDKGAVNGFDPAPPPPGLGSSRPSSAPGMLPLSV']

# BLAST the gene **nucleotide** sequence:

In [None]:
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading,
df = gget.blast(gene_fasta[1], wrap_text=True)

Mon May  2 21:35:55 2022 INFO Sequence recognized as nucleotide sequence.
Mon May  2 21:35:55 2022 INFO BLAST will use program 'blastn' with database 'nt'.
Mon May  2 21:35:56 2022 INFO BLAST initiated with search ID 70KNWMY4016. Estimated time to completion: 37 seconds.
Mon May  2 21:36:34 2022 INFO BLASTING...
Mon May  2 21:37:35 2022 INFO BLASTING...
Mon May  2 21:38:37 2022 INFO BLASTING...
Mon May  2 21:39:38 2022 INFO BLASTING...
Mon May  2 21:40:40 2022 INFO BLASTING...
Mon May  2 21:41:41 2022 INFO BLASTING...
Mon May  2 21:42:43 2022 INFO Retrieving results...


Unnamed: 0,Description,Scientific Name,Common Name,Taxid,Max Score,Total Score,Query Cover,E value,Per. Ident,Acc. Len,Accession
0,"Aquila chrysaetos chrysaetos genome assembly, chromosome: 7",Aquila chrysaetos chrysaetos,,223781,2259,6189,56%,0.0,79.59%,47779391,LR606187.1
1,"Accipiter gentilis genome assembly, chromosome: 32",Accipiter gentilis,Northern goshawk,8957,2143,5579,55%,0.0,78.68%,21169547,OV839393.1
2,"PREDICTED: Motacilla alba alba FUN14 domain containing 1 (FUNDC1), transcript variant X1, mRNA",Motacilla alba alba,,1094192,1194,2525,14%,0.0,85.33%,2172,XM_038145982.1
3,"PREDICTED: Lonchura striata domestica FUN14 domain containing 1 (FUNDC1), transcript variant X2, mRNA",Lonchura striata domestica,Bengalese finch,299123,1151,2162,8%,0.0,99.37%,1184,XM_031507429.1
4,"PREDICTED: Lonchura striata domestica FUN14 domain containing 1 (FUNDC1), transcript variant X1, mRNA",Lonchura striata domestica,Bengalese finch,299123,1151,1909,7%,0.0,99.37%,1076,XM_021538647.2
5,"PREDICTED: Taeniopygia guttata FUN14 domain containing 1 (FUNDC1), transcript variant X2, mRNA",Taeniopygia guttata,zebra finch,59729,1134,1909,7%,0.0,100.00%,1087,XM_002190180.6
6,"PREDICTED: Taeniopygia guttata FUN14 domain containing 1 (FUNDC1), transcript variant X1, mRNA",Taeniopygia guttata,zebra finch,59729,1134,2092,7%,0.0,100.00%,1123,XM_032750485.2
7,"PREDICTED: Pyrgilauda ruficollis FUN14 domain containing 1 (FUNDC1), mRNA",Pyrgilauda ruficollis,rufous-necked snowfinch,221976,1040,1776,7%,0.0,96.51%,1086,XM_041487908.1
8,"PREDICTED: Geospiza fortis FUN14 domain containing 1 (FUNDC1), transcript variant X2, mRNA",Geospiza fortis,medium ground-finch,48883,1031,1635,7%,0.0,95.92%,1184,XM_031064164.1
9,"PREDICTED: Geospiza fortis FUN14 domain containing 1 (FUNDC1), transcript variant X1, mRNA",Geospiza fortis,medium ground-finch,48883,1031,1636,7%,0.0,95.92%,1120,XM_031064163.1


# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
df = gget.blast(transcript_fasta[1], wrap_text=True)

Mon May  2 21:42:43 2022 INFO Sequence recognized as amino acid sequence.
Mon May  2 21:42:43 2022 INFO BLAST will use program 'blastp' with database 'nr'.
Mon May  2 21:42:44 2022 INFO BLAST initiated with search ID 70M2MGTB013. Estimated time to completion: 41 seconds.
Mon May  2 21:43:26 2022 INFO BLASTING...
Mon May  2 21:44:29 2022 INFO Retrieving results...


Unnamed: 0,Description,Scientific Name,Common Name,Taxid,Max Score,Total Score,Query Cover,E value,Per. Ident,Acc. Len,Accession
0,FUN14 domain-containing protein 1 isoform X1 [Taeniopygia guttata],Taeniopygia guttata,zebra finch,59729,345,345,100%,6e-120,100.00%,167,XP_032606376.2
1,FUN14 domain-containing protein 1 isoform X2 [Lonchura striata domestica],Lonchura striata domestica,Bengalese finch,299123,341,341,100%,4e-118,98.20%,167,XP_031363289.1
2,FUN14 domain-containing protein 1 isoform X1 [Motacilla alba alba],Motacilla alba alba,,1094192,340,340,100%,9e-118,97.60%,182,XP_038001910.1
3,FUN14 domain-containing protein 1 isoform X2 [Geospiza fortis],Geospiza fortis,medium ground-finch,48883,335,335,100%,1.9999999999999999e-115,96.41%,203,XP_030920024.1
4,FUN14 domain-containing protein 1 isoform X2 [Molothrus ater],Molothrus ater,,84834,331,331,100%,2e-114,97.01%,165,XP_036241792.1
5,FUN14 domain-containing protein 1 isoform X1 [Parus major],Parus major,Great Tit,9157,328,328,100%,3e-113,95.81%,165,XP_033367835.1
6,FUN14 domain-containing protein 1 [Egretta garzetta],Egretta garzetta,little egret,188379,316,316,100%,2e-107,92.22%,216,XP_035757750.1
7,FUND1 protein [Tachuris rubrigastra],Tachuris rubrigastra,,495162,306,306,91%,8e-105,96.73%,156,NWR31777.1
8,FUND1 protein [Donacobius atricapilla],Donacobius atricapilla,,237420,305,305,89%,2e-104,98.00%,156,NXB69828.1
9,FUN14 domain-containing protein 1 isoform X2 [Phasianus colchicus],Phasianus colchicus,Ring-necked pheasant,9054,304,304,93%,2e-103,94.23%,180,XP_031450465.1


# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:

I will use the .fa files that were previously generated using the terminal commands above. Unlike `blast`, `muscle` only accepts .fa files as input (`blast` also accepts a sequence passed as string). In Jupyter lab, compatible .fa files can be generated using the `save=True` option with `seq()`.

In [None]:
gget.muscle("gene_iso_fasta.fa")

Mon May  2 21:44:29 2022 INFO MUSCLE compiled. 
Mon May  2 21:44:29 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
tcmalloc: large alloc 3775569920 bytes == 0x556e2e44c000 @  0x7f890e1fe1e7 0x556e2d407017 0x556e2d3bf993 0x556e2d3f6957 0x7f890d882edf 0x556e2d3f8fe3 0x556e2d3b4c6d 0x556e2d3b53fd 0x556e2d3ad1f7 0x7f890d06ec87 0x556e2d3b2e2a
tcmalloc: large alloc 3775569920 bytes == 0x556f0f4f6000 @  0x7f890e1fe1e7 0x556e2d407017 0x556e2d3bf99f 0x556e2d3f6957 0x7f890d882edf 0x556e2d3f8fe3 0x556e2d3b4c6d 0x556e2d3b53fd 0x556e2d3ad1f7 0x7f890d06ec87 0x556e2d3b2e2a
00:33 8.4Gb   100.0% UPGMA5         
Mon May  2 21:45:03 2022 INFO MUSCLE alignment complete. Alignment time: 34.2 seconds




ENSTGUT00000006367 [38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;10mT[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;0m[48;5;10mT[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
gget.muscle("transcript_iso_fasta.fa")

Mon May  2 21:45:03 2022 INFO MUSCLE compiled. 
Mon May  2 21:45:03 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
00:00 48Mb    100.0% UPGMA5         
Mon May  2 21:45:03 2022 INFO MUSCLE alignment complete. Alignment time: 0.07 seconds




ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;10mS[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;0m[48;5;10mS[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;0m[48;5;14mH[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;0m[48;5;10mS[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;0m[48;5;10mS[0;0m[38;5;0m[48;5;14mY[0;0m[38;5;15m[48;5;5mE[0;0m[38;5;15m[48;5;12mV[0;0m[38;5;15m[48;5;12mL[0;0m[38;5;15m[48;5;5mD[0;0m[38;5;15m[48;5;12mL[0;0m[38;5;0m[48;5;10mT[0;0m[38;5;15m[48;5;5mE[0;0m[38;5;0m