<a href="https://colab.research.google.com/github/pachterlab/gget_examples/blob/main/gget_workflow_terminal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# `gget` tutorial for use from the terminal

gget features:
- `gget ref` Fetch FTPs for reference genomes and annotations by species.
- `gget search`  Fetch gene and transcript IDs from Ensembl using free-form search terms.
- `gget info` Fetch gene and transcript metadata using Ensembl IDs. 
- `gget seq` Fetch nucleotide or amino acid sequences of genes or transcripts.
- `gget blast` BLAST a nucleotide or amino acid sequence against any BLAST database.
- `gget muscle` Align multiple nucleotide or amino acid sequences against each other.
- `gget enrichr` Perform an enrichment analysis on a list of genes using Enrichr.

___

In [None]:
# For pretty plots
%config InlineBackend.figure_format='retina'

Install gget from source:

In [None]:
!git clone https://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@github.com/pachterlab/gget.git

Cloning into 'gget'...
remote: Enumerating objects: 1892, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (16/16), done.[K
remote: Total 1892 (delta 9), reused 16 (delta 5), pack-reused 1871[K
Receiving objects: 100% (1892/1892), 133.31 MiB | 21.87 MiB/s, done.
Resolving deltas: 100% (1176/1176), done.


In [None]:
# !git clone https://github.com/pachter/gget.git
!pip install mysql-connector-python -q
!cd gget && pip install . -q

[K     |████████████████████████████████| 25.2 MB 1.8 MB/s 
[?25h[33m  DEPRECATION: A future pip version will change local packages to be built in-place without first copying to a temporary directory. We recommend you use --use-feature=in-tree-build to test your packages with this new behavior before it becomes the default.
   pip 21.3 will remove support for this functionality. You can find discussion regarding this at https://github.com/pypa/pip/issues/7555.[0m
  Building wheel for gget (setup.py) ... [?25l[?25hdone


___


<h1><center>Terminal version</center></h1>
<center>Jupyter lab version below.<center>


In [None]:
!gget

Wed May  4 16:31:13 2022 INFO NumExpr defaulting to 2 threads.
usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat,enrichr} ...

gget v0.0.18

positional arguments:
  {ref,search,info,seq,muscle,blast,blat,enrichr}
    ref                 Fetch FTPs for reference genomes and annotations by
                        species.
    search              Fetch gene and transcript IDs from Ensembl using free-
                        form search terms.
    info                Fetch gene and transcript metadata using Ensembl IDs.
    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a
                        gene (and all isoforms) or transcript by Ensembl ID.
    muscle              Align multiple nucleotide or amino acid sequences
                        against each other (using the Muscle v5 algorithm).
    blast               BLAST a nucleotide or amino acid sequence against any
                        BLAST DB.
    blat                BLAT a nucleotide or amino 

In [None]:
# # Show detailed help page
# !gget -h

___
Ensembl just released Ensembl 106. Note that gget ref and search will automatically fetch from that release now unless a previous release is specified (all other functions are release independent):

In [None]:
!gget ref -s human -w gtf

Wed May  4 16:31:14 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:20 2022 INFO Fetching reference information for homo_sapiens from Ensembl release: 106.
{
    "homo_sapiens": {
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz",
            "ensembl_release": 106,
            "release_date": "28-Feb-2022",
            "release_time": "23:27",
            "bytes": "51379459"
        }
    }
}


Show newly available genomes in the latest Ensembl release (compared to previous release 105):

In [None]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

Wed May  4 16:31:21 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:21 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:25 2022 INFO Fetching available genomes in Ensembl release 106 (latest).
Wed May  4 16:31:25 2022 INFO Fetching available genomes in Ensembl release 105.
cyprinus_carpio_carpio


___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome. Just writing 'tae' is enough, because no other genome begins with those letters.

In [None]:
!gget search -sw fun -s tae

Wed May  4 16:31:27 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:30 2022 INFO Fetching results from database: taeniopygia_guttata_core_106_12
Wed May  4 16:31:32 2022 INFO Query time: 2.77 seconds
Wed May  4 16:31:32 2022 INFO Matches found: 14
ensembl_id,gene_name,ensembl_description,ext_ref_description,biotype,url
ENSTGUG00000003915,AIMP1,aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419],aminoacyl tRNA synthetase complex interacting multifunctional protein 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000003915
ENSTGUG00000004896,MFHAS1,malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808],multifunctional ROCO family signaling regulator 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000004896
ENSTGUG00000004956,BFAR,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],bifunctional apoptosis

# Use Enrichr to perform an enrichment analysis on a list of genes

In [None]:
!gget enrichr --genes AIMP1 MFHAS1 BFAR FUNDC1 AIMP2 ASF1A -db pathway

Wed May  4 16:31:33 2022 INFO NumExpr defaulting to 2 threads.
rank,path_name,p_val,z_score,combined_score,overlapping_genes,adj_p_val,database
1,Cytosolic tRNA aminoacylation,2.063922922489771e-05,453.90909090909093,4896.9151445712,"['AIMP1', 'AIMP2']",0.00014447460457428396,BioPlanet_2019
2,Transfer RNA aminoacylation,7.71711962436819e-05,226.70454545454547,2146.775128526375,"['AIMP1', 'AIMP2']",0.00027009918685288667,BioPlanet_2019
3,Keap1-Nrf2 pathway,0.0038940964853966904,333.03333333333336,1847.7667100463034,['AIMP2'],0.0089070475437485,BioPlanet_2019
4,Osteoclast signaling,0.005089741453570572,249.725,1318.6799159042428,['AIMP2'],0.0089070475437485,BioPlanet_2019
5,Gene expression,0.030821052881120598,9.848861283643892,34.26967705214415,"['AIMP1', 'AIMP2']",0.037978947440081436,BioPlanet_2019
6,Leptin influence on immune response,0.0325533835200698,36.48623853211009,124.96076851480254,['AIMP1'],0.037978947440081436,BioPlanet_2019
7,Interleukin-2 signaling pathway,0.2286939936153

# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Show short info on a few of the genes
!gget info -id ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

Wed May  4 16:31:35 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end
A0A674GVD2,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106
A0A674GIX6,LOC115492155,TRMT112,['LOC115492155'],,Multifunctional methyltransferase subunit TRM112-like protein (tRNA methyltransferase 112 homolog),,multifunctional methyltransferase subunit TRM112-like protein [Source:NCBI gene;Acc:115492155],Gene,protein_coding,ENSTGUT00000042451.1,taeniopygia_guttata,bTaeGut1_v1.p,RRCB01000041.1,1,484672,487065
H0Z3G6,BFAR,BFAR,['BFAR'],,Uncharacterized protein,,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],Gene,protein_coding,ENSTGUT00

In [None]:
# Expand info to show all transcripts
!gget info -id ENSTGUG00000006139 -e

Wed May  4 16:31:44 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,uniprot_description,ensembl_description,object_type,biotype,canonical_transcript,species,assembly_name,seq_region_name,strand,start,end,all_transcripts,transcript_biotypes,transcript_names,all_exons,exon_starts,exon_ends,all_translations,translation_starts,translation_ends
A0A674GVD2,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],Gene,protein_coding,ENSTGUT00000027003.1,taeniopygia_guttata,bTaeGut1_v1.p,1,-1,107513786,107528106,"['ENSTGUT00000006367', 'ENSTGUT00000027003']","['protein_coding', 'protein_coding']","['FUNDC1-201', 'FUNDC1-202']",,,,,,


# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
!gget seq -id ENSTGUG00000006139 -o gene_fasta.fa

Wed May  4 16:31:49 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:50 2022 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


In [None]:
!gget seq -id ENSTGUG00000006139 -iso -o gene_iso_fasta.fa

Wed May  4 16:31:51 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:53 2022 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq -id ENSTGUG00000006139 -st transcript -o transcript_fasta.fa

Wed May  4 16:31:55 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:31:57 2022 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


In [None]:
# Get AA sequences of all isoforms
!gget seq -id ENSTGUG00000006139 -st transcript -iso -o transcript_iso_fasta.fa

Wed May  4 16:31:59 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:32:01 2022 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
!gget seq -id ENSTGUT00000027003.1 -st transcript -iso

Wed May  4 16:32:03 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:32:05 2022 INFO Requesting amino acid sequence of ENSTGUT00000027003 from UniProt.
>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167
MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS


# BLAST the gene **nucleotide** sequence:

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [None]:
!gget blast -s gene_fasta.fa -o gene_blast.csv

Wed May  4 16:32:07 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:32:07 2022 INFO Sequence recognized as nucleotide sequence.
Wed May  4 16:32:07 2022 INFO BLAST will use program 'blastn' with database 'nt'.
Wed May  4 16:32:08 2022 INFO BLAST initiated with search ID 75AM8S6W013. Estimated time to completion: 26 seconds.
Wed May  4 16:32:35 2022 INFO BLASTING...
Wed May  4 16:33:37 2022 INFO Retrieving results...


# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
!gget blast -s transcript_fasta.fa -o transcript_blast.csv

Wed May  4 16:33:38 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:33:39 2022 INFO Sequence recognized as amino acid sequence.
Wed May  4 16:33:39 2022 INFO BLAST will use program 'blastp' with database 'nr'.
Wed May  4 16:33:40 2022 INFO BLAST initiated with search ID 75AR3ZP6013. Estimated time to completion: 26 seconds.
Wed May  4 16:34:06 2022 INFO BLASTING...
Wed May  4 16:35:07 2022 INFO BLASTING...
Wed May  4 16:36:09 2022 INFO BLASTING...
Wed May  4 16:37:10 2022 INFO BLASTING...
Wed May  4 16:38:13 2022 INFO Retrieving results...


# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:
Returns an alignment fasta (.afa) file.

In [None]:
# For long/many sequences, use super5 algorithm (activate with flag [-s5]) to decrease memory
# Save results with flag -o
!gget muscle -fa gene_iso_fasta.fa

Wed May  4 16:38:14 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:38:15 2022 INFO MUSCLE compiled. 
Wed May  4 16:38:15 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
tcmalloc: large alloc 3775569920 bytes == 0x55eceed12000 @  0x7f8f7b41b1e7 0x55eced8b3017 0x55eced86b993 0x55eced8a2957 0x7f8f7aa9fedf 0x55eced8a4fe3 0x55eced860c6d 0x55eced8613fd 0x55eced8591f7 0x7f8f7a28bc87 0x55eced85ee2a
tcmalloc: large alloc 3775569920 bytes == 0x55edcfdbc000 @  0x7f8f7b41b1e7 0x55eced8b3017 0x55eced86b99f 0x55eced8a2957 0x7f8f7aa9fedf 0x55eced8a4fe3 0x55eced860c6d 0x55eced8613fd 0x55eced8591f7 0x7f8f7a28bc87 0x55eced85ee2a
00:32 8.4Gb   100.0% UPGMA5         
Wed May  4 16:38:48 2022 INFO MUSCLE alignment complete. Alignment time: 33.72 second

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
!gget muscle -fa transcript_iso_fasta.fa

Wed May  4 16:38:53 2022 INFO NumExpr defaulting to 2 threads.
Wed May  4 16:38:53 2022 INFO MUSCLE compiled. 
Wed May  4 16:38:53 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
00:00 48Mb    100.0% UPGMA5         
Wed May  4 16:38:53 2022 INFO MUSCLE alignment complete. Alignment time: 0.02 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48