<a href="https://colab.research.google.com/github/pachterlab/gget_examples/blob/main/gget_workflow_terminal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: `gget` in the terminal

gget modules:
- `gget ref` Fetch FTPs and metadata for reference genomes and annotations from [Ensembl](https://www.ensembl.org/) by species.
- `gget search`  Fetch genes and transcripts from [Ensembl](https://www.ensembl.org/) using free-form search terms.
- `gget info` Fetch extensive gene and transcript metadata from [Ensembl](https://www.ensembl.org/), [UniProt](https://www.uniprot.org/), and [NCBI](https://www.ncbi.nlm.nih.gov/) using Ensembl IDs.
- `gget seq` Fetch nucleotide or amino acid sequences of genes or transcripts from [Ensembl](https://www.ensembl.org/) or [UniProt](https://www.uniprot.org/), respectively.
- `gget blast` BLAST a nucleotide or amino acid sequence against any [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) database.
- `gget blat` Find the genomic location of a nucleotide or amino acid sequence using [BLAT](https://genome.ucsc.edu/cgi-bin/hgBlat).
- `gget muscle` Align multiple nucleotide or amino acid sequences against each other using [Muscle5](https://www.drive5.com/muscle/).
- `gget enrichr` Perform an enrichment analysis on a list of genes using [Enrichr](https://maayanlab.cloud/Enrichr/).
- `gget archs4` Find the most correlated genes or the tissue expression atlas of a gene of interest using [ARCHS4](https://maayanlab.cloud/archs4/).

___

In [None]:
# For pretty plots
%config InlineBackend.figure_format='retina'

Install gget:

In [None]:
!pip install gget -q

[K     |████████████████████████████████| 1.2 MB 8.0 MB/s 
[K     |████████████████████████████████| 128 kB 8.5 MB/s 
[K     |████████████████████████████████| 25.2 MB 1.9 MB/s 
[?25h

___


<h1><center>Terminal version</center></h1>
<center>Jupyter lab version below.<center>


In [None]:
!gget

Tue May 10 01:43:13 2022 INFO NumExpr defaulting to 2 threads.
usage: gget [-h] [-v]
            {ref,search,info,seq,muscle,blast,blat,enrichr,archs4} ...

gget v0.0.19

positional arguments:
  {ref,search,info,seq,muscle,blast,blat,enrichr,archs4}
    ref                 Fetch FTPs for reference genomes and annotations by
                        species.
    search              Fetch gene and transcript IDs from Ensembl using free-
                        form search terms.
    info                Fetch gene and transcript metadata using Ensembl IDs.
    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a
                        gene (and all isoforms) or transcript by Ensembl ID.
    muscle              Align multiple nucleotide or amino acid sequences
                        against each other (using the Muscle v5 algorithm).
    blast               BLAST a nucleotide or amino acid sequence against any
                        BLAST DB.
    blat                B

In [None]:
# # Show detailed help page
# !gget -h

___
Ensembl just released Ensembl 106. Note that gget ref and search will automatically fetch from that release now unless a previous release is specified (all other functions are release independent):

In [None]:
!gget ref -s human -w gtf

Tue May 10 01:43:15 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:21 2022 INFO Fetching reference information for homo_sapiens from Ensembl release: 106.
{
    "homo_sapiens": {
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz",
            "ensembl_release": 106,
            "release_date": "28-Feb-2022",
            "release_time": "23:27",
            "bytes": "51379459"
        }
    }
}


Show newly available genomes in the latest Ensembl release (compared to previous release 105):

In [None]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

Tue May 10 01:43:22 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:22 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:25 2022 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 106 (latest).
Tue May 10 01:43:25 2022 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 105.
cyprinus_carpio_carpio


___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome. Just writing 'tae' is enough, because no other genome begins with those letters.

In [None]:
!gget search -sw fun -s tae

Tue May 10 01:43:26 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:28 2022 INFO Fetching results from database: taeniopygia_guttata_core_106_12
Tue May 10 01:43:30 2022 INFO Total matches found: 14.
Tue May 10 01:43:30 2022 INFO Query time: 3.11 seconds.
ensembl_id,gene_name,ensembl_description,ext_ref_description,biotype,url
ENSTGUG00000003915,AIMP1,aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419],aminoacyl tRNA synthetase complex interacting multifunctional protein 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000003915
ENSTGUG00000004896,MFHAS1,malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808],multifunctional ROCO family signaling regulator 1,protein_coding,https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000004896
ENSTGUG00000004956,BFAR,bifunctional apoptosis regulator [Source:NCBI gene;Acc:100223595],bifunctional a

# Use [Enrichr](https://maayanlab.cloud/Enrichr/) to perform an enrichment analysis on a list of genes

In [None]:
!gget enrichr --genes AIMP1 MFHAS1 BFAR FUNDC1 AIMP2 ASF1A -db pathway

Tue May 10 01:43:30 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:31 2022 INFO Performing Enichr analysis using databse KEGG_2021_Human. Please note that there might a more appropriate database for your application. Go to https://maayanlab.cloud/Enrichr/#libraries for a full list of supported databases.
rank,path_name,p_val,z_score,combined_score,overlapping_genes,adj_p_val,database
1,Mitophagy,0.02022973917450602,59.483582089552236,232.02175076653654,['FUNDC1'],0.02022973917450602,KEGG_2021_Human


# Find the 100 most correlated genes to a gene of interest or show its tissue expression using the [ARCHS4](https://maayanlab.cloud/archs4/) database

In [None]:
!gget archs4 --gene AIMP1

Tue May 10 01:43:33 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:34 2022 INFO Fetching the 100 most correlated genes to AIMP1 from ARCHS4.
gene_symbol,pearson_correlation
MRPL1,0.7187576293945312
ZC3H15,0.7041563391685486
MRPL47,0.6948618292808533
TRMT10C,0.6906108856201172
C8orf59,0.6878266334533691
SNRPB2,0.678638756275177
AK6,0.6754871606826782
NSA2,0.6750699877738953
METTL5,0.6749789714813232
RSL24D1,0.6746876835823059
NIFK,0.6722356677055359
PSMA4,0.6672275066375732
PSMC2,0.6665088534355164
NOP58,0.6644575595855713
SNRPE,0.6627632975578308
SF3B6,0.6607162356376648
EIF2A,0.6597107648849487
METAP2,0.6589860320091248
POLR2K,0.6589668989181519
LTV1,0.65687495470047
DARS,0.6566787958145142
TWISTNB,0.6559195518493652
MRPL13,0.655347466468811
RPF1,0.6551865339279175
EIF3M,0.6549532413482666
MRPL32,0.6546421647071838
TPRKB,0.652922511100769
PSMA3,0.6519926190376282
CWC15,0.651310384273529
SRP72,0.6507214307785034
MRPS35,0.6504343152046204
TAF9,0.6491758823394775
RIOK2,0.647

In [None]:
!gget archs4 --gene AIMP1 --which tissue

Tue May 10 01:43:38 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:43:39 2022 INFO Fetching the tissue expression atlas of AIMP1 from human ARCHS4 data.
id,min,q1,median,q3,max
System.Muscular System.Skeletal muscle.SKELETAL MUSCLE,7.99445,9.22397,9.81827,10.1036,10.6314
System.Digestive System.Pancreas.PANCREATIC ISLET,0.113644,8.83132,9.7887,10.5512,11.5828
System.Immune System.Lymphoid.PLASMA CELL,0.113644,8.94132,9.7726,10.5481,11.4732
System.Digestive System.Pancreas.ALPHA CELL,7.66651,8.9879,9.74156,10.6557,11.5419
System.Immune System.Lymphoid.TLYMPHOCYTE,8.50117,9.19823,9.61646,10.049,10.7136
System.Digestive System.Stomach.GASTRIC EPITHELIAL CELL,8.5976,9.11788,9.58698,10.0557,10.317
System.Integumentary System.Skin.BASAL CELL,8.6597,9.25821,9.58068,9.88286,10.2913
System.Immune System.Thymus.THYMOCYTE,8.70847,9.0947,9.56961,10.1719,10.5715
System.Immune System.Lymphoid.BLYMPHOCYTE,7.88689,9.15784,9.55795,9.87444,12.2162
System.Connective Tissue.Bone.STROMAL CELL,8.3

# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Show short info on a few of the genes
!gget info -id ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

Tue May 10 01:43:40 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,ncbi_gene_id,species,assembly_name,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,ensembl_description,uniprot_description,ncbi_description,object_type,biotype,canonical_transcript,seq_region_name,strand,start,end
"['A0A674GVD2', 'H0Z6V5']",100228946,taeniopygia_guttata,bTaeGut1_v1.p,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],"[nan, nan]",,Gene,protein_coding,ENSTGUT00000027003.1,1,-1,107513786,107528106
A0A674GIX6,,taeniopygia_guttata,bTaeGut1_v1.p,LOC115492155,TRMT112,['LOC115492155'],,Multifunctional methyltransferase subunit TRM112-like protein (tRNA methyltransferase 112 homolog),multifunctional methyltransferase subunit TRM112-like protein [Source:NCBI gene;Acc:115492155],,,Gene,protein_coding,ENSTGUT00000042451.1,RRCB01000041.1,1,484672,487065
H0Z3G6,100223595,taeniopygia_guttata,bTaeGut1_v1.p,BFAR,BFAR,['BFAR'],,Uncha

In [None]:
# Expand info to show all transcripts
!gget info -id ENSTGUG00000006139 -e

Tue May 10 01:43:55 2022 INFO NumExpr defaulting to 2 threads.
uniprot_id,ncbi_gene_id,species,assembly_name,primary_gene_name,ensembl_gene_name,synonyms,parent_gene,protein_names,ensembl_description,uniprot_description,ncbi_description,object_type,biotype,canonical_transcript,seq_region_name,strand,start,end,all_transcripts,transcript_biotypes,transcript_names,all_exons,exon_starts,exon_ends,all_translations,translation_starts,translation_ends
"['A0A674GVD2', 'H0Z6V5']",100228946,taeniopygia_guttata,bTaeGut1_v1.p,FUNDC1,FUNDC1,['FUNDC1'],,Uncharacterized protein,FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946],"[nan, nan]",,Gene,protein_coding,ENSTGUT00000027003.1,1,-1,107513786,107528106,"['ENSTGUT00000006367', 'ENSTGUT00000027003']","['protein_coding', 'protein_coding']","['FUNDC1-201', 'FUNDC1-202']",,,,,,


# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
!gget seq -id ENSTGUG00000006139 -o gene_fasta.fa

Tue May 10 01:44:01 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:02 2022 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


In [None]:
!gget seq -id ENSTGUG00000006139 -iso -o gene_iso_fasta.fa

Tue May 10 01:44:03 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:08 2022 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq -id ENSTGUG00000006139 -st transcript -o transcript_fasta.fa

Tue May 10 01:44:10 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:16 2022 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


In [None]:
# Get AA sequences of all isoforms
!gget seq -id ENSTGUG00000006139 -st transcript -iso -o transcript_iso_fasta.fa

Tue May 10 01:44:19 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:25 2022 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
!gget seq -id ENSTGUT00000027003.1 -st transcript -iso

Tue May 10 01:44:27 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:31 2022 INFO Requesting amino acid sequence of ENSTGUT00000027003 from UniProt.
>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167
MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS


# BLAST the gene **nucleotide** sequence:

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [None]:
!gget blast -s gene_fasta.fa -o gene_blast.csv

Tue May 10 01:44:34 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:44:34 2022 INFO Sequence recognized as nucleotide sequence.
Tue May 10 01:44:34 2022 INFO BLAST will use program 'blastn' with database 'nt'.
Tue May 10 01:44:36 2022 INFO BLAST initiated with search ID 7KGW3182013. Estimated time to completion: 46 seconds.
Tue May 10 01:45:24 2022 INFO Retrieving results...


# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
!gget blast -s transcript_fasta.fa -o transcript_blast.csv

Tue May 10 01:45:25 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:45:26 2022 INFO Sequence recognized as amino acid sequence.
Tue May 10 01:45:26 2022 INFO BLAST will use program 'blastp' with database 'nr'.
Tue May 10 01:45:27 2022 INFO BLAST initiated with search ID 7KGXP6YY016. Estimated time to completion: 56 seconds.
Tue May 10 01:46:23 2022 INFO BLASTING...
Tue May 10 01:47:25 2022 INFO Retrieving results...


# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:

In [None]:
# For long/many sequences, use super5 algorithm (activate with flag [-s5]) to decrease memory
# Save results with flag -o
!gget muscle -fa gene_iso_fasta.fa

Tue May 10 01:47:27 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:47:27 2022 INFO MUSCLE compiled. 
Tue May 10 01:47:27 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
tcmalloc: large alloc 3775569920 bytes == 0x55f16258c000 @  0x7f8287f3a1e7 0x55f160a69017 0x55f160a21993 0x55f160a58957 0x7f82875beedf 0x55f160a5afe3 0x55f160a16c6d 0x55f160a173fd 0x55f160a0f1f7 0x7f8286daac87 0x55f160a14e2a
tcmalloc: large alloc 3775569920 bytes == 0x55f243636000 @  0x7f8287f3a1e7 0x55f160a69017 0x55f160a2199f 0x55f160a58957 0x7f82875beedf 0x55f160a5afe3 0x55f160a16c6d 0x55f160a173fd 0x55f160a0f1f7 0x7f8286daac87 0x55f160a14e2a
00:36 8.4Gb   100.0% UPGMA5         
Tue May 10 01:48:04 2022 INFO MUSCLE alignment complete. Alignment time: 36.52 second

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
!gget muscle -fa transcript_iso_fasta.fa

Tue May 10 01:48:08 2022 INFO NumExpr defaulting to 2 threads.
Tue May 10 01:48:09 2022 INFO MUSCLE compiled. 
Tue May 10 01:48:09 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
00:00 48Mb    100.0% UPGMA5         
Tue May 10 01:48:09 2022 INFO MUSCLE alignment complete. Alignment time: 0.02 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48