<a href="https://colab.research.google.com/github/pachterlab/gget_examples/blob/main/gget_workflow_terminal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: `gget` in the terminal

[Complete gget manual](https://pachterlab.github.io/gget/)

___

Install gget:

In [1]:
!pip install -q gget

In [2]:
# # Show command line manual for all modules
# !gget -h

___
# Find reference genome metadata and download links

In [3]:
# # Show manual
# !gget ref -h

In [4]:
# Fetch the reference genome metadata of the latest Homo sapiens genome
!gget ref human

Mon May 22 17:13:28 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 109.
{
    "homo_sapiens": {
        "transcriptome_cdna": {
            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",
            "ensembl_release": 109,
            "release_date": "2022-12-13",
            "release_time": "11:30",
            "bytes": "75M"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 109,
            "release_date": "2022-12-13",
            "release_time": "00:02",
            "bytes": "840M"
        },
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz",
            "ensembl_release": 109,
            "release_date": "2022-12-15",
            "release_time": "11:20",
     

In [5]:
# Fetch only the GTF (annotation reference) FTP of the latest Homo sapiens genome
!gget ref -w gtf -ftp human

Mon May 22 17:13:37 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 109.
http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz


At the time of writing this notebook, Ensembl had recently released Ensembl 106. Note that gget ref and search will automatically fetch from the most recent release, unless a previous release is specified (all other functions are release independent). 

Show newly available genomes in the latest Ensembl release (compared to release 105):

In [6]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

Mon May 22 17:13:42 2023 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 105.
Mon May 22 17:13:43 2023 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 109 (latest).
canis_lupus_familiarisgsd
cebus_imitator
cyprinus_carpio_carpio
equus_asinus
gallus_gallus_gca000002315v5
gallus_gallus_gca016700215v2


___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome.

In [7]:
# # Show manual
# !gget search -h

In [8]:
!gget search --species taeniopygia_guttata fun

Mon May 22 17:13:49 2023 INFO Fetching results from database: taeniopygia_guttata_core_109_12
Mon May 22 17:13:51 2023 INFO Total matches found: 14.
Mon May 22 17:13:51 2023 INFO Query time: 5.52 seconds.
[
    {
        "ensembl_id": "ENSTGUG00000003915",
        "gene_name": "AIMP1",
        "ensembl_description": "aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419]",
        "ext_ref_description": "aminoacyl tRNA synthetase complex interacting multifunctional protein 1",
        "biotype": "protein_coding",
        "url": "https://useast.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000003915"
    },
    {
        "ensembl_id": "ENSTGUG00000004896",
        "gene_name": "MFHAS1",
        "ensembl_description": "malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808]",
        "ext_ref_description": "multifunctional ROCO family signaling regulator 1",
        "biotype": "protein_coding",
     

# Use [Enrichr](https://maayanlab.cloud/Enrichr/) to perform a pathway enrichment analysis on a list of genes

In [9]:
# # Show manual
# !gget enrichr -h

In [10]:
!gget enrichr -db pathway AIMP1 MFHAS1 BFAR FUNDC1 AIMP2 ASF1A

Mon May 22 17:13:53 2023 INFO Performing Enichr analysis using database KEGG_2021_Human. 
    Please note that there might a more appropriate database for your application. 
    Go to https://maayanlab.cloud/Enrichr/#libraries for a full list of supported databases.
    
[
    {
        "rank": 1,
        "path_name": "Mitophagy",
        "p_val": 0.0202297392,
        "z_score": 59.4835820896,
        "combined_score": 232.0217507665,
        "overlapping_genes": [
            "FUNDC1"
        ],
        "adj_p_val": 0.0202297392,
        "database": "KEGG_2021_Human"
    }
]


# Find the 100 most correlated genes to a gene of interest or show its tissue expression using the [ARCHS4](https://maayanlab.cloud/archs4/) database

In [11]:
# # Show manual
# !gget archs4 -h

In [12]:
!gget archs4 AIMP1

Mon May 22 17:13:57 2023 INFO Fetching the 100 most correlated genes to AIMP1 from ARCHS4.
[
    {
        "gene_symbol": "MRPL1",
        "pearson_correlation": 0.7187576294
    },
    {
        "gene_symbol": "ZC3H15",
        "pearson_correlation": 0.7041563392
    },
    {
        "gene_symbol": "MRPL47",
        "pearson_correlation": 0.6948618293
    },
    {
        "gene_symbol": "TRMT10C",
        "pearson_correlation": 0.6906108856
    },
    {
        "gene_symbol": "C8orf59",
        "pearson_correlation": 0.6878266335
    },
    {
        "gene_symbol": "SNRPB2",
        "pearson_correlation": 0.6786387563
    },
    {
        "gene_symbol": "AK6",
        "pearson_correlation": 0.6754871607
    },
    {
        "gene_symbol": "NSA2",
        "pearson_correlation": 0.6750699878
    },
    {
        "gene_symbol": "METTL5",
        "pearson_correlation": 0.6749789715
    },
    {
        "gene_symbol": "RSL24D1",
        "pearson_correlation": 0.6746876836
    },
    {
    

In [13]:
!gget archs4 --which tissue AIMP1

Mon May 22 17:14:01 2023 INFO Fetching the tissue expression atlas of AIMP1 from human ARCHS4 data.
[
    {
        "id": "System.Muscular System.Skeletal muscle.SKELETAL MUSCLE",
        "min": 7.99445,
        "q1": 9.22397,
        "median": 9.81827,
        "q3": 10.1036,
        "max": 10.6314
    },
    {
        "id": "System.Digestive System.Pancreas.PANCREATIC ISLET",
        "min": 0.113644,
        "q1": 8.83132,
        "median": 9.7887,
        "q3": 10.5512,
        "max": 11.5828
    },
    {
        "id": "System.Immune System.Lymphoid.PLASMA CELL",
        "min": 0.113644,
        "q1": 8.94132,
        "median": 9.7726,
        "q3": 10.5481,
        "max": 11.4732
    },
    {
        "id": "System.Digestive System.Pancreas.ALPHA CELL",
        "min": 7.66651,
        "q1": 8.9879,
        "median": 9.74156,
        "q3": 10.6557,
        "max": 11.5419
    },
    {
        "id": "System.Immune System.Lymphoid.TLYMPHOCYTE",
        "min": 8.50117,
        "q1": 9.198

# Fetch additional information about genes/transcripts:

In [14]:
# # Show manual
# !gget info -h

In [15]:
# Show short info on a few of the genes (includes the canonical transcript for each)
!gget info ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

{
    "ENSTGUG00000006139": {
        "ensembl_id": "ENSTGUG00000006139.2",
        "uniprot_id": [
            "A0A674GVD2",
            "H0Z6V5"
        ],
        "pdb_id": null,
        "ncbi_gene_id": "100228946",
        "species": "taeniopygia_guttata",
        "assembly_name": "bTaeGut1_v1.p",
        "primary_gene_name": "FUNDC1",
        "ensembl_gene_name": "FUNDC1",
        "synonyms": [],
        "parent_gene": null,
        "protein_names": [
            null,
            null
        ],
        "ensembl_description": "FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946]",
        "uniprot_description": "",
        "ncbi_description": null,
        "subcellular_localisation": "Mitochondrion outer membrane",
        "object_type": "Gene",
        "biotype": "protein_coding",
        "canonical_transcript": "ENSTGUT00000027003.1",
        "seq_region_name": "1",
        "strand": -1,
        "start": 107513786,
        "end": 107528106,
        "all_transcripts": [
  

# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [16]:
# # Show manual
# !gget seq -h

In [17]:
# Flag [-o] defines the file the results will be saved in
!gget seq -o gene_fasta.fa ENSTGUG00000006139

Mon May 22 17:14:19 2023 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


Get the nucleotide sequences of all known isoforms of ENSTGUG00000006139:

In [18]:
!gget seq -iso -o gene_iso_fasta.fa ENSTGUG00000006139

Mon May 22 17:14:22 2023 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [19]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq --translate -o transcript_fasta.fa ENSTGUG00000006139

Mon May 22 17:14:27 2023 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


In [20]:
# Get AA sequences of all isoforms
!gget seq --translate -iso -o transcript_iso_fasta.fa ENSTGUG00000006139

Mon May 22 17:14:30 2023 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [21]:
!gget seq --translate -iso ENSTGUT00000027003.1

Mon May 22 17:14:35 2023 INFO We noticed that you may have passed a version number with your Ensembl ID.
Please note that gget seq will return information linked to the latest Ensembl ID version.
Mon May 22 17:14:35 2023 INFO Requesting amino acid sequence of ENSTGUT00000027003 from UniProt.
>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name: FUNDC1 organism: Taeniopygia guttata sequence_length: 167
MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS


# BLAST the gene **nucleotide** sequence:

In [22]:
# # Show manual
# !gget blast -h

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [28]:
!gget blast gene_fasta.fa

Mon May 22 19:39:08 2023 INFO Sequence recognized as nucleotide sequence.
Mon May 22 19:39:08 2023 INFO BLAST will use program 'blastn' with database 'nt'.
Mon May 22 19:39:08 2023 INFO BLAST initiated. Estimated time to completion: 11 seconds.
Mon May 22 19:39:20 2023 INFO BLASTING...
Mon May 22 19:40:21 2023 INFO BLASTING...
Mon May 22 19:41:23 2023 INFO BLASTING...
Mon May 22 19:42:25 2023 INFO BLASTING...
Mon May 22 19:43:27 2023 INFO Retrieving results...
[
    {
        "Description": "Aquila chrysaetos chrysaetos genome assembly, chromosome: 7",
        "Scientific Name": "Aquila chrysaetos chrysaetos",
        "Common Name": null,
        "Taxid": 223781,
        "Max Score": 2259,
        "Total Score": 6189,
        "Query Cover": "56%",
        "E value": 0.0,
        "Per. Ident": "79.59%",
        "Acc. Len": 47779391,
        "Accession": "LR606187.1"
    },
    {
        "Description": "Haliaeetus albicilla genome assembly, chromosome: 6",
        "Scientific Name": "Hal

# BLAST the **amino acid** sequence of the canonical transcript:

In [24]:
!gget blast transcript_fasta.fa

Mon May 22 19:26:30 2023 INFO Sequence recognized as amino acid sequence.
Mon May 22 19:26:30 2023 INFO BLAST will use program 'blastp' with database 'nr'.
Mon May 22 19:26:31 2023 INFO BLAST initiated with search ID 6RGF6C3H016. Estimated time to completion: 21 seconds.
Mon May 22 19:26:54 2023 INFO Retrieving results...
[
    {
        "Description": "FUN14 domain-containing protein 1 isoform X1 [Taeniopygia guttata]",
        "Scientific Name": "Taeniopygia guttata",
        "Common Name": "zebra finch",
        "Taxid": 59729,
        "Max Score": 345,
        "Total Score": 345,
        "Query Cover": "100%",
        "E value": 7e-120,
        "Per. Ident": "100.00%",
        "Acc. Len": 167,
        "Accession": "XP_032606376.2"
    },
    {
        "Description": "FUN14 domain-containing protein 1 isoform X2 [Lonchura striata domestica]",
        "Scientific Name": "Lonchura striata domestica",
        "Common Name": "Bengalese finch",
        "Taxid": 299123,
        "Max Score

# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:

In [25]:
# # Show manual
# !gget muscle -h

In [26]:
!gget muscle gene_iso_fasta.fa

Mon May 22 19:26:56 2023 INFO MUSCLE compiled. 
Mon May 22 19:26:56 2023 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 8.0Mb  CPU has 2 cores, running 2 threads
00:00 16Mb    100.0% Calc posteriors
00:39 17Mb    100.0% UPGMA5         
Mon May 22 19:27:36 2023 INFO MUSCLE alignment complete. Alignment time: 39.6 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;10mT[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;0m[48;5;11mG[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;15m[48;5;12mC[0;0m[38;5;15m[48;5;9mA[0;0m[38;5;0m[48;5;10mT[0;0m[38;5;0m[48;5

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [27]:
!gget muscle transcript_iso_fasta.fa

Mon May 22 19:27:41 2023 INFO MUSCLE compiled. 
Mon May 22 19:27:41 2023 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 8.0Mb  CPU has 2 cores, running 2 threads
00:00 16Mb    100.0% Calc posteriors
00:00 17Mb    100.0% UPGMA5         
Mon May 22 19:27:41 2023 INFO MUSCLE alignment complete. Alignment time: 0.03 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;10mS[