<a href="https://colab.research.google.com/github/pachterlab/gget_examples/blob/main/gget_workflow_terminal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial: `gget` in the terminal

[`gget`](https://github.com/pachterlab/gget) currently consists of the following nine modules:
- `gget ref`
Fetch File Transfer Protocols (FTPs) and metadata for reference genomes and annotations from [Ensembl](https://www.ensembl.org/) by species.
- `gget search`
Fetch genes and transcripts from [Ensembl](https://www.ensembl.org/) using free-form search terms.
- `gget info`
Fetch extensive gene and transcript metadata from [Ensembl](https://www.ensembl.org/), [UniProt](https://www.uniprot.org/), and [NCBI](https://www.ncbi.nlm.nih.gov/) using Ensembl IDs.  
- `gget seq`
Fetch nucleotide or amino acid sequences of genes or transcripts from [Ensembl](https://www.ensembl.org/) or [UniProt](https://www.uniprot.org/), respectively.  
- `gget blast`
BLAST a nucleotide or amino acid sequence to any [BLAST](https://blast.ncbi.nlm.nih.gov/Blast.cgi) database.
- `gget blat` 
Find the genomic location of a nucleotide or amino acid sequence using [BLAT](https://genome.ucsc.edu/cgi-bin/hgBlat).
- `gget muscle` 
Align multiple nucleotide or amino acid sequences to each other using [Muscle5](https://www.drive5.com/muscle/).
- `gget enrichr`
Perform an enrichment analysis on a list of genes using [Enrichr](https://maayanlab.cloud/Enrichr/).
- `gget archs4` 
Find the most correlated genes to a gene of interest or find the gene's tissue expression atlas using [ARCHS4](https://maayanlab.cloud/archs4/).

___

Install gget:

In [1]:
!pip install gget -q

[K     |████████████████████████████████| 1.2 MB 13.2 MB/s 
[K     |████████████████████████████████| 25.2 MB 1.4 MB/s 
[K     |████████████████████████████████| 128 kB 62.5 MB/s 
[?25h

___


<h1><center>Terminal version</center></h1>
<center>Jupyter lab version below.<center>


In [2]:
!gget

usage: gget [-h] [-v]
            {ref,search,info,seq,muscle,blast,blat,enrichr,archs4} ...

gget v0.2.0

positional arguments:
  {ref,search,info,seq,muscle,blast,blat,enrichr,archs4}
    ref                 Fetch FTPs for reference genomes and annotations by
                        species.
    search              Fetch gene and transcript IDs from Ensembl using free-
                        form search terms.
    info                Fetch gene and transcript metadata using Ensembl IDs.
    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a
                        gene (and all isoforms) or transcript by Ensembl,
                        WormBase or FlyBase ID.
    muscle              Align multiple nucleotide or amino acid sequences
                        against each other (using the Muscle v5 algorithm).
    blast               BLAST a nucleotide or amino acid sequence against any
                        BLAST database.
    blat                BLAT a nucleot

In [3]:
# # Show complete manual
# !gget -h

___
# Find reference genome metadata and download links

In [4]:
# Show manual
!gget ref

usage: gget ref [-h] [-l] [-w WHICH] [-r RELEASE] [-ftp] [-d] [-o OUT]
                [-s SPECIES_DEPRECATED]
                [species]

Fetch FTPs for reference genomes and annotations by species.

positional arguments:
  species               Species for which the FTPs will be fetched, e.g.
                        homo_sapiens.

optional arguments:
  -h, --help            show this help message and exit
  -l, --list_species    List all available species. (Combine with `--release`
                        to get the available species from a specific Ensembl
                        release.)
  -w WHICH, --which WHICH
                        Defines which results to return. Default: 'all' ->
                        Returns all available results. Possible entries are
                        one or a combination (as a comma-separated list) of
                        the following: 'gtf' - Returns the annotation (GTF).
                        'cdna' - Returns the trancriptome (cDNA). 'dna'

In [5]:
# Fetch the reference genome metadata of the latest Homo sapiens genome
!gget ref human

Thu Jun  9 02:25:01 2022 INFO Fetching reference information for homo_sapiens from Ensembl release: 106.
{
    "homo_sapiens": {
        "transcriptome_cdna": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",
            "ensembl_release": 106,
            "release_date": "17-Feb-2022",
            "release_time": "19:50",
            "bytes": "76937571"
        },
        "genome_dna": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",
            "ensembl_release": 106,
            "release_date": "21-Feb-2022",
            "release_time": "09:35",
            "bytes": "881211416"
        },
        "annotation_gtf": {
            "ftp": "http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz",
            "ensembl_release": 106,
            "release_date": "28-Feb-2022",
            "release_time": "

In [6]:
# Fetch only the GTF (annotation reference) FTP of the latest Homo sapiens genome
!gget ref -w gtf -ftp human

Thu Jun  9 02:25:03 2022 INFO Fetching reference information for homo_sapiens from Ensembl release: 106.
http://ftp.ensembl.org/pub/release-106/gtf/homo_sapiens/Homo_sapiens.GRCh38.106.gtf.gz


Ensembl just released Ensembl 106. Note that gget ref and search will automatically fetch from that release now unless a previous release is specified (all other functions are release independent). Show newly available genomes in the latest Ensembl release (compared to previous release 105):

In [7]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

Thu Jun  9 02:25:05 2022 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 106 (latest).
Thu Jun  9 02:25:06 2022 INFO Fetching available genomes (GTF and FASTAs present) from Ensembl release 105.
cyprinus_carpio_carpio


___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome.

In [8]:
# Show manual
!gget search

usage: gget search [-h] -s SPECIES [-t {gene,transcript}] [-ao {and,or}]
                   [-l LIMIT] [-csv] [-o OUT]
                   [-sw SW_DEPRECATED [SW_DEPRECATED ...]] [--seqtype SEQTYPE]
                   [-j]
                   [searchwords [searchwords ...]]

Fetch gene and transcript IDs from Ensembl using free-form search terms.

positional arguments:
  searchwords           One or more free form search words, e.g. gaba, nmda.

optional arguments:
  -h, --help            show this help message and exit
  -s SPECIES, --species SPECIES
                        Species to be queried, e.g. homo_sapiens.
  -t {gene,transcript}, --id_type {gene,transcript}
                        'gene': Returns genes that match the searchwords.
                        (default). 'transcript': Returns transcripts that
                        match the searchwords.
  -ao {and,or}, --andor {and,or}
                        'or': Gene descriptions must include at least one of
                     

In [9]:
!gget search --species taeniopygia_guttata fun

Thu Jun  9 02:25:08 2022 INFO Fetching results from database: taeniopygia_guttata_core_106_12
Thu Jun  9 02:25:09 2022 INFO Total matches found: 14.
Thu Jun  9 02:25:09 2022 INFO Query time: 1.03 seconds.
[
    {
        "ensembl_id": "ENSTGUG00000003915",
        "gene_name": "AIMP1",
        "ensembl_description": "aminoacyl tRNA synthetase complex interacting multifunctional protein 1 [Source:NCBI gene;Acc:100227419]",
        "ext_ref_description": "aminoacyl tRNA synthetase complex interacting multifunctional protein 1",
        "biotype": "protein_coding",
        "url": "https://uswest.ensembl.org/taeniopygia_guttata/Gene/Summary?g=ENSTGUG00000003915"
    },
    {
        "ensembl_id": "ENSTGUG00000004896",
        "gene_name": "MFHAS1",
        "ensembl_description": "malignant fibrous histiocytoma amplified sequence 1 [Source:NCBI gene;Acc:100217808]",
        "ext_ref_description": "multifunctional ROCO family signaling regulator 1",
        "biotype": "protein_coding",
     

# Use [Enrichr](https://maayanlab.cloud/Enrichr/) to perform a pathway enrichment analysis on a list of genes

In [10]:
# Show manual
!gget enrichr

usage: gget enrichr [-h] -db DATABASE [-e] [-csv] [-o OUT]
                    [-g GENES_DEPRECATED [GENES_DEPRECATED ...]] [-j]
                    [genes [genes ...]]

Perform an enrichment analysis on a list of genes using Enrichr.

positional arguments:
  genes                 List of gene symbols or Ensembl gene IDs to perform
                        enrichment analysis on.

optional arguments:
  -h, --help            show this help message and exit
  -db DATABASE, --database DATABASE
                        'pathway', 'transcription', 'ontology',
                        'diseases_drugs', 'celltypes', 'kinase_interactions'or
                        any database listed at:
                        https://maayanlab.cloud/Enrichr/#libraries
  -e, --ensembl         Add this flag if genes are given as Ensembl gene IDs.
  -csv, --csv           Returns results in csv format instead of json.
  -o OUT, --out OUT     Path to the csv file the results will be saved in,
                       

In [11]:
!gget enrichr -db pathway AIMP1 MFHAS1 BFAR FUNDC1 AIMP2 ASF1A

Thu Jun  9 02:25:11 2022 INFO Performing Enichr analysis using database KEGG_2021_Human. 
    Please note that there might a more appropriate database for your application. 
    Go to https://maayanlab.cloud/Enrichr/#libraries for a full list of supported databases.
    
[
    {
        "rank": 1,
        "path_name": "Mitophagy",
        "p_val": 0.0202297392,
        "z_score": 59.4835820896,
        "combined_score": 232.0217507665,
        "overlapping_genes": [
            "FUNDC1"
        ],
        "adj_p_val": 0.0202297392,
        "database": "KEGG_2021_Human"
    }
]


# Find the 100 most correlated genes to a gene of interest or show its tissue expression using the [ARCHS4](https://maayanlab.cloud/archs4/) database

In [12]:
# Show manual
!gget archs4

usage: gget archs4 [-h] [-e] [-w {correlation,tissue}] [-gc GENE_COUNT]
                   [-s {human,mouse}] [-csv] [-o OUT] [-g GENE_DEPRECATED]
                   [-j]
                   [gene]

Find the most correlated genes or the tissue expression atlas of a gene using
data from the human and mouse RNA-seq database ARCHS4
(https://maayanlab.cloud/archs4/).

positional arguments:
  gene                  Gene symbol or Ensembl gene ID of gene of interest
                        (str), e.g. 'STAT4'.

optional arguments:
  -h, --help            show this help message and exit
  -e, --ensembl         Add this flag if gene is given as an Ensembl gene ID.
  -w {correlation,tissue}, --which {correlation,tissue}
                        'correlation' (default) or 'tissue'. - 'correlation'
                        returns a gene correlation table that contains the 100
                        most correlated genes to the gene of interest. The
                        Pearson correlation is cal

In [13]:
!gget archs4 AIMP1

Thu Jun  9 02:25:15 2022 INFO Fetching the 100 most correlated genes to AIMP1 from ARCHS4.
[
    {
        "gene_symbol": "MRPL1",
        "pearson_correlation": 0.7187576294
    },
    {
        "gene_symbol": "ZC3H15",
        "pearson_correlation": 0.7041563392
    },
    {
        "gene_symbol": "MRPL47",
        "pearson_correlation": 0.6948618293
    },
    {
        "gene_symbol": "TRMT10C",
        "pearson_correlation": 0.6906108856
    },
    {
        "gene_symbol": "C8orf59",
        "pearson_correlation": 0.6878266335
    },
    {
        "gene_symbol": "SNRPB2",
        "pearson_correlation": 0.6786387563
    },
    {
        "gene_symbol": "AK6",
        "pearson_correlation": 0.6754871607
    },
    {
        "gene_symbol": "NSA2",
        "pearson_correlation": 0.6750699878
    },
    {
        "gene_symbol": "METTL5",
        "pearson_correlation": 0.6749789715
    },
    {
        "gene_symbol": "RSL24D1",
        "pearson_correlation": 0.6746876836
    },
    {
    

In [14]:
!gget archs4 --which tissue AIMP1

Thu Jun  9 02:25:19 2022 INFO Fetching the tissue expression atlas of AIMP1 from human ARCHS4 data.
[
    {
        "id": "System.Muscular System.Skeletal muscle.SKELETAL MUSCLE",
        "min": 7.99445,
        "q1": 9.22397,
        "median": 9.81827,
        "q3": 10.1036,
        "max": 10.6314
    },
    {
        "id": "System.Digestive System.Pancreas.PANCREATIC ISLET",
        "min": 0.113644,
        "q1": 8.83132,
        "median": 9.7887,
        "q3": 10.5512,
        "max": 11.5828
    },
    {
        "id": "System.Immune System.Lymphoid.PLASMA CELL",
        "min": 0.113644,
        "q1": 8.94132,
        "median": 9.7726,
        "q3": 10.5481,
        "max": 11.4732
    },
    {
        "id": "System.Digestive System.Pancreas.ALPHA CELL",
        "min": 7.66651,
        "q1": 8.9879,
        "median": 9.74156,
        "q3": 10.6557,
        "max": 11.5419
    },
    {
        "id": "System.Immune System.Lymphoid.TLYMPHOCYTE",
        "min": 8.50117,
        "q1": 9.198

# Fetch additional information about genes/transcripts:

In [15]:
# Show manual
!gget info

usage: gget info [-h] [-e] [-csv] [-q] [-o OUT]
                 [-id ID_DEPRECATED [ID_DEPRECATED ...]] [-j]
                 [ens_ids [ens_ids ...]]

Fetch gene and transcript metadata using Ensembl IDs.

positional arguments:
  ens_ids               One or more Ensembl, WormBase or FlyBase IDs).

optional arguments:
  -h, --help            show this help message and exit
  -e, --expand          DEPRECATED - gget info now always returns all
                        available information.
  -csv, --csv           Returns results in csv format instead of json.
  -q, --quiet           Do not print progress information.
  -o OUT, --out OUT     Path to file the results will be saved as, e.g.
                        path/to/directory/results.json. Default: Standard out.
  -id ID_DEPRECATED [ID_DEPRECATED ...], --ens_ids ID_DEPRECATED [ID_DEPRECATED ...]
                        DEPRECATED - use positional argument instead. One or
                        more Ensembl, WormBase or FlyBase IDs).

In [16]:
# Show short info on a few of the genes (includes the canonical transcript for each)
!gget info ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

{
    "ENSTGUG00000006139": {
        "ensembl_id": "ENSTGUG00000006139.2",
        "uniprot_id": [
            "A0A674GVD2",
            "H0Z6V5"
        ],
        "ncbi_gene_id": "100228946",
        "species": "taeniopygia_guttata",
        "assembly_name": "bTaeGut1_v1.p",
        "primary_gene_name": "FUNDC1",
        "ensembl_gene_name": "FUNDC1",
        "synonyms": [
            "FUNDC1"
        ],
        "parent_gene": null,
        "protein_names": "Uncharacterized protein",
        "ensembl_description": "FUN14 domain containing 1 [Source:NCBI gene;Acc:100228946]",
        "uniprot_description": [
            null,
            null
        ],
        "ncbi_description": null,
        "object_type": "Gene",
        "biotype": "protein_coding",
        "canonical_transcript": "ENSTGUT00000027003.1",
        "seq_region_name": "1",
        "strand": -1,
        "start": 107513786,
        "end": 107528106,
        "all_transcripts": [
            {
                "transcript

# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [17]:
# Show manual
!gget seq

usage: gget seq [-h] [-t] [-iso] [-o OUT]
                [-id ID_DEPRECATED [ID_DEPRECATED ...]] [--seqtype SEQTYPE]
                [ens_ids [ens_ids ...]]

Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all isoforms)
or transcript by Ensembl, WormBase or FlyBase ID.

positional arguments:
  ens_ids               One or more Ensembl, WormBase or FlyBase IDs.

optional arguments:
  -h, --help            show this help message and exit
  -t, --transcribe      Returns amino acid sequences from UniProt. (Otherwise
                        returns nucleotide sequences from Ensembl.)
  -iso, --isoforms      Returns sequences of all known transcripts (default:
                        False). (Only for gene IDs.)
  -o OUT, --out OUT     Path to the FASTA file the results will be saved in,
                        e.g. path/to/directory/results.fa. Default: Standard
                        out.
  -id ID_DEPRECATED [ID_DEPRECATED ...], --ens_ids ID_DEPRECATED [ID_DEPRECATED ...]


In [18]:
# Flag [-o] defines the file the results will be saved in
!gget seq -o gene_fasta.fa ENSTGUG00000006139

Thu Jun  9 02:25:34 2022 INFO Requesting nucleotide sequence of ENSTGUG00000006139 from Ensembl.


Get the nucleotide sequences of all known isoforms of ENSTGUG00000006139:

In [19]:
!gget seq -iso -o gene_iso_fasta.fa ENSTGUG00000006139

Thu Jun  9 02:25:39 2022 INFO Requesting nucleotide sequences of all transcripts of ENSTGUG00000006139 from Ensembl.


# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [20]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq --transcribe -o transcript_fasta.fa ENSTGUG00000006139

Thu Jun  9 02:25:43 2022 INFO Requesting amino acid sequence of the canonical transcript ENSTGUT00000027003 of gene ENSTGUG00000006139 from UniProt.


In [21]:
# Get AA sequences of all isoforms
!gget seq --transcribe -iso -o transcript_iso_fasta.fa ENSTGUG00000006139

Thu Jun  9 02:25:49 2022 INFO Requesting amino acid sequences of all transcripts of gene ENSTGUG00000006139 from UniProt.


Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [22]:
!gget seq --transcribe -iso ENSTGUT00000027003.1

Thu Jun  9 02:25:51 2022 INFO We noticed that you may have passed a version number with your Ensembl ID.
Please note that gget seq will return information linked to the latest Ensembl ID version.
Thu Jun  9 02:25:52 2022 INFO Requesting amino acid sequence of ENSTGUT00000027003 from UniProt.
>ENSTGUT00000027003 uniprot_id: A0A674GVD2 ensembl_id: ENSTGUT00000027003 gene_name(s): FUNDC1 organism: Taeniopygia guttata (Zebra finch) (Poephila guttata) sequence_length: 167
MLMPGPLRRALGQKFSIFPSVDHDSDDDSYEVLDLTEYARRHHWWNRLFGRNSGPVVEKYSVATQIVMGGVTGWCAGFLFQKVGKLAATAVGGGFLLLQIASHSGYVQVDWKRVEKDVNKAKKQLKKRANKAAPEINTLIEESTEFIKQNIVVSSGFVGGFLLGLAS


# BLAST the gene **nucleotide** sequence:

In [23]:
# Show manual
!gget blast

usage: gget blast [-h] [-p {blastn,blastp,blastx,tblastn,tblastx}]
                  [-db {nt,nr,refseq_rna,refseq_protein,swissprot,pdbaa,pdbnt}]
                  [-l LIMIT] [-e EXPECT] [-lcf] [-mbo] [-q] [-csv] [-o OUT]
                  [-seq SEQ_DEPRECATED] [-j]
                  [sequence]

BLAST a nucleotide or amino acid sequence against any BLAST database.

positional arguments:
  sequence              Sequence (str) or path to fasta file.

optional arguments:
  -h, --help            show this help message and exit
  -p {blastn,blastp,blastx,tblastn,tblastx}, --program {blastn,blastp,blastx,tblastn,tblastx}
                        'blastn', 'blastp', 'blastx', 'tblastn', or 'tblastx'.
                        Default: 'blastn' for nucleotide sequences; 'blastp'
                        for amino acid sequences.
  -db {nt,nr,refseq_rna,refseq_protein,swissprot,pdbaa,pdbnt}, --database {nt,nr,refseq_rna,refseq_protein,swissprot,pdbaa,pdbnt}
                        'nt', 'nr', 'ref

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [24]:
!gget blast gene_fasta.fa

Thu Jun  9 02:25:55 2022 INFO Sequence recognized as nucleotide sequence.
Thu Jun  9 02:25:55 2022 INFO BLAST will use program 'blastn' with database 'nt'.
Thu Jun  9 02:25:57 2022 INFO BLAST initiated with search ID A2PHM9NG013. Estimated time to completion: 20 seconds.
Thu Jun  9 02:26:17 2022 INFO BLASTING...
Thu Jun  9 02:27:20 2022 INFO Retrieving results...
[
    {
        "Description": "Aquila chrysaetos chrysaetos genome assembly, chromosome: 7",
        "Scientific Name": "Aquila chrysaetos chrysaetos",
        "Common Name": null,
        "Taxid": 223781,
        "Max Score": 2259,
        "Total Score": 6189,
        "Query Cover": "56%",
        "E value": 0.0,
        "Per. Ident": "79.59%",
        "Acc. Len": 47779391,
        "Accession": "LR606187.1"
    },
    {
        "Description": "Accipiter gentilis genome assembly, chromosome: 32",
        "Scientific Name": "Accipiter gentilis",
        "Common Name": "Northern goshawk",
        "Taxid": 8957,
        "Max Sco

# BLAST the **amino acid** sequence of the canonical transcript:

In [25]:
!gget blast transcript_fasta.fa

Thu Jun  9 02:27:21 2022 INFO Sequence recognized as amino acid sequence.
Thu Jun  9 02:27:21 2022 INFO BLAST will use program 'blastp' with database 'nr'.
Thu Jun  9 02:27:23 2022 INFO BLAST initiated with search ID A2PMA2TX013. Estimated time to completion: 19 seconds.
Thu Jun  9 02:27:42 2022 INFO BLASTING...
Thu Jun  9 02:28:44 2022 INFO BLASTING...
Thu Jun  9 02:29:46 2022 INFO BLASTING...
Thu Jun  9 02:30:48 2022 INFO BLASTING...
Thu Jun  9 02:31:50 2022 INFO Retrieving results...
[
    {
        "Description": "FUN14 domain-containing protein 1 isoform X1 [Taeniopygia guttata]",
        "Scientific Name": "Taeniopygia guttata",
        "Common Name": "zebra finch",
        "Taxid": 59729,
        "Max Score": 345,
        "Total Score": 345,
        "Query Cover": "100%",
        "E value": 6e-120,
        "Per. Ident": "100.00%",
        "Acc. Len": 167,
        "Accession": "XP_032606376.2"
    },
    {
        "Description": "FUN14 domain-containing protein 1 isoform X2 [Lonc

# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:

In [26]:
# Show manual
!gget muscle

usage: gget muscle [-h] [-s5] [-o OUT] [-fa FASTA_DEPRECATED] [fasta]

Align multiple nucleotide or amino acid sequences against each other (using
the Muscle v5 algorithm).

positional arguments:
  fasta                 Path to fasta file containing the sequences to be
                        aligned.

optional arguments:
  -h, --help            show this help message and exit
  -s5, --super5         If True, align input using Super5 algorithm instead of
                        PPP algorithm to decrease time and memory. Use for
                        large inputs (a few hundred sequences).
  -o OUT, --out OUT     Path to save an 'aligned FASTA' (.afa) file with the
                        results, e.g. path/to/directory/results.afa.Default:
                        'None' -> Standard out in Clustal format.
  -fa FASTA_DEPRECATED, --fasta FASTA_DEPRECATED
                        DEPRECATED - use positional argument instead. Path to
                        fasta file containing the seque

In [27]:
!gget muscle gene_iso_fasta.fa

Thu Jun  9 02:31:53 2022 INFO MUSCLE compiled. 
Thu Jun  9 02:31:53 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 13750, max 14321

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
tcmalloc: large alloc 3775569920 bytes == 0x55f313b46000 @  0x7fbcee5ac1e7 0x55f311e57017 0x55f311e0f993 0x55f311e46957 0x7fbcedc30edf 0x55f311e48fe3 0x55f311e04c6d 0x55f311e053fd 0x55f311dfd1f7 0x7fbced41cc87 0x55f311e02e2a
tcmalloc: large alloc 3775569920 bytes == 0x55f3f4bf0000 @  0x7fbcee5ac1e7 0x55f311e57017 0x55f311e0f99f 0x55f311e46957 0x7fbcedc30edf 0x55f311e48fe3 0x55f311e04c6d 0x55f311e053fd 0x55f311dfd1f7 0x7fbced41cc87 0x55f311e02e2a
00:27 8.4Gb   100.0% UPGMA5         
Thu Jun  9 02:32:20 2022 INFO MUSCLE alignment complete. Alignment time: 27.43 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mC[0;0m[38;5;0m[48

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [28]:
!gget muscle transcript_iso_fasta.fa

Thu Jun  9 02:32:25 2022 INFO MUSCLE compiled. 
Thu Jun  9 02:32:25 2022 INFO MUSCLE aligning... 

muscle 5.2.linux64 [00617b]  13.3Gb RAM, 2 cores
Built Apr 13 2022 00:43:46
(C) Copyright 2004-2021 Robert C. Edgar.
https://drive5.com

Input: 2 seqs, avg length 162, max 167

00:00 38Mb   CPU has 2 cores, running 2 threads
00:00 47Mb    100.0% Calc posteriors
00:00 48Mb    100.0% UPGMA5         
Thu Jun  9 02:32:25 2022 INFO MUSCLE alignment complete. Alignment time: 0.01 seconds


ENSTGUT00000006367 [38;5;15m[48;5;12mM[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;12mA[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;11mP[0;0m[38;5;15m[48;5;9mR[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;15m-[0;0m[38;5;0m[48;5;10mS[