<a href="https://colab.research.google.com/github/lauraluebbert/gget/blob/dev/examples/gget_workflow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

gget features:
- `gget ref` Fetch FTPs for reference genomes and annotations by species.
- `gget search`  Fetch gene and transcript IDs from Ensembl using free-form search terms.
- `gget info` Fetch gene and transcript metadata using Ensembl IDs. 
- `gget seq` Fetch nucleotide or amino acid sequences of genes or transcripts.
- `gget blast` BLAST a nucleotide or amino acid sequence against any BLAST database.
- `gget muscle` Align multiple nucleotide or amino acid sequences against each other.
- `gget enrichr` Perform an enrichment analysis on a list of genes using Enrichr.

___

Install from gget dev repository (only necessary until next release):

In [None]:
%config InlineBackend.figure_format='retina'
!git clone -b dev --single-branch https://github.com/lauraluebbert/gget.git -q
!pip install mysql-connector-python -q
!cd gget && pip install . -q

___


<h1><center>Terminal version</center></h1>
<center>Jupyter lab version below.<center>


In [None]:
!gget

In [None]:
# # Show detailed help page
# !gget -h

___
Ensembl just released Ensembl 106. Note that gget ref and search will automatically fetch from that release now unless a previous release is specified (all other functions are release independent):

In [None]:
!gget ref -s human -w gtf

Show newly available genomes in the latest Ensembl release (compared to previous release 105):

In [None]:
!comm -13 <(gget ref -l -r 105 | sort) <(gget ref -l | sort)

___

# Find gene IDs based on free form search words:
Searching for 'fun' genes in the zebra finch genome. Just writing 'tae' is enough, because no other genome begins with those letters.

In [None]:
!gget search -sw fun -s tae

# Use Enrichr to perform an enrichment analysis on a list of genes

In [None]:
!gget enrichr --genes AIMP1 MFHAS1 BFAR FUNDC1 AIMP2 ASF1A -db pathway

# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Show short info on a few of the genes
!gget info -id ENSTGUG00000006139 ENSTGUG00000026050 ENSTGUG00000004956

In [None]:
# Expand info to show all transcripts
!gget info -id ENSTGUG00000006139 -e

# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
!gget seq -id ENSTGUG00000006139 -o gene_fasta.fa

In [None]:
!gget seq -id ENSTGUG00000006139 -iso -o gene_iso_fasta.fa

# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get amino acid (AA) sequence of canonical transcript
!gget seq -id ENSTGUG00000006139 -st transcript -o transcript_fasta.fa

In [None]:
# Get AA sequences of all isoforms
!gget seq -id ENSTGUG00000006139 -st transcript -iso -o transcript_iso_fasta.fa

Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
!gget seq -id ENSTGUT00000027003.1 -st transcript -iso

# BLAST the gene **nucleotide** sequence:

Note: `blast` also accepts a sequence passed as string instead of a .fa file.

In [None]:
!gget blast -s gene_fasta.fa -o gene_blast.csv

# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
!gget blast -s transcript_fasta.fa -o transcript_blast.csv

# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:
Returns an alignment fasta (.afa) file.

In [None]:
# For long/many sequences, use super5 algorithm (activate with flag [-s5]) to decrease memory
# Save results with flag -o
!gget muscle -fa gene_iso_fasta.fa

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
!gget muscle -fa transcript_iso_fasta.fa

___

<h1><center>Jupyter Lab version</center></h1>

In [None]:
import gget

In [None]:
# # Show manual per sub-function, e.g. for seq:
# help(gget.seq)

___

# Find gene IDs based on free form search words:

In [None]:
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading
search_results = gget.search("fun", "tae", wrap_text=True)

# Use Enrichr to perform an enrichment analysis on a list of genes

In [None]:
# plot=True displays a graphical overview of the first 15 results
enrichr_df = gget.enrichr(search_results["gene_name"], database="pathway", plot=True)

In [None]:
enrichr_df

# Fetch additional information about genes/transcripts (like the IDs of all known transcripts of a gene):

In [None]:
# Get gene ID of FUNDC1
gene_ID = search_results[search_results["gene_name"]=="FUNDC1"]["ensembl_id"].values[0]
gene_ID

In [None]:
# Show short info on a few genes
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading
df = gget.info([gene_ID, "ENSTGUG00000019264", "ENSTGUG00000022620"], wrap_text=True)

In [None]:
# Show expanded info
info_results = gget.info(gene_ID, expand=True, wrap_text=True)

# Fetch the **nucleotide** sequence of a gene, or the **nucleotide** sequences corresponding to all its known protein isoforms.

In [None]:
gene_fasta = gget.seq(gene_ID)
gene_fasta

In [None]:
gget.seq(gene_ID, isoforms=True)

# Fetch the **amino acid** sequence of the canonical transcript of a gene, or the **amino acid** sequences corresponding to all its known protein isoforms.

In [None]:
# Get AA sequence of canonical transcript
transcript_fasta = gget.seq(gene_ID, seqtype="transcript")
transcript_fasta

In [None]:
# Get AA sequences of all isoforms
gget.seq(gene_ID, seqtype="transcript", isoforms=True)

Note: If you use the isoform option on a transcript, it will simply fetch the sequence of the specified transcript and notify the user that the isoform option only applies to genes:

In [None]:
gget.seq("ENST00000334527", seqtype="transcript", isoforms=True)

# BLAST the gene **nucleotide** sequence:

In [None]:
# Note: 'wrap_text' displays the data frame with wrapped text for easier reading,
df = gget.blast(gene_fasta[1], wrap_text=True)

# BLAST the **amino acid** sequence of the canonical transcript:

In [None]:
df = gget.blast(transcript_fasta[1], wrap_text=True)

# Use MUSCLE algorithm to align the **nucleotide** sequences of all transcripts:

I will use the .fa files that were previously generated using the terminal commands above. Unlike `blast`, `muscle` only accepts .fa files as input (`blast` also accepts a sequence passed as string). In Jupyter lab, compatible .fa files can be generated using the `save=True` option with `seq()`.

In [None]:
gget.muscle("gene_iso_fasta.fa")

# Use MUSCLE algorithm to align the **amino acid** sequences of all transcripts:

In [None]:
gget.muscle("transcript_iso_fasta.fa")