# Using NCBI Datasets command-line tools to download protein sequnces of orthologs from certain taxa and prepare them for alignment


### Orthologs

Since `datasets` version 14, users can retrieve ortholog information using the flag `--ortholog` with the `gene` subcommand.

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of genes that have been identified by the NCBI genome annotation team as homologous genes that are separated by speciation events. They are identified by a combination of protein similarity + local synteny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects.
>
>You can retrieve the genes in an ortholog set using an identifier for one of its genes, such as a gene symbol or sequence accession.


#### Examples:

`datasets download gene accession NM_007037.6 --ortholog all`  
`datasets download gene gene-id 11095 --ortholog all`  
`datasets download gene symbol adamts8 --taxon 'human' --ortholog all`  

All three commands will download the **same** ortholog set (which is the complete set). 

**What if I want to filter the ortholog set to include *only* a taxonomic group of interest?**

### Applying a taxonomic filter to the ortholog set

When using the `--ortholog` flag, users need to provide an argument for it. The argument should be one or more taxa (any rank) to filter results or 'all' for the complete set.

### Case Study

A common task for biologists who work on a particular gene or protein is to find a set of orthologous protein sequences and create an alignment of them to identify organism-specific differences (variations). 

An example of a research project that aims to do this focuses on the human ADAMTS8 protein which has been proposed to serve as a possible tumor suppressor with reduced activity noted in many cancers. Despite their vast size and long lifespans, whales, dolphins and porpoises and other Cetaceans have proportionately very low incidences of cancer. 

**In this example, we'll start with the human ADAMTS8 protein, find a set of Cetacean orthologs, and then align these sequences to look for variations the cetaceans.** This choice of gene was inspired by a [2021 publication by Tejada-Martinez et al.](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7935004/) on positive selection and duplications of tumor suppressor genes in cetaeans. 

### We are going to follow these steps:
- Before downloading the actual sequence data - get a sense of what species are present 
- Download a dataset including the original human adamst8 sequence and all of the available Cetacean orthologs
- Unzip it to a custom folder
- Look at some metadata for the genes in the ortholog set
- Clean up FASTA headers and align the protein sequences for the genes in the set
- Optional workflow - download only the longest protein sequence for an ortholog

**Note**: 
This tutorial assumes that you have installed [NCBI Datasets and Dataformat command line tools](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) and the [Clustal Omega](http://www.clustal.org/omega/) aligner. 

#### Assessing available orthologs in a target taxonomic group 

Here, we used the `summary` command to look at metadata without actually downloading any of ortholog data just yet. The `dataformat` command takes the information from `summary` and reformats it into a nice, human readable TSV file and allows us to view just the taxonomic names associated with the available orthologs.

In [None]:
%%bash
datasets summary gene symbol adamts8 --taxon 'human' --ortholog cetacea --as-json-lines | dataformat tsv gene --fields tax-name

#### Download and unzip sequence dataset package containing human sequence and cetacean orthologs

We are satisfied with the number of species present (18 cetaceans at the time of writing) and taxonomic spread of cetacean orthologs for human ADAMTS8, so we are now downloading the actual protein sequence data to align. 

In [None]:
%%bash
# download the ortholog set for cetaceans and the original human sequence
datasets download gene symbol adamts8 --taxon 'human' --ortholog 'human',cetacea --filename adamts8_orthologs.zip --no-progressbar

In [None]:
%%bash
# Unzip it to a folder with the same name
unzip adamts8_orthologs.zip -d adamts8_orthologs


#### Use dataformat to look at some metadata for the downloaded gene set

This time, our input for `dataformat` is the `data_report.jsonl` file that came with the sequence data package, which can be found in the following location in the unzipped directory. We are adding a `protein-count` column to see if any of our orthologs have multiple associated protein sequences.

In [56]:
%%bash
# Generate a table describing the genes in the ortholog set. 
dataformat tsv gene --inputfile adamts8_orthologs/ncbi_dataset/data/data_report.jsonl \
--fields tax-name,symbol,gene-id,group-method,group-id,protein-count | head

Taxonomic Name	Symbol	NCBI GeneID	Gene Group Method	Gene Group Identifier	Proteins
Orcinus orca	ADAMTS8	101290160	NCBI Ortholog	11095	1
Tursiops truncatus	ADAMTS8	101325122	NCBI Ortholog	11095	1
Physeter catodon	ADAMTS8	102987620	NCBI Ortholog	11095	1
Balaenoptera acutorostrata	ADAMTS8	103000097	NCBI Ortholog	11095	3
Lipotes vexillifer	ADAMTS8	103074223	NCBI Ortholog	11095	1
Homo sapiens	ADAMTS8	11095	NCBI Ortholog	11095	3
Delphinapterus leucas	ADAMTS8	111165488	NCBI Ortholog	11095	1
Neophocaena asiaeorientalis asiaeorientalis	ADAMTS8	112407533	NCBI Ortholog	11095	1
Lagenorhynchus obliquidens	ADAMTS8	113612809	NCBI Ortholog	11095	1


#### Examine protein FASTA file and prepare headers for alignment

As you can see below, the FASTA headers contain a lot of information (Accession, Gene Symbol, Organism and Gene ID) with fields separated by spaces. Spaces in headers can cause issues for downstream analyses and make output hard to read, so at minimum, you may want to replace them with underscores.

In [116]:
%%bash
#Extracting just the headers to look at their format
grep ">" adamts8_orthologs/ncbi_dataset/data/protein.faa | head -5

>XP_029063740.1 ADAMTS8 [organism=Monodon monoceros] [GeneID=114886841]
>XP_033717680.1 ADAMTS8 [organism=Tursiops truncatus] [GeneID=101325122]
>XP_023987074.2 ADAMTS8 [organism=Physeter catodon] [GeneID=102987620]
>XP_059786001.1 ADAMTS8 [organism=Balaenoptera ricei] [GeneID=132370273]
>XP_061058257.1 ADAMTS8 [organism=Eubalaena glacialis] [GeneID=133098920]


In [114]:
%%bash
# Update FASTA headers in the protein sequence file to make clustalo output easier to understand
sed 's/ /_/g' adamts8_orthologs/ncbi_dataset/data/protein.faa > renamed.proteins.faa

You could further trim down the headers to just certain desired fields using further scripting if desired, or use the NCBI Datasets data report to access the protein metadata in a way that is more amenable to programming.

#### Infer an alignment from renamed FASTA file ####

In this example, we are using Clustal Omega, which is a powerful open-source aligner but you are free to use your preferred choice of alignment software. 

Publication: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5734385/

Clustal Omega Binaries: http://www.clustal.org/omega/

**The following command will generate both a Clustal formatted alignment and a distance matrix:**


In [124]:
%%bash
# Run alignment
./clustalo --infile renamed.proteins.faa --full --outfmt=clu --distmat-out=adamts8_distmat --force > adamts8_clustal.clu

In [73]:
%%bash
# Look at output of blast
head adamts8_clustal.clu

CLUSTAL O(1.2.3) multiple sequence alignment


XP_029063740.1_ADAMTS8_[organism=Monodon_monoceros]_[GeneID=114886841]                                MSKPGRKYKYEGLGSPLCASRGGEKAGAEDAHAAGAPVARDRLGGRAREQRQRRQPEGPP
XP_033717680.1_ADAMTS8_[organism=Tursiops_truncatus]_[GeneID=101325122]                               ------------------------------------------------------------
XP_023987074.2_ADAMTS8_[organism=Physeter_catodon]_[GeneID=102987620]                                 MSKPRRKYKYEGLGSPLWASRAGEKAGAEDARAAGAPVARARLGGRARRQRX-RQPEGPP
XP_059786001.1_ADAMTS8_[organism=Balaenoptera_ricei]_[GeneID=132370273]                               MSKPGRKYKYEGLGSPLCASRGGEKAGAEDAHAAGAPVARARLGGRAREQRP-RQPEGPP
XP_061058257.1_ADAMTS8_[organism=Eubalaena_glacialis]_[GeneID=133098920]                              MSKPGRKYKYEGLGSPLCASRGGEKAGAEDAHAAGAPVA--RLGGRAREQRP-RQPEGSP
XP_024612646.1_ADAMTS8_[organism=Neophocaena_asiaeorientalis_asiaeorientalis]_[GeneID=112407533]      ------------------------------------




### Get one protein per gene from a set of orthologs ###

As we saw above, several species have multiple available proteins per `adamts8` ortholog. To simplify our final alignment, lets select the longest protein sequence per gene from this set of orthologs.

This can be done in three steps:

* Get transcript and protein metadata for the gene products of the ortholog set
* Extract the accessions of the longest protein and corresponding transcript from this metadata
* Download the set of longest protein and corresponding transcript sequences, one per gene

#### Get transcript and protein metadata necessary for choosing the ortholog: ####

In [None]:
%%bash
datasets summary gene symbol adamts8 \
--ortholog 'homo sapiens,cetacea' \
--report product --as-json-lines > adamts8_products.jsonl

In [98]:
%%bash
dataformat tsv gene-product \
--inputfile adamts8_products.jsonl \
--fields gene-id,tax-name,symbol,transcript-accession,transcript-length,transcript-protein-accession,transcript-protein-length > transcript_protein.tsv

In [101]:
%%bash
head -5 transcript_protein.tsv

NCBI GeneID	Taxonomic Name	Symbol	Transcript Accession	Transcript Transcript Length	Transcript Protein Accession	Transcript Protein Length
112407533	Neophocaena asiaeorientalis asiaeorientalis	ADAMTS8	XM_024756878.1	3090	XP_024612646.1	850
115844401	Globicephala melas	ADAMTS8	XM_030841433.2	3626	XP_030697293.2	998
116758202	Phocoena sinus	ADAMTS8	XM_032640990.1	5949	XP_032496881.1	998
103000097	Balaenoptera acutorostrata	ADAMTS8	XM_057552216.1	2592	XP_057408199.1	851



#### Extract the accessions of the longest protein ####

In order to pick a single protein sequence for each gene, we will write a Bash command using to  identify the longest protein for each gene, and save the accession for this longest protein to a new file called`longest.list` by parsting the `transcript_protein.tsv` file we just created. 

This code first sorts this file by `NCBI Gene ID`, then by protein length, then by Protein Accession number (to produce consistent results if two proteins are of the same length). Next, it uses `awk` to print only the first line per Gene ID, and then cuts just the column of protein IDs. We will then pass on this list of IDs to Datasets.


To learn more about selecting one isoform per ortholog, check out this other Datasets tutorial:  https://www.ncbi.nlm.nih.gov/datasets/docs/v2/tutorials/ortholog-get-one-isoform/.

In [65]:
%%bash
cat transcript_protein.tsv | sort -k1n -nrk7 -k6 | \
awk 'BEGIN{FS="\t";OFS="\t";gene=0}{if(gene!=$1){print $0};gene=$1}END{if(gene!=$1){print $0}}' |\
cut -f6  > longest.list

Now that we have a list of transcript and protein accessions, we can use datasets to download the sequences.
Use the `--fasta-filter-file` flag to only get sequence for the specific transcript and protein accessions in the file, `longest.list`.

In [None]:
%%bash
datasets download gene accession --no-progressbar  \
--inputfile longest.list \
--fasta-filter-file longest.list \
--filename longest_protein.zip

#### Unzip, update headers and align ####

Just like in the previous example, we are now going to expand the `.zip` file into a new directory, adjust the FASTA headers, and use Clustal Omega to align the sequences and output an alignment and distance matrix.

In [68]:
%%bash
unzip longest_protein.zip -d longest_protein

Archive:  longest_protein.zip
  inflating: longest_protein/README.md  
  inflating: longest_protein/ncbi_dataset/data/protein.faa  
  inflating: longest_protein/ncbi_dataset/data/data_report.jsonl  
  inflating: longest_protein/ncbi_dataset/data/dataset_catalog.json  


In [69]:
%%bash
# Update FASTA headers in the protein sequence file to make clustalo output easier to understand
sed 's/ /_/g' longest_protein/ncbi_dataset/data/protein.faa > longest.renamed.proteins.faa

In [74]:
%%bash
# Run Clustal Omega on this new input file
./clustalo --infile longest.renamed.proteins.faa --full --outfmt=clu --distmat-out=adamts8_distmat_longest --force > adamts8_longest_clustal.clu

### The NIH Comparative Genomics Resource (CGR)

The NCBI tools used in this tutorial (NCBI Datasets CLI tools and NCBI Orthologs) are both part of the [NIH Comparative Genomics Resource (CGR)](https://www.ncbi.nlm.nih.gov/datasets/cgr/). CGR facilitates reliable comparative genomics analyses for all eukaryotic organisms through an NCBI Toolkit and community collaboration. CGR provides comparative genomicists with a wide and expanding taxonomic range of genome assemblies and annotations while specialized BLAST databases, comparative visualization tools, orthology data, protein domain data, and more support your analyses.

Follow us on X @NCBI and join our mailing list to keep up to date with NCBI Datasets, NCBI Orthologs and other CGR news.