# NCBI Datasets (07/21/2022)

### Table of contents
* [Part I: Introduction to NCBI Datasets](#Part-I) 
* [Part II: Getting a list of gene-ids per species](#Part-II)
* [Part III: Getting ortholog data packages from a list of gene-ids](#Part-III)
* [Part IV: Closer look at the ortholog metadata files](#Part-IV)

## Part I: Introduction to NCBI Datasets<a class="anchor" id="Part-I"></a>

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v1/datasets_schema_complete.svg" alt="datasets" style="width: 800px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) and `dataformat`, a `datasets`companion tool, to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

For example: if I want to download the cat reference genome (<i>Felis catus</i>), I would use the command below:

In [10]:
%%bash
#Download the cat reference genome, with associated annotation files and metadata

datasets download genome taxon cat --reference --no-progressbar


Process is interrupted.


Instead of downloading a data package, I could instead look at the metadata information by using the `summary`command. Here, I'm pipping it to [`jq`](https://stedolan.github.io/jq/) so it's easier to read:

In [11]:
%%bash
#Check the metadata information for the cat reference genome

datasets summary genome taxon cat --reference | jq .

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "busco": {
            "busco_lineage": "carnivora_odb10",
            "busco_ver": "4.1.4",
            "complete": 0.9884154,
            "duplicated": 0.0073093367,
            "fragmented": 0.0021376363,
            "missing": 0.009446973,
            "single_copy": 0.98110604,
            "total_count": "14502"
          },
          "file": [
            {
              "estimated_size": "26355419",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "1026987832",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "68647008",
              "type": "RNA_FASTA"
            },
            {
              "estimated_size": "16615960",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "25749352",
              "type": "GENOME_GTF"
            },
            {
    

        "submitter": "Texas A&M University"
      }
    }
  ],
  "total_count": 1
}


### Data packages

NCBI Datasets delivers data as <u>data packages</u>, which which are zip archives containing both data (FASTA, GFF3, GTF, GBFF) and metadata files (JSON, JSON-Lines). The image below shows the contents of all data packages. Files are included depending on their availability. For example: for an annotated genome, the data package would include FASTA files (genomic, transcript, protein and CDS sequences) and annotation files (GFF3, GTF and GBFF).


<img src="./images/datapackages.png" alt="data_package" style="width: 800px;"/>

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [12]:
%%bash
datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
      --api-key string   NCBI Datasets API Key
  -h, --help             help for datasets
      --no-progressbar   hide progress bar

Use datasets help <command> for detailed help about a command.


Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [13]:
%%bash
datasets summary genome taxon formicidae --help


Print a summary of a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank). The summary is returned in JSON format.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/download-and-install/) documentation for information about getting started with the command-line tools.

Usage
  datasets summary genome taxon [flags]

Examples
  datasets summary genome taxon human
  datasets summary genome taxon "mus musculus"
  datasets summary genome taxon 10116

Flags
  -h, --help              help for taxon
      --tax-exact-match   exclude sub-species when a species-level taxon is specified


Global Flags
  -a, --annotated                only include genomes with annotation
      --api-key string           NCBI Datasets API Key
      --as-json-lines            Stream results as newline delimited JSON-Lines
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, con

## Part II: Getting a list of gene-ids per species <a class="anchor" id="Part-II"></a>

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information:
- accession
- gene-id
- symbol

When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

#### accession
Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

#### gene-id
Also a unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.  

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata. 

### Now let's take a look at a gene example:

<img src="./images/gene_json_gene-id.png" />

We want to extract a list of gene-ids for *Drosophila melanogaster* (tax-id ). We will use that list later to download ortholog sets for each gene-id. 

In [None]:
%%bash
# Get list of gene-ids for D. melanogaster and save as a txt file
datasets summary gene taxon 7227 --as-json-lines | jq -r '.gene.gene_id' > dmel_gene-ids.txt

#alternative
# datasets summary gene taxon 7227 | jq -r '.genes[].gene.gene_id' > dmel_gene-ids.txt


In [None]:
%%bash
# Count the number of lines (genes) in the list
wc -l dmel_gene-ids.txt

## Part V: Accessing orthologs <a class="anchor" id="Part-V"></a>

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option:

- accession
- gene-id
- symbol

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 


#### accession
Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how do you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output. A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON Lines, which means that each gene entry is in a different line. Here's the diagram to help you create queries.
  
<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- Use a loop to download an ortholog data package for each gene-id in the list
- unzip it to the folder ortholog
- look at the files

In [None]:
%%bash
# download an ortholog data package per gene-id. We will be using a reduced set (`head -n20`) as example:

mkdir orthologs;
head -n20 dmel_gene-ids.txt | while read GENEID; do 
        echo ${GENEID}; 
        datasets download ortholog gene-id "${GENEID}" 
        --filename ./orthologs/$GENEID.zip 
        --taxon-filter 7215; done

In [None]:
%%bash
# Check the number of ortholog sets downloaded:
ls orthologs


## Part II: Accessing metadata<a class="anchor" id="Part-II"></a>

First, let's extract some metadata information from each of those data packages using `dataformat`

`dataformat`is a `datasets`companion tool and it allows you to export metadata information to tabular or Excel format. Here, we will use `dataformat` to extract the following info from each ortholog data package: organism, accession, CDS range start and CDS range end. You can find more info about the JSON schemas in this page #ADD LINK#

In [None]:
%%bash
dataformat tsv gene 
    --fields tax-name,transcript-cds-accession,transcript-cds-range-start,transcript-cds-range-stop 
    --package 10178785.zip