# NCBI Datasets (07/21/2022)

### Table of contents <a class="anchor" id="top"></a>
* [Part I: Introduction to NCBI Datasets](#Part-I) 
* [Part II: Getting a list of gene-ids per species](#Part-II)
* [Part III: Getting ortholog data packages from a list of gene-ids](#Part-III)
* [Part IV: Closer look at the ortholog metadata files](#Part-IV)

## Part I: Introduction to NCBI Datasets<a class="anchor" id="Part-I"></a>

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/download-and-install/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="https://www.ncbi.nlm.nih.gov/datasets/docs/v1/datasets_schema_complete.svg" alt="datasets" style="width: 800px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) and `dataformat`, a `datasets`companion tool, to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

For example: if I want to download the cat reference genome (<i>Felis catus</i>), I would use the command below:

In [None]:
%%bash
#Download the monarch butterfly (taxid 13037) reference genome, with associated annotation files and metadata

datasets download genome taxon 13037 --reference --no-progressbar


Instead of downloading a data package, I could instead look at the metadata information by using the `summary`command. Here, I'm pipping it to [`jq`](https://stedolan.github.io/jq/) so it's easier to read:

In [None]:
%%bash
#Check the metadata information for the monarch butterfly reference genome

datasets summary genome taxon 13037 --reference | jq .

### Data packages

NCBI Datasets delivers data as <u>data packages</u>, which which are zip archives containing both data (FASTA, GFF3, GTF, GBFF) and metadata files (JSON, JSON-Lines). The image below shows the contents of all data packages. Files are included depending on their availability. For example: for an annotated genome, the data package would include FASTA files (genomic, transcript, protein and CDS sequences) and annotation files (GFF3, GTF and GBFF).


<img src="./images/datapackages.png" alt="data_package" style="width: 800px;"/>

Currently, the `virus` option only include SARS-Cov-2 genomes and proteins. 

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [None]:
%%bash
datasets --help

Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [None]:
%%bash
    datasets summary genome taxon formicidae --help

### Exercises:

Now let's use `datasets` and `jq` to take a look at the available genomes for your taxon of interest. Use our [jq cheatsheet](https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to look at the available fields and look for the information you need. Some ideas: 
- how many genomes are available?
- What's the contig N50?


In [None]:
%%bash



In [None]:
%%bash



In [None]:
%%bash



⬆︎[back to top](#top)

## Part II: Getting a list of gene-ids per species <a class="anchor" id="Part-II"></a>

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information: <i>accession</i>, <i>gene-id</i>, and <i>symbol</i>.  

When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

- **accession**: Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

- **gene-id**: Also a unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.  

- **symbol**: Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata. 

### Now let's take a look at a gene example:

<img src="./images/gene_json_gene-id.png" />

We want to extract a list of gene-ids for *Drosophila melanogaster* (tax-id 7227). We will use that list later to download ortholog sets for each gene-id. Here, we will invoke the command `summary` and pipe the JSON output to `jq` to extract the complete list of gene-ids for that species.   
If you look in the figure above and follow the numbers in the command below, you'll understand better how to build a `jq` filtering command.

In [None]:
%%bash
# Get list of gene-ids for D. melanogaster and save as a txt file

                                             #1     #2    #3
datasets summary gene taxon 7227 | jq -r '.genes[].gene.gene_id' > dmel_gene-ids.txt


Let's take a quick look at how many genes are annotated in <i>D. melanogaster</i>.

In [None]:
%%bash
# Count the number of lines (genes) in the list

wc -l dmel_gene-ids.txt

### Exercises

Choose another species and look for the number of genes annotated in it. Hint: you can pipe the `wc -l` command (count number of lines) after the `jq` command that extracts the gene-ids. 

In [3]:
%%bash




⬆︎[back to top](#top)

## Part III: Accessing orthologs <a class="anchor" id="Part-III"></a>

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option: <i>accession</i>, <i>gene-id</i>, and <i>symbol</i>.   

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 
---

- **accession**: Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

- **gene-id**:  Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

- **symbol**: Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how do you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output. A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON Lines, which means that each gene entry is in a different line. Here's the diagram to help you create queries.
  
<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- Use a loop to download an ortholog data package for each gene-id in the list. Since the list is too long, let's try to download only the first 20 gene-ids from the list.
- Unzip it to the folder ortholog
- Look at the files

In [None]:
%%bash
# download an ortholog data package per gene-id. We will be using a reduced set (`head -n20`) as example:

rm -r orthologs; #if the folder orthologs exist, remove it.
mkdir orthologs; # create the folder orthologs.

head -n20 dmel_gene-ids.txt | while read GENEID; do echo ${GENEID}; 
datasets download ortholog gene-id "${GENEID}" \
--include-cds \
--filename ./orthologs/$GENEID.zip \
--taxon-filter 7215 --no-progressbar; \
done

If you look at the output above, you will notice not all gene-ids were part of ortholog sets (`Error: no valid NCBI gene identifiers, exiting`). Let's check the number of ortholog sets downloaded.

In [None]:
%%bash

ls orthologs

Now, let's unzip one of those data packages and check its content:

In [None]:
%%bash

unzip orthologs/10178777.zip -d 10178777

Let's use the command `tree` to check the folder hierarchy in a easier way.

In [None]:
%%bash
tree 10178777/

The in the `data`folder, you will find all files: data and metadata files:
- FASTA sequences (CDS, gene, protein and RNA); 
- `data_report.jsonl`: metadata file describing the genes included in the data package. 
- `dataset_catalog.json`: lists all files included in the data package
- `data_table`: TSV file with a subset of the metadata from the `data_report.jsonl`

In the next section, we will take a closer look at the metadata files. 

⬆︎[back to top](#top)

## Part IV: Accessing metadata<a class="anchor" id="Part-IV"></a>


For the JSON-Lines metadata files included with the data packages, you have the option of using `dataformat` instead of `jq` to extract metadata information. `dataformat`is a `datasets`companion tool and it allows you to export metadata information to tabular or Excel format. We are working to make `dataformat` compatible with all `datasets summary` command options (currently, it works only with `datasets summary virus`).

Here, we will use `dataformat` to extract the following info from each ortholog data package: organism, accession, CDS range start and CDS range end. You can find more info about the `dataformat`fields for gene in [our documentation page](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/reference-docs/command-line/dataformat/tsv/dataformat_tsv_gene/)

In [2]:
!dataformat tsv gene \
--package orthologs/10178777.zip \
--fields tax-name,genomic-range-accession,genomic-range-range-start,genomic-range-range-stop |\
column -t -s$'\t' #added to align all the columns and make the visualization easier. But not really needed.

Taxonomic Name           Genomic Range Sequence Accession  Genomic Range Start  Genomic Range Stop
Drosophila melanogaster  NT_033777.3                       17420638             17420989
Drosophila biarmipes     NW_025319173.1                    23563281             23563646
Drosophila rhopaloa      NW_025335059.1                    2369786              2370166
Drosophila takahashii    NW_025323511.1                    10203538             10203894
Drosophila eugracilis    NW_024573645.1                    1690150              1690521
Drosophila elegans       NW_024545863.1                    5457384              5457919
Drosophila mauritiana    NC_046670.1                       9555695              9556075
Drosophila suzukii       NW_023496835.1                    1607206              1607604
Drosophila santomea      NC_053019.2                       17363835             17364169
Drosophila teissieri     NC_053032.1                       11934060             11934395
Droso

If we prefer to use `jq`, here's how to achieve the same result.

In [None]:
%%bash

cat 10178777/ncbi_dataset/data/data_report.jsonl | jq -r '[.taxname,
.genomicRanges[].accessionVersion,
.genomicRanges[].range[].begin,
.genomicRanges[].range[].end] | @tsv'

Useful tip: if you want to extract a list of all fields from any JSON file, you can use the code in the cell below. Notice that the field names are missing a period (".") in the beginning, so don't forget to add that when you use the info from this list.


Modified from here: https://www.fabian-keller.de/blog/5-useful-jq-commands-parse-json-cli/

In [None]:
%%bash

cat 10178777/ncbi_dataset/data/data_report.jsonl | jq 'select(objects)|=[.] 
        | map( paths(scalars) ) 
        | map( map(select(numbers)="[]") 
        | join("."))' | sort | uniq | sed 's/\.\[/\[/g'

### Exercises

Use `dataformat` to extract metadata from another data package. Remember that `dataformat` accepts both zip packages (`--package`) and data reports (`--inputfile`) as inputs, so no need to unzip anything.  

Also, instead of using `%%bash`, please use an exclamation mark `!` in front of the command.

In [None]:
!dataformat 



⬆︎[back to top](#top)