# NCBI Datasets - CSHL (11/02/2021)

### Table of contents
* [Part I: Accessing genomes](#Part-I)
* [Part II: Accessing genes](#Part-II)
* [Part III: Accessing orthologs](#Part-III)
* [Part IV: Building a BLAST database and creating a phylogenetic tree](#Part-IV)
* [Part V: Downloading large datasets (dehydration/rehydration) and `dataformat`](#Part-V)

### Important resources
- Etherpad: https://etherpad.wikimedia.org/p/CSHL_Datasets_Workshop_2021
- Github: https://github.com/ncbi/datasets/tree/workshop-cshl-2021/training/cshl-2021
- NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/
- jq cheat sheet: https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md

## Before we start... What is a jupyter notebook?

Jupyter Notebooks are a web-based approach to interactive code. A single notebook (the file you are currently reading) is composed of many "cells" which can contain either text, or code. To navigate between cells, either click, or use the arrow keys on your keyboard.

A text cell will look like... well... this! While a code cell will look something like what you see below. To run the code inside a code cell, click on it, then click the "Run" button at the top of the screen. Try it on the code cell below!

In [1]:
#This is a code cell
print('You ran the code cell!')

You ran the code cell!


If it worked, you should have seen text pop up underneath the cell saying `You ran the code cell!`. Note the `In [1]:` that appeared next to the cell. This tells you the order you have run code cells throughout the notebook. The next time you run a code cell, it will say `In [2]:`, then `In [3]:` and so on... This will help you know if/when code has been run.

The remainder of the notebook below has been pre-built by the workshop organizer. You will not need to create any new cells, and you will be explicitly told if/when to execute a code cell.

The code in this workshop is either Bash (i.e., terminal commands) or Python. Bash commands are prefixed with `!` or the cells have the notation `%%bash` at the top., while Python commands are not. If you are not familiar with code, don't feel pressured to interpret it very deeply. Descriptions of each code block will be provided!

(Jupyter Notebook explanation by Cooper Park at the workshop on [Finding and Analyzing Metagenomic Data](https://www.nlm.nih.gov/oet/ed/ncbi/2021_10_meta.html))

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC5556950/) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

<img src="./images/ants.png" alt="image"/>

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will also do the following tasks:
- Create a custom BLAST database with the Panamanian leaf cutter ants genomes 
- BLAST the gene *orco* against the database
- Multiple sequence alignment of the BLAST results and the ortholog gene sequences
- Build a phylogenetic tree using fastTree


### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="./images/datasets_horizontal.drawio.png" alt="datasets" style="width: 600px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

## Part I: Accessing genomes<a class="anchor" id="Part-I"></a>

![workflow](./images/elmo_workflow.drawio.png)

First, let's figure out what kind of genome information NCBI has for ants (family Formicidae).

<img src="./images/genome_summary.drawio.png" style="width: 600px;"/>

In [2]:
%%bash
# Get metadata info - this example output has been truncated to fit the notebook output screen in github
# by printing only the first 1500 characters

datasets summary genome taxon formicidae | cut -c1-1500

# Original code run in the workshop
# datasets summary genome taxon formicidae


{"assemblies": [{"assembly": {"annotation_metadata":{"file":[{"estimated_size":"3421616","type":"GENOME_GFF"},{"estimated_size":"129483045","type":"GENOME_GBFF"},{"estimated_size":"3444924","type":"PROT_FASTA"},{"estimated_size":"2684704","type":"GENOME_GTF"},{"estimated_size":"7862131","type":"CDS_FASTA"}],"name":"From INSDC submitter","release_date":"2021-03-29","source":"BGI","stats":{"gene_counts":{"protein_coding":8986,"total":14640}}},"assembly_accession":"GCA_017607545.1","assembly_category":"representative genome","assembly_level":"Scaffold","bioproject_lineages":[{"bioprojects":[{"accession":"PRJNA605929","title":"Project of the leaf-cutting ants"}]}],"biosample_accession":"SAMN14167745","blast_url":"https://blast.ncbi.nlm.nih.gov/Blast.cgi?PAGE_TYPE=BlastSearch\u0026PROG_DEF=blastn\u0026BLAST_SPEC=GDH_GCA_017607545.1","chromosomes":[{"length":"296539234","name":"Un"}],"contig_n50":34925,"display_name":"ASM1760754v1","estimated_size":"241200730","gc_count":"99149685","org":{"a

In [3]:
%%bash
# Get metadata info and save to a file
datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata using jq**

In [4]:
%%bash
# this example has been truncated to fit the notebook output in github by only printing the first 100 lines
datasets summary genome taxon formicidae | jq . | perl -ne'1..100 and print' 

# Original code run in the workshop
# datasets summary genome taxon formicidae | jq . 

{
  "assemblies": [
    {
      "assembly": {
        "annotation_metadata": {
          "file": [
            {
              "estimated_size": "3421616",
              "type": "GENOME_GFF"
            },
            {
              "estimated_size": "129483045",
              "type": "GENOME_GBFF"
            },
            {
              "estimated_size": "3444924",
              "type": "PROT_FASTA"
            },
            {
              "estimated_size": "2684704",
              "type": "GENOME_GTF"
            },
            {
              "estimated_size": "7862131",
              "type": "CDS_FASTA"
            }
          ],
          "name": "From INSDC submitter",
          "release_date": "2021-03-29",
          "source": "BGI",
          "stats": {
            "gene_counts": {
              "protein_coding": 8986,
              "total": 14640
            }
          }
        },
        "assembly_accession": "GCA_017607545.1",
        "assembly_category": "representa