# NCBI Datasets - CSHL (11/02/2021)

### Table of contents
* [Part I: Accessing genomes](#Part-I)
* [Part II: Accessing genes](#Part-II)
* [Part III: Accessing orthologs](#Part-III)
* [Part IV: Building a BLAST database and creating a phylogenetic tree](#Part-IV)
* [Part V: Downloading large datasets (dehydration/rehydration) and `dataformat`](#Part-V)

### Important resources
- Etherpad: https://etherpad.wikimedia.org/p/CSHL_Datasets_Workshop_2021
- Github: https://github.com/ncbi/datasets/tree/workshop-cshl-2021/training/cshl-2021
- NCBI datasets: https://www.ncbi.nlm.nih.gov/datasets/
- jq cheat sheet: https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md

## Before we start... What is a jupyter notebook?

Jupyter Notebooks are a web-based approach to interactive code. A single notebook (the file you are currently reading) is composed of many "cells" which can contain either text, or code. To navigate between cells, either click, or use the arrow keys on your keyboard.

A text cell will look like... well... this! While a code cell will look something like what you see below. To run the code inside a code cell, click on it, then click the "Run" button at the top of the screen. Try it on the code cell below!

In [None]:
#This is a code cell
print('You ran the code cell!')

If it worked, you should have seen text pop up underneath the cell saying `You ran the code cell!`. Note the `In [1]:` that appeared next to the cell. This tells you the order you have run code cells throughout the notebook. The next time you run a code cell, it will say `In [2]:`, then `In [3]:` and so on... This will help you know if/when code has been run.

The remainder of the notebook below has been pre-built by the workshop organizer. You will not need to create any new cells, and you will be explicitly told if/when to execute a code cell.

The code in this workshop is either Bash (i.e., terminal commands) or Python. Bash commands are prefixed with `!` or the cells have the notation `%%bash` at the top., while Python commands are not. If you are not familiar with code, don't feel pressured to interpret it very deeply. Descriptions of each code block will be provided!

(Jupyter Notebook explanation by Cooper Park at the workshop on [Finding and Analyzing Metagenomic Data](https://www.nlm.nih.gov/oet/ed/ncbi/2021_10_meta.html))

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.ncbi.nlm.nih.gov/labs/pmc/articles/PMC5556950/) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

<img src="./images/ants.png" alt="image"/>

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will also do the following tasks:
- Create a custom BLAST database with the Panamanian leaf cutter ants genomes 
- BLAST the gene *orco* against the database
- Multiple sequence alignment of the BLAST results and the ortholog gene sequences
- Build a phylogenetic tree using fastTree


### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="./images/datasets_horizontal.drawio.png" alt="datasets" style="width: 600px;"/>

In addition to `datasets`, we will be using `jq` (JSON parser) to take a look at the metadata information. Our metadata reports are almost all in JSON or [JSON Lines](https://jsonlines.org/) format. We put together a [jq cheat sheet]( https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract information from those files.  

## Part I: Accessing genomes<a class="anchor" id="Part-I"></a>

![workflow](./images/elmo_workflow.drawio.png)

First, let's figure out what kind of genome information NCBI has for ants (family Formicidae).

<img src="./images/genome_summary.drawio.png" style="width: 600px;"/>

In [None]:
%%bash
# Get metadata info
datasets summary genome taxon formicidae

In [None]:
%%bash
# Get metadata info and save to a file
datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata using jq**

In [None]:
%%bash
datasets summary genome taxon formicidae | jq .

### A little bit more about json files
A JSON (JavaScript Object Notation) file stores data structures and objects. In a very simplified (and non-technical) way, a JSON file is a box, that might contain other boxes with more boxes inside. In `datasets summary genome` our JSON "box" is organized like this:
<img src="./images/json8.png" alt="image"/>

But let's explore the "boxes" in stages, so we can understand how everything is organized and how we can use this knowledge to extract information from the summary metadata file. At the first level, we have this: 
```
{
 assemblies[
      assembly{},
      assembly{},
 ],
 total_count
}
```
<img src="./images/json1.png" />

If we want to look at the value in the field "total_count", here's the command we would use:

In [None]:
%%bash
datasets summary genome taxon herpestidae | jq '.total_count'

If we continue to expand each one of those assembly boxes, more levels of the hierarchy will be revealed. Let's expand each assembly and look at what information we can find at that level.  

<img src="./images/json2a.png" alt="image"/>

Here we can see that some of the assembly information, such as assembly accession number, contig N50 or submission date are not included inside any of the available "boxes" (annotation_metadata, chromosomes, bioproject_lineage, and org). Those fields describe assembly features/characteristics that pertain to the entire assembly, and not only any of those boxes available. What are the contig n50 values of those assemblies?

To retrieve that information, we need to call each box, starting from the largest one, until the field we're interested in. And each level is separated from the next by a period (.). 

<img src="./images/json3.png" />

In [None]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.contig_n50'

Now let's see what we have inside `annotation_metadata`, `bioproject_lineages`, `org` and `chromosomes`. 
<img src="./images/json8.png" alt="image"/>

Now let's see how we can retrieve the scientific names associated with those assemblies.
<img src="./images/json4.png" />

In [None]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly.org.sci_name'

As you can see, `jq` is very useful in retrieving information from the summary metadata *as long as* you know the path to find it. Let's try a few more complex examples.
<img src="./images/json5.png" />

First, let's retrieve information from three fields at the same time: scientific name (`sci_name`), assembly accession number (`assembly_accession`) and contig N50 (`contig_n50`).

In [None]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly | (.org.sci_name, .assembly_accession, .contig_n50)'

Since all three fields are inside the `.assemblies[].assembly`, we can call the first part of the path once and use a pipe (|) to call each specific field. 
Now let's try to make this a little easier to read. We can create new fields and assign values to them, like this:

In [None]:
%%bash
datasets summary genome taxon herpestidae | jq '.assemblies[].assembly 
| {species: .org.sci_name, accession: .assembly_accession, contigN50:.contig_n50}'

**Last one**: let's look at a larger collection of genome assemblies (let's say, all Carnivora) and select only those assemblies with contig N50 larger than 15 Mb (15000000 bp). `datasets` provides many options for filtering, but there is no built-in filter for contig N50 size.  

Here's what we want to see: assembly accession number, species and assembly level for those genomes with contig N50 above 15 Mb.

In [None]:
%%bash
datasets summary genome taxon carnivora | jq -r '.assemblies[].assembly 
| select(.contig_n50 > 15000000) 
| [.assembly_accession, .org.sci_name, .assembly_level] 
| @tsv'

**RESOURCE:**  
We included a list of all fields in the genome summary in our [jq cheatsheet](https://github.com/ncbi/datasets/blob/workshop-cshl-2021/training/cshl-2021/jq_cheatsheet.md) to help you extract the information you need. And we will show you now how to do that. 

### Let's continue to explore the available genomes for the family Formicidae

<img src="./images/genome_summary.drawio.png" alt="summary" style="width: 600px"/>

For this part, we will use two UNIX commands: `sort` and `uniq`. 

- `sort` can be used to sort text files line by line, numerically and alphabetically.   
- `uniq` will filter out the repeated lines in a file. However, `uniq` can only detect repeated lines if they are adjacent to each other. In other words, if they are alphabetically or numerically sorted. The flag `-c` or `--count` tells the command `uniq` to remove the repeated lines, and to count how many times each value appeared. 

So, we will use `jq` to extract the information we need, sort the result and count the number of unique entries.

In [None]:
%%bash
# For which species does NCBI have genomes in its database? How many per species?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.org.sci_name' | sort | uniq -c

In [None]:
%%bash
# What is the assembly level (contig, scaffold, chromosome, complete) breakdown?

datasets summary genome taxon formicidae | jq '.assemblies[].assembly.assembly_level' | sort | uniq -c

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [None]:
%%bash
datasets --help

Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [None]:
%%bash
datasets summary genome taxon formicidae --help

### Exercises

Now we will practice what we learned about `datasets`. Take a look at the questions below and feel free to ask questions. Useful resources for this exercise are the `--help` from the command line and the [jq cheatsheet](). 


In [None]:
%%bash
# How many reference genomes in the family Formicidae? (hint --reference)



In [None]:
%%bash
# How many reference genomes are annotated? (hint: --annotated)



In [None]:
%%bash
# How many genomes have NCBI (RefSeq) annotations? (hint: --assembly-source)



### Bonus questions:

In [None]:
%%bash
## Take a look at the jq cheat sheet (link here) and try to build a jq query for the metadata



In [None]:
%%bash
# Now look at the summary metadata for your organism of interest 
# (if you don't have a favorite, go with red panda, Ailurus fulgens, taxid: 9649)



In [None]:
%%bash
# How many genomes?



In [None]:
%%bash
# Assembly level breakdown



In [None]:
%%bash
# How many genomes have a contig N50 value above 15Mb?



### Back to the main room


### What is the difference/relationship between Genbank, RefSeq and Reference assemblies?

<img src="./images/gca_gcf.png" alt="ref" />

### Data package

We explored the `datasets summary` option, in which we had a chance to look at the summary metadata ***without*** downloading any files. In the next steps, we will look at the data packages, which contain the actual data files. 

<img src="./images/genome_data_package.png" alt="data_package" />

In [None]:
%%bash
# Download all available GenBank assemblies for the genus Acromyrmex and save as genomes.zip
datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar

In [None]:
%%bash
# Unzip genomes.zip to the folder genomes
unzip genomes.zip -d genomes

In [None]:
%%bash
# Explore the folder structure of the folder genome with the command tree
tree -C genomes/

### Let's recap our goals

We used `datasets` to download all the Genbank assemblies for the genus *Acromyrmex*. The next step is to download the gene *orco* (odorance receptor coreceptor) for the same genus. But first, let's learn more about how genes are organized at NCBI.

<img src="./images/elmo_done1.png" alt="done1" style="width: 500px;" />

## Part II: Accessing genes <a class="anchor" id="Part-II"></a>
### GENES

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information:
- accession
- gene-id
- symbol

<img src="./images/genes_op2.png" style="width: 800px;"/>


When choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

#### accession
Unique identifier. Accession includes RefSeq RNA and protein accessions. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).  

#### gene-id
Also a unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.  

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later. For reference, here's the JSON organization of the gene summary metadata.  

<img src="./images/gene_json.drawio.png" />

Now let's take a look at a gene example:

In [None]:
%%bash
#Example: IFNG in human
datasets summary gene symbol ifng | jq -C .


In [None]:
%%bash
# how datasets deals with synonyms
datasets summary gene symbol IFG | jq -C -r '.genes[].gene | {species: .taxname, symbol: .symbol, synonyms:.synonyms}'


In [None]:
%%bash
#Example: IFNG in cat
datasets summary gene symbol ifng --taxon "felis catus"


### Back to ants
We will download the gene *orco* for the species *Acromyrmex echinatior*. We will use the gene-id 105147775 instead of the symbol because no informative gene symbol has been assigned for this gene.  

In [None]:
%%bash
# Using gene-id to retrieve gene information
datasets summary gene gene-id 105147775 | jq -C '.genes[].gene 
| {gene_description: .description, gene_id: .gene_id, symbol: .symbol, species: .taxname}'

In [None]:
%%bash
# if we try to retrieve metadata information for this gene using the symbol orco, what happens?
datasets summary gene symbol orco --taxon "acromyrmex echinatior"


In [None]:
%%bash
# Download the gene data package for the gene-id 105147775 (*orco* in Acromyrmex echinatior)
datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar


In [None]:
%%bash
#Unzip the file
unzip gene.zip -d gene

In [None]:
%%bash
#Explore the data package structure using tree
tree gene

Now we are going to take advantage of the fact that we are using a Jupyter Notebook and use the package `pandas` to look at the gene data table

In [None]:
import pandas as pd                                                        #load pandas to this notebook
gene_orco = pd.read_csv('gene/ncbi_dataset/data/data_table.tsv', sep='\t') #use pandas to import the data_table.tsv
gene_orco                                                                  #visualize the data table as the object gene_orco

### Exercises

1. Look for the summary data for a gene of interest (check the [etherpad](https://etherpad.wikimedia.org/p/CSHL_Datasets_Workshop_2021) for suggestions)
2. What is the gene location?
3. What is the gene range?
4. Now, download a list of gene symbols using the file genes.txt (provided). Save it as gene_list.zip
5. Unzip gene_list.zip and explore the folder structure
6. How many fasta files are there?

In [None]:
%%bash
# Summary data



In [None]:
%%bash
# Gene location



In [None]:
%%bash
# Gene range



In [None]:
%%bash
# Download a list of genes and save the data package as gene_list.zip (--filename gene_list.zip)


In [None]:
%%bash
# Explore the folder structure



In [None]:
%%bash
# How many genes were downloaded?



In [None]:
%%bash
# How many fasta files in the data package?



## Part III: Accessing orthologs <a class="anchor" id="Part-III"></a>

### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when using each option:

- accession
- gene-id
- symbol

<img src="./images/ortholog.png" style="width: 800px;" />

When choosing any of those three options, you will download the **full ortholog set** to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`  
`datasets download ortholog gene-id 101081937`  
`datasets download ortholog symbol BRCA1 --taxon cat`  

All three commands will download the **same** ortholog set. 

---

#### <font color='blue'>Wait, but what is an ortholog set?</font>

>An ortholog set, or ortholog gene group, is a group of sequences that have been identified by the NCBI genome annotation team as homologous genes related to each other by speciation events. They are identified by a combination of protein similarity + local syntheny information. 
Currently, NCBI has ortholog sets calculated for vertebrates and some insects. 


#### accession
Unique identifier. Accession includes RefSeq accession RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how do you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output. A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`


The summary metadata for orthologs is presented in JSON Lines, which means that each gene entry is in a different line. Here's the diagram to help you create queries.
  
<img src="./images/ortholog_jsonl.drawio.png" />

#### We are going to do the following steps:
- download the ortholog data package and save it with the name ortholog.zip
- unzip it to the folder ortholog
- look at the files

Helpful info:

- gene symbol: orco
- gene-id in *Drosophila melanogaster*: 40650
- gene-id in *Acromyrmer echinatior*: 105147775
- target taxon: Formicidae

In [None]:
%%bash
# download the orco ortholog set for ants (Formicidae)
datasets download ortholog gene-id 40650 --taxon-filter formicidae --filename ortholog.zip --no-progressbar


In [None]:
%%bash
# unzip it to the folder ortholog
unzip ortholog.zip -d ortholog


In [None]:
%%bash
#Explore the folder structure
tree ortholog/


In [None]:
# Create an object called ortho_table using pandas
ortho_table = pd.read_csv("ortholog/ncbi_dataset/data/data_table.tsv", sep='\t')
ortho_table

## What have we done so far?
- Explored metadata for all ant genomes
- Downloaded genomes for the panamanian leaf cutter ant
- Downloaded the orco gene for Acromyrmex echinatior
- Downloaded the ortholog set for all ants for the orco gene

<img src="./images/elmo_done.png" />

## Part IV: Building a BLAST database and creating a phylogenetic tree<a class="anchor" id="Part-IV"></a>

### Here's what we are showing you now:
- BLAST:
    - Create a BLAST database for each genome
    - BLAST the *orco* gene sequence against the genomes database and extract the matching regions
- multiple sequence alignment of the blast matches and the ortholog sequences
- generate a approximate maximum likelihood tree using FastTree

We'll add more detailed information about the commands we're using here to the GitHub page.

#### Extracting taxIDs from the genome data package

First, let's use `dataformat` to extract the species names, taxID and assembly accession numbers from the genomes we downloaded. We will talk in more detail about `dataformat` later.

In [None]:
%%bash
# Extract tax id for each species:
dataformat tsv genome --fields organism-name,tax-id,assminfo-accession --package genomes.zip 

#### Creating a BLAST database with taxonomy information.

First we are going to create a folder called `blastdb` with the UNIX command `mkdir`. Next, we will change to the directory we just created. Finally, we will make a copy of the NCBI taxonomy database (taxdb)

In [None]:
# Create a folder called blastdb
!mkdir blastdb

# change directory to the folder blastdb
%cd blastdb

# download the NCBI Taxonomy Database (taxdb)
!update_blastdb.pl taxdb

#### BLAST database and search
Now we will create a BLAST database with the *Acromyrmex* genomes we downloaded. More information about the commands is available on out GitHub page.

In [None]:
%%bash
# Create a blast database for each genome
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna -taxid 103372 -out Aechinatior
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna -taxid 230686 -out Ainsinuator 
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna -taxid 2715315 -out Acharruanus
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna -taxid 230685 -out Aheyeri

# Create an alias under which the four genome databases can be called
blastdb_aliastool -dbtype nucl -title acromyrmex -out acromyrmex -dblist "Acharruanus Aechinatior Aheyeri Ainsinuator"

# BLASTN search
blastn \
-db acromyrmex \
-query ../gene/ncbi_dataset/data/gene.fna \
-evalue 1e-50 \
-outfmt 11 \
-max_hsps 1 \
-out orco_acromyrmex_1e-50.asn

# Covert the asn.1 output to tabular (output format 6)

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 sseqid sstart send evalue length staxid ssciname' > orco_acromyrmex_1e-50.tsv

Using `pandas` again, we will create an object with the tsv file we just created from the BLAST output, so we can take a look at our results.

In [None]:
# Create a table and visualize the BLAST results

blast_table = pd.read_csv('orco_acromyrmex_1e-50.tsv', sep='\t', header=None)
blast_table

#### Converting from BLAST to fasta

Now we are going to use some "tricks" (not really, just some good old bash scripting) to extract fasta sequences from the BLAST output. For tthis task, we will be using `blast_formatter` again.

In [None]:
%%bash
# Convert BLAST output to fasta

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}' > ../acromyrmex_orco.fasta


<img src="./images/elmo_blast_done.png"/>

### VERY IMPORTANT!
For the next steps, we need to go back to our home folder. Let's do it in steps again.

In [None]:
%%bash
## Check where you are
pwd

In [None]:
## If you're not in the home folder, run this command:
%cd ..

### Multiple sequence alignment: BLAST matches + *orco* orthologs

First, let's simplify the FASTA headers in the ortholog set.

In [None]:
%%bash
# Extract the seqids from the gene ortholog fasta and remove the spaces
grep ">" ortholog/ncbi_dataset/data/gene.fna | sed 's/ /,/g' > ortholog_seqid.txt

#Create a mapping file with the original name in the column 1 and a shortened name on column 2
cat ortholog_seqid.txt | while read line; do
new=$( echo $line | awk 'BEGIN {FS=","; OFS="_"}{gsub(/\[organism\=/, "", $3);gsub(/]/, "", $4);gsub(/\[GeneID\=|\]/, "", $5)} ;{print substr($3,0,1)$4,$5}'); 
old=$( echo $line | sed 's/,/\_/g;s/>//g')
printf "${old}\t${new}\n" >> name_map.tsv; 
done

#Copy the ortholog dataset fasta
cp ortholog/ncbi_dataset/data/gene.fna ortholog_gene.fna

#Remove spaces in the fasta sequnce names
sed 's/ /_/g' ortholog_gene.fna > ortholog_gene_nospaces.fna

#Replace the names in the fasta file
cat ortholog_gene_nospaces.fna | seqkit replace \
--kv-file  <(cut -f 1,2 name_map.tsv) \
--pattern "^(.*)" --replacement "{kv}" > ortholog_gene_final.fna

### Multiple sequence alignment and phylogenetic reconstruction

Now, let's concatenate the FASTA we extracted from the BLAST matches, align them using MAFFT and use FastTree to generate an approximate ML phylogeny.

In [None]:
%%bash

#Concatenate sequences
cat ortholog_gene_final.fna acromyrmex_orco.fasta > orco_all.fasta

#align sequences with mafft
mafft orco_all.fasta > orco_all_aln.fasta

#Generate a phylogeny using fasttree
FastTree -nt orco_all_aln.fasta > orco.tree

### Visualizing the tree

In [None]:
# We will use the package toytree to look at the phylogenetic tree we just created

import toytree
orco_tree = toytree.tree("orco.tree")
orco_tree_rooted = orco_tree.root(names=["Obrunneus_116854080","Dquadriceps_106748868","Hsaltator_105183395"])
orco_tree_rooted.draw(tree_style='d')

## Part V: Downloading large datasets (dehydration/rehydration) and `dataformat`<a class="anchor" id="Part-V"></a>

Now you learned how to download genomes, genes and ortholog gene sets from NCBI with one command using `datasets`. Now we want to show you another feature of `datasets` that allows you to download what we call a `dehydrated` package. Let's download a dehydrated package and explore the files inside it.

In [None]:
%%bash
# Download a dehydrated data package for all acromyrmex GenBank genomes
datasets download genome taxon acromyrmex --assembly-source genbank --dehydrated --filename acromyrmex-dry.zip --no-progressbar

In [None]:
%%bash
# Next we have to unzip the dehydrated package
unzip acromyrmex-dry.zip -d acromyrmex-dry 

In [None]:
%%bash
# Now let's use the command tree to look at the data package contents
tree acromyrmex-dry/

**What is difference between this folder (`acromyrmex-dry`) and the folder `genomes`?**   
Let's use `tree` again to look at the contents of the folder genomes.

In [None]:
%%bash
# Check the folder contents of genome
tree genomes/

Both packages include the files `assembly_data_report.jsonl` and `dataset_catalog.json`, but the folder acromyrmex-dry has the file `fetch.txt` instead of the *actual* data. Let's take a look in this file.

In [None]:
# Inspect the file fetch.txt
fetch = pd.read_csv('./acromyrmex-dry/ncbi_dataset/fetch.txt', sep='\t', header=None)
fetch

The file `fetch.txt` has a list of files to be "fetched" (downloaded) with their respective links. And they are the same files that were originally included in when we downloaded the genomes in the beginning of this notebook.  

#### BUT WHY WOULD I WANT TO USE THIS OPTION?

Some possibilities:
- You are working with very large genomes and want to share the data with your collaborators. Instead of sending a massive data file, you can send a text file that they can use to download the same data you're working on.
- Or maybe you hand selected some genomes for a project from the [NCBI Datasets website](https://www.ncbi.nlm.nih.gov/datasets/genomes/) and they don't follow a specific pattern that can be replicated. You can also download a dehydrated package from our website, share it and download everything you need later.

### `dataformat`

Now we are going to combine `datasets` with another tool called `dataformat`. `dataformat` allows you to extract metadata information from the JSON data report files included with all `datasets` data packages. You can use `dataformat` to:
- Create a tab-delimited file (.tsv) or excel file with the fields you need
- Quickly visualize the information on the screen

`dataformat` currently can not be used with the output of `datasets summary`, only the JSON Lines data report included with the data package.

In [None]:
%%bash
# Read the dataformat help menu. This is a great way to get a list of the available metadata fields.
dataformat tsv genome -h

#### Now let's combine the features of `dataformat` and dehydration/rehydration to select which genomes to download.

Let's use `dataformat` to look at the genome data package for ants. We can use this information to select a "best" genome - we'll pick one with the highest contigN50 value.

In [None]:
%%bash
# Use dataformat to look at the genome data package for ants
dataformat tsv genome \
--fields organism-name,assminfo-accession,assmstats-contig-n50,assminfo-level,assminfo-submission-date,assminfo-submitter \
--package acromyrmex-dry.zip

In [None]:
%%bash
# Let's look at the help file for rehydrate
datasets rehydrate -h

In [None]:
%%bash
# Let's get a list of files that are available for download 
datasets rehydrate --directory acromyrmex-dry/ --list

In [None]:
%%bash
# Let's only get the protein sequences for the genome with the highest contigN50 value
datasets rehydrate --directory acromyrmex-dry/ --match GCA_000204515.1/protein.faa --no-progressbar

In [None]:
%%bash
# Let's use tree to look at our folder acromyrmex-dry again
tree acromyrmex-dry/

We can see that the file we requested ` GCA_000204515.1/protein.faa` was downloaded to the folder `acromyrmex-dry`

In [None]:
%%bash
# Take a peek at the downloaded protein file
cat acromyrmex-dry/ncbi_dataset/data/GCA_000204515.1/protein.faa | head

## Exercise
* Download a dehydrated package for all *Mycobacterium tuberculosis* genomes that meet all of the following criteria (hint: use flags)
    1. submitted/released in 2021
    2. annotated
    3. assembly level of complete_genome
* use dataformat to view the sequencing technology used for each of these genomes
* use rehydrate to get the genome sequence for one genome generated using Oxford Nanopore

In [None]:
%%bash
# Download a dehydrated genome data package



In [None]:
%%bash
# Unzip the data package



In [None]:
%%bash
# Use dataformat to generate a table that includes sequencing technology



In [None]:
%%bash
# Use rehydrate to get genome sequence generated using Oxford Nanopore

