# NCBI Datasets - CSHL (11/02/2021)

### Important resources:
- Etherpad: <link>
- Github: <link>
- NCBI datasets: <link>
- Code of Conduct (?): <link> *we can use the Carpentries COC*
- jq cheat sheet: <link>
- UNIX cheat sheet: <link>

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.sciencedirect.com/science/article/pii/S0092867417307729#app3) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

<img src="./images/ants.png" alt="image"/>

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will also do the following tasks:
- Create a custom BLAST database with the Panamanian leaf cutter ants genomes 
- BLAST the gene *orco* against the database
- Multiple sequence alignment of the BLAST results and the ortholog gene sequences
- Build a phylogenetic tree using fastTree

## something about dataformat and dehydrated files

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="./images/datasets_horizontal.drawio.png" alt="datasets" style="width: 600px;"/>

In addition to `datasets`, we will be using `jq` (json parser) to take a look at the metadata information. Our metadata reports are almsot all in json or json-lines format. We put together a [jq cheat sheet](<add link>) to help you extract information from those files.

## Tutorial - Part 1

![workflow](./images/elmo_workflow.drawio.png)

First, let's figure out what kind of information NCBI has for ants (family Formicidae).

In [None]:
# Get metadata info
!datasets summary genome taxon formicidae

In [None]:
# Get metadata info and save to a file
!datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata usign jq**

In [None]:
!datasets summary genome taxon formicidae | jq .

### A little bit more about json files
A JSON (JavaScript Object Notation) file stores data structures and objects. In a very simplified (and non-technical) way, a JSON file is a box, that might contain other boxes with more boxes inside. In `datasets summary` our JSON "box" is organized like this:
<img src="./images/json1.png" alt="image" style="width: 600px;"/>

If we continue to expand each one of those assembly boxes, more levels of the hierarchy will be revelead. Let's take a look inside the pink assembly box:
<img src="./images/json2.png" alt="image" style="width: 600px;"/>

Here we can see that some of the assembly information, such as assembly accession number, contig N50  or submission date are not include inside any of the available "boxes" (annotation_metadata, chromosomes, bioproject_lineage, and org). Those fields describe assembly features/characteristics that pertain the entire assembly, and not only any of those boxes available.
Let's try to expand a those boxes now:
<img src="./images/json3.png" alt="image" style="width: 600px;"/>

Now we can see all the available fields for the genome summary. Not all assemblies will have those, but this is to give you an idea of how the information is organized. And each assembly will have the same fields, like this:

<img src="./images/json4.png" alt="image" style="width: 600px;"/>

**RESOURCE:**  
We included a list of all fields in the genome summary in our [jq cheatsheet]() to help you extract the information you need. And we will show you now how to do that. 

In [None]:
!datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar

In [None]:
!unzip genomes.zip -d genomes

In [None]:
!tree genomes/

In [None]:
!datasets summary gene gene-id 105147775 | jq '.genes[].gene | {gene_description: .description, gene_id: .gene_id, symbol: .symbol, species: .taxname}'

In [None]:
!datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar

!unzip gene.zip -d gene

In [None]:
!tree gene

In [None]:
import pandas as pd
gene_orco = pd.read_csv('gene/ncbi_dataset/data/data_table.tsv', sep='\t')
gene_orco

In [None]:
!time datasets download ortholog gene-id 40650 --taxon-filter formicidae --filename ortholog.zip --no-progressbar

In [None]:
!time unzip ortholog.zip -d ortholog

In [None]:
!tree ortholog/

In [None]:
%cd blastdb/

In [None]:
%%bash
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna -taxid 103372 -out Aechinatior
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna -taxid 230686 -out Ainsinuator 
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna -taxid 2715315 -out Acharruanus
makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna -taxid 230685 -out Aheyeri

In [None]:
!blastdb_aliastool -dbtype nucl -title acromyrmex -out acromyrmex -dblist "Acharruanus Aechinatior Aheyeri Ainsinuator"

In [None]:
%%bash
blastn \
-db acromyrmex \
-query ../gene/ncbi_dataset/data/gene.fna \
-evalue 1e-50 \
-outfmt 11 \
-max_hsps 1 \
-out orco_acromyrmex_1e-50.asn

In [None]:
%%bash
blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 sseqid sstart send evalue length staxid ssciname' > orco_acromyrmex_1e-50.tsv

In [None]:
blast_table = pd.read_csv('orco_acromyrmex_1e-50.tsv', sep='\t', header=None)
blast_table

In [None]:
%%bash
blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4

In [None]:
%%bash
blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}'

In [None]:
%%bash
blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}' > ../acromyrmex_orco.fasta

In [None]:
%cd ~

In [None]:
# Extract the seqids from the gene ortholog fasta and remove the spaces
!grep ">" ortholog/ncbi_dataset/data/gene.fna | sed 's/ /,/g' > ortholog_seqid.txt

In [None]:
%%bash
#Create a mapping file with the original name in the column 1 and a shortened name on column 2
cat ortholog_seqid.txt | while read line; do
new=$( echo $line | awk 'BEGIN {FS=","; OFS="_"}{gsub(/\[organism\=/, "", $3);gsub(/]/, "", $4);gsub(/\[GeneID\=|\]/, "", $5)} ;{print substr($3,0,1)$4,$5}'); 
old=$( echo $line | sed 's/,/\_/g;s/>//g')
printf "${old}\t${new}\n" >> name_map.tsv; 
done

In [None]:
import pandas as pd
name_map = pd.read_csv('name_map.tsv', sep='\t', header=None)
name_map

In [None]:
#Copy the ortholog dataset fasta
!cp ortholog/ncbi_dataset/data/gene.fna ortholog_gene.fna

In [None]:
#Remove spaces in the fasta sequnce names
!sed 's/ /_/g' ortholog_gene.fna > ortholog_gene_nospaces.fna

In [None]:
!head -n1 ortholog_gene_nospaces.fna

In [None]:
%%bash
#Replace the names in the fasta file
cat ortholog_gene_nospaces.fna | seqkit replace \
--kv-file  <(cut -f 1,2 name_map.tsv) \
--pattern "^(.*)" --replacement "{kv}" > ortholog_gene_final.fna

In [None]:
!grep ">" ortholog_gene_final.fna

In [None]:
#Concatenate sequences
!cat ortholog_gene_final.fna acromyrmex_orco.fasta > orco_all.fasta

In [None]:
#align sequences with mafft
!time mafft orco_all.fasta > orco_all_aln.fasta

In [None]:
%%bash
#Generate a phylogeny using fasttree
time FastTree -nt orco_all_aln.fasta > orco.tree

In [None]:
import toytree

In [None]:
orco_tree = toytree.tree("orco.tree")
orco_tree_rooted = orco_tree.root(names=["Obrunneus_116854080","Dquadriceps_106748868","Hsaltator_105183395"])
orco_tree_rooted.draw(tree_style='d')

## PART 2: large datasets (GENOMES)

In [None]:
# Download a dehydrated data package for all acromyrmex GenBank genomes
!time datasets download genome taxon acromyrmex --assembly-source genbank --dehydrated --filename acromyrmex-dry.zip --no-progressbar

In [None]:
# Read the dataformat help menu. This is a great way to get a list of the available metadata fields.
!dataformat tsv genome -h

In [None]:
%%bash
# Use dataformat to look at the genome data package for ants
# We can use this information to select a "best" genome--we'll pick one with the highest contigN50 value
dataformat tsv genome \
--fields organism-name,assminfo-accession,assmstats-contig-n50,assminfo-level,assminfo-submission-date,assminfo-submitter \
--package acromyrmex-dry.zip

In [None]:
# Next we have to unzip the dehydrated package
!unzip acromyrmex-dry.zip -d acromyrmex-dry 

In [None]:
# Let's get a list of files that are available for download 
!datasets rehydrate --directory acromyrmex-dry/ --list

In [None]:
# Let's only get the protein sequences for the genome with the highest contigN50 value
!datasets rehydrate --directory acromyrmex-dry/ --match GCA_000204515.1/protein.faa --no-progressbar

In [None]:
# Take a peek at the downloaded protein file
!cat acromyrmex-dry/ncbi_dataset/data/GCA_000204515.1/protein.faa | head

In [None]:
!datasets rehydrate -h

## Exercise
* Download a dehydrated package for all *Mycobacterium tuberculosis* genomes that meet all of the following criteria (hint: use flags)
    1. submitted/released in 2021
    2. annotated
    3. assembly level of complete_genome
* use dataformat to view the sequencing technology used for each of these genomes
* use rehydrate to get the genome sequence for one genome generated using Oxford Nanopore

In [None]:
# Download a dehydrated genome data package

In [None]:
# Unzip the data package

In [None]:
# Use dataformat to generate a table that includes sequencing technology

In [None]:
# Use rehydrate to get genome sequence generated using Oxford Nanopore