# NCBI Datasets - CSHL (11/02/2021)

### Important resources:
- Etherpad: <link>
- Github: <link>
- NCBI datasets: <link>
- Code of Conduct (?): <link> *we can use the Carpentries COC*
- jq cheat sheet: <link>
- UNIX cheat sheet: <link>

## Case study: Elmo loves ants

Elmo is a graduate student at the Via Sesamum University. As part of his Ph.D. project, he studies Panamanian leaf cutter ants (genus *Acromyrmex*, family Formicidae) and how variation in the gene *orco* (**o**dorant **r**eceptor **co**receptor) affects the colonies of this genus.

(here's the [link](https://www.sciencedirect.com/science/article/pii/S0092867417307729#app3) to a cool paper talking about this gene in ants of the species *Ooceraea biroi*).

<img src="./images/ants.png" alt="image"/>

Elmo will use `datasets` to help him gather the existing genomic resources from NCBI. He will:

- download all available genomes for the genus *Acromyrmex*
- download the *orco* gene from the *Acromyrmex* reference genome
- download the ortholog set for this gene for all ants (Formicidae)

In addition, he will also do the following tasks:
- Create a custom BLAST database with the Panamanian leaf cutter ants genomes 
- BLAST the gene *orco* against the database
- Multiple sequence alignment of the BLAST results and the ortholog gene sequences
- Build a phylogenetic tree using fastTree

## something about dataformat and dehydrated files

### How is `datasets` organized?

[NCBI datasets](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/quickstarts/command-line-tools/) is a command line tool that allows users to download data packages (data + metadata) or look at metadata summaries for genomes, RefSeq annotated genes, curated ortholog sets and SARS-Cov-2 virus sequences and proteins. The program follows a hierarchy that makes it easier for users to select exact which options they would like to use. In addition to the program commands, additional flags are available for filtering the results. We will go over those during this tutorial.
<img src="./images/datasets_horizontal.drawio.png" alt="datasets" style="width: 600px;"/>

In addition to `datasets`, we will be using `jq` (json parser) to take a look at the metadata information. Our metadata reports are almsot all in json or json-lines format. We put together a [jq cheat sheet](<add link>) to help you extract information from those files.

## Tutorial - Part 1

![workflow](./images/elmo_workflow.drawio.png)

First, let's figure out what kind of information NCBI has for ants (family Formicidae).

![summary_genome](./images/summary_genome_taxon.png)

In [None]:
# Get metadata info
!datasets summary genome taxon formicidae

In [None]:
# Get metadata info and save to a file
!datasets summary genome taxon formicidae > formicidae_summary.json

**Now let's take a look at the metadata usign jq**

In [None]:
!datasets summary genome taxon formicidae | jq .

### A little bit more about json files
A JSON (JavaScript Object Notation) file stores data structures and objects. In a very simplified (and non-technical) way, a JSON file is a box, that might contain other boxes with more boxes inside. In `datasets summary` our JSON "box" is organized like this:
<img src="./images/json1.png" alt="image" style="width: 600px;"/>

If we continue to expand each one of those assembly boxes, more levels of the hierarchy will be revelead. Let's take a look inside the pink assembly box:
<img src="./images/json2.png" alt="image" style="width: 600px;"/>

Here we can see that some of the assembly information, such as assembly accession number, contig N50  or submission date are not include inside any of the available "boxes" (annotation_metadata, chromosomes, bioproject_lineage, and org). Those fields describe assembly features/characteristics that pertain the entire assembly, and not only any of those boxes available.
Let's try to expand a those boxes now:
<img src="./images/json3.png" alt="image" style="width: 600px;"/>

Now we can see all the available fields for the genome summary. Not all assemblies will have those, but this is to give you an idea of how the information is organized. And each assembly will have the same fields, like this:

<img src="./images/json4.png" alt="image" style="width: 600px;"/>

**RESOURCE:**  
We included a list of all fields in the genome summary in our [jq cheatsheet]() to help you extract the information you need. And we will show you now how to do that. 

### Let's continue to explore the available genomes for the family Formicidae

<img src="./images/summary_genome_taxon.png" alt="summary" />

In [None]:
# For which species does NCBI have genomes in its database? How many per species?

!datasets summary genome taxon formicidae | jq '.assemblies[].assembly.org.sci_name' | sort | uniq -c

In [None]:
# What is the assembly level (contig, scaffold, chromosome, complete) breakdown?

!datasets summary genome taxon formicidae | jq '.assemblies[].assembly.assembly_level' | sort | uniq -c

### How to get help when using the command line

Since `datasets` is a very hierarchical program, we can use that characteristic to our advantage to get very specific help.   For example: if we type `datasets --help`, we will see the first level of commands available.


In [3]:
!datasets --help

datasets is a command-line tool that is used to query and download biological sequence data
across all domains of life from NCBI databases.

Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.

Usage
  datasets [command]

Data Retrieval Commands
  summary              print a summary of a gene or genome dataset
  download             download a gene, genome or coronavirus dataset as a zip file
  rehydrate            rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands
  completion           generate autocompletion scripts
  version              print the version of this client and exit
  help                 Help about any command

Flags
      --api-key string   NCBI Datasets API Key
  -h, --help             help for datasets
      --no-progressbar   hide progress bar

Use datasets help <command> for deta

Notice the difference from when we type `datasets summary genome taxon formicidae --help`  


In [4]:
!datasets summary genome taxon formicidae --help


Print a summary of a genome dataset by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank). The summary is returned in JSON format.

Refer to NCBI's [command line quickstart](https://www.ncbi.nlm.nih.gov/datasets/docs/quickstarts/command-line-tools/) documentation for information about getting started with the command-line tools.

Usage
  datasets summary genome taxon [flags]

Examples
  datasets summary genome taxon human
  datasets summary genome taxon "mus musculus"
  datasets summary genome taxon 10116

Flags
  -h, --help              help for taxon
      --tax-exact-match   exclude sub-species when a species-level taxon is specified


Global Flags
  -a, --annotated                only include genomes with annotation
      --api-key string           NCBI Datasets API Key
      --assembly-level string    restrict assemblies to a comma-separated list of one or more of: chromosome, complete_genome, contig, scaffold
      --assembly-source string 

### Exercises

Now we will practice what we learned about `datasets`. Take a look at the questions below and feel free to ask questions. Useful resources for this exercise are the `--help` from the command line and the [jq cheatsheet](). 


In [None]:
# How many reference genomes in the family Formicidae? (hint --reference)
!


In [None]:
# How many reference genomes are annotated? (hint: --annotated)
!


In [None]:
# How many genomes have NCBI (RefSeq) annotations? (hint: --assembly-source)
!


### Bonus questions:

In [None]:
## Take a look at the jq cheat sheet (link here) and try to build a jq query for the metadata
!



In [None]:
# Now look at the summary metadata for your organism of interest 
# (if you don't have a favorite, go with red panda, Ailurus fulgens, taxid: 9649)
!


In [None]:
# How many genomes?
!



In [None]:
# Assembly level breakdown
!


In [None]:
# How many above contig N50 15Mb?
!


### Back to the main room


### What is the difference/relationship between Genbank, RefSeq and Reference assemblies?

<img src="./images/gca_gcf.png" alt="ref" />

### Data package

We explored the `datasets summary` option, in which we had a chance to look at the summary metadata ***without*** downloading any files. In the next steps, we will look at the data packages, which contains the actual data files. 
<img src="./images/genome_data_package.png" alt="data_package" />

In [None]:
# Download all available GenBank assemblies for the genus Acromyrmex and save as genomes.zip
!datasets download genome taxon acromyrmex --assembly-source genbank --filename genomes.zip --no-progressbar

In [None]:
# Unzip genomes.zip to the folder genomes
!unzip genomes.zip -d genomes

In [None]:
# Explore the folder structure of the folder genome with the command tree
!tree genomes/

### Let's recap our goals

We used `datasets` to download all the Genbank assemblies for the genus *Acromyrmex*. The next step is to download the gene *orco* (odorance receptor coreceptor) for the same genus. But first, let's learn more about how genes are organized at NCBI.

<img src="./images/elmo_done1.png" alt="done1" style="width: 450px;" />

### GENES

Independent of choosing `datasets download` or `datasets summary`, there are three options for retrieving gene information:
- accession
- gene-id
- symbol

<img src="./images/genes_op2.png" style="width: 800px;"/>

hen choosing any of those three options, you will retrieve the gene information for the **reference** taxon. Like this:

`datasets download gene accession XR_002738142.1`  
`datasets download gene gene-id 101081937`  
`datasets download gene symbol BRCA1 --taxon cat`  

All three commands will download the same gene from the cat (<i>Felis catus</i>) <u>reference genome</u>. 

#### accession
Unique identifier. Accession includes RefSeq accession DNA, RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. If using the symbol option, you should specify the species. The default option is human.

**Remember**: both `summary` and `download` will return results for the **reference assembly** of a <u>single species</u>. If you want to download a curated set of the same gene for multiple taxa, you should use the option `ortholog`. We'll talk more about it later.

Now let's take a look at a gene example:

In [None]:
#Example: IFNG in human
!datasets summary gene symbol ifng | jq .


In [None]:
# how datasets deasl with synonyms
!datasets summary gene symbol IFG | jq -r '.genes[].gene | {species: .taxname, symbol: .symbol, synonyms:.synonyms}'

In [None]:
#Example: IFNG in cat
!datasets summary gene symbol ifng --taxon "felis catus"

### Back to ants
We will download the gene *orco* for the species *Acromyrmex echinatior*. We will use the gene-id 105147775 instead of the symbol.
The reason for it is that sometimes even when a known gene is characterized in a species, the gene symbol is not necessarily propagated.

In [None]:
# Using gene-id to retrieve gene information
!datasets summary gene gene-id 105147775 | jq '.genes[].gene | {gene_description: .description, gene_id: .gene_id, symbol: .symbol, species: .taxname}'

In [None]:
# if we try to retrieve metadata information for this gene using the symbol orco, what happens?
!datasets summary gene symbol orco --taxon "acromyrmex echinatior"

In [None]:
# Download the gene data package for the gene-id 105147775 (*orco* in Acromyrmex echinatior)
!datasets download gene gene-id 105147775 --filename gene.zip --no-progressbar

In [None]:
#Unzip the file
!unzip gene.zip -d gene

In [None]:
#Explore the data package structure using tree
!tree gene

Now we are going to take advantage of the fact that we are using a Jupyter Notebook and use the package `pandas` to look at the gene data table

In [None]:
import pandas as pd                                                        #load pandas to this notebook
gene_orco = pd.read_csv('gene/ncbi_dataset/data/data_table.tsv', sep='\t') #use pandas to import the data_table.tsv
gene_orco                                                                  #visualize the data table as the object gene_orco

### Exercises

1. Look for the summary data for a gene of interest (check the [etherpad]() for suggestions)
2. What is the gene location?
3. What is the gene range?
4. Now, download a list of genes using the file genes.txt (provided). Save it as gene_list.zip
5. Unzip gene_list.zip and explore the folder structure
6. How many fasta files?

In [None]:
# Summary data
!


In [None]:
# Gene location
!


In [None]:
# Gene range
!


In [None]:
# Download a list of genes
! --filename gene_list.zip


In [None]:
# Explore the folder structure
!


In [None]:
# How many genes were downloaded?
!


In [None]:
# How many fasta files in the data package?
!


### Orthologs

The options to retrieve ortholog sets are the same as those for genes. We'll go over the differences when usign each option:

- accession
- gene-id
- symbol

<img src="./images/ortholog.png" style="width: 800px;" />

When choosing any of those three options, you will download the **full** ortholog set to which they belong (unless you use additional filtering. We'll cover it below). Like this:

`datasets download ortholog accession XR_002738142.1`
`datasets download ortholog gene-id 101081937`
`datasets download ortholog symbol BRCA1 --taxon cat`

All three commands will download the **same** ortholog set. 

#### accession
Unique identifier. Accession includes RefSeq accession DNA, RNA and protein sequences. Since it's unique, taxon is implied (aka there will never be two sequences from different taxa with the same accession number).

#### gene-id
Also an unique identifier. Every RefSeq genome annotated has a unique set of identifiers. For example: the gene-id for BRCA1 in human is 672, while in cat is 101081937. You can use either one (672 or 101081937) to get the same vertebrate BRCA1 ortholog set.

#### symbol
Differently from accession and gene-id, gene symbol is not unique and means different things in different taxonomic groups. For example: the P53 ortholog set in vertebrates is different from the insect set. If using the symbol option, you should specify the taxonomic group. The default option is human. Note that if you want ortholog sets from multiple vertebrate species, you might end up downloading the same ortholog set multiple times. Like this: 

`datasets download ortholog symbol brca1 --taxon cat`  
`datasets download ortholog symbol brca1 --taxon chicken`  
`datasets download ortholog symbol brca1 --taxon "chelonia mydas"`  

If that's the case, how to you filter the ortholog set to include *only* your taxonomic group of interest?

### Applying a taxonomic filter to the ortholog set

For the orthologs, `datasets` provides the flag `--taxon-filter`, which allows the user to restrict the summary or download to one or multiple taxonomic groups.  `--taxon` and `--taxon-filter` have different effects on the data package/summary output.A few examples:

- `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`  
Prints a json metadata summary of the gene brca1 for the domestic cat. 
We did not specify a `--taxon` because the default is human, and Felidae and human are part of the same brca1 ortholog set.   

  

- `datasets summary ortholog symbol brca1 --taxon "felis catus"`  
Even though this option looks almost the same as the one above, the result is *very different*. Here, we're asking `datasets` to find the ortholog set to which the gene brca1 in the domestic cat belongs. And `datasets` will download the <u>entire</u> ortholog set, not only the sequences for the domestic cat.


- `datasets summary ortholog symbol brca1 --taxon "felis catus" --taxon-filter "felis catus"`  
gives you the same result as `datasets summary ortholog symbol brca1 --taxon-filter "felis catus"`

### Exercise
- download the ortholog data package and save it with the name ortholog.zip
- unzip it to the folder ortholog
- look at the files

Without looking below, how would you download the *orco* dataset for all ants (Formicidae)? Here's some information to help you with that:

- gene symbol: orco
- gene-id in *Drosophila melanogaster*: 40650
- gene-id in *Acromyrmer echinatior*: 105147775
- target taxon: Formicidae

In [6]:
# download the orco ortholog set for ants (Formicidae)
!


In [None]:
# unzip it to the folder ortholog
!


In [None]:
#Explore the folder structure
!


In [None]:
# Create an object called ortho_table using pandas
ortho_table = pd.read_csv("orco_ortholog/ncbi_dataset/data/data_table.tsv", sep='\t')
ortho_table

## What have we done so far?
- Explored metadata for all ant genomes
- Downloaded genomes for the panamanian leaf cutter ant
- Downloaded the orco gene for Acromyrmex echinatior
- Downloaded the ortholog set for all ants for the orco gene

<img src="./images/elmo_done.png" />

### Here's what we are showing you now:
- create BLAST database for each genome
- BLAST the *orco* gene sequence against the genomes database and extract the matching regions
- multiple sequence alignment of the blast matches and the ortholog sequences
- generate a approximate maximum likelihood tree using FastTree

In [None]:
# Change directories to the folder blastdb
%cd blastdb/

In [None]:
# Extract tax id for each species:
!dataformat tsv genome --fields organism-name,tax-id,assminfo-accession --package ../acromyrmex.zip 

In [None]:
# Create a blast database for each genome
!makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_000204515.1/unplaced.scaf.fna -taxid 103372 -out Aechinatior
!makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607455.1/GCA_017607455.1_ASM1760745v1_genomic.fna -taxid 230686 -out Ainsinuator 
!makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607545.1/GCA_017607545.1_ASM1760754v1_genomic.fna -taxid 2715315 -out Acharruanus
!makeblastdb -dbtype nucl -in ../genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna -taxid 230685 -out Aheyeri

Let's go over the BLAST command:
```
makeblastdb \
-dbtype nucl \
-in ../genomes/ncbi_dataset/data/GCA_017607565.1/GCA_017607565.1_ASM1760756v1_genomic.fna \
-taxid 230685 \
-out Aheyeri
```

Now we want to create an alias that includes the four BLAST databases we just created. We will call this database <i>Acromyrmex</i>.

In [None]:
# Create an alias under which the four genome databases can be called
!blastdb_aliastool -dbtype nucl -title acromyrmex -out acromyrmex -dblist "Acharruanus Aechinatior Aheyeri Ainsinuator"

Time for our BLAST search. We will search for the gene *orco* matches in the genomes (the databases we just created). WE will be very stringent on our search, and will use an output that allows for it to be converted into other formats. Here's the command breakdown:

---
```
blastn \                                     # Calls the program BLASTN (nucleotide to nucleotide search)
-db acromyrmex \                             # BLAST database to be used
-query ../gene/ncbi_dataset/data/gene.fna \  # query: in our example, orco gene sequence
-evalue 1e-50 \                              # e-value: number of hits one would expect to see by chance
-outfmt 11 \                                 # output format: asn.1
-max_hsps 1 \                                # Maximum number of HSPs (alignments) to keep 
-out orco_acromyrmex_1e-50.asn               # output file
```
---

In [None]:
%%bash
# BLASTN search

blastn \
-db acromyrmex \
-query ../gene/ncbi_dataset/data/gene.fna \
-evalue 1e-50 \
-outfmt 11 \
-max_hsps 1 \
-out orco_acromyrmex_1e-50.asn

Now we will convert our output file using a program called `blast_formatter` (included in the BLAST package). We could have created an output file in the desired format, but if we ever need the same BLAST results in another format, we wouldn't be able to easily make the conversion.


In [None]:
%%bash
# Covert the asn.1 output to tabular (output format 6)

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 sseqid sstart send evalue length staxid ssciname' > orco_acromyrmex_1e-50.tsv

Using `pandas` again, we will create an object with the tsv file we just created from the BLAST output, so we can take a look at our results.

In [None]:
# Create a table and visualize the BLAST results

blast_table = pd.read_csv('orco_acromyrmex_1e-50.tsv', sep='\t', header=None)
blast_table

#### Converting from BLAST to fasta

Now we are going to use some "tricks" (not really, just some good old bash scripting) to extract fasta sequences from the BLAST output. We will be using `blast_formatter` again and we'll do everything into multiple steps so we can all understand what's going on. 

In [None]:
%%bash
# First, let's extract the following fields from the top 4 results: 
# subject scientific name (ssciname)
# subject sequence ID (sseqid) and 
# subject sequence (sseq)

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4

In [None]:
%%bash
#Same command from above, but now were pipping the output into a awk comamnd. The command explanation is in the Etherpad

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}'

In [None]:
%%bash
# Now let's save it to a file called acromyrmex_orco.fasta

blast_formatter \
-archive orco_acromyrmex_1e-50.asn \
-outfmt '6 ssciname sseqid sseq' \
-max_target_seqs 4 | awk 'BEGIN{FS="\t"; OFS="\n"}{gsub(/ /, "_", $1);gsub(/-/, "", $3); print ">"$1"_"$2,$3}' > ../acromyrmex_orco.fasta

In [None]:
#Now let's go back to our home directory and revise what we've done so far.
%cd ~

<img src="./images/elmo_blast_done.png"/>

In [None]:
# Extract the seqids from the gene ortholog fasta and remove the spaces
!grep ">" ortholog/ncbi_dataset/data/gene.fna | sed 's/ /,/g' > ortholog_seqid.txt

In [None]:
%%bash
#Create a mapping file with the original name in the column 1 and a shortened name on column 2
cat ortholog_seqid.txt | while read line; do
new=$( echo $line | awk 'BEGIN {FS=","; OFS="_"}{gsub(/\[organism\=/, "", $3);gsub(/]/, "", $4);gsub(/\[GeneID\=|\]/, "", $5)} ;{print substr($3,0,1)$4,$5}'); 
old=$( echo $line | sed 's/,/\_/g;s/>//g')
printf "${old}\t${new}\n" >> name_map.tsv; 
done

In [None]:
import pandas as pd
name_map = pd.read_csv('name_map.tsv', sep='\t', header=None)
name_map

In [None]:
#Copy the ortholog dataset fasta
!cp ortholog/ncbi_dataset/data/gene.fna ortholog_gene.fna

In [None]:
#Remove spaces in the fasta sequnce names
!sed 's/ /_/g' ortholog_gene.fna > ortholog_gene_nospaces.fna

In [None]:
!head -n1 ortholog_gene_nospaces.fna

In [None]:
%%bash
#Replace the names in the fasta file
cat ortholog_gene_nospaces.fna | seqkit replace \
--kv-file  <(cut -f 1,2 name_map.tsv) \
--pattern "^(.*)" --replacement "{kv}" > ortholog_gene_final.fna

In [None]:
!grep ">" ortholog_gene_final.fna

In [None]:
#Concatenate sequences
!cat ortholog_gene_final.fna acromyrmex_orco.fasta > orco_all.fasta

In [None]:
#align sequences with mafft
!time mafft orco_all.fasta > orco_all_aln.fasta

In [None]:
#Generate a phylogeny using fasttree
!time FastTree -nt orco_all_aln.fasta > orco.tree

In [None]:
import toytree

In [None]:
orco_tree = toytree.tree("orco.tree")
orco_tree_rooted = orco_tree.root(names=["Obrunneus_116854080","Dquadriceps_106748868","Hsaltator_105183395"])
orco_tree_rooted.draw(tree_style='d')

## PART 2: large datasets (GENOMES)

In [None]:
# Download a dehydrated data package for all acromyrmex GenBank genomes
!time datasets download genome taxon acromyrmex --assembly-source genbank --dehydrated --filename acromyrmex-dry.zip --no-progressbar

In [None]:
# Read the dataformat help menu. This is a great way to get a list of the available metadata fields.
!dataformat tsv genome -h

In [None]:
%%bash
# Use dataformat to look at the genome data package for ants
# We can use this information to select a "best" genome--we'll pick one with the highest contigN50 value
dataformat tsv genome \
--fields organism-name,assminfo-accession,assmstats-contig-n50,assminfo-level,assminfo-submission-date,assminfo-submitter \
--package acromyrmex-dry.zip

In [None]:
# Next we have to unzip the dehydrated package
!unzip acromyrmex-dry.zip -d acromyrmex-dry 

In [None]:
# Let's get a list of files that are available for download 
!datasets rehydrate --directory acromyrmex-dry/ --list

In [None]:
# Let's only get the protein sequences for the genome with the highest contigN50 value
!datasets rehydrate --directory acromyrmex-dry/ --match GCA_000204515.1/protein.faa --no-progressbar

In [None]:
# Take a peek at the downloaded protein file
!cat acromyrmex-dry/ncbi_dataset/data/GCA_000204515.1/protein.faa | head

In [None]:
!datasets rehydrate -h

## Exercise
* Download a dehydrated package for all *Mycobacterium tuberculosis* genomes that meet all of the following criteria (hint: use flags)
    1. submitted/released in 2021
    2. annotated
    3. assembly level of complete_genome
* use dataformat to view the sequencing technology used for each of these genomes
* use rehydrate to get the genome sequence for one genome generated using Oxford Nanopore

In [None]:
# Download a dehydrated genome data package

In [None]:
# Unzip the data package

In [None]:
# Use dataformat to generate a table that includes sequencing technology

In [None]:
# Use rehydrate to get genome sequence generated using Oxford Nanopore