## NCBI Genomes - Oracle Open Data

** Fix because only all and genbank

This data repository contains sequence data for all single organism genome assemblies contained in [NCBI's Assembly resource](www.ncbi.nlm.nih.gov/assembly/).

The data is divided into 3 main repositories: genbank, refseq, and all.

* **Genbank** - [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) is the NIH genetic sequence database, a collection of all publicly available DNA sequences. It includes primary submissions of assembled genome sequence and associated annotation data. This collection includes genome sequence data for a larger number of organisms than RefSeq directory, but some assemblies are unannotated. 

* **Refseq** - [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/)Includes assembled genome sequence and RefSeq annotation data. All prokaryotic and eukaryotic genomes in this directory have annotation. The annotation data is either collected from NCBI annotation pipeines or the GenBank submission. This collection includes fewer organisms than GenBank, because not all genome assemblies are selected for the RefSeq project. 

* **All** - The combination of both GenBank and RefSeq assemblies. 

All files are available by anonymous file transfer protocol (FTP), please see [FTP FAQ page](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/). The NCBI Genomes data repository contains a vast amount of data, and parsing through it manually to find files related to a specific organism would be difficult. Instead, we can write functions to parse through the file names and identify those related to a specific organism.

After division of the files into the 3 main repositories (all, genbank refseq), the files are further organized under a series of different sub-directories. Subdirectories for genbank include:

    a. archaea
    b. bacteria
    c. fungi
    d. invertebrate
    e. metagenomes
    f. other -  this directory includes synthetic genomes
    g. plant
    h. protozoa
    i. vertebrate_mammalian
    j. vertebrate_other
    k. viral
    
 
The sub-directories for refseq are as follows:

    a. archaea
    b. bacteria
    c. fungi
    d. invertebrate
    e. plant
    f. protozoa
    g. vertebrate_mammalian
    h. vertebrate_other 
    i. viral
    j. mitochondrion 
    k. plasmid     
    l. plastid 
    
Data are further organized within each of the above directories using the species binomial name. For example, *E. coli* files under genbank would have the file directory 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/' , while files for humans under refseq would have the file directory 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/' .

Hierarchies beyond this initial division 


In [None]:
def list_files_genbank(subdir:str, species:str ) -> list:
    prefix = "https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcxvbiyd8fn/b/ncbi_genomes/o/genbank/"
    prefix = prefix + subdir + "/" + species + *
 
    
    

In [4]:
import pandas as pd
import requests

url = 'https://objectstorage.us-ashburn-1.oraclecloud.com/n/idcxvbiyd8fn/b/ncbi_genomes/o/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions/GCA_000008505.1_ASM850v1_feature_table.txt.gz'
r = requests.get(url, allow_redirects=True)
open('bacillus_feature_table.txt', 'wb').write(r.content)

201

In [6]:
!wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions/GCA_000008505.1_ASM850v1_feature_table.txt.gz

--2022-08-04 18:30:55--  ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions/GCA_000008505.1_ASM850v1_feature_table.txt.gz
           => ‘GCA_000008505.1_ASM850v1_feature_table.txt.gz’
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 130.14.250.10, 2607:f220:41f:250::229, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected.
Logging in as anonymous ... Logged in!
==> SYST ... done.    ==> PWD ... done.
==> TYPE I ... done.  ==> CWD (1) /genomes/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions ... done.
==> SIZE GCA_000008505.1_ASM850v1_feature_table.txt.gz ... done.
==> PASV ... done.    ==> RETR GCA_000008505.1_ASM850v1_feature_table.txt.gz ... 
No such file ‘GCA_000008505.1_ASM850v1_feature_table.txt.gz’.



In [None]:
import oci

# Initialize service client with default config file
object_storage_client = oci.object_storage.ObjectStorageClient(config)

# Send the request to service, some parameters are not required, see API
# doc for more info
list_objects_response = object_storage_client.list_objects(
    namespace_name="idcxvbiyd8fn",
    bucket_name="ncbi_genomes"
)

# Get the data from response
print(list_objects_response.data)