# NCBI Genomes - Oracle Open Data


The NCBI Genomes data repository contains sequence data for all single organism genome assemblies contained in [NCBI's Assembly resource](www.ncbi.nlm.nih.gov/assembly/). Please note that as of August 2022, Oracle Open Data does not include data files for refseq, so those will not be accesssible through Object Storage. They can still be downloaded by FTP from the NCBI.

The data is divided into 3 main repositories: genbank, refseq, and all.

* **Genbank** - [GenBank](https://www.ncbi.nlm.nih.gov/genbank/) is the NIH genetic sequence database, a collection of all publicly available DNA sequences. It includes primary submissions of assembled genome sequence and associated annotation data. This collection includes genome sequence data for a larger number of organisms than RefSeq directory, but some assemblies are unannotated. 

* **Refseq** - [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/about/)Includes assembled genome sequence and RefSeq annotation data. All prokaryotic and eukaryotic genomes in this directory have annotation. The annotation data is either collected from NCBI annotation pipeines or the GenBank submission. This collection includes fewer organisms than GenBank, because not all genome assemblies are selected for the RefSeq project. 

* **All** - The combination of both GenBank and RefSeq assemblies. 

All files are available by anonymous file transfer protocol (FTP), please see [FTP FAQ page](https://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/). For example, to download the feature table file for the bacteria *Bacillus thuringiensis*, use `wget` followed by the ftp link as follows: `wget ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Bacillus_thuringiensis/all_assembly_versions/GCA_000008505.1_ASM850v1_feature_table.txt.gz` in the terminal.


After division of the files into the 3 main repositories (all, genbank refseq), the files are further organized under a series of different sub-directories. Subdirectories for genbank include:

    a. archaea
    b. bacteria
    c. fungi
    d. invertebrate
    e. metagenomes
    f. other -  this directory includes synthetic genomes
    g. plant
    h. protozoa
    i. vertebrate_mammalian
    j. vertebrate_other
    k. viral
    
 
The sub-directories for refseq are as follows:

    a. archaea
    b. bacteria
    c. fungi
    d. invertebrate
    e. plant
    f. protozoa
    g. vertebrate_mammalian
    h. vertebrate_other 
    i. viral
    j. mitochondrion 
    k. plasmid     
    l. plastid 
    
Data are further organized within each of the above directories using the species binomial name. For example, *E. coli* files under genbank would have the file directory 'ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Escherichia_coli/' , while files for humans under refseq would have the file directory 'ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/' .



## Getting all Files for An Organism 

Because there are millions of files in this repository, parsing through it manually to find files related to a specific organism would be difficult. The first function we will use will parse through the file names and identify those related to a specific organism. For this, we will be using the Oracle Command-Line-Interface (CLI). Please note that using the CLI requires that you have an oci config file set up and configured. You will also need the repository of choice (genbank, refseq, or all), sub-directory (from the lettered lists above), and species name (e.g. *Bacillus thuringiensis*). 

In [34]:
#import modules
import oci

# Initialize service client with default config file
config = oci.config.from_file()
object_storage_client = oci.object_storage.ObjectStorageClient(config)

list_files = []

# function to list files, takes name of repo, subdir, and species 
def list_files_for_organism(repository: str, subdir:str, species:str ) :
    myprefix = repository + "/" + subdir + "/" + species + "/" 
    
    # print prefix so user can check if input is correct
    print(myprefix)
    
    # using cli 
    list_objects_response = object_storage_client.list_objects(
        namespace_name="idcxvbiyd8fn",  # namespace for ncbi_genomes
        bucket_name="ncbi_genomes",  # bucket name
        fields = "name",
        prefix = myprefix
    )
    
    list_files = []
    
    # Get the data from response, insert file names into list
    for obj in list_objects_response.data.objects:
        list_files.append(obj.name)
        
    #remore redundant file prefix
    length = len(myprefix)
    for count, name in enumerate(list_files):
        list_files[count] = name[length:]

    print('\n')
    print("The following is a list of all available files for: " + myprefix)
    print('\n')
    print(*list_files, sep = '\n')
        
    

Now that the function to list files is written, let's test for a couple common bacterial strains first:

In [None]:
# Running a test for a couple strans of bacteria
list_files_for_organism("genbank", "bacteria", "Escherichia_coli")
list_files_for_organism("genbank", "bacteria", "Listeria_monocytogenes")

Next, testing some other species types: