<a href="https://colab.research.google.com/github/khanfs/ComputationalBiology-xGenomics/blob/main/NCBI_Datasets_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **NCBI Datasets APIs**

NCBI Datasets collates data from across NCBI databases. Download gene, transcript, protein and genome sequences, annotation and metadata. The Datasets API is still in alpha, and is updatioften to add new functionality. For some larger downloads, may need to download a [dehydrated bag](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/how-tos/genomes/large-download/), and retrieve the individual data files at a later time. 

This Python package is automatically generated by the [OpenAPI Generator](https://openapi-generator.tech/) project:

* API version: v1
* Package version: 13.27.0
* Build package: org.openapitools.codegen.languages.PythonClientCodegen

**Resources**:

* [NCBI Datasets](https://www.ncbi.nlm.nih.gov/datasets/)
* [Datasets Documentation](https://www.ncbi.nlm.nih.gov/datasets/docs/v1/)
* [ncbi-datasets-pylib](https://github.com/ncbi/datasets/tree/master/client_docs/python#installation--usage)
* [Documentation for API Endpoints](https://github.com/ncbi/datasets/tree/master/client_docs/python#documentation-for-api-endpoints)
* [Documentation For Models](https://github.com/ncbi/datasets/tree/master/client_docs/python#documentation-for-models)

In [1]:
! pip install ncbi-datasets-pylib

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
import time
import ncbi.datasets.openapi
from pprint import pprint
from ncbi.datasets.openapi.api import gene_api
from ncbi.datasets.openapi.model.rpc_status import RpcStatus
from ncbi.datasets.openapi.model.v1_download_summary import V1DownloadSummary
from ncbi.datasets.openapi.model.v1_fasta import V1Fasta
from ncbi.datasets.openapi.model.v1_gene_dataset_request import V1GeneDatasetRequest
from ncbi.datasets.openapi.model.v1_gene_dataset_request_content_type import V1GeneDatasetRequestContentType
from ncbi.datasets.openapi.model.v1_gene_dataset_request_sort_field import V1GeneDatasetRequestSortField
from ncbi.datasets.openapi.model.v1_gene_match import V1GeneMatch
from ncbi.datasets.openapi.model.v1_gene_metadata import V1GeneMetadata
from ncbi.datasets.openapi.model.v1_organism import V1Organism
from ncbi.datasets.openapi.model.v1_organism_query_request_tax_rank_filter import V1OrganismQueryRequestTaxRankFilter
from ncbi.datasets.openapi.model.v1_ortholog_request_content_type import V1OrthologRequestContentType
from ncbi.datasets.openapi.model.v1_ortholog_set import V1OrthologSet
from ncbi.datasets.openapi.model.v1_sci_name_and_ids import V1SciNameAndIds
from ncbi.datasets.openapi.model.v1_sort_direction import V1SortDirection

In [3]:
import ncbi.datasets

### **JavaScript Object Notation (JSON)** 

JSON is the data exchange format standard, replacing XML. JSON lines (jsonl), Newline-delimited JSON (ndjson), line-delimited JSON (ldjson) are three terms expressing the same formats primarily intended for JSON streaming. JSON Lines essentially consists of several lines where each individual line is a valid JSON object, separated by newline character `\n`. As a result JSON lines make parsing of documents more [efficient](https://medium.com/hackernoon/json-lines-format-76353b4e588d) - makes the JSON Lines formatted file streamable. 

* **Serialization** is the process of encoding data into JSON format (like converting a Python list to JSON).

* **Deserialization** is the process of decoding JSON data back into native objects you can work with (like reading JSON data into a Python list).

In [4]:
import json
import jsonlines
import os # module provides functions for interacting with the operating system
import csv
import zipfile
import pandas as pd

[**Pyfaidx**](https://pypi.org/project/pyfaidx/): provides an interface for creating and using this index for fast random access of DNA subsequences from huge fasta files in a “pythonic” manner. Indexing speed is comparable to samtools, and in some cases sequence retrieval is much faster (benchmark).

* Shirley MD, Ma Z, Pedersen BS, Wheelan SJ. 2015. Efficient "pythonic" access to FASTA files using pyfaidx. PeerJ PrePrints 3:e970v1 https://doi.org/10.7287/peerj.preprints.970v1

In [5]:
from pyfaidx import Fasta

[**google.protobuf.json_format**](https://googleapis.dev/python/protobuf/latest/google/protobuf/json_format.html): contains routines for printing protocol messages in JSON format.

In [6]:
from google.protobuf.json_format import ParseDict

[**Collections**](https://docs.python.org/3/library/collections.html): module implements specialized containers used for storing additional data structures as alternatives to Python’s general purpose built-in data structure containers. 

In [7]:
from collections import Counter

In [8]:
from datetime import datetime, timezone, timedelta

In [19]:
# Defining the host is optional and defaults to https://api.ncbi.nlm.nih.gov/datasets/v1
# See configuration.py for a list of all supported configuration parameters.
configuration = ncbi.datasets.openapi.Configuration(
    host = "https://api.ncbi.nlm.nih.gov/datasets/v1"
)

In [20]:
# The client must configure the authentication and authorization parameters
# in accordance with the API server security policy.
# Examples for each auth method are provided below, use the example that
# satisfies your auth use case.

# Configure API key authorization: ApiKeyAuthHeader
configuration.api_key['ApiKeyAuthHeader'] = 'YOUR_API_KEY'

In [23]:
# Enter a context with an instance of the API client
with ncbi.datasets.openapi.ApiClient(configuration) as api_client:
    # Create an instance of the API class
    api_instance = gene_api.GeneApi(api_client)
    gene_ids = [
        59067,
    ] # [int] | NCBI gene ids
    include_annotation_type = [
        V1Fasta("FASTA_UNSPECIFIED"),
    ] # [V1Fasta] | Select additional types of annotation to include in the data package.  If unset, no annotation is provided. (optional)
    fasta_filter = [
        "fasta_filter_example",
    ] # [str] | Limit the FASTA sequences in the datasets package to these transcript and protein accessions (optional)
    filename = "ncbi_dataset.zip" # str | Output file name. (optional) if omitted the server will use the default value of "ncbi_dataset.zip"

    # example passing only required values which don't have defaults set
    try:
        # Get a gene dataset by gene ID
        api_response = api_instance.download_gene_package(gene_ids)
        pprint(api_response)
    except ncbi.datasets.openapi.ApiException as e:
        print("Exception when calling GeneApi->download_gene_package: %s\n" % e)

<_io.BufferedReader name='/tmp/ncbi_dataset.zip'>
