# Fetch Descriptors for Chemical Components IDs

## Introduction

Many structures in the [RCSB Protein Data Bank](https://www.rcsb.org) contain chemical components that are not amino acids. This includes cofactors like NAD<sup>+</sup> or thiamine pyrophosphate, substrate analogs, metal ions, or drug candidates. Structure files in the RCSB PDB include these chemical components in clearly structured formats that can be readily identified and retrieved. Each chemical components has a unique alphanumeric identifier (e.g., NDP for NADPH or dihydro-nicotinamide-adenine-dinucleotide phosphate) in a number of computer compatible formats. This notebook is designed to help you identify chemical components by their IDs and then fetch the SMILES, InChI, etc. strings associated with them.

Inputs: id, name, formula, pdbx_formal_charge, formula_weight, or type
Outputs: InChI, InChIKey,SMILES, SMILES_stereo

The code in this notebook is designed to perform the following tasks:

1. Run a Search API query to retrieve all desired chemical component IDs
2. Use a Data API query to retrieve data on all chemical components

### Questions

* What types of chemical components are found in the RCSB PDB?
* What is a SMILES string? an InChIKey?
* How can I obtain computationally compatible versions of chemical components that are found in the PDB archive?
* How can I expand or shrink the output from a search?

### Learning Objectives

* To search for and retrieve chemical component data from the RCSB PDB APIs
* To store the obtained data in file formats or data structures that will be useful in future computation

### Purpose

This notebook is designed to help you fetch all chemical component IDs and then fetch the SMILES, InChI, etc. strings associated with them. 

## Libraries

These libraries will be called in the coding cells in this notebook. 

| Library | Abbreviation |Contents | Source |
| :-----: | ------------ | :------- | :----- |
| json | json | library for working with JavaScript Object Notation for data interchange| [json — JSON encoder and decoder](https://docs.python.org/3/library/json.html) |
| rcsbsearchapi | N/A | library for automated searching of the [RCSB Protein Data Bank](https://www.rcsb.org)| [py-rcsbsearchapi on GitHub](https://github.com/rcsb/py-rcsbsearchapi) |
| python_graphql_client | GraphQL | library for making requests from a graphql server | [PyPi page on python_graphql_client](https://pypi.org/project/python-graphql-client/) |


## Installation

To use this notebook, you will need to have the following libraries installed in your computing environment: json, rcsbsearchapi, python_graphql_client. To install from the command line on your computer, use this command:

`pip install json`\
`pip install rcsbsearchapi`\
`pip install python_graphql_client`

To install from within a Jupyter notebook or CoLab notebook, you need to type the same command in a coding cell, preceded by an exclamation point.

`!pip install json`\
`!pip install rcsbsearchapi`\
`!pip install python_graphql_client`



In [None]:
# Use this coding cell to install necessary libraries if they are not already in your system or environment
!pip install json
!pip install rcsbsearchapi
!pip install python_graphql_client

## Notebook Contents

Then coding cell below contains all of the raw code for this example. **Experienced coders** should use this as you see fit.

For **novice coders**, the code is broken up into smaller chunks in the subsequent coding cells, with stepwise inputs and outputs to better explain how this code can be used.

In [None]:
# For Experienced Coders

# Minimal Python script to fetch all chemical component IDs and then 
# fetch the SMILES, InChI, etc. strings associated with them
# Requires the following modules to be installed:

import json
from rcsbsearchapi.search import AttributeQuery
from python_graphql_client import GraphqlClient as GraphQL

# Step 1: Run search to retrieve all chemical component IDs
q1 = AttributeQuery("rcsb_chem_comp_container_identifiers.comp_id", "exists", service="text_chem")
results = [mol for mol in q1("mol_definition")]

subListSize = 300
resultsSubListL = [results[i:i+subListSize] for i in range(0, len(results), subListSize)]

# Step 2: Run data API query to retrieve data on all chemical components
url_data_api = 'https://data.rcsb.org/graphql'
client = GraphqlClient(endpoint=url_data_api)  # instantiate client with the RCSB Data API endpoint
query_method = """
query structure ($comp_ids: [String!]!) {
  chem_comps(comp_ids:$comp_ids){
        chem_comp {
            id
            name
            formula
            pdbx_formal_charge
            formula_weight
            type
        }
        rcsb_chem_comp_descriptor {
            InChI
            InChIKey
            SMILES
            SMILES_stereo
        }
    }
}
"""

# Iterate over each sublist and perform the data API query
for subList in resultsSubListL:
    query_variables = {"comp_ids": subList}
    dataResult = client.execute(query=query_method, variables=query_variables)
    data = dataResult['data']  # This will contain your data API query results--process/rewrangle this as needed or desired
    # Print out the first result
    print(json.dumps(data["chem_comps"][0], indent=2))  # Probably don't want to print out everything, but this is one way to do it while testing

## Importing Libraries

The following simply imports the required libraries that contain the methods that are called in this notebook.

In [6]:
# For novice or intermediate coders - importing resources

# Requires the following modules to be installed on your system and then imported using these commands:

import json
from rcsbsearchapi.search import AttributeQuery
from python_graphql_client import GraphqlClient as GraphQL


## Step 1 

To start this process need to use the AttributeQuery method from rcsbsearchapi.search library. In the next coding cell, we will search the RCSB PDB website for the ID of chemical components found on the website. This is then converted to a list and the final command simply tells us the number of items in that list.

In [7]:
# Step 1: Run search to retrieve all chemical component IDs

# Paul - I need an explanation for standard terms I have not seen before. 
# Paul - rcsb_chem_container_identifiers.comp_id - this looks like a searchable term, but I have not seen it before.

q1 = AttributeQuery("rcsb_chem_comp_container_identifiers.comp_id", "exists", service="text_chem")
results = [mol for mol in q1("mol_definition")]  # Paul - I am not familiar with the use of a for loop in this setting. I'm not sure how to deal with this.

subListSize = 300
resultsSubListL = [results[i:i+subListSize] for i in range(0, len(results), subListSize)]

# Paul - I suggest including the next line of code. Before printing anything, let's find out how many items were returned by the search
print(f"There are {len(results)} chemical components in the RCSB PDB.")


There are 44029 chemical components in the RCSB PDB.


## Step 2

In this step, we first declare a variable that points to the data API on the RCSB PDB website. Then the query extracts data about the chemical components into a dictionary. The query_method 

In [8]:
# Step 2: Run data API query to retrieve data on all chemical components

# Paul - these comments need a lot more detail for novice coders. It might even be good to have a markdown cell before each coding cell to explain what's coming up next.

url_data_api = 'https://data.rcsb.org/graphql'
client = GraphQL(endpoint=url_data_api)  # instantiate client with the RCSB Data API endpoint

# Paul - how was this query_method generated? Was it just clipped from an Advanced Search page on the PDB?

query_method = """
query structure ($comp_ids: [String!]!) {
  chem_comps(comp_ids:$comp_ids){
        chem_comp {
            id
            name
            formula
            pdbx_formal_charge
            formula_weight
            type
        }
        rcsb_chem_comp_descriptor {
            InChI
            InChIKey
            SMILES
            SMILES_stereo
        }
    }
}
"""

## Step 3

The final step is to produce the results we want to see and store for future use. 

In [9]:
# Iterate over each sublist and perform the data API query
for subList in resultsSubListL:
    query_variables = {"comp_ids": subList}
    dataResult = client.execute(query=query_method, variables=query_variables)
    data = dataResult['data']  # This will contain your data API query results--process/rewrangle this as needed or desired
    # Print out the first result
    print(json.dumps(data["chem_comps"][0], indent=2))  # Probably don't want to print out everything, but this is one way to do it while testing

{
  "chem_comp": {
    "id": "001",
    "name": "1-[2,2-DIFLUORO-2-(3,4,5-TRIMETHOXY-PHENYL)-ACETYL]-PIPERIDINE-2-CARBOXYLIC ACID\n4-PHENYL-1-(3-PYRIDIN-3-YL-PROPYL)-BUTYL ESTER",
    "formula": "C35 H42 F2 N2 O6",
    "pdbx_formal_charge": 0,
    "formula_weight": 624.715,
    "type": "non-polymer"
  },
  "rcsb_chem_comp_descriptor": {
    "InChI": "InChI=1S/C35H42F2N2O6/c1-42-30-22-27(23-31(43-2)32(30)44-3)35(36,37)34(41)39-21-8-7-19-29(39)33(40)45-28(17-9-14-25-12-5-4-6-13-25)18-10-15-26-16-11-20-38-24-26/h4-6,11-13,16,20,22-24,28-29H,7-10,14-15,17-19,21H2,1-3H3/t28-,29-/m0/s1",
    "InChIKey": "NBYCDVVSYOMFMS-VMPREFPWSA-N",
    "SMILES": "COc1cc(cc(c1OC)OC)C(C(=O)N2CCCCC2C(=O)OC(CCCc3ccccc3)CCCc4cccnc4)(F)F",
    "SMILES_stereo": "COc1cc(cc(c1OC)OC)C(C(=O)N2CCCC[C@H]2C(=O)O[C@@H](CCCc3ccccc3)CCCc4cccnc4)(F)F"
  }
}
{
  "chem_comp": {
    "id": "08S",
    "name": "3-CHLORO-6-FLUORO-N-[2-[4-[(5-PROPAN-2-YL-1,3,4-THIADIAZOL-2-YL)SULFAMOYL]PHENYL]ETHYL]-1-BENZOTHIOPHENE-2-CARBOXAMIDE",

## Results

* Explain the results of this search
* Describe how these data can be used in their current format
* Describe how these data can be converted to another format (e.g., pandas dataframe) for storage (e.g., export to csv) or for use in another setting