# Assemblies

## Purpose

The RCSB PDB is the repository for all publicly available experimentally determined protein structures in the world. This notebook is made in order to demonstrate and elaborate on how to use the rcsbsearchapi Python library to recreate Advanced Searches from the RCSB PDB in Python. Further, this notebook will show how to download the files of the results of these searches, also in Python. This will be done through searching particularly for different assemblies represented in the RCSB PDB, with each search demonstrating a different way of utilizing the Advanced Search tool.

## Steps Taken

The following is a step-by-step explanation of what will be performed for each code example.

### 1) Creating the Search

An explanation of what the search is describing will be followed by the creation of a RCSB PDB Search.

### 2a) Validation By List

The search is tested for functionality by analyzing the first 10 results of the search in a list.

### 2b) Validation By File Request

The search is tested for functionality by requesting the file of the first search result. This is the step where the results will be changed if needed for the sake of requesting and downloading their corresponding file.

### 2c) Validation By File Download

The search is tested for functionality by downloading and reading the contents of the first search result's file. This includes the generation of a folder for the files of the search result.

### 3) Complete Search Download

Following validation, each file in the search result is downloaded into the previously generated file.

## Importing Libraries

A list of libraries that will need to be installed and imported to complete the tasks in the notebook.

| Library | Contents | Source |
| :-----: | :------- | :----- |
| rcsbsearchapi | library for automated searching of the [RCSB Protein Data Bank](https://www.rcsb.org)| [py-rcsbsearchapi on GitHub](https://github.com/rcsb/py-rcsbsearchapi) |
| requests | library for sending HTTP requests | [requests Documentation](https://requests.readthedocs.io/en/latest/) |
| os | standard library for creating directories | [os Documentation](https://docs.python.org/3/library/os.html) |

## Installation

These libraries will need to be installed in your computing environment to perform the tasks in this notebook.

To install from the command line on your computer, use this command (with the `requests` library as the example):

`pip install requests`

To install from within a Jupyter notebook or CoLab notebook, you need to type the same command in a coding cell, preceded by an exclamation point.

`!pip install requests`

These libraries will be imported as they are needed over the course of this notebook.


In [None]:
# Import the components of rcsbsearchapi needed for this search
from rcsbsearchapi import rcsb_attributes as attrs
# For Operator notation

from rcsbsearchapi.const import CHEMICAL_ATTRIBUTE_SEARCH_SERVICE, STRUCTURE_ATTRIBUTE_SEARCH_SERVICE
from rcsbsearchapi.search import AttributeQuery, Attr, TextQuery, StructSimilarityQuery
# For Fluent notation

import requests  # to enable us to pull files from the PDB
import os        # to enable us to create a directory to store the files

## Total Polymers in Biological Assembly

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A12%2C%22attribute%22%3A%22rcsb_assembly_info.polymer_entity_instance_count%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22assembly%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%22c8e2869a9b79b9d79d3955b5c9009713%22%7D%7D) which is a search of the total polymers in biological assembly. This search can be divided into 1 attribute:

Number of polymers: 12

In [None]:
q1 = AttributeQuery(attribute="rcsb_assembly_info.polymer_entity_instance_count", operator="equals", value=12)

result_polymers = list(q1("assembly"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_polymers)} polymers in biological assembly:", result_polymers[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

Additionally, because our search gives results that aren't perfectly aligned with their corresponding file's name, we need to truncate the name to match the file name.

In [None]:
i = 0
while i < len(result_polymers):
    result_polymers[i] = result_polymers[i][0:4]
    i += 1

test_validation = requests.get(f'https://files.rcsb.org/download/{result_polymers[0]}.cif')
print(result_polymers[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called ligands.

In [None]:
os.makedirs("assemblies/Biological_Assembly", exist_ok=True) 

with open(f"assemblies/Biological_Assembly/{result_polymers[0]}.cif", 'w+') as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f"assemblies/Biological_Assembly/{result_polymers[0]}.cif", 'r')
file_text = file1.read() 

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_polymers:
    cFile = f"{ChemID}.cif"
    cFileUrl = baseUrl + cFile
    cFileLocal = "assemblies/Biological_Assembly/" + cFile #Do we need to make a local file, or is it expected they will fill in the code? 
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Single Protein Chain

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%2C%22attribute%22%3A%22rcsb_assembly_info.polymer_entity_instance_count%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%2C%22attribute%22%3A%22rcsb_assembly_info.polymer_entity_instance_count_protein%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22range%22%2C%22negation%22%3Afalse%2C%22value%22%3A%7B%22from%22%3A350%2C%22to%22%3A400%2C%22include_lower%22%3Atrue%2C%22include_upper%22%3Atrue%7D%2C%22attribute%22%3A%22rcsb_assembly_info.polymer_monomer_count%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22assembly%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%221c00017375f2a0562f492bb208ab74df%22%7D%7D) which is a search of biological assemblies that have a single protein chain with 350-400 residues. This search can be divided into 3 attributes:

Single chain
Chain is a protein chain
350 - 400 residues

The following code creates these three attributes, combines them into one 'query', then places the result in a list.

In [None]:
q1 = AttributeQuery(attribute="rcsb_assembly_info.polymer_entity_instance_count", operator="equals", value=1)

q2 = AttributeQuery(attribute="rcsb_assembly_info.polymer_entity_instance_count_protein", operator="equals", value=1)

q3_1 = attrs.rcsb_assembly_info.polymer_monomer_count >= 350
q3_2 = attrs.rcsb_assembly_info.polymer_monomer_count <= 400

query = q1 & q2 & q3_1 & q3_2
result_polymers = list(query("assembly"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_polymers)} polymers with a single protein chain:", result_polymers[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
i = 0
while i < len(result_polymers):
    result_polymers[i] = result_polymers[i][0:4]
    i += 1

test_validation = requests.get(f'https://files.rcsb.org/download/{result_polymers[0]}.cif')
print(result_polymers[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called ligands.

In [None]:
os.makedirs("assemblies/Single_Protein_Chain", exist_ok=True) 

with open(f"assemblies/Single_Protein_Chain/{result_polymers[0]}.cif", 'w+') as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f"assemblies/Single_Protein_Chain/{result_polymers[0]}.cif", 'r')
file_text = file1.read() 

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_polymers:
    cFile = f"{ChemID}.cif"
    cFileUrl = baseUrl + cFile
    cFileLocal = "assemblies/Biological_Assembly/" + cFile #Do we need to make a local file, or is it expected they will fill in the code? 
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## 24 Identical Chains

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22Homo%2024-mer%22%2C%22attribute%22%3A%22rcsb_struct_symmetry.oligomeric_state%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22assembly%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%229c66f6bfea1407cc71d25812786a9dc1%22%7D%7D) which is a search of biological assemblies that have 24 identical chains. This search can be divided into 1 attribute:

24 identical chains

In [None]:
q1 = AttributeQuery(attribute="rcsb_struct_symmetry.oligomeric_state", operator="exact_match", value="Homo 24-mer")

result_polymers = list(q1("assembly"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_polymers)} polymers that contain exactly 24 identical chains:", result_polymers[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
i = 0
while i < len(result_polymers):
    result_polymers[i] = result_polymers[i][0:4]
    i += 1

test_validation = requests.get(f'https://files.rcsb.org/download/{result_polymers[0]}.cif')
print(result_polymers[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called ligands.

In [None]:
os.makedirs("assemblies/24_Identical_Chains", exist_ok=True) 

with open(f"assemblies/24_Identical_Chains/{result_polymers[0]}.cif", 'w+') as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f"assemblies/24_Identical_Chains/{result_polymers[0]}.cif", 'r')
file_text = file1.read() 

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_polymers:
    cFile = f"{ChemID}.cif"
    cFileUrl = baseUrl + cFile
    cFileLocal = "assemblies/24_Identical_Chains/" + cFile #Do we need to make a local file, or is it expected they will fill in the code? 
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Immunoglobulin Fab Fragments to Dimeric Antigen

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_entry_info.polymer_entity_count_protein%22%2C%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A3%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_assembly_info.polymer_entity_instance_count%22%2C%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A6%7D%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%5D%2C%22label%22%3A%22text%22%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22structure%22%2C%22parameters%22%3A%7B%22operator%22%3A%22strict_shape_match%22%2C%22value%22%3A%7B%22entry_id%22%3A%221BJ1%22%2C%22asym_id%22%3A%22A%22%7D%7D%7D%5D%7D%2C%22return_type%22%3A%22entry%22%2C%22request_info%22%3A%7B%22query_id%22%3A%228248067268094d3dfb85da133bd86681%22%7D%2C%22request_options%22%3A%7B%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%7D) which is a search of immunoglobin Fab fragments bound to a dimeric antigen. This search can be divided into 3 attributes:

6 polymer entity instances (2 Fab heavy chains, 2 Fab light chains, 2 antigen chains)
3 polymer entities (Fab heavy chain, Fab light chain, antigen chain)
Structure similarity to 1bj1, chain A


The following code creates these three attributes, combines them into one 'query', then places the result in a list.

In [None]:
q1 = AttributeQuery(attribute="rcsb_assembly_info.polymer_entity_instance_count", operator="equals", value=6)

q2 = AttributeQuery(attribute="rcsb_entry_info.polymer_entity_count_protein", operator="equals", value=3)

q3 = StructSimilarityQuery(entry_id = "1BJ1", structure_input_type="chain_id", chain_id="A", target_search_space="polymer_entity_instance")

query = q1 & q2 & q3 
result_polymers = list(query("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_polymers)} polymers immunoglobulin fab fragments connected to dimeric antigen:", result_polymers[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_polymers[0]}.cif')
print(result_polymers[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called ligands.

In [None]:
os.makedirs("assemblies/Immunoglobin_Fab", exist_ok=True) 

with open(f"assemblies/Immunoglobin_Fab/{result_polymers[0]}.cif", 'w+') as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f"assemblies/Immunoglobin_Fab/{result_polymers[0]}.cif", 'r')
file_text = file1.read() 

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_polymers:
    cFile = f"{ChemID}.cif"
    cFileUrl = baseUrl + cFile
    cFileLocal = "assemblies/Immunoglobin_Fab/" + cFile #Do we need to make a local file, or is it expected they will fill in the code? 
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## One Heavy Water

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_assembly_info.deuterated_water_count%22%2C%22operator%22%3A%22greater_or_equal%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%7D%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%2C%22return_type%22%3A%22assembly%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22scoring_strategy%22%3A%22combined%22%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%22f393a5ca40427895ca85117b8db0e69b%22%7D%7D) which is a search for assemblies that contain at least one heavy water. This search can be divided into 1 attribute:

Deuterated water count >= 11

In [None]:
q1 = AttributeQuery(attribute="rcsb_assembly_info.deuterated_water_count", operator="greater_or_equal", value=11)

result_polymers = list(q1("assembly"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_polymers)} assemblies with at least one heavy water:", result_polymers[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
i = 0
while i < len(result_polymers):
    result_polymers[i] = result_polymers[i][0:4]
    i += 1

test_validation = requests.get(f'https://files.rcsb.org/download/{result_polymers[0]}.cif')
print(result_polymers[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called ligands.

In [None]:
os.makedirs("assemblies/Heavy_Water", exist_ok=True) 

with open(f"assemblies/Heavy_Water/{result_polymers[0]}.cif", 'w+') as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f"assemblies/Heavy_Water/{result_polymers[0]}.cif", 'r')
file_text = file1.read() 

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_polymers:
    cFile = f"{ChemID}.cif"
    cFileUrl = baseUrl + cFile
    cFileLocal = "assemblies/Heavy_Water/" + cFile #Do we need to make a local file, or is it expected they will fill in the code? 
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)