# Macromolecules

## Purpose

The RCSB PDB is the repository for all publicly available experimentally determined protein structures in the world. This notebook is made in order to demonstrate and elaborate on how to use the rcsbsearchapi Python library to recreate Advanced Searches from the RCSB PDB in Python. Further, this notebook will show how to download the files of the results of these searches, also in Python. This will be done through searching particularly for different marcomolecules represented in the RCSB PDB, with each search demonstrating a different way of utilizing the Advanced Search tool.

## Steps Taken

The following is a step-by-step explanation of what will be performed for each code example.

### 1) Creating the Search

An explanation of what the search is describing will be followed by the creation of a RCSB PDB Search.

### 2a) Validation By List

The search is tested for functionality by analyzing the first 10 results of the search in a list.

### 2b) Validation By File Request

The search is tested for functionality by requesting the file of the first search result. This is the step where the results will be changed if needed for the sake of requesting and downloading their corresponding file.

### 2c) Validation By File Download

The search is tested for functionality by downloading and reading the contents of the first search result's file. This includes the generation of a folder for the files of the search result.

### 3) Complete Search Download

Following validation, each file in the search result is downloaded into the previously generated file.

## Importing Libraries

A list of libraries that will need to be installed and imported to complete the tasks in the notebook.

| Library | Contents | Source |
| :-----: | :------- | :----- |
| rcsbsearchapi | library for automated searching of the [RCSB Protein Data Bank](https://www.rcsb.org)| [py-rcsbsearchapi on GitHub](https://github.com/rcsb/py-rcsbsearchapi) |
| requests | library for sending HTTP requests | [requests Documentation](https://requests.readthedocs.io/en/latest/) |
| os | standard library for creating directories | [os Documentation](https://docs.python.org/3/library/os.html) |

## Installation

These libraries will need to be installed in your computing environment to perform the tasks in this notebook.

To install from the command line on your computer, use this command (with the `requests` library as the example):

`pip install requests`

To install from within a Jupyter notebook or CoLab notebook, you need to type the same command in a coding cell, preceded by an exclamation point.

`!pip install requests`

These libraries will be imported as they are needed over the course of this notebook.


In [None]:
# Import the components of rcsbsearchapi needed for this search
from rcsbsearchapi import rcsb_attributes as attrs
# For Operator notation

from rcsbsearchapi.const import CHEMICAL_ATTRIBUTE_SEARCH_SERVICE, STRUCTURE_ATTRIBUTE_SEARCH_SERVICE
from rcsbsearchapi.search import AttributeQuery, Attr, TextQuery, SequenceQuery, SeqMotifQuery
# For Fluent notation

import requests  # to enable us to pull files from the PDB
import os        # to enable us to create a directory to store the files

## Single Protein Chain

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%2C%22attribute%22%3A%22rcsb_entry_info.deposited_polymer_entity_instance_count%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22Protein%20(only)%22%2C%22attribute%22%3A%22rcsb_entry_info.selected_polymer_entity_types%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22range%22%2C%22negation%22%3Afalse%2C%22value%22%3A%7B%22from%22%3A350%2C%22to%22%3A400%2C%22include_lower%22%3Atrue%2C%22include_upper%22%3Atrue%7D%2C%22attribute%22%3A%22rcsb_entry_info.deposited_polymer_monomer_count%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%22d06b0c0f28c1ee86b2a6bb2187eff51a%22%7D%7D) which is a search of structures with a single protein chain (no others) of length between 350-400 residues. This search can be divided into 3 attributes:

Total Number of Polymer Instances
Entry Polymer Types (Proteins)
Total Number of Polymer Residues per Deposited Model

The following code creates these three attributes, combines them into one 'query', then places the result in a list.

In [None]:
q1 = AttributeQuery(attribute="rcsb_entry_info.deposited_polymer_entity_instance_count", operator = "equals", value = 1)
q2 = AttributeQuery(attribute="rcsb_entry_info.selected_polymer_entity_types", operator = "exact_match", value = "Protein (only)")
q3_1 = AttributeQuery(attribute="rcsb_entry_info.deposited_polymer_monomer_count", operator = "greater_or_equal", value = 350)
q3_2 = AttributeQuery(attribute="rcsb_entry_info.deposited_polymer_monomer_count", operator = "less_or_equal", value = 400)

query = q1 & q2 & q3_1 & q3_2
result_molecules = list(query("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_molecules)} structures with a single protein chain:", result_molecules[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_molecules[0]}.cif')
print(result_molecules[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules.

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a macromolecules folder for our results. If this marcomolecules folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/Single_Protein_Chain", exist_ok=True)

with open(f"macromolecules/Single_Protein_Chain/{result_molecules[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/Single_Protein_Chain/{result_molecules[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_molecules:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/Single_Protein_Chain/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## RNA Polymer

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22greater%22%2C%22negation%22%3Afalse%2C%22value%22%3A0%2C%22attribute%22%3A%22rcsb_entry_info.polymer_entity_count_RNA%22%7D%2C%22node_id%22%3A0%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%229d86fd0df8d6003d2eb2735ca616caa6%22%7D%7D) which is a search of structures that contain an RNA polymer. This search can be divided into 1 attribute:

Number of Distinct RNA Entities

In [None]:
q1 = AttributeQuery(attribute="rcsb_entry_info.polymer_entity_count_RNA", operator = "greater", value = 0)
result_molecules = list(q1("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_molecules)} structures that contain an RNA polymer:", result_molecules[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_molecules[0]}.cif')
print(result_molecules[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a macromolecules folder for our results. If this macromolecules folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/RNA_Polymer", exist_ok=True)

with open(f"macromolecules/RNA_Polymer/{result_molecules[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/RNA_Polymer/{result_molecules[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_molecules:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/RNA_Polymer/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Membrane Proteins

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22or%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_polymer_entity_annotation.type%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22PDBTM%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_polymer_entity_annotation.type%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22MemProtMD%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_polymer_entity_annotation.type%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22OPM%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22attribute%22%3A%22rcsb_polymer_entity_annotation.type%22%2C%22operator%22%3A%22exact_match%22%2C%22value%22%3A%22mpstruc%22%7D%7D%5D%7D%5D%2C%22logical_operator%22%3A%22and%22%2C%22label%22%3A%22text%22%7D%5D%2C%22logical_operator%22%3A%22and%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_info%22%3A%7B%22query_id%22%3A%22c932d0a99eb6732d4ac5d6c97962f67e%22%7D%2C%22request_options%22%3A%7B%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A25%7D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%7D) which is a search of membrane protein structures. This search can be divided into 1 attribute:

PDBTM or MemProtMD or OPM or mpstruc

In [None]:
q1 = AttributeQuery(attribute="rcsb_polymer_entity_annotation.type", operator = "exact_match", value = "PDBTM")
q2 = AttributeQuery(attribute="rcsb_polymer_entity_annotation.type", operator = "exact_match", value = "MemProtMD")
q3 = AttributeQuery(attribute="rcsb_polymer_entity_annotation.type", operator = "exact_match", value = "OPM")
q4 = AttributeQuery(attribute="rcsb_polymer_entity_annotation.type", operator = "exact_match", value = "mpstruc")

query = q1 or q2 or q3 or q4
result_molecules = list(query("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_molecules)} membrane protein structures:", result_molecules[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_molecules[0]}.cif')
print(result_molecules[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules.

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a ligands folder for our results. If this ligands folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/Membrane_Proteins", exist_ok=True)

with open(f"macromolecules/Membrane_Proteins/{result_molecules[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/Membrane_Proteins/{result_molecules[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_molecules:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/Membrane_Proteins/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Modified Residues

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22equals%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%2C%22attribute%22%3A%22rcsb_entry_info.deposited_polymer_entity_instance_count%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22exact_match%22%2C%22negation%22%3Afalse%2C%22value%22%3A%22Protein%20(only)%22%2C%22attribute%22%3A%22rcsb_entry_info.selected_polymer_entity_types%22%7D%7D%2C%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22range%22%2C%22negation%22%3Afalse%2C%22value%22%3A%7B%22from%22%3A350%2C%22to%22%3A400%2C%22include_lower%22%3Atrue%2C%22include_upper%22%3Atrue%7D%2C%22attribute%22%3A%22rcsb_entry_info.deposited_polymer_monomer_count%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%22d06b0c0f28c1ee86b2a6bb2187eff51a%22%7D%7D) which is a search of structures with non-standard polmeric compoenents. This search can be divided into 1 attribute:

Modified Chemical Component Count Per Polymer Entity

In [None]:
q1 = AttributeQuery(attribute="rcsb_polymer_entity_feature_summary.count", operator = "greater", value = 0)
q2 = AttributeQuery(attribute="rcsb_polymer_entity_feature_summary.type", operator = "exact_match", value = "modified_monomer")

query = q1 & q2
result_molecules = list(query("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_molecules)} structures with modified residues:", result_molecules[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_molecules[0]}.cif')
print(result_molecules[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules.

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a ligands folder for our results. If this ligands folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/Modified_Residues", exist_ok=True)

with open(f"macromolecules/Modified_Residues/{result_molecules[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/Modified_Residues/{result_molecules[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_molecules:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/Modified_Residues/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Chimeric Entities

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22text%22%2C%22parameters%22%3A%7B%22operator%22%3A%22greater%22%2C%22negation%22%3Afalse%2C%22value%22%3A1%2C%22attribute%22%3A%22rcsb_polymer_entity.rcsb_source_taxonomy_count%22%7D%7D%5D%7D%5D%2C%22label%22%3A%22text%22%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22entry%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22combined%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%2C%22results_content_type%22%3A%5B%22experimental%22%5D%7D%2C%22request_info%22%3A%7B%22query_id%22%3A%22adad8b36d958c9102ebc1ba15af20074%22%7D%7D) which is a search of entries containing chimeric entities, which are engineered by fusing sequence fragments from different organisms. This search can be divided into 1 attribute:

Distinct Taxonomy Count

In [None]:
q1 = AttributeQuery(attribute="rcsb_polymer_entity.rcsb_source_taxonomy_count", operator = "greater", value = 1)

result_molecules = list(q1("entry"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_molecules)} structures containing chimeric entities:", result_molecules[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_molecules[0]}.cif')
print(result_molecules[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules.

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a ligands folder for our results. If this ligands folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/Chimeric_Entities", exist_ok=True)

with open(f"macromolecules/Chimeric_Entities/{result_molecules[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/Chimeric_Entities/{result_molecules[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_molecules:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/Chimeric_Entities/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Sequence Similarity Search

### 1)

The following code is a recreation of the search example on the RCSB Protein Data Bank shown [here](https://www.rcsb.org/search?request=%7B%22query%22%3A%7B%22type%22%3A%22group%22%2C%22logical_operator%22%3A%22and%22%2C%22nodes%22%3A%5B%7B%22type%22%3A%22terminal%22%2C%22service%22%3A%22sequence%22%2C%22parameters%22%3A%7B%22evalue_cutoff%22%3A1%2C%22identity_cutoff%22%3A0%2C%22target%22%3A%22pdb_protein_sequence%22%2C%22value%22%3A%22MDAESIEWKLTANLRNGPTFFQPLADSIEPLQFKLIGSDTVATAFPVFDTKYIPDSLINYLFKLFNLEIESGKTYPQLHSLTKQGFLNYWFHSFAVVVLQTDEKFIQDNQDWNSVLLGTFYIKPNYAPRCSHNCNAGFLVNGAHRGQKVGYRLAQVYLNWAPLLGYKYSIFNLVFVTNQASWKIWDKLNFQRIGLVPHAGILNGFSEPVDAIIYGKDLTKIEPEFLSME%22%7D%2C%22label%22%3A%22sequence%22%2C%22node_id%22%3A0%7D%5D%2C%22label%22%3A%22query-builder%22%7D%2C%22return_type%22%3A%22polymer_entity%22%2C%22request_options%22%3A%7B%22paginate%22%3A%7B%22start%22%3A0%2C%22rows%22%3A100%7D%2C%22scoring_strategy%22%3A%22sequence%22%2C%22sort%22%3A%5B%7B%22sort_by%22%3A%22score%22%2C%22direction%22%3A%22desc%22%7D%5D%7D%2C%22request_info%22%3A%7B%22src%22%3A%22ui%22%2C%22query_id%22%3A%22d90fa2a404c722826f43c067ccf71959%22%7D%7D) which is a search of sequences similar to "N-acetyltransferase MPR1". This search can be divided into 1 attribute:

Sequence Similarity (N-acetyltransferase MPR1)

In [None]:
sequencesimilarityquery = SequenceQuery(value = "MDAESIEWKLTANLRNGPTFFQPLADSIEPLQFKLIGSDTVATAFPVFDTKYIPDSLINYLFKLFNLEIESGKTYPQLHSLTKQGFLNYWFHSFAVVVLQTDEKFIQDNQDWNSVLLGTFYIKPNYAPRCSHNCNAGFLVNGAHRGQKVGYRLAQVYLNWAPLLGYKYSIFNLVFVTNQASWKIWDKLNFQRIGLVPHAGILNGFSEPVDAIIYGKDLTKIEPEFLSME", evalue_cutoff = 1, identity_cutoff = 0, sequence_type = "protein")

result_search = list(sequencesimilarityquery("polymer_entity"))

### 2a)

We can check to make sure the list has been successfully created by printing the first 10 elements of the list. These 10 elements should be the same first ten elements seen on the RCSB RDB search.

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences similar to N-acetyltransferase MPR1:", result_search[0:10])

### 2b)

Now, we can begin downloading the files from the list we made. First, download this element in the list and then check to see if it was downloaded successfully. Then, open the file to see if its contents and in line with what is expected from the download.

In [None]:
test_validation = requests.get(f'https://files.rcsb.org/download/{result_search[0]}.cif')
print(result_search[0])

In [None]:
test_validation.status_code

### 2c)

To further check, we can create a file and then read the contents of the file. Creating the file includes the creation of a directory (folder) in order to store the folder, which will be called marcomolecules.

In [None]:
# To really be sure, let's look at the file one line at a time. First we write the downloaded content to a file.

# make a ligands folder for our results. If this ligands folder already exists, then it doesn't create a new one
os.makedirs("macromolecules/Sequence_Similarity", exist_ok=True)

with open(f"macromolecules/Sequence_Similarity/{result_search[0]}.cif", "w+") as file:
    file.write(test_validation.text)

In [None]:
file1 = open(f'macromolecules/Sequence_Similarity/{result_search[0]}.cif', 'r')
file_text = file1.read() # This reads in the file as a string.

print(file_text)

### 3)

Once you've confirmed that the file download occurred correctly, we can finish by downloading all of the files from the list we made previously into the folder we generated. The following block of code will perform this.

In [None]:
baseUrl = "https://files.rcsb.org/download/"

for ChemID in result_search:
    cFile = f"{ChemID}.mol2"
    cFileUrl = baseUrl + cFile
    cFileLocal = "macromolecules/Sequence_Similarity/" + cFile
    response = requests.get(cFileUrl)
    with open(cFileLocal, "w+") as file:
        file.write(response.text)

## Sequence Motif Search (NPPTP)

NPPTP = ...
XPPXP = ...
W7G, G20L = ...

In [None]:
sequencemotifquery = SeqMotifQuery(value = "NPPTP", pattern_type = "simple", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences containing the motif NPPTP:", result_search[0:10])

## Sequence Motif Search (XPPXP)

In [None]:
sequencemotifquery = SeqMotifQuery(value = "XPPXP", pattern_type = "simple", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences containing the motif XPPXP:", result_search[0:10])

## Sequence Motif Search (W 7 G, G 20 L)

In [None]:
sequencemotifquery = SeqMotifQuery(value = "W.{7}G.{20}L", pattern_type = "regex", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences containing 7 variable residues between W and G, and 20 variable residues between G and L:", result_search[0:10])

## Sequence Motif Search (Zinc Finger Motif)

In [None]:
sequencemotifquery = SeqMotifQuery(value = "C.{2,4}C.{12}H.{3,5}H", pattern_type = "regex", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences containing the zinc finger motif:", result_search[0:10])

## Sequence Motif Search (N-terminal Histidine)

In [None]:
sequencemotifquery = SeqMotifQuery(value = "^HHHHHH", pattern_type = "regex", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences with N-terminal Histidine tags:", result_search[0:10])

## Sequence Motif Seach (Walker)

In [None]:
sequencemotifquery = SeqMotifQuery(value = "[AG]....GK[ST]", pattern_type = "regex", sequence_type = "protein")

result_search = list(sequencemotifquery("polymer_entity"))

In [None]:
print(f"The following drugs are among the {len(result_search)} sequences with the Walker (P loop) motif:", result_search[0:10])