<a href="https://colab.research.google.com/github/neetushibu/IontheFold-Team6/blob/main/IontheFoldDataCuration001.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Curation -  Bulk downloading of structural files from the PDB**

This notebook performs automated bulk downloading of 3D structural data from the RCSB Protein Data Bank (PDB). It is part of the data curation process for protein design and analysis, specifically focusing on structures that will later be screened for charged protein–protein interfaces.

The shell script (batch_download.sh) leverages the RCSB’s programmatic access API to download various file formats including:

*   .cif.gz: standard mmCIF structure format ; Contains full atom details, experimental methods, symmetry, and model info. This is requried for Accurate residue mapping and atom coordinates for structures from recent deposits.

*   .pdb.gz: legacy PDB format ;Used for structure preprocessing for input to ProteinMPNN. Extracts chain IDs, residue sequences, and backbone coordinates.

*   sf.cif.gz: structure factor files (used in crystallographic analysis). This is required for Contains diffraction data used to compute the 3D structure (X-ray crystallography).

Allows validation of the atomic model (e.g., checking charge density around binding sites).


To prevent data loss from Colab session timeouts, all files are downloaded directly to Google Drive.

Protein IDs used for the extract has been downloaded from RCSB Protein Data Bank (PDB) https://www.rcsb.org/search/advanced and the bulk extract script from https://www.rcsb.org/docs/programmatic-access/batch-downloads-with-shell-script



# Upload Required Files

In [6]:
from google.colab import files
uploaded = files.upload()


Saving batch_download.sh to batch_download.sh


# Make the Shell Script Executable

In [8]:
!chmod +x batch_download.sh


# Mount Google Drive

This enables saving large downloaded files persistently, even if the Colab session disconnects or expires

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Create Output Directory in Google Drive

In [4]:
#!mkdir -p /content/drive/MyDrive/IontheFold/downloads/500001/

#!mkdir -p /content/drive/MyDrive/IontheFold/downloads/70001/

#!mkdir -p /content/drive/MyDrive/IontheFold/downloads/10-20K/

!mkdir -p /content/drive/MyDrive/IontheFold/downloads/40-50K/



# Run the Batch Download Script

In [9]:
!./batch_download.sh -f /content/rcsb_pdb_ids_fb58ddb1bbf0908cc8aa8e8a10cf5f3a_040001-050000.txt -o /content/drive/MyDrive/IontheFold/downloads/40-50K -c -p -s

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Downloading https://files.rcsb.org/download/6T94-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T94-sf.cif.gz
Downloading https://files.rcsb.org/download/6T95.cif.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T95.cif.gz
Downloading https://files.rcsb.org/download/6T95.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T95.pdb.gz
Downloading https://files.rcsb.org/download/6T95-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T95-sf.cif.gz
Downloading https://files.rcsb.org/download/6T97.cif.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T97.cif.gz
Downloading https://files.rcsb.org/download/6T97.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T97.pdb.gz
Downloading https://files.rcsb.org/download/6T97-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/40-50K/6T97-sf.cif.gz
Downloading https://files.rcsb.org/download/6T9F.cif.gz to /con

In [7]:
#!./batch_download.sh -f /content/rcsb_pdb_ids_fb58ddb1bbf0908cc8aa8e8a10cf5f3a_010001-020000.txt -o /content/drive/MyDrive/IontheFold/downloads/10-20K/ -c -p -s

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Downloading https://files.rcsb.org/download/8BK2.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BK2.pdb.gz
Downloading https://files.rcsb.org/download/8BK2-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BK2-sf.cif.gz
Downloading https://files.rcsb.org/download/8BK3.cif.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BK3.cif.gz
Downloading https://files.rcsb.org/download/8BK3.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BK3.pdb.gz
Downloading https://files.rcsb.org/download/8BK3-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BK3-sf.cif.gz
Downloading https://files.rcsb.org/download/8BL0.cif.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BL0.cif.gz
Downloading https://files.rcsb.org/download/8BL0.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/10-20K//8BL0.pdb.gz
Downloading https://files.rcsb.org/download/8BL0-sf.cif.gz to 

In [14]:
#!./batch_download.sh -f /content/rcsb_pdb_ids_fb58ddb1bbf0908cc8aa8e8a10cf5f3a_070001-080000.txt -o /content/drive/MyDrive/IontheFold/downloads/70001/ -c -p -s

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Downloading https://files.rcsb.org/download/5XPT-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XPT-sf.cif.gz
Downloading https://files.rcsb.org/download/5XPU.cif.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XPU.cif.gz
Downloading https://files.rcsb.org/download/5XPU.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XPU.pdb.gz
Downloading https://files.rcsb.org/download/5XPU-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XPU-sf.cif.gz
Downloading https://files.rcsb.org/download/5XUV.cif.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XUV.cif.gz
Downloading https://files.rcsb.org/download/5XUV.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XUV.pdb.gz
Downloading https://files.rcsb.org/download/5XUV-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/70001//5XUV-sf.cif.gz
Downloading https://files.rcsb.org/download/5XXY.cif.gz to /con

In [11]:
#!./batch_download.sh -f /content/rcsb_pdb_ids_fb58ddb1bbf0908cc8aa8e8a10cf5f3a_050001-060000.txt -o /content/drive/MyDrive/IontheFold/downloads/500001/ -c -p -s

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
Downloading https://files.rcsb.org/download/6KUW-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KUW-sf.cif.gz
Downloading https://files.rcsb.org/download/6KUX.cif.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KUX.cif.gz
Downloading https://files.rcsb.org/download/6KUX.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KUX.pdb.gz
Downloading https://files.rcsb.org/download/6KUX-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KUX-sf.cif.gz
Downloading https://files.rcsb.org/download/6KYK.cif.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KYK.cif.gz
Downloading https://files.rcsb.org/download/6KYK.pdb.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KYK.pdb.gz
Downloading https://files.rcsb.org/download/6KYK-sf.cif.gz to /content/drive/MyDrive/IontheFold/downloads/500001//6KYK-sf.cif.gz
Downloading https://files.rcsb.org/download/6L18.cif.gz 