<a href="https://colab.research.google.com/github/neugiriger/datasciencecoursera/blob/master/IontheFoldDataCuration_directfromdrive.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Data Curation -  Bulk downloading of structural files from the PDB**

This notebook performs automated bulk downloading of 3D structural data from the RCSB Protein Data Bank (PDB). It is part of the data curation process for protein design and analysis, specifically focusing on structures that will later be screened for charged protein–protein interfaces.

The shell script (batch_download.sh) leverages the RCSB’s programmatic access API to download various file formats including:

*   .cif.gz: standard mmCIF structure format ; Contains full atom details, experimental methods, symmetry, and model info. This is requried for Accurate residue mapping and atom coordinates for structures from recent deposits.

*   .pdb.gz: legacy PDB format ;Used for structure preprocessing for input to ProteinMPNN. Extracts chain IDs, residue sequences, and backbone coordinates.

*   sf.cif.gz: structure factor files (used in crystallographic analysis). This is required for Contains diffraction data used to compute the 3D structure (X-ray crystallography).

Allows validation of the atomic model (e.g., checking charge density around binding sites).


To prevent data loss from Colab session timeouts, all files are downloaded directly to Google Drive.

Protein IDs used for the extract has been downloaded from RCSB Protein Data Bank (PDB) https://www.rcsb.org/search/advanced and the bulk extract script from https://www.rcsb.org/docs/programmatic-access/batch-downloads-with-shell-script



# 1. Start here: Mount Google Drive

This enables saving large downloaded files persistently, even if the Colab session disconnects or expires

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# 2. Import Libraries

Just run this cell so everything below works. You'll need to rerun it if you start the runtime again.

In [2]:
from google.colab import drive
import os
import ipywidgets as widgets
from IPython.display import display
from IPython.display import clear_output

# 2. Load in required files from Googledrive (Ion the Fold)

Adjust the file locations as you need

In [4]:

# Link to the protein ID file in google drive
inputProteinFile = '/content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/Input_PDBIDs/testproteins.txt'

# Link to  the batch download file
batchDownloadFile = '/content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/batch_download.sh'

# Link to  the output location where the protein files of each ID in the ID file will be saved
outputLocationFolder = '/content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/0-10k' # this could be any one of the folders from 1-10k to 10-20k, 10-30k and so on up to 170-180k



#3. Set base directories in google drive

In [5]:
# Set base directories

# Link to the protein ID file folder in google drive (general)
input_base = "/content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/Input_PDBIDs"

# Link to  the batch download file
batchDownloadFile = '/content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/batch_download.sh'

# Link to  the output location where the protein files of each ID in the ID file will be saved
output_base = "/content/drive/MyDrive/IonTheFold/ProteinInformation_RAW"


# and we will check they are all working
if os.path.isdir(input_base):
    print(f"✅ Input folder found: {input_base}")
    print("   Contents:", os.listdir(input_base)[:5])
else:
    print(f"❌ Input folder not found: {input_base}")

if os.path.isfile(batchDownloadFile):
    print(f"✅ Batch script found: {batchDownloadFile}")
else:
    print(f"❌ Batch script NOT found: {batchDownloadFile}")

if os.path.isdir(output_base):
    print(f"✅ Output folder found: {output_base}")
    print("   Subfolders:", os.listdir(output_base)[:5])
else:
    print(f"❌ Output folder not found: {output_base}")

✅ Input folder found: /content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/Input_PDBIDs
   Contents: ['testproteins.txt', 'pdb_ids_1-10000.txt', 'pdb_IDs_20001-30000.txt', 'pdb_ids_10001-20000.txt', 'pdb_ids_30001-40000.txt']
✅ Batch script found: /content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/batch_download.sh
✅ Output folder found: /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW
   Subfolders: ['20-30k', '10-20k', '30-40k', '40-50k', 'testCollection']


#4a. Check what's already been done

Run this script to see what's already been done in the drive. Make sure you're not overlapping

In [7]:
import os
import pandas as pd

# Base directory
base_dir = "/content/drive/MyDrive/IonTheFold/ProteinInformation_RAW"

# Prepare a list to store results
folder_counts = []

# Loop through subfolders and count files
for subfolder in sorted(os.listdir(base_dir)):
    subfolder_path = os.path.join(base_dir, subfolder)
    if os.path.isdir(subfolder_path):
        count = sum(
            os.path.isfile(os.path.join(subfolder_path, f))
            for f in os.listdir(subfolder_path)
        )
        folder_counts.append((subfolder, count))

# Convert to DataFrame and display
df = pd.DataFrame(folder_counts, columns=["Subfolder", "File Count"])
print(df.to_string(index=False))

      Subfolder  File Count
          1-10k           1
         10-20k           5
         20-30k           5
         30-40k           1
         40-50k       14444
         50-60k           1
         60-70k           2
         70-80k           2
 testCollection           1
testCollection2           0


# 4b. Select the protein set you would like to process.

This function will make sure the right folders are set up to grab the protein file you want, extract the protein IDs and then make sure the output is going to the right folder.

Each has 10k proteins and will take about 8 hours. Make sure you select your protein set from the dropdown list below after you run this cell.

Once you've run this, go into Google Drive folder "ProteinInformation_RAW" under your selected folder and check the test file is there. This is only a test file with just the first few proteins IDs copied in, just so we know this set up is working.

In [8]:
# Define dropdown options (manual mapping based on your folder structure)
id_file_options = {
    "1-10k": "pdb_ids_1-10000.txt",
    "10-20k": "pdb_ids_10001-20000.txt",
    "20-30k": "pdb_ids_20001-30000.txt",
    "30-40k": "pdb_ids_30001-40000.txt",
    "40-50k": "pdb_ids_40001-50000.txt",
    "50-60k": "pdb_ids_50001-60000.txt",
    "60-70k": "pdb_ids_60001-70000.txt",
    "70-80k": "pdb_ids_70001-80000.txt",
    #"80-90k": "pdb_ids_80001-90000.txt",
    #"90-1000k": "pdb_ids_90001-1000000.txt",
    #"100-110k": "pdb_ids_100001-110000.txt",
    #"110-120k": "pdb_ids_110001-120000.txt",
    #"120-130k": "pdb_ids_120001-130000.txt",
    #"130-140k": "pdb_ids_130001-140000.txt",
    #"140-150k": "pdb_ids_140001-150000.txt",
    #"150-160k": "pdb_ids_150001-160000.txt",
    #"160-170k": "pdb_ids_160001-170000.txt",
    #"170-180k": "pdb_ids_170001-180000.txt",
    #"180-188k": "pdb_ids_180001-188000.txt",
    "Test file": "testproteins.txt"
}

# Dropdown widget for user selection
dropdown = widgets.Dropdown(
    options=list(id_file_options.keys()),
    value="Test file",
    description='Protein Set:',
    style={'description_width': 'initial'},
    layout=widgets.Layout(width='50%')
)

display(dropdown)

# Handler function to run once selection is made
def validate_selection(change):
    from IPython.display import clear_output
    clear_output(wait=True)
    label = change.new
    input_file = os.path.join(input_base, id_file_options[label])

    # Match folder name to ID range label
    folder_key = "testCollection" if label == "Test file" else label
    output_dir = os.path.join(output_base, folder_key)

    print(f"\n📄 Selected ID File: {input_file}")
    print(f"📁 Matched Output Folder: {output_dir}")

    # Check existence
    assert os.path.exists(input_file), "❌ ID file does not exist."
    assert os.path.isdir(output_dir), "❌ Output folder does not exist."
    print("✅ Both input and output paths are valid.")

    # Read first protein ID
    with open(input_file, 'r') as f:
        for line in f:
            first_id = line.strip()
            if first_id:
                break

    print(f"First Protein ID: {first_id}")

    # Auto-name test file
    base_name = "testProteinWrite"
    ext = ".txt"
    i = 1
    while os.path.exists(os.path.join(output_dir, f"{base_name}{i}{ext}")):
        i += 1
    test_file_path = os.path.join(output_dir, f"{base_name}{i}{ext}")

    # Write file
    with open(test_file_path, 'w') as f:
        f.write(f"This is a test file for protein ID: {first_id}\n")

    print(f"✅ Test file written as: {test_file_path}")

# Register the handler
dropdown.observe(validate_selection, names='value')


📄 Selected ID File: /content/drive/MyDrive/IonTheFold/CodesandWorkingFiles/CollectingProteins/Input_PDBIDs/pdb_ids_60001-70000.txt
📁 Matched Output Folder: /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k
✅ Both input and output paths are valid.
First Protein ID: 6QHU,6QHV,6QHW,6QHX,6QHY,6QHZ,6QI0,6QI1,6QI2,6QI3,6QIQ,6QIR,6QIS,6QIT,6QIV,6QL1,6QL2,6QL3,6QPJ,6QPS,6QVV,6QW0,6QW5,6QXV,6QXW,6QY0,6QY1,6QY2,6QY4,6R4M,6R4N,6R6T,6R6U,6RAB,6RB7,6RD1,6RJS,6RK5,6RK7,6RKA,6RKC,6RUZ,6RVB,6RZN,6RZO,6S43,6S4Q,6S50,6S5A,6S7K,6S7S,6S7U,6S7W,6S7Z,6S88,6S8T,6S8Y,6SEI,6SEV,6SFB,6SFC,6SGE,6SH1,6SH2,6SHK,6SJ8,6SJK,6SJN,6SJO,6SJP,6SJR,6SJS,6SK4,6SKB,6SKC,6SKD,6U2J,6U2K,6U2L,6UAC,6UB9,5QQO,5QQP,6AG2,6AG3,6AG9,6AH7,6AH8,6FWG,6FWI,6FWJ,6FWL,6FWM,6FWO,6FWP,6FWQ,6GQ3,6H3R,6H63,6HH1,6HHS,6HJA,6HJI,6HJS,6HJZ,6HK2,6HK9,6HKP,6HPB,6HPC,6HWM,6HWN,6HZX,6I3C,6I3D,6I3H,6I43,6IDY,6IE0,6IE2,6IE3,6IES,6IET,6IEU,6IEV,6IEX,6IF4,6IJ7,6INU,6IOE,6IRI,6J2D,6J2E,6J2F,6J2G,6J2H,6J2I,6J2J,6J33,6J34,6J35,6J4H,6J4J,6J4M,

#5a. Run the download scripts, with counts on the first 5 protein IDs.

Check that the google drive file contains 15 files. There should be 3 files per protein

In [9]:
# === Set dynamic paths from dropdown ===
label = dropdown.value
original_file = os.path.join(input_base, id_file_options[label])
output_dir = os.path.join(output_base, "testCollection" if label == "Test file" else label)

# === Reuse batch script path ===
script_path = batchDownloadFile

# === Read the first 5 comma-separated IDs from the original file ===
with open(original_file, 'r') as f:
    content = f.read().strip()
    all_ids = [x.strip() for x in content.split(",") if x.strip()]
    first_5_ids = all_ids[:5]

# === Write the trimmed list to a temporary input file ===
temp_input_file = "/content/first5_ids.txt"
with open(temp_input_file, 'w') as f:
    f.write(",".join(first_5_ids))

print(f"🧪 Running batch download on first 5 IDs: {first_5_ids}")

# === Ensure output directory exists ===
os.makedirs(output_dir, exist_ok=True)

# === Make the script executable ===
!chmod +x "{script_path}"

# === Run the download script with flags ===
!bash "{script_path}" -f "{temp_input_file}" -o "{output_dir}" -c -p -s

# Count successfully downloaded *.pdb.gz files (1 per protein ID)
saved_ids = !find "{output_dir}" -type f -name "*.pdb.gz" | wc -l
saved_ids = int(saved_ids[0])
missing = len(first_5_ids) - saved_ids

print("----------------------------------------")
print("Summary (First 5 IDs):")
print(f"Expected PDB IDs:            {len(first_5_ids)}")
print(f"Protein structures saved:    {saved_ids}")
print(f"Missing or failed downloads: {missing}")
print("----------------------------------------")

🧪 Running batch download on first 5 IDs: ['6QHU', '6QHV', '6QHW', '6QHX', '6QHY']
Downloading https://files.rcsb.org/download/6QHU.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHU.cif.gz
Downloading https://files.rcsb.org/download/6QHU.pdb.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHU.pdb.gz
Downloading https://files.rcsb.org/download/6QHU-sf.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHU-sf.cif.gz
Downloading https://files.rcsb.org/download/6QHV.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHV.cif.gz
Downloading https://files.rcsb.org/download/6QHV.pdb.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHV.pdb.gz
Downloading https://files.rcsb.org/download/6QHV-sf.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/60-70k/6QHV-sf.cif.gz
Downloading https://files.rcsb.org/download/6QHW.cif.gz to /content/drive/MyDrive/IonTheFold/Pro

#5b. Now run the rest of the first half of the proteins. (4995 protein ids)

This will take around 6-8 hours.

In [None]:
# === Reuse dropdown-selected file and output dir from previous cell ===
label = dropdown.value
original_file = os.path.join(input_base, id_file_options[label])
output_dir = os.path.join(output_base, "testCollection" if label == "Test file" else label)
script_path = batchDownloadFile

# === Read IDs 6 to 5000 ===
with open(original_file, 'r') as f:
    content = f.read().strip()
    all_ids = [x.strip() for x in content.split(",") if x.strip()]
    next_4995_ids = all_ids[5:5000]  # Skip first 5

# === Write to temp file ===
temp_input_file = "/content/next4995_ids.txt"
with open(temp_input_file, 'w') as f:
    f.write(",".join(next_4995_ids))

print(f"🧪 Downloading next 4,995 protein IDs from: {label}")

# === Ensure output dir exists ===
os.makedirs(output_dir, exist_ok=True)

# === Run download script ===
!chmod +x "{script_path}"
!bash "{script_path}" -f "{temp_input_file}" -o "{output_dir}" -c -p -s

# === Count saved *.pdb.gz files
saved_ids = !find "{output_dir}" -type f -name "*.pdb.gz" | wc -l
saved_ids = int(saved_ids[0])
processed = len(next_4995_ids)
missing = processed - saved_ids

# === Summary
print("----------------------------------------")
print("Summary (IDs 6–5000):")
print(f"Expected PDB IDs:            {processed}")
print(f"Protein structures saved:    {saved_ids}")
print(f"Missing or failed downloads: {missing}")
print("----------------------------------------")

🧪 Downloading next 4,995 protein IDs from: 1-10k
Downloading https://files.rcsb.org/download/8XWU.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XWU.cif.gz
Downloading https://files.rcsb.org/download/8XWU.pdb.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XWU.pdb.gz
Failed to download https://files.rcsb.org/download/8XWU.pdb.gz
Downloading https://files.rcsb.org/download/8XWU-sf.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XWU-sf.cif.gz
Downloading https://files.rcsb.org/download/8XXG.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XXG.cif.gz
Downloading https://files.rcsb.org/download/8XXG.pdb.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XXG.pdb.gz
Failed to download https://files.rcsb.org/download/8XXG.pdb.gz
Downloading https://files.rcsb.org/download/8XXG-sf.cif.gz to /content/drive/MyDrive/IonTheFold/ProteinInformation_RAW/1-10k/8XXG-sf.cif.gz
Downloadi

#5c. Finally, run the last half of the proteins. (5000 protein ids)
This will take around 6-8 hours.

In [1]:
# === Continue using previous dropdown selection and base paths ===
label = dropdown.value
original_file = os.path.join(input_base, id_file_options[label])
output_dir = os.path.join(output_base, "testCollection" if label == "Test file" else label)
script_path = batchDownloadFile

# === Read IDs 5001 to 10000 ===
with open(original_file, 'r') as f:
    content = f.read().strip()
    all_ids = [x.strip() for x in content.split(",") if x.strip()]
    ids_5001_10000 = all_ids[5000:10000]


# === Write to temp file ===
temp_input_file = "/content/ids_5001_10000.txt"
with open(temp_input_file, 'w') as f:
    f.write(",".join(ids_5001_10000))

print(f"🧪 Downloading protein IDs 5001–10000 from: {label}")

# === Ensure output dir exists ===
os.makedirs(output_dir, exist_ok=True)

# === Run download script ===
!chmod +x "{script_path}"
!bash "{script_path}" -f "{temp_input_file}" -o "{output_dir}" -c -p -s

# === Count saved *.pdb.gz files
saved_ids = !find "{output_dir}" -type f -name "*.pdb.gz" | wc -l
saved_ids = int(saved_ids[0])
processed = len(ids_5001_10000)
missing = processed - saved_ids

# === Summary
print("----------------------------------------")
print("Summary (IDs 5001–10000):")
print(f"Expected PDB IDs:            {processed}")
print(f"Protein structures saved:    {saved_ids}")
print(f"Missing or failed downloads: {missing}")
print("----------------------------------------")

NameError: name 'dropdown' is not defined

### PDB Download Log Parser: Extract Successfully Downloaded and Failed PDB IDs
Use this script to automatically extract which PDB files were successfully downloaded and which failed from the batch script output. It helps track download progress and prepare a retry list for failed IDs.

In [4]:
import re

# Assuming the output of the script is captured in the variable 'script_output'
script_output = """
Downloading https://files.rcsb.org/download/6DBM.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBM.cif.gz
Downloading https://files.rcsb.org/download/6DBM.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBM.pdb.gz
Downloading https://files.rcsb.org/download/6DBM-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBM-sf.cif.gz
Downloading https://files.rcsb.org/download/6DBN.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBN.cif.gz
Downloading https://files.rcsb.org/download/6DBN.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBN.pdb.gz
Downloading https://files.rcsb.org/download/6DBN-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DBN-sf.cif.gz
Downloading https://files.rcsb.org/download/6DR1.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR1.cif.gz
Downloading https://files.rcsb.org/download/6DR1.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR1.pdb.gz
Downloading https://files.rcsb.org/download/6DR1-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR1-sf.cif.gz
Downloading https://files.rcsb.org/download/6DR7.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR7.cif.gz
Downloading https://files.rcsb.org/download/6DR7.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR7.pdb.gz
Downloading https://files.rcsb.org/download/6DR7-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR7-sf.cif.gz
Downloading https://files.rcsb.org/download/6DR9.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR9.cif.gz
Downloading https://files.rcsb.org/download/6DR9.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR9.pdb.gz
Downloading https://files.rcsb.org/download/6DR9-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DR9-sf.cif.gz
Downloading https://files.rcsb.org/download/6DRB.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRB.cif.gz
Downloading https://files.rcsb.org/download/6DRB.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRB.pdb.gz
Downloading https://files.rcsb.org/download/6DRB-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRB-sf.cif.gz
Downloading https://files.rcsb.org/download/6DRY.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRY.cif.gz
Downloading https://files.rcsb.org/download/6DRY.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRY.pdb.gz
Downloading https://files.rcsb.org/download/6DRY-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DRY-sf.cif.gz
Downloading https://files.rcsb.org/download/6DS6.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DS6.cif.gz
Downloading https://files.rcsb.org/download/6DS6.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DS6.pdb.gz
Downloading https://files.rcsb.org/download/6DS6-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DS6-sf.cif.gz
Downloading https://files.rcsb.org/download/6DT6.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DT6.cif.gz
Downloading https://files.rcsb.org/download/6DT6.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DT6.pdb.gz
Downloading https://files.rcsb.org/download/6DT6-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DT6-sf.cif.gz
Downloading https://files.rcsb.org/download/6DUP.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DUP.cif.gz
Downloading https://files.rcsb.org/download/6DUP.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DUP.pdb.gz
Downloading https://files.rcsb.org/download/6DUP-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DUP-sf.cif.gz
Downloading https://files.rcsb.org/download/6DVL.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVL.cif.gz
Downloading https://files.rcsb.org/download/6DVL.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVL.pdb.gz
Downloading https://files.rcsb.org/download/6DVL-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVL-sf.cif.gz
Downloading https://files.rcsb.org/download/6DVM.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVM.cif.gz
Downloading https://files.rcsb.org/download/6DVM.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVM.pdb.gz
Downloading https://files.rcsb.org/download/6DVM-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVM-sf.cif.gz
Downloading https://files.rcsb.org/download/6DVN.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVN.cif.gz
Downloading https://files.rcsb.org/download/6DVN.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVN.pdb.gz
Downloading https://files.rcsb.org/download/6DVN-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVN-sf.cif.gz
Downloading https://files.rcsb.org/download/6DVO.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVO.cif.gz
Downloading https://files.rcsb.org/download/6DVO.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVO.pdb.gz
Downloading https://files.rcsb.org/download/6DVO-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DVO-sf.cif.gz
Downloading https://files.rcsb.org/download/6DW2.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DW2.cif.gz
Downloading https://files.rcsb.org/download/6DW2.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DW2.pdb.gz
Downloading https://files.rcsb.org/download/6DW2-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DW2-sf.cif.gz
Downloading https://files.rcsb.org/download/6DWA.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWA.cif.gz
Downloading https://files.rcsb.org/download/6DWA.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWA.pdb.gz
Downloading https://files.rcsb.org/download/6DWA-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWA-sf.cif.gz
Downloading https://files.rcsb.org/download/6DWC.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWC.cif.gz
Downloading https://files.rcsb.org/download/6DWC.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWC.pdb.gz
Downloading https://files.rcsb.org/download/6DWC-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWC-sf.cif.gz
Downloading https://files.rcsb.org/download/6DWI.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWI.cif.gz
Downloading https://files.rcsb.org/download/6DWI.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWI.pdb.gz
Downloading https://files.rcsb.org/download/6DWI-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6DWI-sf.cif.gz
Downloading https://files.rcsb.org/download/6EDF.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDF.cif.gz
Downloading https://files.rcsb.org/download/6EDF.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDF.pdb.gz
Downloading https://files.rcsb.org/download/6EDF-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDF-sf.cif.gz
Downloading https://files.rcsb.org/download/6EDI.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDI.cif.gz
Downloading https://files.rcsb.org/download/6EDI.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDI.pdb.gz
Downloading https://files.rcsb.org/download/6EDI-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EDI-sf.cif.gz
Downloading https://files.rcsb.org/download/6EE5.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EE5.cif.gz
Downloading https://files.rcsb.org/download/6EE5.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EE5.pdb.gz
Downloading https://files.rcsb.org/download/6EE5-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EE5-sf.cif.gz
Downloading https://files.rcsb.org/download/6EEG.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEG.cif.gz
Downloading https://files.rcsb.org/download/6EEG.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEG.pdb.gz
Downloading https://files.rcsb.org/download/6EEG-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEG-sf.cif.gz
Downloading https://files.rcsb.org/download/6EEK.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEK.cif.gz
Downloading https://files.rcsb.org/download/6EEK.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEK.pdb.gz
Downloading https://files.rcsb.org/download/6EEK-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEK-sf.cif.gz
Downloading https://files.rcsb.org/download/6EEP.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEP.cif.gz
Downloading https://files.rcsb.org/download/6EEP.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEP.pdb.gz
Downloading https://files.rcsb.org/download/6EEP-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EEP-sf.cif.gz
Downloading https://files.rcsb.org/download/6EFG.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFG.cif.gz
Downloading https://files.rcsb.org/download/6EFG.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFG.pdb.gz
Downloading https://files.rcsb.org/download/6EFG-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFG-sf.cif.gz
Downloading https://files.rcsb.org/download/6EFH.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFH.cif.gz
Downloading https://files.rcsb.org/download/6EFH.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFH.pdb.gz
Downloading https://files.rcsb.org/download/6EFH-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EFH-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIF.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIF.cif.gz
Downloading https://files.rcsb.org/download/6EIF.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIF.pdb.gz
Downloading https://files.rcsb.org/download/6EIF-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIF-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIJ.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIJ.cif.gz
Downloading https://files.rcsb.org/download/6EIJ.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIJ.pdb.gz
Downloading https://files.rcsb.org/download/6EIJ-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIJ-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIL.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIL.cif.gz
Downloading https://files.rcsb.org/download/6EIL.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIL.pdb.gz
Downloading https://files.rcsb.org/download/6EIL-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIL-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIP.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIP.cif.gz
Downloading https://files.rcsb.org/download/6EIP.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIP.pdb.gz
Downloading https://files.rcsb.org/download/6EIP-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIP-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIQ.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIQ.cif.gz
Downloading https://files.rcsb.org/download/6EIQ.pdb.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIQ.pdb.gz
Downloading https://files.rcsb.org/download/6EIQ-sf.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIQ-sf.cif.gz
Downloading https://files.rcsb.org/download/6EIR.cif.gz to /content/drive/MyDrive/IontheFoldPDBstructures/60-70k/6EIR.cif.gz
Downloading https://files.rcsb.org/download/6EIR.pdb.gz to /cont...
"""

# Initialize sets to store downloaded and failed protein IDs
downloaded_ids = set()
failed_ids = set()

# Regex to find protein IDs in the output
download_pattern = re.compile(r'Downloading .*?/(\w{4})\..*? to .*?/\w{4}\..*')
failed_pattern = re.compile(r'Failed to download .*?/(\w{4})\..*')

# Process each line in the script output
for line in script_output.splitlines():
    download_match = download_pattern.search(line)
    if download_match:
        downloaded_ids.add(download_match.group(1))
        continue # Move to the next line if a download is matched

    failed_match = failed_pattern.search(line)
    if failed_match:
        failed_ids.add(failed_match.group(1))

# Remove IDs that were downloaded successfully from the failed set
# This is important because the output shows "Downloading" before "Failed to download" for failed files.
# We only want to count something as 'failed' if it wasn't successfully downloaded.
failed_ids = failed_ids - downloaded_ids

print(f"Number of downloaded files: {len(downloaded_ids)}")
print(f"Number of undownloaded protein IDs: {len(failed_ids)}")

if failed_ids:
    print("\nUndownloaded protein IDs:")
    for pdb_id in failed_ids:
        print(pdb_id)

Number of downloaded files: 32
Number of undownloaded protein IDs: 0


#6. Run this script to double check you have the requisite files all set in the Google Drive

In [None]:
from IPython.display import Image, display

# === Step 6: Verify first 10,000 protein downloads ===
label = dropdown.value
original_file = os.path.join(input_base, id_file_options[label])
output_dir = os.path.join(output_base, "testCollection" if label == "Test file" else label)

with open(original_file, 'r') as f:
    content = f.read().strip()
    all_ids = [x.strip() for x in content.split(",") if x.strip()]
    first_10k_ids = all_ids[:10000]

missing_ids = []
for pdb_id in first_10k_ids:
    expected_file = os.path.join(output_dir, f"{pdb_id}.pdb.gz")
    if not os.path.exists(expected_file):
        missing_ids.append(pdb_id)

found = len(first_10k_ids) - len(missing_ids)

print("----------------------------------------")
print(f"✅ Total expected IDs:        {len(first_10k_ids)}")
print(f"📂 .pdb.gz files found:       {found}")
print(f"❌ Missing protein structures: {len(missing_ids)}")
print("----------------------------------------")

if missing_ids:
    print("Examples of missing PDB IDs:", missing_ids[:10])
else:
    # 🎆 All good — celebrate with fireworks!
    display(Image("https://media.giphy.com/media/xT0xeJpnrWC4XWblEk/giphy.gif"))
