In [1]:
import sqlite3
import pandas as pd
import FetchAlphaFoldPDBs as FETCH
import os
import shutil

ACCESSION_DB_PATH = ".\\..\\AlphaFold\\accession_id_db.db"
ACCESSION_ID_TABLE_NAME = "accession_ids"
TABLE_UNIPROT_ID_FEATURE_NAME = "UniProtAccessionID"

AlphaFold Database doesn't seem to have an API, instead an FTP server with all accession IDs mapped to AlphaFold IDs available [here](http://ftp.ebi.ac.uk/pub/databases/alphafold/)

[This article](https://www.blopig.com/blog/2022/08/retrieving-alphafold-models-from-alphafolddb/) gives an example of retrieving alphafold models using the above however involves loading a 7gb csv into memory which is very slow and leads to memory errors

So instead let's use a database approach and store accession IDs in a database which can be queried. 

- Note: it is much quicker to search a directory of pdb files however this DB approach is exhaustive in all available AlphaFold PDBs (which can subsequently be pulled from AF)

- Note: Comparatively fast when searching multiple IDs as database can be queried for uniprotIDs at once

- Note: Some IDs hyphenated '-', these are caught and reduced ID (without hyphen) searched if no match found for hyphen

#### Set Up Database Connection to Accession Info

Requires a local database of the accession IDs files available from [AlphaFolds FTP server](http://ftp.ebi.ac.uk/pub/databases/alphafold/)

In [2]:
# Set Up Connections
conn = sqlite3.connect(ACCESSION_DB_PATH)
cur = conn.cursor()

#### Pull list of Uniprot IDs to Check
This is the list of uniprot IDs needing to be queried, here llps_plus (see [pnas](https://www.pnas.org/doi/10.1073/pnas.2019053118)) used as a demo

In [3]:
DEMO_DATA_FILEPATH = ".\\demo_datasets\\demo_llps_minus.csv"
OUTPUT_FILEPATH = ".\\demo_datasets\\demo_llps_minus_AlphaFold_Info.csv"

In [4]:
# Get IDs to Query from demo data
dataset_ID_column_name = 'Uniprot_ID'
llpsPlusData = pd.read_csv(DEMO_DATA_FILEPATH)
uniqueUniProtIDs = list(set(llpsPlusData[dataset_ID_column_name]))
print(f'{len(llpsPlusData[dataset_ID_column_name])} IDs in llps minus, {len(uniqueUniProtIDs)} of which are unique')

84 IDs in llps minus, 52 of which are unique


#### Get AF info for each ID

In [5]:
# Get AF Identifiers from ID list
AF_info = FETCH.getAndSaveAFinfoForListOfUniProtIDs(cur, uniqueUniProtIDs, OUTPUT_FILEPATH, debug=False)

In [7]:
AF_info.head()

Unnamed: 0,uniprot_ID_source,uniprot_ID_match,AF_DB_ID,firstResidueIndex,lastResidueIndex,latestVersion
0,Q95XR4,Q95XR4,AF-Q95XR4-F1,1.0,690.0,4.0
1,P78352-3,P78352,AF-P78352-F1,1.0,724.0,4.0
2,Q9TZQ3,Q9TZQ3,AF-Q9TZQ3-F1,1.0,730.0,4.0
3,P35637,P35637,AF-P35637-F1,1.0,526.0,4.0
4,P22626,P22626,AF-P22626-F1,1.0,353.0,4.0


#### Check If PDB Present, If Not Fetch It, copying all to a target directory

Returns a dataframe linking original row with path to PDB

(Could be sped up with multithreading but not currently worth the hassle)

In [8]:
# target directory
COLLECTED_PDBS_DIR = '.\\demo_datasets\\collected_pdbs'

# List of directories that contain local pdb files
LOCAL_ALPHAFOLD_PDB_DIRECTORIES = ['.\\demo_datasets\\local_pdbs']

FINAL_OUTPUT_PATH = ".\\demo_datasets\\demo_llps_plus_AlphaFold_Info_with_PDB_Paths.csv"

In [9]:
AF_info_with_PDB_paths = FETCH.fetchPDBsFromAlphaFoldInfoDataFrame(AF_info, COLLECTED_PDBS_DIR, LOCAL_ALPHAFOLD_PDB_DIRECTORIES, outputPath=FINAL_OUTPUT_PATH)

Q95XR4 (AF: AF-Q95XR4-F1-model_v4.pdb) not found locally, pulling from AlphaFold
P78352-3 (AF: AF-P78352-F1-model_v4.pdb) not found locally, pulling from AlphaFold
Q9TZQ3 (AF: AF-Q9TZQ3-F1-model_v4.pdb) not found locally, pulling from AlphaFold
P35637 (AF: AF-P35637-F1-model_v4.pdb) found locally
	copying .\demo_datasets\local_pdbs\AF-P35637-F1-model_v4.pdb to .\demo_datasets\collected_pdbs\AF-P35637-F1-model_v4.pdb
P22626 (AF: AF-P22626-F1-model_v4.pdb) not found locally, pulling from AlphaFold
O00571 (AF: AF-O00571-F1-model_v4.pdb) not found locally, pulling from AlphaFold
A0A2K3DA85 (AF: AF-A0A2K3DA85-F1-model_v4.pdb) not found locally, pulling from AlphaFold
C5MKY7 has no AF match
P06748 (AF: AF-P06748-F1-model_v4.pdb) found locally
	copying .\demo_datasets\local_pdbs\AF-P06748-F1-model_v4.pdb to .\demo_datasets\collected_pdbs\AF-P06748-F1-model_v4.pdb
P40070 (AF: AF-P40070-F1-model_v4.pdb) not found locally, pulling from AlphaFold
P42212 (AF: AF-P42212-F1-model_v4.pdb) not found l

#### All Available AlphaFold PDBs will now be present in target direcory

In [10]:
AF_info_with_PDB_paths.head()

Unnamed: 0,uniprot_ID_source,uniprot_ID_match,AF_DB_ID,firstResidueIndex,lastResidueIndex,latestVersion,PDB_path
0,Q95XR4,Q95XR4,AF-Q95XR4-F1,1.0,690.0,4.0,.\demo_datasets\collected_pdbs\AF-Q95XR4-F1-mo...
1,P78352-3,P78352,AF-P78352-F1,1.0,724.0,4.0,.\demo_datasets\collected_pdbs\AF-P78352-F1-mo...
2,Q9TZQ3,Q9TZQ3,AF-Q9TZQ3-F1,1.0,730.0,4.0,.\demo_datasets\collected_pdbs\AF-Q9TZQ3-F1-mo...
3,P35637,P35637,AF-P35637-F1,1.0,526.0,4.0,.\demo_datasets\collected_pdbs\AF-P35637-F1-mo...
4,P22626,P22626,AF-P22626-F1,1.0,353.0,4.0,.\demo_datasets\collected_pdbs\AF-P22626-F1-mo...
