### The Protein Data Bank and obsolete entries

The Protein Data Bank (PDB) is a repository for 3D structural data of biological macromolecules such as proteins, nucleic acids, and complex assemblies. Over time, new structures are determined, and sometimes, older structures may become obsolete due to various reasons such as errors in the data, paper retraction, updates in the structural information, or simply because newer and more accurate structures have been deposited. The `obsolete.dat` file in the PDB is a file that contains information about obsolete or outdated entries in the PDB database. Users of the PDB can refer to the file to identify structures that may no longer be reliable or up-to-date.  The file includes the PDB ID codes of the obsolete entries, the PDB ID code of the 'successor' (i.e. the new data that was used to replace the obsolete entry, if existing), and the date in which the replacement was done. Example:

```
LIST OF OBSOLETE COORDINATE ENTRIES AND SUCCESSORS
OBSLTE    31-JUL-94 116L     216L
OBSLTE    15-APR-98 125D     1AW6
OBSLTE    20-SEP-99 14PS     1QJB
OBSLTE    30-OCT-78 151C     251C
OBSLTE    15-JAN-91 156B     256B
OBSLTE    08-JUL-08 179L     
OBSLTE    07-DEC-04 1A0V     1Y46
OBSLTE    07-DEC-04 1A0W     1Y4F
OBSLTE    07-DEC-04 1A0X     1Y4G
...
```

### AlphaFold bug

AlphaFold, developed by DeepMind, is an advanced deep learning-based method for protein structure prediction. It combines deep neural networks and evolutionary information to predict protein structures with remarkable accuracy. AlphaFold does use template structures to help in structure prediction. AlphaFold checks the `obsolete.dat` file to determine whether any of the templates chosen for structure prediction is obsolete. If that is the case, AlphaFold will then use the successor PDB instead. This is an important checkpoint, since (1) obsolete template files are not stored in the local PDB structure database and (2) missing files will raise an error that immediately aborts the execution of AlphaFold.

Unfortunately, older AlphaFold versions (e.g. 2.2.0) cannot handle the situations in which the successor PDB are also obsolete, like in this case:

```
OBSLTE    08-MAR-23 4N1V     5CVX
...
OBSLTE    22-JUN-16 5CVX     5L8Z
```

In the above example, we find that 5CVX is the successor of 4N1V (obsolete), and 5L8Z the successor of 5CVX. Hence, 4N1V and 5CVX are both superseded by 5L8Z. Unfortunately, older AlphaFold versions (e.g. 2.2.0) are not able to make such connection: when the template pdb 4N1V is requested, 5CVX is returned; however, since such file is not present in the local PDB structure database, AlphaFold will raise an exception:

```
FileNotFoundError: [Errno 2] No such file or directory: '/data/mmcif/5cvx.cif'
```

### The solution

The simplest option is to update AlphaFold to a newer version. Versions newer than 2.2.0 should not be affected by this bug.

If an update is not possible, the `obsolete.dat` file can be modified to avoid concatenations and to remove references to pdb files not found in the local database. The following script, once executed, does exactly that.

Please specify the correct paths for these variables:
- `MMCIF_PATH`
- `OBSOLETE_PATH`

On the HCP of the University of Basel (sciCORE), these correspond to the environmental variables `${MMCIF_PATH}` and `${OBSOLETE_PATH}`, respectively, that you can echo after loading the module AlphaFold. Example:

```
module load AlphaFold
echo ${MMCIF_PATH}
echo ${OBSOLETE_PATH}
```

Modify the `OBSOLETE_DAT_FILEPATH` only if you know what you are doing. Otherwise it should work just fine.

In [None]:
import os 

MMCIF_PATH = 'databases/PDB/latest/data/structures/all/mmcif_files/'        # Replace with your own
OBSOLETE_PATH = 'databases/PDB/latest/data/status/'                         # Replace with your own
OBSOLETE_DAT_FILEPATH = os.path.normpath(f'{OBSOLETE_PATH}/obsolete.dat')


# Parse:

with open(OBSOLETE_DAT_FILEPATH, "r") as f:
    file = f.read()

print(f"Input file: {OBSOLETE_DAT_FILEPATH} \n")

obs_dict = {}

for line in file.split("\n"):
    if line.startswith("OBSLTE"):
        #date = line[9:19].strip().upper()
        obsolete_id = line[19:24].strip().upper()
        replacement_ids = line[29:].strip().upper()
        
        obs_dict[obsolete_id] = list(replacement_ids.split(" ")) if replacement_ids != '' else []

        
# Correct file

def correct(iteration=1):
    count = 0
    for obs in obs_dict.keys():
        replacement_ids = obs_dict[obs]
        updated_ids = []
        for ri in replacement_ids:
            if ri in obs_dict.keys():
                updated_ids += obs_dict[ri]
                count += 1
            else:
                updated_ids.append(ri)
        obs_dict[obs] = updated_ids
    
    print(f"Iteration {iteration}")
    print(f"---------------------")
    print(f"Processed ids: {len(list(obs_dict.keys()))}")
    print(f"Modified ids:  {count} \n")

    if count > 0 and iteration < 10: # Iteration limit is used as a safety
        correct(iteration+1)
    else:
        
        count = 0
        removed_ids = []
        for obs in obs_dict.keys():
            replacement_ids = obs_dict[obs]
            updated_ids = []
            for ri in replacement_ids:
                count += 1
                cif_filename = f'{ri.lower()}.cif'
                cif_filepath = os.path.normpath(f'{MMCIF_PATH}/{cif_filename}')
                if os.path.isfile(cif_filepath):
                    updated_ids.append(ri)
                else:
                    # id will be removed if corresponding file is not found in the local MMCIF_PATH database
                    removed_ids.append(ri)
            obs_dict[obs] = updated_ids

        print("Local database check")
        print(f"---------------------")
        print(f"Processed ids: {count}")
        print(f"Modified ids:  {len(removed_ids)} ({', '.join(removed_ids)}) \n")


correct()

# Reconvert and export

from datetime import datetime
formatted_date = datetime.now().strftime("%d-%b-%y") # All entries will have today's date to indicate that the file was modified.

new_obsolete_file = file.split("\n")[0] # Copy header from original file
for obs in obs_dict.keys():
    # Add lines
    new_obsolete_file += "\n"
    new_obsolete_file += f"OBSLTE    {formatted_date} {obs}     {' '.join(obs_dict[obs])}"

output_filepath = os.path.abspath("./obsolete_new.dat")

with open(output_filepath, "w") as f:
    f.write(new_obsolete_file)

print(f"Output file: {output_filepath}")