# NESP: train data wildtypes AlphaFold predictions exploration
Here I show how to open and use the files in [this](https://www.kaggle.com/datasets/shlomoron/train-wildtypes-af) data set. The data set contains 1186 AlphaFold predictions for 73 out of 78 wildtypes in train data (several tens of predictions for each wildtype). The predictions are for exact matches. The data set is based on the work of @roberthatch (see [here](https://www.kaggle.com/code/roberthatch/novo-train-data-contains-wildtype-groups/notebook)) and @vslaykovsky (see [here](https://www.kaggle.com/code/vslaykovsky/nesp-alphafold-v2-exact-match-data) and [here](https://www.kaggle.com/code/vslaykovsky/nesp-alphafold2-all-close-matches)).

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)

!pip install biopandas -q
from biopandas.pdb import PandasPdb
from biopandas.mmcif import PandasMmcif
!pip install py3Dmol -q
import py3Dmol 

In [None]:
alpha_fold_df = pd.read_csv("../input/train-wildtypes-af/alpha_fold_df.csv")
alpha_fold_df.head()

## We will explore the matches to the first wildtype. There are 84 of them

In [None]:
wildtypes = list(alpha_fold_df["af2_sequence"].drop_duplicates())
wildtype = wildtypes[0]
print("wildtype:")
print(wildtype)
AF_keys = alpha_fold_df.loc[alpha_fold_df["af2_sequence"] == wildtype]["af2id"]
print("Number of matches to wildtype: " + str(len(AF_keys)))
print("AF key: " + AF_keys[0])

## Confidence file exploration i.e. pLDDT 

In [None]:
confidense_df = pd.read_json("../input/train-wildtypes-af/confidence/" + AF_keys[0][5:] + "-confidence_v3.json")
confidense_df.head()

In [None]:
plt.plot(confidense_df["confidenceScore"])

In [None]:
confidense_df_list = [pd.read_json("../input/train-wildtypes-af/confidence/" + AF_keys.iloc[i][5:] + "-confidence_v3.json") for i in range(len(AF_keys))]
for confidense_df in confidense_df_list:
    plt.plot(confidense_df["confidenceScore"])

### We note that the length of all the scores is the same, so probably all the matches are exact matches. We will make sure that the residues are the same later.

### We will look at the minimum and maximum pLDDT score across all matches at each point

In [None]:
min_score = list(confidense_df_list[0]["confidenceScore"])
max_score = list(confidense_df_list[0]["confidenceScore"])

for confidense_df in confidense_df_list:
    for point_idx in range(len(confidense_df)):
        if confidense_df.iloc[point_idx]["confidenceScore"] < min_score[point_idx]:
            min_score[point_idx] = confidense_df.iloc[point_idx]["confidenceScore"]
        if confidense_df.iloc[point_idx]["confidenceScore"] > max_score[point_idx]:
            max_score[point_idx] = confidense_df.iloc[point_idx]["confidenceScore"]

plt.plot(min_score)
plt.plot(max_score)

### So we can see quite a bit of variation...This raise the question, how to deside which prediction is the best one to use.

## Now we will explore the cif file

Snippets for reading and displaying are thanks to @cdeotte

In [None]:

cif_path_list = ["../input/train-wildtypes-af/cif/" + AF_keys.iloc[i][5:] + "-model_v3.cif" for i in range(len(AF_keys))]
atom_df_list = []
for cif_path in cif_path_list:
    atom_df = PandasMmcif().read_mmcif(cif_path)
    atom_df = atom_df.df['ATOM']
    atom_df_list.append(atom_df)
atom_df_list[0].head()

In [None]:
with open(cif_path_list[0]) as ifile:
    protein = "".join([x for x in ifile])
view = py3Dmol.view(width=800, height=600) 
view.addModelsAsFrames(protein)
style = {'cartoon': {'color': 'spectrum'},'stick':{}}
view.setStyle({'model': -1},style) 
view.zoom(0.12)
view.rotate(235, {'x':0,'y':1,'z':1})
#view.spin({'x':-0.2,'y':0.5,'z':1},1)
view.show()

### And now we will just make sure that the residues are the same for all matches for the explored wildtype:

In [None]:
wildtype_set = set()
for cif_df in atom_df_list:
    wildtype_set.add("".join(list(cif_df["label_comp_id"])))
print(len(wildtype_set))

### So indeed, they are the same.

## Lastly, we will get the mutations in the train data for this wildtype

In [None]:
train_wildtype_groups = pd.read_csv("../input/train-wildtypes-af/train_wildtype_groups.csv")
print(len(train_wildtype_groups))
train_wildtype_groups.head()

In [None]:
mutations = train_wildtype_groups.loc[train_wildtype_groups["wildtype"] == wildtype]
print(len(mutations))