# 1.AF-cluster 简介
AF-cluster是一种alphafold2的辅助计算方法，它可以帮助alphafold2针对单一蛋白质序列，生成其可能的多种构象。该方法由 Brandeis University 和 Harvard University 的研究团队开发，其核心原理在于通过对蛋白质的msa比对进行序列相似性聚类，从而提取不同构象倾向的序列子集，以此作为条件输入AlphaFold2进行预测，从而生成同一蛋白序列的不同构象。



# 2.与静态结构预测器的区别

传统静态结构预测器通常只输出蛋白质的单一构象，因此它们缺乏对于存在多种构象蛋白质的预测能力，如确定apo蛋白和holo蛋白的结构区别。AF-Cluster 从根本上改变了这一范式，它能够对蛋白质序列msa比对结果进行相似性聚类，识别不同簇的结构形成倾向，从而改变alphafold2的输入条件进而生成不同的构象，形成同一蛋白质的构象系综。

# 3.计算流程

## 3.1 msa数据准备
使用alphafold2的colabfold notebook进行生成，得到a3m文件

## 3.2环境配置
在colab中安装相关python包，并下载alphafold2源码和权重参数用于结构预测



In [None]:
%%bash

if [ ! -d params ]; then
  mkdir params
  curl -fsSL https://storage.googleapis.com/alphafold/alphafold_params_2021-07-14.tar | tar x -C params
fi

if [ ! -d AFcluster ];then
  git clone https://github.com/HWaymentSteele/AFcluster.git
fi

if [ ! -d alphafold ]; then
  git clone https://github.com/deepmind/alphafold.git
  ! pip -q install ml-collections dm-haiku biopython
fi

if [ ! -d output ]; then
  mkdir output
fi

## 3.3定义运行alphafold2需要的函数

In [None]:

import sys
import os
import argparse
import hashlib
import jax
import jax.numpy as jnp
import numpy as np
import re
import subprocess
from glob import glob

sys.path.append('alphafold')

from alphafold.model import model, config, data
from alphafold.data import parsers, pipeline
from alphafold.common import protein

"""
Create an AlphaFold model runner
name -- The name of the model to get the parameters from. Options: model_[1-5]
"""

def make_model_runner(model_num=3, recycles=1, deterministic=True):
  model_name = 'model_%d_ptm' % model_num
  cfg = config.model_config(model_name)      

  cfg.data.common.num_recycle = recycles
  cfg.model.num_recycle = recycles
  cfg.data.eval.num_ensemble = 1
  if deterministic:
    cfg.data.eval.masked_msa_replace_fraction = 0.0
    cfg.model.global_config.deterministic = True
  params = data.get_model_haiku_params(model_name, '.')

  return model.RunModel(cfg, params)

def make_processed_feature_dict(runner, a3m_file, name="test", seed=0):
  feature_dict = {}

  # assumes sequence is first entry in msa
  with open(a3m_file,'r') as msa_fil:
    sequence = msa_fil.read().splitlines()[1].strip()

  feature_dict.update(pipeline.make_sequence_features(sequence, name, len(sequence)))

  with open(a3m_file,'r') as msa_fil:
    msa = pipeline.parsers.parse_a3m(msa_fil.read())

  feature_dict.update(pipeline.make_msa_features([msa]))
  processed_feature_dict = runner.process_features(feature_dict, random_seed=seed)

  return processed_feature_dict

"""
Package AlphaFold's output into an easy-to-use dictionary
prediction_result - output from running AlphaFold on an input dictionary
processed_feature_dict -- The dictionary passed to AlphaFold as input. Returned by `make_processed_feature_dict`.
"""
def parse_results(prediction_result, processed_feature_dict):
  b_factors = prediction_result['plddt'][:,None] * prediction_result['structure_module']['final_atom_mask']  
  dist_bins = jax.numpy.append(0,prediction_result["distogram"]["bin_edges"])
  dist_mtx = dist_bins[prediction_result["distogram"]["logits"].argmax(-1)]
  contact_mtx = jax.nn.softmax(prediction_result["distogram"]["logits"])[:,:,dist_bins < 8].sum(-1)

  out = {"unrelaxed_protein": protein.from_prediction(processed_feature_dict, prediction_result, b_factors=b_factors),
        "plddt": prediction_result['plddt'],
        "pLDDT": prediction_result['plddt'].mean(),
        "dists": dist_mtx,
        "adj": contact_mtx}

  out.update({"pae": prediction_result['predicted_aligned_error'],
              "pTMscore": prediction_result['ptm']})
  return out

def write_results(result, pdb_out_path):
  plddt = float(result['pLDDT'])
  ptm = float(result["pTMscore"])
  print('plddt: %.3f' % plddt)
  print('ptm  : %.3f' % ptm)

  pdb_lines = protein.to_pdb(result["unrelaxed_protein"])
  with open(pdb_out_path, 'w') as f:
    f.write(pdb_lines)

## 3.4对msa结果进行聚类
对msa结果进行聚类分析，每种类别对应一种可能的构象倾向，以改变alphafold2预测结果

In [None]:
pip -q install polyleven

python AFcluster/scripts/ClusterMSA.py EX -i 1HAK_ab609.a3m -o subsampled_MSAs

EX
2581 seqs removed for containing more than 25% gaps, 14415 remaining
eps	n_clusters	n_not_clustered
3.00	1	3604
3.50	1	3604
4.00	2	3601
4.50	1	3604
5.00	1	3604
5.50	4	3591
6.00	6	3587
6.50	5	3590
7.00	3	3596
7.50	6	3579
8.00	5	3574
8.50	5	3572
9.00	7	3565
9.50	8	3555
10.00	19	3523
10.50	29	3467
11.00	76	3202
11.50	64	2603
12.00	50	1829
12.50	24	905
13.00	6	222
13.50	2	24
14.00	1	0
Selected eps=11.00
14415 total seqs
599 clusters, 9535 of 14415 not clustered (0.66)
avg identity to query of unclustered: 0.35
avg identity to query of clustered: 0.40
writing 10 size-10 uniformly sampled clusters
writing 10 size-100 uniformly sampled clusters
wrote clustering data to subsampled_MSAs/EX_clustering_assignments.tsv
wrote cluster metadata to subsampled_MSAs/EX_cluster_metadata.tsv
Saved this output to EX.log

## 3.5 结构预测
将上一步的聚类结果a3m文件，输入alphafold2进行结构预测，以得到构象系综；每个a3m文件对应一个pdb结构

In [None]:
n_recycles = 3
model_number = 3
seed=0
name='KaiB_TE'

runner = make_model_runner(model_num=model_number, recycles=n_recycles)

#subsampled_msas = glob('AFcluster/data_sep2022/00_KaiB/kaib_dbscan_msas/*a3m')
subsampled_msas = glob('subsampled_MSAs/EX_3[0-9][0-9].a3m')

for fil in subsampled_msas:
  print(fil)
  features = make_processed_feature_dict(runner, fil, seed=seed)
  result = parse_results(runner.predict(features, random_seed=seed), features)
  write_results(result, 'output/' + os.path.basename(fil).replace('.a3m','.pdb'))

# 4.数据结果
针对聚类得到的599个多序列比对簇，进行了结构预测得到了对应的pdb结构存放在https://github.com/kinggmars/Pyrophosphokinase-/tree/main/data/predictions/af-cluster/pdb文件夹下