# Convert CSV to AF3 JSON

**Protein-Peptide Complex** prediction should be the target user of this repo.

Complex can contains multiple protein chains, and all of them should only contain natural amino acids.\
Only one peptide should be provided in each complex, and both natural and unnatural amino acids can be included.

Before starting, you should first run the build_smi2ccd.py
```bash
python build_smi2ccd.py
```
Ignoring all the errors or warnings, just make sure `smi2ccd.json` file showed up after running the script.

## Peptides with All Natural Amino Acid

In [None]:
import pandas as pd
from other2afjson import *

output_folder = "af_json"
csv_file = "cplx_candidates.csv"

df = pd.read_csv(csv_file)
df_natural = df[df['ID']==1]
df_natural

Unnamed: 0,ID,chainA,chainB,P1,P2,P3,P4,P5,P6,P7,...,P10,P11,P12,P13,linker1,pos_cyclic1,pos_linker1,linker2,pos_cyclic2,pos_linker2
0,1,VSGWLGPQQYLSYNSLRGEAEPCGAWVWENQVSWYWEKETTDLRIK...,VEHSDLSFSKD,,,,,,,,...,,,,,,,,,,


You just need to put protein(s) and peptide sequence under column "chainX". You can make "chainX" here as an object which only contains natural amnio acids.

In [None]:
dataframe2afjsons(df_natural, output_folder="af_json")

## Peptides with Unnatural Amino Acid

### 1. Linear peptides

In [3]:
df_uaa_linear = df[df['ID']==2]
df_uaa_linear

Unnamed: 0,ID,chainA,chainB,P1,P2,P3,P4,P5,P6,P7,...,P10,P11,P12,P13,linker1,pos_cyclic1,pos_linker1,linker2,pos_cyclic2,pos_linker2
1,2,VSGWLGPQQYLSYNSLRGEAEPCGAWVWENQVSWYWEKETTDLRIK...,,ACE,Ala,Cys,Phe,Ala,CUSTOM,Asp,...,Val,Ala,Pro,NH2,,,,,,


If you provide a peptide contains unnatural amino acids, you should put 3-letter format(for natural ones) and you custom abbreviation under column "P#".

Some of the abbreviations may be already in af3's ccd codes(like ACE, NH2 here), but if you don't know if your custom one already exists, you may use the function below, and refine your csv.

In [None]:
custom_smiles = "CN[C@@H](CC(C)C)C(=O)O" # Me-Leu
find_ccd(custom_smiles) # MLE

If nothing showed up (or you just too lazy to do that), you can add your custom abbreviation and smiles in `lookuptable.csv`. And then...

In [None]:
dataframe2afjsons(df_uaa_linear, lookuptable="lookuptable.csv",output_folder="af_json")

### 2. Cyclic Peptides

For cyclic peptides, we simply utilize pTM and bondAtomPair to mimic cyclic peptides.

In [4]:
df_uaa_cyclic = df[df['ID']==3]
df_uaa_cyclic

Unnamed: 0,ID,chainA,chainB,P1,P2,P3,P4,P5,P6,P7,...,P10,P11,P12,P13,linker1,pos_cyclic1,pos_linker1,linker2,pos_cyclic2,pos_linker2
2,3,VSGWLGPQQYLSYNSLRGEAEPCGAWVWENQVSWYWEKETTDLRIK...,,ACE,Ala,Cys,Phe,Ala,Pro,Asp,...,Val,Ala,Pro,NH2,SS,"2|CB,11|CB","1|S1,1|S2",,,


Take the simplest disulfide as an example:
- linker#: provide the custom abb for your linker (and also need to update `lookuptable.csv`)
- pos_cyclic#: specific the cyclic position of the atom on the peptide. Here we cut CYS to ALA, and move the disulfide as a linker
- pos_linker#: specific the cyclic position of the atom on the linker

pos_cyclic: 2|CB means cyclic postion is at CB atom of ALA@P2\
pos_linker: 1|S1 means cyclic postion is at S1 atom of SS@P1, but linker chain only have one obj, so the position always be 1\
2|CB,11|CB and 1|S1,1|S2 means CB of ALA@P2 will connect to S1 of SS linker, and CB of ALA@P11 will connect to S2 of SS linker

How to know the atom name?

In [None]:
lookuptable = "lookuptable.csv"
ccd_folder = "CCD"
lookup_df = pd.read_csv(lookuptable)
lookup_dict = dict(zip(lookup_df['CCD'], lookup_df['smiles']))

for name, smiles in lookup_dict.items():
    output_cif_file = f"{ccd_folder}/{name}.cif"
    smiles2cif(smiles, output_cif_file, name)

And use PyMOL to open CIF, label the object with atom name. The showing atom names will be the names used in here.

And then...

In [None]:
dataframe2afjsons(df_uaa_cyclic, lookuptable="lookuptable.csv", output_folder="af_json")

How about bicycle?\
You just need to extend the columns by adding `linker2,pos_cyclic2,pos_linker2`

## Update MSA Template

For the sake of convenience, once you have a json file with MSA, you can reuse it, especially we only design different peptides for the same protein target.

If you have data with same proteins but different peptides (their lengths are not equal)

In [5]:
df_same_prots = df[df["ID"].isin([4,5])]
df_same_prots

Unnamed: 0,ID,chainA,chainB,P1,P2,P3,P4,P5,P6,P7,...,P10,P11,P12,P13,linker1,pos_cyclic1,pos_linker1,linker2,pos_cyclic2,pos_linker2
3,4,VSGWLGPQQYLSYNSLRGEAEPCGAWVWENQVSWYWEKETTDLRIK...,IQRTPKIQVYSRHPAENGKSNFLNCYVSGFHPSDIEVDLLKNGERI...,ACE,Ala,Cys,Phe,Ala,DAL,Asp,...,Val,Ala,Pro,NH2,SS,"2|CB,11|CB","1|S1,1|S2",DABE,"5|CB,9|CB","1|C1,1|C4"


You may provide one MSA template file with only chainA and chainB

In [None]:
dataframe2afjsons(df_same_prots, template_file=None, lookuptable="lookuptable.csv", output_folder="af_json")

If you have same length peptides, you can set `mut_peptides` to true.

If each entry in the csv file has a template file, you can provide a list of template files, but need to follow the same order as the entry.