This file will take the raw datasets, curate them to get the datasets ready-to-use for the model. 

You can preprocess your own dataset like we did with Voldborg data.

The Bojar rules data should stay the same if you wish to use our model. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as st


## Bojar rules

In [2]:
#Load lectin binding rules in Bojar paper
bojar_rule = pd.read_excel('../Data/S4_Motif-associated p values.xlsx')#,index_col=0)
index = bojar_rule[bojar_rule.columns[0]].to_numpy()[4:]
bojar_rule = bojar_rule.loc[bojar_rule.index[4:],bojar_rule.columns[1:]]
bojar_rule.index = index
#Load in corrected motif names. Original ones have strange character in it
motif_names = pd.read_excel("../Data/Motifs_name.xlsx",index_col=0)
bojar_rule.columns = motif_names.index
bojar_rule.columns.name = None
bojar_rule

Unnamed: 0,Terminal GlcNAca,Terminal type-II LacdiNAc,Thr-tail,3- O sulfate,Core 6 O-link,Core 4 O-link,Internal Type I LacNAc,Terminal GlcNAcb,Type 1 H,Terminal Manb,...,Blood group A,Core Fucose,Forssman antigen,Terminal Mana,"a1,2 Fuc",Terminal Man1-6,terminal Type II LacNAc,Terminal GalNAca,Biantennary,Man6
AAA_EY,0.056695,0.056695,0.056695,0.228909,0.228909,0.228909,0.242261,0.242261,0.461527,0.797154,...,1,1,1,1,1,1,1,1,1,1
AAL_Vector,1,1,1,0.063158,0.278215,1,0.155415,1,0.072717,1,...,1,0.000262,1,1,0.003972,1,1,1,1,1
ABA_EY,0.17729,1,0.004589,0.497016,0.497016,0.497016,0.17729,0.002898,0.497016,0.797154,...,1,1,1,1,1,1,1,1,1,1
ACL_Vector,1,1,0.086059,0.05887,0.074013,0.856496,0.111732,1,0.636552,1,...,1,1,1,1,1,1,1,1,1,1
AIA_EY,1,1,0.015648,0.375831,0.871812,0.822073,0.375831,1,0.613195,0.613195,...,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
WFL_Vector,1,0.116191,1,0.58601,0.516018,0.58601,0.516018,1,1,1,...,1,1,1,1,1,1,0.0,1,1,1
WGA_EY,0.080241,0.080241,0.890132,0.213407,0.104493,0.943062,0.26956,0.104493,1,1,...,1,1,1,1,1,1,1,1,1,1
WGA_SeikagakuBio,0.012874,0.035691,1,0.934777,0.326946,1,1,0.000807,1,1,...,1,1,1,1,1,1,1,1,1,1
WGA_Sigma,0.009322,0.035691,1,0.274634,0.253924,1,0.777179,0.000049,1,1,...,1,1,1,1,1,1,1,1,1,1


In [3]:
#Find 0 in lec_motif_p, print
for col in bojar_rule.columns:
    if 0 in bojar_rule[col].to_numpy():
        display(bojar_rule.loc[bojar_rule[col]==0,col])

GS-II_Vector    0
Name: Terminal GlcNAcb, dtype: object

SNA_Vector            0
TJA-I_SeikagakuBio    0
Name: Terminal 2,6 NeuAc, dtype: object

AAL_Vector    0
AOL           0
Name: Terminal Fuc, dtype: object

RCAI_EY        0
RCAI_Vector    0
WFL_Vector     0
Name: Terminal b-Gal, dtype: object

GS-I_Vector    0
PA-IL_Sigma    0
Name: Terminal a-Gal, dtype: object

RCAI_EY        0
RCAI_Vector    0
Name: terminal Type II LacNAc, dtype: object

### Check why there are 0 values in p-values
### Terminal GlcNAcb
1. GS-II_Vector： primary rule in Bojar's paper, align with literature (https://www.jbc.org/article/S0021-9258(19)63057-7/pdf)
### Terminal 2,6 NeuAc
1. SNA_Vector: primary rule in Bojar's paper, align with literature (Dugan, Aisling S et al. “Direct correlation between sialic acid binding and infection of cells by two human polyomaviruses (JC virus and BK virus).” Journal of virology vol. 82,5 (2008): 2560-4. doi:10.1128/JVI.02123-07)
2. TJA-I_SeikagakuBio: primary rule in Bojar's paper,  align with literature (https://www.tandfonline.com/doi/pdf/10.1080/19420862.2016.1149662)
### Terminal Fuc
1. AAL_Vector: Not mentioned in Bojar's paper, but tolerates Fuca1-6, and interaction with core fuc found in literature. (https://www.sciencedirect.com/topics/immunology-and-microbiology/aleuria-aurantia)
2. AOL： Not mentioned in Bojar's paper, but tolerates Fuca1-6. Literature showed it is similar to AFL and AAL (https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0083077&type=printable)
### Terminal b-Gal
1. RCAI_EY
2. RCAI_Vector: part of Type II LacNAc, found avidence in literature (https://pubmed.ncbi.nlm.nih.gov/3424394/)
3. WFL_Vector: only affinity with Terminal GalNAcb found.This should be a mistaken one. (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7873859/#ref36)
### Terminal a-Gal
1. GS-I_Vector: primary rule in Bojar's paper, found avidence in literature (https://www.rcsb.org/structure/1GNZ)
2. PA-IL_Sigma: primary rule in Bojar's paper, align with literature (https://hal.archives-ouvertes.fr/hal-02554317/document)
### terminal Type II LacNAc
1. RCAI_EY
2. RCAI_Vector: primary rule in Bojar's paper, align with literature  (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7175966/)
### Decision
Most of the lectins binds to these 0 glycofeature. Set all of them to 0.00001.

In [4]:
#For zeros in Bojar S4 Table
#P-value adjustment according to Bojar and literature (details are above)
for col in bojar_rule.columns:
    if 0 in bojar_rule[col].to_numpy():
        bojar_rule.loc[bojar_rule[col]==0,col] = 0.00001

In [5]:
#Convert p-values into continuous binding affinity
bojar_rule_z = pd.DataFrame(index = bojar_rule.index)
for col in bojar_rule.columns:
    bojar_rule_z[col] = 1-bojar_rule[col].values[:]/2
    bojar_rule_z[col] = st.norm.ppf(bojar_rule_z[col].values.tolist())
#Save the df
# bojar_rule_z.to_excel('../Data/Lectin binding rules z-scores.xlsx', index = True)