<a href="https://colab.research.google.com/github/mlraul/biosensor_predictor/blob/main/Data_codification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Data codification
In this notebook the different functions needed to code the data in order to make it interpretable for the network are presented.

In [None]:
import numpy as np
import pandas as pd

!pip install kora -q
import kora.install.rdkit
from rdkit import Chem

###Sequence codification
A function to code a simple sequence using the one-hot encoding technique. A matrix of 1s and 0s is built. It has 20 rows, one for each amino acid. The amino acids sequence must be given as input.

In [None]:
def seq_cod (seq):
    
  aa_list = ["A","C","D","E","F","G","H","I","K","L","M","N","P","Q","R",
             "S","T","V","W","Y"]
  aam = np.zeros((20, 978)).astype(int)
  pos = 0
  for aa in seq:
    if aa == "*" or aa == "X":
      pos = pos + 1
      continue
    aa_idx = aa_list.index(aa)
    aam[aa_idx][pos] = 1
    pos = pos + 1
  aam = aam.reshape((1,20,978))
    
  return aam

###Molecule codification
A function to code a simple molecule getting its fingerprint. The SMILES format of the molecule must be given as input.

In [None]:
def smi_cod(smi):
    
  ms = Chem.MolFromSmiles(smi)
  fp = Chem.RDKFingerprint(ms, fpSize = 512).ToBitString()
  fp = np.array([int(x) for x in list(fp)])
  fp = fp.reshape((1,512))
    
  return fp

###Sequences matrix
A function to code and arrange a set of sequences. The result is the matrix which will be used as input for the model. The database containing the sequences must be given as input.

In [None]:
def aam3d(db):
    
  aamatrix = seq_cod(db['AA_sequence'][0])
  count = 0
  for i in db['AA_sequence'][1:]:
    aam = seq_cod(i)
    aamatrix = np.concatenate((aamatrix, aam))
    count += 1
    print(count)
        
  return aamatrix

###Molecules matrix
A function to code and arrange a set of molecules. The result is the matrix which will be used as input for the model. The database containing the molecules must be given as input.

In [None]:
def fp3d(db):
    
  fpmatrix = smi_cod(db['SMILES'][0])
  count = 0
  for i in db['SMILES'][1:]:
    fpm = smi_cod(i)
    fpmatrix = np.concatenate((fpmatrix, fpm))
    count += 1
    print(count)
    
  return fpmatrix