# CLICK_SERCH
## Generacion de cliques
+ Se generan con biopandas para obtener los atomos de $C_\alpha$ y sus coordenadas.
+ Se calcula la distancia y se genera un grafo completo con la distancia entre cada par de atomos.
+ Se restringen los enlaces por una distancia dada y se generan los cliques que tengas un numero k de elementos para pertencer al clique.
+ Una ves generados los cliques de cada proteina se extraen sus coordenadas para poderlas comparar

In [1]:
import pandas as pd
pd.set_option('display.float_format', lambda x: '%.5f' % x)
pd.set_option('max_rows', 200)
pd.set_option('max_columns', 40)

import networkx as nx,community
import matplotlib.pyplot as plt
plt.style.use('ggplot')
font = {'family' : 'sans',
        'weight' : 'bold',
        'size'   : 20}
plt.rc('font', **font)

plt.rcParams['xtick.labelsize'] = 16
plt.rcParams['axes.labelsize'] = 18
plt.rcParams['ytick.labelsize'] = 16
plt.rcParams[u'figure.figsize'] = (16,8)

In [2]:
#mas librerias que voy obteniendo
import biopandas.pdb as bp
biop = bp.PandasPdb() #libreria de lectura de pdbs
#libreria de calculo de distancia euclidiana
from scipy.spatial.distance import pdist, squareform
#libreria de mate
import numpy as np
import itertools as it

In [3]:
# generar mismos resultados que el algoritmo click
path1 ='/Users/serch/pdbmani/Serch/1phr.pdb'
path2 ='/Users/serch/pdbmani/Serch/1tig.pdb'

#funcion de lectura con biopandas
def read_biopdb(path):
        df = biop.read_pdb(path)
        df_atom = df.df['ATOM']
        #OJO AQUI ESTA ADECUADO AL PDB   para elegir solo un frame en trj_0 y trj_0_A [:1805]
        df_ca = df_atom[df_atom.atom_name == 'CA'][[
        'atom_number','atom_name','residue_name','residue_number',
        'x_coord','y_coord','z_coord']]
        columna_vector = []
        for i in zip(df_ca.x_coord.tolist(),df_ca.y_coord.tolist(),df_ca.z_coord.tolist()):
            columna_vector.append(list(i))
            
        df_ca['vector'] = columna_vector
        return(df_ca)

In [4]:
df_ca1 = read_biopdb(path1)
df_ca2 = read_biopdb(path2)

In [18]:
df_ca1.head()

Unnamed: 0,atom_number,atom_name,residue_name,residue_number,x_coord,y_coord,z_coord,vector
1,2,CA,VAL,4,-1.152,49.045,47.247,"[-1.152, 49.045, 47.247]"
8,9,CA,THR,5,-2.11,47.476,43.973,"[-2.11, 47.476, 43.973]"
15,16,CA,LYS,6,1.146,46.724,42.118,"[1.146, 46.724, 42.118]"
24,25,CA,SER,7,1.539,45.139,38.664,"[1.539, 45.139, 38.664]"
30,31,CA,VAL,8,4.309,42.829,37.399,"[4.309, 42.829, 37.399]"


In [5]:
# #se calcula la distancia entre cada par de nodos.
# # def distancia_entre_atomos(df_ca):
# distancias = []
# #se calcula la distancia euclidiana entre cada atomo de carbon alfalfa
# for v,i in zip(df_ca1.vector,df_ca1.atom_number):
#     distancia_un_atomo = []
#     for av,j in zip(df_ca1.vector,df_ca1.atom_number):
#         distancia = pdist([v,av],metric='euclidean').item()
#         distancia_un_atomo.append(distancia)
#     distancias.append(distancia_un_atomo)
    
# pd.DataFrame(index=df_ca1.atom_number,
#                                       columns=df_ca1.atom_number,
#                                       data=distancias)

In [6]:
#se calcula la distancia entre cada par de nodos.
def distancia_entre_atomos(df_ca):
    distancias = []
    #se calcula la distancia euclidiana entre cada atomo de carbon alfalfa
    for v,i in zip(df_ca.vector,df_ca.atom_number):
        distancia_un_atomo = []
        for av,j in zip(df_ca.vector,df_ca.atom_number):
            distancia = pdist([v,av],metric='euclidean').item()
            distancia_un_atomo.append(distancia)
        distancias.append(distancia_un_atomo)
    #se genera la matriz de adyacencias para la red
    df_da = pd.DataFrame(index=df_ca.atom_number,columns=df_ca.atom_number,data=distancias)
    return(df_da)

In [7]:
df_da1 = distancia_entre_atomos(df_ca1)
df_da2 = distancia_entre_atomos(df_ca2)

In [8]:
def gen_3_cliques(df_da, dth = 10, k=3):
    """Genera n-cliques de dataframe de distancias, tomando en cuenta los enlaces menores o iguales
    a dth y forma los k-cliques que elijas 
    valores por default:
    dth=10, k=3
    Te devuelve un df con los valores de los cliques y 
    un resumen de como es la red antes y despues del filtro"""
    #red de distancias completa
    red = nx.from_pandas_adjacency(df_da)
    print("red antes de filtros:",nx.info(red))

    #filtro de distancias
    edgesstrong = [(u,v) for (u,v,d) in red.edges(data=True) if d["weight"] <= dth]

    red = nx.Graph(edgesstrong)
    print("=="*20)
    print("red despues de filtros:",nx.info(red))

    n_cliques = [clq for clq in nx.find_cliques(red) if len(clq) >=k]
    print('numero de cliques maximos encontrados:',len(n_cliques))

    lista_cliques = []
    for i,v in enumerate(n_cliques):
        a = list(it.combinations(v,k))
        for j in a:
            if set(j) not in lista_cliques:
                #recuerda que para comparar elementos utiliza set, y apilalos como set
                lista_cliques.append(set(j))

    df_lc = pd.DataFrame(lista_cliques)            
    print("numero de %s-cliques posibles:" % (k), df_lc.shape[0])
    return(df_lc)

In [9]:
df_lc1 = gen_3_cliques(df_da1)
print('--'*50)
df_lc2 = gen_3_cliques(df_da2)

red antes de filtros: Name: 
Type: Graph
Number of nodes: 154
Number of edges: 11781
Average degree: 153.0000
red despues de filtros: Name: 
Type: Graph
Number of nodes: 154
Number of edges: 1378
Average degree:  17.8961
numero de cliques maximos encontrados: 419
numero de 3-cliques posibles: 4480
----------------------------------------------------------------------------------------------------
red antes de filtros: Name: 
Type: Graph
Number of nodes: 88
Number of edges: 3828
Average degree:  87.0000
red despues de filtros: Name: 
Type: Graph
Number of nodes: 88
Number of edges: 709
Average degree:  16.1136
numero de cliques maximos encontrados: 246
numero de 3-cliques posibles: 2102


In [10]:
# #red de distancias completa
# red = nx.from_pandas_adjacency(df_da1)
# print(nx.info(red))


# #filtro de distancias
# edgesstrong = [(u,v) for (u,v,d) in red.edges(data=True) if d["weight"] <= 10]

# red = nx.Graph(edgesstrong)
# print("=="*20)
# print(nx.info(red))

# cliques3 = [clq for clq in nx.find_cliques(red) if len(clq) >=3]
# print('numero de cliques maximos encontrados:',len(cliques3))

# lista_cliques = []
# for i,v in enumerate(cliques3):
#     a = list(it.combinations(v,3))
#     for j in a:
#         if set(j) not in lista_cliques:
#             #recuerda que para comparar elementos utiliza set, y apilalos como set
#             lista_cliques.append(set(j))

# print("numero de 3-cliques posibles:",pd.DataFrame(lista_cliques).shape[0])

In [11]:
###CHECK DE NUMERO DE CLIQUES CORRECTO####
#se genera la matriz de adyacencias para la red
# distancias_adyacenctes = pd.DataFrame(index=df_ca1.atom_number,
#                                       columns=df_ca1.atom_number,
#                                       data=distancias)

# #red de distancias completa
# red = nx.from_pandas_adjacency(distancias_adyacenctes)
# print(nx.info(red))

# print("=="*20)
# print(nx.info(red))
# cliques3 = [clq for clq in nx.find_cliques(red) if len(clq) >=3]
# print('numero de cliques maximos encontrados:',len(cliques3))

# lista_cliques = []
# for i,v in enumerate(cliques3):
#     a = list(it.combinations(v,3))
#     for j in a:
#         #if set(j) not in lista_cliques:
#             #recuerda que para comparar elementos utiliza set, y apilalos como set
#         lista_cliques.append(set(j))

# print("numero de 3-cliques posibles:",pd.DataFrame(lista_cliques).shape[0])
###CHECK DE NUMERO DE CLIQUES CORRECTO####

In [12]:
def get_coord_clique(df_ca,df_lc):
    lista_matriz_coordendas = []
    for i in df_lc.index:
        mat_dist = [df_ca[df_ca.atom_number==df_lc.iloc[i,0]].vector.values[0],
                df_ca[df_ca.atom_number==df_lc.iloc[i,1]].vector.values[0],
                df_ca[df_ca.atom_number==df_lc.iloc[i,2]].vector.values[0]]
        lista_matriz_coordendas.append(mat_dist)

    df_lc['matriz_distancias'] = lista_matriz_coordendas
    return(df_lc)

In [13]:
df_lc1 = get_coord_clique(df_ca1,df_lc1)
df_lc2 = get_coord_clique(df_ca2,df_lc2)

In [16]:
df_lc1.head()

Unnamed: 0,0,1,2,matriz_distancias
0,121,514,431,"[[4.736, 31.117, 32.112], [-4.239, 29.134, 31...."
1,121,514,507,"[[4.736, 31.117, 32.112], [-4.239, 29.134, 31...."
2,121,514,393,"[[4.736, 31.117, 32.112], [-4.239, 29.134, 31...."
3,121,514,387,"[[4.736, 31.117, 32.112], [-4.239, 29.134, 31...."
4,514,507,431,"[[-4.239, 29.134, 31.826], [-1.742, 27.922, 34..."


## Comparacion de Cliques
The objective is to compute a one to one mapping between amino acid residues of the two structures
A and B. To begin with, all possible 3-body cliques $A_3$ and $B_3$, where $A_3$ and $B_3  S_3$, are compared to one another (inclusive of all permutations). Equivalent pairs of cliques are deduced according to the relations in Equations (2–4). A pair of $(A_3, B_3)$ is matched if their RMSD on superimposition is smaller than a preset threshold ($RMSD_3=0.15 A˚$). RMSD between cliques is
calculated by 3-D least squares fit (39). 

Additionally, amino acid residue secondary-structure state and side chain solvent accessible area also determines what pair of cliques are matched. Secondary structure provides the general three-dimensional form of local segments of proteins while side-chain solvent accessibility is the degree to which a residue in a protein is accessible to a solvent molecule. For matching of a pair of cliques in our algorithm, the secondary-structure score between two equivalent residues $A_i$ and $B_j$ are compared
SSM is an empirically determined secondary-structure match matrix (Table 1), $SS(A_i)$ is the secondary-structure state of amino acid residue $A_i$, and $s$ is a preset threshold for matching secondary-structure elements. 

The cut-off threshold for comparing secondary structure used in this study was 2, hence $SSM[Ai,Bj]< 2 [Equation (2)]$. This implies that, either residues of regular secondary structures can only match with other residues of the same secondary structure, or with residues in loops.
The solvent accessibility score between two residues $A_i$ and $B_j$ from solvent accessibility matrix (Table 2) are matched by using the inequality [Equation (3)]: SAM is an empirical solvent accessibility match matrix (Table 2), $SA(A_i)$ is the side-chain solvent accessibility of amino acid residue $A_i$, and a is a preset threshold for matching solvent accessible area states. The cut-off threshold for solvent accessibility matching is $a=1$, implying that residues categorized in different accessible area
classes cannot be matched. However, this criterion is relaxed to allow the matching of two residues in
adjacent accessible area classes if their side chain accessible areas are within 10% of each other.
Next, $A_3$ and $B_3$ are extended to 4-body cliques $A_4$ and $B_4$, by including one residue, $A_i$ and $B_j$ respectively, subject to the distance threshold criterion [Equation (1)].

This new pair $(A_i, B_j)$ and matched residues of $(A_3, B_3)$ are used to superimpose the pair of cliques $A_4=A_3[A_i$ and $B_4=B_3[B_j$. Pairs of four-body cliques, $A_4$ and $B_4$, are matched if their RMSD is smaller than another preset threshold, $RMSD_4=0.30A˚ [Equation (4), n=4]$. 

Pairs of n body cliques, $A_n=A_n1[A_i and B_n=B_n1[ B_j$ are selected if their RMSD is smaller than a preset threshold RMSDn (Table 3). Every value of n has a different RMSD threshold, RMSDn. See the section on RMSD threshold optimization for details. At every step the secondary structure and accessible area comparisons [Equations (2) and (3)] are also performed.

All matched pairs of 4-body cliques $A_4$ and $B_4$ are extended to all possible higher order cliques, $A_n$ and $B_n$, where $A_n, B_n S_n$ and $n>4$. In this study, cliques are extended to a maximum of seven constituent residues.

### Pasos para comparar
Para obtener el __RMSD__ es necesario primero rotar y trasladar al origen ambos atomos o moleculas para generar su __RMSD__

Para obtener C, $\alpha$, $\beta$ con:
   + $\Phi$
   + $\Psi$
1. Matriz de comparacion de Estructura Secundaria (SSM)
2. Solvente Accesible (SAM)