**What is a graph neural network?**

A graph neural network (GNN) is a type of neural network designed to process data represented as a graph. 
GNNs are used for a variety of tasks such as node classification, link prediction, and graph generation. 
They are particularly useful for problems that involve analyzing relationships between entities, such as social networks, molecular structures, and transportation networks. GNNs use a combination of neural networks and graph theory to learn features and representations of nodes and edges in a graph.

**A mechanism of GNNs**

Graph Neural Networks (GNNs) are a type of neural network designed to process data represented as a graph. They use a combination of neural networks and graph theory to learn features and representations of nodes and edges in a graph.

A graph is a collection of nodes (also called vertices) and edges (also called connections) that connect the nodes. Each node in the graph represents an entity and each edge represents a relationship between two entities.

The main mechanism of GNNs is the message passing process. In message passing, the model iteratively updates the representation of each node based on the representations of its neighboring nodes. This process is done by passing messages from each node to its neighboring nodes through the edges that connect them.

The message passing process can be divided into two main steps:
1. Aggregation: In the aggregation step, the model aggregates the information from the neighboring nodes. This is usually done by taking a weighted sum of the representations of the neighboring nodes, where the weights are learned by the model.
1. Update: In the update step, the model updates the representation of the current node based on the aggregated information. This is usually done by applying a neural network to the aggregated information.

This process is repeated for a fixed number of iterations or until a stopping criterion is met.
One of the key components of GNNs is the use of a trainable function called the "GNN layer" which is applied to the updated representation of each node, this function is used to update the representation of each node and can be a neural network like MLP, RNN, CNN. The final representation of each node is used as the output of the GNN.

GNNs can be applied to various tasks such as node classification, link prediction, and graph generation. They have been widely used in various fields such as natural language processing, computer vision, bioinformatics, and social networks.

**Several types of GNNs**

There are several types of Graph Neural Networks (GNNs) that have been developed, each with its own strengths and weaknesses. Some of the most common types of GNNs include:

1. Graph Convolutional Networks (GCNs): GCNs use a convolutional architecture to process graph data. They are based on the convolutional neural networks (CNNs) used in image processing, but are adapted to work with graph-structured data. GCNs use a local neighborhood aggregation scheme to update the representation of each node.

1. Graph Attention Networks (GATs): GATs use an attention mechanism to weight the contributions of neighboring nodes when updating the representation of each node. This allows the model to focus on the most important neighbors for each node.

1. Graph Recurrent Networks (GRNs): GRNs use a recurrent architecture to process graph data. They are based on the recurrent neural networks (RNNs) used in sequential data, but are adapted to work with graph-structured data.

1. Graph Auto-Encoders (GAEs): GAEs are neural networks that are designed to learn representations of graphs in a low-dimensional space. They consist of two main components: an encoder that maps the graph to a low-dimensional representation, and a decoder that maps the low-dimensional representation back to the original graph.

1. Graph Transformer Networks (GTNs): GTNs are based on the transformer architecture, which is widely used in natural language processing. They use self-attention mechanisms to weight the contributions of neighboring nodes when updating the representation of each node.

1. Simplifying GNNs: Simplifying GNNs are designed to reduce the complexity of GNNs, making them more efficient and scalable. Some examples of simplifying GNNs include GraphSAGE, FastGCN, and JK-Net.

These are some of the most popular types of GNNs, but there are many other variations and combinations that have been proposed in the literature. The choice of GNN architecture will depend on the specific problem and dataset you're working with.

**Graph Convolutional Networks (GCNs)**

Graph Convolutional Networks (GCNs) are one of the most popular types of Graph Neural Networks (GNNs). They were first introduced by Thomas Kipf and Max Welling in 2017 in the paper "Semi-Supervised Classification with Graph Convolutional Networks".

GCNs are based on the convolutional neural networks (CNNs) used in image processing, but are adapted to work with graph-structured data. They use a local neighborhood aggregation scheme to update the representation of each node and a trainable convolutional kernel to learn the weights.

GCNs have been successfully applied to a variety of graph-based tasks such as node classification, link prediction, and graph generation. They have been widely used in various fields such as natural language processing, computer vision, bioinformatics, and social networks.

One of the key advantages of GCNs is that they can effectively capture local structural information in the graph, making them well suited for tasks that involve analyzing relationships between entities. They also have a relatively simple architecture and can be trained efficiently on large graphs.

Due to their simplicity and good performance, GCNs have become the go-to choice for many researchers working on graph-based problems. It is not the only one but one of the most popular GNNs.

**1. Aggregation steps**

In Graph Convolutional Networks (GCNs), the aggregation step is used to combine the information from a node's neighborhood in order to update the node's representation. The aggregation step is typically performed using a weighted sum, where the weights are learned during training.

The mathematical formulation of the aggregation step in GCNs can be represented as follows:

Given a graph $G = (V, E)$ with n nodes and m edges, and a set of node features $X ∈ R^{nxd}$, where d is the number of features for each node.

Let A be the adjacency matrix of the graph, where $A_{i,j} = 1$ if there is an edge between node i and j, and 0 otherwise.

Let D be the degree matrix of the graph, where $D_{i,i}$ is the degree of node i.

The normalized adjacency matrix is defined as:
$\hat{A} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

The graph convolution is defined as:
$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)})$

where $H^(l)$ is the node representations at the l-th layer, $W^(l)$ is the weight matrix of the l-th layer, and $σ$ is the non-linear activation function.

The above equation describes the aggregation step in GCNs, where the representations of the node's neighbors are combined with the node's own representation and the learned weight matrix, in order to update the node's representation.

In summary, the aggregation step in GCNs is a mathematical operation that combines the information from a node's neighborhood in order to update the node's representation. It is performed using a weighted sum, where the weights are learned during training, and it is typically performed by multiplying the adjacency matrix by the node feature matrix and a learned weight matrix.

**2. Update steps**

In Graph Convolutional Networks (GCNs), the update step is used to update the representation of each node in the graph, based on the information gathered from the aggregation step. The update step is typically performed using a linear transformation of the aggregated information, where the transformation is learned during training.

The mathematical formulation of the update step in GCNs can be represented as follows:

Given the aggregation step, which computes the new node features as:
$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)})$

Where $H^{(l)}$ is the node representations at the l-th layer, $W^{(l)}$ is the weight matrix of the l-th layer, $\hat{A}$ is the normalized adjacency matrix, and $\sigma$ is the non-linear activation function.

The update step can be written as:
$H^{(l+1)} = H^{(l+1)} + H^{(l)}$

This equation simply adds the new node representations computed by the aggregation step to the current node representations. This allows the model to incorporate both the local structural information from the neighborhood and the node's own features.

In summary, the update step of GCN is a mathematical operation that updates the representation of each node in the graph based on the information gathered from the aggregation step. It is performed using a linear transformation of the aggregated information, where the transformation is learned during training, and it is typically performed by adding the new node representations computed by the aggregation step to the current node representations.

**Between Keras and Pytorch, which one is more popular for implementing GNN algorithms?**

Both Pytorch and Keras are popular deep learning libraries, and both have support for implementing Graph Neural Networks (GNNs). The choice between the two depends on the specific requirements of your project and your personal preferences.

Pytorch is a more low-level library and provides more flexibility in terms of customizing the model's architecture and training process. It also has a large community of researchers and developers who contribute to its development and provide support. Pytorch also has a lot of built-in functions for graph processing and various GNNs architectures, which makes it easy to implement GNNs.

Keras, on the other hand, is a higher-level library that provides a simpler and more intuitive interface for building and training models. It is also more user-friendly, making it a popular choice for researchers and practitioners who are new to deep learning. Keras also provides support for GNNs through external libraries such as Keras-GCN and keras-gat, but the support is not as extensive as Pytorch.

In practice, both libraries are widely used in research and industry, and both have their own advantages and disadvantages. It ultimately comes down to the specific needs of your project and your own experience and preferences. If you are familiar with Pytorch and you need to implement complex GNNs architectures, Pytorch might be the better choice. But if you are new to deep learning and you prefer a more user-friendly interface, Keras might be the better choice.

In [3]:
import pandas as pd
pd.set_option('display.max_rows', 200)
import numpy as np

import time
import random

from os import getcwd 
from os.path import exists

print(getcwd()) # current working directory

version = 'v4'
update = True

D:\Project\MIT_glyco


In [4]:
load_name = f'{version}_all_sites_group.csv'
all_sites = pd.read_csv(load_name)

protein_list = list(all_sites.protein.unique())
pass_list = ["P24622_2", "Q91YE8_2"] #these proteins have positive sites which are out of bound

for x in pass_list:
    protein_list.remove(x) 
    
print("total number of proteins:", len(protein_list))

total number of proteins: 273


In [46]:
load_path = './protein_dataset'
for i, name in enumerate(protein_list):
    load_name = f"{load_path}/{version}_dataset_{name}.csv"
    temp = pd.read_csv(load_name, index_col=0)
    temp['protein'] = temp.index.name

    if i==0:
        dataset = temp
    else:
        dataset = pd.concat([dataset, temp], axis=0)
dataset = dataset.reset_index(drop=True)

In [47]:
dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258653 entries, 0 to 258652
Data columns (total 27 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   #                258653 non-null  int64  
 1   SEQ              258653 non-null  object 
 2   SS               258653 non-null  object 
 3   ASA              258653 non-null  float64
 4   Phi              258653 non-null  float64
 5   Psi              258653 non-null  float64
 6   Theta(i-1=>i+1)  258653 non-null  float64
 7   Tau(i-2=>i+2)    258653 non-null  float64
 8   HSE_alpha_up     258653 non-null  float64
 9   HSE_alpha_down   258653 non-null  float64
 10  P(C)             258653 non-null  float64
 11  P(H)             258653 non-null  float64
 12  P(E)             258653 non-null  float64
 13  flexibility      258051 non-null  float64
 14  side_-1          258653 non-null  object 
 15  side_1           258653 non-null  object 
 16  side_2           258653 non-null  obje

In [48]:
header = ['#', 'protein']
x_cts = ['flexibility', 'Proline']
x_cat = ['SEQ', 'nAli', 'nPos', 'nS/nT', 'SS', 'phi_psi', 
         'side_-1', 'side_1', 'side_2', 'side_3', 'side_4', 'side_5']
y = ['positivity']

dummies = pd.get_dummies(dataset[x_cat+y], columns=x_cat+y)
display(np.array(dummies.columns))

array(['SEQ_A', 'SEQ_C', 'SEQ_D', 'SEQ_E', 'SEQ_F', 'SEQ_G', 'SEQ_H',
       'SEQ_I', 'SEQ_K', 'SEQ_L', 'SEQ_M', 'SEQ_N', 'SEQ_P', 'SEQ_Q',
       'SEQ_R', 'SEQ_S', 'SEQ_T', 'SEQ_V', 'SEQ_W', 'SEQ_Y', 'nAli_0',
       'nAli_1', 'nAli_2', 'nAli_3', 'nPos_0', 'nPos_1', 'nPos_2',
       'nPos_3', 'nS/nT_0', 'nS/nT_1', 'nS/nT_2', 'nS/nT_3', 'nS/nT_4',
       'nS/nT_5', 'nS/nT_6', 'nS/nT_7', 'nS/nT_8', 'nS/nT_9', 'nS/nT_10',
       'nS/nT_11', 'nS/nT_12', 'nS/nT_13', 'nS/nT_14', 'nS/nT_15',
       'nS/nT_16', 'nS/nT_17', 'nS/nT_18', 'nS/nT_19', 'nS/nT_20',
       'nS/nT_21', 'SS_C', 'SS_E', 'SS_H', 'phi_psi_alpha',
       'phi_psi_beta', 'phi_psi_other', 'side_-1_None', 'side_-1_cycle',
       'side_-1_gly', 'side_-1_long', 'side_-1_normal', 'side_-1_pro',
       'side_-1_small', 'side_-1_very_small', 'side_1_None',
       'side_1_cycle', 'side_1_gly', 'side_1_long', 'side_1_normal',
       'side_1_pro', 'side_1_small', 'side_1_very_small', 'side_2_None',
       'side_2_cycle', 'side_2_gly'

In [49]:
dataset_onehot = pd.concat([dataset[header], dataset[x_cts], dummies], axis=1)
display(np.array(dataset_onehot.columns))

array(['#', 'protein', 'flexibility', 'Proline', 'SEQ_A', 'SEQ_C',
       'SEQ_D', 'SEQ_E', 'SEQ_F', 'SEQ_G', 'SEQ_H', 'SEQ_I', 'SEQ_K',
       'SEQ_L', 'SEQ_M', 'SEQ_N', 'SEQ_P', 'SEQ_Q', 'SEQ_R', 'SEQ_S',
       'SEQ_T', 'SEQ_V', 'SEQ_W', 'SEQ_Y', 'nAli_0', 'nAli_1', 'nAli_2',
       'nAli_3', 'nPos_0', 'nPos_1', 'nPos_2', 'nPos_3', 'nS/nT_0',
       'nS/nT_1', 'nS/nT_2', 'nS/nT_3', 'nS/nT_4', 'nS/nT_5', 'nS/nT_6',
       'nS/nT_7', 'nS/nT_8', 'nS/nT_9', 'nS/nT_10', 'nS/nT_11',
       'nS/nT_12', 'nS/nT_13', 'nS/nT_14', 'nS/nT_15', 'nS/nT_16',
       'nS/nT_17', 'nS/nT_18', 'nS/nT_19', 'nS/nT_20', 'nS/nT_21', 'SS_C',
       'SS_E', 'SS_H', 'phi_psi_alpha', 'phi_psi_beta', 'phi_psi_other',
       'side_-1_None', 'side_-1_cycle', 'side_-1_gly', 'side_-1_long',
       'side_-1_normal', 'side_-1_pro', 'side_-1_small',
       'side_-1_very_small', 'side_1_None', 'side_1_cycle', 'side_1_gly',
       'side_1_long', 'side_1_normal', 'side_1_pro', 'side_1_small',
       'side_1_very_small', '

In [51]:
### generate GNN dataset###

win_size = 10
mat_size = 2*win_size+1
feat_size = dataset_onehot.shape[1]-4
name_list = []
adj_list  = []
feat_list = []
label_list = []
for name in protein_list:
    data = dataset_onehot[dataset_onehot['protein']==name].reset_index(drop=True)
    ST_index = np.where((data['SEQ_S']==1) | (data['SEQ_T']==1))[0]
    for index in ST_index:
        start_index = min(max(index-win_size, 0), len(data))
        end_index   = max(min(index+win_size+1, len(data)), 0)
        
        zero_matrix = np.zeros((mat_size,mat_size))
        adj_matrix = np.eye(mat_size, k=-1) + np.eye(mat_size) + np.eye(mat_size, k=1)
        feat_matrix = np.zeros((mat_size,feat_size))
        
        down_lim = index - win_size
        up_lim = index + win_size
        if down_lim < 0:
            adj_matrix[:-down_lim, :-down_lim] = zero_matrix[:-down_lim, :-down_lim]
            feat_matrix[-down_lim:] = data.iloc[start_index:end_index, 2:-2].values
        elif up_lim > len(data)-1:
            adj_matrix[len(data)-up_lim-1:, len(data)-up_lim-1:] = zero_matrix[len(data)-up_lim-1:, len(data)-up_lim-1:]
            feat_matrix[:len(data)-up_lim-1] = data.iloc[start_index:end_index, 2:-2].values
        else:
            feat_matrix = data.iloc[start_index:end_index, 2:-2].values
        label = data.iloc[[index], -2:].values

        name_list.append(name)
        adj_list.append(adj_matrix)
        feat_list.append(feat_matrix)
        label_list.append(label)

In [65]:
np.array(label_list).shape

(41432, 1, 2)

In [40]:
import argparse
import sys
import time
import copy

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from torch.autograd import Variable

In [67]:
paser = argparse.ArgumentParser()
args = paser.parse_args("")
args.seed = 123
args.val_size = 0.1
args.test_size = 0.1
args.shuffle = True

class GCNDataset(Dataset):
    def __init__(self, list_feature, list_adj, list_logP):
        self.list_feature = list_feature
        self.list_adj = list_adj
        self.list_logP = list_logP

    def __len__(self):
        return len(self.list_feature)

    def __getitem__(self, index):
        return self.list_feature[index], self.list_adj[index], self.list_logP[index]
    
def partition(list_feature, list_adj, list_logP, args):
    num_total = list_feature.shape[0]
    num_train = int(num_total * (1 - args.test_size - args.val_size))
    num_val = int(num_total * args.val_size)
    num_test = int(num_total * args.test_size)

    feature_train = list_feature[:num_train]
    adj_train = list_adj[:num_train]
    logP_train = list_logP[:num_train]
    feature_val = list_feature[num_train:num_train + num_val]
    adj_val = list_adj[num_train:num_train + num_val]
    logP_val = list_logP[num_train:num_train + num_val]
    feature_test = list_feature[num_total - num_test:]
    adj_test = list_adj[num_total - num_test:]
    logP_test = list_logP[num_total - num_test:]
        
    train_set = GCNDataset(feature_train, adj_train, logP_train)
    val_set = GCNDataset(feature_val, adj_val, logP_val)
    test_set = GCNDataset(feature_test, adj_test, logP_test)

    partition = {
        'train': train_set,
        'val': val_set,
        'test': test_set
    }

    return partition

dict_partition = partition(np.array(feat_list), np.array(adj_list), np.array(label_list), args)

In [42]:
np.random.seed(args.seed)
torch.manual_seed(args.seed)

if torch.cuda.is_available():
    torch.set_default_tensor_type('torch.cuda.FloatTensor')
else:
    torch.set_default_tensor_type('torch.FloatTensor')

In [68]:
class GCNLayer(nn.Module):
    
    def __init__(self, in_dim, out_dim, n_amino, act=None, bn=False):
        super(GCNLayer, self).__init__()
        
        self.use_bn = bn
        self.linear = nn.Linear(in_dim, out_dim)
        nn.init.xavier_uniform_(self.linear.weight)
        self.bn = nn.BatchNorm1d(n_amino)
        self.activation = act
        
    def forward(self, x, adj):
        out = self.linear(x)
        out = torch.matmul(adj, out)
        if self.use_bn:
            out = self.bn(out)
        if self.activation != None:
            out = self.activation(out)
        return out, adj


In [72]:
def train(net, partition, optimizer, criterion, args):
    trainloader = torch.utils.data.DataLoader(partition['train'], 
                                              batch_size=args.train_batch_size, 
                                              shuffle=True, num_workers=2)
    net.train()
    optimizer.zero_grad()

    train_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        optimizer.zero_grad() # [21.01.05 오류 수정] 매 Epoch 마다 .zero_grad()가 실행되는 것을 매 iteration 마다 실행되도록 수정했습니다. 

        # get the inputs
        list_feature, list_adj, list_label = data
        list_feature = list_feature.cuda().float()
        list_adj = list_adj.cuda().float()
        list_label = list_label.cuda().float().view(-1, 1)
        outputs = net(list_feature, list_adj)

        loss = criterion(outputs, list_label)
        train_loss += loss.item()
        loss.backward()
        optimizer.step()

    train_loss = train_loss / len(trainloader)
    return net, train_loss


def validate(net, partition, criterion, args):
    valloader = torch.utils.data.DataLoader(partition['val'], 
                                            batch_size=args.test_batch_size, 
                                            shuffle=False, num_workers=2)
    net.eval()
    val_loss = 0 
    with torch.no_grad():
        for data in valloader:
            list_feature, list_adj, list_label = data
            list_feature = list_feature.cuda().float()
            list_adj = list_adj.cuda().float()
            list_label = list_label.cuda().float().view(-1, 1)
            
            outputs = net(list_feature, list_adj)

            loss = criterion(outputs, labels)
            val_loss += loss.item()

        val_loss = val_loss / len(valloader)
    return val_loss

# def test(net, partition, args):
#     testloader = torch.utils.data.DataLoader(partition['test'], 
#                                              batch_size=args.test_batch_size, 
#                                              shuffle=False, num_workers=2)
#     net.eval()
#     with torch.no_grad():
#         logP_total = list()
#         pred_logP_total = list()
#         for data in testloader:
#             list_feature, list_adj, list_label = data
#             list_feature = list_feature.cuda().float()
#             list_adj = list_adj.cuda().float()
#             list_label = list_label.cuda().float()
#             label_total += list_label.tolist()
#             list_label = list_label.view(-1, 1)
            
#             outputs = net(list_feature, list_adj)
#             pred_label_total += outputs.view(-1).tolist()

#         mae = mean_absolute_error(label_total, pred_label_total)
#         std = np.std(np.array(label_total)-np.array(pred_label_total))
    
#     return mae, std, label_total, pred_label_total

In [73]:
def experiment(partition, args):
  
    net = GCNNet(args)
    net.cuda()

    criterion = nn.CrossEntropyLoss()
    if args.optim == 'SGD':
        optimizer = optim.SGD(net.parameters(), lr=args.lr, weight_decay=args.l2)
    elif args.optim == 'RMSprop':
        optimizer = optim.RMSprop(net.parameters(), lr=args.lr, weight_decay=args.l2)
    elif args.optim == 'Adam':
        optimizer = optim.Adam(net.parameters(), lr=args.lr, weight_decay=args.l2)
    else:
        raise ValueError('In-valid optimizer choice')
    
    train_losses = []
    val_losses = []
        
    for epoch in range(args.epoch):  # loop over the dataset multiple times
        ts = time.time()
        net, train_loss = train(net, partition, optimizer, criterion, args)
        val_loss = validate(net, partition, criterion, args)
        te = time.time()
        
        train_losses.append(train_loss)
        val_losses.append(val_loss)
        
        print('Epoch {}, Acc(train/val): {:2.2f}/{:2.2f}, Loss(train/val) {:2.2f}/{:2.2f}. Took {:2.2f} sec'.format(epoch, train_acc, val_acc, train_loss, val_loss, te-ts))
        
#     mae, std, logP_total, pred_logP_total = test(net, partition, args)    
    
    result = {}
    result['train_losses'] = train_losses
    result['val_losses'] = val_losses
    result['mae'] = mae
    result['std'] = std
#     result['logP_total'] = logP_total
#     result['pred_logP_total'] = pred_logP_total
    return vars(args), result
     