**What is a graph neural network?**

A graph neural network (GNN) is a type of neural network designed to process data represented as a graph. 
GNNs are used for a variety of tasks such as node classification, link prediction, and graph generation. 
They are particularly useful for problems that involve analyzing relationships between entities, such as social networks, molecular structures, and transportation networks. GNNs use a combination of neural networks and graph theory to learn features and representations of nodes and edges in a graph.

**A mechanism of GNNs**

Graph Neural Networks (GNNs) are a type of neural network designed to process data represented as a graph. They use a combination of neural networks and graph theory to learn features and representations of nodes and edges in a graph.

A graph is a collection of nodes (also called vertices) and edges (also called connections) that connect the nodes. Each node in the graph represents an entity and each edge represents a relationship between two entities.

The main mechanism of GNNs is the message passing process. In message passing, the model iteratively updates the representation of each node based on the representations of its neighboring nodes. This process is done by passing messages from each node to its neighboring nodes through the edges that connect them.

The message passing process can be divided into two main steps:
1. Aggregation: In the aggregation step, the model aggregates the information from the neighboring nodes. This is usually done by taking a weighted sum of the representations of the neighboring nodes, where the weights are learned by the model.
1. Update: In the update step, the model updates the representation of the current node based on the aggregated information. This is usually done by applying a neural network to the aggregated information.

This process is repeated for a fixed number of iterations or until a stopping criterion is met.
One of the key components of GNNs is the use of a trainable function called the "GNN layer" which is applied to the updated representation of each node, this function is used to update the representation of each node and can be a neural network like MLP, RNN, CNN. The final representation of each node is used as the output of the GNN.

GNNs can be applied to various tasks such as node classification, link prediction, and graph generation. They have been widely used in various fields such as natural language processing, computer vision, bioinformatics, and social networks.

**Several types of GNNs**

There are several types of Graph Neural Networks (GNNs) that have been developed, each with its own strengths and weaknesses. Some of the most common types of GNNs include:

1. Graph Convolutional Networks (GCNs): GCNs use a convolutional architecture to process graph data. They are based on the convolutional neural networks (CNNs) used in image processing, but are adapted to work with graph-structured data. GCNs use a local neighborhood aggregation scheme to update the representation of each node.

1. Graph Attention Networks (GATs): GATs use an attention mechanism to weight the contributions of neighboring nodes when updating the representation of each node. This allows the model to focus on the most important neighbors for each node.

1. Graph Recurrent Networks (GRNs): GRNs use a recurrent architecture to process graph data. They are based on the recurrent neural networks (RNNs) used in sequential data, but are adapted to work with graph-structured data.

1. Graph Auto-Encoders (GAEs): GAEs are neural networks that are designed to learn representations of graphs in a low-dimensional space. They consist of two main components: an encoder that maps the graph to a low-dimensional representation, and a decoder that maps the low-dimensional representation back to the original graph.

1. Graph Transformer Networks (GTNs): GTNs are based on the transformer architecture, which is widely used in natural language processing. They use self-attention mechanisms to weight the contributions of neighboring nodes when updating the representation of each node.

1. Simplifying GNNs: Simplifying GNNs are designed to reduce the complexity of GNNs, making them more efficient and scalable. Some examples of simplifying GNNs include GraphSAGE, FastGCN, and JK-Net.

These are some of the most popular types of GNNs, but there are many other variations and combinations that have been proposed in the literature. The choice of GNN architecture will depend on the specific problem and dataset you're working with.

**Graph Convolutional Networks (GCNs)**

Graph Convolutional Networks (GCNs) are one of the most popular types of Graph Neural Networks (GNNs). They were first introduced by Thomas Kipf and Max Welling in 2017 in the paper "Semi-Supervised Classification with Graph Convolutional Networks".

GCNs are based on the convolutional neural networks (CNNs) used in image processing, but are adapted to work with graph-structured data. They use a local neighborhood aggregation scheme to update the representation of each node and a trainable convolutional kernel to learn the weights.

GCNs have been successfully applied to a variety of graph-based tasks such as node classification, link prediction, and graph generation. They have been widely used in various fields such as natural language processing, computer vision, bioinformatics, and social networks.

One of the key advantages of GCNs is that they can effectively capture local structural information in the graph, making them well suited for tasks that involve analyzing relationships between entities. They also have a relatively simple architecture and can be trained efficiently on large graphs.

Due to their simplicity and good performance, GCNs have become the go-to choice for many researchers working on graph-based problems. It is not the only one but one of the most popular GNNs.

**1. Aggregation steps**

In Graph Convolutional Networks (GCNs), the aggregation step is used to combine the information from a node's neighborhood in order to update the node's representation. The aggregation step is typically performed using a weighted sum, where the weights are learned during training.

The mathematical formulation of the aggregation step in GCNs can be represented as follows:

Given a graph $G = (V, E)$ with n nodes and m edges, and a set of node features $X ∈ R^{nxd}$, where d is the number of features for each node.

Let A be the adjacency matrix of the graph, where $A_{i,j} = 1$ if there is an edge between node i and j, and 0 otherwise.

Let D be the degree matrix of the graph, where $D_{i,i}$ is the degree of node i.

The normalized adjacency matrix is defined as:
$\hat{A} = D^{-\frac{1}{2}}AD^{-\frac{1}{2}}$

The graph convolution is defined as:
$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)})$

where $H^(l)$ is the node representations at the l-th layer, $W^(l)$ is the weight matrix of the l-th layer, and $σ$ is the non-linear activation function.

The above equation describes the aggregation step in GCNs, where the representations of the node's neighbors are combined with the node's own representation and the learned weight matrix, in order to update the node's representation.

In summary, the aggregation step in GCNs is a mathematical operation that combines the information from a node's neighborhood in order to update the node's representation. It is performed using a weighted sum, where the weights are learned during training, and it is typically performed by multiplying the adjacency matrix by the node feature matrix and a learned weight matrix.

**2. Update steps**

In Graph Convolutional Networks (GCNs), the update step is used to update the representation of each node in the graph, based on the information gathered from the aggregation step. The update step is typically performed using a linear transformation of the aggregated information, where the transformation is learned during training.

The mathematical formulation of the update step in GCNs can be represented as follows:

Given the aggregation step, which computes the new node features as:
$H^{(l+1)} = \sigma(\hat{A}H^{(l)}W^{(l)})$

Where $H^{(l)}$ is the node representations at the l-th layer, $W^{(l)}$ is the weight matrix of the l-th layer, $\hat{A}$ is the normalized adjacency matrix, and $\sigma$ is the non-linear activation function.

The update step can be written as:
$H^{(l+1)} = H^{(l+1)} + H^{(l)}$

This equation simply adds the new node representations computed by the aggregation step to the current node representations. This allows the model to incorporate both the local structural information from the neighborhood and the node's own features.

In summary, the update step of GCN is a mathematical operation that updates the representation of each node in the graph based on the information gathered from the aggregation step. It is performed using a linear transformation of the aggregated information, where the transformation is learned during training, and it is typically performed by adding the new node representations computed by the aggregation step to the current node representations.

**Between Keras and Pytorch, which one is more popular for implementing GNN algorithms?**

Both Pytorch and Keras are popular deep learning libraries, and both have support for implementing Graph Neural Networks (GNNs). The choice between the two depends on the specific requirements of your project and your personal preferences.

Pytorch is a more low-level library and provides more flexibility in terms of customizing the model's architecture and training process. It also has a large community of researchers and developers who contribute to its development and provide support. Pytorch also has a lot of built-in functions for graph processing and various GNNs architectures, which makes it easy to implement GNNs.

Keras, on the other hand, is a higher-level library that provides a simpler and more intuitive interface for building and training models. It is also more user-friendly, making it a popular choice for researchers and practitioners who are new to deep learning. Keras also provides support for GNNs through external libraries such as Keras-GCN and keras-gat, but the support is not as extensive as Pytorch.

In practice, both libraries are widely used in research and industry, and both have their own advantages and disadvantages. It ultimately comes down to the specific needs of your project and your own experience and preferences. If you are familiar with Pytorch and you need to implement complex GNNs architectures, Pytorch might be the better choice. But if you are new to deep learning and you prefer a more user-friendly interface, Keras might be the better choice.

In [12]:
import torch
import torch.nn.functional as F
import torch_geometric as tg
import torch_geometric.transforms as T
# from torch_geometric.nn import GCNConv

In [None]:
tg.

In [13]:
import pandas as pd
pd.set_option('display.max_rows', 200)
import numpy as np

import time
import random

from os import getcwd 
from os.path import exists

print(getcwd()) # current working directory

version = 'v4'
update = True

D:\project\MIT_glyco


In [None]:
load_name = f'{version}_all_sites_group.csv'
all_sites = pd.read_csv(load_name)

protein_list = list(all_sites.protein.unique())
pass_list = ["P24622_2", "Q91YE8_2"] #these proteins have positive sites which are out of bound

for x in pass_list:
    protein_list.remove(x) 
    
print("total number of proteins:", len(protein_list))

In [None]:
load_path = './protein_dataset'
for i, name in enumerate(protein_list):
    load_name = f"{load_path}/{version}_dataset_{name}.csv"
    temp = pd.read_csv(load_name, index_col=0)
    temp['protein'] = temp.index.name

    if i==0:
        dataset = temp
    else:
        dataset = pd.concat([dataset, temp], axis=0)
dataset = dataset.reset_index(drop=True)

In [None]:
dataset.info()

In [None]:
header = ['#', 'protein']
x_cts = ['flexibility', 'Proline']
x_cat = ['SEQ', 'nAli', 'nPos', 'nS/nT', 'SS', 'phi_psi', 
         'side_-1', 'side_1', 'side_2', 'side_3', 'side_4', 'side_5']
y = ['positivity']

dummies = pd.get_dummies(dataset[x_cat], columns=x_cat)
display(np.array(dummies.columns))

In [None]:
dataset_onehot = pd.concat([dataset[header], dataset[x_cts], dummies, dataset[y]], axis=1)
display(np.array(dataset_onehot.columns))

In [None]:
### generate GNN dataset###

win_size = 10

for name in protein_list:
    name_list = []
    adj_list  = []
    feat_list = []
    label_list = []
    
    data = dataset_onehot[dataset_onehot['protein']==name].reset_index(drop=True)
    ST_index = np.where((data['SEQ_S']==1) | (data['SEQ_T']==1))[0]
    for index in ST_index:
        start_index = min(max(index-win_size, 0), len(data))
        end_index   = max(min(index+win_size+1, len(data)), 0)
        index_len = end_index - start_index
        
        adj_matrix = np.eye(index_len, k=-1) + np.eye(index_len) + np.eye(index_len, k=1)
        feat_matrix = data.iloc[start_index:end_index, 2:-1]
        label = data.iloc[[index], [-1]]

        name_list.append(name)
        adj_list.append(adj_matrix)
        feat_list.append(feat_matrix)
        label_list.append(label)

In [15]:
import numpy as np 
import dgl 
from dgl.nn import GraphConv 
 
import torch
import torch.nn as nn 
import torch.optim as optim 
import torch.nn.functional as F 

import dgl.data

DGL backend not selected or invalid.  Assuming PyTorch for now.


Setting the default backend to "pytorch". You can change it in the ~/.dgl/config.json file or export the DGLBACKEND environment variable.  Valid options are: pytorch, mxnet, tensorflow (all lowercase)


ImportError: cannot import name 'spmatrix' from 'scipy.sparse' (unknown location)

In [None]:
class GCNModel(nn.Module):
    def __init__(self, nfeat, nhid, nclass, dropout):
        super(GCNModel, self).__init__()

        self.gc1 = GCNConv(nfeat, nhid)
        self.gc2 = GCNConv(nhid, nclass)
        self.dropout = dropout

    def forward(self, x, adj):
        x = F.relu(self.gc1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.gc2(x, adj)
        return F.log_softmax(x, dim=1)

# Model and optimizer
import torch.optim as optim
import time

#GCNModel 객체 생성
model = GCNModel(nfeat=data.x.shape[1],
            nhid=20,
            nclass=data.y.max().item() + 1,
            dropout=0.6)

#정확도 함수 정의
def accuracy(output, labels):
    preds = output.max(1)[1].type_as(labels)
    correct = preds.eq(labels).double()
    correct = correct.sum()
    return correct / len(labels)

#최적화 객체 생성
optimizer = optim.Adam(model.parameters(),
                       lr=0.5, weight_decay=.0)



In [None]:
#train 함수 정의
def train(epoch):
    t = time.time()
    model.train()
    optimizer.zero_grad()
    output = model(data.x, data.edge_index)
    loss_train = F.nll_loss(output[data.train_mask], data.y[data.train_mask])
    acc_train = accuracy(output[data.train_mask], data.y[data.train_mask])
    loss_train.backward()
    optimizer.step()
    
    model.eval()
    output = model(data.x, data.edge_index)
    loss_val = F.nll_loss(output[data.val_mask], data.y[data.val_mask])
    acc_val = accuracy(output[data.val_mask], data.y[data.val_mask])
    print('Epoch: {:04d}'.format(epoch+1),
          'loss_train: {:.4f}'.format(loss_train.item()),
          'acc_train: {:.4f}'.format(acc_train.item()),
          'loss_val: {:.4f}'.format(loss_val.item()),
          'acc_val: {:.4f}'.format(acc_val.item()),
          'time: {:.4f}s'.format(time.time() - t))

#epchs 100번 학습
for epoch in range(100):
    train(epoch)


In [None]:
#test 함수 정의
def test():
    model.eval()
    output = model(data.x, data.edge_index)
    loss_test = F.nll_loss(output[data.test_mask], data.y[data.test_mask])
    acc_test = accuracy(output[data.test_mask], data.y[data.test_mask])
    print("Test set results:",
          "loss= {:.4f}".format(loss_test.item()),
          "accuracy= {:.4f}".format(acc_test.item()))
test()


In [None]:
name = protein_list[0]
load_path = './protein_dataset'
load_name = f"{load_path}/{version}_dataset_{name}.csv"

dataset = pd.read_csv(load_name, index_col=0)
dataset.info()

In [None]:
np.set_printoptions(threshold=np.inf, linewidth=np.inf) 

pro_win_size = 10 # protein window size
pro_win_len = pro_win_size*2+1 # protein window length

pro_ss = pd.read_csv(f"./protein_sequence/{pro_name}.csv", index_col=0)
pro_S_T = pro_ss[(pro_ss['SEQ']=='S') | (pro_ss['SEQ']=='T')]
adjajecency_matrix = np.eye(window_len,window_len, 1) + np.eye(window_len,window_len, -1)

adjajecency_matrix