-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TypeError: __init__() missing 1 required positional argument: 'source_path' #1
Comments
Hello, The |
emm,I really did not understand this question, but now there is a new problem, how to import the data from other papers, I already have tags and source files, but I do not know how to convert them into your format, which I did not find in your submission history or readme.👍 |
Firstly, Thank you for your comments, we're lacking of some input preprocessing. We will update it. |
Yes, I've already discovered that when creating a new dataset, if you use "GAE" or "LINE" or "Node2vec", the file you need to read doesn't actually exist. Also, I see that your paper should be written about GCN, not GAE. |
So I was very curious about how to recreate GAE, LINE, and Node2vec. |
Hi, we reused the following GitHub repository with minor modifications to generate the node embeddings of the LINE and Node2vec models. And the authors' repository with the GAE (or GCN) model. However, please note that the above Github repositories are quite old. It is required to set up a specified environment with some old settings to run them. |
Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files? Which is the model that you read in when the option is "GAE" or "LINE" or "Node2vec"? |
Yes. You feed the cfg_cg_compressed_graphs.gpickle files (using NetworkX format) to the two libraries to generate the corresponding embedded files. |
Thank you. I'll give it a try and hope for the best. :) |
I tried it and it worked for line and node2vec, but I couldn't find a suitable interface for conversion on gae and gcn. |
The authors seemingly have modified their repository a bit since I forked it. You can check our from __future__ import division
from __future__ import print_function
import time
import os
import sys
# find path to root directory of the project so as to import from other packages
tokens = os.path.abspath(__file__).split('/')
# print('tokens = ', tokens)
path2root = '/'.join(tokens[:-4])
# print('gae', 'path2root = ', path2root)
if path2root not in sys.path:
sys.path.append(path2root)
# Train on CPU (hide GPU) due to memory constraints
# os.environ['CUDA_VISIBLE_DEVICES'] = ""
import tensorflow.compat.v1 as tf
import numpy as np
import scipy.sparse as sp
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
import networkx as nx
# from gae.optimizer import OptimizerAE, OptimizerVAE
# from gae.input_data import load_data
# from gae.model import GCNModelAE, GCNModelVAE
# from gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges
from auto_encoders.vgae.gae.optimizer import OptimizerAE, OptimizerVAE
from auto_encoders.vgae.gae.model import GCNModelAE, GCNModelVAE
from auto_encoders.vgae.gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges
tf.disable_eager_execution()
def train(input_network, model_name='gcn_ae', emb_dim=16):
"""
:param input_network: networkx network
:param model_name: 'gcn_vae' or 'gcn_ae'
:param emb_dim:
:return:
"""
adj = nx.adjacency_matrix(input_network)
# Settings
flags = tf.app.flags
FLAGS = flags.FLAGS
FLAGS.remove_flag_values(FLAGS.flag_values_dict())
flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
flags.DEFINE_integer('epochs', 500, 'Number of epochs to train.')
# flags.DEFINE_integer('epochs', 2000, 'Number of epochs to train.')
flags.DEFINE_integer('hidden1', 32, 'Number of units in hidden layer 1.')
flags.DEFINE_integer('hidden2', emb_dim, 'Number of units in hidden layer 2.')
flags.DEFINE_float('weight_decay', 0., 'Weight for L2 loss on embedding matrix.')
flags.DEFINE_float('dropout', 0., 'Dropout rate (1 - keep probability).')
flags.DEFINE_string('model', model_name, 'Model string.')
# flags.DEFINE_string('dataset', 'cora', 'Dataset string.')
# flags.DEFINE_integer('features', 1, 'Whether to use features (1) or not (0).')
model_str = FLAGS.model
# dataset_str = FLAGS.dataset
# Load data
# adj, features = load_data(dataset_str)
# Store original adjacency matrix (without diagonal entries) for later
adj_orig = adj
adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape)
adj_orig.eliminate_zeros()
adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj)
adj = adj_train
# if FLAGS.features == 0:
# features = sp.identity(features.shape[0]) # featureless
features = sp.identity(adj.shape[0]) # featureless
# Some preprocessing
adj_norm = preprocess_graph(adj)
# Define placeholders
placeholders = {
'features': tf.sparse_placeholder(tf.float32),
'adj': tf.sparse_placeholder(tf.float32),
'adj_orig': tf.sparse_placeholder(tf.float32),
'dropout': tf.placeholder_with_default(0., shape=())
}
num_nodes = adj.shape[0]
features = sparse_to_tuple(features.tocoo())
num_features = features[2][1]
features_nonzero = features[1].shape[0]
# Create model
model = None
if model_str == 'gcn_ae':
model = GCNModelAE(placeholders, num_features, features_nonzero)
elif model_str == 'gcn_vae':
model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero)
pos_weight = float(adj.shape[0] * adj.shape[0] - adj.sum()) / adj.sum()
norm = adj.shape[0] * adj.shape[0] / float((adj.shape[0] * adj.shape[0] - adj.sum()) * 2)
# Optimizer
with tf.name_scope('optimizer'):
if model_str == 'gcn_ae':
opt = OptimizerAE(preds=model.reconstructions,
labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
validate_indices=False), [-1]),
pos_weight=pos_weight,
norm=norm)
elif model_str == 'gcn_vae':
opt = OptimizerVAE(preds=model.reconstructions,
labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
validate_indices=False), [-1]),
model=model, num_nodes=num_nodes,
pos_weight=pos_weight,
norm=norm)
# Initialize session
sess = tf.Session()
sess.run(tf.global_variables_initializer())
cost_val = []
acc_val = []
def get_roc_score(edges_pos, edges_neg, emb=None):
if emb is None:
feed_dict.update({placeholders['dropout']: 0})
emb = sess.run(model.z_mean, feed_dict=feed_dict)
def sigmoid(x):
return 1 / (1 + np.exp(-x))
# Predict on test set of edges
adj_rec = np.dot(emb, emb.T)
preds = []
pos = []
for e in edges_pos:
preds.append(sigmoid(adj_rec[e[0], e[1]]))
pos.append(adj_orig[e[0], e[1]])
preds_neg = []
neg = []
for e in edges_neg:
preds_neg.append(sigmoid(adj_rec[e[0], e[1]]))
neg.append(adj_orig[e[0], e[1]])
preds_all = np.hstack([preds, preds_neg])
labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds_neg))])
roc_score = roc_auc_score(labels_all, preds_all)
ap_score = average_precision_score(labels_all, preds_all)
return roc_score, ap_score
cost_val = []
acc_val = []
val_roc_score = []
adj_label = adj_train + sp.eye(adj_train.shape[0])
adj_label = sparse_to_tuple(adj_label)
# Train model
for epoch in range(FLAGS.epochs):
t = time.time()
# Construct feed dictionary
feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders)
feed_dict.update({placeholders['dropout']: FLAGS.dropout})
# Run single weight update
outs = sess.run([opt.opt_op, opt.cost, opt.accuracy], feed_dict=feed_dict)
# Compute average loss
avg_cost = outs[1]
avg_accuracy = outs[2]
roc_curr, ap_curr = get_roc_score(val_edges, val_edges_false)
val_roc_score.append(roc_curr)
print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(avg_cost),
"train_acc=", "{:.5f}".format(avg_accuracy), "val_roc=", "{:.5f}".format(val_roc_score[-1]),
"val_ap=", "{:.5f}".format(ap_curr),
"time=", "{:.5f}".format(time.time() - t))
print("Optimization Finished!")
roc_score, ap_score = get_roc_score(test_edges, test_edges_false)
print('Test ROC score: ' + str(roc_score))
print('Test AP score: ' + str(ap_score))
feed_dict.update({placeholders['dropout']: 0})
emb = sess.run(model.z_mean, feed_dict=feed_dict)
# print('type(emb) = ', type(emb))
# print('emb.shape = ', emb.shape)
return emb
# emb = train(nx.karate_club_graph(), emb_dim=32)
# emb = np.asmatrix(emb)
# print('type(emb) = ', type(emb))
# print('emb.shape = ', emb.shape) |
Ok, according to your source code, I have run it, but the problem is that it exceeds the memory limit. I have tried many methods, but all failed to reduce the consumption, so I have to give up temporarily. |
Yes, executing the GCN model on our contract graphs requires a powerful GPU resource (We had used Nvidia A100-16GB). However, if you only want quick results, I suggest focusing on the node features generated by node-type one-hot vectors, LINE, and node2vec models. Based on our experiments, the results from these settings are often better than the ones from the GCN model. Besides, the settings can run with limited GPU resources, especially in the LINE model designed for the vast graph structure. |
Thank you, I've successfully reproduced everything except the gae part, and it's great. |
Hello, I am not sure whether it is caused by the environment problem or the failure to upload your last code update. I followed the require prompt in README and installed the environment, but there are some errors in running the following commands. Here's an example. My environment, called MANDO, runs commands that are copied directly.
=======================================================================
(MANDO) astronaut@dell-PowerEdge-T640:/data/space_station/ge-sc$ python node_classifier.py -ld ./logs/node_classification/cfg/gae/access_control --output_models ./models/node_classification/cfg/gae/access_control --dataset ./experiments/ge-sc-data/source_code/access_control/buggy_curated/ --compressed_graph ./experiments/ge-sc-data/source_code/access_control/buggy_curated/cfg_compressed_graphs.gpickle --node_feature gae --feature_extractor ./experiments/ge-sc-data/source_code/gesc_matrices_node_embedding/matrix_gae_dim128_of_core_graph_of_access_control_cfg_buggy_curated.pkl --testset ./experiments/ge-sc-data/source_code/access_control/curated --seed 1
Using backend: pytorch
Training phase
Getting features
Traceback (most recent call last):
File "node_classifier.py", line 240, in
train_results, val_results = main(args)
File "node_classifier.py", line 56, in main
model = MANDONodeClassifier(args['compressed_graph'], feature_extractor=feature_extractor, node_feature=args['node_feature'], device=device)
TypeError: init() missing 1 required positional argument: 'source_path'
The text was updated successfully, but these errors were encountered: