Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TypeError: __init__() missing 1 required positional argument: 'source_path' #1

Closed
Astronaut-diode opened this issue Jan 27, 2023 · 14 comments

Comments

@Astronaut-diode
Copy link

Hello, I am not sure whether it is caused by the environment problem or the failure to upload your last code update. I followed the require prompt in README and installed the environment, but there are some errors in running the following commands. Here's an example. My environment, called MANDO, runs commands that are copied directly.

=======================================================================
(MANDO) astronaut@dell-PowerEdge-T640:/data/space_station/ge-sc$ python node_classifier.py -ld ./logs/node_classification/cfg/gae/access_control --output_models ./models/node_classification/cfg/gae/access_control --dataset ./experiments/ge-sc-data/source_code/access_control/buggy_curated/ --compressed_graph ./experiments/ge-sc-data/source_code/access_control/buggy_curated/cfg_compressed_graphs.gpickle --node_feature gae --feature_extractor ./experiments/ge-sc-data/source_code/gesc_matrices_node_embedding/matrix_gae_dim128_of_core_graph_of_access_control_cfg_buggy_curated.pkl --testset ./experiments/ge-sc-data/source_code/access_control/curated --seed 1
Using backend: pytorch
Training phase
Getting features
Traceback (most recent call last):
File "node_classifier.py", line 240, in
train_results, val_results = main(args)
File "node_classifier.py", line 56, in main
model = MANDONodeClassifier(args['compressed_graph'], feature_extractor=feature_extractor, node_feature=args['node_feature'], device=device)
TypeError: init() missing 1 required positional argument: 'source_path'

@minhnn-tiny
Copy link
Contributor

Hello, The source_path parameter was removed at this commit. Please make sure you pull the latest version. Btw, this command still work well from my side.

@Astronaut-diode
Copy link
Author

emm,I really did not understand this question, but now there is a new problem, how to import the data from other papers, I already have tags and source files, but I do not know how to convert them into your format, which I did not find in your submission history or readme.👍

@minhnn-tiny
Copy link
Contributor

Firstly, Thank you for your comments, we're lacking of some input preprocessing. We will update it.
The current required input are a compressed_graph which can generated by the scripts in process_graphs folder. In addition, when using node_feature are "GAE" or "LINE" or "Node2vec" as input node feature, you have to refer to those papers and generate nodes' features from the compressed_graph. We didn't include "GAE" or "LINE" or "Node2vec" tools inside this repo, we just dump their output of our current dataset to .pkl files in gesc_matrices_node_embedding folder.

@Astronaut-diode
Copy link
Author

Astronaut-diode commented Mar 1, 2023

Yes, I've already discovered that when creating a new dataset, if you use "GAE" or "LINE" or "Node2vec", the file you need to read doesn't actually exist. Also, I see that your paper should be written about GCN, not GAE.

@Astronaut-diode
Copy link
Author

So I was very curious about how to recreate GAE, LINE, and Node2vec.

@erichoang
Copy link
Collaborator

erichoang commented Mar 1, 2023

Hi, we reused the following GitHub repository with minor modifications to generate the node embeddings of the LINE and Node2vec models.
https://github.com/shenweichen/GraphEmbedding

And the authors' repository with the GAE (or GCN) model.
https://github.com/tkipf/gae

However, please note that the above Github repositories are quite old. It is required to set up a specified environment with some old settings to run them.

@Astronaut-diode
Copy link
Author

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files? Which is the model that you read in when the option is "GAE" or "LINE" or "Node2vec"?

@erichoang
Copy link
Collaborator

erichoang commented Mar 1, 2023

Do you mean to feed those cfg_cg_compressed_graphs.gpickle files into these two libraries to generate the corresponding pre-trained embedded files?

Yes. You feed the cfg_cg_compressed_graphs.gpickle files (using NetworkX format) to the two libraries to generate the corresponding embedded files.

@Astronaut-diode
Copy link
Author

Thank you. I'll give it a try and hope for the best. :)

@Astronaut-diode
Copy link
Author

I tried it and it worked for line and node2vec, but I couldn't find a suitable interface for conversion on gae and gcn.

@erichoang
Copy link
Collaborator

erichoang commented Mar 2, 2023

The authors seemingly have modified their repository a bit since I forked it. You can check our train function in the code below. Note that changing the path of your GCN models and having a suitable Tensorflow version are required if you want to re-use the code.

from __future__ import division
from __future__ import print_function

import time
import os
import sys

# find path to root directory of the project so as to import from other packages
tokens = os.path.abspath(__file__).split('/')
# print('tokens = ', tokens)
path2root = '/'.join(tokens[:-4])
# print('gae', 'path2root = ', path2root)
if path2root not in sys.path:
    sys.path.append(path2root)

# Train on CPU (hide GPU) due to memory constraints
# os.environ['CUDA_VISIBLE_DEVICES'] = ""

import tensorflow.compat.v1 as tf
import numpy as np
import scipy.sparse as sp

from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score
import networkx as nx

# from gae.optimizer import OptimizerAE, OptimizerVAE
# from gae.input_data import load_data
# from gae.model import GCNModelAE, GCNModelVAE
# from gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

from auto_encoders.vgae.gae.optimizer import OptimizerAE, OptimizerVAE
from auto_encoders.vgae.gae.model import GCNModelAE, GCNModelVAE
from auto_encoders.vgae.gae.preprocessing import preprocess_graph, construct_feed_dict, sparse_to_tuple, mask_test_edges

tf.disable_eager_execution()

def train(input_network, model_name='gcn_ae', emb_dim=16):
    """

    :param input_network: networkx network
    :param model_name: 'gcn_vae' or 'gcn_ae'
    :param emb_dim:
    :return:
    """
    adj = nx.adjacency_matrix(input_network)

    # Settings
    flags = tf.app.flags
    FLAGS = flags.FLAGS
    FLAGS.remove_flag_values(FLAGS.flag_values_dict())

    flags.DEFINE_float('learning_rate', 0.01, 'Initial learning rate.')
    flags.DEFINE_integer('epochs', 500, 'Number of epochs to train.')
    # flags.DEFINE_integer('epochs', 2000, 'Number of epochs to train.')
    flags.DEFINE_integer('hidden1', 32, 'Number of units in hidden layer 1.')
    flags.DEFINE_integer('hidden2', emb_dim, 'Number of units in hidden layer 2.')
    flags.DEFINE_float('weight_decay', 0., 'Weight for L2 loss on embedding matrix.')
    flags.DEFINE_float('dropout', 0., 'Dropout rate (1 - keep probability).')

    flags.DEFINE_string('model', model_name, 'Model string.')
    # flags.DEFINE_string('dataset', 'cora', 'Dataset string.')
    # flags.DEFINE_integer('features', 1, 'Whether to use features (1) or not (0).')

    model_str = FLAGS.model
    # dataset_str = FLAGS.dataset

    # Load data
    # adj, features = load_data(dataset_str)

    # Store original adjacency matrix (without diagonal entries) for later
    adj_orig = adj
    adj_orig = adj_orig - sp.dia_matrix((adj_orig.diagonal()[np.newaxis, :], [0]), shape=adj_orig.shape)
    adj_orig.eliminate_zeros()

    adj_train, train_edges, val_edges, val_edges_false, test_edges, test_edges_false = mask_test_edges(adj)
    adj = adj_train

    # if FLAGS.features == 0:
    #    features = sp.identity(features.shape[0])  # featureless

    features = sp.identity(adj.shape[0])  # featureless

    # Some preprocessing
    adj_norm = preprocess_graph(adj)

    # Define placeholders
    placeholders = {
        'features': tf.sparse_placeholder(tf.float32),
        'adj': tf.sparse_placeholder(tf.float32),
        'adj_orig': tf.sparse_placeholder(tf.float32),
        'dropout': tf.placeholder_with_default(0., shape=())
    }

    num_nodes = adj.shape[0]

    features = sparse_to_tuple(features.tocoo())
    num_features = features[2][1]
    features_nonzero = features[1].shape[0]

    # Create model
    model = None
    if model_str == 'gcn_ae':
        model = GCNModelAE(placeholders, num_features, features_nonzero)
    elif model_str == 'gcn_vae':
        model = GCNModelVAE(placeholders, num_features, num_nodes, features_nonzero)

    pos_weight = float(adj.shape[0] * adj.shape[0] - adj.sum()) / adj.sum()
    norm = adj.shape[0] * adj.shape[0] / float((adj.shape[0] * adj.shape[0] - adj.sum()) * 2)

    # Optimizer
    with tf.name_scope('optimizer'):
        if model_str == 'gcn_ae':
            opt = OptimizerAE(preds=model.reconstructions,
                              labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                          validate_indices=False), [-1]),
                              pos_weight=pos_weight,
                              norm=norm)
        elif model_str == 'gcn_vae':
            opt = OptimizerVAE(preds=model.reconstructions,
                               labels=tf.reshape(tf.sparse_tensor_to_dense(placeholders['adj_orig'],
                                                                           validate_indices=False), [-1]),
                               model=model, num_nodes=num_nodes,
                               pos_weight=pos_weight,
                               norm=norm)

    # Initialize session
    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    cost_val = []
    acc_val = []

    def get_roc_score(edges_pos, edges_neg, emb=None):
        if emb is None:
            feed_dict.update({placeholders['dropout']: 0})
            emb = sess.run(model.z_mean, feed_dict=feed_dict)

        def sigmoid(x):
            return 1 / (1 + np.exp(-x))

        # Predict on test set of edges
        adj_rec = np.dot(emb, emb.T)
        preds = []
        pos = []
        for e in edges_pos:
            preds.append(sigmoid(adj_rec[e[0], e[1]]))
            pos.append(adj_orig[e[0], e[1]])

        preds_neg = []
        neg = []
        for e in edges_neg:
            preds_neg.append(sigmoid(adj_rec[e[0], e[1]]))
            neg.append(adj_orig[e[0], e[1]])

        preds_all = np.hstack([preds, preds_neg])
        labels_all = np.hstack([np.ones(len(preds)), np.zeros(len(preds_neg))])
        roc_score = roc_auc_score(labels_all, preds_all)
        ap_score = average_precision_score(labels_all, preds_all)

        return roc_score, ap_score

    cost_val = []
    acc_val = []
    val_roc_score = []

    adj_label = adj_train + sp.eye(adj_train.shape[0])
    adj_label = sparse_to_tuple(adj_label)

    # Train model
    for epoch in range(FLAGS.epochs):
        t = time.time()
        # Construct feed dictionary
        feed_dict = construct_feed_dict(adj_norm, adj_label, features, placeholders)
        feed_dict.update({placeholders['dropout']: FLAGS.dropout})
        # Run single weight update
        outs = sess.run([opt.opt_op, opt.cost, opt.accuracy], feed_dict=feed_dict)

        # Compute average loss
        avg_cost = outs[1]
        avg_accuracy = outs[2]

        roc_curr, ap_curr = get_roc_score(val_edges, val_edges_false)
        val_roc_score.append(roc_curr)

        print("Epoch:", '%04d' % (epoch + 1), "train_loss=", "{:.5f}".format(avg_cost),
              "train_acc=", "{:.5f}".format(avg_accuracy), "val_roc=", "{:.5f}".format(val_roc_score[-1]),
              "val_ap=", "{:.5f}".format(ap_curr),
              "time=", "{:.5f}".format(time.time() - t))

    print("Optimization Finished!")

    roc_score, ap_score = get_roc_score(test_edges, test_edges_false)
    print('Test ROC score: ' + str(roc_score))
    print('Test AP score: ' + str(ap_score))

    feed_dict.update({placeholders['dropout']: 0})
    emb = sess.run(model.z_mean, feed_dict=feed_dict)
    # print('type(emb) = ', type(emb))
    # print('emb.shape = ', emb.shape)
    return emb

# emb = train(nx.karate_club_graph(), emb_dim=32)
# emb = np.asmatrix(emb)
# print('type(emb) = ', type(emb))
# print('emb.shape = ', emb.shape)

@Astronaut-diode
Copy link
Author

Ok, according to your source code, I have run it, but the problem is that it exceeds the memory limit. I have tried many methods, but all failed to reduce the consumption, so I have to give up temporarily.
It is worth mentioning that your model is also quite large for memory consumption, no criticism, hah! :)

@erichoang
Copy link
Collaborator

erichoang commented Mar 3, 2023

Yes, executing the GCN model on our contract graphs requires a powerful GPU resource (We had used Nvidia A100-16GB). However, if you only want quick results, I suggest focusing on the node features generated by node-type one-hot vectors, LINE, and node2vec models. Based on our experiments, the results from these settings are often better than the ones from the GCN model. Besides, the settings can run with limited GPU resources, especially in the LINE model designed for the vast graph structure.

@Astronaut-diode
Copy link
Author

Thank you, I've successfully reproduced everything except the gae part, and it's great.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants