Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Error while reading for large files #20

Open
KarthikRevanuru opened this issue Oct 2, 2021 · 9 comments
Open

Memory Error while reading for large files #20

KarthikRevanuru opened this issue Oct 2, 2021 · 9 comments
Labels
enhancement New feature or request

Comments

@KarthikRevanuru
Copy link

I've a file with nearly 70M edges and it fails to load into memory on a machine with 32 GB ram

@KarthikRevanuru
Copy link
Author

@RemyLau

@KarthikRevanuru
Copy link
Author

I've a graph with 4.5M nodes and 80M edges. What is your rough estimate of running time on large graphs like this.

@RemyLau
Copy link
Contributor

RemyLau commented Oct 3, 2021

Hi @KarthikRevanuru , from my rough estimation, it should take no more than 20GB of memory to fully load and convert your graph into the CSR format, which is used as the final graph data structure. I have run some testing with a couple large biological networks (see bench repo). For example, SSN has roughly 72M edges with 800k nodes, and it uses ~10GB of memory throughout the execution of the program (see line 65 or line 71 in this benchmarking result table).

May I ask what mode of execution you are using (i.e. did you explicitly set --mode to PreComp or DenseOTF)? In this case, since the network has a large number of nodes with very sparse connections, it is best to use SparseOTF (which is the default mode).

@KarthikRevanuru
Copy link
Author

Thanks for quick revert @RemyLau
I'm using SparseOTF and I've used 32 GB ram but it throws error. On disk my input data size is 3.4 GB, it's stored as edge list.
I've taken a bigger instance and it seems to work. But it has printed only this since 6 hours.
Took 00:06:06.35 to load graph Took 00:00:00.00 to pre-compute transition probabilities

Do you have any estimates on running time ? Also I've enabled verbose mode and it didn't print anything regarding walks

@RemyLau
Copy link
Contributor

RemyLau commented Oct 3, 2021

Hmm.. What error are you seeing @KarthikRevanuru ? It would be helpful if you can share the error log here.

From the log message you shared here, it only took 6 minutes to load the graph, which is a quite reasonable time for networks of this size. And since you're using SparseOTF, there's no preprocessing step, so 0 sec of preprocessing is expected. But I'm not sure why you only see this after 6 hours. If you look at the command-line interface, there's no other steps before loading graph.

PecanPy/src/pecanpy/cli.py

Lines 202 to 233 in 6a0a733

def main():
"""Pipeline for representational learning for all nodes in a graph."""
args = parse_args()
if args.directed and args.extend:
raise NotImplementedError("Node2vec+ not implemented for directed graph yet.")
@Timer("load graph", True)
def timed_read_graph():
return read_graph(args)
@Timer("pre-compute transition probabilities", True)
def timed_preprocess():
g.preprocess_transition_probs()
@Timer("generate walks", True)
def timed_walk():
return g.simulate_walks(args.num_walks, args.walk_length)
@Timer("train embeddings", True)
def timed_emb():
learn_embeddings(args=args, walks=walks)
if args.workers == 0:
args.workers = numba.config.NUMBA_DEFAULT_NUM_THREADS
numba.set_num_threads(args.workers)
g = timed_read_graph()
timed_preprocess()
walks = timed_walk()
g = None
timed_emb()

@KarthikRevanuru
Copy link
Author

Error message is killed, after I took 64gb ram its fixed.
@RemyLau this is not printed after 6 hrs. Its printed within 6 min of running, but nothing after that since 6 hrs..

@KarthikRevanuru
Copy link
Author

what's the expected time to generate walks and train embeddings

@RemyLau RemyLau added the enhancement New feature or request label Oct 3, 2021
@RemyLau
Copy link
Contributor

RemyLau commented Oct 3, 2021

@KarthikRevanuru It honestly depends on a lot of factors, e.g. number of processors, CPU clock, memory clock, etc. But in your case, I'll say 6 hours would be the time where the random walk generation process is finished. So given these couple of clues, I think the issue might be caused by the large number of random walks generated. Previously in my case, although SSN network has roughly the same number of edges as yours, it has an order of magnitude less number of nodes (800k compared to 4.5M). The number of nodes does not affect the size of the sparse graph structure much, but it does affect the size of the corpurs generated (i.e. the random walks).

In particular, your killed error message may very likely be caused by exceeding memory limit from the following line of code in the simulate_walks function, which convert the node index sequence into list of strings (of node ids).

walks = [[self.IDlst[idx] for idx in walk[:walk[-1]]] for walk in node2vec_walks()]

This was done originally due to convenience when calling gensim.Word2Vec. I'll try to find an alternative solution where we don't need to convert to list of strings first, but instead, use the index sequence directly to save memory usage for specific cases like these (networks with large number of nodes). So keep posted.

In the meantime, if it is possible, try to further increase your memory allocation to, say 128GB, and see if that could resolve the issue.

@KarthikRevanuru
Copy link
Author

Ok thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants