Memory Error while reading for large files #20

KarthikRevanuru · 2021-10-02T18:26:13Z

I've a file with nearly 70M edges and it fails to load into memory on a machine with 32 GB ram

KarthikRevanuru · 2021-10-03T03:17:11Z

KarthikRevanuru · 2021-10-03T04:42:43Z

I've a graph with 4.5M nodes and 80M edges. What is your rough estimate of running time on large graphs like this.

RemyLau · 2021-10-03T13:38:29Z

Hi @KarthikRevanuru , from my rough estimation, it should take no more than 20GB of memory to fully load and convert your graph into the CSR format, which is used as the final graph data structure. I have run some testing with a couple large biological networks (see bench repo). For example, SSN has roughly 72M edges with 800k nodes, and it uses ~10GB of memory throughout the execution of the program (see line 65 or line 71 in this benchmarking result table).

May I ask what mode of execution you are using (i.e. did you explicitly set --mode to PreComp or DenseOTF)? In this case, since the network has a large number of nodes with very sparse connections, it is best to use SparseOTF (which is the default mode).

KarthikRevanuru · 2021-10-03T14:17:05Z

Thanks for quick revert @RemyLau
I'm using SparseOTF and I've used 32 GB ram but it throws error. On disk my input data size is 3.4 GB, it's stored as edge list.
I've taken a bigger instance and it seems to work. But it has printed only this since 6 hours.
Took 00:06:06.35 to load graph Took 00:00:00.00 to pre-compute transition probabilities

Do you have any estimates on running time ? Also I've enabled verbose mode and it didn't print anything regarding walks

RemyLau · 2021-10-03T14:39:53Z

Hmm.. What error are you seeing @KarthikRevanuru ? It would be helpful if you can share the error log here.

From the log message you shared here, it only took 6 minutes to load the graph, which is a quite reasonable time for networks of this size. And since you're using SparseOTF, there's no preprocessing step, so 0 sec of preprocessing is expected. But I'm not sure why you only see this after 6 hours. If you look at the command-line interface, there's no other steps before loading graph.

PecanPy/src/pecanpy/cli.py

Lines 202 to 233 in 6a0a733

    
           def main(): 
        
               """Pipeline for representational learning for all nodes in a graph.""" 
        
               args = parse_args() 
        
               if args.directed and args.extend: 
        
                   raise NotImplementedError("Node2vec+ not implemented for directed graph yet.") 
        
               @Timer("load graph", True) 
        
               def timed_read_graph(): 
        
                   return read_graph(args) 
        
               @Timer("pre-compute transition probabilities", True) 
        
               def timed_preprocess(): 
        
                   g.preprocess_transition_probs() 
        
               @Timer("generate walks", True) 
        
               def timed_walk(): 
        
                   return g.simulate_walks(args.num_walks, args.walk_length) 
        
               @Timer("train embeddings", True) 
        
               def timed_emb(): 
        
                   learn_embeddings(args=args, walks=walks) 
        
               if args.workers == 0: 
        
                   args.workers = numba.config.NUMBA_DEFAULT_NUM_THREADS 
        
               numba.set_num_threads(args.workers) 
        
               g = timed_read_graph() 
        
               timed_preprocess() 
        
               walks = timed_walk() 
        
               g = None 
        
               timed_emb()

KarthikRevanuru · 2021-10-03T14:54:00Z

Error message is killed, after I took 64gb ram its fixed.
@RemyLau this is not printed after 6 hrs. Its printed within 6 min of running, but nothing after that since 6 hrs..

KarthikRevanuru · 2021-10-03T14:55:32Z

what's the expected time to generate walks and train embeddings

RemyLau · 2021-10-03T15:14:10Z

@KarthikRevanuru It honestly depends on a lot of factors, e.g. number of processors, CPU clock, memory clock, etc. But in your case, I'll say 6 hours would be the time where the random walk generation process is finished. So given these couple of clues, I think the issue might be caused by the large number of random walks generated. Previously in my case, although SSN network has roughly the same number of edges as yours, it has an order of magnitude less number of nodes (800k compared to 4.5M). The number of nodes does not affect the size of the sparse graph structure much, but it does affect the size of the corpurs generated (i.e. the random walks).

In particular, your killed error message may very likely be caused by exceeding memory limit from the following line of code in the simulate_walks function, which convert the node index sequence into list of strings (of node ids).

PecanPy/src/pecanpy/node2vec.py

Line 149 in 6a0a733

    
           walks = [[self.IDlst[idx] for idx in walk[:walk[-1]]] for walk in node2vec_walks()]

This was done originally due to convenience when calling gensim.Word2Vec. I'll try to find an alternative solution where we don't need to convert to list of strings first, but instead, use the index sequence directly to save memory usage for specific cases like these (networks with large number of nodes). So keep posted.

In the meantime, if it is possible, try to further increase your memory allocation to, say 128GB, and see if that could resolve the issue.

KarthikRevanuru · 2021-10-03T15:28:20Z

Ok thanks !

RemyLau added the enhancement New feature or request label Oct 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Error while reading for large files #20

Memory Error while reading for large files #20

KarthikRevanuru commented Oct 2, 2021

KarthikRevanuru commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021 •

edited

Loading

KarthikRevanuru commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021 •

edited

Loading

KarthikRevanuru commented Oct 3, 2021

Memory Error while reading for large files #20

Memory Error while reading for large files #20

Comments

KarthikRevanuru commented Oct 2, 2021

KarthikRevanuru commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021 • edited Loading

KarthikRevanuru commented Oct 3, 2021

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021 • edited Loading

KarthikRevanuru commented Oct 3, 2021

RemyLau commented Oct 3, 2021 •

edited

Loading

RemyLau commented Oct 3, 2021 •

edited

Loading