Numerosity-based input? #23

vigna · 2019-10-22T10:03:12Z

It is very common, when trying to analyse distributions of, say, web graph, to have the distribution in the form
1 143245
2 395599
etc., that is, for each possible value the associated numerosity. The way we're using plfit now is simply that of generating a sample by printing 143254 times "1", 395599 times "2", and so on, but this is slow, memory-hungry and definitely suboptimal. How difficult would be to have plfit use directly a sample specified as above?

ntamas · 2019-10-22T17:07:03Z

Well, it's not impossible but it definitely needs a bit of work. Right now all the functions in plfit.c work with an array of doubles. (This is because the primary use-case for plfit so far was to fit power-laws to degree distributions in igraph, where you already have a vector that contains the degrees for each node in the graph). Switching to the proposed representation (which is basically a run-length encoded version of the sorted sample vector) would necessitate rewriting plfit_i_logsum_discrete() and plfit_i_logsum_less_than_discrete at least, along with all the functions that call these two functions, and the functions related to resampling (for estimating the p-value). Unfortunately I don't think I'll have time for this in the near future, but I can help with the code review if someone is willing to send a PR.

Alternatively, if you worry only about the space consumed on the disk but don't mind if the run-length encoded representation from the disk is "expanded" into a full vector in memory, then it's enough to re-write the first part of process_file in main.c that deals with reading the input file - or write a small Python script that reads the run-length encoded representation from a file and prints the "expanded" representation to stdout, and pipe this into the stdin of plfit.

vigna · 2019-10-22T17:11:38Z

Yes, the latter is what we are doing now. Only, on a graph with 12 billion nodes, so it's a bit slow and memory consuming 😂.

ntamas · 2019-10-22T17:30:18Z

Yeah, I can imagine. Sorry about that :( Unfortunately I'm not doing research any more so it's hard to find spare time for these side-projects, that's why I cannot promise that I'll implement this in the near future. It would definitely be a nice addition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numerosity-based input? #23

Numerosity-based input? #23

vigna commented Oct 22, 2019

ntamas commented Oct 22, 2019

vigna commented Oct 22, 2019

ntamas commented Oct 22, 2019

Numerosity-based input? #23

Numerosity-based input? #23

Comments

vigna commented Oct 22, 2019

ntamas commented Oct 22, 2019

vigna commented Oct 22, 2019

ntamas commented Oct 22, 2019