Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numerosity-based input? #23

Open
vigna opened this issue Oct 22, 2019 · 3 comments
Open

Numerosity-based input? #23

vigna opened this issue Oct 22, 2019 · 3 comments

Comments

@vigna
Copy link

vigna commented Oct 22, 2019

It is very common, when trying to analyse distributions of, say, web graph, to have the distribution in the form
1 143245
2 395599
etc., that is, for each possible value the associated numerosity. The way we're using plfit now is simply that of generating a sample by printing 143254 times "1", 395599 times "2", and so on, but this is slow, memory-hungry and definitely suboptimal. How difficult would be to have plfit use directly a sample specified as above?

@ntamas
Copy link
Owner

ntamas commented Oct 22, 2019

Well, it's not impossible but it definitely needs a bit of work. Right now all the functions in plfit.c work with an array of doubles. (This is because the primary use-case for plfit so far was to fit power-laws to degree distributions in igraph, where you already have a vector that contains the degrees for each node in the graph). Switching to the proposed representation (which is basically a run-length encoded version of the sorted sample vector) would necessitate rewriting plfit_i_logsum_discrete() and plfit_i_logsum_less_than_discrete at least, along with all the functions that call these two functions, and the functions related to resampling (for estimating the p-value). Unfortunately I don't think I'll have time for this in the near future, but I can help with the code review if someone is willing to send a PR.

Alternatively, if you worry only about the space consumed on the disk but don't mind if the run-length encoded representation from the disk is "expanded" into a full vector in memory, then it's enough to re-write the first part of process_file in main.c that deals with reading the input file - or write a small Python script that reads the run-length encoded representation from a file and prints the "expanded" representation to stdout, and pipe this into the stdin of plfit.

@vigna
Copy link
Author

vigna commented Oct 22, 2019

Yes, the latter is what we are doing now. Only, on a graph with 12 billion nodes, so it's a bit slow and memory consuming 😂.

@ntamas
Copy link
Owner

ntamas commented Oct 22, 2019

Yeah, I can imagine. Sorry about that :( Unfortunately I'm not doing research any more so it's hard to find spare time for these side-projects, that's why I cannot promise that I'll implement this in the near future. It would definitely be a nice addition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants