Guidance about usage with large datasets #22

610v4nn1 · 2021-03-17T12:09:29Z

I tried the library on some datasets and I have to say I am positively surprised about the usability and the effectiveness of the methods provided.

At the same time, I found some serious blocker in using it with large datasets since this requires to read the literature referenced in the documentation. It would be extremely useful to provide guidance about the computational complexity of the different methods or distinguish between scalable methods (e.g., streaming methods) and less scalable ones.

jmschrei · 2021-03-17T19:45:34Z

Howdy

It is a reasonable point that I should provide some more basic advice on when to apply different types of functions. I think the key issue is whether one uses a feature-based function where the complexity is linear in the size of the data set size or a graph-based function where the complexity is linear in the number of edges (e.g. quadratic w.r.t data set size). Unfortunately, submodular functions with the same computational complexity can sometimes have very different runtimes depending on the number of operations in the score function. Further, it's often unclear which function will work best given a data set other than making high-level decisions like feature-based vs graph-based.

Can you describe a bit more what you'd like to see? I'll see if I can add some more thoughts in soon.

610v4nn1 · 2021-03-18T08:46:00Z

I completely understand the issue and I do not expect a 1:1 mapping between use-cases and functions.
Some high level guidance saying "in cases such X and Y where condition Z applies, it is often possible to get good results with function F1, but if you have a big dataset you probably want to use F2 because ...". I see this as something that puts users in a good spot to start experimenting more than a fully fledged solution.

I am also quite interested in using streaming algorithms, for which you produced a notebook, and provided the partial_fit function. In this case it would be good to understand what happen if you have a stream of point (or multiple batches).
There is an example where you iterate on several batches of data but it does not fully explain what's going on (I can see that from the code but I may miss something). In particular, since I load data from disk, I would not like to keep all of them in memory (or reload them in memory for a second pass) and I would like to retain only the points needed.

610v4nn1 closed this as completed Mar 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guidance about usage with large datasets #22

Guidance about usage with large datasets #22

610v4nn1 commented Mar 17, 2021

jmschrei commented Mar 17, 2021

610v4nn1 commented Mar 18, 2021 •

edited

Guidance about usage with large datasets #22

Guidance about usage with large datasets #22

Comments

610v4nn1 commented Mar 17, 2021

jmschrei commented Mar 17, 2021

610v4nn1 commented Mar 18, 2021 • edited

610v4nn1 commented Mar 18, 2021 •

edited