Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guidance about usage with large datasets #22

Closed
610v4nn1 opened this issue Mar 17, 2021 · 2 comments
Closed

Guidance about usage with large datasets #22

610v4nn1 opened this issue Mar 17, 2021 · 2 comments

Comments

@610v4nn1
Copy link

I tried the library on some datasets and I have to say I am positively surprised about the usability and the effectiveness of the methods provided.

At the same time, I found some serious blocker in using it with large datasets since this requires to read the literature referenced in the documentation. It would be extremely useful to provide guidance about the computational complexity of the different methods or distinguish between scalable methods (e.g., streaming methods) and less scalable ones.

@jmschrei
Copy link
Owner

Howdy

It is a reasonable point that I should provide some more basic advice on when to apply different types of functions. I think the key issue is whether one uses a feature-based function where the complexity is linear in the size of the data set size or a graph-based function where the complexity is linear in the number of edges (e.g. quadratic w.r.t data set size). Unfortunately, submodular functions with the same computational complexity can sometimes have very different runtimes depending on the number of operations in the score function. Further, it's often unclear which function will work best given a data set other than making high-level decisions like feature-based vs graph-based.

Can you describe a bit more what you'd like to see? I'll see if I can add some more thoughts in soon.

@610v4nn1
Copy link
Author

610v4nn1 commented Mar 18, 2021

I completely understand the issue and I do not expect a 1:1 mapping between use-cases and functions.
Some high level guidance saying "in cases such X and Y where condition Z applies, it is often possible to get good results with function F1, but if you have a big dataset you probably want to use F2 because ...". I see this as something that puts users in a good spot to start experimenting more than a fully fledged solution.

I am also quite interested in using streaming algorithms, for which you produced a notebook, and provided the partial_fit function. In this case it would be good to understand what happen if you have a stream of point (or multiple batches).
There is an example where you iterate on several batches of data but it does not fully explain what's going on (I can see that from the code but I may miss something). In particular, since I load data from disk, I would not like to keep all of them in memory (or reload them in memory for a second pass) and I would like to retain only the points needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants