Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DBSTREAM performance #1086

Closed
AIRobotZhang opened this issue Nov 10, 2022 · 20 comments
Closed

DBSTREAM performance #1086

AIRobotZhang opened this issue Nov 10, 2022 · 20 comments
Assignees

Comments

@AIRobotZhang
Copy link

When the data stream comes continually, DBSTREAM algorithm becomes slower gradually. Is there any solution?

@MaxHalford
Copy link
Member

Hey @AIRobotZhang. Could you share some more context and some figures to illustrate what you're saying? What is the current performance? What performance are you looking for?

@AIRobotZhang
Copy link
Author

DBSTREAM is a clustering algorithm for evolving data streams. As time went on, the number of cluster becomes more and more. Thus, the speed of DBSTREAM becomes slower in my experiment. Is there any solution on keeping the speed of DBSTREAM as the data stream comes continually?

@smastelini
Copy link
Member

@hoanganhngo610, can you check that? Is this the expected behavior, and is there a way to "forget" old information in DBSTREAM?

@hoanganhngo610
Copy link
Contributor

@AIRobotZhang @smastelini I will try to replicate this by using the datasets that I am currently having. Once I have something available I would let you know!

@AIRobotZhang
Copy link
Author

@AIRobotZhang @smastelini I will try to replicate this by using the datasets that I am currently having. Once I have something available I would let you know!

Thank you very much!

@hoanganhngo610
Copy link
Contributor

Dear @AIRobotZhang, I have just investigated the performance of DBSTREAM with a simple dataset that have 20k observations, and have recognized that the number of clusters generated increases significantly within the first few thousand observations, but after one point, the number of clusters are actually quite stable.

This would be quite as expected for methods using the concept of micro-clusters or macro-clusters, since they would require a large number of data points at the beginning to fully establish a stable structure, before actually doing the clustering later.

The dataset and code that I used can be found here.

If you have any other concerns, please do not hesitate to let me know! Otherwise, hope that this answer helps!

Pinging @MaxHalford FYI.

@MaxHalford
Copy link
Member

Thanks @hoanganhngo610 for looking into it.

Is there a straightforward way to control the number of micro/macro clusters? I can see anything in the parameters documentation.

@hoanganhngo610
Copy link
Contributor

@MaxHalford To be honest, I don't think there are a lot of things that we can do at this moment. I have also been trying to mess around with the hyperparameters, but there are very little to no changes. This also happens with CluStream, or some corner cases with DenStream.

One idea that I am thinking of is actually to induce limits on the number of micro clusters that can be generated, thus allow the phase of macro cluster formation to happen more regularly. However, this will indeed change the nature of the algorithm a little bit. In any case, will get back to you soon on what can be done!

@MaxHalford
Copy link
Member

Cool! I think it's healthy to have theoretical algorithms meet practical considerations. What matters is that the models in River are performant and can actually meet user needs :). But don't go out your way to make changes if it's too difficult, or modifies the algorithm too much. Maybe that what's needed is a different algorithm entirely.

@hoanganhngo610
Copy link
Contributor

@MaxHalford Got it! Moreover, do you want me to close the issue, or you prefer to keep it open for further follow-up?

@MaxHalford
Copy link
Member

@AIRobotZhang what do you think? Do you have any other question?

I think we can keep this issue open, it doesn't hurt. Thanks @hoanganhngo610 :)

@AIRobotZhang
Copy link
Author

@AIRobotZhang what do you think? Do you have any other question?

I think we can keep this issue open, it doesn't hurt. Thanks @hoanganhngo610 :)

Thanks for patient answers!
I think keeping this issue open would be better!

@Dref360
Copy link

Dref360 commented Nov 28, 2022

Hello! I have a similar issue when using River with BertTopic, a topic modeling library.

I saw that in DBSTREAM.predict_one we call self._recluster every time which is quite time-consuming if we need to predict after each learning.

I created a new function predict that can predict on a batch of embeddings. I thought that could be useful to some. During my testing, it was 30x faster as well.

Please let me know if I missed something trivial.

    def predict(self, embeddings : np.ndarray) -> Iterable[int]:
        # Assign clusters to each embedding.

        def _dict_to_array(self, d):
            # Get centers as numpy array
            return np.array([d[i] for i in range(len(d))])

        self._recluster()

        centers = np.array([self._dict_to_array(ci) for ci in self.centers.values()])
        return np.argmin(pairwise_distances(embeddings, centers, metric="minkowski", p=2), -1)

@smastelini
Copy link
Member

I have no in-depth knowledge of clustering algorithms. Still, if performance is not too impaired, they could have a grace_period parameter to control how often reclustering is performed.

Alternatively, we could make the recluster method public and delegate to the user how frequently the macro-cluster representations should be updated.

Or even a combination of the two proposals: reclustering performed at regular intervals with the option of explicit reclustering class when needed.

@Dref360
Copy link

Dref360 commented Nov 28, 2022

From what I understand, DBSTREAM._recluster is deterministic. So an easy fix would be to _recluster only when needed ie. after calling learn_one.

@MaxHalford
Copy link
Member

Hey @Dref360 @smastelini. A lot of good stuff! Let's unpack.

In the code snippet you shared, what shape is the embeddings? Is it basically a mini-batch of samples? If so, we could add a predict_many method to accommodate for that case.

Alternatively, we could make the recluster method public and delegate to the user how frequently the macro-cluster representations should be updated.

I see how that is great for power users, but I would have it just work magically. I like the idea of specifying an interval. We do that elsewhere in the codebase.

From what I understand, DBSTREAM._recluster is deterministic. So an easy fix would be to _recluster only when needed ie. after calling learn_one.

Good call! We could indeed move it to learn_one. @hoanganhngo610 do you have any objection to that? Note that we could also introduce a boolean to check whether reclustering is needed or not. But my gut feeling is that moving it to learn_one is cleaner.

@smastelini
Copy link
Member

I see how that is great for power users, but I would have it just work magically. I like the idea of specifying an interval. We do that elsewhere in the codebase.

I personally like the third option better. Reclustering magically happens at predefined intervals by default. But the user can switch this off (interval=None) and trigger reclustering manually.

@MaxHalford
Copy link
Member

Yes, why not. We'll need to document this properly though.

@MaxHalford
Copy link
Member

We could also pass this as a parameter to the learn_one method, which avoids adding a new method.

@MaxHalford
Copy link
Member

Closing this because we now recluster only when predict_one is called.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants