DBSTREAM performance #1086

AIRobotZhang · 2022-11-10T07:35:55Z

When the data stream comes continually, DBSTREAM algorithm becomes slower gradually. Is there any solution?

MaxHalford · 2022-11-10T09:39:29Z

Hey @AIRobotZhang. Could you share some more context and some figures to illustrate what you're saying? What is the current performance? What performance are you looking for?

AIRobotZhang · 2022-11-10T11:08:37Z

DBSTREAM is a clustering algorithm for evolving data streams. As time went on, the number of cluster becomes more and more. Thus, the speed of DBSTREAM becomes slower in my experiment. Is there any solution on keeping the speed of DBSTREAM as the data stream comes continually?

smastelini · 2022-11-10T12:26:17Z

@hoanganhngo610, can you check that? Is this the expected behavior, and is there a way to "forget" old information in DBSTREAM?

hoanganhngo610 · 2022-11-10T17:42:40Z

@AIRobotZhang @smastelini I will try to replicate this by using the datasets that I am currently having. Once I have something available I would let you know!

AIRobotZhang · 2022-11-11T01:09:22Z

@AIRobotZhang @smastelini I will try to replicate this by using the datasets that I am currently having. Once I have something available I would let you know!

Thank you very much!

hoanganhngo610 · 2022-11-19T01:24:55Z

Dear @AIRobotZhang, I have just investigated the performance of DBSTREAM with a simple dataset that have 20k observations, and have recognized that the number of clusters generated increases significantly within the first few thousand observations, but after one point, the number of clusters are actually quite stable.

This would be quite as expected for methods using the concept of micro-clusters or macro-clusters, since they would require a large number of data points at the beginning to fully establish a stable structure, before actually doing the clustering later.

The dataset and code that I used can be found here.

If you have any other concerns, please do not hesitate to let me know! Otherwise, hope that this answer helps!

Pinging @MaxHalford FYI.

MaxHalford · 2022-11-19T08:45:38Z

Thanks @hoanganhngo610 for looking into it.

Is there a straightforward way to control the number of micro/macro clusters? I can see anything in the parameters documentation.

hoanganhngo610 · 2022-11-20T01:31:09Z

@MaxHalford To be honest, I don't think there are a lot of things that we can do at this moment. I have also been trying to mess around with the hyperparameters, but there are very little to no changes. This also happens with CluStream, or some corner cases with DenStream.

One idea that I am thinking of is actually to induce limits on the number of micro clusters that can be generated, thus allow the phase of macro cluster formation to happen more regularly. However, this will indeed change the nature of the algorithm a little bit. In any case, will get back to you soon on what can be done!

MaxHalford · 2022-11-20T01:40:44Z

Cool! I think it's healthy to have theoretical algorithms meet practical considerations. What matters is that the models in River are performant and can actually meet user needs :). But don't go out your way to make changes if it's too difficult, or modifies the algorithm too much. Maybe that what's needed is a different algorithm entirely.

hoanganhngo610 · 2022-11-21T07:29:37Z

@MaxHalford Got it! Moreover, do you want me to close the issue, or you prefer to keep it open for further follow-up?

MaxHalford · 2022-11-21T08:24:12Z

@AIRobotZhang what do you think? Do you have any other question?

I think we can keep this issue open, it doesn't hurt. Thanks @hoanganhngo610 :)

AIRobotZhang · 2022-11-23T07:56:20Z

@AIRobotZhang what do you think? Do you have any other question?

I think we can keep this issue open, it doesn't hurt. Thanks @hoanganhngo610 :)

Thanks for patient answers!
I think keeping this issue open would be better!

Dref360 · 2022-11-28T12:57:21Z

Hello! I have a similar issue when using River with BertTopic, a topic modeling library.

I saw that in DBSTREAM.predict_one we call self._recluster every time which is quite time-consuming if we need to predict after each learning.

I created a new function predict that can predict on a batch of embeddings. I thought that could be useful to some. During my testing, it was 30x faster as well.

Please let me know if I missed something trivial.

    def predict(self, embeddings : np.ndarray) -> Iterable[int]:
        # Assign clusters to each embedding.

        def _dict_to_array(self, d):
            # Get centers as numpy array
            return np.array([d[i] for i in range(len(d))])

        self._recluster()

        centers = np.array([self._dict_to_array(ci) for ci in self.centers.values()])
        return np.argmin(pairwise_distances(embeddings, centers, metric="minkowski", p=2), -1)

smastelini · 2022-11-28T13:06:14Z

I have no in-depth knowledge of clustering algorithms. Still, if performance is not too impaired, they could have a grace_period parameter to control how often reclustering is performed.

Alternatively, we could make the recluster method public and delegate to the user how frequently the macro-cluster representations should be updated.

Or even a combination of the two proposals: reclustering performed at regular intervals with the option of explicit reclustering class when needed.

Dref360 · 2022-11-28T13:36:52Z

From what I understand, DBSTREAM._recluster is deterministic. So an easy fix would be to _recluster only when needed ie. after calling learn_one.

MaxHalford · 2022-11-28T13:53:00Z

Hey @Dref360 @smastelini. A lot of good stuff! Let's unpack.

In the code snippet you shared, what shape is the embeddings? Is it basically a mini-batch of samples? If so, we could add a predict_many method to accommodate for that case.

Alternatively, we could make the recluster method public and delegate to the user how frequently the macro-cluster representations should be updated.

I see how that is great for power users, but I would have it just work magically. I like the idea of specifying an interval. We do that elsewhere in the codebase.

From what I understand, DBSTREAM._recluster is deterministic. So an easy fix would be to _recluster only when needed ie. after calling learn_one.

Good call! We could indeed move it to learn_one. @hoanganhngo610 do you have any objection to that? Note that we could also introduce a boolean to check whether reclustering is needed or not. But my gut feeling is that moving it to learn_one is cleaner.

smastelini · 2022-11-28T15:51:29Z

I see how that is great for power users, but I would have it just work magically. I like the idea of specifying an interval. We do that elsewhere in the codebase.

I personally like the third option better. Reclustering magically happens at predefined intervals by default. But the user can switch this off (interval=None) and trigger reclustering manually.

MaxHalford · 2022-11-28T16:39:44Z

Yes, why not. We'll need to document this properly though.

MaxHalford · 2022-11-28T16:46:01Z

We could also pass this as a parameter to the learn_one method, which avoids adding a new method.

MaxHalford · 2023-10-15T15:46:51Z

Closing this because we now recluster only when predict_one is called.

MaxHalford assigned hoanganhngo610 Nov 14, 2022

emarsc mentioned this issue Feb 14, 2023

Online Modeling - Topic Representation Loss and Mapping Confusion MaartenGr/BERTopic#946

Closed

MaxHalford closed this as completed Oct 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DBSTREAM performance #1086

DBSTREAM performance #1086

AIRobotZhang commented Nov 10, 2022

MaxHalford commented Nov 10, 2022

AIRobotZhang commented Nov 10, 2022

smastelini commented Nov 10, 2022

hoanganhngo610 commented Nov 10, 2022

AIRobotZhang commented Nov 11, 2022

hoanganhngo610 commented Nov 19, 2022

MaxHalford commented Nov 19, 2022

hoanganhngo610 commented Nov 20, 2022

MaxHalford commented Nov 20, 2022

hoanganhngo610 commented Nov 21, 2022

MaxHalford commented Nov 21, 2022

AIRobotZhang commented Nov 23, 2022

Dref360 commented Nov 28, 2022 •

edited

Loading

smastelini commented Nov 28, 2022

Dref360 commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

smastelini commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

MaxHalford commented Oct 15, 2023

DBSTREAM performance #1086

DBSTREAM performance #1086

Comments

AIRobotZhang commented Nov 10, 2022

MaxHalford commented Nov 10, 2022

AIRobotZhang commented Nov 10, 2022

smastelini commented Nov 10, 2022

hoanganhngo610 commented Nov 10, 2022

AIRobotZhang commented Nov 11, 2022

hoanganhngo610 commented Nov 19, 2022

MaxHalford commented Nov 19, 2022

hoanganhngo610 commented Nov 20, 2022

MaxHalford commented Nov 20, 2022

hoanganhngo610 commented Nov 21, 2022

MaxHalford commented Nov 21, 2022

AIRobotZhang commented Nov 23, 2022

Dref360 commented Nov 28, 2022 • edited Loading

smastelini commented Nov 28, 2022

Dref360 commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

smastelini commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

MaxHalford commented Nov 28, 2022

MaxHalford commented Oct 15, 2023

Dref360 commented Nov 28, 2022 •

edited

Loading