-
-
Notifications
You must be signed in to change notification settings - Fork 548
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DBSTREAM performance #1086
Comments
Hey @AIRobotZhang. Could you share some more context and some figures to illustrate what you're saying? What is the current performance? What performance are you looking for? |
DBSTREAM is a clustering algorithm for evolving data streams. As time went on, the number of cluster becomes more and more. Thus, the speed of DBSTREAM becomes slower in my experiment. Is there any solution on keeping the speed of DBSTREAM as the data stream comes continually? |
@hoanganhngo610, can you check that? Is this the expected behavior, and is there a way to "forget" old information in DBSTREAM? |
@AIRobotZhang @smastelini I will try to replicate this by using the datasets that I am currently having. Once I have something available I would let you know! |
Thank you very much! |
Dear @AIRobotZhang, I have just investigated the performance of DBSTREAM with a simple dataset that have 20k observations, and have recognized that the number of clusters generated increases significantly within the first few thousand observations, but after one point, the number of clusters are actually quite stable. This would be quite as expected for methods using the concept of micro-clusters or macro-clusters, since they would require a large number of data points at the beginning to fully establish a stable structure, before actually doing the clustering later. The dataset and code that I used can be found here. If you have any other concerns, please do not hesitate to let me know! Otherwise, hope that this answer helps! Pinging @MaxHalford FYI. |
Thanks @hoanganhngo610 for looking into it. Is there a straightforward way to control the number of micro/macro clusters? I can see anything in the parameters documentation. |
@MaxHalford To be honest, I don't think there are a lot of things that we can do at this moment. I have also been trying to mess around with the hyperparameters, but there are very little to no changes. This also happens with CluStream, or some corner cases with DenStream. One idea that I am thinking of is actually to induce limits on the number of micro clusters that can be generated, thus allow the phase of macro cluster formation to happen more regularly. However, this will indeed change the nature of the algorithm a little bit. In any case, will get back to you soon on what can be done! |
Cool! I think it's healthy to have theoretical algorithms meet practical considerations. What matters is that the models in River are performant and can actually meet user needs :). But don't go out your way to make changes if it's too difficult, or modifies the algorithm too much. Maybe that what's needed is a different algorithm entirely. |
@MaxHalford Got it! Moreover, do you want me to close the issue, or you prefer to keep it open for further follow-up? |
@AIRobotZhang what do you think? Do you have any other question? I think we can keep this issue open, it doesn't hurt. Thanks @hoanganhngo610 :) |
Thanks for patient answers! |
Hello! I have a similar issue when using River with BertTopic, a topic modeling library. I saw that in DBSTREAM.predict_one we call I created a new function Please let me know if I missed something trivial. def predict(self, embeddings : np.ndarray) -> Iterable[int]:
# Assign clusters to each embedding.
def _dict_to_array(self, d):
# Get centers as numpy array
return np.array([d[i] for i in range(len(d))])
self._recluster()
centers = np.array([self._dict_to_array(ci) for ci in self.centers.values()])
return np.argmin(pairwise_distances(embeddings, centers, metric="minkowski", p=2), -1) |
I have no in-depth knowledge of clustering algorithms. Still, if performance is not too impaired, they could have a Alternatively, we could make the recluster method public and delegate to the user how frequently the macro-cluster representations should be updated. Or even a combination of the two proposals: reclustering performed at regular intervals with the option of explicit reclustering class when needed. |
From what I understand, |
Hey @Dref360 @smastelini. A lot of good stuff! Let's unpack. In the code snippet you shared, what shape is the
I see how that is great for power users, but I would have it just work magically. I like the idea of specifying an interval. We do that elsewhere in the codebase.
Good call! We could indeed move it to |
I personally like the third option better. Reclustering magically happens at predefined intervals by default. But the user can switch this off ( |
Yes, why not. We'll need to document this properly though. |
We could also pass this as a parameter to the learn_one method, which avoids adding a new method. |
Closing this because we now recluster only when |
When the data stream comes continually, DBSTREAM algorithm becomes slower gradually. Is there any solution?
The text was updated successfully, but these errors were encountered: