Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Returning 0 clusters regardless of data (even using examples from documentation) #1466

Open
shiraeisenberg opened this issue Nov 29, 2023 · 4 comments
Assignees

Comments

@shiraeisenberg
Copy link

shiraeisenberg commented Nov 29, 2023

Versions

river version: installed from github
Python version: 3.11
Operating system: Mac OS

Describe the bug

With openai embeddings, Denstream is returning 0 clusters regardless of the set parameters.

Steps/code to reproduce

from schema import db, Post, TiktokVideo, Topic, Subtopic, ResponseCampaign, TiktokVideoHashtag, TiktokQuery
import time
from river import cluster, stream
import requests
from qdrant_client import QdrantClient

client = QdrantClient("localhost", port=6333)

initial_sample = 1000

denstream = cluster.DenStream(decaying_factor=0.01,
                              beta=0.5,
                              mu=2.5,
                              epsilon=0.5,
                              n_samples_init=10)


# dbstream = cluster.DBSTREAM(
#     clustering_threshold=5,
#     fading_factor=0.05,
#     cleanup_interval=40,
#     intersection_factor=0.5,
#     minimum_weight=5
# )

# clustream = cluster.CluStream(
#     n_macro_clusters=30,
#     max_micro_clusters=300,
#     time_gap= 100,
#     time_window = 10000,
#     seed = 42
# )


def get_initial_tiktok_videos():
    try:
        videos = TiktokVideo.select().limit(initial_sample)
        ids = [video.id for video in videos]
        # print("IDs: ", ids)
        tiktok_ids = [video.tiktok_id for video in videos]
        # print("Tiktok IDs: ", tiktok_ids)
        return ids, tiktok_ids
    except TiktokVideo.DoesNotExist:
        print("No Tiktok videos found.")
        return None, None


def get_stream_videos(limit=100, page=10):
    try:
        videos = TiktokVideo.select().limit(limit).offset(page*limit)
        ids = [video.id for video in videos]
        # print("IDs: ", ids)
        tiktok_ids = [video.tiktok_id for video in videos]
        # print("Tiktok IDs: ", tiktok_ids)
        return ids, tiktok_ids
    except TiktokVideo.DoesNotExist:
        print("No Tiktok videos found.")
        return None, None

def fetch_embeddings_from_qdrant(ids):
    try:
        embeddings = []
        for id in ids:
            url = f"http://localhost:6333/collections/raqia_test_collection/points/{id}"
            response = requests.get(url)
            if response.status_code == 200:
                embedding = response.json()['result']['vector']
                # print(embedding)
                embeddings.append(embedding)
        return embeddings
    except Exception as e:
        print(f"Error: {e}")
        return None
    


        
    



if __name__ == "__main__":
    # get_initial_tiktok_videos()
    # get_stream_videos()
    # fetch_embeddings_from_qdrant([0,1])
    
    initial_set = get_initial_tiktok_videos()[0]
    initial_embeddings = fetch_embeddings_from_qdrant(initial_set)
    for x, _ in stream.iter_array(initial_embeddings):
        # print(x)
        denstream = denstream.learn_one(x)
        print(denstream.n_clusters)
        print(denstream.clusters)
        # dbstream = dbstream.learn_one(x)
        # clustream = clustream.learn_one(x)
        # print(dbstream.n_clusters)
        # print(clustream.centers)
    print(denstream.n_clusters)
    print(denstream.clusters)
    # print(dbstream.n_clusters)
    # print(dbstream.clusters)
    # print(dbstream.centers)
    # print(clustream.centers)
    time.sleep(60)
    
    while True:
        stream_set = get_stream_videos()[0]
        embeddings = fetch_embeddings_from_qdrant(stream_set)
        for embedding, _ in stream.iter_array(embeddings):
            denstream = denstream.learn_one(embedding)
            # dbstream = dbstream.learn_one(embedding)
            # clustream = clustream.learn_one(embedding)
        # print(denstream.n_clusters)
        # print(denstream.clusters)
        time.sleep(60*5)


@hoanganhngo610
Copy link
Contributor

Thank you so much for raising the issue @shiraeisenberg. I will have a look into this as soon as possible.

@hoanganhngo610 hoanganhngo610 self-assigned this Dec 4, 2023
@hoanganhngo610
Copy link
Contributor

@shiraeisenberg Would you mind giving me the value of your denstream._n_samples_seen? Usually, from my personal experience, DenStream would not return any clusters for approximately 100 first observations. As such, we would usually use 100-1000 first samples as a burn-in, i.e let the model learn without actually requiring any predictions.

@nipunagarwala
Copy link

@hoanganhngo610 I see this too.
I took the example in the documentation, and re-ran that.
If we do not add something like denstream.predict_one({0: -1, 1: -2}) , then n_clusters is always 0.

@hoanganhngo610
Copy link
Contributor

Dear @nipunagarwala. Thank you very much for your response.

If you look closely into the source code of DenStream and the example within the documentation, you can notice that the learn_one only creates p-micro-clusters and o-micro-clusters, but only when predict_one is called for the first time, the final clusters will be generated and the number of clusters for the solution will be known.

As such, in the example within the documentation, if you retrieve the number of o_micro_clusters and p_micro_clusters by the following commands, the results will be 0 and 2 respectively:

>>> len(denstream.o_micro_clusters
0
>>> len(denstream.p_micro_clusters
2

This philosophy is adopted since we believe that the cluster solutions, or final cluster centers, should only be generated if the command is called since it's extremely computationally expensive; else,only o-micro-clusters and p-micro-clusters will be updated.

Hope that this answer is clear and helpful to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants