Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for a variable vocab KMeans #864

Closed
wants to merge 1 commit into from
Closed

Adding support for a variable vocab KMeans #864

wants to merge 1 commit into from

Conversation

vsoch
Copy link
Contributor

@vsoch vsoch commented Mar 6, 2022

Currently the kmeans model(s) require providing a set of indexed values to map to the currently known centroids, and the models allow prediction given a feature has not yet been seen, but it is not clear that it is easy to update models with new
features, and especially when they are strings! This pull request adds a new KMeans class, VariableVocabKMeans that allows:

  1. providing a dictionary of strings and integers more suitable for NLP / text processing
  2. updating the model and vocabulary with new words as they are seen.

This means that to learn, the user provides a vocabulary (lookup/dict of keys as words, values as counts or other value for the word) that is then mapped into the same model space as the traditional KMeans via a vocabulary lookup. To support this I added stream.iter_counts that can transform a vector of tokens into the suitable input (since the user will need to do it every time!). I am also adding a function to get a centroid and return the words instead of the indices, since that might be useful.

The use case for this is what I'm working on - I started out with Word2Vec/Doc2Vec to generate vectors (which would work nicely with the current KMeans implementation since that dimension does not change) but then I realized I really don't want to have to carry around two models - the word2vec AND the online-ml KMeans to basically better visualize that space. With this new model I can simply do my processing of the text to generate tokens and feed them directly to the online-ml system, and I don't need Word2Vec. We of course lose context with that, but since my data is standardized error messages, we can be pretty confident that when we see a set of tokens, they are going to generally come from the same order. Here is an example of running my new model:

from river import cluster, stream


def main():

    # Create variable vocab kmeans
    # You'd typically do some kind of NLP preprocessing to get tokens
    X = [
        ["one", "two"],
        ["one", "four"],
        ["one", "zero"],
        ["five", "six"],
        ["seven", "eight"],
        ["nine", "nine"]
    ]

    model = cluster.VariableVocabKMeans(n_clusters=4, halflife=0.4, sigma=3, seed=0)

    # We want to convert a vector of words into counts
    for i, vocab in enumerate(stream.iter_counts(X)):
        model = model.learn_one(vocab)
        center = model.predict_one(vocab)
        print(f'{vocab} is assigned to cluster {center}')

        # this has words (key) and coordinate (value)
        coords = model.get_center_vocab(center)
        print(coords)

if __name__ == "__main__":
    main()

I ran the pre-commit for black, isort, etc., and please let me know how/where to test and what other changes you would like, and if you'll accept this idea, period! I hope that you will consider it because I know a lot of researchers do NLP stuffs and this seems like a nice way to extend that to KMeans clustering.

And a follow up question - I was looking through metrics and the only one that seemed suitable for clustering was https://github.com/online-ml/river/blob/main/river/metrics/fowlkes_mallows.py but I believe that is for between clusters and not appropriate for a single cluster. Is there anything akin to looking at average distances to centroids or something like that? You can probably guess I'm anticipating adding a "cluster" class to django-river-ml and I want some metrics associated :)

Thank you!

Signed-off-by: vsoch vsoch@users.noreply.github.com

@vsoch
Copy link
Contributor Author

vsoch commented Mar 6, 2022

Just had another idea - I think it would be good to support users adding custom models to django-river-ml - so basically a CustomModelFlavor that is more leniant on requirements and then possibly can have custom metrics added too... going to think more about this!

Currently the kmeans model(s) require providing a set of indexed values to map
to the currently known centroids, and the models allow prediction given a feature
has not yet been seen, but it is not clear that it is easy to update models with new
features, and especially when they are strings! This pull request adds a new KMeans class,
VariableVocabKMeans that allows the user to provide a vocabulary (lookup/dict of keys as
words, values as counts or other value for the word) that is then mapped into the same
model space as the traditional Kmeans. I am also adding a function to get a centroid
and return the words instead of the indices.

Signed-off-by: vsoch <vsoch@users.noreply.github.com>
@vsoch
Copy link
Contributor Author

vsoch commented Mar 7, 2022

okay the custom model idea wasn't too bad! It's basically a model flavor that doesn't constrain things. I have a basic example working, and next I'm going to test when the module is outside of the apps django can see. https://github.com/vsoch/django-river-ml/pull/8/files#diff-2fc3b9f958ccce7c952b2460ec359be8c54adb468a2e6908328c8537066d1419R150.

Tests here have a different setup - it looks like we run the models through common checks, and I suspect because my models input looks differently it's failing because of that. Let me know what would be best to do next (and I'll likely have some time in an evening this week).

@MaxHalford
Copy link
Member

MaxHalford commented Mar 7, 2022

Hey!

So I'm not exactly sure what problem you're trying to solve with these two additions. Can't you just do something like this?

from river import cluster, feature_extraction

model = (
    feature_extraction.BagOfWords() |
    cluster.KMeans(
        n_clusters=4,
        halflife=0.4,
        sigma=3,
        seed=0
    )
)

for x in X:
    sentence = ' '.join(x)
    model.learn_one(sentence)
    center = model.predict_one(sentence)
    print
    print(f"'{sentence}' is assigned to cluster {center}")
    print(dict(model["KMeans"].centers[center]))
    print()
'one two' is assigned to cluster 1
{'two': -0.8234860065411578, 'one': 1.0669064214291877}

'one four' is assigned to cluster 1
{'two': -0.8234860065411578, 'one': 1.0401438528575127, 'four': 0.7489979342483873}

'one zero' is assigned to cluster 1
{'two': -0.8234860065411578, 'one': 1.0240863117145076, 'four': 0.7489979342483873, 'zero': 3.3622091690488194}

'five six' is assigned to cluster 3
{'two': 0.5375894618245707, 'one': -2.4932976458129645, 'four': -1.9409448916425145, 'zero': -1.542469989158181, 'six': 2.598585543538009, 'five': -1.2220162707116904}

'seven eight' is assigned to cluster 3
{'two': 0.5375894618245707, 'one': -2.4932976458129645, 'four': -1.9409448916425145, 'zero': -1.542469989158181, 'six': 2.598585543538009, 'five': -1.2220162707116904, 'eight': -3.1667595431959117, 'seven': 0.9188387420459362}

'nine nine' is assigned to cluster 3
{'two': 0.5375894618245707, 'one': -2.4932976458129645, 'four': -1.9409448916425145, 'zero': -1.542469989158181, 'six': 2.598585543538009, 'five': -1.2220162707116904, 'eight': -3.1667595431959117, 'seven': 0.9188387420459362, 'nine': 0.7088125154322985}

And a follow up question - I was looking through metrics and the only one that seemed suitable for clustering was https://github.com/online-ml/river/blob/main/river/metrics/fowlkes_mallows.py but I believe that is for between clusters and not appropriate for a single cluster. Is there anything akin to looking at average distances to centroids or something like that? You can probably guess I'm anticipating adding a "cluster" class to django-river-ml and I want some metrics associated :)

There's actually loads of clustering metrics in the metrics.cluster module, such as this one. Hopefully that will help.

Just had another idea - I think it would be good to support users adding custom models to django-river-ml - so basically a CustomModelFlavor that is more leniant on requirements and then possibly can have custom metrics added too... going to think more about this!

Yes I think it's a very good idea to offer flexibility along with strong defaults :)

@vsoch
Copy link
Contributor Author

vsoch commented Mar 7, 2022

lol! Yes that works too, I just didn't know you could do that :) Now I do.

I don't know how I missed BIC! I should probably just ask - what metrics should I make default for a cluster flavor model?

Sorry for the extraneous PR - I actually had a lot of fun doing it anyway, and the "custom" type I think is still useful.

@MaxHalford
Copy link
Member

No worries! It's out fault we have documented well enough.

I don't know how I missed BIC! I should probably just ask - what metrics should I make default for a cluster flavor model?

That's a good question... I would suggest using SilhouetteScore. Dunn's index is good too, but I don't believe we have it.

I actually had a lot of fun doing it anyway, and the "custom" type I think is still useful.

That's all that matters :)

@vsoch
Copy link
Contributor Author

vsoch commented Mar 7, 2022

Thanks! I'll add those tonight. Closing PR here, hopefully I can contribute more meaningfully at some point.

@vsoch vsoch closed this Mar 7, 2022
@MaxHalford
Copy link
Member

We have a whole roadmap that needs tackling :D

@vsoch
Copy link
Contributor Author

vsoch commented Mar 7, 2022

@MaxHalford follow up question - so let's say we have a cluster based model, and we do a prediction and we get back:

res = cli.predict(x=sentence, model_name='spack-errors')

In [49]: res
Out[49]: 
{'model': 'spack-errors',
 'prediction': 69,
 'identifier': 'c7e9ffb8-e876-4664-b986-bc5c90f7a2d6'}

Yay we got the centroid ID, but we don't really know what that means :P Should there be endpoints (that maybe need to be flavor specific) to inspect some attribute of a model (such as centroid) or perhaps there should be a wrapper around predict one for different types, that given that we have a centers attribute, is able to look up the details? I figure that in the case of a prediction, people might want more information back than the centroid number. I suppose it could be meaningful to see what is grouped together, but if I just want metadata for one point I'd rather just get that.

Let me know your thoughts, and if you think any of the above is a good idea, what you'd think is a good design.

The alternative would be to require the user to download the model, but arguably I don't want to do that because it's a download -> load and I have to do it regularly given there could be updates.

@MaxHalford
Copy link
Member

That's a bit of a philosophical question. Clustering models are not directly used for decision making. So indeed there's not much you can do when you're give a cluster index. Note that clustering models are sometimes used for visualization purposes, but that's a bit of a toy example in my opinion.

The ideal experience for clustering is to provide a list of examples for each cluster. That way users can label each cluster with a human friendly cluster label. Then, the system could return that cluster label along with the cluster index.

Makes sense?

@vsoch
Copy link
Contributor Author

vsoch commented Mar 8, 2022

The ideal experience for clustering is to provide a list of examples for each cluster. That way users can label each cluster with a human friendly cluster label. Then, the system could return that cluster label along with the cluster index.

It does make sense, although for many cases we never have labels, and the best we can do is look at other points assigned to the same cluster. That will have to do then I suppose, and I can build the visualization of the centers into my UI somewhere.

Thanks, and let me know if you think further and change your mind or have other ideas!

@MaxHalford
Copy link
Member

It does make sense, although for many cases we never have labels, and the best we can do is look at other points assigned to the same cluster. That will have to do then I suppose, and I can build the visualization of the centers into my UI somewhere.

To be clear, I meant "label" as in the friendly name you could give to a cluster (a group of points), not "label" as in the label of each observation.

@vsoch
Copy link
Contributor Author

vsoch commented Mar 8, 2022

Yes understood! 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants