Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add similartiy utility. #64

Closed
koaning opened this issue Jun 21, 2023 · 1 comment
Closed

Add similartiy utility. #64

koaning opened this issue Jun 21, 2023 · 1 comment

Comments

@koaning
Copy link
Owner

koaning commented Jun 21, 2023

Something like this:

import numpy as np 
from sklearn.metrics import pairwise_distances
from embetter.utils import similarity

def calc_distances(inputs, anchors, pipeline, anchor_pipeline=None, metric="cosine", aggregate=np.max, n_jobs=None):
    """
    Shortcut to compare a sequence of inputs to a set of anchors. 

    The available metrics are: `cityblock`,`cosine`,`euclidean`,`haversine`,`l1`,`l2`,`manhattan` and `nan_euclidean`.

    You can read a verbose description of the metrics [here](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.distance_metrics.html#sklearn.metrics.pairwise.distance_metrics).

    Arguments:
        - inputs: sequence of inputs to calculate scores for
        - anchors: set/list of anchors to compare against
        - pipeline: the pipeline to use to calculate the embeddings
        - anchor_pipeline: the pipeline to apply to the anchors, meant to be used if the anchors should use a different pipeline
        - metric: the distance metric to use 
        - aggregate: you'll want to aggregate the distances to the different anchors down to a single metric, numpy functions that offer axis=1, like `np.max` and `np.mean`, can be used
        - n_jobs: set to -1 to use all cores for calculation
    """
    X_input = pipeline.transform(inputs)
    if anchor_pipeline:
        X_anchors = anchor_pipeline.transform(anchors)
    else:
        X_anchors = pipeline.transform(anchors)

    X_dist = pairwise_distances(X_input, X_anchors, metric=metric, n_jobs=n_jobs)
    return aggregate(X_dist, axis=1)
@koaning
Copy link
Owner Author

koaning commented Jun 24, 2023

Then the Prodigy recipe might use something like:

from prodigy.sorters import ExpMovingAverage, prefer_low_scores

def make_scored_stream(stream, anchors):
    for batch in batched(stream):
        batch_text = [b['text'] for b in batch]
        distances = calc_distance(batch_text, anchors, pipeline)
        for score, ex in zip(distances, batch):
            yield score, ex

def sorted_stream(stream):
    return prefer_low_scores(ExpMovingAverage(stream))

Worth rethinking though. Something about recalculating the anchors feels a bit wasteful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant