Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

glove.py get_expanded_query() is slow,potential improvement #33

Open
Woracle opened this issue Oct 5, 2022 · 1 comment
Open

glove.py get_expanded_query() is slow,potential improvement #33

Woracle opened this issue Oct 5, 2022 · 1 comment

Comments

@Woracle
Copy link

Woracle commented Oct 5, 2022

So been playing around with the different expanders and found that the glove get_expanded_query() method was quite slow.

realised line 33 looping through the comparisons was the main culprit slowing the method. w = sorted(Glove.glove.keys(), key=lambda word: scipy.spatial.distance.euclidean(Glove.glove[word], Glove.glove[qw]))

Reviewing scipy.spatial.distance.cdist i was able to see that we could run all the distance comparisons at the same time and improve speed. Combined with np.argpartition() i was able to get the index for the top n words for each array output form cdist and retrieve the words from the model using index. (the model being the glove dictionary)

in my testing I got the same outputs in a shorter of the time using short queries. "a nice lemon pie" went from 11seconds -> 1,4seconds with no change in output.

below is reduced snippet showing basic approach applied. Anyway I found this repos really helpful so thought I would share suggestion for anyone else benefit

def get_expanded_query(model, query, topn = 5, coef = 0.7):
    embs = []
    terms = []
    for qw in query:
        if qw.lower() in model.keys():
            embs.append(model[qw])
        else:
            terms.append([qw, 1])
    # we now will stack the embeddings into an array which will allow us to use cdist to compare all words much faster. 
    embs = np.vstack(embs)
    nums = np.vstack(list(model.values()))

    model_list = list(model) # allows to to use indexing on the model keys.

    # cdist performs a pairwise comparison for each pairs generating array shape [Query length (post cleaning) , number of words in model]
    matrix = scipy.spatial.distance.cdist(embs, nums)

    # we have a array of pairwise distance for each term. We can use arg parse to get the top n indexs for each word. then just look up that work in the model.   
    words = [model_list[idx] for idx in  np.argpartition(matrix, topn)[:, :topn].flatten()]
    return ' '.join(words)
@hosseinfani
Copy link
Owner

Hi @Woracle
Thank you for the speed improvement. Please send a pr, and I will merge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants