Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

practical-nlp/Ch4/03_Word2Vec_Example.ipynb #5

Closed
sai-teja-ponugoti opened this issue Jul 10, 2020 · 5 comments
Closed

practical-nlp/Ch4/03_Word2Vec_Example.ipynb #5

sai-teja-ponugoti opened this issue Jul 10, 2020 · 5 comments

Comments

@sai-teja-ponugoti
Copy link

In the above Jupiter notebook, in function:

# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                **feat_for_this += w2v_model[token]
                count_for_this +=1
        feats.append(feat_for_this)**        
    return feats

The feature vectors are not averaged, directly sum of the word embeddings is appended as the feature, whereas the function description indicates that the vectors will be averaged.

@varunp2k
Copy link
Collaborator

varunp2k commented Jul 11, 2020

Thank you @sai-teja-ponugoti for bringing this to our notice.
The fix to this most likely would be just having
np.mean(w2v_model[token],axis=0)
instead of the current
w2v_model[token]

Would you like to send a PR to fix this issue?

@brianalexander
Copy link

Hi @varunp2k,

I believe the fix would be a modification of

feats.append(feat_for_this)
to
feats.append(feat_for_this/count_for_this if count_for_this > 0 else feat_for_this)

In the example, we're categorizing the tweet by taking the average of all of the word vectors for each word in the sentence. This also makes sense because previously count_for_this is included in the code, but never ends up being used.

count_for_this if count_for_this > 0 else feat_for_this is needed to handle cases where there are no token matches and avoid dividing by zero.

Here is the updated cell in its entirety:

def embedding_feats(list_of_lists, dims=300):
    zero_vector = np.zeros(dims)
    feats = []
    for tokens in list_of_lists:
        feat_for_this = np.zeros(dims)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                feat_for_this += w2v_model[token]
                count_for_this += 1
        feats.append(feat_for_this/count_for_this if count_for_this > 0 else feat_for_this)
    return feats

train_vectors = embedding_feats(texts_processed)
print(len(train_vectors))

@varunp2k
Copy link
Collaborator

Looks good @brianalexander
Please send a PR

@sai-teja-ponugoti
Copy link
Author

sai-teja-ponugoti commented Jul 13, 2020

Thank you @sai-teja-ponugoti for bringing this to our notice.
The fix to this most likely would be just having
np.mean(w2v_model[token],axis=0)
instead of the current
w2v_model[token]

Would you like to send a PR to fix this issue?

@brainalexander have you sent one already?

@varunp2k
Copy link
Collaborator

@sai-teja-ponugoti
A PR stands for Pull Request. This article will explain what it is in a detail.
https://docs.github.com/en/github/collaborating-with-issues-and-pull-requests/about-pull-requests
Brainalexander is yet to send a PR.
You can check the active PR's for this repo under the pull requests section right next to the issues. The link to that is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants