Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lstm_imdb validation set is not included in vocabulary #1

Open
SlothWithCloth opened this issue Mar 24, 2020 · 0 comments
Open

lstm_imdb validation set is not included in vocabulary #1

SlothWithCloth opened this issue Mar 24, 2020 · 0 comments

Comments

@SlothWithCloth
Copy link

SlothWithCloth commented Mar 24, 2020

Hello,

it seems that in the file lstm_imdb.py there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or <unk> and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:

counter = 0
for batch in valid_iter:
    batch_text = batch.text[0]
    for a in batch_text:
        for b in a:
            if b == 0:
                counter += 1
                if counter%1000==0:
                    print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')

It counts words embedded as 0 in valid_iter, to prove there are many of them.

I resolved the issue by manually loading the whole dataset and passing it to TEXT.build_vocab as first argument but I am sure there is a nicer way of doing it.

Hope I could help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant