lstm_imdb validation set is not included in vocabulary #1

SlothWithCloth · 2020-03-24T15:48:37Z

Hello,

it seems that in the file lstm_imdb.py there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or <unk> and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:

counter = 0
for batch in valid_iter:
    batch_text = batch.text[0]
    for a in batch_text:
        for b in a:
            if b == 0:
                counter += 1
                if counter%1000==0:
                    print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')

It counts words embedded as 0 in valid_iter, to prove there are many of them.

I resolved the issue by manually loading the whole dataset and passing it to TEXT.build_vocab as first argument but I am sure there is a nicer way of doing it.

Hope I could help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lstm_imdb validation set is not included in vocabulary #1

lstm_imdb validation set is not included in vocabulary #1

SlothWithCloth commented Mar 24, 2020 •

edited

Loading

lstm_imdb validation set is not included in vocabulary #1

lstm_imdb validation set is not included in vocabulary #1

Comments

SlothWithCloth commented Mar 24, 2020 • edited Loading

SlothWithCloth commented Mar 24, 2020 •

edited

Loading