You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
it seems that in the file lstm_imdb.py there are many words missing in vocabulary from valid_ds.
Those words will then be embedded as 0 or <unk> and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:
Copy the first 98 lines of the original file and add this loop:
counter = 0
for batch in valid_iter:
batch_text = batch.text[0]
for a in batch_text:
for b in a:
if b == 0:
counter += 1
if counter%1000==0:
print(f'{counter} words could not be translated')
print(f'{counter} words could not be translated')
It counts words embedded as 0 in valid_iter, to prove there are many of them.
I resolved the issue by manually loading the whole dataset and passing it to TEXT.build_vocab as first argument but I am sure there is a nicer way of doing it.
Hope I could help!
The text was updated successfully, but these errors were encountered:
Hello,
it seems that in the file
lstm_imdb.py
there are many words missing in vocabulary from valid_ds.Those words will then be embedded as 0 or
<unk>
and the nn still works, but it biases the evaluation of the net. You can reproduce the problem by doing the following:Copy the first 98 lines of the original file and add this loop:
It counts words embedded as 0 in valid_iter, to prove there are many of them.
I resolved the issue by manually loading the whole dataset and passing it to
TEXT.build_vocab
as first argument but I am sure there is a nicer way of doing it.Hope I could help!
The text was updated successfully, but these errors were encountered: