Can not use custom sentence for QG #14

ilham-bintang · 2019-10-21T01:01:34Z

Problem

Hi. I want to try the QG using decode_seq2seq.py. It works when I try use the sample data. But when I use another data, it encounter Key Error: 'H.E.

Note

I use BERT-LARGE-CASED
It will success if I remove that word, and error with another 'weird' word.

Question

Is the decode seq2seq will match each the input word with a BERT LARGE CASED vocab?
How to preprocess the text before decode seq2seq? any guidance for preprocessing?
I also read similar issue from pytorch-bert-transformer Unseen Vocab huggingface/transformers#63 ?

Terminal Output

File "/root/code/unilm/src/pytorch_pretrained_bert/tokenization.py", line 117, in convert_tokens_to_ids 
    ids.append(self.vocab[token])
KeyError: 'H.E.' # or another weird word

The text was updated successfully, but these errors were encountered:

donglixp · 2019-10-21T05:34:41Z

I think the problem is caused by the tokenization. The training examples should be tokenized as follows:

    tokenizer = BertTokenizer.from_pretrained(
        args.bert_model, do_lower_case=args.do_lower_case)
    r_list = []
    for idx, line in enumerate(chunk):
        tk_list = tokenizer.tokenize(line)
        r_list.append((idx, tk_list))

For example, the sentence Who did the Panthers beat to become the NFC champs ? is tokenized to
Who did the Panthers beat to become the NFC ch ##amp ##s ?, so that all the tokens are in the vocabulary.

aretius · 2019-11-05T11:46:33Z

Hey @donglixp,
I also see lsb and rsb tokens. So any source text like a paragraph needs to be StanfordTokenized then passed through BertTokenizer?

donglixp · 2019-11-06T02:03:20Z

Hi @aretius ,

If you would like to directly run the provided fine-tuned checkpoint, the same preprocessing pipeline is recommended. For custom fine-tuning, other toolkits should also work as long as both fine-tuning and inference use the similar input formats.

donglixp closed this as completed Oct 21, 2019

johnyoonh mentioned this issue Nov 5, 2019

CPU based pre-trained model #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can not use custom sentence for QG #14

Can not use custom sentence for QG #14

ilham-bintang commented Oct 21, 2019 •

edited

donglixp commented Oct 21, 2019

aretius commented Nov 5, 2019

donglixp commented Nov 6, 2019

Can not use custom sentence for QG #14

Can not use custom sentence for QG #14

Comments

ilham-bintang commented Oct 21, 2019 • edited

Problem

Note

Question

Terminal Output

donglixp commented Oct 21, 2019

aretius commented Nov 5, 2019

donglixp commented Nov 6, 2019

ilham-bintang commented Oct 21, 2019 •

edited