Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not use custom sentence for QG #14

Closed
ilham-bintang opened this issue Oct 21, 2019 · 3 comments
Closed

Can not use custom sentence for QG #14

ilham-bintang opened this issue Oct 21, 2019 · 3 comments

Comments

@ilham-bintang
Copy link

ilham-bintang commented Oct 21, 2019

Problem

Hi. I want to try the QG using decode_seq2seq.py. It works when I try use the sample data. But when I use another data, it encounter Key Error: 'H.E.

Note

  • I use BERT-LARGE-CASED
  • It will success if I remove that word, and error with another 'weird' word.

Question

  1. Is the decode seq2seq will match each the input word with a BERT LARGE CASED vocab?
  2. How to preprocess the text before decode seq2seq? any guidance for preprocessing?
  3. I also read similar issue from pytorch-bert-transformer Unseen Vocab huggingface/transformers#63 ?

Terminal Output

File "/root/code/unilm/src/pytorch_pretrained_bert/tokenization.py", line 117, in convert_tokens_to_ids 
    ids.append(self.vocab[token])
KeyError: 'H.E.' # or another weird word
@donglixp
Copy link
Contributor

I think the problem is caused by the tokenization. The training examples should be tokenized as follows:

    tokenizer = BertTokenizer.from_pretrained(
        args.bert_model, do_lower_case=args.do_lower_case)
    r_list = []
    for idx, line in enumerate(chunk):
        tk_list = tokenizer.tokenize(line)
        r_list.append((idx, tk_list))

For example, the sentence Who did the Panthers beat to become the NFC champs ? is tokenized to
Who did the Panthers beat to become the NFC ch ##amp ##s ?, so that all the tokens are in the vocabulary.

@aretius
Copy link

aretius commented Nov 5, 2019

Hey @donglixp,
I also see lsb and rsb tokens. So any source text like a paragraph needs to be StanfordTokenized then passed through BertTokenizer?

@donglixp
Copy link
Contributor

donglixp commented Nov 6, 2019

Hi @aretius ,

If you would like to directly run the provided fine-tuned checkpoint, the same preprocessing pipeline is recommended. For custom fine-tuning, other toolkits should also work as long as both fine-tuning and inference use the similar input formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants