Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ALBERT]: In run_squad_sp, convert_examples_to_features gives error in case sentence piece model is not provided. #98

Open
Rachnas opened this issue Oct 31, 2019 · 10 comments

Comments

@Rachnas
Copy link

Rachnas commented Oct 31, 2019

I am trying to run Albert model on SQUAD dataset. In case SP model is not used, convert_examples_to_features will not go through. Please let me know, where I can find SP model.

@s4sarath
Copy link

Download the model from tensorflow hub. The downloaded models will have an assets folder. Inside that .vocab and .model is present. .model represents spm model.

With no SPM Model

vocab_file = '/albert_base/assets/30k-clean.vocab'
spm_model_file = None
tokenizer = tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=True,
      spm_model_file=spm_model_file)
text_a   = "Hello how are you"
tokens_a = tokenizer.tokenize(text_a)

Output

['hello', 'how', 'are', 'you']

With SPM Model

vocab_file = '/albert_base/assets/30k-clean.vocab'
spm_model_file ='/albert_base/assets/30k-clean.model'
tokenizer = tokenization.FullTokenizer(
      vocab_file=vocab_file, do_lower_case=True,
      spm_model_file=spm_model_file)
text_a   = "Hello how are you"
tokens_a = tokenizer.tokenize(text_a)

Output

['▁', 'H', 'ello', '▁how', '▁are', '▁you']

@np-2019
Copy link

np-2019 commented Oct 31, 2019

I had a similar issue,
workaround was to use the convert_examples_to_features from XLNet's run_squad.py and prepare_utils and make necessary changes. This helped me bypass it.

@Rachnas
Copy link
Author

Rachnas commented Oct 31, 2019

Thanks @s4sarath and @np-2019, I am able to process data with 30k-clean.model. I also incorporated convert_examples_to_features from XLNet with other changes. I am not bypassing SP model.

@Rachnas Rachnas closed this as completed Oct 31, 2019
@wxp16
Copy link

wxp16 commented Oct 31, 2019

The trained model is uncased, so the returned value of do_lower_case in create_tokenizer_from_hub_module() is True

But in class FullTokenizer, when spm_model_file is not None, the current code ignore the the value of do_lower_case. To fix this, first, in the constructor function of FullTokenizer, add one line self.do_lower_case = do_lower_case, then in def tokenize(self, text) , lowercase the text when you are using sentence piece model ` i.e.

    if self.sp_model:
      if self.do_lower_case:
        text = text.lower()

Hope this works.

@s4sarath
Copy link

s4sarath commented Nov 1, 2019

@np-2019 - It is better not to use XLNET preprocessing. Here things are bit different. The provided code runs without any error. If you are familiar with BERT preprocessing, it is very close except the usage of SentencePiece Model.

@Rachnas
Copy link
Author

Rachnas commented Nov 1, 2019

The trained model is uncased, so the returned value of do_lower_case in create_tokenizer_from_hub_module() is True

But in class FullTokenizer, when spm_model_file is not None, the current code ignore the the value of do_lower_case. To fix this, first, in the constructor function of FullTokenizer, add one line self.do_lower_case = do_lower_case, then in def tokenize(self, text) , lowercase the text when you are using sentence piece model ` i.e.

    if self.sp_model:
      if self.do_lower_case:
        text = text.lower()

Hope this works.

Thanks @wxp16 it helped.

@Rachnas
Copy link
Author

Rachnas commented Nov 18, 2019

Sharing my learning, using XLNet pre processing will not help. As sequence of tokens in XLnet and Albert differs. SQUAD2.0 will get pre processed but training will not converge. Better to make selective changes in Albert Code only.

@np-2019
Copy link

np-2019 commented Nov 19, 2019

FYI, @Rachnas and @s4sarath , using Xlnet preprocessing I could achieve following results on squad-2.0
Screen Shot 2019-11-19 at 11 28 23 am

@s4sarath
Copy link

@np-2019 - Thats pretty good results. Which Albert model ( large, xlarge and version (v1 or v2) ) you have used?

@Rachnas
Copy link
Author

Rachnas commented Nov 19, 2019

@np-2019 , Its very nice that you are able to reproduce the results successfully.

according to XLnet paper: section 2.5: "We only reuse the memory that belongs to
the same context. Specifically, the input to our model is similar to BERT: [A, SEP, B, SEP, CLS],"
According to Albert paper: section 4.1: "We format our inputs as “[CLS] x1 [SEP] x2 [SEP]”,

As we can see, CLS token has different locations, will it not cause any problem if we format data according to XLNet ?

@Rachnas Rachnas reopened this Nov 19, 2019
@andrewluchen andrewluchen transferred this issue from google-research/google-research Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants