Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data processing for the MAN model #7

Open
songdezhao opened this issue Dec 23, 2021 · 9 comments
Open

Data processing for the MAN model #7

songdezhao opened this issue Dec 23, 2021 · 9 comments

Comments

@songdezhao
Copy link

Hi Di, thanks a lot for sharing the code of this QA system. I have been trying to apply it to my own data. I skipped the pre-training and the multi-task learning; instead, I was trying to apply the MAN architecture (i.e., BertForMultipleChoice_SAN) to a single dataset of my own.

I didn't find the exact code to produce the two additional masks: premise_mask and hyp_mask, so I tried to implement them myself. Sorry for the ask but I am wondering if my following implementation makes sense:

  1. In run_classifier_bert.py:
    a) Right after Line 143, I added the following. My understanding is that the hypothesis contains the answer only, so we should do this after tokens_b gets is "fair share" from the total max_len but before we concat it with question in the next line.
    hypothesis = ["[CLS]"] + tokens_b + ["[SEP]"]

    b) Then, before Line 151, I added the following:

        # The premise is the concatenation of the passage/dialogue and the question plus additional special tokens
        premise = ["[CLS]"] + tokens_a + ["[SEP]"] + tokens_c + ["[SEP]"]
        
        # Convert to IDs
        premise_ids = tokenizer.convert_tokens_to_ids(premise)
        hypothesis_ids = tokenizer.convert_tokens_to_ids(hypothesis)
        
        # Compute how much to pad and then build the mask with the actual content length and the pad length
        premise_pad_length = max_seq_length - len(premise_ids)
        # create a mask with the actual content only
        premise_mask = [1] * len(premise_ids)
        # do padding
        premise_ids += [0] * premise_pad_length
        # append the padded length to mask
        premise_mask += [0] * premise_pad_length

        hypothesis_pad_length = max_seq_length - len(hypothesis_ids)
        hypothesis_mask = [1] * len(hypothesis_ids)
        hypothesis_ids += [0] * hypothesis_pad_length
        hypothesis_mask += [0] * hypothesis_pad_length
  1. With the above, I also modified InputFeatures to include these two additional masks and have them passed along to the forward function

Sorry for the long message but I am wondering if the above additional data processing looks correct in order to use MAN? Many thanks!

@jind11
Copy link
Owner

jind11 commented Dec 23, 2021

Hi for MAN, to find out the positions of premise and hypothesis, we can make use of the segment ID. For example, in my code, segment ID of 0 means dialogue/context and segment ID of 1 means the concatenation of question and answer. But of course, you can also use special mask to indicate which tokens belong to premise and which belong to hypothesis. They are serving the same purpose. Lines 152 and 156 are for setting up segment IDs.

@songdezhao
Copy link
Author

Thanks a lot for the quick response!

I guess I had one misunderstanding before but just to clarify: Is hypothesis the concatenation of question and answer or is it answer only?

From the other conversion here, it seems hypothesis is answer only? If so, then I guess the segment ID cannot be directly used to derive the two masks, since 1 in segment ID means the concatenation of question and answer.

Thanks again.

@jind11
Copy link
Owner

jind11 commented Dec 23, 2021

Good question, I have tried both versions: hypothesis consists of question and answer, hypothesis consists only the answer. At lease for the DREAM dataset, I did not find out much difference. The first choice also makes sense in intuition: we carry with the information from both question and answer and seek to find out the most relevant information from the context and see whether we can find out the evidence to support this certain pair of question and answer, which is also similar to the factual correctness task.

@songdezhao
Copy link
Author

Got it and thanks again. I will use the first choice. I agree it makes more sense and also requires less changes to the data processing code, i.e., I can directly use the segment ID to derive the two additional masks.

Two other quick questions (sorry):

  1. I assume this is the model: BertForMultipleChoice_SAN? I am asking because I also see "SAN2".

  2. For initialization, I simply replaced the general model with the following. For "opt", I simply put "use_SAN" there. Are there any other things I should put into this "opt" parameter?
    model = BertForMultipleChoice_SAN.from_pretrained(args.bert_model, opt={"use_SAN": 1}, num_choices=[5])

@jind11
Copy link
Owner

jind11 commented Dec 24, 2021

  1. Yes
  2. That's it .

@songdezhao
Copy link
Author

Thanks a lot

@songdezhao
Copy link
Author

Thanks again for your help and I am now able to train the model with MAN (i.e., BertForMultipleChoice_SAN).

Just one quick question. When training, in the log, I see this warning:
pytorch_pretrained_bert.modeling - Failed for Randomly initialize the top level classifiers!

I checked the code and I think this is because in the modeling.py, it is trying to randomly initialize the variable "classifiers" while there is no such variable in BertForMultipleChoice_SAN. Can I simply ignore this warning or should I do the following initialization at Line 776:

logger.info("Randomly initialize the top level classifiers!")
for i in range(len(model.out_proj)):
    model.out_proj[i].classifier.proj.weight.data.normal_(mean=0.0, std=config.initializer_range)
    model.out_proj[i].classifier.proj.bias.data.zero_()

@jind11
Copy link
Owner

jind11 commented Dec 24, 2021

The code you mentioned is all right, while you can also ignore since pytorch will help initialize any tensor with default initialization method.

@songdezhao
Copy link
Author

I see. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants