Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to represent sentence in Template Denoising step? #13

Closed
CSerxy opened this issue Aug 20, 2022 · 16 comments
Closed

How to represent sentence in Template Denoising step? #13

CSerxy opened this issue Aug 20, 2022 · 16 comments

Comments

@CSerxy
Copy link

CSerxy commented Aug 20, 2022

Hi there,

I am recently rebuilding your work in fairseq. Your model is really impressive.

I am able to rebuild your results in Table 8, with different templates, I can get 78.41 scores on average (RoBERTa_base as backbone model).

However, when I try to reproduce your default method, which is different templates with denoising, the highest score I can get is 78.54 (RoBERTa_base as backbone model).

I tried using either 1) MASK token's representation to represent the template, or 2) cls token's representation to represent the template at the Template Denoising step.

Can you clarify which method you use as the template biases?

Many thanks!

@CSerxy
Copy link
Author

CSerxy commented Aug 20, 2022

Given a template, The sentence : ‘[X]’ means [MASK] .

I understand that you use [MASK]'s vector to represent X.

I am curious which one did you use to represent template biases, given The sentence : ‘’ means [MASK] . without [X].

@CSerxy
Copy link
Author

CSerxy commented Aug 20, 2022

For me, I tried both [MASK] and [CLS] to represent template biases. The first performed badly, with 75.89 average score; while the second one got 78.54 performance.

@kongds
Copy link
Owner

kongds commented Aug 20, 2022

Thanks for your interest in out work.

For template denoising, we use [MASK] with same position ids of template tokens.

For example, if we use template like T1 <s1> T2 [MASK], where <s1> is sentence and T1 , T2 are template tokens, we feed BERT with T1:pos_id0 T2:pos_id(1+len(s1)) [MASK]:pos_id(2+len(s1)) to represent bias of s1. (pos_id is the position id, len(s1) is the length of s1 tokens)

You can also refer to this issue: #11 (comment).

@CSerxy
Copy link
Author

CSerxy commented Aug 20, 2022

Thank you for the quick response!

However, I do use the method you described but found it performed worse than a version using [CLS] to represent template biases.

For now, I got a version using [MASK] can get 75.89, and the other version using [CLS] can get 78.93.

I am curious did you do any tests about using [CLS] to represent sentence bias? If so, what kind of performance can you get? Many thanks!

@kongds
Copy link
Owner

kongds commented Aug 21, 2022

Hello, can you share your implementation?
Our implementation of template denoising is:

def get_delta(template_token, length=50):
with torch.set_grad_enabled(not cls.model_args.mask_embedding_sentence_delta_freeze):
device = input_ids.device
d_input_ids = torch.Tensor(template_token).repeat(length, 1).to(device).long()
if cls.model_args.mask_embedding_sentence_autoprompt:
d_inputs_embeds = encoder.embeddings.word_embeddings(d_input_ids)
p = torch.arange(d_input_ids.shape[1]).to(d_input_ids.device).view(1, -1)
b = torch.arange(d_input_ids.shape[0]).to(d_input_ids.device)
for i, k in enumerate(cls.dict_mbv):
if cls.fl_mbv[i]:
index = ((d_input_ids == k) * p).max(-1)[1]
else:
index = ((d_input_ids == k) * -p).min(-1)[1]
#print(d_inputs_embeds[b,index][0].sum().item(), cls.p_mbv[i].sum().item())
#print(d_inputs_embeds[b,index][0].mean().item(), cls.p_mbv[i].mean().item())
d_inputs_embeds[b, index] = cls.p_mbv[i]
else:
d_inputs_embeds = None
d_position_ids = torch.arange(d_input_ids.shape[1]).to(device).unsqueeze(0).repeat(length, 1).long()
if not cls.model_args.mask_embedding_sentence_delta_no_position:
d_position_ids[:, len(cls.bs)+1:] += torch.arange(length).to(device).unsqueeze(-1)
m_mask = d_input_ids == cls.mask_token_id
outputs = encoder(input_ids=d_input_ids if d_inputs_embeds is None else None ,
inputs_embeds=d_inputs_embeds,
position_ids=d_position_ids, output_hidden_states=True, return_dict=True)
last_hidden = outputs.last_hidden_state
delta = last_hidden[m_mask]
template_len = d_input_ids.shape[1]
if cls.model_args.mask_embedding_sentence_org_mlp:
delta = cls.mlp(delta)
return delta, template_len

if cls.model_args.mask_embedding_sentence_delta:
delta, template_len = get_delta([cls.mask_embedding_template])
if len(cls.model_args.mask_embedding_sentence_different_template) > 0:
delta1, template_len1 = get_delta([cls.mask_embedding_template2])

if cls.model_args.mask_embedding_sentence_org_mlp:
pooler_output = cls.mlp(pooler_output)
if len(cls.model_args.mask_embedding_sentence_different_template) > 0:
pooler_output = pooler_output.view(batch_size, num_sent, -1)
attention_mask = attention_mask.view(batch_size, num_sent, -1)
blen = attention_mask.sum(-1) - template_len
pooler_output[:, 0, :] -= delta[blen[:, 0]]
blen = attention_mask.sum(-1) - template_len1
pooler_output[:, 1, :] -= delta1[blen[:, 1]]
if num_sent == 3:
pooler_output[:, 2, :] -= delta1[blen[:, 2]]
else:
blen = attention_mask.sum(-1) - template_len
pooler_output -= delta[blen]

Our results are from ./run.sh unsup-roberta $SEED.

For the results of [CLS], I have not tried [CLS] as template denoising.

@CSerxy
Copy link
Author

CSerxy commented Aug 21, 2022

Sure, thanks for the details of your implementation. I actually have already read them. But anyway, appreciate it!

I attach the main part for computing template denoising below:

In my code, the original input is src_tokens (which is [s1] or [X] in your paper)

I first concatenate the bs and es together without adding [s1]:
template = torch.cat([self.bs1[aug].type_as(src_tokens), self.es1.type_as(src_tokens)], 0).repeat(batch_size, 1).to('cuda:0')
I also concatenate the bs, es together with to get the new token sequence:
new_src_tokens = torch.cat([self.bs1[aug].repeat(batch_size, 1).type_as(src_tokens), src_tokens[:, 1:-1], self.es1.repeat(batch_size, 1).type_as(src_tokens)], 1).to('cuda:0')

Then I get the position of [MASK] in the template:
template_mask = (template==self.mask).to('cuda:0')

Next, I call the sentence encoder to get template representation. Note that here I only use new_src_tokens for computing position ids, which I will cover later. I use bs_length and x_length to get the index when I compute the

template_features, extra = self.sentence_encoder(template,
last_state_only=not return_all_hiddens,
new_src_tokens=new_src_tokens,
bs_length=self.bs1[aug].size()[0],
x_length=src_tokens.size()[1] - 2 - truncated_len)

At last, I get the mask representation:
template_rep = template_features[-1].transpose(0, 1)[template_mask]


Inside the self.sentence_encoder, I copy the main part --which is the part that computes the positional ids below:
The below line computes the position ids given new_src_tokens, it actually only requires a sample of new_src_tokens and it can return you the positional ids:
a = self.embed_positions(new_src_tokens, positions=positions)
Next, we add the first part (bs part) positional ids into x:
x[:, :bs_length, :] += a[:, :bs_length, :]
Next, we add the second part (es part) positional ids into x:
x[:, bs_length:, :] += a[:, bs_length+x_length:, :]

For your reference, the original transformer in fairseq computes the positional ids in this way (given input sequence tokens):
x += self.embed_positions(tokens, positions=positions)

That's the main part of the code. For all other parts, I use the default code in fairseq to compute attentions, segment embeddings, etc.

Thanks for your time and looking forward to your reply!!

@CSerxy
Copy link
Author

CSerxy commented Aug 21, 2022

Indeed, I found one slight difference between my code with yours is that my template has a space between the word 'means' and [mask]; but I found your version did not have a space according to https://github.com/kongds/Prompt-BERT/blob/main/train.py#L145

Do you think this slight difference would cause a performance difference?

@kongds
Copy link
Owner

kongds commented Aug 21, 2022

Hello,

I think space will not cause the performance difference.

For new_src_tokens, where from
new_src_tokens = torch.cat([self.bs1[aug].repeat(batch_size, 1).type_as(src_tokens), src_tokens[:, 1:-1], self.es1.repeat(batch_size, 1).type_as(src_tokens)], 1).to('cuda:0'), but the length of src_tokens is not same in batch, which may contains <pad> in src_tokens[:, 1:-1]. (The <pad> makes positions ids of es in new_src_tokens are same)
For x_lenght, i think x_length should be (src_tokens != pad_token).sum(-1) -2.

@CSerxy
Copy link
Author

CSerxy commented Aug 21, 2022

I see, that's a great point. I will remove the pad in src_tokens and train the model again.

Many thanks and enjoy the weekend!

@kongds
Copy link
Owner

kongds commented Aug 21, 2022

Thank you
Our new_src_tokens is from:

Prompt-BERT/train.py

Lines 674 to 699 in 8c0cb4c

bs = tokenizer.encode(model_args.mask_embedding_sentence_bs)[:-1]
es = tokenizer.encode(model_args.mask_embedding_sentence_es)[1:] # remove cls or bos
if len(model_args.mask_embedding_sentence_different_template) > 0:
bs2 = tokenizer.encode(model_args.mask_embedding_sentence_bs2)[:-1]
es2 = tokenizer.encode(model_args.mask_embedding_sentence_es2)[1:] # remove cls or bos
else:
bs2, es2 = bs, es
sent_features = {'input_ids': [], 'attention_mask': []}
for i, s in enumerate(sentences):
if i < total:
s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length]
sent_features['input_ids'].append(bs+s+es)
elif i < 2*total:
s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length]
sent_features['input_ids'].append(bs2+s+es2)
else:
s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length]
sent_features['input_ids'].append(bs2+s+es2)
ml = max([len(i) for i in sent_features['input_ids']])
for i in range(len(sent_features['input_ids'])):
t = sent_features['input_ids'][i]
sent_features['input_ids'][i] = t + [tokenizer.pad_token_id]*(ml-len(t))
sent_features['attention_mask'].append(len(t)*[1] + (ml-len(t))*[0])

@CSerxy
Copy link
Author

CSerxy commented Aug 21, 2022

I see. So basically you encode each sentence again and prepend bs in front and append es afterward. Because you use a for loop to do this one by one, so [pad] will not appear before es in your case.

Can I ask a stupid question, it seems you add the [pad] after getting all sentences.

However, do you add the pad (line 698) after the bos token? Is it the correct way to do it? An alternative way is adding [pad] between '.' and [SEP], which way shall I use?

@kongds
Copy link
Owner

kongds commented Aug 21, 2022

The [SEP] is already in es and es2, so the [PAD] is after [SEP].

es = tokenizer.encode(model_args.mask_embedding_sentence_es)[1:] # remove cls or bos

Although using template.replace('*sent_0*', sentence) to replace original sentence can remove [PAD] in sentence, but one problem is the sentence may influence the template tokens during tokenizer.encode, which causes template tokens are different from the tokens in template denoising.

@CSerxy
Copy link
Author

CSerxy commented Aug 21, 2022

That makes sense. Thank you so much for your insightful answers!

I will let you whether it works or not. And I could open-source your model in fairseq architecture once I finished my project.

@CSerxy
Copy link
Author

CSerxy commented Aug 22, 2022

Hi,

Sorry to bother you again. I changed the way to calculate new_src_tokens and x_length accordingly. The performance of using [MASK] and [CLS] both improved slightly, from 75.89 -> 76.52 ([MASK]) and 78.54 -> 79.08 ([CLS]).

However, the highest performance for using [MASK] as template denoising can still not reach the performance in your paper.

I am curious if you could help me check the code once you have time, many thanks!

batch_size, seq_len = src_tokens.size()[0], src_tokens.size()[1]
I first calculate the truncated_len if adding a template leads to the total length exceeding the max sequence length:
truncated_len = 0
if seq_len - 2 + self.template_len[aug] > self.max_seq_len:
truncated_len = seq_len - 2 + self.template_len[aug] - self.max_seq_len

Next, I iterate each src_tokens[i] to get its new form (adding the template), i.e., new_src_tokens[i].
The first step is to get the end index of eos, in case some pads are at the end of src_tokens[i]:
for eos_idx in range(seq_len - 1, -1, -1):
if src_tokens[i][eos_idx] == self.eos:
break

Now, we can get the pad length we shall add after to the eos in new_src_tokens[i]:
pad_len = seq_len - 1 - eos_idx

Next, we discuss three situations:

  1. if pad_len == 0, then we don't have to add pad after new_src_tokens[i]:
    new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx-truncated_len], self.es1.type_as(src_tokens)), 0))
    x_length.append(eos_idx - truncated_len - 1)

  2. if pad_len > 0 and truncated_len < pad_len, then it means we still need to add some pads but don't have to truncate x, the part that is truncated is from pad:
    new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx], self.es1.type_as(src_tokens), self.pad_idx.type_as(src_tokens).repeat(pad_len - truncated_len)), 0))
    x_length.append(eos_idx - 1)

  3. if pad_len > 0 and trauncated_len >= pad_len, then it means there is no pad added after new_src_tokens[i] and meantime some x needs to be truncated:
    new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx - (truncated_len - pad_len)], self.es1.type_as(src_tokens)), 0))
    x_length.append(eos_idx - (truncated_len - pad_len) - 1)

In this way, all new_src_tokens and each x_length[i] are re-computed.

I also re-wrote the sentence encoder part to get template representation. I add one more parameter, which is es_length:
template_features, extra = self.sentence_encoder(template, last_state_only=not return_all_hiddens, new_src_tokens=new_src_tokens, bs_length=self.bs1[aug].size()[0], x_length=x_length, es_length=self.es1.size()[0])


Inside the self.sentence_encoder, I first compute positional embedding from the shape of new_src_tokens:
a = self.embed_positions(new_src_tokens, positions=positions)
Next, we add the first part (bs part) positional ids into x:
x[:, :bs_length, :] += a[:, :bs_length, :]
Next, we add the second part (es part) positional ids into x, note that is the line I changed compared to last version:
for i in range(x.size()[0]):
x[i, bs_length:, :] += a[0, bs_length+x_length[i]:bs_length+x_length[i]+es_length, :]

Thanks for your time and looking forward to your reply!!

@kongds
Copy link
Owner

kongds commented Aug 24, 2022

Hello,
Sorry for the late reply.

I don't find problem with the calculation of new_src_tokens and x_lengths. ( Not sure because I'm not familiar with the fairseq)
It is also wired that [CLS] as denoising can achieve 79.04, while [MASK] achieves 76.52.
But I ensure our code using [MASK] as denoising can achieve the performance in our paper with bash run.sh unsup-roberta $SEED.

@CSerxy
Copy link
Author

CSerxy commented Aug 24, 2022

Thanks for the help!

Yeah, sometimes it happens for the performance difference between different architectures.

For example, my re-implemented SimCSE can get 77.45 in fairseq while the original SimCSE at HuggingFace is 76.57.

Anyway, your answer helps me a lot. And I believe the version with [CLS], even with a slight difference from the [MASK]'s one, is still a good reproducing version of PromptBERT.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants