How to represent sentence in Template Denoising step? #13

CSerxy · 2022-08-20T02:22:55Z

Hi there,

I am recently rebuilding your work in fairseq. Your model is really impressive.

I am able to rebuild your results in Table 8, with different templates, I can get 78.41 scores on average (RoBERTa_base as backbone model).

However, when I try to reproduce your default method, which is different templates with denoising, the highest score I can get is 78.54 (RoBERTa_base as backbone model).

I tried using either 1) MASK token's representation to represent the template, or 2) cls token's representation to represent the template at the Template Denoising step.

Can you clarify which method you use as the template biases?

Many thanks!

CSerxy · 2022-08-20T02:25:08Z

Given a template, The sentence : ‘[X]’ means [MASK] .

I understand that you use [MASK]'s vector to represent X.

I am curious which one did you use to represent template biases, given The sentence : ‘’ means [MASK] . without [X].

CSerxy · 2022-08-20T02:27:42Z

For me, I tried both [MASK] and [CLS] to represent template biases. The first performed badly, with 75.89 average score; while the second one got 78.54 performance.

kongds · 2022-08-20T08:11:03Z

Thanks for your interest in out work.

For template denoising, we use [MASK] with same position ids of template tokens.

For example, if we use template like T1 <s1> T2 [MASK], where <s1> is sentence and T1 , T2 are template tokens, we feed BERT with T1:pos_id0 T2:pos_id(1+len(s1)) [MASK]:pos_id(2+len(s1)) to represent bias of s1. (pos_id is the position id, len(s1) is the length of s1 tokens)

You can also refer to this issue: #11 (comment).

CSerxy · 2022-08-20T17:13:23Z

Thank you for the quick response!

However, I do use the method you described but found it performed worse than a version using [CLS] to represent template biases.

For now, I got a version using [MASK] can get 75.89, and the other version using [CLS] can get 78.93.

I am curious did you do any tests about using [CLS] to represent sentence bias? If so, what kind of performance can you get? Many thanks!

kongds · 2022-08-21T02:50:15Z

Hello, can you share your implementation?
Our implementation of template denoising is:

Prompt-BERT/prompt_bert/models.py

Lines 78 to 108 in 8c0cb4c

    
           def get_delta(template_token, length=50): 
        
               with torch.set_grad_enabled(not cls.model_args.mask_embedding_sentence_delta_freeze): 
        
                   device = input_ids.device 
        
                   d_input_ids = torch.Tensor(template_token).repeat(length, 1).to(device).long() 
        
                   if cls.model_args.mask_embedding_sentence_autoprompt: 
        
                       d_inputs_embeds = encoder.embeddings.word_embeddings(d_input_ids) 
        
                       p = torch.arange(d_input_ids.shape[1]).to(d_input_ids.device).view(1, -1) 
        
                       b = torch.arange(d_input_ids.shape[0]).to(d_input_ids.device) 
        
                       for i, k in enumerate(cls.dict_mbv): 
        
                           if cls.fl_mbv[i]: 
        
                               index = ((d_input_ids == k) * p).max(-1)[1] 
        
                           else: 
        
                               index = ((d_input_ids == k) * -p).min(-1)[1] 
        
                           #print(d_inputs_embeds[b,index][0].sum().item(), cls.p_mbv[i].sum().item()) 
        
                           #print(d_inputs_embeds[b,index][0].mean().item(), cls.p_mbv[i].mean().item()) 
        
                           d_inputs_embeds[b, index] = cls.p_mbv[i] 
        
                   else: 
        
                       d_inputs_embeds = None 
        
                   d_position_ids = torch.arange(d_input_ids.shape[1]).to(device).unsqueeze(0).repeat(length, 1).long() 
        
                   if not cls.model_args.mask_embedding_sentence_delta_no_position: 
        
                       d_position_ids[:, len(cls.bs)+1:] += torch.arange(length).to(device).unsqueeze(-1) 
        
                   m_mask = d_input_ids == cls.mask_token_id 
        
                   outputs = encoder(input_ids=d_input_ids if d_inputs_embeds is None else None , 
        
                                     inputs_embeds=d_inputs_embeds, 
        
                                     position_ids=d_position_ids,  output_hidden_states=True, return_dict=True) 
        
                   last_hidden = outputs.last_hidden_state 
        
                   delta = last_hidden[m_mask] 
        
                   template_len = d_input_ids.shape[1] 
        
                   if cls.model_args.mask_embedding_sentence_org_mlp: 
        
                       delta = cls.mlp(delta) 
        
                   return delta, template_len

Prompt-BERT/prompt_bert/models.py

Lines 110 to 113 in 8c0cb4c

    
           if cls.model_args.mask_embedding_sentence_delta: 
        
               delta, template_len = get_delta([cls.mask_embedding_template]) 
        
               if len(cls.model_args.mask_embedding_sentence_different_template) > 0: 
        
                   delta1, template_len1 = get_delta([cls.mask_embedding_template2])

Prompt-BERT/prompt_bert/models.py

Lines 163 to 177 in 8c0cb4c

    
           if cls.model_args.mask_embedding_sentence_org_mlp: 
        
               pooler_output = cls.mlp(pooler_output) 
        
           if len(cls.model_args.mask_embedding_sentence_different_template) > 0: 
        
               pooler_output = pooler_output.view(batch_size, num_sent, -1) 
        
               attention_mask = attention_mask.view(batch_size, num_sent, -1) 
        
               blen = attention_mask.sum(-1) - template_len 
        
               pooler_output[:, 0, :] -= delta[blen[:, 0]] 
        
               blen = attention_mask.sum(-1) - template_len1 
        
               pooler_output[:, 1, :] -= delta1[blen[:, 1]] 
        
               if num_sent == 3: 
        
                   pooler_output[:, 2, :] -= delta1[blen[:, 2]] 
        
           else: 
        
               blen = attention_mask.sum(-1) - template_len 
        
               pooler_output -= delta[blen]

Our results are from ./run.sh unsup-roberta $SEED.

For the results of [CLS], I have not tried [CLS] as template denoising.

CSerxy · 2022-08-21T03:27:59Z

Sure, thanks for the details of your implementation. I actually have already read them. But anyway, appreciate it!

I attach the main part for computing template denoising below:

In my code, the original input is src_tokens (which is [s1] or [X] in your paper)

I first concatenate the bs and es together without adding [s1]:
template = torch.cat([self.bs1[aug].type_as(src_tokens), self.es1.type_as(src_tokens)], 0).repeat(batch_size, 1).to('cuda:0')
I also concatenate the bs, es together with to get the new token sequence:
new_src_tokens = torch.cat([self.bs1[aug].repeat(batch_size, 1).type_as(src_tokens), src_tokens[:, 1:-1], self.es1.repeat(batch_size, 1).type_as(src_tokens)], 1).to('cuda:0')

Then I get the position of [MASK] in the template:
template_mask = (template==self.mask).to('cuda:0')

Next, I call the sentence encoder to get template representation. Note that here I only use new_src_tokens for computing position ids, which I will cover later. I use bs_length and x_length to get the index when I compute the

template_features, extra = self.sentence_encoder(template,
last_state_only=not return_all_hiddens,
new_src_tokens=new_src_tokens,
bs_length=self.bs1[aug].size()[0],
x_length=src_tokens.size()[1] - 2 - truncated_len)

At last, I get the mask representation:
template_rep = template_features[-1].transpose(0, 1)[template_mask]

Inside the self.sentence_encoder, I copy the main part --which is the part that computes the positional ids below:
The below line computes the position ids given new_src_tokens, it actually only requires a sample of new_src_tokens and it can return you the positional ids:
a = self.embed_positions(new_src_tokens, positions=positions)
Next, we add the first part (bs part) positional ids into x:
x[:, :bs_length, :] += a[:, :bs_length, :]
Next, we add the second part (es part) positional ids into x:
x[:, bs_length:, :] += a[:, bs_length+x_length:, :]

For your reference, the original transformer in fairseq computes the positional ids in this way (given input sequence tokens):
x += self.embed_positions(tokens, positions=positions)

That's the main part of the code. For all other parts, I use the default code in fairseq to compute attentions, segment embeddings, etc.

Thanks for your time and looking forward to your reply!!

CSerxy · 2022-08-21T03:46:54Z

Indeed, I found one slight difference between my code with yours is that my template has a space between the word 'means' and [mask]; but I found your version did not have a space according to https://github.com/kongds/Prompt-BERT/blob/main/train.py#L145

Do you think this slight difference would cause a performance difference?

kongds · 2022-08-21T03:57:23Z

Hello,

I think space will not cause the performance difference.

For new_src_tokens, where from
new_src_tokens = torch.cat([self.bs1[aug].repeat(batch_size, 1).type_as(src_tokens), src_tokens[:, 1:-1], self.es1.repeat(batch_size, 1).type_as(src_tokens)], 1).to('cuda:0'), but the length of src_tokens is not same in batch, which may contains <pad> in src_tokens[:, 1:-1]. (The <pad> makes positions ids of es in new_src_tokens are same)
For x_lenght, i think x_length should be (src_tokens != pad_token).sum(-1) -2.

CSerxy · 2022-08-21T04:00:38Z

I see, that's a great point. I will remove the pad in src_tokens and train the model again.

Many thanks and enjoy the weekend!

kongds · 2022-08-21T04:02:25Z

Thank you
Our new_src_tokens is from:

Prompt-BERT/train.py

Lines 674 to 699 in 8c0cb4c

    
           bs = tokenizer.encode(model_args.mask_embedding_sentence_bs)[:-1] 
        
           es = tokenizer.encode(model_args.mask_embedding_sentence_es)[1:] # remove cls or bos 
        
           if len(model_args.mask_embedding_sentence_different_template) > 0: 
        
               bs2 = tokenizer.encode(model_args.mask_embedding_sentence_bs2)[:-1] 
        
               es2 = tokenizer.encode(model_args.mask_embedding_sentence_es2)[1:] # remove cls or bos 
        
           else: 
        
               bs2, es2 = bs, es 
        
           sent_features = {'input_ids': [], 'attention_mask': []} 
        
           for i, s in enumerate(sentences): 
        
               if i < total: 
        
                   s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length] 
        
                   sent_features['input_ids'].append(bs+s+es) 
        
               elif i < 2*total: 
        
                   s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length] 
        
                   sent_features['input_ids'].append(bs2+s+es2) 
        
               else: 
        
                   s = tokenizer.encode(s, add_special_tokens=False)[:data_args.max_seq_length] 
        
                   sent_features['input_ids'].append(bs2+s+es2) 
        
           ml = max([len(i) for i in sent_features['input_ids']]) 
        
           for i in range(len(sent_features['input_ids'])): 
        
               t = sent_features['input_ids'][i] 
        
               sent_features['input_ids'][i] = t + [tokenizer.pad_token_id]*(ml-len(t)) 
        
               sent_features['attention_mask'].append(len(t)*[1] + (ml-len(t))*[0])

CSerxy · 2022-08-21T04:11:08Z

I see. So basically you encode each sentence again and prepend bs in front and append es afterward. Because you use a for loop to do this one by one, so [pad] will not appear before es in your case.

Can I ask a stupid question, it seems you add the [pad] after getting all sentences.

However, do you add the pad (line 698) after the bos token? Is it the correct way to do it? An alternative way is adding [pad] between '.' and [SEP], which way shall I use?

kongds · 2022-08-21T04:14:44Z

The [SEP] is already in es and es2, so the [PAD] is after [SEP].

Prompt-BERT/train.py

Line 675 in 8c0cb4c

    
           es = tokenizer.encode(model_args.mask_embedding_sentence_es)[1:] # remove cls or bos

Although using template.replace('*sent_0*', sentence) to replace original sentence can remove [PAD] in sentence, but one problem is the sentence may influence the template tokens during tokenizer.encode, which causes template tokens are different from the tokens in template denoising.

CSerxy · 2022-08-21T04:36:48Z

That makes sense. Thank you so much for your insightful answers!

I will let you whether it works or not. And I could open-source your model in fairseq architecture once I finished my project.

CSerxy · 2022-08-22T18:30:11Z

Hi,

Sorry to bother you again. I changed the way to calculate new_src_tokens and x_length accordingly. The performance of using [MASK] and [CLS] both improved slightly, from 75.89 -> 76.52 ([MASK]) and 78.54 -> 79.08 ([CLS]).

However, the highest performance for using [MASK] as template denoising can still not reach the performance in your paper.

I am curious if you could help me check the code once you have time, many thanks!

batch_size, seq_len = src_tokens.size()[0], src_tokens.size()[1]
I first calculate the truncated_len if adding a template leads to the total length exceeding the max sequence length:
truncated_len = 0
if seq_len - 2 + self.template_len[aug] > self.max_seq_len:
truncated_len = seq_len - 2 + self.template_len[aug] - self.max_seq_len

Next, I iterate each src_tokens[i] to get its new form (adding the template), i.e., new_src_tokens[i].
The first step is to get the end index of eos, in case some pads are at the end of src_tokens[i]:
for eos_idx in range(seq_len - 1, -1, -1):
if src_tokens[i][eos_idx] == self.eos:
break

Now, we can get the pad length we shall add after to the eos in new_src_tokens[i]:
pad_len = seq_len - 1 - eos_idx

Next, we discuss three situations:

if pad_len == 0, then we don't have to add pad after new_src_tokens[i]:
new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx-truncated_len], self.es1.type_as(src_tokens)), 0))
x_length.append(eos_idx - truncated_len - 1)
if pad_len > 0 and truncated_len < pad_len, then it means we still need to add some pads but don't have to truncate x, the part that is truncated is from pad:
new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx], self.es1.type_as(src_tokens), self.pad_idx.type_as(src_tokens).repeat(pad_len - truncated_len)), 0))
x_length.append(eos_idx - 1)
if pad_len > 0 and trauncated_len >= pad_len, then it means there is no pad added after new_src_tokens[i] and meantime some x needs to be truncated:
new_src_tokens.append(torch.cat((self.bs1[aug].type_as(src_tokens), src_tokens[i][1:eos_idx - (truncated_len - pad_len)], self.es1.type_as(src_tokens)), 0))
x_length.append(eos_idx - (truncated_len - pad_len) - 1)

In this way, all new_src_tokens and each x_length[i] are re-computed.

I also re-wrote the sentence encoder part to get template representation. I add one more parameter, which is es_length:
template_features, extra = self.sentence_encoder(template, last_state_only=not return_all_hiddens, new_src_tokens=new_src_tokens, bs_length=self.bs1[aug].size()[0], x_length=x_length, es_length=self.es1.size()[0])

Inside the self.sentence_encoder, I first compute positional embedding from the shape of new_src_tokens:
a = self.embed_positions(new_src_tokens, positions=positions)
Next, we add the first part (bs part) positional ids into x:
x[:, :bs_length, :] += a[:, :bs_length, :]
Next, we add the second part (es part) positional ids into x, note that is the line I changed compared to last version:
for i in range(x.size()[0]):
x[i, bs_length:, :] += a[0, bs_length+x_length[i]:bs_length+x_length[i]+es_length, :]

Thanks for your time and looking forward to your reply!!

kongds · 2022-08-24T07:47:41Z

Hello,
Sorry for the late reply.

I don't find problem with the calculation of new_src_tokens and x_lengths. ( Not sure because I'm not familiar with the fairseq)
It is also wired that [CLS] as denoising can achieve 79.04, while [MASK] achieves 76.52.
But I ensure our code using [MASK] as denoising can achieve the performance in our paper with bash run.sh unsup-roberta $SEED.

CSerxy · 2022-08-24T19:04:18Z

Thanks for the help!

Yeah, sometimes it happens for the performance difference between different architectures.

For example, my re-implemented SimCSE can get 77.45 in fairseq while the original SimCSE at HuggingFace is 76.57.

Anyway, your answer helps me a lot. And I believe the version with [CLS], even with a slight difference from the [MASK]'s one, is still a good reproducing version of PromptBERT.

CSerxy closed this as completed Aug 24, 2022

kongds mentioned this issue Dec 6, 2022

关于损失函数去除模板噪音 #17

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to represent sentence in Template Denoising step? #13

How to represent sentence in Template Denoising step? #13

CSerxy commented Aug 20, 2022

CSerxy commented Aug 20, 2022

CSerxy commented Aug 20, 2022

kongds commented Aug 20, 2022

CSerxy commented Aug 20, 2022

kongds commented Aug 21, 2022

CSerxy commented Aug 21, 2022

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022 •

edited

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022 •

edited

CSerxy commented Aug 21, 2022

CSerxy commented Aug 22, 2022

kongds commented Aug 24, 2022

CSerxy commented Aug 24, 2022

How to represent sentence in Template Denoising step? #13

How to represent sentence in Template Denoising step? #13

Comments

CSerxy commented Aug 20, 2022

CSerxy commented Aug 20, 2022

CSerxy commented Aug 20, 2022

kongds commented Aug 20, 2022

CSerxy commented Aug 20, 2022

kongds commented Aug 21, 2022

CSerxy commented Aug 21, 2022

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022 • edited

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022

CSerxy commented Aug 21, 2022

kongds commented Aug 21, 2022 • edited

CSerxy commented Aug 21, 2022

CSerxy commented Aug 22, 2022

In this way, all new_src_tokens and each x_length[i] are re-computed.

kongds commented Aug 24, 2022

CSerxy commented Aug 24, 2022

kongds commented Aug 21, 2022 •

edited

kongds commented Aug 21, 2022 •

edited