Using padded tokens when creating averaged sentence embeddings #10

AndrewLim1990 · 2022-03-14T18:39:23Z

When calculating the similarity loss between two sentences, it looks like we are using the averaged word embeddings per sentence. Within models.SDR.similarity_modeling.SimilarityModeling we have the following:

...
non_masked_outputs = self.roberta(
    non_masked_input_ids,
    attention_mask=attention_mask,
    token_type_ids=token_type_ids,
    position_ids=position_ids,
    head_mask=head_mask,
    inputs_embeds=inputs_embeds,
    output_hidden_states=output_hidden_states,
    return_dict=return_dict,
)
non_masked_seq_out = non_masked_outputs[0]

meaned_sentences = non_masked_seq_out.mean(1)
miner_output = list(self.miner_func(meaned_sentences, sample_labels))

sim_loss = self.similarity_loss_func(meaned_sentences, sample_labels, miner_output)
...

It appears using the embeddings for the padded tokens since we aren't taking into account any sentence lengths. Was this done by design perhaps?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using padded tokens when creating averaged sentence embeddings #10

Using padded tokens when creating averaged sentence embeddings #10

AndrewLim1990 commented Mar 14, 2022

Using padded tokens when creating averaged sentence embeddings #10

Using padded tokens when creating averaged sentence embeddings #10

Comments

AndrewLim1990 commented Mar 14, 2022