Why sometimes the target attribution can be 'nan'? #268

frankdarkluo · 2024-04-27T06:39:48Z

Question

When I load my decoder-only language model to analyze the target_attributions with the attention-based method, there will be some 'nan' for some tokens.

My code is basically

import inseq
model_name='huggyllama/llama-30b'

model = AutoModelForCausalLM.from_pretrained(
        model_name, device_map='cuda',
        torch_dtype=torch.float16,
    )
toker=AutoTokenizer.from_pretrained(model_name)

question = (
"The following are multiple choice questions: You should directly answer the question by choosing the correct option.\n\nQuestion: The morning temperature in a city is 41\u00b0F. If a sunny, mild day is forecast, which temperature is most likely for 2:00 p.m.?\nOptions:\nC. 32\u00b0 F\nD. 41\u00b0 F\nA. 78\u00b0 F\nB. 98\u00b0 F\nAnswer: ")

qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False})

out=qa_model.attribute(
    question,
    generation_args={"max_new_tokens": 1, 'do_sample':False})

out_agg = out.aggregate(normalise=True)
print(out_agg.sequence_attributions[0].target_attributions)

The printing result is shown as

tensor([[9.5070e-01],
[2.8751e-03],
[7.9385e-04],
[ nan],
[ nan],
[ nan],
[ nan],
[1.0483e-03],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[1.0087e-02],
[2.0023e-03],
[2.9738e-03],
[ nan],
[1.1513e-03],
[1.2865e-03],
...

I wonder:

Does it mean the language model does not give attention to those tokens at all?
If they do give attention, is there a threshold for setting 'nan'? Where is the specific code that sets attribution to 'nan'?

Additional context

Checklist

I've searched the project's issues.

The text was updated successfully, but these errors were encountered:

gsarti · 2024-04-27T07:16:00Z

Hi @frankdarkluo,

Nans are used in the target attribution tensor to mark positions that are not being used for the current prediction step due to the causal attention mask of the model. This said, in the case above the single step of attribution should not have any nans in the scores. Could you try to reproduce this on another CausalLM model like gpt-2 or Qwen/Qwen1.5-0.5B-Chat, and see if you get nans for those too?

frankdarkluo · 2024-04-27T08:19:17Z

Thank you @gsarti .

I tried 'gpt2-large', and the result is even worse. More 'nan' appears.

tensor([[0.6928],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
...
[0.0141],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
...

frankdarkluo · 2024-04-27T08:35:35Z

When I use 'Qwen/Qwen1.5-0.5B-Chat', another error appeared during loading attribution model.

The error happens when I do

out=qa_model.attribute(
    question,
    # generation_args={"max_new_tokens": 1, 'do_sample':False},
)

ValueError Traceback (most recent call last)
Cell In[6], line 4
1 print(question)
3 qa_model = inseq.load_model(model, "attention", tokenizer=toker)
----> 4 out=qa_model.attribute(
5 question,
6 generation_args={"max_new_tokens": 1, 'do_sample':False},
7 # n_steps=500,
8 # internal_batch_size=50
9 )
10 # out.sequence_attributions[0].target_attributions[0,:]=0
11 out.show(normalize=True)

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/attribution_model.py:421, in AttributionModel.attribute(self, input_texts, generated_texts, method, override_default_attribution, attr_pos_start, attr_pos_end, show_progress, pretty_progress, output_step_attributions, attribute_target, step_scores, include_eos_baseline, attributed_fn, device, batch_size, generate_from_target_prefix, generation_args, **kwargs)
419 # If constrained decoding is not enabled, output texts are generated from input texts.
420 if not has_generated_texts or generate_from_target_prefix:
--> 421 encoded_input = self.encode(input_texts, return_baseline=True, include_eos_baseline=include_eos_baseline)
422 if generate_from_target_prefix:
423 decoder_input = self.encode(generated_texts, as_targets=True)

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py:266, in HuggingfaceModel.encode(self, texts, as_targets, return_baseline, include_eos_baseline, add_bos_token, add_special_tokens)
264 if as_targets and not self.is_encoder_decoder:
265 raise ValueError("Decoder-only models should use tokenization as source only.")
--> 266 batch = self.tokenizer(
267 text=texts if not as_targets else None,
268 text_target=texts if as_targets else None,
269 add_special_tokens=add_special_tokens,
270 padding=True,
271 truncation=True,
272 return_tensors="pt",
273 ).to(self.device)
274 baseline_ids = None
275 if return_baseline:

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2829, in PreTrainedTokenizerBase.call(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2827 if not self._in_target_context_manager:
2828 self._switch_to_input_mode()
-> 2829 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
2830 if text_target is not None:
2831 self._switch_to_target_mode()

File /op 2910 2911 2912 2913 2914 -> 2915 2916 2917 2918 2919 2920 2921 2922 2923 2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 (...)
2953 2954 t/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2915, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
raise ValueError(
f"batch length of text: {len(text)} does not match batch length of text_pair:"
f" {len(text_pair)}."
)
batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
return self.batch_encode_plus(
batch_text_or_text_pairs=batch_text_or_text_pairs,
add_special_tokens=add_special_tokens,
padding=padding,
truncation=truncation,
max_length=max_length,
stride=stride,
is_split_into_words=is_split_into_words,
pad_to_multiple_of=pad_to_multiple_of,
return_tensors=return_tensors,
return_token_type_ids=return_token_type_ids,
return_attention_mask=return_attention_mask,
return_overflowing_tokens=return_overflowing_tokens,
return_special_tokens_mask=return_special_tokens_mask,
return_offsets_mapping=return_offsets_mapping,
return_length=return_length,
verbose=verbose,
**kwargs,
)
else:
return self.encode_plus(
text=text,
text_pair=text_pair,
**kwargs,
)

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3097, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
3080 """
3081 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
3082
(...)
3093 details in encode_plus).
3094 """
3096 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 3097 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
3098 padding=padding,
3099 truncation=truncation,
3100 max_length=max_length,
3101 pad_to_multiple_of=pad_to_multiple_of,
3102 verbose=verbose,
3103 **kwargs,
3104 )
3106 return self._batch_encode_plus(
3107 batch_text_or_text_pairs=batch_text_or_text_pairs,
3108 add_special_tokens=add_special_tokens,
(...)
3123 **kwargs,
3124 )

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2734, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2732 # Test if we have a padding token
2733 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
-> 2734 raise ValueError(
2735 "Asking to pad but the tokenizer does not have a padding token. "
2736 "Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) "
2737 "or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})."
2738 )
2740 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
2741 if (
2742 truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
2743 and padding_strategy != PaddingStrategy.DO_NOT_PAD
(...)
2746 and (max_length % pad_to_multiple_of != 0)
2747 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

Even if I have specifically set

toker.pad_token = toker.eos_token
toker.pad_token_id = toker.eos_token_id

The same error still happens. How should I fix this? Greatly appreciate it if you could help this!

gsarti · 2024-04-29T10:16:16Z

Hi @frankdarkluo, thanks for the follow-up!

Indeed, attention had an issue in which 0s in the tensor were set to nan even if they weren't in the causal masking position. I opened PR #269 to address this issue, let me know if it works for you!

Re: the usage of Qwen, this works for me on #269:

import inseq
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name='Qwen/Qwen1.5-0.5B-Chat'

model = AutoModelForCausalLM.from_pretrained(
        model_name, #device_map='cuda',
        #torch_dtype=torch.float16,
    )
toker=AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "system", "content": "The following are multiple choice questions: You should directly answer the question by choosing the correct option."},
    {"role": "user", "content": "Question: The morning temperature in a city is 41\u00b0F. If a sunny, mild day is forecast, which temperature is most likely for 2:00 p.m.?\nOptions:\nC. 32\u00b0 F\nD. 41\u00b0 F\nA. 78\u00b0 F\nB. 98\u00b0 F\nAnswer:"}
]
question = toker.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(question)
qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False})
out=qa_model.attribute(
    question,
    generation_args={"max_new_tokens": 20, 'do_sample':False, "skip_special_tokens": False})

out_agg = out.aggregate(normalise=True)
out_agg.show(do_aggregation=False)

The only main adjustment is the usage of the chat template, and skip_special_tokens=False to prevent special template tokens to be removed, causing Inseq to raise an error of mismatch in the generation prefix and the original input.

frankdarkluo · 2024-04-29T21:09:36Z

Thanks @gsarti !

I have tried the opened PR #269 and it works for me, solving both of my concerns!

frankdarkluo · 2024-04-29T21:13:33Z

In addition, I found an interesting phenomenon that the attribution socre is not 'nan' anymore for the source tokens, but some meaningful logits. Is it because previously there is some precision threshold for those very small logits, and then become 'nan'?

tensor([[7.5076e-01],
[5.6619e-03],
[7.4188e-04],
[6.0200e-04],
[6.4065e-04],
[1.2943e-03],
[9.0566e-04],
[9.5552e-04],
[6.3531e-04],
[5.2958e-04],
[5.6214e-04],
[6.5459e-04],
[1.1314e-03],
[5.6269e-04],
[5.3682e-04],
[5.1956e-04],
[7.7899e-04],
[4.5886e-04],
[7.8914e-04],
[1.4563e-03],
[1.8974e-03],
[2.6545e-03],
[1.6023e-03],
[1.1958e-03],

gsarti · 2024-04-30T08:50:58Z

Hey @frankdarkluo, yes, it was being rounded after the fourth decimal!

frankdarkluo added the question Further information is requested label Apr 27, 2024

gsarti mentioned this issue Apr 29, 2024

Fix nans in attention, add pad in tokenizer if missing #269

Merged

gsarti closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why sometimes the target attribution can be 'nan'? #268

Why sometimes the target attribution can be 'nan'? #268

frankdarkluo commented Apr 27, 2024 •

edited

gsarti commented Apr 27, 2024

frankdarkluo commented Apr 27, 2024

frankdarkluo commented Apr 27, 2024

gsarti commented Apr 29, 2024 •

edited

frankdarkluo commented Apr 29, 2024 •

edited

frankdarkluo commented Apr 29, 2024

gsarti commented Apr 30, 2024

Why sometimes the target attribution can be 'nan'? #268

Why sometimes the target attribution can be 'nan'? #268

Comments

frankdarkluo commented Apr 27, 2024 • edited

Question

Additional context

Checklist

gsarti commented Apr 27, 2024

frankdarkluo commented Apr 27, 2024

frankdarkluo commented Apr 27, 2024

gsarti commented Apr 29, 2024 • edited

frankdarkluo commented Apr 29, 2024 • edited

frankdarkluo commented Apr 29, 2024

gsarti commented Apr 30, 2024

frankdarkluo commented Apr 27, 2024 •

edited

gsarti commented Apr 29, 2024 •

edited

frankdarkluo commented Apr 29, 2024 •

edited