Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why sometimes the target attribution can be 'nan'? #268

Closed
1 task done
frankdarkluo opened this issue Apr 27, 2024 · 7 comments
Closed
1 task done

Why sometimes the target attribution can be 'nan'? #268

frankdarkluo opened this issue Apr 27, 2024 · 7 comments
Labels
question Further information is requested

Comments

@frankdarkluo
Copy link

frankdarkluo commented Apr 27, 2024

Question

When I load my decoder-only language model to analyze the target_attributions with the attention-based method, there will be some 'nan' for some tokens.

My code is basically

import inseq
model_name='huggyllama/llama-30b'

model = AutoModelForCausalLM.from_pretrained(
        model_name, device_map='cuda',
        torch_dtype=torch.float16,
    )
toker=AutoTokenizer.from_pretrained(model_name)

question = (
"The following are multiple choice questions: You should directly answer the question by choosing the correct option.\n\nQuestion: The morning temperature in a city is 41\u00b0F. If a sunny, mild day is forecast, which temperature is most likely for 2:00 p.m.?\nOptions:\nC. 32\u00b0 F\nD. 41\u00b0 F\nA. 78\u00b0 F\nB. 98\u00b0 F\nAnswer: ")

qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False})

out=qa_model.attribute(
    question,
    generation_args={"max_new_tokens": 1, 'do_sample':False})

out_agg = out.aggregate(normalise=True)
print(out_agg.sequence_attributions[0].target_attributions)

The printing result is shown as

tensor([[9.5070e-01],
[2.8751e-03],
[7.9385e-04],
[ nan],
[ nan],
[ nan],
[ nan],
[1.0483e-03],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[1.0087e-02],
[2.0023e-03],
[2.9738e-03],
[ nan],
[1.1513e-03],
[1.2865e-03],
...

I wonder:

  1. Does it mean the language model does not give attention to those tokens at all?
  2. If they do give attention, is there a threshold for setting 'nan'? Where is the specific code that sets attribution to 'nan'?

Additional context

Checklist

  • I've searched the project's issues.
@frankdarkluo frankdarkluo added the question Further information is requested label Apr 27, 2024
@gsarti
Copy link
Member

gsarti commented Apr 27, 2024

Hi @frankdarkluo,

Nans are used in the target attribution tensor to mark positions that are not being used for the current prediction step due to the causal attention mask of the model. This said, in the case above the single step of attribution should not have any nans in the scores. Could you try to reproduce this on another CausalLM model like gpt-2 or Qwen/Qwen1.5-0.5B-Chat, and see if you get nans for those too?

@frankdarkluo
Copy link
Author

Thank you @gsarti .

I tried 'gpt2-large', and the result is even worse. More 'nan' appears.

tensor([[0.6928],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
...
[0.0141],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
[ nan],
...

@frankdarkluo
Copy link
Author

When I use 'Qwen/Qwen1.5-0.5B-Chat', another error appeared during loading attribution model.

The error happens when I do

out=qa_model.attribute(
    question,
    # generation_args={"max_new_tokens": 1, 'do_sample':False},
)

ValueError Traceback (most recent call last)
Cell In[6], line 4
1 print(question)
3 qa_model = inseq.load_model(model, "attention", tokenizer=toker)
----> 4 out=qa_model.attribute(
5 question,
6 generation_args={"max_new_tokens": 1, 'do_sample':False},
7 # n_steps=500,
8 # internal_batch_size=50
9 )
10 # out.sequence_attributions[0].target_attributions[0,:]=0
11 out.show(normalize=True)

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/attribution_model.py:421, in AttributionModel.attribute(self, input_texts, generated_texts, method, override_default_attribution, attr_pos_start, attr_pos_end, show_progress, pretty_progress, output_step_attributions, attribute_target, step_scores, include_eos_baseline, attributed_fn, device, batch_size, generate_from_target_prefix, generation_args, **kwargs)
419 # If constrained decoding is not enabled, output texts are generated from input texts.
420 if not has_generated_texts or generate_from_target_prefix:
--> 421 encoded_input = self.encode(input_texts, return_baseline=True, include_eos_baseline=include_eos_baseline)
422 if generate_from_target_prefix:
423 decoder_input = self.encode(generated_texts, as_targets=True)

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/inseq/models/huggingface_model.py:266, in HuggingfaceModel.encode(self, texts, as_targets, return_baseline, include_eos_baseline, add_bos_token, add_special_tokens)
264 if as_targets and not self.is_encoder_decoder:
265 raise ValueError("Decoder-only models should use tokenization as source only.")
--> 266 batch = self.tokenizer(
267 text=texts if not as_targets else None,
268 text_target=texts if as_targets else None,
269 add_special_tokens=add_special_tokens,
270 padding=True,
271 truncation=True,
272 return_tensors="pt",
273 ).to(self.device)
274 baseline_ids = None
275 if return_baseline:

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2829, in PreTrainedTokenizerBase.call(self, text, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2827 if not self._in_target_context_manager:
2828 self._switch_to_input_mode()
-> 2829 encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
2830 if text_target is not None:
2831 self._switch_to_target_mode()

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2915, in PreTrainedTokenizerBase._call_one(self, text, text_pair, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2910 raise ValueError(
2911 f"batch length of text: {len(text)} does not match batch length of text_pair:"
2912 f" {len(text_pair)}."
2913 )
2914 batch_text_or_text_pairs = list(zip(text, text_pair)) if text_pair is not None else text
-> 2915 return self.batch_encode_plus(
2916 batch_text_or_text_pairs=batch_text_or_text_pairs,
2917 add_special_tokens=add_special_tokens,
2918 padding=padding,
2919 truncation=truncation,
2920 max_length=max_length,
2921 stride=stride,
2922 is_split_into_words=is_split_into_words,
2923 pad_to_multiple_of=pad_to_multiple_of,
2924 return_tensors=return_tensors,
2925 return_token_type_ids=return_token_type_ids,
2926 return_attention_mask=return_attention_mask,
2927 return_overflowing_tokens=return_overflowing_tokens,
2928 return_special_tokens_mask=return_special_tokens_mask,
2929 return_offsets_mapping=return_offsets_mapping,
2930 return_length=return_length,
2931 verbose=verbose,
2932 **kwargs,
2933 )
2934 else:
2935 return self.encode_plus(
2936 text=text,
2937 text_pair=text_pair,
(...)
2953 **kwargs,
2954 )

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:3097, in PreTrainedTokenizerBase.batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
3080 """
3081 Tokenize and prepare for the model a list of sequences or a list of pairs of sequences.
3082
(...)
3093 details in encode_plus).
3094 """
3096 # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
-> 3097 padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
3098 padding=padding,
3099 truncation=truncation,
3100 max_length=max_length,
3101 pad_to_multiple_of=pad_to_multiple_of,
3102 verbose=verbose,
3103 **kwargs,
3104 )
3106 return self._batch_encode_plus(
3107 batch_text_or_text_pairs=batch_text_or_text_pairs,
3108 add_special_tokens=add_special_tokens,
(...)
3123 **kwargs,
3124 )

File /opt/anaconda3/envs/tuned-lens/lib/python3.9/site-packages/transformers/tokenization_utils_base.py:2734, in PreTrainedTokenizerBase._get_padding_truncation_strategies(self, padding, truncation, max_length, pad_to_multiple_of, verbose, **kwargs)
2732 # Test if we have a padding token
2733 if padding_strategy != PaddingStrategy.DO_NOT_PAD and (self.pad_token is None or self.pad_token_id < 0):
-> 2734 raise ValueError(
2735 "Asking to pad but the tokenizer does not have a padding token. "
2736 "Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) "
2737 "or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'})."
2738 )
2740 # Check that we will truncate to a multiple of pad_to_multiple_of if both are provided
2741 if (
2742 truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE
2743 and padding_strategy != PaddingStrategy.DO_NOT_PAD
(...)
2746 and (max_length % pad_to_multiple_of != 0)
2747 ):

ValueError: Asking to pad but the tokenizer does not have a padding token. Please select a token to use as pad_token (tokenizer.pad_token = tokenizer.eos_token e.g.) or add a new pad token via tokenizer.add_special_tokens({'pad_token': '[PAD]'}).

Even if I have specifically set

toker.pad_token = toker.eos_token
toker.pad_token_id = toker.eos_token_id 

The same error still happens. How should I fix this? Greatly appreciate it if you could help this!

@gsarti
Copy link
Member

gsarti commented Apr 29, 2024

Hi @frankdarkluo, thanks for the follow-up!

Indeed, attention had an issue in which 0s in the tensor were set to nan even if they weren't in the causal masking position. I opened PR #269 to address this issue, let me know if it works for you!

Re: the usage of Qwen, this works for me on #269:

import inseq
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name='Qwen/Qwen1.5-0.5B-Chat'

model = AutoModelForCausalLM.from_pretrained(
        model_name, #device_map='cuda',
        #torch_dtype=torch.float16,
    )
toker=AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "system", "content": "The following are multiple choice questions: You should directly answer the question by choosing the correct option."},
    {"role": "user", "content": "Question: The morning temperature in a city is 41\u00b0F. If a sunny, mild day is forecast, which temperature is most likely for 2:00 p.m.?\nOptions:\nC. 32\u00b0 F\nD. 41\u00b0 F\nA. 78\u00b0 F\nB. 98\u00b0 F\nAnswer:"}
]
question = toker.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
print(question)
qa_model = inseq.load_model(model, "attention", tokenizer=model_name, tokenizer_kwargs={"legacy": False})
out=qa_model.attribute(
    question,
    generation_args={"max_new_tokens": 20, 'do_sample':False, "skip_special_tokens": False})

out_agg = out.aggregate(normalise=True)
out_agg.show(do_aggregation=False)

The only main adjustment is the usage of the chat template, and skip_special_tokens=False to prevent special template tokens to be removed, causing Inseq to raise an error of mismatch in the generation prefix and the original input.

@frankdarkluo
Copy link
Author

frankdarkluo commented Apr 29, 2024

Thanks @gsarti !

I have tried the opened PR #269 and it works for me, solving both of my concerns!

@gsarti gsarti closed this as completed Apr 29, 2024
@frankdarkluo
Copy link
Author

In addition, I found an interesting phenomenon that the attribution socre is not 'nan' anymore for the source tokens, but some meaningful logits. Is it because previously there is some precision threshold for those very small logits, and then become 'nan'?

tensor([[7.5076e-01],
[5.6619e-03],
[7.4188e-04],
[6.0200e-04],
[6.4065e-04],
[1.2943e-03],
[9.0566e-04],
[9.5552e-04],
[6.3531e-04],
[5.2958e-04],
[5.6214e-04],
[6.5459e-04],
[1.1314e-03],
[5.6269e-04],
[5.3682e-04],
[5.1956e-04],
[7.7899e-04],
[4.5886e-04],
[7.8914e-04],
[1.4563e-03],
[1.8974e-03],
[2.6545e-03],
[1.6023e-03],
[1.1958e-03],

@gsarti
Copy link
Member

gsarti commented Apr 30, 2024

Hey @frankdarkluo, yes, it was being rounded after the fourth decimal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants