Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warning: Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. #539

Open
1 of 2 tasks
artkpv opened this issue May 23, 2024 · 0 comments

Comments

@artkpv
Copy link

artkpv commented May 23, 2024

System Info

Cuda 12.1
PyTorch 2.3.0
Python 3.11

Thu May 23 15:30:20 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.06              Driver Version: 545.23.06    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          Off | 00000000:0F:00.0 Off |                    0 |
| N/A   37C    P0              61W / 400W |      4MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Thanks for the open source model. I init Llama 3 70B as per the recipe for local inference. However when I do inference I see warning:

Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

I took max_length=None from the recipe. Also, at HF, at here, they advise to do tokenizer(batch_sentences, padding='max_length', truncation=True) (without max_length but it is None by default). However, the model doesn't provide the max_length. So how to set max_length

My code:

        self._model = AutoModelForCausalLM.from_pretrained(
            llm,
            return_dict=True,
            load_in_8bit=llm_kwargs["load_in_8bit"],
            load_in_4bit=llm_kwargs["load_in_4bit"],
            device_map="auto",
            low_cpu_mem_usage=True,
            attn_implementation="sdpa" if llm_kwargs.get("use_fast_kernels", False) else None,
            torch_dtype=torch.bfloat16
        )
        self._model.eval()

        tokenizer = AutoTokenizer.from_pretrained(self._llm)
        prompt = tokenizer.apply_chat_template(
            prompt, tokenize=False, add_generation_prompt=True
        )
        tokenizer.pad_token = tokenizer.eos_token
        batch = tokenizer(
            prompt,
            padding='max_length', 
            truncation=True, 
            max_length=None,
            return_tensors="pt"
        )
        batch = {k: v.to("cuda") for k, v in batch.items()}
        outputs = self._model.generate(
            **batch,
            **self._gen_kwargs,
        )
        # Take only response:
        outputs = outputs[0][batch['input_ids'][0].size(0):]
        response = tokenizer.decode(outputs, skip_special_tokens=True)

Error logs

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
> Asking to pad to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no padding.
> Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
> Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.

Expected behavior

No warning expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant