You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Install latest transformers. I tried with latest main and v4.31-release branches, accelerate==0.21.0 and latest as they were used in the blog, the problem remained the same.
cd $HOME
# git clone -b v4.31-release https://github.com/huggingface/transformers.git
git clone https://github.com/huggingface/transformers.git
cd transformers
# For Python 3.8
pip install -e .
pip install datasets evaluate scikit-learn accelerate py7zr
Prepare llama2_fsdp_config.json and copy to home folder. Login with HF token.
On a CUDA device, load the fine tuned model and inference
from transformers import AutoTokenizer, LlamaTokenizer
import transformers
import torch
model = "~/llama-2-7b-hf-ft-xla"
tokenizer = AutoTokenizer.from_pretrained(model, use_auth_token=True)
pipeline = transformers.pipeline(
"text-generation",
model=model,
torch_dtype=torch.bfloat16,
device_map="auto",
)
sequences = pipeline(
['I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n'] * 1,
do_sample=True,
top_k=10,
num_return_sequences=1,
eos_token_id=tokenizer.eos_token_id,
max_length=200,
)
print(sequences)
It would get Runtime error for fine tuned Llama2-7B,
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I also tried with GPT2, with GPT2, the model can be loaded and used for inference, however it would produce garbage outputs like
[{'generated_text': 'I liked "Breaking Bad" and "Band of Brothers". Do you have any recommendations of other shows I might like?\n contends Creator smiling reminiscentoffset prophets contends contends Sheffield contends wetlandslocked maximizing maximizing WIratorct continuity=- ...'}]
For both fine tuned Llama2-7B & GPT2, I will get this kind of warnings during instantiating transformers.pipeline
Some weights of the model checkpoint at /home/hzchen/scripts/llm/gpt-ft-test were not used when initializing GPT2LMHeadModel: [<FSDP_LAYERS_OMITTED...>]
- This IS expected if you are initializing GPT2LMHeadModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing GPT2LMHeadModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of GPT2LMHeadModel were not initialized from the model checkpoint at /home/hzchen/scripts/llm/gpt-ft-test and are newly initialized: [<LAYERS_OMITTED...>]
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I also aware that the output fine tuned models have abnormally small size, e.g. fine tuned GPT2 get 60MB+ while origin 500MB+. Llama2 7B 3.2GB while origin 13GB, but fine tuned on CUDA will give 20+GB in size.
I also tried with accelerate + FSDP on 8*L4 GPU, everything worked fine with the same configs, that made me believe the problem is on XLA+FSDP.
System Info
transformers
version: 4.36.0.dev0Who can help?
@muellerzr @pacman100
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I was following this two blogs/docs,
https://pytorch.org/blog/large-scale-training-hugging-face/
https://huggingface.co/docs/transformers/main_classes/trainer#pytorchxla-fully-sharded-data-parallel
It would get Runtime error for fine tuned Llama2-7B,
I also tried with GPT2, with GPT2, the model can be loaded and used for inference, however it would produce garbage outputs like
For both fine tuned Llama2-7B & GPT2, I will get this kind of warnings during instantiating transformers.pipeline
I also aware that the output fine tuned models have abnormally small size, e.g. fine tuned GPT2 get 60MB+ while origin 500MB+. Llama2 7B 3.2GB while origin 13GB, but fine tuned on CUDA will give 20+GB in size.
I also tried with accelerate + FSDP on 8*L4 GPU, everything worked fine with the same configs, that made me believe the problem is on XLA+FSDP.
Below is how I ran successfully on CUDA devices,
Expected behavior
The output fine tuned models using XLA+FSDP on TPU should be usable, like what it does on Accelerate+FSDP on GPUs.
The text was updated successfully, but these errors were encountered: