Skip to content

Latest commit

 

History

History
140 lines (87 loc) · 10.3 KB

llama2.md

File metadata and controls

140 lines (87 loc) · 10.3 KB

Llama2

Overview

The Llama2 model was proposed in LLaMA: Open Foundation and Fine-Tuned Chat Models by Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushka rMishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing EllenTan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, Thomas Scialom. It is a collection of foundation language models ranging from 7B to 70B parameters, with checkpoints finetuned for chat application!

The abstract from the paper is the following:

In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called Llama 2-Chat, are optimized for dialogue use cases. Our models outperform open-source chat models on most benchmarks we tested, and based on our human evaluations for helpfulness and safety, may be a suitable substitute for closed-source models. We provide a detailed description of our approach to fine-tuning and safety improvements of Llama 2-Chat in order to enable the community to build on our work and contribute to the responsible development of LLMs.

Checkout all Llama2 model checkpoints here. This model was contributed by Arthur Zucker with contributions from Lysandre Debut. The code of the implementation in Hugging Face is based on GPT-NeoX here. The original code of the authors can be found here.

Usage tips

The Llama2 models were trained using bfloat16, but the original inference uses float16. The checkpoints uploaded on the Hub use torch_dtype = 'float16', which will be used by the AutoModel API to cast the checkpoints from torch.float32 to torch.float16.

The dtype of the online weights is mostly irrelevant unless you are using torch_dtype="auto" when initializing a model using model = AutoModelForCausalLM.from_pretrained("path", torch_dtype = "auto"). The reason is that the model will first be downloaded ( using the dtype of the checkpoints online), then it will be casted to the default dtype of torch (becomes torch.float32), and finally, if there is a torch_dtype provided in the config, it will be used.

Training the model in float16 is not recommended and is known to produce nan; as such, the model should be trained in bfloat16.

Tips:

  • Weights for the Llama2 models can be obtained by filling out this form
  • The architecture is very similar to the first Llama, with the addition of Grouped Query Attention (GQA) following this paper
  • Setting config.pretraining_tp to a value different than 1 will activate the more accurate but slower computation of the linear layers, which should better match the original logits.
  • The original model uses pad_id = -1 which means that there is no padding token. We can't have the same logic, make sure to add a padding token using tokenizer.add_special_tokens({"pad_token":"<pad>"}) and resize the token embedding accordingly. You should also set the model.config.pad_token_id. The embed_tokens layer of the model is initialized with self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.config.padding_idx), which makes sure that encoding the padding token will output zeros, so passing it when initializing is recommended.
  • After filling out the form and gaining access to the model checkpoints, you should be able to use the already converted checkpoints. Otherwise, if you are converting your own model, feel free to use the conversion script. The script can be called with the following (example) command:
python src/transformers/models/llama/convert_llama_weights_to_hf.py \
    --input_dir /path/to/downloaded/llama/weights --model_size 7B --output_dir /output/path
  • After conversion, the model and tokenizer can be loaded via:
from transformers import LlamaForCausalLM, LlamaTokenizer

tokenizer = LlamaTokenizer.from_pretrained("/output/path")
model = LlamaForCausalLM.from_pretrained("/output/path")

Note that executing the script requires enough CPU RAM to host the whole model in float16 precision (even if the biggest versions come in several checkpoints they each contain a part of each weight of the model, so we need to load them all in RAM). For the 75B model, it's thus 145GB of RAM needed.

  • The LLaMA tokenizer is a BPE model based on sentencepiece. One quirk of sentencepiece is that when decoding a sequence, if the first token is the start of the word (e.g. "Banana"), the tokenizer does not prepend the prefix space to the string.

  • When using Flash Attention 2 via attn_implementation="flash_attention_2", don't pass torch_dtype to the from_pretrained class method and use Automatic Mixed-Precision training. When using Trainer, it is simply specifying either fp16 or bf16 to True. Otherwise, make sure you are using torch.autocast. This is required because the Flash Attention only support fp16 and bf16 data type.

Resources

A list of official Hugging Face and community (indicated by 🌎) resources to help you get started with LLaMA2. If you're interested in submitting a resource to be included here, please feel free to open a Pull Request and we'll review it! The resource should ideally demonstrate something new instead of duplicating an existing resource.

  • A notebook on how to fine-tune Llama 2 in Google Colab using QLoRA and 4-bit precision. 🌎
  • A notebook on how to fine-tune the "Llama-v2-7b-guanaco" model with 4-bit QLoRA and generate Q&A datasets from PDFs. 🌎
  • A notebook on how to fine-tune the Llama 2 model with QLoRa, TRL, and Korean text classification dataset. 🌎🇰🇷

⚗️ Optimization

  • Fine-tune Llama 2 with DPO, a guide to using the TRL library's DPO method to fine tune Llama 2 on a specific dataset.
  • Extended Guide: Instruction-tune Llama 2, a guide to training Llama 2 to generate instructions from inputs, transforming the model from instruction-following to instruction-giving.
  • A notebook on how to fine-tune the Llama 2 model on a personal computer using QLoRa and TRL. 🌎

⚡️ Inference

  • A notebook on how to quantize the Llama 2 model using GPTQ from the AutoGPTQ library. 🌎
  • A notebook on how to run the Llama 2 Chat Model with 4-bit quantization on a local computer or Google Colab. 🌎

🚀 Deploy

LlamaConfig

[[autodoc]] LlamaConfig

LlamaTokenizer

[[autodoc]] LlamaTokenizer - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - save_vocabulary

LlamaTokenizerFast

[[autodoc]] LlamaTokenizerFast - build_inputs_with_special_tokens - get_special_tokens_mask - create_token_type_ids_from_sequences - update_post_processor - save_vocabulary

LlamaModel

[[autodoc]] LlamaModel - forward

LlamaForCausalLM

[[autodoc]] LlamaForCausalLM - forward

LlamaForSequenceClassification

[[autodoc]] LlamaForSequenceClassification - forward