# [Llama 2](https://huggingface.co/meta-llama/Llama-2-7b-hf)

**Llama 2** is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. This repository houses the 7B pretrained model, converted for the Hugging Face Transformers format. Links to other models can be found in the index at the bottom.

## Model Details

*Note: Use of this model is governed by the Meta license. To download the model weights and tokenizer, please visit the [website](website_link) and accept our License before requesting access [here](access_request_link).*

Meta developed and publicly released the **Llama 2** family of large language models (LLMs), a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Our fine-tuned LLMs, called **Llama-2-Chat**, are optimized for dialogue use cases. **Llama-2-Chat** models outperform open-source chat models on most benchmarks we tested. In our human evaluations for helpfulness and safety, they are on par with some popular closed-source models like ChatGPT and PaLM.

## Model Developers

Developed by Meta.

## Variations

**Llama 2** comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations.

- **Input Models:** Accept input text only.
- **Output Models:** Generate text only.

## Model Architecture

**Llama 2** is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

### Code for Fine-tuning Llama2-7b

In [None]:
!pip install datasets
!pip install peft
!pip install transformers==4.30

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6
Collecting peft
  Downloading peft-0.7.1-py3-none-any.whl (168 kB)
[2K     [90m━━━━━━━━

Code for Fine-tuning Llama2-7b

In [None]:
# Importing Libraries
from transformers import LlamaTokenizer, LlamaForCausalLM
import torch
from datasets import Dataset
import transformers
import pandas as pd
from peft import get_peft_model, LoraConfig, TaskType, prepare_model_for_int8_training, get_peft_model_state_dict, PeftModel
from sklearn.utils import shuffle

# Loading Model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM, TrainingArguments, DataCollatorForSeq2Seq, Trainer

token = "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"
model_name = "meta-llama/Llama-2-7b-hf"

# Load tokenizer and model with authentication token
tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=token)
model = AutoModelForCausalLM.from_pretrained(model_name, use_auth_token=token)

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/609 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

Data Preparation Phase

In [None]:
df = pd.read_csv('/content/drive/Shareddrives/dataset/clients/LLama2/Qspot-Sea Final Annotation reviwed by sandeep - combined_df.csv', encoding='latin-1')
df

df.rename({'word':'input_text', 'entity_group': 'output_text'}, axis=1, inplace=True)
print (df.head(5))

   Unnamed: 0.1  Unnamed: 0  input_text output_text
0           0.0           0  Buongiorno      others
1           1.0           1     Dario,;      others
2           2.0           2       avrei      others
3           3.0           3         una      others
4           4.0           4  spedizione      others


# Fine-tuning Phase


### Tokenize the Prompt

In [None]:
def tokenize(prompt, add_eos_token=True):
    result = tokenizer(
        prompt,
        truncation=True,
        max_length=128,
        padding=False,
        return_tensors=None,
    )
    if (
        result["input_ids"][-1] != tokenizer.eos_token_id
        and len(result["input_ids"]) < 128
        and add_eos_token
    ):
        result["input_ids"].append(tokenizer.eos_token_id)
        result["attention_mask"].append(1)
        result["labels"] = result["input_ids"].copy()
    return result

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199




### Add PEFT Config

PEFT
🤗 PEFT (Parameter-Efficient Fine-Tuning) is a library for efficiently adapting large pretrained models to various downstream applications without fine-tuning all of a model’s parameters because it is prohibitively costly. PEFT methods only fine-tune a small number of (extra) model parameters - significantly decreasing computational and storage costs - while yielding performance comparable to a fully fine-tuned model. This makes it more accessible to train and store large language models (LLMs) on consumer hardware.

PEFT is integrated with the Transformers, Diffusers, and Accelerate libraries to provide a faster and easier way to load, train, and use large models for inference.

In [None]:
def create_peft_config(m):
    peft_cofig = LoraConfig(
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False,
        r=8,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=['q_proj', 'v_proj'],
    )
    model = prepare_model_for_int8_training(m)
    model.enable_input_require_grads()
    model = get_peft_model(model, peft_cofig)
    model.print_trainable_parameters()
    return model, peft_cofig


model, lora_config = create_peft_config(model)

### Generate Prompt

In [None]:
def generate_prompt(data_point):
    return f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
### Instruction:
Extract entity from the given input:
### Input:
{data_point["input_text"]}
### Response:
{data_point["output_text"]}"""

tokenizer.pad_token_id = 0


In [None]:
def generate_and_tokenize_prompt(data_point):
    full_prompt = generate_prompt(data_point)
    tokenized_full_prompt = tokenize(full_prompt)
    return tokenized_full_prompt

In [None]:
total_ds = shuffle(df, random_state=42)
total_train_ds = total_ds.head(4000)
total_test_ds = total_ds.tail(1500)


total_train_ds_hf = Dataset.from_pandas(total_train_ds)
total_test_ds_hf = Dataset.from_pandas(total_test_ds)

tokenized_tr_ds = total_train_ds_hf.map(generate_and_tokenize_prompt)
tokenized_te_ds = total_test_ds_hf.map(generate_and_tokenize_prompt)

Map:   0%|          | 0/3615 [00:00<?, ? examples/s]

Map:   0%|          | 0/1500 [00:00<?, ? examples/s]

In [None]:
training_arguments = transformers.TrainingArguments(
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=4e-05,
    logging_steps=50,
    optim="adamw_torch",
    evaluation_strategy="steps",
    save_strategy="steps",
    eval_steps=50,
    save_steps=50,
    output_dir="/content/drive/Shareddrives/dataset/clients/LLama2/"
)
data_collator = transformers.DataCollatorForSeq2Seq(tokenizer)

### Training begins

In [None]:
trainer = transformers.Trainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=tokenized_tr_ds,
    eval_dataset=tokenized_te_ds,
    args=training_arguments,
    data_collator=data_collator
)


with torch.autocast("cuda"):
    trainer.train()


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss
50,2.383,0.959191
100,0.4403,0.291672
150,0.2651,0.24623


KeyboardInterrupt: ignored

### Inference

In [None]:
!pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
import torch
#from your_custom_module import LlamaTokenizer, LlamaForCausalLM, PeftModel

In [None]:
model_name="/content/drive/Shareddrives/dataset/clients/LLama2/checkpoint-150/"

Loaded_tokenizer = LlamaTokenizer.from_pretrained(model_name)
Loaded_model = LlamaForCausalLM.from_pretrained(model_name,
                                                load_in_8bit=True,
                                                dtype=torch.float16,
                                                device_map='auto')

Model = PeftModel.from_pretrained(Loaded_model, "saved_model_path",
                                  dtype=torch.float16)

Model.config.pad_token_id = loaded_tokenizer.pad_token_id = 0
Model.eval()

def extract_entity(text):
    inp = Loaded_tokenizer(prompt, return_tensor='pt').to("cuda")
    with torch.no_grad():
        P_ent = Loaded_tokenizer.decode(model.generate(**inp, max_new_tokens=128)[0], skip_special_tokens=True)
        int_idx = P_ent.find('Response:')
        P_ent = P_ent[int_idx+len('Response:'):]
    return P_ent.strip()

extracted_entity = extract_entity(text)
print(extracted_entity)


ImportError: ignored