<a href="https://colab.research.google.com/github/nataliakzm/colab_collection/blob/main/Fine_tuning_LangSmith_%26_LLaMA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LangSmith + LLaMA Fine-tuning Guide

## Context

This is a guide for fine-tuning open source LLMs using a single GPU with HuggingFace and LangSmith for dataset management / evaluation:

* We will fine-tune LLaMA2-7b-chat for an extraction task (knowledge graph triple extraction)
* We will use [training data](https://docs.smith.langchain.com/evaluation/datasets) exported from LangSmith
* We will [evaluate](https://docs.smith.langchain.com/evaluation/evaluator-implementations) the results using LangSmith


## Environment

First we'll set our `LANGCHAIN_API_KEY` so that we can access our LangSmith datasets.

We'll also import a few libraries to support fine-tuning.


In [None]:
%env LANGCHAIN_API_KEY=

In [None]:
%pip install --quiet -U langchain
%pip install --quiet -U langsmith
%pip install --quiet -U pandas
%pip install --quiet -U openai
%pip install --quiet -U xformers
%pip install --quiet -U transformers
%pip install --quiet -U huggingface
%pip install --quiet -U accelerate==0.21.0
%pip install --quiet -U peft==0.4.0
%pip install --quiet -U bitsandbytes==0.40.2
%pip install --quiet -U transformers==4.31.0
%pip install --quiet -U trl==0.4.7

# Get dataset

We can load a dataset from LangSmith (e.g., in this case `Carb-IE-train`).


In [None]:
import pandas as pd
from langsmith import Client

client = Client()
def get_dataset(train_dataset_name):

  examples = client.list_examples(dataset_name=train_dataset_name)
  train_df = pd.DataFrame([{**e.inputs, **e.outputs} for e in examples])
  return train_df

# Load dataset from LangSmith
dataset_name_tmpl = "Carb-IE-{split}"
train_dataset_name = dataset_name_tmpl.format(split="train")
train_df = get_dataset(train_dataset_name)

# Prepare for fine-tuning
df=train_df[['sentence','clusters']]
df.columns=["prompt","response"]
df.head(3)

Unnamed: 0,prompt,response
0,Wide acceptance of zero-energy building techno...,[{'s': 'Wide acceptance of zero-energy buildin...
1,Ms. Waleson is a free - lance writer based in ...,"[{'s': '[Ms.] Waleson', 'object': 'a free - la..."
2,The residue can be reprocessed for more drippi...,"[{'s': '[The] residue', 'object': 'more drippi..."


# Create Instructions

We follow the [LLaMA](https://huggingface.co/blog/llama2#how-to-prompt-llama-2) chat prompt structure, as also done [for other datasets](https://huggingface.co/datasets/mlabonne/guanaco-llama2-1k/viewer/mlabonne--guanaco-llama2-1k/train?row=10) fine-tuning LLaMA.

```
<s>[INST] <<SYS>>
{{ system_prompt }}
<</SYS>>
{{ user_message }} [/INST]
```

The instructions between the special `<<SYS>>` tokens provide context for the model so it knows how we expect it to respond.

All the interactions between the human and the "bot" are appended to the previous prompt, enclosed between `[INST]` delimiters.


In [None]:
# Save our DataFrame to a format that can be read by HuggingFace
from datasets import load_dataset

# Write to JSON
df.to_json('train.jsonl', orient='records', lines=True)
train_df_synthetic.to_json('train_synthetic.jsonl', orient='records', lines=True)

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [None]:
# Create instructions
import json
def create_instructions(examples):
    texts = []

    for prompt, response in zip(examples['prompt'], examples['response']):
        # Convert dictionary response to string
        if isinstance(response, list):
            # Pretty print for better readability
            response_str = json.dumps(response, indent=2)
        else:
            response_str = response

        # Format the text using the instruction structure provided
        text = (f'<s>[INST] <<SYS>>\n'
                f'{system_prompt.strip()}\n'
                f'<</SYS>>\n\n'
                f'### Input: \n{prompt}\n\n'
                f'### Output: \n{response_str}\n'
                f'[/INST]</s>')

        texts.append(text)

    return {'text': texts}

# Set system prompt for our particular task
system_prompt = ("you are a model tasked with extracting knowledge graph triples from given text. "
              "The triples consist of:\n"
              "- \"s\": the subject, which is the main entity the statement is about.\n"
              "- \"object\": the object, which is the entity or concept that the subject is related to.\n"
              "- \"relation\": the relationship between the subject and the object. "
              "Given an input sentence, output the triples that represent the knowledge contained in the sentence.")

# Read JSON we saved
train_dataset = load_dataset('json', data_files='/content/train.jsonl', split="train")
train_dataset_synthetic = load_dataset('json', data_files='/content/train_synthetic.jsonl', split="train")

# Create instructions, which we can see in "text" field below
train_dataset_mapped = train_dataset.map(create_instructions, batched=True)
train_dataset_mapped[0]

Map:   0%|          | 0/1466 [00:00<?, ? examples/s]

{'prompt': 'Wide acceptance of zero-energy building technology may require more government incentives or building code regulations , the development of recognized standards , or significant increases in the cost of conventional energy .',
 'response': [{'s': 'Wide acceptance of zero-energy building technology',
   'object': 'government incentives',
   'relation': 'may require more'},
  {'s': 'Wide acceptance of zero-energy building technology',
   'object': 'building code regulations',
   'relation': 'may require more'},
  {'s': 'Wide acceptance of zero-energy building technology',
   'object': 'the development of recognized standards',
   'relation': 'may require'},
  {'s': 'Wide acceptance of zero-energy building technology',
   'object': 'significant increases in the cost of conventional energy',
   'relation': 'may require'}],
 'text': '<s>[INST] <<SYS>>\nyou are a model tasked with extracting knowledge graph triples from given text. The triples consist of:\n- "s": the subject, whi

# Benchmarking our base LLM

Before fine-tuning, we can sanity check the base LLM prior to fine-tuning.

We can do this in two ways:

* Text generation pipeline to spot-check inference
* Evaluation using LangSmith (at bottom of the notebook)

To spot-check inference, we can review the [Replicate chat](https://replicate.com/a16z-infra/llama-2-7b-chat) API with this same LLM (`llama-2-7b-chat`)

In addition, we spot check a text generation pipeline on `Llama-2-7b-chat-hf`.

As done [here](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing#scrollTo=ib_We3NLtj2E), we load the LLM with 4-bit quantization.


In [None]:
import torch
from transformers import AutoModelForCausalLM
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)

# -- Bitsandbytes parameters --

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Chat model
model_name = "daryl149/llama-2-7b-chat-hf"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)

model.config.use_cache = False
model.config.pretraining_tp = 1

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
pipe_llama7b_chat = pipeline(task="text-generation", model=model, tokenizer=tokenizer, max_length=300, device=1) # set device to run inference on GPU

Downloading (…)lve/main/config.json:   0%|          | 0.00/507 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

In [None]:
# Build a test prompt
input_sentence = train_dataset_mapped[0]['prompt']
test_prompt = f'{system_prompt}\n\n### Input: \n{input_sentence}'

# Run inference with text-generation pipeline
result = pipe_llama7b_chat(test_prompt)
result



[{'generated_text': 'you are a model tasked with extracting knowledge graph triples from given text. The triples consist of:\n- "s": the subject, which is the main entity the statement is about.\n- "object": the object, which is the entity or concept that the subject is related to.\n- "relation": the relationship between the subject and the object. Given an input sentence, output the triples that represent the knowledge contained in the sentence.\n\n### Input: \nWide acceptance of zero-energy building technology may require more government incentives or building code regulations , the development of recognized standards , or significant increases in the cost of conventional energy .\n\n### Output: \n\n* s: Wide acceptance\n* object: building technology\n* relation: may require\n\n### Input: \nThe company\'s new product line is expected to increase sales by 20% in the next quarter.\n\n### Output: \n\n* s: company\'s new product line\n* object: sales\n* relation: increase\n\n### Input: \

# Hyperparameters

We follow the training recipe [here](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing#scrollTo=ib_We3NLtj2E).

In [None]:
# *** Modify the model_name ***
model_name = "daryl149/llama-2-7b-chat-hf"

# *** The instruction dataset to use ***
dataset = train_dataset_mapped

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = {"": 0}

#### Tokenization

As shown above, each instruction is captured in the `text` key.
  
The `text` is tokenized.

```
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
```

#### Objective

`AutoModelForCausalLM` loads a pre-trained language model (`LLaMA-7b-chat`, in our case) for causal language modeling, which is the task of predicting the next token in a sequence, given the previous tokens. For each instruction, the LLM will try to predict the next token in the sequence as its target.

```
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
```

#### Quantization

We also pass `bnb_config` to configure the quantization of a model using a BitsAndBytesConfig object that specifies the quantization parameters for 4-bit precision. Both the tokenizer and the model are passed to the SFT trainer.

```
trainer = SFTTrainer(
    model=model,
    train_dataset=train_dataset_mapped,
    eval_dataset=valid_dataset_mapped,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)
```

In [None]:
import os
import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer

# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map=device_map
)
model.config.use_cache = False
model.config.pretraining_tp = 1

# Load LLaMA tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

# Load LoRA configuration
peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
)

# Set training parameters
training_arguments = TrainingArguments(
    output_dir=output_dir,
    num_train_epochs=num_train_epochs,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    weight_decay=weight_decay,
    fp16=fp16,
    bf16=bf16,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=group_by_length,
    lr_scheduler_type=lr_scheduler_type,
    report_to="tensorboard"
)

Your GPU supports bfloat16: accelerate training with bf16=True


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

  gc.collect()
  gc.collect()


Fine-tune the model.

In [None]:
# Set supervised fine-tuning parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
    packing=packing,
)

# Train model
trainer.train()

# *** Fine-tuned model name ***
new_model = "llama-2-7b-chat-hf-ft-triples"

# Save trained model
trainer.model.save_pretrained(new_model)

Your GPU supports bfloat16: accelerate training with bf16=True


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



Map:   0%|          | 0/1466 [00:00<?, ? examples/s]

You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,1.0345
50,0.5302
75,0.3643
100,0.3538
125,0.3253
150,0.3732
175,0.3191
200,0.3519
225,0.331
250,0.3574


#Save fine-tuned model in Google Drive

We follow the recipes [here](https://colab.research.google.com/drive/1xB0ZeiBAF78FAxNCz-TeRRwQtmmVjxOh?usp=sharing#scrollTo=R7WKZyxtpUPS) and [here](https://colab.research.google.com/drive/1PEQyJO1-f6j0S_XJ8DV50NkpzasXkrzd?usp=sharing#scrollTo=ib_We3NLtj2E).

We merge the LoRA weights with the FP16 base model.

**Beware: we are using a GPU with limited VRAM, we may see OOM due to re-loading the base model in FP16 (see [here](https://mlabonne.github.io/blog/posts/Fine_Tune_Your_Own_Llama_2_Model_in_a_Colab_Notebook.html)).**

However, if using A100 there is no problem with VRAM (using the 7B parameter model).

In [None]:
# Optional: Empty VRAM if using T4
'''
del model
del pipe
del trainer
import gc
gc.collect()
gc.collect()
'''

In [None]:
# Merge and save the fine-tuned model
from google.colab import drive
drive.mount('/content/drive')
model_path = f"/content/drive/MyDrive/{new_model}"  # change to your preferred path

# Reload model in FP16 and merge it with LoRA weights
base_model = AutoModelForCausalLM.from_pretrained(
    model_name,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map=device_map,
)
model = PeftModel.from_pretrained(base_model, new_model)
model = model.merge_and_unload()

# Reload tokenizer to save it
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Save the merged model
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

('/content/drive/MyDrive/llama-2-7b-chat-hf-ft-triples-synthetic/tokenizer_config.json',
 '/content/drive/MyDrive/llama-2-7b-chat-hf-ft-triples-synthetic/special_tokens_map.json',
 '/content/drive/MyDrive/llama-2-7b-chat-hf-ft-triples-synthetic/tokenizer.json')

# Load fine-tuned model

* Test inference again

In [None]:
from google.colab import drive
from transformers import AutoModelForCausalLM, AutoTokenizer

# Fine-tuned model
new_model = "llama-2-7b-chat-hf-ft-triples"
drive.mount('/content/drive')
model_path = f"/content/drive/MyDrive/{new_model}"

model_loaded = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

Mounted at /content/drive


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
from transformers import pipeline
pipe_llama7b_chat_ft = pipeline(task="text-generation", model=model_loaded, tokenizer=tokenizer, max_length=300, device=1) # Set device to run inference on GPU
result = pipe_llama7b_chat_ft(test_prompt)
result

[{'generated_text': 'you are a model tasked with extracting knowledge graph triples from given text. The triples consist of:\n- "s": the subject, which is the main entity the statement is about.\n- "object": the object, which is the entity or concept that the subject is related to.\n- "relation": the relationship between the subject and the object. Given an input sentence, output the triples that represent the knowledge contained in the sentence.\n\n### Input: \nWide acceptance of zero-energy building technology may require more government incentives or building code regulations , the development of recognized standards , or significant increases in the cost of conventional energy .\n\n### Output: \n[\n  {\n    "s": "Wide acceptance of zero-energy building technology",\n    "object": "more government incentives or building code regulations",\n    "relation": "may require"\n  },\n  {\n    "s": "Wide acceptance of zero-energy building technology",\n    "object": "the development of recog

# Evaluation

We created a 100 sample test set in LangSmith.

We use an LLM (GPT-4) evaluator instructed to identify factual discrepancies between the labels and the predicted triplets.

This will penalize when there are triplets not present in the label or when the prediction fails to include a triplet.

But, it will be lenient if the exact wording of the object or relation differs in a non-meaningful way

In [None]:
%env OPENAI_API_KEY=

In [None]:
import json
from typing import Any, Optional
from langchain.evaluation import StringEvaluator
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.output_parsers import openai_functions

eval_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an impartial grader tasked with measuring the accuracy of extracted entity relations."),
        ("human", "Please evaluate the following data:\n\n"
         "<INPUT>\n{input}</INPUT>\n"
         "<PREDICTED>\n{prediction}</PREDICTED>\n"
         "<GROUND_TRUTH>\n{reference}</GROUND_TRUTH>\n\n"
         "Please save your reasoning and grading by calling the commit_grade function."
         " First, enumerate all factual discrepancies in the predicted triplets relative to the ground truth."
         " Finally, score the prediction on a scale out of 100, taking into account factuality and"
         " correctness according to the ground truth."),

    ]
)

commit_grade_schema = {
    "name": "commit_grade",
    "description": "Commits a grade with reasoning.",
    "parameters": {
        "title": "commit_grade_parameters",
        "description": "Parameters for the commit_grade function.",
        "type": "object",
        "properties": {
            "mistakes": {
                "title": "discrepancies",
                "type": "string",
                "description": "Any discrepencies between the predicted and ground truth."
            },
            "reasoning": {
                "title": "reasoning",
                "type": "string",
                "description": "The explanation or logic behind the final grade."
            },
            "grade": {
                "title": "grade",
                "type": "number",
                "description": "The numerical value representing the grade.",
                "minimum": 0,
                "maximum": 100
            }
        },
        "required": ["reasoning", "grade", "mistakes"],
    }
}

def normalize_grade(func_args: str) -> dict:
    args = json.loads(func_args)
    return {
        "reasoning": (args.get("reasoning", "") + "\n\n" + args.get("discrepancies", "")).strip(),
        "score": args.get("grade", 0) / 100,
    }

eval_chain = (
    eval_prompt
    | ChatOpenAI(model="gpt-4", temperature=0).bind(functions=[commit_grade_schema])
    | openai_functions.OutputFunctionsParser()
    | normalize_grade
)

class EvaluateTriplets(StringEvaluator):
    """Evaluate the triplets of a predicted string."""

    @property
    def requires_input(self) -> bool:
        return True

    @property
    def requires_reference(self) -> bool:
        return True

    def _evaluate_strings(
        self,
        *,
        prediction: str,
        reference: Optional[str] = None,
        input: Optional[str] = None,
        **kwargs: Any,
    ) -> dict:
        callbacks = kwargs.pop("callbacks", None)
        return eval_chain.invoke(
            {"prediction": prediction, "reference": reference, "input": input},
            {"callbacks": callbacks},
        )

In [None]:
from langsmith import Client
from langchain.smith import RunEvalConfig
from langchain.prompts import PromptTemplate
from langchain.llms import HuggingFacePipeline
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer

client = Client()
# Note that "sentence" is the key in the test dataset
prompt = PromptTemplate.from_template(
    "[INST] <<SYS>>\n{system_message}\n<</SYS>>\n\n### Input:{sentence}\n\n[/INST]\n"
).partial(system_message=system_prompt)

from langchain.smith import RunEvalConfig
config = RunEvalConfig(
    custom_evaluators=[EvaluateTriplets()],
)

In [None]:
# Chat LLM
llama_llm_chat = HuggingFacePipeline(pipeline=pipe_llama7b_chat)
llama_chain_chat = prompt | llama_llm_chat
results = await client.arun_on_dataset(validation_dataset_name, llama_chain_chat, evaluation=config)

View the evaluation results for project 'ec254abd664a4521b876a0d67d68d35d-RunnableSequence' at:
https://smith.langchain.com/projects/p/019bd6f2-d5e3-4af3-b139-877ab47d81c9?eval=true


  obj, end = self.scan_once(s, idx)
  append = y.append


In [None]:
# Chat LLM w/ FT
llama_llm_chat_ft = HuggingFacePipeline(pipeline=pipe_llama7b_chat_ft)
llama_chain_chat_ft = prompt | llama_llm_chat_ft
results = await client.arun_on_dataset(validation_dataset_name, llama_chain_chat_ft, evaluation=config)

View the evaluation results for project 'e012f3ad0d9b413caa54b6b7f33197b8-RunnableSequence' at:
https://smith.langchain.com/projects/p/77088c6a-1a78-4db7-9f34-e0c9a808aaf7?eval=true


