## Step 1: Installing Required Packages
In this step, we install the necessary Python libraries such as `sentence-transformers`, `ragas`, `peft`, and others. These packages are essential for working with language models, evaluation metrics, and datasets throughout the notebook.

In [None]:
!pip install -q sentence-transformers==3.2.1 ragas==0.2.2 peft==0.13.2 bitsandbytes==0.44.1 datasets==3.0.1 wandb==0.18.5 scipy==1.13.1 google-cloud-secret-manager

## Step 2: Importing Necessary Libraries
Here, we import the necessary libraries and modules. This includes transformers for model handling, LangChain for embedding models, and Ragas for evaluation metrics like answer relevancy and faithfulness.

In [None]:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
import torch
from langchain_community.embeddings import HuggingFaceEmbeddings
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
from langchain import HuggingFacePipeline
from langchain_community.embeddings import HuggingFaceEmbeddings
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
)
from ragas import evaluate


## Step 3: Accessing Hugging Face Token for Model Authentication
In this step, we retrieve the Hugging Face API token, either from the Google Colab environment or via the Google Cloud Secret Manager if the token is not found. This token is required to access and work with models hosted on Hugging Face.

In [None]:
try:
  from google.colab import userdata
  userdata.get("HF_TOKEN")
except userdata.SecretNotFoundError:
  print("HuggingFace Token not found, looking in caltech class project")
  from google.cloud import secretmanager
  import os
  client = secretmanager.SecretManagerServiceClient()
  response = client.access_secret_version(request={"name": "projects/240830225929/secrets/HF_TOKEN/versions/1"})
  os.environ["HF_TOKEN"] = response.payload.data.decode("UTF-8")


## Step 4: Loading Pre-trained Language Models
In this step, we load two pre-trained models from Hugging Face: Model A (SQL fine-tuned) and Model B (a CodeLlama model). We configure both models for quantization to run efficiently on available hardware.

In [None]:


# Model A From Step 1 in Lab01
base_model = "motherduckdb/DuckDB-NSQL-7B-v0.1"
modelA = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizerA = AutoTokenizer.from_pretrained(base_model)
embedding_modelA = HuggingFaceEmbeddings()


# Model B From Step 2 in Lab01
base_model = "codellama/CodeLlama-7b-hf"
modelB = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=BitsAndBytesConfig(load_in_8bit=True),
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizerB = AutoTokenizer.from_pretrained(base_model)
embedding_modelB = HuggingFaceEmbeddings()


## Step 5: Preparing for Multi-GPU Setup (Optional)
If multiple GPUs are available, we enable parallel processing for the models. This is useful for distributing the model's workload across GPUs, enhancing computation speed.

In [None]:
if torch.cuda.device_count() > 1:
    print(f"multi cuda devices #{torch.cuda.device_count()}")
    # keeps Trainer from trying its own DataParallelism when more than 1 gpu is available
    modelA.is_parallelizable = True
    modelA.model_parallel = True
    modelB.is_parallelizable = True
    modelB.model_parallel = True

## Step 6: Loading Evaluation Dataset
We load the SQL-related dataset, which includes train and test splits. The test set will be used for evaluating the performance of the fine-tuned model.

In [None]:
from datasets import load_dataset
dataset = load_dataset("b-mc2/sql-create-context", split="train")
# train_dataset = dataset.train_test_split(test_size=0.1)["train"]
eval_dataset = dataset.train_test_split(test_size=0.1)["test"]


## Step 7: Adding Context to the Evaluation Dataset
Here, we add a column for 'retrieved_contexts' to the evaluation dataset. This will store the context retrieved for each SQL query during evaluation.

In [None]:

eval_dataset = eval_dataset.add_column("retrieved_contexts", [[],] * len(eval_dataset))

## Step 8: Setting Up the Evaluation Pipeline
We create a pipeline for evaluating Model A using SQL queries. We configure the evaluation with specific parameters such as maximum token limit, temperature, and repetition penalty.

In [None]:


pipe = pipeline(
    model=modelA,
    max_new_tokens=1024,
    do_sample=True,
    tokenizer=tokenizerA,
    return_full_text=True,  # langchain expects the full text
    task='text-generation',
    temperature=0.1,
    repetition_penalty=1.1  # without this output begins repeating
)

evaluator = HuggingFacePipeline(pipeline=pipe)

# ragas
result = evaluate(
    dataset=eval_dataset,
    llm=evaluator,
    embeddings=embedding_modelA,
    metrics=[
        faithfulness,
        answer_relevancy,
    ],
)

## Step 9: Quiz - Evaluating LLMs for SQL Queries
Now that you have completed this exercise, test your understanding with the following quiz questions:

### Quiz Questions:
1. What is the purpose of the `ragas` library in this notebook?
2. Why do we use quantization when loading models?
3. What evaluation metrics are applied to assess model performance?
4. How do we handle multiple GPUs in the model setup?
5. What is the significance of adding the `retrieved_contexts` column in the evaluation dataset?