# **1. Introduction**

<p style="font-size: 24px;text-align:center;"><b>What is Google Gemma Model?</b></p>

<div align="center"><img style="height:40%; width:40%" src="https://storage.googleapis.com/gweb-uniblog-publish-prod/images/gemma-header.width-1200.format-webp.webp"/></div>

<p style="text-align:center;font-size:18px;"><b>Image </b><a href="https://blog.google/technology/developers/gemma-open-models/"><b> src</b></a>

<p style="font-size: 18px;"><strong>Gemma Model</strong> is a family of open-source large language models created by Google AI. These models are known for being lightweight and powerful, achieving good performance on various tasks for their size. Here's a breakdown of what Gemma models are:</p>

<ul style="font-size: 18px;">
    <li><strong>State-of-the-art:</strong> Compared to other models like Llama 2 and Mistral 7B, Gemma models perform competitively on benchmarks that measure knowledge, problem-solving abilities, and common sense reasoning.</li>
    <li><strong>Lightweight:</strong> Gemma models are designed to be smaller and require less computing power than other large language models. This makes them more accessible for people who don't have access to powerful machines.</li>
    <li><strong>Open-source:</strong> Anyone can access and use Gemma models, and researchers can even fine-tune them for specific tasks.</li>
    <li><strong>Flexible:</strong> Gemma models work with various tools and frameworks, including TensorFlow, JAX, PyTorch, and Hugging Face Transformers. They can also run on different devices, from laptops to mobile phones.</li>
</ul>

<p style="font-size: 18px;">Overall, Gemma models are a good option for developers and researchers who are looking for a powerful and versatile large language model that is easy to use and modify.</p>



<p style="font-size: 24px;text-align:center;"><b>Gemma Released Models</b></p>

| Model Preset    | Tuned versions  | Preset                 |
|-----------------|-----------------|------------------------|
| `gemma_2b_en`| Pretrained      |    2B       |
|  `gemma_instruct_2b_en`| Instruction tuned| 2B    |
|`gemma_7b_en`| Pretrained    |  7B    |
|  `gemma_instruct_7b_en` | Instruction tuned| 7B   |

<p style="font-size: 24px;text-align:center;"><b>Pretrained vs Instruction tuned</b></p>

A pre-trained model and an instruction-tuned model are both starting points for creating a model that can perform a specific task. However, they differ in how the model learns that task:

### **Pre-trained Model:**

* **Training:** A pre-trained model is trained on a massive dataset of unlabeled text or data (like text or images) that covers a broad range of topics. This initial training helps the model develop a general understanding of language or the world. 
* **Focus:**  Think of it as learning the building blocks or foundational skills. It doesn't learn a specific task but develops a strong base for various tasks related to the type of data it's trained on. 
* **Benefits:** Pre-trained models are very efficient. They leverage the vast amount of data processed during pre-training, so you don't need to start from scratch when tackling a specific task. 
* **Drawback:** While versatile, they might not be perfect for a specific domain or task since the focus wasn't on that specific area.

### **Instruction-Tuned Model:**

* **Training:** An instruction-tuned model starts with a pre-trained model as a base. Then, it's further trained on a smaller dataset specifically designed for the desired task. This dataset often includes labeled examples with instructions or prompts about what the model should learn.  
* **Focus:** This additional training refines the model's understanding to the specific task. It's like taking those building blocks and using them to construct something specific. 
* **Benefits:** Instruction-tuned models can achieve higher accuracy on a specific task compared to a pre-trained model used directly. 
* **Drawback:**  Tuning requires a task-specific dataset, and the success depends on the quality and size of that data. 

<p style="font-size: 24px;text-align:center;"><b>How to Access Google's Gemma Model?</b></p>

<div align="center"><img style="max-width:720px;" src="https://i.ibb.co/VBLJ3tN/Screenshot-from-2024-04-04-23-43-49.png" /></div>

- Go to <a href="https://www.kaggle.com/models/google/gemma?utm_medium=kagglecomp&utm_source=kagglecompetition1&utm_campaign=models-gemmalaunch"><b>Gemma,</b></a> scroll down and give the consent.
- You are now ready to use the Gemma Model.
- Click on Create Notebook
- Click on Input
<div align="center"><img style="width:50%;max-width:240px;" src="https://i.ibb.co/2Sxf1K1/Screenshot-from-2024-04-05-00-36-43.png" alt="Screenshot-from-2024-04-05-00-36-43" border="0"></div>
- Click on Add Input
- Click on Gemma Model
<div align="center"><img style="width:50%;max-width:240px;" src="https://i.ibb.co/3mHgJbD/Screenshot-from-2024-04-05-00-37-11.png" alt="Screenshot-from-2024-04-05-00-37-11" border="0"></div>
<br/>
- Now selected Model will be visible in the Input Section
- Select the Accelerator on which you want to run your model.
<div align="center"><img style="width:50%;max-width:240px;" src="https://i.ibb.co/XFPrBst/Screenshot-from-2024-04-05-00-50-19.png" alt="Screenshot-from-2024-04-05-00-50-19" border="0"></div>

# **2. Training the model**

## **Install dependencies**
### Install Keras, KerasNLP, and other dependencies.

In [None]:
%%capture
%pip install -q bitsandbytes
%pip install -q transformers
%pip install -q peft
%pip install -q accelerate
%pip install -q trl
%pip install -q torch
%pip install -q qdrant-client langchain pypdf sentence-transformers

## **Load all libraries**

In [None]:
%%capture
import os, torch
import pandas as pd
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoConfig, TrainingArguments, pipeline
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from datasets import Dataset
from IPython.display import Markdown, display
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline

## The code below configures a large language model (LLM) for inference with quantization techniques for efficiency. Here's a breakdown of what each part does:

**Model Path and Quantization Configuration**

1. **Model Path:** The `model` variable stores the path to a pre-trained causal language model (likely a 2-billion parameter model) on Kaggle Datasets.

2. **BitsAndBytesConfig:** The `bnbConfig` object defines the configuration for quantization using the BitsAndBytes library. Here are the key arguments:
    * `load_in_4bit (bool, optional)`: This argument enables 4-bit quantization, reducing memory usage by approximately fourfold compared to the original model.
    * `bnb_4bit_quant_type (str, optional)`: This parameter specifies the type of 4-bit quantization to use. Here, it's set to `"nf4"`, a specific quantization format supported by BitsAndBytes.
    * `bnb_4bit_compute_dtype (torch.dtype, optional)`: This argument defines the data type used for computations during inference. Here, it's set to `torch.bfloat16`, a lower-precision format that can improve speed on compatible hardware.

**Loading Tokenizer and Model with Quantization**

1. **AutoTokenizer:** The `AutoTokenizer.from_pretrained` function loads the tokenizer associated with the pre-trained model at the specified path (`model`). The `quantization_config` argument is crucial here. It tells the tokenizer to consider the quantization information (e.g., potential padding changes) while processing text.

2. **AutoModelForCausalLM:** Similarly, `AutoModelForCausalLM.from_pretrained` loads the actual LLM model from the path (`model`). Again, the `device_map="auto"` argument allows automatic device placement (CPU or GPU) and the `quantization_config` ensures the model is loaded with the 4-bit quantization configuration.

**Overall, this code snippet aims to achieve two goals:**

* **Load a pre-trained LLM:** It retrieves a pre-trained causal language model from the specified path.
* **Enable Quantization for Efficiency:** By using the `BitsAndBytesConfig` and arguments during loading, the code configures the tokenizer and model to leverage 4-bit quantization for memory reduction and potentially faster inference on compatible hardware.


<h3><strong>Know More about <a href="https://www.kaggle.com/code/lorentzyeung/what-s-4-bit-quantization-how-does-it-help-llama2">4-bit quantization</a></strong></h3>

In [None]:
model = "/kaggle/input/gemma/transformers/2b-it/2"

bnbConfig = BitsAndBytesConfig(
    load_in_4bit = True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

tokenizer = AutoTokenizer.from_pretrained(model, quantization_config=bnbConfig, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(
    model,
    device_map = "auto",
    quantization_config=bnbConfig
)

## **Test Model using a prompt**
### The code below demonstrates how to use a large language model (LLM) for creative text generation. Here's a breakdown of what each part does:

**Creating the Prompt**

1. **System Response:** The code defines a variable `system` containing a message praising your Python coding skills.
2. **User Request:** The `user` variable specifies the request to "Write a Python code to display text in a star pattern."
3. **Prompt Construction:** The `prompt` variable combines the system response, user request, and an AI response placeholder using f-strings.

**Tokenization and Model Input Preparation**

1. **Tokenizer:** The `tokenizer` likely refers to a pre-trained tokenizer function from the Hugging Face Transformers library. It converts the text in the prompt into numerical representations suitable for the LLM.
2. **Tensor Conversion:** `.to("cuda")` converts the tokenized prompt into a PyTorch tensor and moves it to the GPU (if available) for faster processing.

**Model Generation**

1. **Model Generation:** The `model.generate` function utilizes the LLM to generate text following the prompt. The provided arguments specify:
    * `inputs`: The tokenized prompt as input.
    * `num_return_sequences`: Set to 1, indicating only one generated sequence is desired.
    * `max_new_tokens`: Limits the maximum number of tokens the model generates to 1000.

**Decoding and Output**

1. **Decoding:** The `tokenizer.decode` function converts the generated token sequence back into human-readable text.
2. **Splitting and Markdown:** The code splits the generated text by "AI:" to extract the AI's response. Finally, it wraps the response in a Markdown object, likely for formatting purposes (not shown in the provided code).

**Overall Functionality:**

This code snippet simulates a conversation where the user asks for Python code for a star pattern, and the LLM generates the code using the prompt and its knowledge.

**Note:** The actual Python code for generating a star pattern is not included here, but the LLM would likely generate the code based on its training data.


In [None]:
system =  "You are a skilled software engineer who consistently produces high-quality Python code."
user = "Write a Python code to display text in a star pattern."

prompt = f"System: {system} \n User: {user} \n AI: "
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=1000)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)
Markdown(text.split("AI:")[1])

# **3. Fine Tune Model**

## **Load the dataset**

In [None]:
data = pd.read_csv("/kaggle/input/dataset-python-question-answer/Dataset_Python_Question_Answer.csv")
dataset = Dataset.from_pandas(data)
data.head()

## **Define a formatting function for the model output.**

In [None]:
def formatting_func(example):
    template = "Instruction:\n{instruction}\n\nResponse:\n{response}"
    line = template.format(instruction=example['Question'], response=example['Answer'])
    return [line]

In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

## **What is WANDB?**

WANDB is a cloud platform for experiment tracking specifically designed for machine learning. It provides functionalities like:

* **Logging** training metrics, parameters, and visualizations.
* **Version control** for your machine learning experiments.
* **Sweeping** hyperparameters to find the best performing configuration.
* **Collaboration** features to share and discuss experiments with your team.

## **Why disable WANDB?**

There are a few reasons why you might want to disable WANDB:

* **You don't need experiment tracking:** If your code is a simple experiment or you're not interested in tracking the results, disabling WANDB can reduce overhead.
* **Privacy concerns:** WANDB logs your experiment data to the cloud. If your data is sensitive, you might not want to upload it.
* **Troubleshooting errors:**  Sometimes errors can arise from WANDB itself. Disabling it can help isolate the issue.

## **Setting the `WANDB_DISABLED` environment variable**

We sets an environment variable called `WANDB_DISABLED` to `"true"`. This tells the WANDB library to not initialize itself, effectively disabling it for current script.

## **In summary:**

* WANDB is a useful tool for experiment tracking in machine learning.
* You might disable WANDB if you don't need experiment tracking or for debugging purposes.

In [None]:
lora_config = LoraConfig(
    r = 8,
    target_modules = ["q_proj", "o_proj", "k_proj", "v_proj",
                      "gate_proj", "up_proj", "down_proj"],
    task_type = "CAUSAL_LM",
)

### The code above defines a configuration object called `lora_config` for a technique called LoRA (Low-Rank Adaptation). Here's a breakdown of what each parameter does:

**LoRA - Low-Rank Adaptation**

LoRA is a technique used to fine-tune large language models (LLMs) more efficiently. It allows you to adapt pre-trained models to new tasks with minimal memory and computational cost compared to traditional fine-tuning.

**LoraConfig Parameters:**

* **r (int):** This parameter defines the rank of the low-rank decomposition used in LoRA. It controls the trade-off between accuracy and memory usage. A lower value of `r` uses less memory but might lead to slightly lower accuracy. The default value is typically 8, as set in our code.

* **target_modules (List[str]):** This list specifies the Transformer layers where LoRA will be applied. The provided configuration targets several key projection layers within the Transformer architecture:
    * `q_proj`: Query projection
    * `o_proj`: Output projection
    * `k_proj`: Key projection
    * `v_proj`: Value projection
    * `gate_proj`: Gate projection (used in attention layers)
    * `up_proj`: Upsampling projection (used in some encoder-decoder architectures)
    * `down_proj`: Downsampling projection (used in some encoder-decoder architectures)

By applying LoRA to these projection layers, the model can learn task-specific adaptations without modifying the original large model weights significantly.

* **task_type (str, optional):** This parameter specifies the type of task you're fine-tuning the model for. While not used in this specific configuration, some libraries might leverage this information to optimize LoRA for specific task categories (e.g., "CAUSAL_LM" for causal language modeling).

**In summary:**

This configuration defines how LoRA will be applied to a pre-trained model for fine-tuning. It specifies the rank of the decomposition (memory usage) and the target layers within the Transformer architecture where LoRA will be used to adapt the model to a new task.


In [None]:
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    max_seq_length=512,
    args=TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=50,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)

### The code above creates an instance of `SFTTrainer` from the Transformers library specifically designed for supervised fine-tuning (SFT) tasks. Here's a breakdown of what each part does:

**SFTTrainer for Supervised Fine-Tuning**

This code utilizes `SFTTrainer` to fine-tune a pre-trained model (`model`) on a specific training dataset (`dataset`). It's designed for tasks where you have labeled data and want to adapt the model for a new purpose.

**Key Parameters:**

* **model (PreTrainedModel):** This argument specifies the pre-trained model you want to fine-tune.

* **train_dataset (Dataset):** This argument points to the training dataset you'll use for fine-tuning. The dataset should be formatted appropriately for the task.

* **max_seq_length (int):** This parameter defines the maximum sequence length allowed in the training data. Sequences exceeding this length will be truncated.

* **args (TrainingArguments):** This argument is an instance of `TrainingArguments` that defines various hyperparameters for the training process. Here are some notable arguments within `args`:
    * `per_device_train_batch_size (int)`: Sets the batch size per device (GPU/TPU) during training. Here, it's set to 1, which is a small batch size commonly used with gradient accumulation.
    * `gradient_accumulation_steps (int)`: This parameter allows accumulating gradients over several batches before updating the model weights. Here, it's set to 4, effectively increasing the effective batch size.
    * `warmup_steps (int)`: This defines the number of warmup steps where the learning rate is gradually increased from 0 to its full value. Here, it's set to 2.
    * `max_steps (int)`: This parameter specifies the total number of training steps. Here, it's set to 50, which might be a short training run for fine-tuning depending on your dataset size and task complexity.
    * `learning_rate (float)`: This sets the learning rate for the optimizer. Here, it's set to 2e-4, which is a common starting point for fine-tuning.
    * `fp16 (bool)`: Enables training using 16-bit floating-point precision (mixed precision) for faster training with minimal accuracy loss (if supported by your hardware).
    * `logging_steps (int)`: Defines how often training metrics are logged during training. Here, it's set to 1, logging metrics for every step. 
    * `output_dir (str)`: Specifies the directory where training outputs (model checkpoints, logs, etc.) will be saved. Here, it's set to "outputs".
    * `optim (str)`: Defines the optimizer used for training. Here, it's set to "paged_adamw_8bit", which is likely an optimizer with specific memory optimizations. 

* **peft_config (LoraConfig):** This argument is likely referencing the `lora_config` you defined earlier. It provides the configuration for LoRA (Low-Rank Adaptation), which helps fine-tune the model more efficiently.

* **formatting_func (Callable):** This argument (if provided) specifies a custom function for formatting the training data before feeding it to the model. This allows for specific pre-processing steps tailored to your task.

**In essence:**

This code snippet configures and initializes an `SFTTrainer` for fine-tuning a pre-trained model with LoRA for memory efficiency. The training hyperparameters are set within the `TrainingArguments` object. 

In [None]:
trainer.train()

## **Test the Fine-Tuned Model**

In [None]:
system =  "You are a skilled software engineer who consistently produces high-quality Python code."
question =system + "What is the difference between a variable and an object"

prompt = f"Question: {question} \n Answer: "
    
inputs = tokenizer(prompt, return_tensors='pt', padding=True, truncation=True).to("cuda")

outputs = model.generate(**inputs, num_return_sequences=1, max_new_tokens=512)

text = tokenizer.decode(outputs[0], skip_special_tokens=True)

Markdown(text.split("Answer:")[1])

# **4. Retrieval Augment Generation (RAG)**
```Retrieval Augmented Generation (RAG)``` is a paradigm in language model architecture that integrates both retrieval and generation processes to enhance the model's understanding and response capabilities. In essence, it combines the strengths of retrieval-based models, which excel at accessing and utilizing external knowledge sources, with generative models, which can generate novel and contextually relevant responses.

The primary benefit of RAG in large language models (LLMs) is its ability to leverage external knowledge sources during the generation process. By retrieving relevant information from a predefined knowledge base or corpus, the model can augment its understanding of the input context and produce more accurate and informative responses. This approach not only improves the coherence and relevance of generated text but also enables the model to incorporate real-world knowledge and factual accuracy into its outputs.

RAG aims to achieve several key objectives:

1. **Enhanced Contextual Understanding:** By retrieving relevant information from external sources, RAG can better understand the context of a given prompt or query, leading to more contextually appropriate responses.

2. **Improved Content Quality:** Integrating external knowledge sources allows RAG to generate content that is more accurate, informative, and relevant to the input context, enhancing the overall quality of generated text.

3. **Factually Accurate Responses:** By accessing external knowledge bases, RAG can ensure that its responses are factually accurate and grounded in real-world information, reducing the likelihood of generating misleading or incorrect information.

The workflow of RAG typically involves the following steps:

1. **Retrieval:** The model first retrieves relevant information from a knowledge base or corpus based on the input prompt or query. This retrieval process aims to identify key facts, concepts, or contextually relevant information to inform the generation process.

2. **Augmentation:** The retrieved information is then used to augment the model's understanding of the input context. By incorporating this external knowledge, the model can generate more informed and contextually appropriate responses.

3. **Generation:** Finally, the model generates a response based on the augmented understanding of the input context, leveraging both the original prompt and the retrieved information to produce a coherent and relevant output.

The necessity of using RAG lies in its ability to address the limitations of traditional generative models, such as lack of factual accuracy and coherence in responses. By integrating retrieval-based mechanisms, RAG can access external knowledge sources to enhance its understanding of the input context, leading to more accurate, informative, and contextually relevant generated text. This approach is particularly valuable in tasks requiring a deep understanding of complex topics or access to large knowledge bases, such as question answering, dialogue generation, and content summarization.

## **Load documents for RAG**

In [None]:
# Instantiate a PyPDFDirectoryLoader object with the specified directory path
pdf_loader = PyPDFDirectoryLoader("/kaggle/input/knowledge-base")

# Load PDF documents from the specified directory
pdfs = pdf_loader.load()

In [None]:
# import the HuggingFaceEmbeddings class, 
embeddings = HuggingFaceEmbeddings(
    # This argument specifies the pre-trained model name to be used for generating embeddings.
    # Here, "sentence-transformers/all-mpnet-base-v2" is a pre-trained sentence transformer model 
    # from the Sentence Transformers library (not Transformers).
    # Sentence transformer models are specifically trained to generate meaningful representations 
    # of sentences that capture semantic similarity.
    model_name="sentence-transformers/all-mpnet-base-v2",

    # This argument is likely specific to the HuggingFaceEmbeddings class and might 
    # not be present in the base Transformers library.
    # It sets the device to "cuda" to leverage the GPU for faster processing if available.
    model_kwargs={"device": "cuda"}
)

In [None]:
# Instantiate a RecursiveCharacterTextSplitter object with specified parameters
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# Split documents into chunks using the RecursiveCharacterTextSplitter
all_splits = text_splitter.split_documents(pdfs)

In [None]:
# Create a Qdrant collection from the document splits
# For storing and searching document information we use a vector database called Qdrant. 

qdrant_collection = Qdrant.from_documents(
    all_splits,                # List of document splits
    embeddings,                # HuggingFaceEmbeddings object for generating embeddings
    location=":memory:",       # Location to store the collection (in memory)
    collection_name="all_documents"  # Name of the Qdrant collection
)

In [None]:
# Create a retriever
retriever = qdrant_collection.as_retriever()

In [None]:
# This code creates a pipeline for text generation using a pre-trained model (model) 
# and its tokenizer (tokenizer). It leverages mixed precision (torch.bfloat16) 
# for potentially faster inference and limits generated text to 512 tokens.
pipeline = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer,
    model_kwargs = {"torch.dtype": torch.bfloat16},
    max_new_tokens=512    
)

## Here's a breakdown of the code snippet below and the parameters used for text generation:

**1. Prompt Creation (`pipeline.tokenizer.apply_chat_template`)**

* **Function:** This part uses a function (likely) provided by the text-generation pipeline to specifically format the conversation history (`messages`) into a prompt suitable for the LLM. 
* **Arguments:**
    * `messages (List[Dict])`: This argument is a list of dictionaries representing the conversation history. Each dictionary likely contains keys like "role" (indicating user or system) and "content" (the actual text).
    * `tokenize (bool, optional)`: This argument is likely set to `False` here, indicating that the tokenizer should not be applied at this stage. The conversation history might already be pre-processed text.
    * `add_generation_prompt (bool, optional)`: This argument is likely set to `True`, instructing the function to add a prompt at the beginning specifically designed for chat-like text generation tasks. The exact format of this prompt might be specific to the pipeline implementation.

**2. Text Generation with Pipeline (`pipeline`)**

* **Function:** This line calls the text-generation pipeline itself to generate text that continues the conversation based on the provided prompt.
* **Arguments:**
    * `prompt (str)`: This argument is the crucial prompt created in the previous step, containing the formatted conversation history and potentially a generation prompt.
    * `max_new_tokens (int, optional)`: This argument limits the number of tokens (words) the LLM can generate to 512 in this case.
    * `add_special_tokens (bool, optional)`: This argument, set to `True` here, instructs the pipeline to add special tokens (like start/end of sequence tokens) to the prompt before feeding it to the LLM.
    * `do_sample (bool, optional)`: Set to `True` here, enabling random sampling during generation, which can introduce some variation in the response compared to greedy decoding.
    * `temperature (float, optional)`: This argument controls the randomness of the generated text. A temperature of 1.0 samples more uniformly from all possibilities, while lower values like 0.7 (used here) favor higher probability tokens, resulting in a more conservative response.
    * `top_k (int, optional)`: This argument restricts the generation to only consider the top k most likely tokens at each step, potentially reducing the likelihood of going off on tangents. Here, it's set to 10.
    * `top_p (float, optional)`: This argument controls the sampling process by favoring the top p probability mass of the distribution (considering the top k tokens). A value of 0.95 (used here) indicates a preference for high-probability tokens, influencing the creativity of the response.

In [None]:
question = "What is the difference between a variable and an object"

message = [
    {"role": "user", "content": question},
]

prompt = pipeline.tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True)

outputs = pipeline(
    prompt,
    max_new_tokens=512,
    add_special_tokens=True,
    do_sample=True,
    temperature=0.7,
    top_k=10,
    top_p=0.95
)
Markdown(outputs[0]["generated_text"][len(prompt):])

## The code below combines a large language model (LLM) for text generation with a retrieval system for building a more informative response system. Here's a breakdown of what each part does:

**1. Creating a Custom LLM Pipeline (`gemma_llm`)**

* **HuggingFacePipeline:** This line likely wraps the existing text-generation pipeline (`pipeline`) from the previous steps into a custom `HuggingFacePipeline` class. This might provide additional functionalities or a more convenient way to interact with the pipeline.
* **Model Kwargs:** It sets a default argument (`temperature=0.7`) for the custom pipeline. This argument controls the randomness of the generated text during text generation (explained earlier).

**2. Building a RetrievalQA Object (`qa`)**

* **RetrievalQA.from_chain_type:** This line creates a `RetrievalQA` object, likely from a custom library or framework that combines retrieval and question answering functionalities.
* **Arguments:**
    * `llm (TextGenerationPipeline)`: This argument takes the `gemma_llm` object, which provides the text-generation capabilities.
    * `chain_type (str)`: This argument is set to `"stuff"`, which might be a specific type of retrieval-LM chain supported by the `RetrievalQA` class. The exact functionality of `"stuff"` depends on the library's implementation.
    * `retriever (RetrievalInterface)`: This argument likely refers to a separate `retriever` object (not shown here) that handles retrieving relevant information from a knowledge base or document store based on the user's query.

**In essence:**

This code combines the text-generation capabilities of the LLM with a retrieval system. The `RetrievalQA` object likely leverages the retrieved information (through the `retriever` object) to inform the LLM's text generation process, potentially leading to more comprehensive and informative responses to user queries. The specific details of how retrieval and text generation are chained together depend on the `RetrievalQA` implementation and the `"stuff"` chain type.


In [None]:
gemma_llm = HuggingFacePipeline(
    pipeline=pipeline,
    model_kwargs={
        "temperature": 0.7,
        "max_new_tokens": 512,
        "add_special_tokens": True,
        "do_sample": True,
        "top_k": 10,
        "top_p": 0.95
    },
)
# Create a RetrievalQA object
qa = RetrievalQA.from_chain_type(
    llm=gemma_llm,  # Pass the text-generation pipeline object
    chain_type="stuff",
    retriever=retriever  # retriever object
)

In [None]:
question = "Write in detail about python"
message = [
    {"role": "user", "content": question},
]

prompt = pipeline.tokenizer.apply_chat_template(message, tokenize=False, add_generation_prompt=True, truncation=True)
result = qa.invoke(prompt)
Markdown(result['result'].split('Helpful Answer:')[1])