In [1]:
print("oitik the boss")

oitik the boss


#### KaggleBot - Gemma-7b-it RAG w few-shot prompting
A RAG system that uses gemma-7b-it to answer questions<br>
about the Bd Politics using news.<br>
by oitik<br>
14 April 2024

#### Competition
Google – AI Assistants for Data Tasks with Gemma<br>
Task: Answer common questions about Bangladesh Politics



## 1- Objective

The  objective is to create a notebook that demonstrates how to use the Gemma LLM to answer common questions about BD Politics

To achieve this objective we will build a RAG system that:

- takes a question from a user as input
- uses the question to retrieve relevant information from Kaggle documents via vector search and reranking, 
- and then uses gemma-7b-it together with few-shot prompting and heuristics to output a natural language answer to the question.

## 2- What is RAG?



Retrieval Augmented Generation (RAG) is a technique that combines information retrieval and text generation. Information Retrieval involves searching documents to find pieces of information that are relevant to a query. The text generation part involves the LLM taking the retrieved information and the query as inputs, reviewing the information, and then generating a natural language answer to the query.

Example:
1. Employee question: "How do I report office bullying?"
2. >.. A search is made of all company procedures.
3. >.. The question and the information from the search are passed to an LLM.
4. LLM Response: "CompanyX has a 24 hour employee wellness helpline where employees can report harassment. The number is 0800-123-456. You can also send a confidential email to helpme@companyx.com."

In this example the language model understands that bullying and harassment are the same thing i.e. the two words are semantically similar.  This depth of language understanding would not be possible in a keyword search.


## 3- Approach



To create the RAG system we will take documents that contain information about the Kaggle platform and convert their text into chunks.

Then we will use the Sentence Transformers package to convert each chunk into a vector embedding with length 384. These vectors will be stored in a FAISS index.

FAISS (Facebook AI Similarity Search) is an open-source library designed for fast (GPU supported) vector similarity search in large datasets.

When a question is submitted, the question text string will be vectorized. This query vector will then be compared to all vectors in the FAISS index. The top 20 matches will be returned.

Then, the search results will be reranked (i.e. reordered). Vector search compares numbers but reranking compares text. Reranking measures the relevance of the question to each text chunk returned by the vector search and assigns a relevance score to each question/chunk pair. The search results are then reordered based on these scores with the text chunks that are most relevant appearing first.

We will then pass the user's question and the top three text chunks to Gemma. These top three text chunks are now called the context. Gemma will review this information and output a natural language answer to the question using only the information in the context. 

We will use few-shot prompting and two heuristics (rules) to condition Gemma's output so that the text is in the style that we want.

I chose this architecture because: 
- it's simple to setup
- it's easy to understand what each component is doing
- it's memory efficient and doesn't crash the notebook
- the vector search step is super fast
- the reranking step greatly improves the quality of the final result

## 4- Hardware and inference time



GPU T4 x2 with 29GB RAM

I chose to use the two T4 GPUs because of the larger amount of RAM that's available (2xT4: 29GB vs P100: 16GB). This increased RAM is important when running evaluation. Without this the evaluation step would crash this notebook. Being able to run evaluation without crashing allows for different ideas to be tested quickly.

An interesting observation that I made was that the gemma-7b-it inference time when using the RAG system is about two times faster than for normal inference. 

Consider the following example question: "What is Kaggle?"

Inference time: 15 seconds<br>
Inference time with RAG: 7 seconds

When passing your own question to the RAG system (see section 23) you can expect the inference time to be approx. 7 seconds. 

This entire notebook runs in approx. 10 minutes.


## 5- Where does the data come from?


Nijei collect koresi.

## 6- What pre-processing was done?



I manually cleaned up each txt file and separated the text into chunks. To identify the chunks I added '###" to each txt file. In that way after reading a file I can use ``` text.split('###') ``` to create the text chunks.

It's important that each chunk contains one idea. Also, a text chunk in isolation does not have meaning therefore, to each chunk I prepended the name of the context that the chunk was part of. 

This is what an example chunk looks like:

```

###


In 2006, Muhammad Yunus and his Grameen Bank were awarded and accepted the Nobel Peace Prize. Yunus perfectly fitted the profile of the ideal candidate: just enough social commitment without any whiff of brimstone! Already at the time we were aware that Yunus’ contribution was in no way as revolutionary as was claimed. A book under his name published in 2008, Creating a World Without Poverty: Social Business and the Future of Capitalism and what is presumably a French version, Vers un nouveau capitalism [1], will let us throw light onto Yunus ‘phenomenon’ and the Grameen Bank.

###

```

Having a good data chunking strategy is vital for ensuring that the vector search and reranking steps produce good results. In this notebook I found that the vector search and reranking works very well. Even in cases where Gemma produces an incorrect output, if you look at the context (i.e. the text chunks that were passed to Gemma) you'll find that the context contains the correct answer.

## 7- Evaluating the RAG system

To evaluate the performance of the RAG system I passed a list of twenty five questions to the system and then reviewed the quality of the answers. The questions cover different aspects of the Kaggle platform including Competitions, Notebooks and Datasets.

The evaluation process is done manually. I chose twenty five questions because that number is small enough so that the evaluation process finishes within a resonable time (approx. 3 minutes), but that number is also large enough to give an impression of how the system is performing.

## 8- Resources to learn RAG basics



Here's a list of resources that will help you understand the code and the workflow that's used in this notebook:

- [Faiss - Introduction to Similarity Search <br> James Briggs on YouTube](https://www.youtube.com/watch?v=sKyvsdEv6rk)

- [Large Language Models with Semantic Search <br> Deeplearning.Ai Short Course](https://www.deeplearning.ai/short-courses/large-language-models-semantic-search/)

- [Colab Notebook that explains rerank](https://colab.research.google.com/github/UKPLab/sentence-transformers/blob/master/examples/applications/retrieve_rerank/retrieve_rerank_simple_wikipedia.ipynb#scrollTo=UlArb7kqN3Re)


- [Sentence transformers docs](https://www.sbert.net/)

- [HuggingFace Transformers docs](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationConfig)

- [Gemma prompt engineering template](https://www.promptingguide.ai/models/gemma)

## 9- Install packages

In [2]:
!pip install git+https://github.com/huggingface/transformers -U
#!pip install accelerate
!pip install -i https://pypi.org/simple/ bitsandbytes

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-vbzcyff3
  Running command git clone --filter=blob:none --quiet https://github.com/huggingface/transformers /tmp/pip-req-build-vbzcyff3
  Resolved https://github.com/huggingface/transformers to commit 0a7af19f4dc868bafc82f35eb7e8d13bac87a594
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting huggingface-hub<1.0,>=0.23.2 (from transformers==4.45.0.dev0)
  Downloading huggingface_hub-0.24.6-py3-none-any.whl.metadata (13 kB)
Collecting tokenizers<0.20,>=0.19 (from transformers==4.45.0.dev0)
  Downloading tokenizers-0.19.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Downloading huggingface_hub-0.24.6-py3-none-any.whl (417 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.5/417.5 kB

In [3]:
!pip install -U sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl.metadata (10 kB)
Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m [36m0:00:01[0m
[?25hInstalling collected packages: sentence-transformers
Successfully installed sentence-transformers-3.0.1


In [4]:
#!pip install faiss-cpu
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.4 kB)
Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


## 10- Imports

In [5]:
import pandas as pd
import numpy as np
import os
import ast

import torch
import gc

import sys, random, string, re, time
from transformers import (BitsAndBytesConfig, 
                          AutoModelForCausalLM, 
                          AutoTokenizer, pipeline)
from tqdm.auto import tqdm

# Don't Show Warning Messages
import warnings
warnings.filterwarnings('ignore')

print(f"CUDA Version: {torch.version.cuda}")
print(f"Pytorch {torch.__version__}")

2024-08-26 03:41:51.671054: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-26 03:41:51.671161: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-26 03:41:51.801472: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


CUDA Version: 12.1
Pytorch 2.1.2


In [6]:
# Set a seed value

import torch, random

# Ensure that all GPU operations are deterministic 
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

seed_val = 1023

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

## 11- Define variables

In [7]:
# set the path to the Gemma model hosted on Kaggle
MODEL_PATH = "/kaggle/input/gemma/transformers/7b-it/1"

# set the path to the data that will be used in the few shot prompt
FEW_SHOT_DATA_PATH = '../input/gemma-comp-data/df_corrected_data.csv'

# set the path the text files containing info about Kaggle
KAGGLE_DATA_PATH = '../input/gemma-comp-data/rev4-cleaned-txt-kaggle/'

# the number of results from the vector search that will be reranked
TOP_K = 20

# the number of text chunks that will be passed to Gemma
NUM_CHUNKS_IN_CONTEXT = 3


## 12- Define the device

We will be using two T4 GPUs with 29GB RAM in total.

In [8]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [9]:
DEVICE = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print(f"Device: {DEVICE}")
print(f"CUDA Version: {torch.version.cuda}")
print(f"Pytorch {torch.__version__}")

Device: cuda
CUDA Version: 12.1
Pytorch 2.1.2


In [10]:
# Check the type and quantity of GPUs

if torch.cuda.is_available():
    print('Num CPUs:', os.cpu_count())
    print('Num GPUs:', torch.cuda.device_count())
    print('GPU Type:', torch.cuda.get_device_name(0))

Num CPUs: 4
Num GPUs: 2
GPU Type: Tesla T4


In [11]:
os.cpu_count()

4

## 13- Helper functions

In [12]:
def run_faiss_search(query_text, top_k):
    
    """
    Executes an exhaustive search using FAISS to find the most 
    similar items to a given query.

    This function vectorizes the input query text using 
    a pre-defined model and then performs a search in a FAISS index 
    to retrieve the top_k most similar items. 
    It returns the indices of these items in the FAISS index, 
    which can be used to retrieve the corresponding documents
    or items.

    Parameters:
    - query_text (str): The text of the query for which similar 
    items are to be found.
    - top_k (int): The number of top similar items to retrieve.

    Returns:
    - index_vals_list (list of int): A list of indices for the top_k 
    most similar items found in the FAISS index. 
    These indices correspond to the positions of the items in 
    the dataset used to build the FAISS index.
    
    Note:
    - This function assumes that a FAISS index (`faiss_index`) 
    and a model for vectorization (`model`) are already defined 
    outside the function.
    - The function is designed for use with the Sentence Transformers
    package to convert text to vectors.
    
    """
    
    # Run FAISS exhaustive search
    query = [query_text]

    # Vectorize the query string
    query_embedding = model.encode(query, show_progress_bar=False)

    # Run the query
    # index_vals refers to the chunk_list index values
    scores, index_vals = faiss_index.search(query_embedding, top_k)
    
    # Get the list of index vals
    index_vals_list = index_vals[0]
    
    return index_vals_list
    

def run_rerank(index_vals_list, query_text):
    
    """
    Re-ranks a list of retrieved passages based on 
    their similarity to the input query using a cross-encoder.

    This function takes a list of index values corresponding to 
    retrieved passages and the input query text. 
    It then retrieves the actual text of these passages from a 
    dataframe (`df_data`) and formats them for input to a cross-encoder.
    The cross-encoder is then used to score the similarity between 
    each passage and the query. The passages are re-ranked
    based on these scores, and the re-ranked list of 
    passages is returned.

    Parameters:
    - index_vals_list (list of int): A list of index values 
    corresponding to retrieved passages.
    - query_text (str): The text of the query to be used 
    for re-ranking the passages.

    Returns:
    - pred_list (list of str): A list of re-ranked passages based 
    on their similarity to the query text.

    Note:
    - This function assumes that a dataframe (`df_data`) 
    containing the prepared text of passages and a 
    cross-encoder (`cross_encoder`) for scoring the similarity 
    between text pairs are already defined outside the function.
    """
    
    # Create a list of text chunks
    chunk_list = list(df_data['prepared_text'])

    # Replace the chunk index values with the corresponding strings
    pred_strings_list = [chunk_list[item] for item in index_vals_list]

    # Format the input for the cross encoder
    # The input to the cross_encoder is a list of lists
    # [[query_text, pred_text1], [query_text, pred_text2], ...]

    cross_input_list = []

    for item in pred_strings_list:
        
        # Create a question/chunk pair: [question, text_chunk]
        new_list = [query_text, item]
        
        # Append to the list containing all the question/chunk pairs
        # [[question, text_chunk], [question, text_chunk], ...]
        cross_input_list.append(new_list)


    # Put the pred text into a dataframe
    df = pd.DataFrame(cross_input_list, 
                      columns=['query_text', 'pred_text'])

    # Save the orginal index (i.e. df_data index values)
    df['original_index'] = index_vals_list

    # Now, score all retrieved passages using the cross_encoder
    cross_scores = cross_encoder.predict(cross_input_list, show_progress_bar=False)

    # Add the scores to the dataframe
    df['cross_scores'] = cross_scores

    # Sort the DataFrame in descending order based on the scores
    df_sorted = df.sort_values(by='cross_scores', ascending=False)
    
    # Reset the index
    df_sorted = df_sorted.reset_index(drop=True)

    pred_list = []

    for i in range(0,len(df_sorted)):
        
        # Get the text
        text = df_sorted.loc[i, 'pred_text']
        
        # Add curly braces
        item = {
            text
        }

        # Appen the text to a list
        pred_list.append(item)

    return pred_list

    

def vector_search_and_rerank(query_text, top_k=10):
    
    """
    Executes a retrieval-augmented generation (RAG) system 
    to generate responses to a given query.

    This function integrates FAISS for initial retrieval and 
    re-ranking using a cross-encoder to produce a list of responses 
    to the input query text. 
    First, it runs a FAISS exhaustive search to retrieve the top_k 
    most relevant passages based on the query. 
    Then, it re-ranks these passages using a cross-encoder
    to prioritize those with the highest similarity to the query. 
    The resulting list of passages is returned as the 
    output of the RAG system.

    Parameters:
    - query_text (str): The text of the query for which responses 
    are to be generated.
    - top_k (int, optional): The number of top passages to 
    retrieve and re-rank. Defaults to 10.

    Returns:
    - pred_list (list of str): A list of passages ranked and 
    generated by the RAG system in response to the query.

    Note:
    - This function assumes that `run_faiss_search` and `run_rerank` 
    functions are already defined. 
    These functions handle the initial retrieval and 
    re-ranking processes, respectively.
    """
    
    # Run a faiss exhaustive search
    pred_index_list = run_faiss_search(query_text, top_k)

    # This returns a list of dicts with length equal to top_k
    pred_list = run_rerank(pred_index_list, query_text)
    
    return pred_list

 

def extract_gemma_response(response):
    
    # Extract the answer:
    # Split and select the last item in the list
    response = response.split('<start_of_turn>model')[-1]
    # Remove leading and trailing spaces
    response = response.strip()
    # Remove the '<end_of_turn> token
    response = response.replace('<end_of_turn>', "")

    # Gemma always uses the phrase "I cannot answer this question"
    # when the answer is not available.
    text1 = 'I cannot answer this question'
    
    # If Gemma can't answer the question then
    # output a standard response.
    if text1 in response:
        response = "Sorry, that information is not available."
        
    return response


def format_text(text):

    # Create a list
    answer_list = text.split('\n')

    for i, item in enumerate(answer_list):

        # Replace * with nothing
        new_item = item.replace('*','')
        
        # Remove leading and trailing spaces
        new_item = new_item.strip()

        # Create the output string
        if i == 0:  
            fin_string = new_item + '\n'
        else:
            fin_string = fin_string + new_item + '\n'

    return fin_string


def gemma_assistant(question):
    
    # Create the prompt
    prompt = f"""<start_of_turn>user 
    Don't use Mardown to format your response.
    {question}<end_of_turn>
    <start_of_turn>model
    """

    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    # Generate the outputs from prompt
    generate_ids = gemma_model.generate(**inputs, max_new_tokens=768)
    # Decode the generated output
    generated_text = tokenizer.batch_decode(generate_ids, 
                                        skip_special_tokens=True,
                                        clean_up_tokenization_spaces=False)[0]


    # Extract the answer
    response = generated_text.split('<start_of_turn>model')[-1]
    # Remove leading and trailing spaces
    response = response.strip()
    # Remove the '<end_of_turn> token
    response = response.replace('<end_of_turn>', "")
    
    # Remove markdown '*' symbols
    response = format_text(response)
    
    return response


def timer(start_time):

    # End timing
    end_time = time.time()
    # Calculate the elapsed time
    elapsed_time = end_time - start_time
    # round to one decimal place
    elapsed_time = round(elapsed_time, 1)
    
    return elapsed_time

## 14- Initialize gemma-7b-it

There are three important capabilities that LLMs have - knowledge, reasoning and reading comprehension. I experimented with both gemma-2b-it (trained on 2T tokens) and gemma-7b-it (trained on 6T tokens).

I chose the larger gemma-7b-it for this solution because it has a better  reasoning ability and better reading comprehension. When both models are given the same reference text and asked to extract an answer to a question, gemma-7b-it more often produced the correct answer.

We will use the HuggingFace Transformers package to load the model and run inference. We will also use the bitsandbytes package to reduce the size of the model by using 4-bit precision. This will allow it to fit in the memory (RAM) available in this notebook environment.

We are using two T4 GPUs.<br>
You will note that in the code below we have set: device_map="auto"<br>
This feature of the Transformers package automatically takes care of of distributing the model across both GPUs. 


In [13]:
# Initialize the model and the tokenizer.
# (This step takes about 2 minutes)


# Set the compute data type to 16-bit floating point (float16).
# This is a more memory-efficient format than float32, 
# It lowers memory usage and can speed up computation.
compute_dtype = getattr(torch, "float16")


# Configure the model to use 4-bit precision for certain weights, 
# and specify the quantization details. This further reduces the 
# model size and can speed up inference.
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=False,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
)

# Load the causal language model with the defined quantization 
# configuration and set it to automatically map 
# to the available device.
gemma_model = AutoModelForCausalLM.from_pretrained(MODEL_PATH,
                                        device_map="auto",
                                        quantization_config=bnb_config)

# Disable caching of past key values for transformer models.
# This reduces memory usage in scenarios where past key values 
# aren't needed for subsequent predictions.
gemma_model.config.use_cache = False

# Set the pretraining throughput to 1.
gemma_model.config.pretraining_tp = 1

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)


`config.hidden_act` is ignored, you should use `config.hidden_activation` instead.
Gemma's activation function will be set to `gelu_pytorch_tanh`. Please, use
`config.hidden_activation` if you want to override this behaviour.
See https://github.com/huggingface/transformers/pull/29402 for more details.


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## 15- Ask Gemma questions about BD Politics

Let's ask Gemma a few questions about bangladesh politics. Gemma would have gained this knowledge during training.

It's important to use a good prompt template when working with Gemma. If we don't then we might get bad outputs.
The prompt template we will be using is explained here:<br>
https://www.promptingguide.ai/models/gemma


In [15]:
question = 'Tell me about some ploticial parties of Bangladesh.'

# Create the prompt
prompt = f"""<start_of_turn>user
{question}<end_of_turn>
<start_of_turn>model
"""

# Start timing
start_time = time.time()

# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
# Generate the outputs from prompt
generate_ids = gemma_model.generate(**inputs, max_new_tokens=768)
# Decode the generated output
generated_text = tokenizer.batch_decode(generate_ids, 
                                    skip_special_tokens=True,
                                    clean_up_tokenization_spaces=False)[0]


# Extract the answer

# Split and select the last item in the list
response = generated_text.split('<start_of_turn>model')[-1]
# Remove leading and trailing spaces
response = response.strip()
# Remove the '<end_of_turn> token
response = response.replace('<end_of_turn>', "")

# Remove markdown '*' symbols
# The deafult Markdown that Gemma outputs
# doesn't always display well.
response = format_text(response)


# Get the inference time
elapsed_time = timer(start_time)
print(f"Time taken: {elapsed_time} seconds")

print()
print('User:\n',question)
print()
print('Gemma:\n', response)


Time taken: 153.3 seconds

User:
 Tell me about some ploticial parties of Bangladesh.

Gemma:
 Sure, here are some of the major political parties in Bangladesh:

Awami League (AL)

Founded in 1949, the Awami League is the largest political party in Bangladesh.
Led by Prime Minister Sheikh Hasina.
Known for its progressive stance on issues such as women's rights, social justice, and secularism.
Has a strong grassroots organization and a large following among the working class and marginalized communities.

BNP (Bangladesh National Party)

Founded in 1971, the BNP is the main opposition party in Bangladesh.
Led by former Prime Minister Khaleda Zia.
Known for its conservative stance on issues such as religion, traditional values, and national security.
Has a strong presence in the business community and among the middle class.

Jatiya Samaj Party (JSP)

Founded in 1982, the JSP is a third-largest party in Bangladesh.
Led by former Speaker of Parliament, Dr. Shirin Sharmin Chowdhury.
Known

<hr>
This answer looks quite good. Let's put the above code into a function called gemma_assistant() and ask Gemma a few more questions.

In [16]:
# Start timing
start_time = time.time()

question = "What is the most popular political party in Bangladesh?"

answer = gemma_assistant(question)


# Get the inference time
elapsed_time = timer(start_time)
print(f"Time taken: {elapsed_time} seconds")
print()

print('User:\n',question)
print()
print('Gemma:\n',answer)

Time taken: 9.1 seconds

User:
 What is the most popular political party in Bangladesh?

Gemma:
 The Awami League is the most popular political party in Bangladesh. It is a center-left party that has been in power for much of the country's history.



<hr>
This answer also looks good.

In [18]:
question = "Who is the prime Minister of Bangladesh?"

answer = gemma_assistant(question)

print('User:\n',question)
print()
print('Gemma:\n',answer)

User:
 Who is the prime Minister of Bangladesh>

Gemma:
 The Prime Minister of Bangladesh is Sheikh Hasina.



In [79]:
eval_questions = ['Who is the new prime minister of Bangladesh?',
 'Who is the ex Prime Minister Of Bangladesh?',
 'Name Some Student who put great Impact on Bangladesh recent political activity.']

In [81]:
question = eval_questions[0]

answer = gemma_assistant(question)

print('User:\n',question)
print()
print('Gemma:\n',answer)

User:
 Who is the new prime minister of Bangladesh?

Gemma:
 The new prime minister of Bangladesh is Sheikh Hasina.



In [82]:
question = eval_questions[1]

answer = gemma_assistant(question)

print('User:\n',question)
print()
print('Gemma:\n',answer)

User:
 Who is the ex Prime Minister Of Bangladesh?

Gemma:
 The ex Prime Minister of Bangladesh is Sheikh Hasina. She is a prominent politician and the leader of the Awami League party.



In [83]:
question = eval_questions[2]

answer = gemma_assistant(question)

print('User:\n',question)
print()
print('Gemma:\n',answer)

User:
 Name Some Student who put great Impact on Bangladesh recent political activity.

Gemma:
 Sure, here is the answer to the question:

Sheikh Hasina is a prominent student who has made a significant impact on Bangladesh's recent political activity. She is a former Prime Minister of Bangladesh and is known for her strong leadership and her ability to navigate complex political situations. Hasina has been a dominant force in Bangladesh's politics for decades, and her contributions to the country's development are undeniable.



<hr>
Gemma does not know the answer. Also, Gemma is referring to "the text provided", which is a strange response.

To answer the above questions Gemma would have had to rely on the knowlege it gained during training. This limited knowledge could be one reason for the wrong answers and hallucinations. Another reason could be because we are running this model in 4-bit mode. This can result in lower quality performance.

Using a Retrieval Augmented Generation (RAG) system is a way to give a language model more knowledge without having to train it all over again. This method works by combining the language model with a system that can look up and bring in information from outside sources in real time. 

The RAG system helps the model answer questions more accurately because it can use the latest information available. 

Next we will build a RAG system. The source of information will be txt files () that contain information

## 16- Load and pre-process the data

In the cells that follow we will:
- read the content of each txt file
- convert the text from each file into chunks
- store the chunks in a Pandas dataframe
- clean the text by removing newline ('\n') characters and removing leading and trailing spaces.

## 16.1. Read all the txt files

In [24]:
# Get a list of all txt files
file_list = os.listdir("/kaggle/input/personal-data")
file_list = [file_list[2]]
print('Num files:', len(file_list))
print(file_list)

Num files: 1
['political.txt']


In [25]:
# Load all txt files

for i, fname in enumerate(file_list):
    
    # set the path to the file
    file_path = "/kaggle/input/personal-data/" + fname

    with open(file_path, "r") as file:

        # read the file
        content = file.read()

        # split by the # symbol
        lines = content.split("###")

        # create the chunks
        chunk_list = [line.strip() for line in lines]
            

    # Create a dataframe with one column called text_chunk
    cols = ['text_chunk']
    df = pd.DataFrame(chunk_list, columns=cols)

    # Add a new column called fname
    df['fname'] = fname
    
    if i == 0:
        # make a copy of the dataframe
        df_data = df.copy()
    else:
        # concatenate the two dataframes
        df_data = pd.concat([df_data, df], axis=0)

        
# Reset the index       
df_data = df_data.reset_index(drop=True)

# print the shape of the dataframe
print(df_data.shape)

df_data.head()

(14, 2)


Unnamed: 0,text_chunk,fname
0,The CADTM welcomes the departure of Prime Mini...,political.txt
1,"In 2006, Muhammad Yunus and his Grameen Bank w...",political.txt
2,The Nobel Peace Prize-winning economist Muhamm...,political.txt
3,"They include Nahid Islam and Asif Mahmud, top ...",political.txt
4,Bangladeshi Prime Minister Sheikh Hasina has r...,political.txt


In [26]:
df_data['text_chunk'][3]

'They include Nahid Islam and Asif Mahmud, top leaders of the Students Against Discrimination group, which led the weeks-long protests that ousted Hasina.\n\nOthers include Touhid Hossain, a former foreign secretary, and Hassan Ariff, a former attorney general. Syeda Rizwana Hasan, an award-winning environmental lawyer, and Asif Nazrul, a top law professor and writer, were also sworn in.\n\nAdilur Rahman Khan, a prominent human rights activist who was sentenced to two years in jail by Hasina’s government, also took the oath as an adviser.'

## 16.2. Clean the text

In [30]:
# Replace newline characters ('\n') with a space
# Remove leading and trailing spaces



def clean_text(x):
    
    # Replace newline characters with a space
    x = x.replace("\n", " ")
    # Remove leading and trailing spaces
    new_text = x.strip()
    # Use split to break the text into words and then join them with a single space
    cleaned_text = " ".join(new_text.split())
    
    return cleaned_text

# Clean the text and save the cleaned text in
# a new column called prepared_text.
df_data['prepared_text'] = df_data['text_chunk'].apply(clean_text)

df_data.head()

Unnamed: 0,text_chunk,fname,prepared_text
0,The CADTM welcomes the departure of Prime Mini...,political.txt,The CADTM welcomes the departure of Prime Mini...
1,"In 2006, Muhammad Yunus and his Grameen Bank w...",political.txt,"In 2006, Muhammad Yunus and his Grameen Bank w..."
2,The Nobel Peace Prize-winning economist Muhamm...,political.txt,The Nobel Peace Prize-winning economist Muhamm...
3,"They include Nahid Islam and Asif Mahmud, top ...",political.txt,"They include Nahid Islam and Asif Mahmud, top ..."
4,Bangladeshi Prime Minister Sheikh Hasina has r...,political.txt,Bangladeshi Prime Minister Sheikh Hasina has r...


## 17- Create the embedding vectors

In the cells that follow we will use the Sentence Transformers package to convert the text chunks into embedding vectors with length 384.

## 17.1. Get the data ready for vectorizing
The input format for the Sentence Transformers model is a list of text strings.

In [31]:
# Create a list of text chunks
chunk_list = list(df_data['prepared_text'])

# Display the number of text chunks
len(chunk_list)

14

In [32]:
chunk_list[3]

'They include Nahid Islam and Asif Mahmud, top leaders of the Students Against Discrimination group, which led the weeks-long protests that ousted Hasina. Others include Touhid Hossain, a former foreign secretary, and Hassan Ariff, a former attorney general. Syeda Rizwana Hasan, an award-winning environmental lawyer, and Asif Nazrul, a top law professor and writer, were also sworn in. Adilur Rahman Khan, a prominent human rights activist who was sentenced to two years in jail by Hasina’s government, also took the oath as an adviser.'

## 17.2. Convert text chunks into embedding vectors

Sentence similarity models convert input text into vectors that are also called embeddings. These embeddings capture semantic information. Here we will use the all-MiniLM-L6-v2 model to create the vectors.

Model: all-MiniLM-L6-v2<br>
Max tokens: 256<br>
Output vector length: 384<br>
Size: 80 MB


In [33]:
from sentence_transformers import SentenceTransformer

# Initialize the model
model = SentenceTransformer("all-MiniLM-L6-v2")

# Sentences are encoded by calling model.encode()
embeddings = model.encode(chunk_list, show_progress_bar=False)

print(embeddings.shape)
print('Embedding length', embeddings.shape[1])

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(14, 384)
Embedding length 384


In [34]:
em = model.encode(['i love you', 'i hate you'], show_progress_bar=True)
em.shape

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

(2, 384)

In [35]:
# Display one text chunk and it's embedding

i = 1
print('Text chunk:\n',chunk_list[i])
print()
print('Embedding vector Shape:\n',embeddings[i].shape)
print('Embedding vector:\n',embeddings[i])

Text chunk:
 In 2006, Muhammad Yunus and his Grameen Bank were awarded and accepted the Nobel Peace Prize. Yunus perfectly fitted the profile of the ideal candidate: just enough social commitment without any whiff of brimstone! Already at the time we were aware that Yunus’ contribution was in no way as revolutionary as was claimed. A book under his name published in 2008, Creating a World Without Poverty: Social Business and the Future of Capitalism and what is presumably a French version, Vers un nouveau capitalism [1], will let us throw light onto Yunus ‘phenomenon’ and the Grameen Bank.

Embedding vector Shape:
 (384,)
Embedding vector:
 [-4.96243536e-02  9.62151389e-04 -6.78160936e-02 -2.28274819e-02
 -2.41332501e-02  1.56425238e-02  3.53450589e-02  3.32547352e-02
 -3.25388610e-02  7.96441548e-03  1.01619055e-02  8.98080245e-02
 -8.40059295e-03 -3.34957987e-02 -7.96375517e-03  1.04775466e-02
 -6.81441426e-02 -9.10326745e-03 -2.19236016e-02 -6.65952936e-02
 -1.50681315e-02 -5.141787

## 18- Conduct a Vector Search using FAISS

FAISS (Facebook AI Similarity Search) is an open-source library designed for fast (GPU supported) vector similarity search in large datasets. First we will set up FAISS. Then we will execute an exhaustive vector search. In an exhaustive search (brute-force search) we compare a query vector to every vector stored in the FAISS index.

The similarity metric is L2 distance, also known as Euclidean distance. A smaller value indicates that two points are closer to each other. Therefore, a smaller distance between two vectors indicates a higher similarity.

In [36]:
import faiss

# Get the embedding length
embed_length = embeddings.shape[1]

# IndexFlatL2 is used for exhaustive search
faiss_index = faiss.IndexFlatL2(embed_length)

# Check if the index is trained.
# No training needed when using exhaustive search i.e. IndexFlatL2
faiss_index.is_trained

True

In [37]:
# Add the embeddings to the index
faiss_index.add(embeddings)

# Check the total number of embeddings in the index
faiss_index.ntotal

14

Next we will conduct a vector search. We will convert a question into a vector (embedding) and then compare that query vector to every vector in the FAISS index. The search will return a list of vector index values that are ordered by similarity score (L2 distance).

In [39]:
# Run a vector search

# Create the query string
query_text = """
Who is the Prime Minister Of Bangladesh?
"""
query = [query_text]


# Vectorize the query string
query_embedding = model.encode(query, show_progress_bar=False)

# Set the number of outputs we want
top_k = 5

# Run the query
# index_vals refers to the chunk_list index values
scores, index_vals = faiss_index.search(query_embedding, top_k)

# Print the index values and the similarity scores.
# Each index value corresponds to position in the list named chunk_list
print('Index values:\n',index_vals[0])
print()
print('Similarity scores:\n',scores[0])

Index values:
 [10 12  7  2  4]

Similarity scores:
 [0.84797126 0.86966133 0.91124725 0.99497765 1.0011293 ]


Let's print the text associated with each of the above index values. Remember that these results are ordered by similarity score. The lower the score the higher the similarity.

In [40]:
# Get a list of predicted index values
pred_indexes = index_vals[0]

for i in range(0, len(pred_indexes)):
    
    # get the chunk index
    chunk_index = pred_indexes[i]
    
    # get the text that corresponds to the index
    text = chunk_list[chunk_index]
    
    print()
    print(text)


DHAKA, BANGLADESH — Nobel Peace Prize laureate Muhammad Yunus has been chosen to head Bangladesh's interim government after the nation's longtime prime minister resigned and fled abroad in the face of a broad uprising against her rule. Known as the "banker to the poorest of the poor" and a longtime critic of the ousted Sheikh Hasina, Yunus will act as a caretaker premier until new elections are held. The decision followed a meeting late Tuesday that included student protest leaders, military chiefs, civil society members and business leaders.

NEW DELHI, India — Bangladesh’s Nobel laureate Muhammad Yunus is set to return to Dhaka on Thursday to be sworn in as his country’s interim leader, after former Prime Minister Sheikh Hasina resigned and fled to India Monday following widespread protests against her government. Rioters burned down police stations and attacked homes and temples of minority Hindus in the protests. “The whole edifice has collapsed,” said Jyoti Rahman, an Australia-b

<hr>
In the above results you'll note that the top match doesn't contain the answer to our question. The correct answer (D. Sculley) is in the sixth search result.

Next we will apply reranking to improve the order of the search results.

## 19- Rerank (reorder) the search results

Vector search compares vectors but reranking compares text.<br>
During reranking the query text (user question) is compared to the text chunk assciated with each of the vectors that were returned during vector search. A relevance score is assigned to each question/text_chunk pair. 

When the search results are sorted based on these relevance scores, the text chunks that are most relevant to the question will appear at the top.

Here we will use the cross-encoder/ms-marco-MiniLM-L-6-v2 model for reranking.

model: ms-marco-MiniLM-L-6-v2<br>
Max tokens: 384<br>
Size: 90.9 MB

In [41]:
from sentence_transformers import CrossEncoder

# Initialize the cross-encoder
cross_encoder = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

config.json:   0%|          | 0.00/794 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [42]:
# Get the text associated with each search result

# Replace the chunk index values with the corresponding strings
pred_strings_list = [chunk_list[item] for item in pred_indexes]

Now let's put the data into a format that the cross encoder expects. The input format is a list of lists.

In [43]:
query[0]

'\nWho is the Prime Minister Of Bangladesh?\n'

In [44]:
# Format the input for the cross encoder

# The input to the cross_encoder is a list of lists
# [[query_text, pred_text1], [query_text, pred_text2], ...]

cross_input_list = []

for item in pred_strings_list:
    
    # Create the question/chunk pair: [question, text_chunk]
    new_list = [query[0], item]
    
    # Append the pair to a list
    cross_input_list.append(new_list)

In [46]:
# cross_input_list

In [47]:
# Put the text chunks from the FAISS vector search into a dataframe.
# Create a dataframe with two columns
df = pd.DataFrame(cross_input_list, columns=['query_text', 'pred_text'])

# Add a third column containing the original predicted index values
df['original_index'] = pred_indexes
df

Unnamed: 0,query_text,pred_text,original_index
0,\nWho is the Prime Minister Of Bangladesh?\n,"DHAKA, BANGLADESH — Nobel Peace Prize laureate...",10
1,\nWho is the Prime Minister Of Bangladesh?\n,"NEW DELHI, India — Bangladesh’s Nobel laureate...",12
2,\nWho is the Prime Minister Of Bangladesh?\n,"Dhaka, Bangladesh CNN — The prime minister of ...",7
3,\nWho is the Prime Minister Of Bangladesh?\n,The Nobel Peace Prize-winning economist Muhamm...,2
4,\nWho is the Prime Minister Of Bangladesh?\n,Bangladeshi Prime Minister Sheikh Hasina has r...,4


In [48]:

# Now, score all question/text_chunk pairs using the cross_encoder
cross_scores = cross_encoder.predict(cross_input_list, show_progress_bar=False)

# Add the scores to the dataframe
df['cross_scores'] = cross_scores

# Sort the DataFrame in descending order based on the scores
df_sorted = df.sort_values(by='cross_scores', ascending=False)

# Reset the index.
df_sorted = df_sorted.reset_index(drop=True)

df_sorted.head()

Unnamed: 0,query_text,pred_text,original_index,cross_scores
0,\nWho is the Prime Minister Of Bangladesh?\n,"Dhaka, Bangladesh CNN — The prime minister of ...",7,7.817359
1,\nWho is the Prime Minister Of Bangladesh?\n,Bangladeshi Prime Minister Sheikh Hasina has r...,4,6.713878
2,\nWho is the Prime Minister Of Bangladesh?\n,The Nobel Peace Prize-winning economist Muhamm...,2,6.076886
3,\nWho is the Prime Minister Of Bangladesh?\n,"DHAKA, BANGLADESH — Nobel Peace Prize laureate...",10,6.008359
4,\nWho is the Prime Minister Of Bangladesh?\n,"NEW DELHI, India — Bangladesh’s Nobel laureate...",12,4.826979


In [49]:
# Compare the orginal predicted index order and 
# the re-ranked index order

print('Original order:',pred_indexes)
print('Reranked order:',list(df_sorted['original_index']))

Original order: [10 12  7  2  4]
Reranked order: [7, 4, 2, 10, 12]


Okay now let's see if reranking has improved the order of the search results. If it has then the first text chunk should contain the answer to our question. 

The question was: Who is the Prime Minister Of Bangladesh?

In [50]:
# Print the output

# Print three results
num_results = 3

for i in range(0,num_results):
    
    # Get the text chunk
    text = df_sorted.loc[i, 'pred_text']

    print('Profile:',text)
    print()

Profile: Dhaka, Bangladesh CNN — The prime minister of Bangladesh, Sheikh Hasina, resigned and fled to neighboring India on Monday after protesters stormed her official residence after weeks of deadly anti-government demonstrations in the South Asian nation. Scenes of jubilation erupted on the streets as protesters celebrated the end of her 15 years in power by climbing on tanks and scaling an imposing statue of Hasina’s father, independence leader Sheikh Mujibur Rahman, in Dhaka, attacking the head with an ax. In a national address, Bangladesh’s army chief, Gen. Waker-uz-Zaman confirmed Hasina had resigned and said the military would form an interim government.

Profile: Bangladeshi Prime Minister Sheikh Hasina has resigned after weeks of deadly anti-government protests, putting an end to her more than two decades dominating the country's politics. Ms Hasina, 76, fled the country, reportedly landing in India on Monday. Jubilant crowds took to the streets to celebrate the news, with so

<hr>
You'll note that the the first text chunk now contains the answer. Reranking has definitley improved the search results.

Later we will be passing text chunks to Gemma to be used when answering user questions. LLMs have a limited context. This means that the amount of text that can be given to them has a fixed limit. For Gemma this limit is 8192 tokens. Reranking allows us to make efficient use of the available context by passing only the most useful text chunks to the LLM.

## 20- Use Gemma to create a natural language output

After the reranking step we have a list of text chunks that are ordered based on relevance to the question. We now need gemma-7b-it to review these chunks (called the context) and then answer the user's question using natural language.

## 20.1. Zero-shot prompt

First we will send the question and the context to Gemma using a zero-shot prompt. Let's look at two examples.

I've printed the raw Gemma response in the first example. This will make it easy to understand the code that extracts the answer from the raw response.

LLMs have the ability to infer the task based on the structure of the prompt. You will note that, in the prompt, I will not explicitly tell the LLM to use the context to answer the question. I will give the LLM a context and a question. The LLM will infer that it needs to use the context to answer the question.

In [63]:
# Question 1: 
# Who is the new prime minister of Bangladesh?
query_text = "Who is the new prime minister of Bangladesh?"


# Start timing
start_time = time.time()


# Run the RAG search
sorted_pred_list = vector_search_and_rerank(query_text, top_k=5)

# Choose the first 3 reranked and sorted text chunks 
context_list = sorted_pred_list[0:3]
context_list

[{'DHAKA, BANGLADESH — Nobel Peace Prize laureate Muhammad Yunus has been chosen to head Bangladesh\'s interim government after the nation\'s longtime prime minister resigned and fled abroad in the face of a broad uprising against her rule. Known as the "banker to the poorest of the poor" and a longtime critic of the ousted Sheikh Hasina, Yunus will act as a caretaker premier until new elections are held. The decision followed a meeting late Tuesday that included student protest leaders, military chiefs, civil society members and business leaders.'},
 {'Dhaka, Bangladesh CNN — The prime minister of Bangladesh, Sheikh Hasina, resigned and fled to neighboring India on Monday after protesters stormed her official residence after weeks of deadly anti-government demonstrations in the South Asian nation. Scenes of jubilation erupted on the streets as protesters celebrated the end of her 15 years in power by climbing on tanks and scaling an imposing statue of Hasina’s father, independence lea

In [64]:
# Create the prompt
prompt = f"""<start_of_turn>user
Context: {context_list}
Question: {query_text}<end_of_turn>
<start_of_turn>model
"""
    
    
# Tokenize the prompt
inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
# Generate the outputs from prompt
generate_ids = gemma_model.generate(**inputs, max_new_tokens=768)
# Decode the generated output
response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                     clean_up_tokenization_spaces=False)[0]


# Extract the answer

# Split and select the last item in the list
gemma_response = response.split('<start_of_turn>model')[-1]
# Remove leading and trailing spaces
gemma_response = gemma_response.strip()
# Remove the '<end_of_turn> token
gemma_response= gemma_response.replace('<end_of_turn>', "")

# Get the inference time
elapsed_time = timer(start_time)
print(f"Time taken: {elapsed_time} seconds")
print()

print('-----')
print('User:\n',query_text)
print()
print('Raw Gemma response:\n\n',response)
print()
print()
print('Extracted Gemma response:\n\n',gemma_response)

Time taken: 220.8 seconds

-----
User:
 Who is the new prime minister of Bangladesh?

Raw Gemma response:

 <start_of_turn>user
Context: [{'DHAKA, BANGLADESH — Nobel Peace Prize laureate Muhammad Yunus has been chosen to head Bangladesh\'s interim government after the nation\'s longtime prime minister resigned and fled abroad in the face of a broad uprising against her rule. Known as the "banker to the poorest of the poor" and a longtime critic of the ousted Sheikh Hasina, Yunus will act as a caretaker premier until new elections are held. The decision followed a meeting late Tuesday that included student protest leaders, military chiefs, civil society members and business leaders.'}, {'Dhaka, Bangladesh CNN — The prime minister of Bangladesh, Sheikh Hasina, resigned and fled to neighboring India on Monday after protesters stormed her official residence after weeks of deadly anti-government demonstrations in the South Asian nation. Scenes of jubilation erupted on the streets as protest

## 20.2. Implement few-shot prompting

We will use few-shot prompting to condition the output so that Gemma responds in the style that we want i.e. without mentioning the context. 

In few-shot prompting we give the LLM a few example questions and answers, prepended to the question from the user. The example answers match the style that we are looking for. The LLM will then respond to the user's question using the same style as the example answers it was given.



## How I created the few-shot data

I created the question/answer pairs for the few shot prompts manually. I asked Gemma questions, and then edited the answers. I also saved the the context that's assciated with each question/answer pair. The context contains 5 text chunks. During experiments I found that three to five chunks contained enough information to produce good answers.

The data is saved as a csv file named df_corrected_data.csv. It's stored in the gemma-comp-data dataset that's attached to this notebook.

Example from the few-shot data:

<b>User Question:</b> What is the min age limit to use Kaggle?<br>
<b>Orginal response:</b> The text states that the minimum age limit to use Kaggle is 13 years old.<br>
<b>Edited response:</b> The minimum age limit to use Kaggle is 13 years old.

## Load the few-shot data

In [55]:
# Load the few shot data into a pandas dataframe
df_fshot = pd.read_csv(FEW_SHOT_DATA_PATH)

def convert_to_list(x):
    
    # Convert the string to a list: '[...]' to [...]
    x_as_list = ast.literal_eval(x)
    
    return x_as_list

# Convert each item in the context column from a string to a 
# python list i.e. '[...]' to [...]
df_fshot['gem_context'] = df_fshot['gem_context'].apply(convert_to_list)

df_fshot.head()

Unnamed: 0,query,gem_context,response,corrected_text
0,Who is the CEO of kaggle?,[{Kaggle Team Members (Employees) and their pr...,The text states that D. Sculley is the CEO of ...,D. Sculley is the CEO of Kaggle.
1,What is Kaggle?,[{{Kaggle FAQ} What is Kaggle? Kaggle is a pla...,Kaggle is a platform for data science and mach...,Kaggle is a platform for data science and mach...
2,When was Kaggle founded?,[{{Kaggle History} Kaggle is a platform that h...,Kaggle was founded in 2010 by Anthony Goldbloo...,Kaggle was founded in 2010 by Anthony Goldbloo...
3,What info does kaggle collect about me?,[{{Kaggle Privacy Policy} Information Kaggle C...,Kaggle collects information to provide better ...,Kaggle collects information to provide better ...
4,What behaviors are prohibited on Kaggle?,[{{Kaggle Community Guidelines} Enforcement an...,"Sure, here are the behaviors that are prohibit...",Here are the behaviors that are prohibited on ...


In [56]:
print(df_fshot.shape)
df_fshot.tail(3)

(7, 4)


Unnamed: 0,query,gem_context,response,corrected_text
4,What behaviors are prohibited on Kaggle?,[{{Kaggle Community Guidelines} Enforcement an...,"Sure, here are the behaviors that are prohibit...",Here are the behaviors that are prohibited on ...
5,What is the min age limit to use Kaggle?,[{{Kaggle Privacy Policy} Our Commitment to Ch...,The text states that the minimum age limit to ...,The minimum age limit to use Kaggle is 13 year...
6,Are any members of the kaggle team foodies?,[{{Kaggle team member} Andrew Wang Developer A...,"Sure, the text indicates that Andrew, Brandon,...","Andrew, Brandon, and Kestin are foodies. They ..."


## Create three few-shot prompts

Here we will create three prompts using the few-shot data. We will prepend these three prompts to the prompt that contains the question from the user, as shown in the example few-shot prompt in the code cell below.

These are the three questions that are part of the few-shot prompt:
- Who is the CEO of kaggle? (index 0)
- What is the min age limit to use Kaggle? (index 5)
- Are any members of the kaggle team foodies? (index 6)

The context (gem_context) and corrected answer (corrected_text) for each question (query) is included in the prompt.


You will note that I've added the following sentence to the prompt:<br>
*Think and write your step-by-step reasoning before responding*<br>

Instructing a model to approach problems in this way is called chain-of-thought prompting. Adding this instruction to the prompt guides the model to break the problem down into smaller steps and to then go one step at a time.
Chain-of-thought prompting can improve the performance of LLMs when solving problems that require reasoning.

In [62]:
# Example with three few-shot prompts

prompt = f"""

    <start_of_turn>user
    Context: {df_fshot.loc[0, 'gem_context']}
    Question: {df_fshot.loc[0, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[0, 'corrected_text']}<end_of_turn>
    
    <start_of_turn>user
    Context: {df_fshot.loc[5, 'gem_context']}
    Question: {df_fshot.loc[5, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[5, 'corrected_text']}<end_of_turn>
    
    <start_of_turn>user
    Context: {df_fshot.loc[6, 'gem_context']}
    Question: {df_fshot.loc[6, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[6, 'corrected_text']}<end_of_turn>
    
    <start_of_turn>user
    Think and write your step-by-step reasoning before responding.
    
    Context: {context_list}
    Question: {query_text}<end_of_turn>
    <start_of_turn>model
    """

# prompt

Let's put everything into a function called get_gemma_response().

In [58]:
def get_gemma_response(query_text, context_list):
    
    prompt = f"""<start_of_turn>user
    Context: {df_fshot.loc[0, 'gem_context']}
    Question: {df_fshot.loc[0, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[0, 'corrected_text']}<end_of_turn>
    <start_of_turn>user
    Context: {df_fshot.loc[5, 'gem_context']}
    Question: {df_fshot.loc[5, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[5, 'corrected_text']}<end_of_turn>
    <start_of_turn>user
    Context: {df_fshot.loc[6, 'gem_context']}
    Question: {df_fshot.loc[6, 'query']}<end_of_turn>
    <start_of_turn>model
    {df_fshot.loc[6, 'corrected_text']}<end_of_turn>
    <start_of_turn>user
    Think and write your step-by-step reasoning before responding.
    
    Context: {context_list}
    Question: {query_text}<end_of_turn>
    <start_of_turn>model
    """
    
    
    # Get a natural language response from Gemma
    # Tokenize the prompt
    inputs = tokenizer(prompt, return_tensors="pt").to(DEVICE)
    # Generate the outputs from prompt
    generate_ids = gemma_model.generate(**inputs, max_new_tokens=768)
    # Decode the generated output
    gemma_response = tokenizer.batch_decode(generate_ids, skip_special_tokens=True,
                                         clean_up_tokenization_spaces=False)[0]

    # Clear the memory to create space
    del prompt
    del inputs
    del generate_ids
    torch.cuda.empty_cache() 
    gc.collect()
    
    return gemma_response

Now let's ask the same two questions again and see if Gemma refers to the context when answering.

In [65]:
# Question 1: 
# Who is the new prime minister of Bangladesh?
query_text = "Who is the new prime minister of Bangladesh?"


# Run the RAG search
sorted_pred_list = vector_search_and_rerank(query_text, top_k=10)

# Choose the top reranked and sorted text chunks 
# to put in the context
context_list = sorted_pred_list[0:NUM_CHUNKS_IN_CONTEXT]

# This function includes the few-shot prompts
response = get_gemma_response(query_text, context_list)

#response

# Extract the answer
gemma_response = response.split('<start_of_turn>model')[-1]
# Remove leading and trailing spaces
gemma_response = gemma_response.strip()
# Remove the '<end_of_turn> token
gemma_response = gemma_response.replace('<end_of_turn>', "")
    

print('-----')
print('User:\n', query_text)
print()
print('Gemma:\n',gemma_response)

-----
User:
 Who is the new prime minister of Bangladesh?

Gemma:
 Muhammad Yunus is the new prime minister of Bangladesh.


In [66]:
# Question 1: 
# Who is the new prime minister of Bangladesh?
query_text = "Who was the ex prime minister of Bangladesh?"

# Run the RAG search
sorted_pred_list = vector_search_and_rerank(query_text, top_k=10)

# Choose the top reranked and sorted text chunks 
# to put in the context
context_list = sorted_pred_list[0:NUM_CHUNKS_IN_CONTEXT]

# This function includes the few-shot prompts
response = get_gemma_response(query_text, context_list)

#response

# Extract the answer
gemma_response = response.split('<start_of_turn>model')[-1]
# Remove leading and trailing spaces
gemma_response = gemma_response.strip()
# Remove the '<end_of_turn> token
gemma_response = gemma_response.replace('<end_of_turn>', "")
    

print('-----')
print('User:\n', query_text)
print()
print('Gemma:\n',gemma_response)

-----
User:
 Who was the ex prime minister of Bangladesh?

Gemma:
 Sheikh Hasina was the ex prime minister of Bangladesh. She resigned and fled to neighboring India after protesters stormed her official residence.


This answers are much better. You will notice that now the response from Gemma does not refer to the context i.e. the phrase "The text states that" is not being used. Few-shot prompting is successfully conditioning Gemma's responses.

## 20.3. Use heuristics (rules) to post process Gemma's responses

Here are two example questions and answers:

<b>User:</b><br>
Is there life on Mars?<br>
<b>Gemma:</b><br>
The text does not provide information about life on Mars, therefore I cannot answer this question.<br>

<b>User:</b><br>
What are Kaggle notebooks?<br>
<b>Gemma:</b><br>
Sure, here is the answer to the question: Kaggle notebooks are a cloud-based computational environment where you can write and execute Python or R code. They support libraries for data analysis and machine learning, making them an ideal tool for experimenting with datasets directly on Kaggle.<br>


The first example shows how Gemma currently responds when it does not have information in the context to be able to answer the user's question. Instead, when the information needed to answer the question is not in the context, we want Gemma to just say: Sorry, that information is not available.

Gemma also has a habit of starting answers with the word "Sure", as in the second example above. We want a natural language answer, as if we were talking to a person therefore, we don't want Gemma to start answers with the word "Sure."

We will change the above two behaviours by simply using code to post process Gemma's responses. 

The two post processing steps are included in the function below. I've commented the code so you can see what changes are being made to Gemma's responses.

In [67]:
def post_process_gemma_response(response):
    
    # Remove leading and trailing spaces
    response = response.strip()
    
    # Initialize revised_response at the beginning
    revised_response = response  # Default case, no changes to the original response
 
    # Gemma always uses the phrase "I cannot answer this question"
    # when the answer is not available.
    text1 = 'I cannot answer this question'
    
    # If Gemma's response contains the phrase in text1 then 
    # replace the entire response with this sentence: 
    # "Sorry, that information is not available.""
    if text1 in response:
        revised_response = "Sorry, that information is not available."

    # If Gemma's response starts with the word "Sure" then
    # remove all text from the word "Sure" to the first colon (':')
    elif response.startswith("Sure"):
        # REMOVE "Sure,..."
        # Check if the first word is "Sure"
        # Find the position of the first occurrence of ":"
        colon_pos = response.find(":")
        if colon_pos != -1:
            # Remove the text from "Sure" to ":" including the colon and the space after it
            revised_response = response[colon_pos+2:]  # Assuming there's a space after the colon

    # Remove leading and trailing spaces
    revised_response = revised_response.strip()
    
    return revised_response

Here are the revised responses that we now get after applying post processing.

In [68]:
response = """
The text does not provide information about life on Mars, 
therefore I cannot answer this question."
"""
revised_response = post_process_gemma_response(response)

print(revised_response)

Sorry, that information is not available.


In [70]:
response = """
Sure, here is the answer to the question: 
 Sheikh Hasina was the ex prime minister of Bangladesh.
 She resigned and fled to neighboring India after 
 protesters stormed her official residence.
"""

revised_response = post_process_gemma_response(response)

print(revised_response)

Sheikh Hasina was the ex prime minister of Bangladesh.
 She resigned and fled to neighboring India after 
 protesters stormed her official residence.


These modified responses look much better. To fix the original responses we could have spent hours trying to get the perfect prompt or we could have tried more complex few-shot prompts or we could have even tried fine tuning. But, there's no guarantee that any of these approaches would work reliably. What I found when experimenting with smaller LLMs is that often when you fix one thing you break something else. 

Simply using code to modify the responses is quick and easy to implement. It also works reliably as you will see when we evaluate the RAG system next.

## 21- Evaluate the RAG system

Let's create a function called run_gemma_rag_system(). This function includes everything we've covered so far: vector search and reranking, using few-shot prompting and finally, post processing Gemma's responses using heuristics.

There will be three text chunks in the context that gets passed to Gemma i.e. NUM_CHUNKS_IN_CONTEXT = 3. The more chunks there are, the more RAM gets used. Using too many chunks can crash this notebook. During experiments I found that three to five chunks is enough to produce good answers.

In [71]:
def run_gemma_rag_system(query_text):
    
    # Run the RAG search
    sorted_pred_list = vector_search_and_rerank(query_text, top_k=TOP_K)

    # Choose the top reranked and sorted text chunks 
    # to put in the context
    context_list = sorted_pred_list[0:NUM_CHUNKS_IN_CONTEXT]

    # Submit the question about kaggle to gemma and 
    # get a natural language answer.
    response = get_gemma_response(query_text, context_list)

    # Extract the answer
    
    # Split and select the last item in the list
    gemma_response = response.split('<start_of_turn>model')[-1]
    # Remove leading and trailing spaces
    gemma_response = gemma_response.strip()
    # Remove the '<end_of_turn> token
    gemma_response = gemma_response.replace('<end_of_turn>', "")

    # Post process the response
    gemma_response = post_process_gemma_response(gemma_response)
    
    print()
    print('User:\n', query_text)
    print()
    print('Gemma:\n', gemma_response)
    print()
    print('-----')
    
    return gemma_response, context_list

In the cell below are the 25 questions that we will use to evaluate the system.

Included in the list of evaluation questions are two of the three questions that in the few-shot prompts:

- What is the min age limit to use Kaggle?
- Are any members of the kaggle team foodies?

I included them to see if asking a question that's part of the few-shot prompt causes any issues.

In [74]:
# Evaluation questions

eval_questions = [
    
    # FAQ
    "Who is the new prime minister of Bangladesh?",
    "Who is the ex Prime Minister Of Bangladesh?",
    "Name Some Student who put great Impact on Bangladesh recent political activity."
]

print('Num questions:', len(eval_questions))

Num questions: 3


Now let's pass all these questions to the RAG system and review Gemma's answers.

In [75]:
# Start timing
start_time = time.time()

# Pass each question to the RAG system.

for i, question in enumerate(eval_questions):
    print(f"Question {i}")
    answer, context = run_gemma_rag_system(question)
    


# Get the time taken
elapsed_time = timer(start_time)
total_time = round(elapsed_time/60, 1)
time_per_question = elapsed_time/len(eval_questions)
time_per_question = round(time_per_question, 1)

print('Evaluation complete.')
print(f'Total time: {total_time} minutes')
print(f"Avg time per question: {time_per_question} seconds")

Question 0

User:
 Who is the new prime minister of Bangladesh?

Gemma:
 Muhammad Yunus is the new prime minister of Bangladesh.

-----
Question 1

User:
 Who is the ex Prime Minister Of Bangladesh?

Gemma:
 The ex Prime Minister Of Bangladesh is Sheikh Hasina.

-----
Question 2

User:
 Name Some Student who put great Impact on Bangladesh recent political activity.

Gemma:
 Nahid Islam is one of the students who put great impact on Bangladesh recent political activity.

-----
Evaluation complete.
Total time: 2.2 minutes
Avg time per question: 44.8 seconds


## 23- Enter your question

To try out this RAG system please enter your question below. You can also print and review the three-chunk context that Gemma is referencing in order to answer your question.

In [86]:
# Start timing
start_time = time.time()

################################
# Please Enter your Question here

question = eval_questions[2]

################################

# Run the RAG system
answer, context = run_gemma_rag_system(question)


# Get the inference time
elapsed_time = timer(start_time)
print(f"Time taken: {elapsed_time} seconds")


User:
 Name Some Student who put great Impact on Bangladesh recent political activity.

Gemma:
 Nahid Islam is one of the students who put great impact on Bangladesh recent political activity.

-----
Time taken: 54.2 seconds


In [87]:
# Print the context that gemma is referencing
# to answer the question.
for item in context:
    print()
    print(item)


{'NEW DELHI, India — Bangladesh’s Nobel laureate Muhammad Yunus is set to return to Dhaka on Thursday to be sworn in as his country’s interim leader, after former Prime Minister Sheikh Hasina resigned and fled to India Monday following widespread protests against her government. Rioters burned down police stations and attacked homes and temples of minority Hindus in the protests. “The whole edifice has collapsed,” said Jyoti Rahman, an Australia-based economist who writes on Bangladeshi politics and economy, referring to Hasina’s government. The Bangladesh military’s swift appointment of Yunus was a demand of students who led the protests that triggered the former prime minister’s resignation. “Any government other than the one we recommended would not be accepted,” Reuters quoted one of the student leaders, Nahid Islam, as writing on Facebook.'}

{"Bangladeshi Prime Minister Sheikh Hasina has resigned after weeks of deadly anti-government protests, putting an end to her more than two

## 24- Is this system robust?

Let's test the RAG system.

We will submit the same question twice. In the second question we will change the word "there" to "their". This change will make the second question grammatically incorrect.

<b>Question1:</b> "Are there any kaggle employees who have pet cats?"<br>
<b>Question2:</b> "Are their any kaggle employees who have pet cats?"

Let's see how the system responds.

In [None]:
# there

question = "Are there any kaggle employees who have pet cats?"

answer, context = run_gemma_rag_system(question)

In [None]:
# Print the context that gemma is referencing
# to answer the question.
for item in context:
    print()
    print(item)

<hr>
Gemma has answered that Kinnera and Yuting have pet cats. According to the context, this is correct. However, Gemma's answer also contains hallucination. Gemma answered that Kinnera has two cats. This statement is wrong. From the context we see that Kinnera has only one cat. Also, Gemma answered that Yuting also has two cats. However, from the context we see that Yuting does have more than one cat, but the context does not say that Yuting has exactly two cats.

Now let's change "there" to "their" and ask the question again.


In [None]:
# their

# question = ""

# answer, context = run_gemma_rag_system(question)

In [None]:
# Print the context that gemma is referencing
# to answer the question.
# for item in context:
#     print()
#     print(item)

<hr>
Changing the word "there" to "their" caused Gemma to ouput a different response. Also, now Gemma incorrectly tells us that Mark has pet cats. In the context above we see that Mark only has two dogs.



From this test we can learn two things:
1. The vector search and reranking parts of the system work very well i.e. the context contains all the information needed to answer the question correctly.
2. There is weakness in the generative part of the system. Gemma can make errors when extracting fine grained information from a given context. Gemma is also sensitive to small changes to an input question, like changing the word "there" to "their".

It's important to keep in mind that we are running Gemma in 4-bit mode. This could lead to lower quality performance.

Is this RAG system robust? The vector search and reranking parts of the system are robust. But the text generation part is not robust. It can produce errors when asked to extract highly specific information from a given context.

## 25- Conclusion

The task was to use Gemma to answer common questions about the Kaggle platform. This notebook has demonstrated how to accomplish that task by using gemma-7b-it with a RAG system that incorporates few-shot prompting and heuristics.

I would like to thank Google and Kaggle for hosting this interesting competition. 

## 26- Reference Notebooks

- [Create AI-generated essays | Gemma](https://www.kaggle.com/code/minhsienweng/create-ai-generated-essays-gemma/notebook)<br>
by Min-Hsien Weng

- [Data Science AI Assistant with Gemma 2b-it](https://www.kaggle.com/code/lucamassaron/data-science-ai-assistant-with-gemma-2b-it/notebook#4.-Wrapping-up-everything)<br>
by Luca Massaron

- [Part 1 - Build an ArXiv RAG search system w FAISS](https://www.kaggle.com/code/vbookshelf/part-1-build-an-arxiv-rag-search-system-w-faiss)<br>
by vbookshelf

## 27- Get a list of all packages


In [None]:
# Create a requirements.txt file

!pip freeze > requirements.txt

In [None]:
!ls