<a href="https://colab.research.google.com/github/arafat04/text-generation-with-LLM/blob/main/4_Retrieval_Augmented_Generation_(RAG)_Arafat.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 4. More about Prompting

Welcome to the Fourth Lesson of the NNLG Tutorial!

As we have seen in previous tutorials, Llms can be fine-tuned to achieve several common tasks. However, when we need to access specific knowledge (especially when said knowledge may change over time) it's possible to build a language model-based system that accesses external knowledge sources to complete tasks, such a system is typically called Retrieval-Augmented Generation (RAG). RAG combines an information retrieval component with a text generator model.

In this session, we look at , how to query a database to enrich the input to generation, and generate from the original and the retrieved input.

Let's start by loading the model ([`HuggingFaceTB/SmolLM2-1.7B-Instruct`](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct)) and  the trivia dataset ([`mandarjoshi/trivia_qa`](https://huggingface.co/datasets/mandarjoshi/trivia_qa)) from the previous lesson.

In [None]:
! pip install transformers --quiet
! pip install datasets --quiet
! pip install nltk --quiet
! pip install sentence_transformers --quiet
! pip install faiss-cpu --quiet
# ! pip install faiss-gpu --quiet
! pip install wikipedia --quiet

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m30.3 MB/s[0m eta [36m0:

In [None]:
# Import Pipeline for the LLM and Pytorch to find the best available device
from transformers import pipeline
import torch

# Find the best available device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # Use GPU if available, otherwise use CPU

# Load the model
model_identifier = 'HuggingFaceTB/SmolLM2-1.7B-Instruct'
llm = pipeline(model=model_identifier, device=device)

config.json:   0%|          | 0.00/792 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.42G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.76k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/801k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/466k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.10M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/655 [00:00<?, ?B/s]

In [None]:
# Import load_dataset and Dataset
from datasets import load_dataset, Dataset

# Instantiate the Trivia QA dataset in streaming model
dataset_raw = load_dataset(
    'mandarjoshi/trivia_qa',
    'rc',
    split='train',
    streaming=True
)

# # Preprocess the dataset
def preprocess_trivia_qa(sample):
    wiki_context = []
    for title, context in zip(sample['entity_pages']['title'], sample['entity_pages']['wiki_context']):
        wiki_context.append(tuple([title, context]))
    new_sample = {
        'wiki_context':wiki_context,
        'answer':sample['answer']['value']
    }
    return new_sample

dataset = dataset_raw.map(
    preprocess_trivia_qa,
    remove_columns=[
        'question_id',
        'question_source',
        'entity_pages',
        'search_results'
    ]
)

# Take the first 8 elements ofthe dataset
dataset = dataset.take(8)

# Convert from IterableDataset to Dataset
dataset = Dataset.from_generator(lambda: iter(dataset), features=dataset.features)

dataset[0]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/26 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

{'question': 'Which American-born Sinclair won the Nobel Prize for Literature in 1930?',
 'answer': 'Sinclair Lewis',
 'wiki_context': []}

In [None]:
print(type(dataset))

<class 'datasets.arrow_dataset.Dataset'>


In [None]:
print(len(dataset[1]['wiki_context']))

2


In [None]:
print(len(dataset[3]['wiki_context']))

3


In [None]:
# Get the first item from the dataset
first_item = next(iter(dataset_raw))

print(type(first_item))
# Iterate over the keys and print the values
for key, value in first_item.items():
    print(f'{key}: {value}')

<class 'dict'>
question: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
question_id: tc_1
question_source: http://www.triviacountry.com/
entity_pages: {'doc_source': [], 'filename': [], 'title': [], 'wiki_context': []}
search_results: {'description': ['The Nobel Prize in Literature 1930 Sinclair ... The Nobel Prize in Literature 1930 was awarded to ... nobelprize.org/nobel_prizes/literature/laureates/1930/>', 'Why Don’t More Americans Win the Nobel Prize? By . ... When the Nobel Prize in Literature was awarded to Sinclair ... In 1930, Lewis told his Nobel audience that ...', '... Sauk Centre native Sinclair Lewis became the first American to be awarded a Nobel Prize for Literature. ... in 1930, Sauk Centre native Sinclair Lewis became the ...', 'Sinclair Lewis - Nobel Prize in Literature, 1930 (20 books) Type ... Literature Fiction Classics Short Stories Essays American literature Nobel Prize Uploaded: 2015 ...', "The Nobel Prize in Literature 1930 Sinclair Le

In [None]:
dataset[1]["wiki_context"][1]

['Judi Dench',
 'Dame Judith Olivia "Judi" Dench,  (born 9 December 1934)  is an English actress and author.  Dench made her professional debut in 1957 with the Old Vic Company. Over the following few years she performed in several of Shakespeare\'s plays in such roles as Ophelia in Hamlet, Juliet in Romeo and Juliet and Lady Macbeth in Macbeth. Although most of her work during this period was in theatre, she also branched into film work, and won a BAFTA Award as Most Promising Newcomer. She drew strong reviews for her leading role in the musical Cabaret in 1968.\n\nOver the next two decades, Dench established herself as one of the most significant British theatre performers, working for the National Theatre Company and the Royal Shakespeare Company. She achieved success in television during this period, in the series A Fine Romance from 1981 until 1984, and in 1992 with a starring role in the romantic comedy series As Time Goes By. Her film appearances were infrequent and included sup

## 4.1 Creating a vector Database

### Collecting data

Some of the entries from the dataset include a `wiki_context` with the wikipedia page of entities related to the question. We can use this information as the source of our vector database. However, we need to process the data;

First, it would be useful to split the large wikipedia document into smaller chunks, since our model has a limited input size.

Second, we might want to add the title of the page to each chunk, to provide context.

In [None]:
import nltk
nltk.download('punkt')
nltk.download('punkt_tab')

facts = []
for sample in dataset:
    for title, page in sample['wiki_context']:
        for paragraph in page.replace(r'\n+', '\n').split('\n'):
            for sentence in nltk.sent_tokenize(paragraph):
                if len(nltk.word_tokenize(sentence)) > 10:
                    facts.append(f'{title}: {sentence}')

print(facts[0])
len(facts)

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


England: England is a country that is part of the United Kingdom.


2863

In [None]:
print(facts[2000])

Super Bowl XX: McMahon suffered a strained glute as the result of a hit taken in the NFC Championship Game and flew his acupuncturist into New Orleans to get treatment.


### Encoding Data

Now that we have collected all th facts, we can use a `SentenceTransformer` to convert the different text into vectors.

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models or to calculate similarity scores using Cross-Encoder models.

For now, lets use the following model: [`avsolatorio/GIST-small-Embedding-v0`](https://huggingface.co/avsolatorio/GIST-small-Embedding-v0). You can check the [Massive Textual Embedding Benchmark leaderboard](https://huggingface.co/spaces/mteb/leaderboard) from huggingface to find more models.

In [None]:
from sentence_transformers import SentenceTransformer
encoder = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0')

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [None]:
encoder.encode(["Hello my name is William"]).shape

(1, 384)

In [None]:
encoded_facts = encoder.encode(facts, show_progress_bar=True)

Batches:   0%|          | 0/90 [00:00<?, ?it/s]

In [None]:
encoded_facts[2000]

array([-3.08835022e-02,  1.35429371e-02,  1.83294935e-03, -3.60750630e-02,
       -3.27222375e-03,  4.73675020e-02,  3.79150994e-02,  9.34317559e-02,
       -1.43485172e-02, -3.31995673e-02,  1.24484312e-03, -2.44582508e-04,
        5.78271039e-02, -3.68466205e-03, -7.23100454e-02, -1.48698017e-02,
        3.97444479e-02,  1.82554163e-02, -7.12054297e-02,  1.36031993e-02,
       -7.15836417e-04,  2.28580274e-02,  2.38290299e-02, -7.79643580e-02,
       -1.49962623e-02, -9.36619844e-03, -3.20653282e-02, -5.87035483e-03,
       -5.93632236e-02, -1.93749309e-01,  6.75767427e-04,  2.13626772e-02,
        1.51219051e-02, -1.52254887e-02,  2.11444125e-02, -6.20544031e-02,
       -1.69345867e-02, -2.61854474e-02,  4.16663140e-02,  2.81720292e-02,
        3.04318275e-02,  5.15900590e-02, -3.96664767e-03, -6.84062243e-02,
        1.72300003e-02, -2.15205587e-02, -3.84414755e-02, -9.00626369e-03,
        1.54119894e-01,  5.26492335e-02,  1.36414561e-02, -1.61318481e-02,
        4.94889021e-02,  

### Storing data

Once the data has been encoded as vectors, we can save it in a vectordatabase that is easy to query. In our case, we will use the `IndexFlatL2` from the [faiss](https://github.com/facebookresearch/faiss) library that computes simple L2 euclidean distance between vectors.

FAISS (Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

In [None]:
import faiss

encoding_dimension = encoded_facts.shape[1]
index = faiss.IndexFlatL2(encoding_dimension)
index.add(encoded_facts)

### Querying the database

Now, every time we get a new question, all we need to do is encode the question and query the index for the `k` closest vectors. Then, we can retrieve the text from the original source.

In [None]:
[dataset[5]["question"]]

['Who won Super Bowl XX?']

In [None]:
question_encoding = encoder.encode([dataset[5]['question']])
distances, positions = index.search(question_encoding, k=3)
distances, positions

(array([[0.20137188, 0.21171883, 0.21670619]], dtype=float32),
 array([[1925, 1926, 1978]]))

In [None]:
relevant_facts = []
for p in positions[0]:
    relevant_facts.append(facts[p])
relevant_facts

['Super Bowl XX: Super Bowl XX was an American football game between the National Football Conference (NFC) champion Chicago Bears and the American Football Conference (AFC) champion New England Patriots to decide the National Football League (NFL) champion for the 1985 season.',
 'Super Bowl XX: The Bears defeated the Patriots by the score of 46–10, capturing their first NFL championship since 1963, three years prior to the birth of the Super Bowl.',
 'Super Bowl XX: Wide receiver Stanley Morgan provided the team with a good deep threat, catching 39 passes for 760 yards and 5 touchdowns.']

Now we can add this to our existing generation flow.

In [None]:
%%time
for sample in dataset:

    if sample['wiki_context']:
        question_encoding = encoder.encode([sample['question']])
        distances, positions = index.search(question_encoding, k=3)
        relevant_facts = []
        for p in positions[0]:
            relevant_facts.append(facts[p])
        relevant_facts = '\n'.join(relevant_facts)+'\n'
    else:
        relevant_facts = ''

    prompt_in_chat_format = [
         {
            'role':'system',
            'content':'''You are a helpful assistant.'''
        },
         {
            'role':'user',
            'content':f'''Using the following information:{relevant_facts}Answer this question: {sample['question']}'''
         }
    ]

    prompt = llm.tokenizer.apply_chat_template(prompt_in_chat_format, tokenize=False, add_generation_prompt=True)

    generation = llm(prompt, max_new_tokens = 32)

    new_text = generation[0]['generated_text'][len(prompt):].strip()

    print('Question: '+sample['question'])
    print('Relevant Facts:\n'+relevant_facts.strip())

    print('TriviaQA Answer: '+sample['answer'])
    print('Generated Answer: '+new_text)
    print(f'\n-----------\n')

Question: Which American-born Sinclair won the Nobel Prize for Literature in 1930?
Relevant Facts:

TriviaQA Answer: Sinclair Lewis
Generated Answer: The American-born Sinclair who won the Nobel Prize for Literature in 1930 was Sinclair Lewis.

-----------

Question: Where in England was Dame Judi Dench born?
Relevant Facts:
Judi Dench: Dench was born in Heworth, North Riding of Yorkshire.
Judi Dench: Dame Judith Olivia "Judi" Dench,  (born 9 December 1934)  is an English actress and author.
Judi Dench: Her father, Reginald Arthur Dench, a doctor, was born in Dorset, and later moved to Dublin, where he was raised.
TriviaQA Answer: York
Generated Answer: Dame Judi Dench was born in Heworth, North Riding of Yorkshire, England.

-----------

Question: In which decade did Billboard magazine first publish and American hit chart?
Relevant Facts:

TriviaQA Answer: 30s
Generated Answer: Billboard magazine first published and American hit chart in the 1950s.

-----------

Question: From which c

## 4.3 Exercise

- Modify the prompt to ask the model not to produce an answer if no relevant fact is provided. Look at the results: does it work or does the LLM always produce an answer regardless ?

- Modify the prompt to produce short answers like we did in the previous tutorial (eg. you can use Few-shot)

- Create your own RAG
   - Select one or more documents which you want to use as information database (e.g., wikipedia documents about astronauts or a dataset from Huggingface)
   - Create a testset consisting of (question, answer) pairs that can be answered based on your information base
   - Adapt the colabcode to create a RAG model that can answer questions about your information base
   - Run your model on the questions from your testset and look at the retrieved informaton and at the generated answers. Is the retrieved information relevant ? Is it reflected in the generated answer ?

**Create your own RAG**

Here, we have chosen a wikipedia page containing information about "Bangladesh". It's the English version of the page

In [None]:
# Your Code Here
import wikipedia
wikipedia.set_lang("en")
page = wikipedia.page("Bangladesh")
print(page.content)

Bangladesh, officially the People's Republic of Bangladesh, is a country in South Asia. It is the eighth-most populous country in the world and seventh most densely populated with a population of 173,562,364 in an area of 148,460 square kilometres (57,320 sq mi). Bangladesh shares land borders with India to the north, west, and east, and Myanmar to the southeast. To the south, it has a coastline along the Bay of Bengal. To the north, it is separated from Bhutan and Nepal by the Siliguri Corridor, and from China by the mountainous Indian state of Sikkim. Dhaka, the capital and largest city, is the nation's political, financial, and cultural centre. Chittagong is the second-largest city and the busiest port. The official language is Bengali, with Bangladeshi English also used in government.
Bangladesh is part of the historic and ethnolinguistic region of Bengal, which was divided during the Partition of British India in 1947 as the eastern enclave of the Dominion of Pakistan. The country

**Extraction of the sentences from the page to make the database for question answering:**
Here we are extracting each sentences and add them to a list using this format: </br>
`title_of_the_section+:subsections*:sentence_of_the_section` </br>
for example: each sentences in the first section that gives the summary about Bangladesh is added to the list in this way:</br>
`Bangladesh: Bangladesh, officially the People's Republic of Bangladesh, is a country in South Asia.` </br>
If there were one or more than one subections under a section, then the format would look like this:</br>
`Bangladesh: Demographics: Language: These include Chittagonian which is spoken in the south-eastern Chittagong region, and Sylheti spoken in the north-eastern region of Sylhet.`
</br>
*This is done due to provide context.*



In [None]:
import nltk

nltk.download('punkt_tab')
nltk.download('punkt')

def get_sentences(text, level=1, title='Bangladesh'):
  facts = []
  sections = text.split(f'''\n{"="*(level+1)} ''')
  #print(len(sections))
  #print(f"type of the sections: {type(sections)} and length of the sections: {len(sections)}")
  #print(sections[0])
  for section in sections:
    if f''' {"="*(level+1)}\n''' in section:
      #print(f"if: this section has: ={level+1} and section is: \n{section}")
      section_title, section_content = section.split( f''' {"="*(level+1)}\n''')
      section_sentences = get_sentences(section_content, level+1, title=section_title)
      for sentence in section_sentences:
         facts.append(f'{title}: {sentence}')
    else:
      #print(f"else: this section has: ={level+1} and section is: \n{section}")

      for paragraph in text.replace(r'\n+', '\n').split('\n'):
          for sentence in nltk.sent_tokenize(paragraph):
              if len(nltk.word_tokenize(sentence)) > 10:
                  facts.append(f'{title}: {sentence}')
  return facts


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [None]:
fact_bgd_en = get_sentences(page.content)
fact_bgd_en


["Bangladesh: Bangladesh, officially the People's Republic of Bangladesh, is a country in South Asia.",
 'Bangladesh: It is the eighth-most populous country in the world and seventh most densely populated with a population of 173,562,364 in an area of 148,460 square kilometres (57,320 sq mi).',
 'Bangladesh: Bangladesh shares land borders with India to the north, west, and east, and Myanmar to the southeast.',
 'Bangladesh: To the south, it has a coastline along the Bay of Bengal.',
 'Bangladesh: To the north, it is separated from Bhutan and Nepal by the Siliguri Corridor, and from China by the mountainous Indian state of Sikkim.',
 "Bangladesh: Dhaka, the capital and largest city, is the nation's political, financial, and cultural centre.",
 'Bangladesh: The official language is Bengali, with Bangladeshi English also used in government.',
 'Bangladesh: Bangladesh is part of the historic and ethnolinguistic region of Bengal, which was divided during the Partition of British India in 19

In [None]:
print(len(fact_bgd_en))
print(type(fact_bgd_en))

1367
<class 'list'>


In [None]:
print(fact_bgd_en[0])
print(type(fact_bgd_en[0]))

Bangladesh: Bangladesh, officially the People's Republic of Bangladesh, is a country in South Asia.
<class 'str'>


In [None]:
print(fact_bgd_en[973])

Bangladesh: Demographics: Language: These include Chittagonian which is spoken in the south-eastern Chittagong region, and Sylheti spoken in the north-eastern region of Sylhet.


**Create a testset consisting of (question, answer) pairs that can be answered based on your information base**


In [None]:
from datasets import Dataset
import pandas as pd
# Define the data as a list of dictionaries
test_data_eng = [
    {
        'question': 'In which region of the world is Bangladesh located?',
        'answer': 'South Asia.'
    },
    {
        'question': 'To which historical period does the history of Bangladesh date back over four millennia?',
        'answer': 'The Chalcolithic period.'
    },
    {
        'question': 'Which ruler\'s expansion of the Bengal Sultanate led to economic prosperity and military dominance, making Bengal known to Europeans as the richest country to trade with?',
        'answer': 'Shamsuddin Ilyas Shah.'
    },
    {
        'question': 'When did Bangladesh declare independence from Pakistan, leading to the Bangladesh Liberation War?',
        'answer': 'March 26, 1971'
    }
]

# Create the Dataset object
testdata_bgd_en = Dataset.from_pandas(pd.DataFrame(test_data_eng))

# Display the Dataset
print(testdata_bgd_en)
print(type(testdata_bgd_en))


Dataset({
    features: ['question', 'answer'],
    num_rows: 4
})
<class 'datasets.arrow_dataset.Dataset'>


**Adapt the colabcode to create a RAG model that can answer questions about your information base**

**Creating a vector Database**</br>
As we already have the wikipedia database and the facts are stored in a list, we can now perform the operations where we will encode the data.

**Encoding Data**
we can use a SentenceTransformer to convert the different text into vectors.

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models or to calculate similarity scores using Cross-Encoder models.

For now, lets use the following model: avsolatorio/GIST-small-Embedding-v0

In [None]:
from sentence_transformers import SentenceTransformer
# encoder = SentenceTransformer('avsolatorio/GIST-small-Embedding-v0') # this is used for English version of the wikipedia page
encoder = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2') # this is used for the Bengali version of the wikipedia page

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.13k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/402 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
encoded_facts = encoder.encode(fact_bgd_en, show_progress_bar=True)

Batches:   0%|          | 0/43 [00:00<?, ?it/s]

In [None]:
import faiss

encoding_dimension = encoded_facts.shape[1]
index = faiss.IndexFlatL2(encoding_dimension)
index.add(encoded_facts)

In [2]:
#encoded_facts[1000]

**Storing data**</br>
Once the data has been encoded as vectors, we can save it in a vectordatabase that is easy to query. In our case, we will use the IndexFlatL2 from the faiss library that computes simple L2 euclidean distance between vectors.

FAISS (Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.

### Querying the database

Now, every time we get a new question, all we need to do is encode the question and query the index for the `k` closest vectors. Then, we can retrieve the text from the original source.

In [None]:
[testdata_bgd_en[1]["question"]]

['To which historical period does the history of Bangladesh date back over four millennia?']

In [None]:
question_encoding = encoder.encode([testdata_bgd_en[1]['question']])
distances, positions = index.search(question_encoding, k=3)
distances, positions

relevant_facts = []
for p in positions[0]:
    relevant_facts.append(fact_bgd_en[p])
relevant_facts

['Bangladesh: The history of Bangladesh dates back over four millennia to the Chalcolithic period.',
 'Bangladesh: History: The history of Bangladesh dates back over four millennia to the Chalcolithic period.',
 "Bangladesh: History: The region's early history was characterized by a succession of Hindu and Buddhist kingdoms and empires that fought for control over the Bengal region."]

In [None]:
import time
def generate_answer(dataset, facts,llm):
  start_time = time.time()  # Record the start time
  for sample in dataset:


    question_encoding = encoder.encode([sample['question']])
    distances, positions = index.search(question_encoding, k=3)
    relevant_facts = []
    for p in positions[0]:
        relevant_facts.append(facts[p])
    relevant_facts = '\n'.join(relevant_facts)+'\n'


    prompt_in_chat_format = [
         {
            'role':'system',
            'content':'''You are a helpful assistant.'''
        },
         {
            'role':'user',
            'content':f'''Using the following information:{relevant_facts}Answer this question: {sample['question']}'''
         }
    ]

    prompt = llm.tokenizer.apply_chat_template(prompt_in_chat_format, tokenize=False, add_generation_prompt=True)

    generation = llm(prompt, max_new_tokens = 32)

    new_text = generation[0]['generated_text'][len(prompt):].strip()

    print('Question: '+sample['question'])
    print('Relevant Facts:\n'+relevant_facts.strip())

    print('Answer: '+sample['answer'])
    print('Generated Answer: '+new_text)
    print(f'\n-----------\n')
  end_time = time.time()  # Record the end time
  print(f"Execution time: {end_time - start_time} seconds")  # Print the execution time

In [None]:
generate_answer(testdata_bgd_en, fact_bgd_en)

Question: In which region of the world is Bangladesh located?
Relevant Facts:
Bangladesh: Bangladesh is in South Asia on the Bay of Bengal.
Bangladesh: Geography: Bangladesh is in South Asia on the Bay of Bengal.
Bangladesh: Geography: Bangladesh is located in the Indomalayan realm, and lies within four terrestrial ecoregions: Lower Gangetic Plains moist deciduous forests, Mizoram–Manipur–Kachin rain forests, Sundarbans freshwater swamp forests, and Sundarbans mangroves.
Answer: South Asia.
Generated Answer: Bangladesh is located in the Indomalayan realm.

-----------

Question: To which historical period does the history of Bangladesh date back over four millennia?
Relevant Facts:
Bangladesh: The history of Bangladesh dates back over four millennia to the Chalcolithic period.
Bangladesh: History: The history of Bangladesh dates back over four millennia to the Chalcolithic period.
Bangladesh: History: The region's early history was characterized by a succession of Hindu and Buddhist ki

### comment about the generated answer by the model:
The results for English version of the wikipedia page that contains the information about Bangladesh was quite good. For the 4 questions, it provided correct answers and the use of the retrieval system played a significant role. If we look into this question:</br>
Question: In which region of the world is Bangladesh located?
Relevant Facts:
Bangladesh: Bangladesh is in South Asia on the Bay of Bengal.
Bangladesh: Geography: Bangladesh is in South Asia on the Bay of Bengal.
Bangladesh: Geography: Bangladesh is located in the Indomalayan realm, and lies within four terrestrial ecoregions: Lower Gangetic Plains moist deciduous forests, Mizoram–Manipur–Kachin rain forests, Sundarbans freshwater swamp forests, and Sundarbans mangroves.
Answer: South Asia.
Generated Answer: Bangladesh is located in the Indomalayan realm.
</br>
Here, out of 3 relevant sentences, the model chose the one that contains Indomalayan realm that is not a false answer. Though we would like to get the answer to be South Asia, but it stills provided the somewhat correct answer.

###Working with the Bengali version of the wikipedia page:
This time we examine the performance of meta-llama/Llama-3.2-1B-Instruct for Bengali version of the wikipedia page. It supports Hindi, an Indo-European language like Bengali and it has a same root as Bengali as they are heavily influened by Sanskrit. So it is another area where we can examine how meta-llama/Llama-3.2-1B-Instruct will perform for a relative language to a supported lanugage by the model.

#### load the model

In [None]:
from huggingface_hub import login
from google.colab import userdata


# Login with your token
login(userdata.get('hf_token'))


In [None]:
# Import Pipeline for the LLM and Pytorch to find the best available device
from transformers import pipeline
import torch

# Find the best available device
device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu') # Use GPU if available, otherwise use CPU

# Load the model
model_identifier_llama = 'meta-llama/Llama-3.2-1B-Instruct'
llm_llama = pipeline(model=model_identifier_llama, device=device, token=userdata.get('hf_token'))

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

In [None]:
# Your Code Here
import wikipedia
wikipedia.set_lang("bn")
page_bgd_bn = wikipedia.page("Bangladesh")
# print(page_bgd_bn.content)

In [None]:
fact_bgd_bn = get_sentences(page_bgd_bn.content, title = "বাংলাদেশ")

In [1]:
#fact_bgd_bn

In [None]:
#Encoding Data
encoded_facts = encoder.encode(fact_bgd_bn, show_progress_bar=True)


# Storing data
# Once the data has been encoded as vectors, we can save it in a vectordatabase that is easy to query. In our case, we will use the IndexFlatL2 from the faiss library that computes simple L2 euclidean distance between vectors.

# FAISS (Facebook AI Similarity Search) is a library that provides efficient algorithms to quickly search and cluster embedding vectors.
import faiss

encoding_dimension = encoded_facts.shape[1]
index = faiss.IndexFlatL2(encoding_dimension)
index.add(encoded_facts)



Batches:   0%|          | 0/13 [00:00<?, ?it/s]

In [None]:
[testdata_bgd_en[1]["question"]]

In [None]:
# Querying the database
# Now, every time we get a new question, all we need to do is encode the question and query the index for the k closest vectors. Then, we can retrieve the text from the original source.
print(f'{[testdata_bgd_en[1]["question"]]}')
question_encoding = encoder.encode([testdata_bgd_en[1]['question']])
distances, positions = index.search(question_encoding, k=3)
distances, positions

relevant_facts = []
for p in positions[0]:
    relevant_facts.append(fact_bgd_bn[p])
relevant_facts

['To which historical period does the history of Bangladesh date back over four millennia?']


['বাংলাদেশ: ইতিহাস: বাংলাদেশে ইসলামের প্রাথমিক ইতিহাস দুটি পর্যায়ে বিভক্ত। প্রথম পর্যায়টি ৮ম থেকে ১২শ শতাব্দী পর্যন্ত ছিল, যখন বাংলার সাথে মুসলিম বাণিজ্য আরব ও ইরানের সাথে বিকশিত হয়েছিল। দ্বিতীয় পর্যায়টি ১৩শ শতাব্দীতে শুরু হয়েছিল, যখন বাংলা মুসলিম শাসকদের অধীনে আসার পর মুসলিম রাজবংশের শাসন শুরু হয়েছিল। বাংলার সাথে মুসলিম বাণিজ্য আরব ও ইরানের সাথে বিকশিত হয়েছিল। বাংলা মুসলিম শাসনের অধীনে আসার পর বাংলায় মুসলিম রাজবংশের শাসন শুরু হয়েছিল। এই সময়ের মধ্যে, ইসলাম বাংলার প্রধান ধর্ম হয়ে ওঠে। মুসলিম রাজবংশগুলো ইসলামি সংস্কৃতি এবং শিক্ষার উন্নয়নে অবদান রেখেছিল। মুহাম্মদ আল-ইদ্রিসি, ইবনে হাওকাল, আল-মাসুদি, ইবন খোরদাদবেহ এবং সুলাইমান আল তাজিরের লেখা থেকে আরব, ইরান এবং বাংলার মধ্যকার সমুদ্র বাণিজ্যের বর্ণনা পাওয়া যায়। এই লেখকরা বাংলার সাথে মুসলিম বাণিজ্যের বিকাশ এবং বাংলায় ইসলামের প্রাথমিক প্রসার সম্পর্কে তথ্য প্রদান করে। বাংলার সাথে মুসলিম বাণ্যিজ্য সাসানীয় সাম্রাজ্যের পতনের পর এবং আরবদের পারস্য বাণিজ্য পথগুলো দখল করার পর বিকাশ লাভ করে। সাসানীয় সাম্রাজ্য ছিল মধ্যপ্রাচ্য এবং এশিয়

In [None]:
generate_answer(testdata_bgd_en, fact_bgd_bn, llm_llama)

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: In which region of the world is Bangladesh located?
Relevant Facts:
বাংলাদেশ: ব্যাক্সটার, সি (১৯৯৭)। Bangladesh, from a Nation to a State [বাংলাদেশ, একটি জাতি থেকে একটি রাষ্ট্রে] (ইংরেজি ভাষায়)। ওয়েস্ট ভিউ প্রেস। আইএসবিএন 0-8133-3632-5। ওসিএলসি 47885632।
বাংলাদেশ: বাংলাদেশ () দক্ষিণ এশিয়ার একটি স্বাধীন সার্বভৌম রাষ্ট্র। বাংলাদেশের সাংবিধানিক নাম গণপ্রজাতন্ত্রী বাংলাদেশ। ভৌগোলিকভাবে বাংলাদেশের পশ্চিমে ভারতের পশ্চিমবঙ্গ, উত্তরে পশ্চিমবঙ্গ, আসাম ও মেঘালয়, পূর্ব সীমান্তে আসাম, ত্রিপুরা ও মিজোরাম, দক্ষিণ-পূর্ব সীমান্তে মিয়ানমারের চিন ও রাখাইন রাজ্য এবং দক্ষিণ উপকূলে  বঙ্গোপসাগর অবস্থিত। ভৌগোলিকভাবে পৃথিবীর বৃহত্তম ব-দ্বীপের সিংহভাগ অঞ্চল জুড়ে বাংলাদেশ ভূখণ্ড অবস্থিত। জনসংখ্যার বিচারে প্রায় ১৭ কোটিরও অধিক জনসংখ্যা নিয়ে বাংলাদেশ বিশ্বের ৮ম বৃহত্তম দেশ। নদীমাতৃক বাংলাদেশ ভূখণ্ডের উপর দিয়ে বয়ে গেছে ৫৭টি আন্তর্জাতিক নদী। বাংলাদেশের উত্তর-পূর্বে ও দক্ষিণ-পূর্বে টারশিয়ারি যুগের পাহাড় মেঘের সাথে মিশে আছে। বিশ্বের বৃহত্তম ম্যানগ্রোভ অরণ্য সুন্দরবন ও দীর্ঘতম প্রাকৃতিক সৈকত কক্সব

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: To which historical period does the history of Bangladesh date back over four millennia?
Relevant Facts:
বাংলাদেশ: ইতিহাস: বাংলাদেশে ইসলামের প্রাথমিক ইতিহাস দুটি পর্যায়ে বিভক্ত। প্রথম পর্যায়টি ৮ম থেকে ১২শ শতাব্দী পর্যন্ত ছিল, যখন বাংলার সাথে মুসলিম বাণিজ্য আরব ও ইরানের সাথে বিকশিত হয়েছিল। দ্বিতীয় পর্যায়টি ১৩শ শতাব্দীতে শুরু হয়েছিল, যখন বাংলা মুসলিম শাসকদের অধীনে আসার পর মুসলিম রাজবংশের শাসন শুরু হয়েছিল। বাংলার সাথে মুসলিম বাণিজ্য আরব ও ইরানের সাথে বিকশিত হয়েছিল। বাংলা মুসলিম শাসনের অধীনে আসার পর বাংলায় মুসলিম রাজবংশের শাসন শুরু হয়েছিল। এই সময়ের মধ্যে, ইসলাম বাংলার প্রধান ধর্ম হয়ে ওঠে। মুসলিম রাজবংশগুলো ইসলামি সংস্কৃতি এবং শিক্ষার উন্নয়নে অবদান রেখেছিল। মুহাম্মদ আল-ইদ্রিসি, ইবনে হাওকাল, আল-মাসুদি, ইবন খোরদাদবেহ এবং সুলাইমান আল তাজিরের লেখা থেকে আরব, ইরান এবং বাংলার মধ্যকার সমুদ্র বাণিজ্যের বর্ণনা পাওয়া যায়। এই লেখকরা বাংলার সাথে মুসলিম বাণিজ্যের বিকাশ এবং বাংলায় ইসলামের প্রাথমিক প্রসার সম্পর্কে তথ্য প্রদান করে। বাংলার সাথে মুসলিম বাণ্যিজ্য সাসানীয় সাম্রাজ্যের 

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: Which ruler's expansion of the Bengal Sultanate led to economic prosperity and military dominance, making Bengal known to Europeans as the richest country to trade with?
Relevant Facts:
বাংলাদেশ: অষ্টাদশ শতাব্দীর সময়, বাংলার নবাবরা এই অঞ্চলের প্রকৃত শাসক হয়ে ওঠেন। দক্ষিণ এশিয়ার অনেক বড় একটি অঞ্চল তাদের নিয়ন্ত্রণে ছিলো। নবাবরা ইউরোপীয় বাণিজ্য কোম্পানিগুলির সাথে মিত্রতা স্থাপন করে, যার ফলে শতাব্দীর শুরুর দিকে অঞ্চলটি বেশ সমৃদ্ধি লাভ করে। সমগ্র মুঘল সাম্রাজ্যের মোট দেশজ উৎপাদনের প্রায় ৫০% বাংলা থেকে আসতো। বাঙালি অর্থনীতি নির্ভর করতো কাপড় উৎপাদন, জাহাজ নির্মাণ, লবণ উৎপাদন, বিভিন্ন হস্তশিল্প ও কৃষিজ পণ্যের উপর। আন্তর্জাতিক বাণিজ্যেরও একটি বড় কেন্দ্র ছিলো বাংলা। বিশ্বব্যাপী বাংলার রেশম ও তুলা বস্ত্রের সুনাম ছড়িয়ে পড়েছিল। জাহাজ নির্মাণেও বাংলা বিখ্যাত ছিলো।
বাংলাদেশ: ইতিহাস: ইসলামি বাংলা: অষ্টাদশ শতাব্দীর সময়, বাংলার নবাবরা এই অঞ্চলের প্রকৃত শাসক হয়ে ওঠেন। দক্ষিণ এশিয়ার অনেক বড় একটি অঞ্চল তাদের নিয়ন্ত্রণে ছিলো। নবাবরা ইউরোপীয় বাণিজ্য কোম্পানিগুলির সাথে মিত্রতা স্থা

Setting `pad_token_id` to `eos_token_id`:None for open-end generation.


Question: When did Bangladesh declare independence from Pakistan, leading to the Bangladesh Liberation War?
Relevant Facts:
বাংলাদেশ: ইতিহাস: স্বাধীনতা যুদ্ধ: ১৯৭২ সালের আগস্টের মধ্যে, ৮৬টি দেশ নতুন রাষ্ট্রটিকে স্বীকৃতি দেয়। বেশিরভাগ মুসলিম দেশের চাপের পরে ১৯৭৪ সালে পাকিস্তান বাংলাদেশকে স্বীকৃতি দেয়।
বাংলাদেশ: ইতিহাস: স্বাধীনতা যুদ্ধ: পাকিস্তানের ব্যর্থ প্রথম আঘাত হানার পর ১৯৭১ সালের ৩রা ডিসেম্বর ভারত যুদ্ধে হস্তক্ষেপ করে। বাংলাদেশী ও ভারতীয় যৌথ বাহিনীর স্থল অগ্রযাত্রা সহ ভারত এবং ছোট বাংলাদেশী বিমান বাহিনীর বিমান হামলার মুখে পাকিস্তানের দখলদারিত্ব থেকে মধ্য ডিসেম্বরে রাজধানী ঢাকা মুক্ত হয়। যুদ্ধের শেষ পর্যায়ে সোভিয়েত ইউনিয়ন ও যুক্তরাষ্ট্র উভয়ই শীতল যুদ্ধের মুখোমুখি অবস্থানের রণকৌশল হিসেবে বঙ্গোপসাগরে নৌবাহিনী পাঠায়। নয় মাসব্যাপী যুদ্ধ ১৯৭১ সালের ১৬ই ডিসেম্বরে পাকিস্তানের পূর্ব কমান্ডের বাংলাদেশ-ভারত যৌথ বাহিনীর কাছে আত্মসমর্পণের মধ্য দিয়ে শেষ হয়। আন্তর্জাতিক চাপে পাকিস্তান ১৯৭২ সালের ৮ই জানুয়ারী মুজিবকে কারাগার থেকে মুক্তি দেয় এবং তাঁকে বাংলাদেশ ফেরত আনা হয়। যেখানে তাঁক

### Comment about the Bengali version of the wikipedia page:
We used different model and a different encoding model for this experiment. They are: `'meta-llama/Llama-3.2-1B-Instruct'` and for encoder: `'sentence-transformers/paraphrase-multilingual-mpnet-base-v2`'
As the model and the encoder model does not primarily supports Bengali, they both provided related information about the query. There is another thing to notice: the query is written in English and the wikipedia text is written in Bengali, so for the similarity measure, the encoded query and the encoded wiki text will have different mappings as the lanuages do not share common alphabet. For instance if we look into this query and the generated answer: </br>
Question: To which historical period does the history of Bangladesh date back over four millennia?
Answer: The Chalcolithic period.
Generated Answer: ইতিহাস: বাংলাদেশে ইসলামের � </br>
The term "Chalcolithic" has its own word in Bengali that is totally different. However, as the question contains the word "historical", and the word "ইতিহাস" translates to "history" in English. So both the retrieval model and the encoder model provided the sentences regarding to history.