# Utilizing **RAG** Technique with LLM for Answering Biology Questions in **Serbian Language**

The aim of this work is to provide Serbian students with quick and easy access to answers for their biology questions through interaction with a large language model.
The featured technique is **Retrieval-Augmented Generation (RAG)**, chosen because large language models are not typically trained on Serbian language literature.
To enable the LLM to effectively answer users' queries, it is essential to provide it with external knowledge, which is the core of the RAG approach.


This **[link](https://core.ac.uk/download/pdf/237391838.pdf)** contains a
biology file used for the knowledge base.

The first step is to download and import the necessary libraries.

In [12]:
import warnings
warnings.filterwarnings('ignore')

In [4]:
# !pip install bitsandbytes

To handle the high memory demands of large language models (LLMs), a helpful strategy is quantization, a technique that was implemented in this project. This involves reducing the precision of model parameters like weights and activations, cutting down the model's memory usage without sacrificing much accuracy. This makes it easier to use LLMs on devices with limited resources.

## **Quantization**

---



In [1]:
from transformers import BitsAndBytesConfig
from torch import bfloat16

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=bfloat16
)

## **Model and adequate tokenizer**

---



The selected LLM for this project is the fine-tuned Mistral 7b Instruct model  that is tailored to Serbian language.

More about used LLM you can find on this [link](https://huggingface.co/MaziyarPanahi/Mistral-7B-Instruct-Aya-101).

To load pre-trained model weights and its tokenizer, this project utilized the AutoTokenizer and AutoModelForCausalLM classes from the Transformers library.

In [7]:
# !pip install accelerate

In [2]:
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "MaziyarPanahi/Mistral-7B-Instruct-Aya-101"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    quantization_config=nf4_config
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers


model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

## **Pipeline**

---



Text generation pipeline is first established using the Hugging Face Transformers library. The pipeline function is used to create the pipeline for text generation, specifying the task as "text-generation" and focusing mainly on two parameters:

*   temperature
*   repetition_penalty

Temperature controls the creativity of responses, with lower values ensuring more deterministic outputs. On the other hand, repetition_penalty regulates the likelihood of repetitive responses, where higher values reduce repetition. These adjustments aim to enhance the model's precision.



In [8]:
# !pip install langchain

In [10]:
# !pip install langchain_community

In [11]:
from transformers import pipeline
from langchain import HuggingFacePipeline

hf_pipe = pipeline(task="text-generation",
                   model=model,
                   tokenizer= tokenizer,
                   return_full_text=True,
                   temperature=0.1,
                   max_new_tokens = 512,
                   repetition_penalty=1.2,
                   num_return_sequences=1
                )

mistral_llm = HuggingFacePipeline(pipeline = hf_pipe)

  warn_deprecated(


Mounting the Google Drive to the Colab environment.

In [13]:
from google.colab import drive

drive.mount('/content/drive')

Mounted at /content/drive


## **Loading documents**

---



Loading essential biology-related files into Colab to support the task.

In [15]:
# !pip install pypdf

In [16]:
from langchain.document_loaders import PyPDFLoader, DirectoryLoader

document_locations = '/content/drive/MyDrive/rag_serbian/documents'

loader = DirectoryLoader(
    document_locations,
    glob ='*.pdf',
    loader_cls = PyPDFLoader
)

documents = loader.load()

In [17]:
len(documents)

16

The header page of the document is displayed below.

In [18]:
documents[0]

Document(page_content='357 Arh.farm 2009;59: 357 – 372 Stru čni rad/Professional paper \n \n  \nBiologija i fiziologija starenja \n \nBosiljka Ple ćaš, Lada Živkovi ć, Biljana Potparevi ć \n \nInstitut za fiziologiju i biologiju, Farmaceutski fakultet Univerziteta u \nBeogradu, Vojvode Stepe 450, 11221Beograd, Srbija \n \n \n \n \n \n \nKratak sadržaj \nStarenje je neizbežno i kako zastupljenost starih osoba u populaciji raste, \nrazumevanje fizioloških mehanizama uklju čenih u proces normalnog starenja ima sve \nveći značaj u cilju održavanja kvaliteta života u starosti. Tako đe, očuvanje fizioloških \nfunkcija ili „zdravlja” starih ljudi smanji će pritisak na zdravstvene sisteme i troškove. U \novom radu, u svetlu najnovijih nau čnih podataka i hipoteza, s ažeto se razmatraju neka \nključna pitanja u vezi biologije i fiziologije starenja: šta je starenje i zašto se ono \ndogađa, kao i mogu ćnosti odlaganja i usporavanja starenja.   \n \n \nKljučne reči: starenje – karakteristike, uzr

The content of the document's first page is displayed below.

In [19]:
documents[1]

Document(page_content='358 Uvod \nEnormno produženje ljudskog život a u nekoliko poslednjih vekova \nsmatra se jednim od najve ćih dostignu ća naše civilizacije. Od 25 do 30 godina, \nkoliko su prvi ljudi verovatno živeli, do 1900. godine životni vek je u industrijalizovanim zemljama produžen na  oko 45 do 50 godina, da bi samo sto \ngodina kasnije u najzdravijoj naciji, japanskoj, o čekivani životni vek iznosio \noko 80 godina. Pove ćanje dužine života prvenstveno je bilo posledica \ndrastičnog smanjenja smrtnosti novoro đenčadi, uspešnog le čenja infektivnih \nbolesti i poboljšanih uslova života. Pr oduženje životnog veka istovremeno je \npokazalo da je starenje pra ćeno pove ćanom osetljivoš ću organizma prema \nvelikom broju degenerativnih stanja koja se prema težini kre ću od trivijalnih do \nfatalnih, i koja smanjuju kvalitet života starih ljudi.    \nDemografske promene koje su se odigrale u XX veku uslovile su i novi \nnačin razmišljanja o starenju. Danas ve ćina istraživa ča s

## **Chunking documents**

---



To make the documents more manageable for the LLM, the next step is **chunking** or splitting them into smaller pieces. In this project, the chosen chunk size is 256 tokens, with an overlap of 20 tokens between chunks to ensure that the context and meaning are preserved.

The RecursiveCharacterTextSplitter from the Langchain framework is used for chunking the text.

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_spliter = RecursiveCharacterTextSplitter(
    chunk_size = 256,
    chunk_overlap = 20,
    length_function = len,
    separators= ['\n \n','\n\n','\n', ' ']
)

document_chunks = text_spliter.split_documents(documents)

In [21]:
len(document_chunks)

184

In [22]:
document_chunks[0]

Document(page_content='357 Arh.farm 2009;59: 357 – 372 Stru čni rad/Professional paper \n \n  \nBiologija i fiziologija starenja \n \nBosiljka Ple ćaš, Lada Živkovi ć, Biljana Potparevi ć', metadata={'source': '/content/drive/MyDrive/rag_serbian/documents/biologija.pdf', 'page': 0})

In [23]:
document_chunks[1]

Document(page_content='Institut za fiziologiju i biologiju, Farmaceutski fakultet Univerziteta u \nBeogradu, Vojvode Stepe 450, 11221Beograd, Srbija', metadata={'source': '/content/drive/MyDrive/rag_serbian/documents/biologija.pdf', 'page': 0})

## **Transforming document chunks into vector representation**

---



Since all algorithms and computers only recognize and work with numerical representations, the next step is to convert the created document chunks into numerical representations, or vectors.

 For this task, Sentence Transformers were used. They embed the mentioned documents while preserving their meaning. The main idea is that sentences with similar meanings will have similar vector representations and will be closer to each other in vector space compared to those that are different.

The Sentence Transformer used in this project can be found [here](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). It was chosen primarily for its ability to handle multiple languages, including Serbian.

In [25]:
# !pip install sentence-transformers

In [26]:
from langchain.embeddings import HuggingFaceEmbeddings

model_name = 'sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2'

embeddings = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs ={'device':'cuda'},
    encode_kwargs = {'normalize_embeddings':True}
)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/4.12k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/645 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/471M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/480 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.08M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## **Vector store - Faiss**

---



The created sentence embeddings are stored in a vector store using **FAISS**. In the FAISS vector database, similarity is measured using the L2 Euclidean distance.

The main idea is that when we send a new sentence to the created vector database, which acts as a retriever, it retrieves the 2 most relevant sentences based on the calculated similarity scores. However, the number of retrieved sentences could vary depending on the context.

In [5]:
# !pip install -U langchain-community faiss-gpu

In [27]:
from langchain.vectorstores import FAISS

vector_db = FAISS.from_documents(document_chunks, embeddings)

In [28]:
# retriever = vector_db.as_retriever(
#     search_type="similarity",
#     search_kwargs={"k": 2}
# )
retriever = vector_db.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"score_threshold": 0.5}
)

In [29]:
def format_print_result(result):
  formatted = []
  for doc in result:
    formatted.append(doc.page_content)

  for doc in formatted:
    print(doc,'\n\n')

For the query 'Starenje in vivo', the most relevant sentences that were retrieved are shown below.

In [30]:
result = retriever.invoke('Starenje in vivo')
format_print_result(result)

Ključne reči: starenje – karakteristike, uzroci, teorije, odlaganje 


Kratak sadržaj 
Starenje je neizbežno i kako zastupljenost starih osoba u populaciji raste, 
razumevanje fizioloških mehanizama uklju čenih u proces normalnog starenja ima sve 


heterogenost stare populacije i kompleksnos t fizioloških procesa, teško može da 
se povuče jasna vremenska granica po četka starosti (1). Starenje zapo činje u 
nepoznatom periodu života zrele osobe i progresivno se odvija na razli čit način 


Key words: aging – traits, causes, theories, delaying 




Testing retrival for a query that is not present in biology documents.

In [31]:
format_print_result(retriever.invoke('Šta je sunce?'))

## **Prompting**

---



In the prompt below, we instruct the LLM to generate a response based only on the context that it was provided with for the asked question. The context is retrieved from the created vector store.

In [32]:
prompt_template = """
<s> [INST] Sigurno, korisno i smisleno odgovori na postavljeno pitanje isključivo na osnovu konteksta.
Ako imaš bilo kakav kontekst, odgovori na osnovu tog konteksta.
Ako nešto ne znaš, jednostavno reci Ne znam.
Ako je rečenica u pitanju nedovršena, završi je na osnovu konteksta ili svog znanja.
[/INST]
Kontekst: {context}
Pitanje: {question}
</s>
 """

# prompt_template = """
# <s> [INST] Sigurno i korisno odgovaraj na postavljeno pitanje isključivo na osnovu konteksta.
# Ako nešto ne znaš, jednostavno reci da ne znaš.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>
#  """


# prompt_template = """
# <s> [INST] Sigurno i korisno odgovaraj na postavljeno pitanje isključivo na osnovu konteksta.
# Ako nemaš kontekst, jednostavno reci da ne znaš.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>
# """

# prompt_template = """
# <s> [INST] Please respond in Serbian language to the question based solely on the provided context.
# If you have any context, respond only based on that context.
# If you don't have context, simply respond with "Ne znam odgovor na ovo pitanje".
# Only respond to questions in Serbian laguage.
# If question is not in Serbian language, respond with "Molim te postavi pitanje na srpskom jeziku".
# Always respond in Serbian language.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>
# """

# prompt_template = """
# <s> [INST] Molim te da odgovoriš na pitanje na srpskom jeziku isključivo na osnovu pruženog konteksta.
# Ako nemaš kontekst, jednostavno odgovori sa "Ne znam".
# Ako imaš kontekst, odgovori samo na osnovu tog konteksta.
# Odgovaraj samo na pitanja na srpskom jeziku.
# Uvek odgovaraj na srpskom jeziku.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>
# """

# prompt_template = """
# <s> [INST] Sigurno i korisno odgovaraj na postavljeno pitanje isključivo na osnovu konteksta.
# Ako nisi dobio kontekst, jednostavno reci da ne znaš.
# Ako je kontekst prazan, jednostavno reci da ne znaš.
# Ako pitanje nije na srpskom, reci da postavi pitanje na srpskom jeziku.
# Ako pitanje nije o biologiji, reci da postavi pitanje o biologiji.
# Ako nemaš kontekst, ne odgovaraj na osnovu svojih saznanja.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>
# """

# prompt_template = """
# <s> [INST] Please respond to the question based solely on the provided context. Only respond in Serbian language.
# If you don't have context, simply say "Ne znam, izvini"
# If the question is not in Serbian, say "Postavi pitanje na srpskom jeziku".
# If the question is not about biology, say "Postavi pitanje o biologiji".
# If you don't have context, don't respond based on your knowledge.
# [/INST]
# Kontekst: {context}
# Pitanje: {question}
# </s>

# """

In [33]:
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["context", "question"],
    template=prompt_template,
)

## **RAG chain**

---



 The flow is as follows:
*   We pass the query to the vector database to find the most similar documents
*   The retrieved context, along with the question, goes to the LLM to generate the response

In [34]:
from langchain.chains import RetrievalQA

rag_chain = RetrievalQA.from_chain_type(llm=mistral_llm,
                                         chain_type="stuff",
                                         retriever=retriever,
                                         return_source_documents=True,
                                         chain_type_kwargs={
                                             "prompt": prompt
                                         })

In [35]:
def wrap_text(text, word_limit):
    words = text.split()
    wrapped_text = []

    for i in range(0, len(words), word_limit):
        wrapped_text.append(' '.join(words[i:i+word_limit]))

    return '\n'.join(wrapped_text)

In [36]:
def print_result(question):
  if not retriever.invoke(question):
    print('Ne znam kako da ti odgovorim na ovo pitanje :(')

  else:

    result =rag_chain.invoke(question)
    start = result['result'].find("Odgovor:")
    response_content = result['result'][start:]
    print("Pitanje: ",question,'\n\n')

    response_content = wrap_text(response_content.strip(), 20)
    print(response_content)

In [38]:
question1 = 'Karakteristike starenja'
print_result(question1)

Pitanje:  Karakteristike starenja 


Odgovor: Starenje se javlja kod životinjskih organizama i ljudima, a karakteristike su mu: • Progresivni pad funkcionalnosti organa (npr. gubitak
sluhove) • Pada sposobnosti za regeneraciju tkiva (npr. kože), • Pada reproduktivnih sposobnosti (npr. menopauza). Uzroci starenja Nekoliko hipoteza objašnjava
uzroke starenja: • Teorija o starom dobi - po ovoj teoriji, starenje je posledica progresivnog pada broja stanica u telu.
• Teorija o epigenetici - po ovaj teoriji, genetičke promene u DNK mogu da uticaju na starenje. • Teorija o
molekularnom otpadu - po ovaj teoriji, starenje je posledica akumulacije štetnih molekula u telu. • Teorija o senescenciji - po
ovaj teoriji, starenje je posledica senescentnih stanica koje se formiraju u telu. Teorija o hormonima - po ovaj teoriji, starenje
je posledica padajućeg razina hormona u telu. Teorije o delovanju antioxidansa Antioxidanti su supstance koje brane organizam od štete izazvane
slobodnim radikalima. Po

In [51]:
question2 = 'Starenje ćelija'
print_result(question2)

Pitanje:  Starenje ćelija 


Odgovor: Uloga ćelijskog starenja i ćelijske smrti u starenju organizama


**Retrieved sentences for second query "Starenje ćelija"**

In [52]:
res = retriever.invoke(question2)
for doc in res:
  print(doc.page_content,'\n\n')

Starenje ćelija 
Ćelije su osnovne strukturne jedini ce organizma i zato pretpostavka da 
njihovo starenje i propadanje doprinosi st arenju organizma ima smisla. Prirodno 
starenje čoveka može da bude rezultat: 1) slabljenja procesa odgovornih za 


mobilizacijom mati čnih ili prekursorskih  ćelija, ali izgleda da i te ćelije imaju 
konačni životni vek i u starosti se njihov broj smanjuje. Štaviše, tkivni ili 
sistemski milje starog organizma doda tno kompromituje mobilizaciju i/ili 


proliferativnu homeostazu i čak da stimulišu rast mutiranih preneoplasti čnih ili 
neoplastičnih ćelija.  
Uloga ćelijskog starenja i ćelijske smrti u starenju organizma 
Ćelijska starost i smrt mogu da ig raju ulogu u starenju na dva na čina 
(36): 


postoji kao starost ćelija in vivo  i, ako postoji, kakve su posledice prisustva 
starih ćelija u tkivima i organima.  Za sada nije utvr đena direktna uzro čna veza 
između starenja ćelija i ćelijske smrti, s jedne strane , i starenja organizma, s 




**Retrieved sentences for third query "Šta je telomeraza?"**

In [40]:
question3 = 'Šta je telomeraza?'
print_result(question3)

Pitanje:  Šta je telomeraza? 


Odgovor: Telomeraza je enzim koji produžuje telomere.


In [50]:
res = retriever.invoke(question3)
for doc in res:
  print(doc.page_content,'\n\n')

da se dele a da se ne ispolji RS zavis na od dužine telomera. Telomeraza je 
glavni, ali ne i jedini mehanizam koji produžava telomere. 
Tokom embrionalnog i postnatalnog ži vota ekspresija telomeraze se 


kraju završava  jednostrukim lancem nukl eotida koji se savija unazad i strukturu 
koja se zove T-petlja. Direktno ili indire ktno, u vezi sa DNK telomera je veliki 
broj proteina koji olakšavaju formiranje T-petlji i u čestvuju u funkciji telomera. 


ciklusa. Pri tome, dužina telomera ima ulogu tzv. mitoti čkog sata ili 
replikometra koji prati broj  deoba normalnih somatskih ćelija. 


potpunosti rasvetljen; danas se  uglavnom smatra da skra ćenje telomera indukuje 
klasični odgovor ćelija na ošte ćenje DNK (24,26). Signali o ošte ćenju DNK 
potiču iz kratkih telomera, a odgovor zahteva u č
ešće nekoliko proteina, od kojih 




We can observe that the model **used the retrieved documents to generate a response** while also incorporating its own knowledge to synthesize the information.

In [41]:
question4 = 'How are you?'
print_result(question4)

Ne znam kako da ti odgovorim na ovo pitanje :(


In [53]:
question5 = 'Šta je sunce?'
print_result(question5)

Ne znam kako da ti odgovorim na ovo pitanje :(


In [43]:
retriever.invoke(question4)

[]

In [54]:
retriever.invoke(question5)

[]

The fourth and fifth questions are not in the externally added knowledge base, so nothing relevant was retrieved for them.
<p>Therefore, they were not passed to the LLM because the point of this project is to showcase how we can enhance the LLM with new knowledge it didn't already have, not its ability to answer questions on its own.</p>

## Conclusion for this version of Mistral 7B

There were two main challenges:


1.   Preventing the LLM from responding to questions not in Serbian language
2.   Ensuring it relied on an external knowledge base, specifically a biology PDF

In attempting to resolve the issue, I first tried creating a clear prompt to guide the LLM's behavior. However, after multiple attempts, I couldn't find one that addressed all the problems effectively. Additionally, I aimed to minimize GPU usage unless absolutely necessary.
<p>
Hence, I chose a different approach. Initially, I modified the retriever to retrieve documents above a certain threshold. Before passing the query and context to the LLM for generating a response, I verified if there were relevant documents in the database (retriever). If no document was relevant to the passed query, none would have a similarity score above the threshold, resulting in an empty context. Therefore, I refrained from passing it to the LLM since it was unnecessary. Conversely, if relevant documents were found, I would pass them along with the query to the LLM for further processing. </p>


# Comparison to Llama LLM

**Authentication**

---



In [44]:
from huggingface_hub import login
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

**Model and adequate tokenizer**

---



For a fair comparison, the **same retriever** is used in both models.

In [45]:
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers

model_id = "meta-llama/Llama-2-7b-chat-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="cuda",
    quantization_config=nf4_config
)

tokenizer_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [55]:
hf_llama_pipe = pipeline(task="text-generation",
                   model=model,
                   tokenizer= tokenizer,
                   return_full_text=True,
                   temperature=0.1,
                   max_new_tokens = 512,
                   repetition_penalty=1.2,
                   num_return_sequences=1
                )

llama_llm = HuggingFacePipeline(pipeline = hf_llama_pipe)

In [56]:
system_message = """
You are a helpful, respectful, and honest assistant.
Always answer as helpfully as possible, while ensuring safety.
Your answers should be based only on the context provided.
Do not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content in your responses.
Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, is not factually coherent, or the context does not provide enough information to answer it, just respond with "Ne znam odgovor na ovo pitanje :(".
Always answer in Serbian.
"""

In [57]:
user_message = """
CONTEXT:/n/n {context}/n

Question: {question}
"""

Llama expects a different prompt format than Mistral, and it is crucial to pass the prompt to the LLM in the format it was trained on.
<p>Otherwise, it won't generate an adequate response.

In [58]:
llama_prompt_format = f"<s>[INST] <<SYS>>\n{system_message}\n<</SYS>>\n{user_message} [/INST]"

In [59]:
llama_prompt = PromptTemplate(
    template= llama_prompt_format,
    input_variables=["context", "question"]
)

In [60]:
llama_qa_chain = RetrievalQA.from_chain_type(llm=llama_llm,
                                         chain_type="stuff",
                                         retriever=retriever,
                                         return_source_documents=True,
                                         chain_type_kwargs={
                                             "prompt": llama_prompt
                                         })

In [61]:
q1_result = llama_qa_chain.invoke(question1)

In [62]:
q1_result.keys()

dict_keys(['query', 'result', 'source_documents'])

In [76]:
def return_llama_response(qa_chain, query):
  q1_result = qa_chain.invoke(query)
  index = q1_result['result'].find('[/INST]') + len('[/INST]')

  result = wrap_text(q1_result['result'][index:].strip(), 50)
  print(result)

In [64]:
question1

'Karakteristike starenja'

In [77]:
return_llama_response(llama_qa_chain, question1)

Ne znam odgovor na ovo pitanje :(


In [66]:
question3

'Šta je telomeraza?'

In [68]:
res = retriever.invoke(question3)
for doc in res:
  print(doc.page_content,'\n\n')

da se dele a da se ne ispolji RS zavis na od dužine telomera. Telomeraza je 
glavni, ali ne i jedini mehanizam koji produžava telomere. 
Tokom embrionalnog i postnatalnog ži vota ekspresija telomeraze se 


kraju završava  jednostrukim lancem nukl eotida koji se savija unazad i strukturu 
koja se zove T-petlja. Direktno ili indire ktno, u vezi sa DNK telomera je veliki 
broj proteina koji olakšavaju formiranje T-petlji i u čestvuju u funkciji telomera. 


ciklusa. Pri tome, dužina telomera ima ulogu tzv. mitoti čkog sata ili 
replikometra koji prati broj  deoba normalnih somatskih ćelija. 


potpunosti rasvetljen; danas se  uglavnom smatra da skra ćenje telomera indukuje 
klasični odgovor ćelija na ošte ćenje DNK (24,26). Signali o ošte ćenju DNK 
potiču iz kratkih telomera, a odgovor zahteva u č
ešće nekoliko proteina, od kojih 




In [78]:
return_llama_response(llama_qa_chain, question3)

Ne znam odgovor na ovo pitanje :(. The context you provided does not give enough information for me to accurately answer your question about telomerase. Can you please provide more details or clarify what you would like to know about telomerase?


With Llama, the retrieval part works relatively well, but the generative aspect falls short.
<p>The main reason for this is that it is not trained on Serbian literature and, therefore, lacks the ability to generate appropriate responses in Serbian.


#### Prompt in Serbian language with Llama

Finally, I tested Llama with prompts in the Serbian language to see if the accuracy would improve.

In [70]:
system_message2 = """
Vi ste uslužan, pun poštovanja i pošten asistent.
Uvek odgovarajte što je moguće korisnije, pridržavajući se datog konteksta.
Nemojte uključivati nikakav štetan, neetički, rasistički, seksistički, toksičan, opasan ili nezakonit sadržaj u svoje odgovore.
Uverite se da su vaši odgovori društveno nepristrasni i pozitivne prirode.
Ako pitanje nema smisla, nije činjenično koherentno ili kontekst ne pruža dovoljno informacija za odgovor, odgovorite sa „Ne znam odgovor na ovo pitanje :(“.
Uvek odgovarajte na srpskom jeziku.
"""

In [71]:
user_message2 = """
Kontekst:/n/n {context}/n

Pitanje: {question}
"""

In [72]:
llama_prompt_format = f"<s>[INST] <<SYS>>\n{system_message2}\n<</SYS>>\n{user_message2} [/INST]"

In [73]:
llama_prompt2 = PromptTemplate(
    template= llama_prompt_format,
    input_variables=["context", "question"]
)

In [74]:
llama_qa_chain2 = RetrievalQA.from_chain_type(llm=llama_llm,
                                         chain_type="stuff",
                                         retriever=retriever,
                                         return_source_documents=True,
                                         chain_type_kwargs={
                                             "prompt": llama_prompt2
                                         })

In [75]:
question1

'Karakteristike starenja'

In [79]:
return_llama_response(llama_qa_chain2, question1)

Ne znam odgovor na ovo pitanje :(. The topic of aging is a complex and multifaceted one, and there are many different characteristics, causes, theories, and delays involved in the process. To provide a comprehensive answer would require a detailed examination of the various physiological mechanisms that contribute to aging,
as well as an analysis of the social and cultural factors that influence the aging process. Additionally, there are many different perspectives on what constitutes "normal" aging, and how it should be defined and measured. Therefore, I cannot provide a simple or definitive answer to this question without further context
or information.


In [82]:
question2

'Starenje ćelija'

In [83]:
return_llama_response(llama_qa_chain2, question2)

Ne znam odgovor na ovo pitanje :(


In [80]:
question3

'Šta je telomeraza?'

In [81]:
return_llama_response(llama_qa_chain2, question3)

Telomeraza je glavni, ali ne i jedini mehanizam koji produžava telomere. Tokom embrijske i posmatalske faze, ekspresija telomeraze se kraju završava jednostruim lancom nukl eotida koji se savija unazad i forma structure zove T-petlja. U vezi sa DNK telomera je veliki broj proteina koji olakšavaju formiranje T-petlji i učešće u
funkciji telomera. Oni potpomene ciklusu. Dužina telomera ima ulogu tzv. mitoti čkog sata ili replikometra koji prati broj deoba normalnih somatских ćelija.


The generative aspect improves a bit when the prompt is in Serbian, but the difference is not significant.
<p> Llama only provided a valid response in Serbian for one question ("Šta je telomeraza") out of the ones I tested, while for all others, it didn't generate a valid response.











## The **first** model seems to be more suitable for this task overall.





