# Multilingual QA (Part 1)

This Multilingual QA has implemented with **multilingual embedding** model.

Steps:
1. The data preparation contains PDF data loading, text splitting.
2. Document embedding uses multilingual model.
3. Created Faiss vector store and embedded documents are inserted in the vector store.
4. The retrieval process is implemented with multilingual query.
5. For same query, answer is generated using gemini model ensuring language stays consistent.

In [16]:
#!pip install langchain_community langchain langchain-google-genai faiss-cpu pdfplumber sentence-transformers

## Import Libraries

In [2]:
from langchain_community.document_loaders import PDFPlumberLoader
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.docstore.in_memory import InMemoryDocstore
import faiss
from langchain_google_genai.chat_models import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate

## Data Preparation

In [13]:
# loading pdf data and chunking

pdf_data = PDFPlumberLoader("./sample_data/2306.04542v3.pdf").load()
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=50,
)
data_splits = splitter.split_documents(pdf_data)
print("no of chunks: ", len(data_splits))


no of chunks:  53


## Document Embedding and vector store generation

In [16]:
# multilingual huggingface embedding model
model_name = "Alibaba-NLP/gte-multilingual-base"
embeddings = HuggingFaceEmbeddings(
                model_name=model_name,
                model_kwargs={"device": "cpu", "trust_remote_code": True},
                encode_kwargs={"normalize_embeddings": True},
            )

  embeddings = HuggingFaceEmbeddings(
Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [51]:
# create faiss vectorstore
vector_store = FAISS(
    embedding_function=embeddings,
    index=faiss.IndexFlatL2(768),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
# print(vector_store)
vector_store.add_documents(documents=data_splits)

In [53]:
# vector_store.save_local('faiss_index_1')

## Multilingual Retrieval

In [36]:
# multilingual retriever
multilingual_retriever = vector_store.as_retriever()    

In [47]:
# Multilingual query
query = "Que se passe-t-il dans le processus avancé du modèle de diffusion ?"
out_docs = multilingual_retriever.invoke(query)
context = "\n--\n".join(doc.page_content for doc in out_docs)

In [48]:
print("Retrieval output for multilingual embedding model: \n")
for doc in out_docs :
    print("Document: ", doc.page_content[0:200])
    print("*"*100)

Retrieval output for multilingual embedding model: 

Document:  [274],Chemistry[275],[276],etc.Theyarenotonlyapplying
and adapting diffusion models to solve problems in these
domains,butalsoleveragingknowledgeinadisciplinaryto
theoretically improve diffusion model
****************************************************************************************************
Document:  each timestep. The reverse process moves on the chain in pling procedure, as shown in Figure 1. This breakdown is
the opposite direction. It optimizes a network to remove aligned with the generic pipe
****************************************************************************************************
Document:  1
On the Design Fundamentals of
Diffusion Models: A Survey
Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE
Abstract—Diffusionmodelsaregenerativemodels,whichgraduallyaddandremoveno
****************************************************************************************************
Document

## Question Answering

In [44]:
# define LLM
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key="google_api_key")

In [49]:
prompt = PromptTemplate.from_template("You are a multilingual assistant. Answer user's {query} from given {context} in same language as query.")
chain = prompt | llm
response = chain.invoke({'query': query,
                            'context': context})

In [50]:
print("query: \n",query)
print("response: \n", response.content)

query: 
 Que se passe-t-il dans le processus avancé du modèle de diffusion ?
response: 
 Le processus de diffusion avancé implique une compréhension approfondie de la façon dont le bruit est progressivement ajouté à un échantillon de données dans le processus de diffusion directe, puis supprimé dans le processus inverse pour générer de nouvelles données. 

En termes simples, imaginez une goutte d'encre dans un verre d'eau. 

1. **Processus de diffusion direct (ajout de bruit):**  C'est comme si l'on regardait l'encre se diffuser lentement dans l'eau. À chaque étape, l'encre se répand un peu plus, devenant de plus en plus "bruyante" jusqu'à ce qu'elle soit uniformément répartie et que l'on ne puisse plus distinguer la goutte d'origine.

2. **Processus de diffusion inverse (suppression du bruit):** C'est là que ça devient intéressant. Le modèle apprend à inverser ce processus de diffusion. Il apprend à "dé-mélanger" l'encre, étape par étape, en commençant par l'état "bruyant" final et en

**Translated output:**

The advanced diffusion process involves a thorough understanding of how noise is progressively added to a data sample in the forward diffusion process and then removed in the reverse process to generate new data. 

In simple terms, imagine a drop of ink in a glass of water. 

1. **Direct diffusion process (adding noise):** It's like watching ink slowly diffuse in water. With each step, the ink spreads a little more, becoming more and more "noisy" until it is evenly distributed and you can no longer make out the original drop.

2. **Reverse diffusion process (noise removal):** This is where it gets interesting. The model learns to reverse this diffusion process. It learns how to "un-mix" the ink, step by step, starting with the final "noisy" state and working back to the initial state of the ink drop.

The advanced process involves training a neural network to understand and replicate this ink “un-mixing” process.  Once trained, the model can generate new data by starting from a “noisy” state and gradually “de-noising” it to create something new and coherent.

In summary, the advanced streaming process helps capture the underlying structure of data by learning to both add and remove noise in a controlled manner.

# Multilingual Documents

* The multilingual embeddings are used with multilingual documents. The document is having french and german data.

## Data Preparation

In [3]:
# loading multilingual data and chunking
from langchain_community.document_loaders import TextLoader

text_data = TextLoader(",/sample_data/multilingual_text.txt").load()
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=500,
    chunk_overlap=50,
)
data_splits = splitter.split_documents(text_data)
print("no of chunks: ", len(data_splits))


no of chunks:  7


## Multilingual embedding and vector store generation

In [4]:
# multilingual huggingface embedding model
model_name = "Alibaba-NLP/gte-multilingual-base"
embeddings = HuggingFaceEmbeddings(
                model_name=model_name,
                model_kwargs={"device": "cpu", "trust_remote_code": True},
                encode_kwargs={"normalize_embeddings": True},
            )

  embeddings = HuggingFaceEmbeddings(
Some weights of the model checkpoint at Alibaba-NLP/gte-multilingual-base were not used when initializing NewModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing NewModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing NewModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [7]:
# create faiss vectorstore
vector_store1 = FAISS(
    embedding_function=embeddings,
    index=faiss.IndexFlatL2(768),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
# print(vector_store)
vector_store1.add_documents(documents=data_splits)

['f7d6d487-08bd-423d-987d-2084f5f08242',
 'afc3fec4-9d65-4b05-bbd8-791739b7c276',
 '003f21c3-4b6a-4cd7-8b8b-09cca3fd2626',
 'c2d92a07-8cae-4a81-8e8b-7863e9f8c967',
 '11cfc88f-39c1-46f5-bda7-cab4c7d6427c',
 '72bd6ccf-97af-4188-a84b-885eb95a8925',
 '7115b853-e195-44ce-8684-da955e02807f']

## Mutlitlingual Retrieval

In [8]:
# multilingual retriever
multilingual_doc_retriever = vector_store1.as_retriever()    

In [9]:
# Multilingual query
query = "wat is die verskil tussen GAN'e en diffusiemodel?"
out_docs = multilingual_doc_retriever.invoke(query)
context = "\n--\n".join(doc.page_content for doc in out_docs)

In [10]:
print("Retrieval output for multilingual embedding model: \n")
for doc in out_docs :
    print("Document: ", doc.page_content[0:200])
    print("*"*100)

Retrieval output for multilingual embedding model: 

Document:  Evolution von Diffusionsmodellen aus GANs
Generative Adversarial Networks (GANs) und Diffusionsmodelle sind zwei wichtige Techniken im Bereich der generativen Modellierung, jede mit ihren eigenen StÃ¤
****************************************************************************************************
Document:  Funktionsweise von Diffusionsmodellen

VorwÃ¤rtsprozess: Saubere Daten werden in mehreren Schritten mit Rauschen verfÃ¤lscht.
RÃ¼ckwÃ¤rtsprozess: Ein neuronales Netzwerk lernt, den Rauschprozess umzuk
****************************************************************************************************
Document:  Export to Sheets
Fazit
WÃ¤hrend GANs einen bedeutenden Meilenstein in der generativen Modellierung darstellen, haben sich Diffusionsmodelle als vielversprechende Alternative mit mehreren Vorteilen her
****************************************************************************************************
Document

## Quetion Answering

In [11]:
# define LLM
llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key="google_api_key")

In [12]:
prompt = PromptTemplate.from_template("You are a multilingual assistant. Answer user's {query} from given {context} in same language as query.")
chain = prompt | llm
response = chain.invoke({'query': query,
                            'context': context})

In [13]:
print("query: \n",query)
print("response: \n", response.content)

query: 
 wat is die verskil tussen GAN'e en diffusiemodel?
response: 
 Die verskil tussen GAN'e (Generatiewe Teenstander Netwerke) en diffusiemodelle lê in hoe hulle werk:

**GAN'e:**

* **Twee netwerke:** 'n Generator wat probeer om vals data te skep, en 'n diskrimineerder wat probeer om te onderskei tussen werklike en vals data. Hulle "veg" teen mekaar, wat die generator beter maak met tyd.
* **Voorbeeld:** Dink aan 'n vervalser wat probeer om 'n skildery na te maak, en 'n kunskenner wat probeer om die vervalsing te identifiseer.
* **Voordele:** Kan baie realistiese data skep.
* **Nadele:** Kan onstabiel wees om te oefen, en kan sukkel om diverse data te skep.

**Diffusiemodelle:**

* **Een netwerk:** 'n Enkele netwerk wat leer om geraas by data te voeg en dit dan weer te verwyder. Deur hierdie proses te keer, kan dit nuwe data genereer.
* **Voorbeeld:** Dink aan 'n glas water waar jy ink byvoeg en dit dan stadig verwyder totdat jy skoon water oor het. Die model leer hoe om die "ink"

The difference between GANs (Generative Adversary Networks) and diffusion models lies in how they work:

**GANs:**

* **Two networks:** A generator that tries to create fake data, and a discriminator that tries to distinguish between real and fake data. They "fight" against each other, making the generator better with time.
* **Example:** Consider a forger trying to fake a painting, and an art connoisseur trying to identify the forgery.
* **Advantages:** Can create very realistic data.
* **Disadvantages:** Can be unstable to train, and can struggle to create diverse data.

**Diffusion models:**

* **One network:** A single network that learns to add noise to data and then remove it again. By stopping this process, it can generate new data.
* **Example:** Think of a glass of water where you add ink and then slowly remove it until you are left with clear water. The model learns how to add and remove the "ink" (noise).
* **Advantages:** More stable to train, and can create more diverse data.
* **Disadvantages:** Can be slower to generate than GANs.

**In short:** GANs are like two artists competing against each other, while diffusion models are like a single artist learning to unravel and reassemble an image. Both can be used to achieve impressive results, but have different strengths and weaknesses.