# Multilingual QA (Part 2)

To implement multilingual QA, **translation** step has added at the query part.

Steps:
1. The data preparation contains PDF data loading, text splitting.
2. Document embeddings created.
3. Created Faiss vector store and embedded documents are inserted in the vector store.
4. The query is translated to english language.
5. The retrieval process is implemented with English translated query.
6. For answer generation original query is used with retrieved documents.

## Import libraries

In [1]:
from langchain_community.document_loaders import PDFPlumberLoader
from langchain_community.embeddings.huggingface import HuggingFaceEmbeddings
from langchain_community.vectorstores.faiss import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.docstore.in_memory import InMemoryDocstore
import faiss
from langchain_google_genai.chat_models import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate

## Data preparation

In [3]:
# loading pdf data and chunking

pdf_data = PDFPlumberLoader("./sample_data/2306.04542v3.pdf").load()
splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=1000,
    chunk_overlap=50,
)
data_splits = splitter.split_documents(pdf_data)
print("no of chunks: ", len(data_splits))


no of chunks:  53


## Document Embedding and Vector store creation

In [2]:
# multilingual huggingface embedding model
model_name = "thenlper/gte-base"
embeddings = HuggingFaceEmbeddings(
                model_name=model_name,
                model_kwargs={"device": "cpu", "trust_remote_code": True},
                encode_kwargs={"normalize_embeddings": True},
            )

  embeddings = HuggingFaceEmbeddings(


In [4]:
# create faiss vectorstore
vector_store = FAISS(
    embedding_function=embeddings,
    index=faiss.IndexFlatL2(768),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)
# print(vector_store)
vector_store.add_documents(documents=data_splits)

In [11]:
vector_store.save_local('faiss_index_2')

<langchain_community.vectorstores.faiss.FAISS at 0x1c101c96290>

In [12]:
# Retriever
retriever = vector_store.as_retriever()

## Query translation to english

In [13]:
# translation chain

llm = ChatGoogleGenerativeAI(model="gemini-1.5-pro", google_api_key="google_api_key")
prompt = PromptTemplate.from_template("Translate the query{query} to english language if its not in english. Only output translated query.")
translate_chain = prompt | llm

In [15]:
# Multilingual query translation
query = "Que se passe-t-il dans le processus avancé du modèle de diffusion ?"
translated_query = translate_chain.invoke({'query': query})
print("English translated query: ", translated_query.content)

English translated query:  What happens in the forward process of the diffusion model? 



## Retrieval

In [16]:
out_docs = retriever.invoke(translated_query.content)
context = "\n--\n".join(doc.page_content for doc in out_docs)

In [17]:
print("Retrieval output: \n")
for doc in out_docs :
    print("Document: ", doc.page_content[0:200])
    print("*"*100)

Retrieval output: 

Document:  1
On the Design Fundamentals of Diffusion
Models: A Survey - Supplementary Material
Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE
✦
1 DERIVATION OF THE FORWARD PROCESS Mathemati
****************************************************************************************************
Document:  4
whereβ
t
isthenoiseschedule,whichisahyper-parameter 3 THE FORWARD PROCESS
tocontroltheamountofnoisetobeaddedineachtimestep.
The forward process defines the way data to be perturbed
Asallforwardtrans
****************************************************************************************************
Document:  1
On the Design Fundamentals of
Diffusion Models: A Survey
Ziyi Chang, George Koulieris, Hubert P. H. Shum, Senior Member, IEEE
Abstract—Diffusionmodelsaregenerativemodels,whichgraduallyaddandremoveno
****************************************************************************************************
Document:  2
Fig.1.Theoverviewofdiffusion

## Question Answering

In [18]:
ans_prompt = PromptTemplate.from_template("You are a helpful multilingual assistant. Answer user's {query} from given {context} in same language as query.")
ans_chain = ans_prompt | llm

In [19]:
response = ans_chain.invoke({'query': query, 
                 'context': context})

In [21]:
print("query: \n",query)
print("response: \n", response.content)

query: 
 Que se passe-t-il dans le processus avancé du modèle de diffusion ?
response: 
 Que se passe-t-il dans le processus avancé du modèle de diffusion ?

Le processus avancé, également appelé processus de diffusion, perturbe un exemple d'apprentissage  $x_0$ à mesure que le pas de temps $t$ augmente, comme illustré dans la Figure 2. Une transition avancée $p(x_t | x_{t-1})$ décrit une telle perturbation où une petite quantité de bruit  $\epsilon_t$ est ajoutée entre deux pas de temps. En d'autres termes, à mesure que le processus avancé progresse dans la chaîne, de plus en plus de bruit est ajouté par $p(x_t | x_{t-1})$ et l'échantillon perturbé $x_t$ devient de plus en plus bruyant. Après plusieurs pas de temps, la distribution originale $p(x_0)$ est finalement perturbée en une distribution terminale traitable $p(x_T)$ qui est généralement définie comme une distribution isotrope gaussienne, c'est-à-dire $x_T \sim \mathcal{N}(0,I)$ , où $I$ est la matrice d'identité. 



**Translated output:**

What happens in the advanced process of the diffusion model?

The forward process, also called the diffusion process, perturbs a training example $x_0$ as the time step $t$ increases, as shown in Figure 2. An advanced transition $p(x_t | x_{t- 1})$ describes such a disturbance where a small amount of noise $\epsilon_t$ is added between two time steps. In other words, as the forward process progresses through the chain, more and more noise is added by $p(x_t | x_{t-1})$ and the perturbed sample $x_t$ becomes more and noisier. After several time steps, the original distribution $p(x_0)$ is finally perturbed into a tractable terminal distribution $p(x_T)$ which is usually defined as a Gaussian isotropic distribution, i.e. $x_T \sim \mathcal{N}(0,I)$ , where $I$ is the identity matrix.