# A walkthrough for multi-language document Q&A

For my data journalism talk [How I convinced GPT to teach me about Hungarian folktales (without speaking a word of Hungarian)](https://github.com/jsoma/mediaparty-folktales) at Media Party Chicago 2023.

There's a *lot* more you can do with this, this is just the very basics! Feel free to email me at [js4571@columbia.edu](mailto:js4571@columbia.edu) or on Twitter at [@dangerscarf](https://twitter.com/dangerscarf) if you want some more details.

We're going to be using [langchain](https://python.langchain.com/en/latest/index.html) but you can absolutely use other great tools like [LlamaIndex](https://gpt-index.readthedocs.io/en/latest/).

In [2]:
# !pip install -U langchain chromadb

# Read in our text

We're going to be asking questions about the book [Eredeti népmesék](https://www.google.com/books/edition/Eredeti_n%C3%A9pmes%C3%A9k/FcZSEAAAQBAJ?hl=en). We're getting the text version [from Project Gutenberg](https://www.gutenberg.org/files/38852/38852-0.txt).

In [4]:
import requests
import re

# Gutenberg pretends everything is English, which
# means "Hát gyöngyömadta" gets really mangled
response = requests.get("https://www.gutenberg.org/files/38852/38852-0.txt")
text = response.content.decode("utf-8")

# Cleaning up newlines
text = text.replace("\r", "")
text = re.sub("\n(?=[^\n])", "", text)

# Saving the book
with open('book.txt', 'w') as f:
    f.write(text)

# Split up the text

We'll divide our text up into 1,000-character chunks, with 100-character overlap between each set of neighboring chunks. We're loading text here, but you can load [all sorts of other kinds of documents](https://python.langchain.com/en/latest/modules/indexes/document_loaders.html) like HTML, PDFs and more.

In [51]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('book.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

# Create embeddings for each passage

There are lots lots lots of different multilingual embeddings available! We're using one called [`paraphrase-multilingual-MiniLM-L12-v2`](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2). Different ones will perform better or worse, depending on many many *many* variables.

In [52]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')

from langchain.vectorstores import Chroma

docsearch = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


If you're curious, we can look at the embeddings for just _one_ passage. We can do it for **What did Zsuzska steal from the devil?**

In [53]:
scores = embeddings.embed_documents(["What did Zsuzska steal from the devil?"])[0]
len(scores)

384

384 scores for the document! They aren't "dog" or "wild" or "fuzzy" or even mean anything specific, like in the talk, they're just... magic numbers that only mean things to the embedding model. We just have to trust that texts with similar scores are going to have similar content!

If we wanted to look at the first twenty of those scores:

In [54]:
print(scores[:20])

[-0.46812501549720764, 0.47161218523979187, -0.39475440979003906, 0.18969321250915527, 0.08756688982248306, 0.04914027079939842, 0.6678051948547363, 0.24234464764595032, 0.011556974612176418, 0.24045951664447784, 0.15715603530406952, 0.04403669759631157, 0.25661030411720276, -0.12375714629888535, -0.5067397356033325, 0.053942807018756866, 0.06712772697210312, 0.13114140927791595, -0.17556653916835785, 0.2375480830669403]


## Find relevant passages using embeddings

To take that one step further, let's try to find one related passage. Here's a match for **What did Zsuzska steal from the devil?**

In [55]:
# k=1 because we only want one result
docsearch.similarity_search("What did Zsuzska steal from the devil?", k=1)

[Document(page_content='Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a felesége is. Zsuzska csendesen belopódzott, ellopta a tenger-ütőpálczát, avval bekiáltott az ablakon.\n– Hej ördög, viszem ám már a tenger-ütő pálczádat is.\n– Hej kutya Zsuzska, megöletted három szép lyányomat, elloptad atenger-lépő czipőmet, most viszed a tenger-ütő pálczámat, de majdmeglakolsz te ezért.\nUtána is szaladt, de megint csak a tengerparton tudott közel jutnihozzá, ott meg Zsuzska megütötte a tengert a tenger-ütő pálczával,kétfelé vált előtte, utána meg összecsapódott, megint nem foghatta megaz ördög. Zsuzska ment egyenesen a királyhoz.\n– No felséges király, elhoztam már a tengerütő pálczát is.', metadata={'source': 'book.txt'})]

I don't know Hungarian, but we can [translate the match with DeepL](https://www.deepl.com/en/translator).

**Translation:**

> Poor Zsuzska denied it in vain, it was of no use, so she set off in sorrow. It was only midnight when she reached the devil's house, and the devil's wife was asleep. Zuzska crept in quietly, stole the sea-whisk, and with it she called through the window.
> 
> "Hey, devil, I'll take your sea-rod too."
>
> "Hey dog, Zsuzska, you killed three of my beautiful girls, you stole my sea-stepping-hip, now you're taking my sea-bat, but you'll get it for this."
> 
> He ran after her, but again he could only get close to her on the beach, and there she hit the sea with her sea-beating stick, and it split in two in front of him, and then they clashed, and again the devil couldn't catch her. She went straight to the king.
> 
> "Well, sire King, I have brought the sea-striking stick."

Seems like a good match!

## Filter, search, and send our question to GPT

In [56]:
# Build our connection to GPT

from langchain.chat_models import ChatOpenAI

# Your API key
openai_api_key = "sk-...."

# Use temperature=0 to get the same results every time
llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    openai_api_key=openai_api_key)

In [57]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

In [58]:
query = "What did Zsuzska steal from the devil?"
qa.run(query)

"Zsuzska stole the devil's sea-striking stick, his golden cabbage head, and his golden baby in a golden cradle."

# What about attribution?

Do we want sources? We just need to add some sort of metadata to our passages. We'll do it in a very simple way right now: we're just **giving them numbers.**

In [102]:
# Loop through each document, adding an index
for i in range(len(docs)):
    docs[i].metadata['source'] = f"passage-{i}"

docsearch = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


In [103]:
# Now metadata has both a source and a passage_index
docs[0].metadata

{'source': 'passage-0', 'passage_index': 0}

In [104]:
from langchain.chains import RetrievalQAWithSourcesChain

chain = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=docsearch.as_retriever()
)

In [105]:
query = "What did Zsuzska steal from the devil?"

chain({ "question": query, "verbose": True })

{'question': 'What did Zsuzska steal from the devil?',
 'verbose': True,
 'answer': 'Zsuzska stole the tenger-ütő pálczát (a staff that can split the sea) from the devil.\n',
 'sources': 'passage-267, passage-268, passage-269, passage-266'}

Why is this answer different?? I don't know! Oh boy, looks like LLMs... *aren't perfect?*

In [106]:
# Then you can pull out the docs individually
docs[267]

Document(page_content='Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a felesége is. Zsuzska csendesen belopódzott, ellopta a tenger-ütőpálczát, avval bekiáltott az ablakon.\n– Hej ördög, viszem ám már a tenger-ütő pálczádat is.\n– Hej kutya Zsuzska, megöletted három szép lyányomat, elloptad atenger-lépő czipőmet, most viszed a tenger-ütő pálczámat, de majdmeglakolsz te ezért.\nUtána is szaladt, de megint csak a tengerparton tudott közel jutnihozzá, ott meg Zsuzska megütötte a tengert a tenger-ütő pálczával,kétfelé vált előtte, utána meg összecsapódott, megint nem foghatta megaz ördög. Zsuzska ment egyenesen a királyhoz.\n– No felséges király, elhoztam már a tengerütő pálczát is.', metadata={'source': 'passage-267', 'passage_index': 267})