Hi, I'm Soma! You can find me on email at [jonathan.soma@gmail.com](mailto:jonathan.soma@gmail.com), on Twitter at [@dangerscarf](https://twitter.com/dangerscarf), or maybe even on [this newsletter I've never sent](https://tinyletter.com/jsoma).

# Multi-language document Q&A with LangChain and GPT-3.5-turbo 

## Using GPT, LangChain, and vector stores to ask questions of documents in languages you don't speak

I don't speak Hungarian, but **I demand to have my questions about Hungarian folktales answered!** Let's use GPT to do this for us!

*This might be useful if you're doing a cross-border investigation, are interested in academic papers outside of your native tongue, or are just interested in learning how LangChain and document Q&A works.*

In this tutorial, we'll look at:

1. Why making ChatGPT read an whole book is impossible
2. How to provide GPT (and other AI tools) with context to provide answers

If you don't want to read all of this nonsense you can go directly to the source and check out [Question Answering](https://langchain.readthedocs.io/en/latest/use_cases/question_answering.html) or [Question Answering with Sources
](https://langchain.readthedocs.io/en/latest/modules/indexes/chain_examples/qa_with_sources.html). This just adds a bit of multi-language sparkle on top!

## Our source material

**We'll begin by downloading the source material.** If your original documents are in PDF form or anything like that, you'll want to convert them to text first.

Our reference is a book of folktales called [Eredeti népmesék](https://www.gutenberg.org/ebooks/38852) by László Arany on Project Gutenberg. It's just [a basic text file](https://www.gutenberg.org/files/38852/38852-0.txt) so we can download it easily.

In [1]:
import requests
import re

# Gutenberg pretends everything is English, which
# means "Hát gyöngyömadta" gets really mangled
response = requests.get("https://www.gutenberg.org/files/38852/38852-0.txt")
text = response.content.decode("utf-8")

# Cleaning up newlines
text = text.replace("\r", "")
text = re.sub("\n(?=[^\n])", "", text)

# Saving the book
with open('book.txt', 'w') as f:
    f.write(text)

And the text is indeed in Hungarian:

In [5]:
print(text[3000:4500])

be, de az is épen úgy járt, mint abátyja, ez is kiszaladt a szobából.
Harmadik nap a legfiatalabb királyfin volt a sor; a bátyjai be se’akarták ereszteni, hogy ha ők ki nem tudták venni az apjokból, biz’ e’se’ sokra megy, de a királyfi nem tágitott, hanem bement. Mikor elmondtahogy m’ért jött, ehez is hozzá vágta az öreg király a nagy kést, de eznem ugrott félre, hanem megállt mint a peczek, kicsibe is mult, hogybele nem ment a kés, a sipkáját kicsapta a fejéből, úgy állt meg azajtóban. De a királyfi még ettől se’ ijedt meg, kihúzta a kést azajtóból, odavitte az apjának. ,,Itt van a kés felséges király atyám, hamegakar ölni, öljön meg, de elébb mondja meg mitől gyógyulna meg aszeme, hogy a bátyáim megszerezhessék.’’
Nagyon megilletődött ezen a beszéden a király, nemhogy megölte volnaezért a fiát, hanem össze-vissza ölelte, csókolta. No kedves fiam –mondja neki – nem hiában voltál te egész életemben nekem legkedvesebbfiam, de látom most is te szántad el magad legjobban a halálra az énme

Luckily for us, GPT speaks Hungarian! So if we tell it to read the book, it'll be able to answer all of our English-language questions without a problem. But there's one problem: the book is *not* a short tiny paragraph.

Life would be nice if we could just feed it directly to ChatGPT and start asking questions, but **you can't make ChatGPT read a whole book**. After it gets partway through the book ChatGPT starts forgetting the earlier pieces!

There are a few tricks to get around this. We'll work with one of the simplest for now:

1. Split our original text up into pieces
2. Find the pieces most relevant to our question
3. Send those pieces to GPT along with our question

Newer LLMs can deal with a lot more tokens at a time – GPT-4 has both an 8k and 32k version – but hey, we work with what we've got.

## Part 1: Split our original text up into pieces

To do pretty much everything from here on out we're relying on [LangChain](https://langchain.readthedocs.io/), a really fun library that allows you to bundle together different common tasks when working with language models. It's best trick is chaining together AI at different steps in the process, but for the moment we're just using its text search abilities.

We're going to split our text up into 1000-character chunks, which should be around 150-200 words apiece. I'm also going to add a little overlap.

In [112]:
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = TextLoader('book.txt')
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

Technically speaking I'm using a `RecursiveCharacterTextSplitter`, which tries to keep paragraphs and sentences and all of those things together, so it might go above or below 1000. But it should *generally* hit the mark.

In [113]:
len(docs)

440

Overall this gave us just over 400 documents. Let's pick one at random to check out, just to make sure things went okay.

In [124]:
docs[109]

Document(page_content='Mikor aztán eljött a lakodalom napja, felöltözött, de olyan ruhába, hogyTündérországban se igen látni párját sátoros ünnepkor se, csak elfogta acselédje szemefényét. Mire a királyi palotához ért, már ott ugyancsakszólott a muzsika, úgy tánczoltak, majd leszakadt a ház, még a süketnekis bokájába ment a szép muzsika.', lookup_str='', metadata={'source': 'book.txt'}, lookup_index=0)

It's a little short, but it's definitely part of the folktales. According to Google Translate:

> When the day of the wedding came, she dressed up, but in such a dress that one would not see her partner in a fairyland even during a tent festival, she only caught the eye of her mistress. By the time he got to the royal palace, the music was already playing there too, they were dancing like that, and then the house was torn apart, the beautiful music even went to the deaf man's ankles

Sounds like a pretty fun party!

## Part 2: Find the pieces most relevant to our question

### Understanding text embeddings and semantic search

If we're asking questions about a wedding, we can't just look for the text *wedding* – our documents are in Hungarian, so that's *lakodalom* (I think). Instead, we're going to use someting called **embeddings**.

Embeddings take a word, sentence, or snippet of text and turn it into a string of numbers. Take the sentences below as an example: I've scored each one of them as to how much they're about shopping, home, and animals.

|sentence|shopping|home|animals|result|
|---|---|---|---|---|
|You should buy a house|0.9|0.8|0|`(0.9, 0.8, 0.0)`|
|The cat is in the house|0|1|0.8|`(0.0, 1.0, 0.8)`|
|The dog bought a pet mouse|1|0.2|1|`(1.0, 0.2, 1.0)`|

Let's say we have a fourth sentence – **the dog is at home**. I've decided it scores `(0.0 1.0 0.9)` since it's about home and animals, but not shipping. How can we find a similar text?

**The cat is in the house** is the best match from our original list, *even though it doesn't have any words that match*. But if we ignore the words and look at the scores, it's clearly the best match! That's more or less the basic idea behind text embeddings and semantic search.

Instead of reasonable categories like mine, actual embeddings are something like 384 or 512 different dimensions your text is scored on. And unlike "shopping" or "animal" above, the dimensions aren't anything you can understand. They're generated by computers that have read a lot lot lot of the internet, so we just have to trust them!

:::{.callout-info}
You might want to read [my introduction to word embeddings](https://investigate.ai/text-analysis/word-embeddings/) for more details and [conceptual document similarity](https://investigate.ai/text-analysis/document-similarity-using-word-embeddings/).
:::

### Creating and searching our embeddings database

There are many, many embeddings out there, and they each score text differently. We need one that supports English (for our queries) and Hungarian (for the dataset): while not all of them support multiple languages, [it isn't hard to find some that do](https://www.sbert.net/docs/pretrained_models.html#multi-lingual-models)!

We're going to pick `paraphrase-multilingual-MiniLM-L12-v2` since it supports a delightful 50 languages. That way we can ask questions in French or Italian, or maybe add some Japanese folklore to the mix later on.

These multilingual embeddings have read enough sentences across the all-languages-speaking internet to *somehow* know things like that cat and lion and Katze and tygrys and 狮 are all vaguely feline. At this point don't need to know how it works, just that it gets the job done!

In [146]:
from langchain.embeddings import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(model_name='paraphrase-multilingual-MiniLM-L12-v2')

In order to find the most relevant pieces of text, we'll also need something that can store and search embeddings. That way when we want to find anything about *weddings* it won't have a problem finding *lakodalom*.

We're going to use [Chroma](https://github.com/chroma-core/chroma) for no real reason, just because it has a convenient LangChain extension.

In [187]:
# You'll probably need to install chromadb
# !pip install chromadb

In [148]:
from langchain.vectorstores import Chroma

db = Chroma.from_documents(docs, embeddings)

Running Chroma using direct local API.
Using DuckDB in-memory for database. Data will be transient.


Now that this is stored, we can search for weddings at a festival.

In [149]:
db.similarity_search("weddings at a festival with loud music", k=1)

[Document(page_content='Eltelt az egy hónap, elérkezett az esküvő napja, ott volt a sok vendég,köztök a boltos is, csak a vőlegényt meg a menyasszonyt nem lehetettlátni. Bekövetkezett az ebéd ideje is, mindnyájan vígan ültek le azasztalhoz, elkezdtek enni. Az volt a szokás a gróf házánál, hogy mindenembernek egy kis külön tálban vitték az ételt; a boltos amint a magatáljából szedett levest, hát csak alig tudta megenni, olyan sótalanvolt, nézett körül só után, de nem volt az egész asztalon; a másodikétel még sótalanabb volt, a harmadik meg már olyan volt, hogy hozzá se’tudott nyúlni. Kérdezték tőle hogy mért nem eszik? tán valami baja vanaz ételnek? amint ott vallatták, eszébe jutott a lyánya, hogy az nekiazt mondta, hogy úgy szereti, mint a sót, elkezdett sírni; kérdeztékaztán tőle, hogy mért sír, akkor elbeszélt mindent, hogy volt neki egylyánya, az egyszer neki azt mondta, hogy úgy szereti mint a sót, őmegharagudott érte, elkergette a házától, lám most látja, hogy milyenigazságtalan 

It's a match! Now we'll use this to find passages relevant to our question, that we'll then pass along to GPT as context for our questions.

## Part 3: Send the matches to GPT along with our question

This is the part where [LangChain](https://langchain.readthedocs.io/) really shines. We just say "hey, go get the relevant pieces from our database, then go talk to GPT for us!"

First, we'll fire up our connection to GPT (you'll need to provide your own API key!).

In [166]:
from langchain.llms import OpenAI

# Connect to GPT-3.5 turbo
openai_api_key = "sk-..."

llm = OpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    openai_api_key=openai_api_key)

Second, we'll put together our vector-based Q&A. This is a custom LangChain tool that takes our original question, finds relevant passages, and packages it all up to send over to GPT.

In [190]:
# Vector-database-based Q&A
qa = VectorDBQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    vectorstore=db
)

## Let's see it in action!

I'm going to ask some questions about Zsuzska, who according to some passages apparently stole some of the devil's belongings! I don't really know anything about her, this is just from a couple random passages I translated for myself.

In [167]:
query = "What did Zsuzska steal from the devil?"
qa.run(query)

'The tenger-ütő pálczát (sea-beating stick).'

In [168]:
query = "Why did Zsuzska steal from the devil?"
qa.run(query)

"Zsuzska was forced to steal from the devil by the king, who threatened her with death if she didn't."

In [169]:
query = "Why were the king's aunts jealous of Zsuzska?"
qa.run(query)

"The king's aunts were jealous of Zsuzskát because the king had grown to love her and they wanted to undermine her by claiming that she could not steal the devil's golden cabbage head."

That's a good amount of information about Zsuzskat! Let's try another character, Janko.

In [170]:
query = "Who did Janko marry?"
qa.run(query)

'Janko married a beautiful princess.'

In [171]:
query = "How did Janko meet the princess?"
qa.run(query)

"The context does not provide information on a character named Janko meeting the king's daughter."

I know for a fact that Janko met the princess because *he stole her clothes while she was swimming in a lake*, but I guess the appropriate context didn't get sent to GPT. **It actually used to get the question right before I changed the embeddings!** In the next section we'll see how to provide more context and hopefully get better answers.

There's also a big long story about a red or bloody row that had to do with a character's mother coming back to protect him. Let's see what we can learn about it!

In [172]:
query = "Who was the bloody cow?"
qa.run(query)

'The bloody cow was a cow that Ferkó rode away on after throwing the lasso at it.'

In [173]:
query = "Why was Ferko's mother disguised as a cow?"
qa.run(query)

"Ferko's mother was not disguised as a cow, but rather the red cow was actually Ferko's mother, the first queen."

## Improving our answers from GPT

When we asked what was stolen from the devil, we were told "The tenger-ütő pálczát (sea-beating stick)." I know for a fact more things were stolen than that!

If we provide better context, we can hopefully get better answers. Usually "better context" means "more context," so we have two major options:

* Increase the size of our window/include more overlap so passages are longer
* Provide more passages to GPT as context when asking for an answer (the default is 4)

Since I haven't seen the second one show up too many places, let's do that instead. We'll increase the number of results to provide as context to eight by passing with `k=8`.

In [174]:
qa = VectorDBQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    vectorstore=db,
    k=8
)

At this point we have to be careful of two things: money and token limits.

1. **Money:** Larger requests cost more.
2. **Token limits** We have around 3,000 words to work with for each GPT-3.5 request. If each chunk is up to 250 words long, this gets us up to 2,000 words. We should be safe!

But we want good answers, right??? **Let's see if it works:**

In [175]:
query = "What did Zsuzská steal from the devil?"
qa.run(query)

"Zsuzska stole the devil's tenger-ütő pálczája (sea-beating stick), tenger-lépő czipője (sea-stepping shoes), and arany kis gyermek (golden baby) in an arany bölcső (golden cradle). She also previously stole the devil's tenger-ütőpálczát (sea-beating stick) and arany fej káposztát (golden head cabbage)."

Perfect! That gold cabbage sounds great, and it's almost time for lunch, so let's wrap up with *one more thing*.

## Seeing the context

If you're having trouble getting good answers to your questions, it might be because the **context you're providing isn't very good.** I was actually having this issue earlier on with `distiluse-base-multilingual-cased-v2` before I switched to `paraphrase-multilingual-MiniLM-L12-v2`! I honestly don't know the difference between them, just that one provided more relevant snippets to GPT.

Let's see what context is being provided to GPT for each search!

### Method one: Context from the question

To see what context is being sent to GPT, include the `return_source_documents=True` parameter.

In [176]:
qa = VectorDBQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    vectorstore=db,
    return_source_documents=True
)

In [181]:
query = "What did Zsuzská steal from the devil?"
result = qa({"query": query})

In [182]:
result["result"]

'Zsuzská stole the tenger-ütő pálczát (sea-beater stick) from the devil.'

In [183]:
result["source_documents"]

[Document(page_content='Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a felesége is. Zsuzska csendesen belopódzott, ellopta a tenger-ütőpálczát, avval bekiáltott az ablakon.\n– Hej ördög, viszem ám már a tenger-ütő pálczádat is.\n– Hej kutya Zsuzska, megöletted három szép lyányomat, elloptad atenger-lépő czipőmet, most viszed a tenger-ütő pálczámat, de majdmeglakolsz te ezért.\nUtána is szaladt, de megint csak a tengerparton tudott közel jutnihozzá, ott meg Zsuzska megütötte a tengert a tenger-ütő pálczával,kétfelé vált előtte, utána meg összecsapódott, megint nem foghatta megaz ördög. Zsuzska ment egyenesen a királyhoz.\n– No felséges király, elhoztam már a tengerütő pálczát is.', lookup_str='', metadata={'source': 'book.txt'}, lookup_index=0),
 Document(page_content='De Zsuzska nem adta;,,Tán bolond vagyok, hogy visszaadjam, mikor kivülvagyok már vele az udvaron?!’’ Az ördög kergette egy 

### Method two: Just ask your database

If you already know what GPT is going to say in response and you're debugging on specific query, you can just ask your database what the relevant snippets are!

In [189]:
db.similarity_search("What did Zsuzská steal from the devil?", k=2)

[Document(page_content='Hiába tagadta szegény Zsuzska, nem használt semmit, elindult hát nagyszomorúan. Épen éjfél volt, mikor az ördög házához ért, aludt az ördögis, a felesége is. Zsuzska csendesen belopódzott, ellopta a tenger-ütőpálczát, avval bekiáltott az ablakon.\n– Hej ördög, viszem ám már a tenger-ütő pálczádat is.\n– Hej kutya Zsuzska, megöletted három szép lyányomat, elloptad atenger-lépő czipőmet, most viszed a tenger-ütő pálczámat, de majdmeglakolsz te ezért.\nUtána is szaladt, de megint csak a tengerparton tudott közel jutnihozzá, ott meg Zsuzska megütötte a tengert a tenger-ütő pálczával,kétfelé vált előtte, utána meg összecsapódott, megint nem foghatta megaz ördög. Zsuzska ment egyenesen a királyhoz.\n– No felséges király, elhoztam már a tengerütő pálczát is.', lookup_str='', metadata={'source': 'book.txt'}, lookup_index=0),
 Document(page_content='De Zsuzska nem adta;,,Tán bolond vagyok, hogy visszaadjam, mikor kivülvagyok már vele az udvaron?!’’ Az ördög kergette egy 

You can keep playing with your `k` values until you get what you think is enough context.

## Improvements and next steps

This is a collection of folktales, not one long story. That means asking about something like a wedding might end up mixing together all sorts of different stories! Our next step will allow us to add other books, filter stories from one another, and more techniques that can help with larger, more complex datasets.

If you're interested in hearing when it comes out, feel free to follow me [@dangerscarf](https://twitter.com/dangerscarf) or [hop on my mailing list](https://tinyletter.com/jsoma). Questions, comments, and blind cat adoption inquiries can go to [jonathan.soma@gmail.com](mailto:jonathan.soma@gmail.com).