> Large Language Models and Retrieval-Augmented Generation

[Large Language Models](https://en.wikipedia.org/wiki/Large_language_model) (LLMs), powering tools such as *ChatGPT*m are great at *generating* text, but they don’t always know the latest information or specific facts we care about. This is where [Retrieval-Augmented Generation](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) (RAG) comes in: instead of relying only on what the model "remembers", we can connect it to an external knowledge source (like documents, notes, or a database). With RAG, the model first retrieves the most relevant pieces of information, and then uses them to generate an accurate, helpful response. This approach makes LLMs more reliable, more up-to-date, and more useful for real-world applications like chatbots, assistants, and search systems.

For (easily) putting together a RAG, we'll be making use of [LangChain](https://github.com/langchain-ai/langchain), which is an open-source framework that helps you quickly build LLM-powered apps and agents using prebuilt components and integrations.

# Setup

In principle, you could run the notebook either in Colab or locally. Is the notebook running in *Colab*?

In [None]:
try:
    import google.colab
    running_in_colab = True
except ImportError:
    running_in_colab = False

running_in_colab

If in *Colab*, you might want to switch to GPU environment...or get comfy while the CPU is crunching numbers.

If not running in *Colab*, you might want to choose a GPU if several are available. Ignore, if you are running in *Colab*

In [None]:
if not running_in_colab:

    import os
    os.environ["CUDA_LAUNCH_BLOCKING"] = "0"

## Ollama

We are going to exploit a freely available (trained by someone else) LLM. For *running* it, we will be using [Ollama](https://ollama.com/), which is software for

> getting up and running large language models

It relies on a client-server model, and hence we must start up the server before sending any request. 

- If in *Colab*, the required packages will be installed. You might see some ugly <font color='red'>ERROR/WARNING</font>s. Just ignore them. It should be fine.

- If **not** in *Colab* you should install *ollama* on your own (along with every required *python* package).

In [None]:
import shutil
import pathlib
import subprocess
import time

if running_in_colab:
    !pip uninstall -y langchain 2>/dev/null || true # to avoid conflicts with new langchain version
    !pip -q install -U langchain-core langchain-community langchain-text-splitters langchain-chroma langchain-huggingface langchain-ollama llama-index-core chromadb unstructured
    !curl -fsSL https://ollama.com/install.sh | sh
    !nohup ollama serve > /dev/null 2>&1 &

else:

    log_file = pathlib.Path('ollama.log').open('w')
    proc = subprocess.Popen(
        [str(shutil.which('ollama') or pathlib.Path.home() / 'ollama' / 'bin' / 'ollama'), 'serve'],
        stdout=log_file,
        stderr=subprocess.STDOUT,
        start_new_session=True,
    )
    
# give it some time to start
time.sleep(10)

The LLM server (*ollama*) should be now up and running.

## Python libraries

The remaining required `import`s are centralized here.

In [None]:
import requests

import numpy as np

# embedding
from langchain_community.document_loaders import DirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings

import ollama

from langchain_core.prompts import PromptTemplate, ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

# database
from langchain_chroma import Chroma

# model
from langchain_ollama import OllamaLLM

# Data

Let us download some data. Here we are using the short story [Bartleby, the Scrivener](https://en.wikipedia.org/wiki/Bartleby,_the_Scrivener) by Herman Melville (freely available at [Project Gutenberg](https://www.gutenberg.org/)), but you can use any *plain text* documents you like (just make sure to place them in the correct directory or modify the code accordingly).

In [None]:
bartleby_url = 'https://www.gutenberg.org/ebooks/11231.txt.utf-8'
response = requests.get(bartleby_url)
response.raise_for_status()

data_dir = pathlib.Path('data')
data_dir.mkdir(exist_ok=True)

book_file = data_dir / 'bartleby.txt'

with book_file.open('w', encoding='utf-8') as f:
    f.write(response.text)

print(f'Downloaded book to {book_file}')

It *loads* every document in directory `data` as a *LangChain* `Document` (a `list` with all the documents is returned)

In [None]:
docs = DirectoryLoader('data').load()
type(docs)

<font color='red'>TO-DO</font>: How many documents do you have? Check the content of one of them. What's its type in Python? Try to access the text inside (the important attribute is `page_content`).

Each `Document` above must be turned into a **fixed-size** vector of real numbers. Before, we were achieving this using *Bag-of-words*, but different (already trained) models can be summoned for this task, and here we'll be using a *sentence transformer* (never mind for now). This is a much more sophisticated model (preserving semantic and contextual information) than *Bag-of-words*. The resulting **fixed-size** vectors of real numbers are often referred to as [embeddings](https://en.wikipedia.org/wiki/Embedding_(machine_learning))

In [None]:
embed_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
embed_model

<font color='red'>TO-DO</font>: Embed any sentence you like (just call the above `embed_model` on any text), and then another that is really close. Check their corresponding vectors and compare them somehow.

<font color='red'>TO-DO</font>: Try (a few) [different models](https://huggingface.co/sentence-transformers/models) to carry out the 
embedding of the `Document`s. Do you see any difference (speed, performance...)?


The number of *tokens* fed as input to a sentence transformer is limited (that was not the case for *Bag-of-words*). The documentation for the [all-MiniLM-L6-v2]((https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) ) model used above states

> input text longer than 256 word pieces is truncated.

In order to deal with this, documents are split into **overlapping** sequences of (here) 500 *characters* (a *token* encompasses several of them). This splitting of the documents is also a good thing for having a finer *granularity* when looking for information in the document since the RAG can locate and return the specific segment where relevant information appears instead of an entire long document.

In [None]:
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
documents = splitter.split_documents(docs)

<font color='red'>TO-DO</font>: After splitting, how many documents do you have? What are the lengths of the first three?

# Storing the documents

A [Chroma](https://www.trychroma.com/) database is assembled to store the *embedded documents*. When instantiating the corresponding object, we need to pass in the embedding model.

In [None]:
vectordb = Chroma.from_documents(
    documents, embedding=embed_model
)

Notice the above [`from_documents`](https://reference.langchain.com/python/integrations/langchain_chroma/#langchain_chroma.Chroma.from_documents) method embeds and stores the documents in one go.

<font color='red'>TO-DO</font>: What kind of database is [Chroma](https://www.trychroma.com/)? How does it differ from traditional databases?

You can exploit the [Chroma](https://www.trychroma.com/) database to look for sentences *similar* to a given one. Every search in the database might (will, in principle) return several hits (as many as requested through the parameter`k`)

In [None]:
hits = vectordb.similarity_search(
    "I would prefer not to do it", k=3
)
len(hits)

In [None]:
for h in hits:
    print(h.page_content[:120], '\n---')

<font color='red'>TO-DO</font>: Search for:

- A sentence that is close to something that you know is there (in the text).

- A sentence that is (most likely) not there at all.

# LLM

## Model

Let us go and pick a model. You can check [here](https://ollama.com/library?sort=popular) for which ones are freely available. Let us start with

In [None]:
# model_name = 'llama3:8b'
# model_name = 'deepseek-r1:8b'
model_name = 'llama3.2:1b' # faster download

CAVEAT: some models will also report the thinking process (via `<think>` tags), which might not be very convenient for some tasks, and we might want to filter out.

A model must be downloaded only once (...per session if you are using *Colab*)

In [None]:
ollama.pull(model_name)

Once the server is running, a (previously downloaded) model can be accessed through a `langchain` object.

In [None]:
llm = OllamaLLM(model=model_name, temperature=0.0)

<font color='red'>TO-DO</font>: Ask the LLM model whatever (using the `invoke` *method*), e.g., "What is backpropagation?" or "What is Rick and Morty?"

<font color='red'>TO-DO</font>: Ask the LLM model about the main character in the novel, Bartleby. We are doing this **before** *augmenting* what the LLM knows.

## Templates

If you plan to interact with an LLM using always a certain style of *prompting*, you can make a template for it.

In [None]:
chat_prompt = ChatPromptTemplate([
    ('system', 'Try to rhyme every answer you give.'),
    ('human', '{user_input}'),
])

Then you build a *chain* (to be invoked just like we did above) that *sends* (read `|` as $\to$) the template to the LLM and the result of the latter to a *parser* that outputs a `str`.

In [None]:
chain = chat_prompt | llm | StrOutputParser()
print(chain.invoke('Explain backpropagation in one paragraph.'))

<font color='red'>TO-DO</font>: Experiment with the above `system` instructions.

<font color='red'>TO-DO</font>: Make a new *chain* (stored in a new variable) that *keeps only the last word in every sentence and gets rid of the rest*. For the sake of simplicity (we don't need `system` instructions), you might want to use `PromptTemplate.from_template`.

<font color='red'>TO-DO</font>: Make yet another chain by *piping* (`|`) the two previous ones in order to get a chain that yields the *rhyming* words in the above output.

# Retriever + LLM
`langchain` provides a way to construct an entire end-to-end RAG. We just need:

- an LLM, 

- a database of documents (which is tidied up using `as_retriever` so that `langchain` knows how to "talk" to it), and

- a *prompt* factoring in the latter.

Let us start with the *prompt*

In [None]:
system_prompt = (
    'Use the given context to answer the question. '
    'If you don\'t know, say so. '
    'Keep the answer under three sentences.\n\n'
    'Context:\n{context}'
)
prompt = ChatPromptTemplate.from_messages(
    [('system', system_prompt), ('human', '{input}')]
)

A function simply joining the documents.

In [None]:
join_docs = RunnableLambda(
    lambda docs: '\n\n'.join(d.page_content for d in docs)
)

`context` and `input` are the inputs to (*fill in*) the `prompt`, which is sent to the `llm`, whose output is *parsed* to yield the final output.

In [None]:
piped_rag_chain = (
    {                                             #  a `dict` for the prompt
        'context': vectordb.as_retriever(search_kwargs={"k": 4}) | join_docs,
        'input':   RunnablePassthrough(),
    }
    | prompt                                      # fill {context} + {input}
    | llm                                         # generate answer
    | StrOutputParser()                           # ChatMessage → str
)

<!-- <font color='red'>TO-DO</font>: Ask this *chain* (using the `invoke` *method*) again about *backpropagation*. Compare answers. -->
<font color='red'>TO-DO</font>: Ask this *chain* (using the `invoke` *method*) again the same questions you asked above (Bartleby, and whatever you asked before that). Compare answers (now **after** *augmenting* the LLM).

In [None]:
if not running_in_colab:
    proc.terminate()   # or proc.kill() for a hard stop
    proc.wait()
# log.close()

<font color='red'>TO-DO</font> : Try a different language model (you have a couple commented out in the code setting `model_name`). Be careful with the size of the model (*7b*, *8b*...). Large models might take **a lot** of time to produce an answer...or even collapse the GPU memory. The suffix *b* in those models stands for *billion*. The *Toy GPT* we trained before had...0.2 **m**illion.

# Sample questions

## What is the main benefit of adding retrieval to an LLM-based system?
- [ ] It removes the need to design prompts.
- [ ] It brings in relevant external context first, making answers more accurate and up-to-date.
- [ ] It permanently rewrites the model’s weights with new facts.
- [ ] It forces the model to answer only with quotes.

## What does an embedding represent for a piece of text?
- [ ] A perfect, lossless compression of the original text.
- [ ] The count of words and punctuation only.
- [ ] A numeric vector that captures meaning so similar texts are close together.
- [ ] A list of keywords in alphabetical order.