![Roche logo](images/roche-logo-blue-aligned-right.png)

<p>&nbsp;</p>
<p>&nbsp;</p>

# Building a Q&A engine with LangChain and open-source LLMs
## Marek Grzenkowicz

#### December 2023

## About Roche

![Key Roche Informatics hubs](images/roche-about.png)

## Large Language Models did not appear out of nowhere

### Natural language processing (NLP)

- Standardized tasks (question answering, summarization, sentiment analysis, ...)
- Evaluation benchmarks
- **Leaderboards**
- Word and sentence **embeddings** (word2vec, GloVe, fastText, ELMo, BERT, ...)

### Machine learning

- Neural networks
- Deep learning
- CUDA
- Transformer architecture
- Attention (ML technique)
- Reinforced learning

## Objectives

1. Build an LLM application **without extensive NLP expertise**
2. Enable **local development and execution** (easy to replace with cloud-hosted inference APIs later)
3. Utilize **open-source models**
4. Prototype a solution for a **real business challenge**
    1. Use semantic search for proprietary company documents
    2. Query the index in natural language

## Tools

- [LangChain](https://python.langchain.com/docs/get_started/introduction.html) - framework for developing applications powered by LLMs
- [Hugging Face Hub](https://huggingface.co/models) - repository of pre-trained language models
  - [Transformers](https://huggingface.co/docs/transformers/index) - downloading the models
- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0) - compact, open-source LLM from [lmsys.org](https://lmsys.org) 
- [ChromaDB](https://www.trychroma.com) - embedding database (vector store)
- [Jupyter Notebook](https://jupyter.org/) with the [RISE](https://github.com/damianavila/RISE) extension - IDE with a slideshow feature

## It's full of LLMs!

![LLM Evolutionary Tree](https://raw.githubusercontent.com/Mooler0410/LLMsPracticalGuide/main/imgs/tree.jpg)

Source: [github.com/Mooler0410/LLMsPracticalGuide/](https://github.com/Mooler0410/LLMsPracticalGuide/)

## Why the `lmsys/fastchat-t5-3b-v1.0` model?

- GPU + 4GB memory - 1B parameters at best
- CPU + 32GB RAM - 3B parameters

- [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) ➡ [`CobraMamba/mamba-gpt-3b-v3`](https://huggingface.co/CobraMamba/mamba-gpt-3b-v3) claims to surpass some 12B models
  - but it is slow ➡ [Open LLM performace leaderboard](https://huggingface.co/spaces/optimum/llm-perf-leaderboard)


- [lmsys/fastchat-t5-3b-v1.0](https://huggingface.co/lmsys/fastchat-t5-3b-v1.0)
  - _A commercial-friendly, compact, yet powerful chat assistant_
  - The first model to actually generate any response on my laptop in reasonable time 🎉

## Some duct tape first 🙈

In [1]:
# careful! important warnings may get hidden

import warnings
warnings.filterwarnings('ignore')

import logging
from transformers.utils import logging
logging.set_verbosity(logging.ERROR)

In [2]:
# https://github.com/chroma-core/chroma/blob/main/chromadb/__init__.py#L57

import sys
__import__("pysqlite3")
sys.modules["sqlite3"] = sys.modules.pop("pysqlite3")

## Load a model

In [3]:
from langchain.llms import HuggingFacePipeline

In [4]:
model_id, task = "lmsys/fastchat-t5-3b-v1.0", "text2text-generation"

# model will be downloaded on first use and cached in ~/.cache/huggingface/hub/

model = HuggingFacePipeline.from_model_id(
    model_id=model_id,
    task=task,
    model_kwargs={
        "temperature": 0,
        "max_length": 1000
    },
    device=-1,  # CPU
)

## Initialize an LLM chain and start asking questions

In [5]:
from langchain import PromptTemplate, LLMChain

template_text = """
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [6]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Sheryl  Crow  is  an  American  singer,  songwriter,  and  actress.  She  is  best  known  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  rock  band  The  Band wagon,  and  for  her  role  as  the  lead  singer  and  lead  guitarist  of  the  alternative  rock  band  The  Mamas  and  the  Papas.  Crow  has  also  been  a  member  of  the  band  The  Mamas  and  the  Papas  since  its  formation  in  1995.\n'

## ↪️ 2019: Language models cannot augment knowledge automatically

!['Run, Baby, Run' and deaths of Huxley and Kennedy](./images/sheryl-crow-huxley-kennedy.png)

## ↪️ 2023: ChatGPT still fails at that! 😮 (for this particular example)

!['Run, Baby, Run' confuses ChatGPT](./images/sheryl-crow-chatgpt.png)

## Prompt engineering

In [7]:
template_text = """
Provide brief answers, use 10 words or less.
{question}
"""
template = PromptTemplate(template=template_text, input_variables=["question"])

llm_chain = LLMChain(llm=model, prompt=template)

In [8]:
llm_chain("Who is Sheryl Crow?")["text"]

'<pad> Singer-songwriter'

## Easy questions

In [9]:
llm_chain("Where is Poland located?")["text"]

'<pad> Europe'

In [10]:
llm_chain("What is the Bialowieza Forest?")["text"]

'<pad> Bialowieza Forest is a protected forest in Poland.'

## Harder questions

In [11]:
llm_chain("What does the name 'Bialowieza' mean in English?")["text"]

'<pad> "Birch Tree"'

In [12]:
llm_chain("""Bialowieza Forest trails are marked with colors.
   What's the color of the Wolf’s Trail?""")["text"]

'<pad> Yellow.\n'

# What now?

# Should I fine-tune the base model? 🤔

## ↪️ Embeddings as a vector representation of text

![cosine similarity in 2D](./images/vectors-cos-sim-500.png)

The actual embedding spaces have **100s or even 1000s of dimensions**! 🤯

An embedding vector can represent: a **word**, an entire **sentence** or a longer **chunk of text**.

Source: [github.com/grzenkom/do-androids-read/](https://github.com/grzenkom/do-androids-read/)

## ↪️ Vector similarity measures

In [13]:
import spacy

nlp = spacy.load("en_core_web_md")
vector_dog = nlp.vocab[u"dog"].vector
vector_dog[:10]

array([  1.233  ,   4.2963 ,  -7.9738 , -10.121  ,   1.8207 ,   1.4098 ,
        -4.518  ,  -5.2261 ,  -0.29157,   0.95234], dtype=float32)

In [14]:
from sklearn.metrics.pairwise import cosine_similarity

for word in ["husky", "cows", "stone", "Xerox"]:
    print(f"cos(dog, {word:<5}) = {cosine_similarity([vector_dog], [nlp.vocab[word].vector])[0][0]:>6.3f}")

cos(dog, husky) =  0.811
cos(dog, cows ) =  0.352
cos(dog, stone) =  0.071
cos(dog, Xerox) = -0.134


❗Cosine similarity is one of many vector similarity and vector distance measures.

## ↪️ Arithmetic of word vectors

\begin{equation}
\LARGE{\mathit{ v_{parent} + v_{woman} \approx v_{x} }}
\end{equation}

![sum of "parent" and "woman" vectors](./images/vector-mother.png)

\begin{equation}
\LARGE{\mathit{ v_{seawater} - v_{salt} \approx v_{x} }}
\end{equation}

![difference of "seawater" and "salt" vectors](./images/vector-water.png)

Source: [github.com/grzenkom/do-androids-read/](https://github.com/grzenkom/do-androids-read/)

## Retrieval Augmented Generation (RAG)

With RAG, documents can be stored as embeddings in a vector database, queried
for based on semantic meaning, and then these relevant splits are passed into
**model prompt via the context window**. LLM uses these text chunks from
original documents to generate an answer.

![Question Answering flow](images/langchain-qa-flow.jpg)

Source: [python.langchain.com/docs/use_cases/question_answering/](https://python.langchain.com/docs/use_cases/question_answering/#overview)

## LangChain integrations

1. [Catalog](https://integrations.langchain.com/llms)
2. [Documentation](https://python.langchain.com/docs/integrations)

![LangChain integrations](./images/langchain-integrations.png)

## RAG step 1 - load documents

In [15]:
from langchain.document_loaders import WikipediaLoader

loader = WikipediaLoader(query="Białowieża Forest", lang="en")
bf_docs = loader.load()

In [16]:
from langchain.document_loaders import WebBaseLoader

urls = [
    "https://bpn.com.pl/index.php?option=com_content&task=view&id=651&Itemid=297&lang=en",
    "https://bpn.com.pl/index.php?option=com_content&task=view&id=27&Itemid=211&lang=en"
]

loader = WebBaseLoader(web_paths=urls)
bpn_page = loader.load()

## RAG step 2 - split documents into chunks

In [17]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter splits text recursively on characters
# ["\n\n", "\n", " ", ""] and stops as soon as the chunks are small enough

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=0
)
all_splits = text_splitter.split_documents(bf_docs + bpn_page)

## RAG step 3 - initialize `sentence_transformers` embedding model

In [18]:
from langchain.embeddings import HuggingFaceEmbeddings

hf_embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2",
    model_kwargs={
        "device": "cpu"
    },
    encode_kwargs={
        "normalize_embeddings": False
    }
)

## RAG step 4 - calculate embeddings for text chunks and store them in a database

In [19]:
from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(
    documents=all_splits, embedding=hf_embeddings
)

# Chroma's default similarity measure is `l2`, not `cosine`
#  - https://docs.trychroma.com/usage-guide#changing-the-distance-function
#  - https://github.com/nmslib/hnswlib/tree/master#python-bindings

## ↪️ Similarity search

In [20]:
# similarity_search_with_score - lower score represents more similarity

# examples: Białowieża, bison, nature protection, pancake recipe
relevant_splits = vector_store.similarity_search_with_score("Białowieża", k=5)
print(" score  | size | document chunk")
[
    f'{score:.3f} | {len(chunk.page_content):>4} | {chunk.page_content[:60]} ...'
    for chunk, score in relevant_splits
]

 score  | size | document chunk


["0.570 |  245 | Białowieża [bʲawɔˈvʲɛʐa] in Poland's Podlasie Province, in t ...",
 '0.624 |  450 | == Location ==\nBiałowieża is in eastern Poland, in Podlasie  ...',
 '0.716 |   95 | village of Białowieża lies within the forest. Białowieża mea ...',
 '0.748 |  498 | The Białowieża Forest takes its name from the Polish village ...',
 '0.768 |  500 | Białowieża National Park (Polish: Białowieski Park Narodowy) ...']

## RAG step 5 - initialize a Q&A chain

In [21]:
from langchain.chains import RetrievalQA

# the QA chain is constructed with the LLM model (loaded previously)
# and the embedding database
qa_chain = RetrievalQA.from_chain_type(
    llm=model, retriever=vector_store.as_retriever()
)

### `RetrievalQA` chain - processing steps

1. fetch document splits relevant to the question from the vector database,
2. inject the retrieved splits into the model prompt,
3. have the LLM generate an answer based on original documents.

## RAG step 6 - generate answers with the Q&A chain

In [22]:
question = "What does the name 'Bialowieza' mean in English?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

"<pad> The  name  'Bialowieza'  means  'White  Tower'  in  English.\n"

In [23]:
question = """Bialowieza Forest trails are marked with colors.
   What's the color of the Wolf’s Trail?"""

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

"<pad> The  Wolf's  Trail  is  green.\n"

## No more hallucinations then?

In [29]:
question = "Are the Bialowieza Forest walking trails available to the public?"

qa_chain(f"Provide brief answers, use 10 words or less. {question}")["result"]

'<pad> No,  only  scientists  can  navigate  freely.\n'

## Wait! But why?! 🕵

It is possible to configure the QA chain to return the source chunks used to generate the answer.

In [25]:
explainable_qa_chain = RetrievalQA.from_chain_type(
    llm=model, retriever=vector_store.as_retriever(),
    return_source_documents=True
)

explainable_qa_output = explainable_qa_chain(f"Provide brief answers, use 10 words or less. {question}")

## Source documents - metadata 🕵

In [26]:
[
    f"{chunk.metadata['title'][:35]:<35} | {chunk.metadata['source'][:70]}"
    for chunk in explainable_qa_output['source_documents']
]

['Hajnówka                            | https://en.wikipedia.org/wiki/Hajn%C3%B3wka',
 'Bialowieski Park Narodowy - Walking | https://bpn.com.pl/index.php?option=com_content&task=view&id=651&Itemi',
 'Białowieża Forest                   | https://en.wikipedia.org/wiki/Bia%C5%82owie%C5%BCa_Forest',
 'Białowieża Forest                   | https://en.wikipedia.org/wiki/Bia%C5%82owie%C5%BCa_Forest']

## Source documents - content 🕵

In [27]:
[
    f"{chunk.page_content}"
    for chunk in explainable_qa_output['source_documents']
]

['== History ==\nFor a more detailed history of Białowieża and the area see: Białowieża Forest',
 'tourists about undergrowth plants, animal trails and processes occuring in the natural forest untainted by manmade activity. Remaining trails are made available to specialised groups, on every such occasion permission from the Park’s Directorate is required.',
 'village of Białowieża lies within the forest. Białowieża means "the white tower" in Old Polish.',
 '== Nature protection ==\n\n\n=== Białowieża National Park, Poland ===']

## Next steps

1. Other data loaders
    - PDF files
    - slide decks
    - Google Sites
    - files in Google Drive
2. Conversational user experience
    - chat user interface
    - memory of the conversation
4. Larger model
5. GPU processing
6. Other chunking strategies
    - find optimal chunk size
    - ensure chunks preserve the context of the source documents

## Links

### Overview

- [How OpenAI trained ChatGPT](https://blog.quastor.org/p/openai-trained-chatgpt)
    - 🎥 [State of GPT | Andrej Karpathy](https://www.youtube.com/watch?v=s6zNXZaIiiI)
- [Generative AI exists because of the transformer](https://ig.ft.com/generative-ai)
- [Catching up on the weird world of LLMs](https://simonwillison.net/2023/Aug/3/weird-world-of-llms)
- [What We Know About LLMs  (Primer)](https://willthompson.name/what-we-know-about-llms-primer)
- [The history, timeline, and future of LLMs](https://toloka.ai/blog/history-of-llms)
- [The Practical Guides for Large Language Models](https://github.com/Mooler0410/LLMsPracticalGuide)
- [The Many Ways that Digital Minds Can Know](https://moultano.wordpress.com/2023/06/28/the-many-ways-that-digital-minds-can-know)

## Links (cont.)

### Courses, tutorials

- 👩‍🎓 [Large Language Models with Semantic Search](https://learn.deeplearning.ai/large-language-models-semantic-search) (great explanation of **embeddings** + sandbox to experiment with provided code samples)
- [Running a Hugging Face Large Language Model (LLM) locally on my laptop](https://www.markhneedham.com/blog/2023/06/23/hugging-face-run-llm-model-locally-laptop)
- [Why You (Probably) Don’t Need to Fine-tune an LLM](https://www.tidepool.so/2023/08/17/why-you-probably-dont-need-to-fine-tune-an-llm/)
- [Building RAG-based LLM Applications for Production](https://www.anyscale.com/blog/a-comprehensive-guide-for-building-rag-based-llm-applications-part-1)

## Links (cont.)

### Challenges and future research

- [Open challenges in LLM research](https://huyenchip.com/2023/08/16/llm-research-open-challenges.html) 
- [Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc](https://arxiv.org/abs/2308.04445)
- [OWASP Top 10 for Large Language Model Applications](https://owasp.org/www-project-top-10-for-large-language-model-applications) 

### Skepticism

- 🎥 [Why AI Is Incredibly Smart and Shockingly Stupid | Yejin Choi](https://www.youtube.com/watch?v=SvBR0OGT5VI)
- [Anti-hype LLM reading list](https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e)
- [What if Generative AI turned out to be a Dud?](https://garymarcus.substack.com/p/what-if-generative-ai-turned-out)
    - And [Marcus on AI](https://garymarcus.substack.com) in general

![Roche logo](images/roche-logo-blue-aligned-right.png)

<p>&nbsp;</p>
<p>&nbsp;</p>


# Doing now what patients need next

<p>&nbsp;</p>
<p>&nbsp;</p>


## ⏬ GitHub repository: [go.roche.com/zsbit](https://go.roche.com/zsbit)