# QA Pipeline
---

In [1]:
import os
iskaggle = os.environ.get('KAGGLE_KERNEL_RUN_TYPE', '')

## Dependencies

This notebook was written in Python 3.10.13 

In [2]:
%%capture
if iskaggle:
    ! pip install -U pip datasets sentence_transformers

In [3]:
from transformers import pipeline
from datasets import load_dataset
from sentence_transformers import SentenceTransformer, util
import numpy as np

2024-02-05 01:01:25.514976: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-02-05 01:01:25.515082: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-02-05 01:01:25.672487: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


#### Note:

Running this notebook on Kaggle requires an internet connection, which can be turned on in the *Notebook options* tab on the right. This can only be enabled with a phone number.

## Introduction

#### Task
I created this notebook as a solution to a take home excercise for a job application. The task was to create a QA pipeline built on a RAG framework to base an LLM with factual knowledge from a database. The choice of database was left to me, however Wikipedia was recommended.

I was also asked how I would assess performance and to provide 2-3 ways to make the model accessible to the public.

#### Outcome
The language model I chose finds the relevant information in the provided context, however it resisted my prompts to be more chatty. Strangely enough it seemed quite friendly when not restricted by facts. I made some recommendations for how to assess performance, and made one recommendation for how to make it accessible.

#### Inspiration
I followed **James Briggs'** excellent video for the implementation of this notebook: https://www.youtube.com/watch?v=0xyXYHMrAP0, which reflects my choice of models.

## Import the Wikipedia dataset

I chose to work with the Simple English version because it is significantly smaller than the English one and I would have needed to invest in more storage. The 20 GB provided for free on the Kaggle notebook servers was not enough to store more than 6 million vectors. I looked at the Pinecone free tier offering and that was also insufficient. Since model performance was not the objective I also thought that this would be sufficient.

There's another benefit, and that's shorter articles. There's a bigger chance that in case one contains relevant information pertaining to a question, the information won't be truncated. This is beneficial for both the retrieval and generation phases.

There is also a further helpful underlying assumption: 

A1: Articles most likely contain the relevant information in their introductory paragraphs.

This is underpinned by the fact that Wikipedia is an Encyclopedia.

In [4]:
wikipedia = load_dataset("wikipedia", "20220301.simple")['train']

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.


Downloading builder script:   0%|          | 0.00/35.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/16.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.66k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/235M [00:00<?, ?B/s]

In [5]:
wikipedia

Dataset({
    features: ['id', 'url', 'title', 'text'],
    num_rows: 205328
})

There are 205328 articles in our dataset, this should provide enough opportunity to ask interesting questions.

## Create vector representation for Wikipedia in Simple English

**all-MiniLM-L6-v2** is by far the most popular sentence transformer available on *HuggingFace* at the time of writing, with 7,019,691 downloads in January 2024. It encodes text as a 384 dimensional vector (for 384 tokens).

In [6]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

To run the code below enable the GPU.

In [7]:
%%time
# Encode every article in the Dataset
pool = model.start_multi_process_pool()
encoded_texts = model.encode_multi_process(
    wikipedia['text'],
    pool=pool
)
model.stop_multi_process_pool(pool)

CPU times: user 1.5 s, sys: 872 ms, total: 2.37 s
Wall time: 3min 46s


The output is as expected:

In [8]:
encoded_texts.shape

(205328, 384)

In [9]:
# %%time
# n_batches = 7
# pool = model.start_multi_process_pool()
# for i in tqdm(range(n_batches)):
#     encoded_batch = model.encode_multi_process(
#         wikipedia_xxl_intro.shard(n_batches, index=i)['title'], 
#         pool=pool
#     )
#     pickle_object(encoded_batch, f"encoded_titles_{i}")
# model.stop_multi_process_pool(pool)

## Encode the question and find the top 3 relevant articles

In [10]:
question = "What are the sister cities of Győr?"
query = model.encode(question)
# using cosine similarity to find the most relevant articles
similarity = util.pytorch_cos_sim(np.array(query), encoded_texts)[0]

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Get the top three by cosine similarity (efficient sort: https://stackoverflow.com/a/23734295):

In [11]:
# top three not in order
most_similar = np.argpartition(similarity,-3)[-3:]
# sorted
most_similar = np.array(most_similar[np.argsort(similarity[most_similar])], dtype=int)
similarity[most_similar]

tensor([0.4829, 0.4885, 0.6066])

As we can see below, the most relevant article contains the answer to our question.

In [12]:
for article in most_similar:
    # cast article to int since it's a tensor
    print(wikipedia[int(article)]['text'][:2000])
    print('*')

Glyfada is a district in southern Athens and a famous coastal resort.
The area developed as a tourist destination through the 1970s and 1980s and has a large number of hotels, shops, restaurants and clubs. It is situated close to the old international airport.

References

Athens
*
Glocester is a town in Rhode Island.

Towns in Rhode Island
*
Győr (; , ) is the sixth-largest city in Hungary and the capital of Győr-Moson-Sopron county. In 2016, 131.267 people lived there.

Twin towns − Sister cities
Győr is twinned with:

  Kuopio, Finland
  Erfurt, Germany
  Sindelfingen, Germany
  Ingolstadt, Germany
  Colmar, France
  Bryansk, Russia
  Brașov, Romania
  Nazareth Illit, Israel
  Wuhan, China
  Poznań, Poland
  Montevideo, Uruguay

Other websites

 Official site 

Cities in Hungary
*


## Generate natural language responses based on the relevant context to the question

In [13]:
llm = pipeline("text2text-generation", "google/flan-t5-xl")

config.json:   0%|          | 0.00/1.44k [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/53.0k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.45G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Since the LLM is large enough it sometimes gives the right answer:

In [14]:
llm.predict("What is the largest city in Hungary?")



[{'generated_text': 'budapest'}]

But for other's it can be quite wrong, as for our previous question:

In [15]:
question

'What are the sister cities of Győr?'

In [16]:
llm.predict(question)

[{'generated_text': 'budapest budapest budapest'}]

Now with context:

In [17]:
def prompt(question, context): 
    return f"""Answer the following QUESTION based on the CONTEXT
        given. If you do not know the answer and the CONTEXT doesn't
        contain the answer truthfully say: "I don't know the answer to that question. Can you provide more information?". 

        QUESTION:
        {question}
        
        CONTEXT:
        {context}
        """

In [18]:
context = wikipedia[int(most_similar[-1])]
answer = llm(prompt(question, context), max_length=200)[0]["generated_text"]

In [19]:
answer

'Kuopio, Finland Erfurt, Germany Sindelfingen, Germany Ingolstadt, Germany Colmar, France Bryansk, Russia Brașov, Romania Nazareth Illit, Israel Wuhan, China Pozna, Poland Montevideo, Uruguay'

Unfortunately this model is not very chatty. From what I've seen llama would fit my purpose better, but I'd need to ask for permission from META to use it.

## Now we can fit the pieces together

We'll wrap it up in a function.

In [20]:
def ask(question: str):
    # create vector representation
    vec = model.encode(question)
    # using cosine similarity to find the most relevant articles
    similarity = util.pytorch_cos_sim(np.array(vec), encoded_texts)[0]
    # top three not in order
    most_similar = np.argpartition(similarity,-3)[-3:]
    # sorted
    most_similar = np.array(most_similar[np.argsort(similarity[most_similar])], dtype=int)
    # we'll only provide one context file
    context_id = int(most_similar[-1])
    context = wikipedia[context_id]['text']
    # ask llm to generate response
    answer = llm(prompt(question, context), max_length=100, truncation=True)[0]["generated_text"]
    link = "" if answer.startswith("I don't know the answer") else wikipedia[context_id]['url']
    print(f"""
    Q: {question} /n/n
    A: {answer} /n/n
    link: {link}
    """)


Let's see how it works.

In [21]:
ask("How big is the Sun?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


    Q: How big is the Sun? /n/n
    A: The Sun is about a hundred times as wide as the Earth. It has a mass of . This is 333,000 times the mass of the Earth. 1.3 million Earths can fit inside the Sun. /n/n
    link: https://simple.wikipedia.org/wiki/Sun
    


In [22]:
ask("Can I have some water?")

Batches:   0%|          | 0/1 [00:00<?, ?it/s]


    Q: Can I have some water? /n/n
    A: I don't know the answer to that question. Can you provide more information? /n/n
    link: 
    


## How to assess performance

1. The first method I'd use for assessing performance is trying the pipeline out with example questions, making sure it performs as expected. I would not spend too much time on this. Live performance can be hard to predict.
2. After making the QA pipeline accessible to the end users I would collect usage statistics:
    1. (self reported satisfaction) Implement a satisfaction query
    2. (revealed satisfaction) Measure how much time users spend chatting to the bot
    3. (fit for purpose) Measure how demand on other workflows is alleviated by the solution, such as live QA phone lines

## Making the chat function accessible to the public

I was quite impressed by the simple solution outlined in **James Briggs'** video. Following his guidelines, I would host the embedding and large language models on *SageMaker* instances and the database of embedded vectors on *Pinecone* and the corpus on a cloud database. The chat interface could communicate by sending API requests to a gateway, which would follow the steps in the *ask* function above, to finally send back a response from the *LLM*.