<a href="https://colab.research.google.com/github/marcinmosiolek/nlp/blob/main/Custom_Knowledge_Generative_QA_With_Alpaca.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

OpenAI has just announced and released [plugins for ChatGPT](https://openai.com/blog/chatgpt-plugins) that address its key problems - hallucinations and difficulties in updating with new, case-specific information. The plugins work mainly by restricting the generation of answers to a specific context, obtained from a dedicated information source, such as your specific documents.

Overcoming these challenges makes ChatGPT suitable for any application. However, it also comes at a significant cost. Fortunately, there is an open source alternative that can be prototyped in 15 minutes. In this short article, I'll show how to do retrieved augmented generative question answering with [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) and [Sentence Transformers](https://www.sbert.net). That is, how to construct a solution that answers your questions like a human domain expert.


---
## Installing dependecies
As the very first step we need to install the required python dependecies

In [1]:
!pip install -q bitsandbytes datasets loralib sentencepiece tenacity
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q git+https://github.com/huggingface/peft.git
!pip install -q sentence-transformers

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.2/84.2 MB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m33.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m58.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m33.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━

---
## Alpaca - Instruction-Following Language Model

[Stanford's Alpaca7B](https://https://crfm.stanford.edu/2023/03/13/alpaca.html) is a small but super powerful instruction-following language model constructed in a [very clever way](https://arxiv.org/abs/2212.10560). In other words, it works just like ChatGPT, but it's much smaller and free to use - you can even [run it on your CPU](https://github.com/antimatter15/alpaca.cpp)!

In [2]:
import torch

from transformers import GenerationConfig
from transformers import LlamaTokenizer, LlamaForCausalLM
from peft import PeftModel
import bitsandbytes as bnb

# The code comes from here: https://github.com/deep-diver/Alpaca-LoRA-Serve/

GENERATION_CONFIG = GenerationConfig(
    max_lenght=256,
    temperature=0.9,
    top_p=0.75,
    num_beams=1,
    use_cache=True,
    min_length=0
)


def load_model(
        base="decapoda-research/llama-7b-hf",
        finetuned="tloen/alpaca-lora-7b",
):
    tokenizer = LlamaTokenizer.from_pretrained(base)
    tokenizer.pad_token_id = 0
    tokenizer.padding_side = "left"

    model = LlamaForCausalLM.from_pretrained(
        base,
    )

    model = PeftModel.from_pretrained(model, finetuned).to("cuda")
    return model, tokenizer


model, tokenizer = load_model()



Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/141 [00:00<?, ?B/s]

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LLaMATokenizer'. 
The class this function is called from is 'LlamaTokenizer'.


Downloading (…)lve/main/config.json:   0%|          | 0.00/427 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/25.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00002-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00003-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00004-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00005-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00006-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00007-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00008-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00009-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00010-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00011-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00012-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00013-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00014-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00015-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00016-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00017-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00018-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00019-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00020-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00021-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00022-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00023-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00024-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00025-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00026-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00027-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00028-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00029-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00030-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00031-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00032-of-00033.bin:   0%|          | 0.00/405M [00:00<?, ?B/s]

Downloading (…)l-00033-of-00033.bin:   0%|          | 0.00/524M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/33 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)/adapter_config.json:   0%|          | 0.00/370 [00:00<?, ?B/s]

Downloading adapter_model.bin:   0%|          | 0.00/16.8M [00:00<?, ?B/s]

---
## Semantic Search

The idea of semantic search is to embed the meaning of the text in a vector of numbers and then use this vector to search for similar documents - vector similarity indicates the similarity of documents. The embedding is done using deep learning models specially trained for the task of semantic similarity, while the searching is done using vector search databases. However, in this example we will limit our solution to what comes with the Python package called [SentenceTransformers](https://sbert.net).

For simplicity, we'll use simple-wiki to resemble a custom dataset. We will also download the corresponding pre-computed text embeddings of the dataset to save time and resources. This way we only need to compute the embeddings of the query.


In [3]:
import os
import gzip
import json

from sentence_transformers import SentenceTransformer, util


# This code comes from: https://github.com/UKPLab/sentence-transformers/

def load_wikipedia():
    wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

    # retrieve the dataset from online location
    if not os.path.exists(wikipedia_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

    # extract the text and store as a list in passages variable
    passages = []
    with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
        for line in fIn:
            data = json.loads(line.strip())
            for paragraph in data['paragraphs']:
                # We encode the passages as [title, text]
                passages.append([data['title'], paragraph])

    # also download the embeddings to avoid redunant computation
    embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
    if not os.path.exists(embeddings_filepath):
        util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

    corpus_embeddings = torch.load(embeddings_filepath)
    corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
    if torch.cuda.is_available():
        corpus_embeddings = corpus_embeddings.to('cuda')

    return passages, corpus_embeddings


# load the dataset
passages, embeddings = load_wikipedia()

# load the embeddings model to be used for the query
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)

  0%|          | 0.00/50.2M [00:00<?, ?B/s]

  0%|          | 0.00/783M [00:00<?, ?B/s]

Downloading (…)a2d19/.gitattributes:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)17900a2d19/README.md:   0%|          | 0.00/3.69k [00:00<?, ?B/s]

Downloading (…)900a2d19/config.json:   0%|          | 0.00/540 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/265M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)a2d19/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/554 [00:00<?, ?B/s]

Downloading (…)17900a2d19/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)00a2d19/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

---
### Domain Specific Generative Question Answering

What we are now doing is combining semantic search with generative question answering. First, we will look up the answer to our question in the database and retrieve the correct document. We need to embed the query into a vector and run the search. 

In [4]:
question = "How many James Bond films has Sean Connery starred in??"

question_embedding = bi_encoder.encode(question, convert_to_tensor=True)
hit = util.semantic_search(question_embedding, embeddings, top_k=1)
retrieved_passage = passages[hit[0][0]["corpus_id"]][1]

print(retrieved_passage)

Sir Thomas Sean Connery (25 August 1930 – 31 October 2020) was a Scottish actor. He was known for his charm and good looks, which have made him very famous. He was best known for playing James Bond in seven of the James Bond movies. He appeared in 94 movies. He won the Academy Award for Best Supporting Actor for his role as Jimmy Malone in "The Untouchables" (1987).


Then, we will ask Alpaca to answer our question by looking at the identified document. To do this we need to construct the correct prompt.


In [8]:
def answer(question, context, model, tokenizer):
    prompt = [
        "Answer the question using the following context\n"
        f"Question: {question}\n"
        f"Context: {context}"
    ]

    encodings = tokenizer(prompt, padding=True, return_tensors="pt").to('cuda')
    generated_ids = model.generate(
        **encodings,
        generation_config=GENERATION_CONFIG,
        max_new_tokens=256
    )

    decoded = tokenizer.batch_decode(generated_ids)
    del encodings, generated_ids
    torch.cuda.empty_cache()
    return decoded[0].split("\n")[-1]

And voila! That's it:

In [9]:
def answer_question(question):
    question_embedding = bi_encoder.encode(question, convert_to_tensor=True)
    hit = util.semantic_search(question_embedding, embeddings, top_k=1)
    retrieved_passage = passages[hit[0][0]["corpus_id"]][1]

    return answer(question, retrieved_passage, model, tokenizer)

Let' see some results:

In [10]:
answer_question("How many James Bond films has Sean Connery starred in?")

'Answer: Sean Connery starred in seven James Bond films.'

In [11]:
answer_question("Who played Vito Corleone in the movie Godfather?")

'Answer: Marlon Brando'

In [12]:
answer_question("What is the most popular album of Pink Floyd?")

"Answer: The most popular album of Pink Floyd is The Dark Side of the Moon. It was released in 1973 and has sold over 45 million copies worldwide. It is the second best-selling album of all time, behind Michael Jackson's Thriller."

---
## Summary

Obviously, the above example is oversimplified and much more work is needed to make it part of users facing software. For example, a [larger model](https://huggingface.co/baseten/alpaca-30b) could be used to better follow user instructions, different prompts could be evaluated, extend the context to more than a single document, add conversation history and finally the semantic search could be performed by more sophisticated models and vector databases. Moreover for commercial application you could turn your attention to [Dolly](https://github.com/databrickslabs/dolly). But the idea remains the same.