OpenAI has just announced and released [plugins for ChatGPT](https://openai.com/blog/chatgpt-plugins) that address its key problems - hallucinations and difficulties in updating with new, case-specific information. The plugins work mainly by restricting the generation of answers to a specific context, obtained from a dedicated information source, such as your specific documents.

Overcoming these challenges makes ChatGPT suitable for any application. However, it also comes at a significant cost. Fortunately, there is an open source alternative that can be prototyped in 15 minutes. In this short article, I'll show how to do retrieved augmented generative question answering with [Alpaca](https://crfm.stanford.edu/2023/03/13/alpaca.html) and [Sentence Transformers](https://www.sbert.net). That is, how to construct a solution that answers your questions like a human domain expert.


---
## Installing dependecies
As the very first step we need to install the required python dependecies

In [None]:
!pip install bitsandbytes datasets loralib sentencepiece 
!pip install git+https://github.com/huggingface/transformers.git
!pip install git+https://github.com/huggingface/peft.git tenacity
!pip install -U sentence-transformers

---
## Alpaca - Instruction-Following Language Model

[Stanford's Alpaca7B](https://https://crfm.stanford.edu/2023/03/13/alpaca.html) is a small but super powerful instruction-following language model constructed in a very clever way.  In other words, it works just like ChatGPT, but it's much smaller and free to use - you can even [run it on your CPU](https://github.com/antimatter15/alpaca.cpp)!

In [None]:
import torch

from transformers import GenerationConfig
from transformers import LlamaTokenizer, LlamaForCausalLM
from peft import PeftModel
import bitsandbytes as bnb

# The code comes from here: https://github.com/deep-diver/Alpaca-LoRA-Serve/

GENERATION_CONFIG = GenerationConfig(
    max_lenght=256,
    temperature=0.9,
    top_p=0.75,
    num_beams=1,
    use_cache=True,
    min_length=0
)

def load_model(
    base="decapoda-research/llama-7b-hf",
    finetuned="tloen/alpaca-lora-7b",
):
    tokenizer = LlamaTokenizer.from_pretrained(base)
    tokenizer.pad_token_id = 0
    tokenizer.padding_side = "left"

    model = LlamaForCausalLM.from_pretrained(
        base,
    )
    
    model = PeftModel.from_pretrained(model, finetuned).to("cuda")
    return model, tokenizer

model, tokenizer = load_model()

---
## Semantic Search

The idea of semantic search is to embed the meaning of the text in a vector of numbers and then use this vector to search for similar documents - vector similarity indicates the similarity of documents. The embedding is done using deep learning models specially trained for the task of semantic similarity, while the searching is done using vector search databases. However, in this example we will limit our solution to what comes with the Python package called [SentenceTransformers](https://sbert.net).

For simplicity, we'll use simple-wiki as our search database. We will also download the corresponding pre-computed text embeddings of the dataset to save time and resources. This way we only need to compute the embeddings of the query.


In [None]:
import os
import gzip
import json

from sentence_transformers import SentenceTransformer, util


# This code comes from: https://github.com/UKPLab/sentence-transformers/

def load_wikipedia():
  wikipedia_filepath = 'data/simplewiki-2020-11-01.jsonl.gz'

  # retrieve the dataset from online location
  if not os.path.exists(wikipedia_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01.jsonl.gz', wikipedia_filepath)

  # extract the text and store as a list in passages variable
  passages = []
  with gzip.open(wikipedia_filepath, 'rt', encoding='utf8') as fIn:
    for line in fIn:
        data = json.loads(line.strip())
        for paragraph in data['paragraphs']:
            # We encode the passages as [title, text]
            passages.append([data['title'], paragraph])

  # also download the embeddings to avoid redunant computation
  embeddings_filepath = 'simplewiki-2020-11-01-nq-distilbert-base-v1.pt'
  if not os.path.exists(embeddings_filepath):
    util.http_get('http://sbert.net/datasets/simplewiki-2020-11-01-nq-distilbert-base-v1.pt', embeddings_filepath)

  corpus_embeddings = torch.load(embeddings_filepath)
  corpus_embeddings = corpus_embeddings.float()  # Convert embedding file to float
  if torch.cuda.is_available():
      corpus_embeddings = corpus_embeddings.to('cuda')

  return passages, corpus_embeddings

# load the dataset
passages, embeddings = load_wikipedia()

# load the embeddings model to be used for the query
model_name = 'nq-distilbert-base-v1'
bi_encoder = SentenceTransformer(model_name)

---
### Domain Specific Generative Question Answering

What we are now doing is combining semantic search with generative question answering. First, we will look up the answer to our question in the database and retrieve the correct document. We need to embed the query into a vector and run the search. 

In [None]:
question = "How many James Bond films has Sean Connery starred in??"

question_embedding = bi_encoder.encode(question, convert_to_tensor=True)
hit = util.semantic_search(question_embedding, embeddings, top_k=1)
retrieved_passage = passages[hit[0][0]["corpus_id"]][1]

print(retrieved_passage)

Then, we will ask Alpaca to answer our question by looking at the identified document. To do this we need to construct the correct prompt.


In [None]:
def answer(question, context, model, tokenizer):
  query = [
      "Answer the question using the following context\n"
      f"Question: {question}\n"
      f"Context: {context}"
  ]

  encodings = tokenizer(query, padding=True, return_tensors="pt").to('cuda')
  generated_ids = model.generate(
    **encodings,
    generation_config=GENERATION_CONFIG,
    max_new_tokens=256
  )

  decoded = tokenizer.batch_decode(generated_ids)
  del encodings, generated_ids
  torch.cuda.empty_cache()
  return decoded[0].split("\n")[-1]

And voila! That's it:

In [None]:
def answer_question(question):
  question_embedding = bi_encoder.encode(question, convert_to_tensor=True)
  hit = util.semantic_search(question_embedding, embeddings, top_k=1)
  retrieved_passage = passages[hit[0][0]["corpus_id"]][1]

  return answer(question, retrieved_passage, model, tokenizer)

Let' see some results:

In [None]:
answer_question("How many James Bond films has Sean Connery starred in?")

In [None]:
answer_question("Who played Vito Corleone in the movie Godfather?")

In [None]:
answer_question("What is the most popular album of Pink Floyd?")

In [None]:
answer_question("What reason of establishing the city of Gdynia?")

---
## Summary

Obviously, the above example is oversimplified and much more work is needed to make it part of commercial software. For example, a [larger model could be used to better follow user instructions](https://https://huggingface.co/baseten/alpaca-30b), different prompts could be evaluated, extend the context to more than a single document, add conversation history and finally the semantic search could be performed by more sophisticated models and vector databases. But the idea remains the same.