<a href="https://colab.research.google.com/github/gregmeldrum/open-source-llm-talk/blob/main/colabs/QA_Chandrayaan3_Llama2_OpenOrca_Platypus2_13B_4_bit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title
from IPython.display import HTML, display

def set_css():
  display(HTML('''
  <style>
    pre {
        white-space: pre-wrap;
    }
  </style>
  '''))
get_ipython().events.register('pre_run_cell', set_css)

## Document Q & A using Llama2 LLM in Google Colab.

This is an example of Retrieval Augmented Generation (RAG) using open source models for embedding and inference.

We'll run this on a Google Colab Python3 hosted runtime with a T4 instance which offers 15G VRAM.

Due to the resource constraints of colab, we'll use a small (0.44G) model for embeddings and a 4 bit quantized 13B parameter model in GGML format for inference. The [llama cpp](https://github.com/ggerganov/llama.cpp) projects supports GGML formatted models and allows the model to run on a combination of CPU and GPU.

We'll use [langchain](https://python.langchain.com/) to run the model.

The model used in this colab is based on [OpenOrca-Platypus2-13B](https://huggingface.co/Open-Orca/OpenOrca-Platypus2-13B) which (at the time of writing this colab) is the top rated 13B model on the [Hugging Face LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard). More specifically, we'll be using the [GGML quantized version](https://huggingface.co/TheBloke/OpenOrca-Platypus2-13B-GGML).

This model uses the Alpaca InstructOnly prompt template:

```
### Instruction:

{prompt}

### Response:
```

The over steps are a followings:
1. Scrape web sites and store the content in files
2. Read the files into chunks
3. Use an embedding model to vectorize the chunks
4. Store the vectors in a local Chroma database
5. Initialize the Llama2 inference model
6. Construct the inference chain in langchain
7. Run inference:
 - Take query from the user and run
 - searching for the relevent data from the vector database
 - use the inference model to construct a user response.

 TODO:
 1. Add memory

# Step 1: Scrape the website

## The Scraper

The following code will scrape the text from a website and follow links to other related sites as long as they fall under the same URL hierarchy.

First we pip install all of the dependencies.

In [None]:
!pip install -q requests==2.31.0 beautifulsoup4==4.11.2 sentence-transformers==2.2.2 chromadb==0.4.5 langchain==0.0.271 FlagEmbedding==1.0.3

In [None]:
import os
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

# Set for storing already visited urls
visited_urls = set()

data_directory = "/content/data"

def get_page_content(url):
    """
    Returns the content of the webpage at `url`
    """
    response = requests.get(url)
    return response.text

def get_all_links(content, domain):
    """
    Returns all valid links on the page
    """
    soup = BeautifulSoup(content, "html.parser")
    links = soup.find_all("a")
    valid_links = []

    for link in links:
        href = link.get('href')
        if href != None and not href.startswith("..") and href != "#" and not href.startswith("#"):
            if href.startswith("http"):
                if href.startswith(domain):
                    print("Following", href)
                    valid_links.append(href)
            else:

                print("Following", strip_after_last_hash(href))
                valid_links.append(domain + '/' + strip_after_last_hash(href))
    return valid_links

def strip_after_last_hash(url):
    """
    Strips off all characters after the last "#" in `url`,
    if "#" does not have a "/" character before it.
    """
    hash_index = url.rfind('#')
    if hash_index > 0 and url[hash_index - 1] != '/':
        return url[:hash_index]
    else:
        return url

def write_to_file(url, content):
    """
    Write the content to a text file with the name as the URL
    """
    if not os.path.exists(data_directory):
        os.makedirs(data_directory)
    filename = data_directory + '/' + url.replace('/', '_').replace(':', '_') + '.txt'
    with open(filename, 'w', encoding='utf-8') as f:
        lines = content.split('\n')
        non_blank_lines = [line for line in lines if line.strip() != '']
        f.write('\n'.join(non_blank_lines))

def scrape(url, depth):
    """
    Scrapes the webpage at `url` up to a certain `depth`
    """
    scheme = urlparse(url).scheme # Get the scheme
    domain = urlparse(url).netloc # Get base domain
    path = os.path.dirname(urlparse(url).path) # Get base path excluding the last part

    print("URL", url)
    if depth == 0 or url in visited_urls:
        return

    visited_urls.add(url)

    print(f"Scraping: {url}")
    content = get_page_content(url)
    soup = BeautifulSoup(content, "html.parser")
    text = soup.get_text()
    write_to_file(url, text)

    links = get_all_links(content, scheme + "://" + domain + path)

    for link in links:
        scrape(link, depth - 1)

Now, lets scrape Wikipedia and India Space Research Organization for information on Chandrayaan-3

In [None]:
scrape("https://en.wikipedia.org/wiki/Chandrayaan-3", 1)
scrape("https://www.isro.gov.in/Chandrayaan3_Details.html", 1)

In [None]:
!ls -l /content/data/*


# Vector Embedding

## Step 2: Text Splitter

We split the text using Recursive Character Text Splitter.

## Step 3: Embed
We'll be using the BGE Embeding model using the huggingface libraries. Ideally we would utilize the GPU for the embedding task, but since we are trying to run this in a free-tier colab, we'll save the GPU for inference, and instead we'll use the CPU. This takes much longer but it's a trade-off we need to make due to resource constraints.

In [None]:
from langchain.embeddings import HuggingFaceBgeEmbeddings

from langchain.vectorstores import Chroma
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
import sys

embedding_directory = "/content/chroma_db"

embedding_db = None;

def embed():

    print("\nCalculating Embeddings\n")

    # Load the text from the data directory
    loader=DirectoryLoader(data_directory,
                        glob="*.txt",
                        loader_cls=TextLoader)

    documents=loader.load()

    # Split the data into chunks
    text_splitter=RecursiveCharacterTextSplitter(chunk_size=500,
                                                chunk_overlap=50)

    chunks = text_splitter.split_documents(documents)

    # Load the huggingface embedding model
    model_name = "BAAI/bge-base-en"
    encode_kwargs = {'normalize_embeddings': True} # set True to compute cosine similarity

    embedding_model = HuggingFaceBgeEmbeddings(
        model_name=model_name,
        model_kwargs={'device': 'cpu'},
        #model_kwargs={'device': 'cuda'},
        encode_kwargs=encode_kwargs
    )

    global embedding_db
    embedding_db = Chroma.from_documents(chunks, embedding_model, persist_directory=embedding_directory)

    print("Embeddings completed")

In [None]:
embed()

In [None]:
!ls /content/chroma_db


Next we'll install and compile the `llama-cpp-python` library and then download the ggml llm model from Hugging Face.

In [None]:
!pip uninstall -y llama-cpp-python
!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python==0.1.78 --no-cache-dir

!apt-get -y install -qq aria2
!aria2c --console-log-level=error -c -x 16 -s 16 -k 1M https://huggingface.co/TheBloke/OpenOrca-Platypus2-13B-GGML/resolve/main/openorca-platypus2-13b.ggmlv3.q4_K_M.bin -d /content/ -o openorca-platypus2-13b.ggmlv3.q4_K_M.bin

Next we'll install langchain and create the llm and llm chain. For a description of the parameters used to configure the LlamaCpp LLM see the [API definition](https://api.python.langchain.com/en/latest/llms/langchain.llms.llamacpp.LlamaCpp.html#langchain.llms.llamacpp.LlamaCpp).

In [None]:
from langchain.llms import LlamaCpp
from langchain import PromptTemplate, LLMChain
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

# Callbacks support token-wise streaming
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])
# Verbose is required to pass to the callback manager

temperature = 0.1 # Use a value between 0 and 2. Lower = factual, higher = creative
n_gpu_layers = 43  # Change this value based on your model and your GPU VRAM pool.
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.

# Make sure the model path is correct for your system!
llm = LlamaCpp(
    model_path="/content/openorca-platypus2-13b.ggmlv3.q4_K_M.bin",
    temperature=temperature,
    max_tokens=1024,
    n_ctx=4096,
    top_p=0.95,
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    #callback_manager=callback_manager,
    verbose=True,
)


Now, we create a custom prompt to be used by the LLM. We feed the prompt into a langchain PromptTemplate.

In [None]:
## Follow the default prompt style from the OpenOrca-Platypus2 huggingface model card.

def get_prompt():
  return """### Instruction:

Use the following Context information to answer the user's question.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Context: {context}
User Question: {question}
Only return the helpful answer below and nothing else.

### Response:"""

from langchain.prompts import PromptTemplate
prompt_template = get_prompt()

llama_prompt = PromptTemplate(
    template=prompt_template, input_variables=["context", "question"]
)

chain_type_kwargs = {"prompt": llama_prompt}

Create the "retriever" from the embedding DB to get the text chunks matching the query.

Set up the RetrievalQA chain, by providing the retriever, chaintype and llm.

In [None]:
from langchain.schema import prompt
from langchain.chains import RetrievalQA

#retriever = embedding_db.as_retriever(search_type="mmr", search_kwargs={'k': 5})
retriever = embedding_db.as_retriever(search_kwargs={'k': 5})

# create the chain to answer questions
qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                       chain_type="stuff",
                                       retriever=retriever,
                                       chain_type_kwargs=chain_type_kwargs,
                                       return_source_documents=True)


Process the response. We want to see the sources in the debug window, but only return the answer

In [None]:
## Cite sources

import textwrap

def wrap_text_preserve_newlines(text, width=110):
    # Split the input text into lines based on newline characters
    lines = text.split('\n')

    # Wrap each line individually
    wrapped_lines = [textwrap.fill(line, width=width) for line in lines]

    # Join the wrapped lines back together using newline characters
    wrapped_text = '\n'.join(wrapped_lines)

    return wrapped_text

def process_llm_response(llm_response):
    print(wrap_text_preserve_newlines(llm_response['result']))
    print('\n\nSources:')
    for source in llm_response["source_documents"]:
        print(source.metadata['source'])
    response = llm_response['result']
    response = response.split("### Response")[0]
    return response

Example with code

In [None]:
# full example
query = "When did the Chandrayaan-3 lander land on the moon?"
llm_response = qa_chain(query)
process_llm_response(llm_response)

Example with UI

In [None]:
!pip install -q gradio
import gradio

def runChain(query, history):
  return process_llm_response(qa_chain(query))

gradio.ChatInterface(runChain).launch(share=True, debug=True)