# Code documentation Q&A bot example with LangChain
![picture](https://lancedb.github.io/lancedb/assets/ecosystem-illustration.png)

This Q&A bot will allow you to query your own documentation easily using questions. We'll also demonstrate the use of LangChain and LanceDB using the OpenAI API.

In this example we'll **Numpy 1.26** documentation, but, this could be replaced for your own docs as well

### Installing dependencies

In [None]:
! pip install -U langchain langchain-openai

In [23]:
! pip install -qq tiktoken unstructured pandas lancedb

First, let's get some setup out of the way. As we're using the OpenAI API, ensure that you've set your key (and organization if needed):

In [1]:
import openai
import os


# Configuring the environment variable OPENAI_API_KEY
os.environ["OPENAI_API_KEY"] = "sk-..."

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = "sk-..."
openai.api_key = os.environ["OPENAI_API_KEY"]

# assert len(openai.models.list()["data"]) > 0


We're going to use the power of LangChain to help us create our Q&A bot. It comes with several APIs that can make our development much easier as well as a LanceDB integration for vectorstore.

### Importing all libraries

In [14]:
import lancedb
import re
import pickle
import requests
import zipfile
from pathlib import Path

from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import LanceDB
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA

To make this easier, we've downloaded Pandas documentation and stored the raw HTML files for you to download. We'll download them and then use LangChain's HTML document readers to parse them and store them in LanceDB as a vector store, along with relevant metadata. 
By default we'll use numpy docs as it is much smaller than pandas docs, but you can replace this with your own docs as well.

### Get the data

In [3]:
# pandas_docs = requests.get("https://eto-public.s3.us-west-2.amazonaws.com/datasets/pandas_docs/pandas.documentation.zip")

numpy_docs = requests.get("https://numpy.org/doc/1.26/numpy-html.zip")
with open("numpy-html.zip", "wb") as f:
    f.write(numpy_docs.content)

file = zipfile.ZipFile("numpy-html.zip")
file = file.extractall(path="numpy_docs")

We'll create a simple **helper function** that can help to extract metadata, so we can use this downstream when we're wanting to query with filters. In this case, we want to keep the lineage of the uri or path for each document that we process:

In [4]:
# Pre-processing and loading the documentation

# Next, let's pre-process and load the documentation. To make sure we don't need to do this repeatedly if we were updating code,
# we're caching it using pickle so we can retrieve it again (this could take a few minutes to run the first time you do it).
# We'll also add some more metadata to the docs here such as the title and version of the code:
import re


def get_document_title(document_list):
    titles = []
    for doc in document_list:
        if "metadata" in doc and "source" in doc["metadata"]:
            m = str(doc["metadata"]["source"])
            title = re.findall("numpy_docs(.*).html", m)
            print(title)
            if title:
                titles.append(title[0])
            else:
                titles.append("")
        else:
            titles.append("")
    return titles

# Pre-processing and loading the documentation

Next, let's pre-process and load the documentation. To make sure we don't need to do this repeatedly if we were updating code, we're caching it using pickle so we can retrieve it again (this could take a few minutes to run the first time you do it). We'll also add some more metadata to the docs here such as the title and version of the code:

If there is some issue with nltk package, kindly try using
```
import nltk
nltk.download('punkt')
```
or try to manually install the [nltk_data](https://github.com/nltk/nltk_data/tree/gh-pages) package and unzip the **punkt tokenizer** zip and the **averaged_perceptron_tagger** zip file in the packages folder.

In [5]:
from tqdm import tqdm

docs = []
docs_path = Path("docs.pkl")
for p in tqdm(Path("numpy_docs").rglob("*.html")):
    if p.is_dir():
        continue
    loader = UnstructuredHTMLLoader(p)
    raw_document = loader.load()
    # docs.append(raw_document)
    title = get_document_title(raw_document)
    m = {"title": title}
    if raw_document:
        raw_document[0].metadata.update(m)
        raw_document[0].metadata["source"] = str(raw_document[0].metadata["source"])
        docs.extend(raw_document)


if docs:
    with open(docs_path, "wb") as fh:
        pickle.dump(docs, fh)
else:
    with open(docs_path, "rb") as fh:
        docs = pickle.load(fh)

2699it [03:03, 14.72it/s]


In [6]:
len(docs)

2699

# Generating emebeddings from our docs

Now that we have our raw documents loaded, we need to pre-process them to generate embeddings:

In [15]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
documents = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()

# Storing

Let's connect to LanceDB Cloud so we can store our documents, It requires 0 setup !

In [9]:
vectorstore = LanceDB(
    embedding=embeddings,
    uri = "db://test", # your remote database URI
    api_key="sk_...",
    region="us-east-x-xxx", # the cloud region you have configured 
    table_name="langchain_vectorstore",  # Optional, defaults to "vectors"
    mode="overwrite", # Optional, defaults to "overwrite"
)

doc_ids = vectorstore.add_documents(documents=documents)

Now let's create our RetrievalQA chain using the LanceDB vector store:

In [10]:
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever()
)

And thats it! We're all setup. The next step is to run some queries, let's try a few:

### Querying

In [11]:
query = "tell me about the numpy library?"
qa.invoke(query)

{'query': 'tell me about the numpy library?',
 'result': ' The NumPy library is a Python library that provides multidimensional array and matrix data structures. It is used for efficient mathematical operations on arrays and matrices, and it also offers a wide range of high-level mathematical functions. It is a fundamental library for scientific computing in Python and is used extensively in many other data science and scientific packages such as Pandas, SciPy, and OpenCV. It is known for its speed and performance, and is used by everyone from beginners to experienced researchers in various fields of science and engineering.'}

In [12]:
query = "What's the current version of numpy?"
qa.invoke(query)

{'query': "What's the current version of numpy?",
 'result': ' The current version of NumPy is 1.21.6, according to the context provided.'}

In [13]:
query = "What kind of linear algebra related operations can be done in numpy?"
qa.invoke(query)

{'query': 'What kind of linear algebra related operations can be done in numpy?',
 'result': ' Numpy provides a variety of linear algebra related operations, including decompositions, matrix eigenvalues, norms and other numbers, solving equations and inverting matrices, and linear algebra on several matrices at once.'}

Thanks