# Code documentation Q&A bot example with LangChain
![picture](https://lancedb.github.io/lancedb/assets/ecosystem-illustration.png)

This Q&A bot will allow you to query your own documentation easily using questions. We'll also demonstrate the use of LangChain and LanceDB using the OpenAI API.

In this example we'll **Numpy 1.26** documentation, but, this could be replaced for your own docs as well

### Credentials

Copy and paste the project name and the api key from your project page.
These will be used later to [connect to LanceDB Cloud](#scroll-to=5q8m6GMD7sGu)

In [1]:
project_slug = "your-project-slug"  # @param {type:"string"}

In [2]:
api_key = "sk_..."  # @param {type:"string"}

You can also set the LANCEDB_API_KEY as an environment variable. More details can be found <a href="https://github.com/lancedb/vectordb-recipes/tree/main/examples/RAG_Reranking/lancedb_cloud/README.md">**here**</a>.

Since we will be using OPENAI API, let us set the OPENAI API KEY as well.

In [None]:
openai_api_key = "sk-..."  # @param {type:"string"}

### Installing dependencies

In [None]:
! pip install -U langchain langchain-openai langchain-community

In [None]:
! pip install -qq tiktoken unstructured pandas lancedb

### Importing libraries

In [None]:
import openai
import os
import re
import pickle
import requests
import zipfile
from pathlib import Path

from langchain.document_loaders import UnstructuredHTMLLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import LanceDB
from langchain_openai import OpenAI, OpenAIEmbeddings
from langchain.chains import RetrievalQA

os.environ["OPENAI_API_KEY"] = openai_api_key
assert openai.models.list() is not None

### Get the data
To make this easier, we've downloaded Numpy documentation and stored the raw HTML files for you to download. Once the docs are downloaded, we then use LangChain's HTML document readers to parse them and store them in LanceDB as a vector store, along with relevant metadata.
By default we use numpy docs, but you can replace this with your own docs as well.

In [None]:
numpy_docs = requests.get("https://numpy.org/doc/1.26/numpy-html.zip")
with open("numpy-html.zip", "wb") as f:
    f.write(numpy_docs.content)

file = zipfile.ZipFile("numpy-html.zip")
file = file.extractall(path="numpy_docs")

We'll create a simple **helper function** that can help to extract metadata, so it can used later when querying with filters. In this case, we want to keep the lineage of the uri or path for each document that has been processed:

In [None]:
# Pre-processing and loading the documentation

# Next, let's pre-process and load the documentation. To make sure we don't need to do this repeatedly if we were updating code,
# we're caching it using pickle so we can retrieve it again (this could take a few minutes to run the first time you do it).
# We'll also add some more metadata to the docs here such as the title and version of the code:


def get_document_title(document_list):
    titles = []
    for doc in document_list:
        if "metadata" in doc and "source" in doc["metadata"]:
            m = str(doc["metadata"]["source"])
            title = re.findall("numpy_docs(.*).html", m)
            print(title)
            if title:
                titles.append(title[0])
            else:
                titles.append("")
        else:
            titles.append("")
    return titles

### Pre-processing and loading the documents

Next, let's pre-process and load the documents. To make sure we don't need to do this repeatedly while updating code, we're caching it using pickle so it can be retrieved again (this could take a few minutes to run the first time you do it). We'll also add extra metadata to the docs here such as the title and version of the code:

*Note*: This step might take up to 10 minutes to run!
*Note*: If there is some issue with nltk package, kindly try using
```
import nltk
nltk.download('punkt')
```
or try to manually install the [nltk_data](https://github.com/nltk/nltk_data/tree/gh-pages) package and unzip the **punkt tokenizer** zip and the **averaged_perceptron_tagger** zip file in the packages folder.

In [None]:
from tqdm import tqdm

docs = []
docs_path = Path("docs.pkl")
for p in tqdm(Path("numpy_docs").rglob("*.html")):
    if p.is_dir():
        continue
    loader = UnstructuredHTMLLoader(p)
    raw_document = loader.load()
    # docs.append(raw_document)
    title = get_document_title(raw_document)
    m = {"title": title}
    if raw_document:
        raw_document[0].metadata.update(m)
        raw_document[0].metadata["source"] = str(raw_document[0].metadata["source"])
        docs.extend(raw_document)


if docs:
    with open(docs_path, "wb") as fh:
        pickle.dump(docs, fh)
else:
    with open(docs_path, "rb") as fh:
        docs = pickle.load(fh)

len(docs)

### Generating emebeddings from our docs

Now that we have our raw documents loaded, we need to pre-process them to generate embeddings:

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)
documents = text_splitter.split_documents(docs)
embeddings = OpenAIEmbeddings()

### Store data in LanceDB Cloud

Let's connect to LanceDB so we can store our documents, It requires 0 setup !

In [None]:
uri = "db://" + project_slug
table_name = "langchain_vectorstore"

vectorstore = LanceDB(
    embedding=embeddings,
    uri=uri,  # your remote database URI
    api_key=api_key,
    region="us-east-1",
    table_name=table_name,  # Optional, defaults to "vectors"
    mode="overwrite",  # Optional, defaults to "overwrite"
)

doc_ids = vectorstore.add_documents(documents=documents)

Now let's create our RetrievalQA chain using the LanceDB vector store:

In [None]:
qa = RetrievalQA.from_chain_type(
    llm=OpenAI(), chain_type="stuff", retriever=vectorstore.as_retriever()
)

And thats it! We're all setup. The next step is to run some queries, let's try a few:

### Query

In [None]:
query = "tell me about the numpy library?"
qa.invoke(query)

{'query': 'tell me about the numpy library?',
 'result': ' The NumPy library is an open source Python library that is used for working with numerical data in Python. It contains multidimensional array and matrix data structures, and provides methods for efficient operations on these arrays. It is widely used in various fields of science and engineering and is a core component of the scientific Python and PyData ecosystems. It also offers a large library of high-level mathematical functions for working with arrays and matrices. '}

In [None]:
query = "What's the current version of numpy?"
qa.invoke(query)

{'query': "What's the current version of numpy?",
 'result': '\nThe current version of numpy is 1.16.4.'}

In [None]:
query = "What kind of linear algebra related operations can be done in numpy?"
qa.invoke(query)

{'query': 'What kind of linear algebra related operations can be done in numpy?',
 'result': ' The numpy package provides various operations related to linear algebra, such as decompositions, matrix eigenvalues, norms, solving equations and inverting matrices, and performing linear algebra on several matrices at once. It also has support for logic functions, masked array operations, mathematical functions, matrix library, miscellaneous routines, padding arrays, polynomials, random sampling, set routines, sorting, searching, counting, statistics, and window functions.'}