# Building Q&A Assistant Using Mongo and OpenAI

## Introduction

This notebook is designed to demonstrate how to implement a document Question-and-Answer (Q&A) task using SuperDuperDB in conjunction with OpenAI and MongoDB. It provides a step-by-step guide and explanation of each component involved in the process.

Implementing a document Question-and-Answer (Q&A) system using SuperDuperDB, OpenAI, and MongoDB can find applications in various real-life scenarios:

1. **Customer Support Chatbots:** Enable a chatbot to answer customer queries by extracting information from documents, manuals, or knowledge bases stored in MongoDB or any other SuperDuperDB supported database using Q&A.

2. **Legal Document Analysis:** Facilitate legal professionals in quickly extracting relevant information from legal documents, statutes, and case laws, improving efficiency in legal research.

3. **Medical Data Retrieval:** Assist healthcare professionals in obtaining specific information from medical documents, research papers, and patient records for quick reference during diagnosis and treatment.

4. **Educational Content Assistance:** Enhance educational platforms by enabling students to ask questions related to course materials stored in a MongoDB database, providing instant and accurate responses.

5. **Technical Documentation Search:** Support software developers and IT professionals in quickly finding solutions to technical problems by querying documentation and code snippets stored in MongoDB or any other database supported by SuperDuperDB. We did that!

6. **HR Document Queries:** Simplify HR processes by allowing employees to ask questions about company policies, benefits, and procedures, with answers extracted from HR documents stored in MongoDB or any other database supported by SuperDuperDB.

7. **Research Paper Summarization:** Enable researchers to pose questions about specific topics, automatically extracting relevant information from a MongoDB repository of research papers to generate concise summaries.

8. **News Article Information Retrieval:** Empower users to inquire about specific details or background information from a database of news articles stored in MongoDB or any other database supported by SuperDuperDB, enhancing their understanding of current events.

9. **Product Information Queries:** Improve e-commerce platforms by allowing users to ask questions about product specifications, reviews, and usage instructions stored in a MongoDB database.

By implementing a document Q&A system with SuperDuperDB, OpenAI, and MongoDB, these use cases demonstrate the versatility and practicality of such a solution across different industries and domains.

All is possible without zero friction with SuperDuperDB. Now back into the notebook.

## Prerequisites

Before starting the implementation, make sure you have the required libraries installed by running the following commands:

In [None]:
!pip install pinnacledb
!pip install ipython openai==1.1.2 numpy==1.24.4

Additionally, ensure that you have set your OpenAI API key as an environment variable. You can uncomment the following code and add your API key:

In [1]:
import os

# Load env variables
from dotenv import load_dotenv
load_dotenv()

# Or add your OPEN_AI_API_KEY
#os.environ['OPENAI_API_KEY'] = 'sk-...'

if 'OPENAI_API_KEY' not in os.environ:
    raise Exception('Environment variable "OPENAI_API_KEY" not set')

## Connect to datastore 

First, we need to establish a connection to a MongoDB datastore via SuperDuperDB. You can configure the `MongoDB_URI` based on your specific setup. 
Here are some examples of MongoDB URIs:

* For testing (default connection): `mongomock://test`
* Local MongoDB instance: `mongodb://localhost:27017`
* MongoDB with authentication: `mongodb://pinnacle:pinnacle@mongodb:27017/documents`
* MongoDB Atlas: `mongodb+srv://<username>:<password>@<atlas_cluster>/<database>`

In [2]:
from pinnacledb import pinnacle
from pinnacledb.backends.mongodb import Collection
import os

mongodb_uri = os.getenv("MONGODB_URI", "mongomock://test")

# SuperDuperDB, now handles your MongoDB database
# It just super dupers your database
db = pinnacle(
    mongodb_uri,
    artifact_store='filesystem://./data/',
    downloads__folder='./data',
    cluster__vector_search=mongodb_uri,
)

collection = Collection('questiondocs')

[32m 2024-Jan-25 18:33:09.44[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36m9d62ba71-b922-4bff-915a-4dfe51d67422[0m| [36mpinnacledb.base.build[0m:[36m60  [0m | [1mData Client is ready. MongoClient(host=['cluster0-shard-00-01.j28qm.mongodb.net:27017', 'cluster0-shard-00-02.j28qm.mongodb.net:27017', 'cluster0-shard-00-00.j28qm.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retrywrites=True, w='majority', authsource='admin', replicaset='atlas-os13l9-shard-0', tls=True, serverselectiontimeoutms=5000)[0m
[32m 2024-Jan-25 18:33:09.45[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36m9d62ba71-b922-4bff-915a-4dfe51d67422[0m| [36mpinnacledb.base.build[0m:[36m35  [0m | [1mConnecting to Metadata Client with engine:  MongoClient(host=['cluster0-shard-00-01.j28qm.mongodb.net:27017', 'cluster0-shard-00-02.j28qm.mongodb.net:27017', 'cluster0-shard-00-00.j28qm.mongodb.net:27017'], document_class=dict, tz_aware=False, connect=True, retryw

## Load Dataset

In this example, we use the internal textual data from the `pinnacledb` project's API documentation. The objective is to create a chatbot that can offer information about the project. You can either load the data from your local project or use the provided data.

If you have the SuperDuperDB project locally and want to load the latest version of the API, uncomment the following cell:

In [3]:
import glob
import re

ROOT = '../docs/hr/content/docs/'

STRIDE = 3       # stride in numbers of lines
WINDOW = 25       # length of window in numbers of lines

files = sorted(glob.glob(f'{ROOT}/**/*.md', recursive=True))

def get_chunk_link(chunk, file_name):
    # Get the original link of the chunk
    file_link = file_name[:-3].replace(ROOT, 'https://docs.pinnacledb.com/docs/docs/')
    # If the chunk has subtitles, the link to the first subtitle will be used first.
    first_title = (re.findall(r'(^|\n)## (.*?)\n', chunk) or [(None, None)])[0][1]
    if first_title:
        # Convert subtitles and splice URLs
        first_title = first_title.lower()
        first_title = re.sub(r'[^a-zA-Z0-9]', '-', first_title)
        file_link = file_link + '#' + first_title
    return file_link

def create_chunk_and_links(file, file_prefix=ROOT):
    with open(file, 'r') as f:
        lines = f.readlines()
    if len(lines) > WINDOW:
        chunks = ['\n'.join(lines[i: i + WINDOW]) for i in range(0, len(lines), STRIDE)]
    else:
        chunks = ['\n'.join(lines)]
    return [{'txt': chunk, 'link': get_chunk_link(chunk, file)}  for chunk in chunks]


all_chunks_and_links = sum([create_chunk_and_links(file) for file in files], [])

Otherwise, you can load the data from an external source. The text chunks include code snippets and explanations, which will be utilized to construct the document Q&A chatbot.

In [None]:
# Use !curl to download the 'pinnacledb_docs.json' file
!curl -O https://datas-public.s3.amazonaws.com/pinnacledb_docs.json

import json
from IPython.display import Markdown

# Open the downloaded JSON file and load its contents into the 'chunks' variable
with open('pinnacledb_docs.json') as f:
    all_chunks_and_links = json.load(f)

View the chunk content:

In [4]:
from IPython.display import *

# Assuming 'chunks' is a list or iterable containing markdown content
chunk_and_link = all_chunks_and_links[48]
print(chunk_and_link['link'])
Markdown(chunk_and_link['txt'])

https://docs.pinnacledb.com/docs/docs/ai_integrations/llm


```python

from pinnacledb.ext.llm import VllmModel

model = VllmModel(model_name="mistralai/Mistral-7B-Instruct-v0.2")

```



**Load to a Ray Cluster**



Requires installing `ray`, no need for `vllm` dependencies.



> Installing `vllm` requires a CUDA environment, which can prevent clients without CUDA from installing `vllm`. Therefore, pinnacledb has adapted so that if loading to a ray cluster, local installation of `vllm` is not required.



```bash

pip install 'ray[default]'

```



```python

from pinnacledb.ext.llm import VllmModel

model = VllmModel(model_name="mistralai/Mistral-7B-Instruct-v0.2", ray_address="ray://ray_cluster_ip:10001")

```



> If this is your first time running on that ray cluster, the wait time might be a bit longer, as `vllm` dependencies and the corresponding model will be installed on the ray cluster's server.



**Parameter**



- model_name: Same as `model` of vLLM


The chunks of text contain both code snippets and explanations, making them valuable for constructing a document Q&A chatbot. The combination of code and explanations enables the chatbot to provide comprehensive and context-aware responses to user queries.

As usual we insert the data. The `Document` wrapper allows `pinnacledb` to handle records with special data types such as images,
video, and custom data-types.

In [5]:
from pinnacledb import Document

# Insert multiple documents into the collection
insert_ids = db.execute(collection.insert_many([Document(chunk_and_link) for chunk_and_link in all_chunks_and_links]))
print(insert_ids[:5])

[32m 2024-Jan-25 18:36:07.70[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36m9d62ba71-b922-4bff-915a-4dfe51d67422[0m| [36mpinnacledb.backends.local.compute[0m:[36m32  [0m | [1mSubmitting job. function:<function callable_job at 0x12c290680>[0m
[32m 2024-Jan-25 18:36:08.82[0m| [32m[1mSUCCESS [0m | [36mDuncans-MBP.fritz.box[0m| [36m9d62ba71-b922-4bff-915a-4dfe51d67422[0m| [36mpinnacledb.backends.local.compute[0m:[36m38  [0m | [32m[1mJob submitted.  function:<function callable_job at 0x12c290680> future:82b3a62b-d36a-4ba2-bffa-4d0e0ebab85d[0m
([ObjectId('65b29c06125201bc8461be50'), ObjectId('65b29c06125201bc8461be51'), ObjectId('65b29c06125201bc8461be52'), ObjectId('65b29c06125201bc8461be53'), ObjectId('65b29c06125201bc8461be54'), ObjectId('65b29c06125201bc8461be55'), ObjectId('65b29c06125201bc8461be56'), ObjectId('65b29c06125201bc8461be57'), ObjectId('65b29c06125201bc8461be58'), ObjectId('65b29c06125201bc8461be59'), ObjectId('65b29c06125201bc8461be5a')

## Create a Vector-Search Index

To enable question-answering over your documents, set up a standard `pinnacledb` vector-search index using `openai` (other options include `torch`, `sentence_transformers`, `transformers`, etc.).

A `Model` is a wrapper around a self-built or ecosystem model, such as `torch`, `transformers`, `openai`.

In [6]:
from pinnacledb.ext.openai import OpenAIEmbedding

# Create an instance of the OpenAIEmbedding model with the specified identifier ('text-embedding-ada-002')
model = OpenAIEmbedding(identifier='text-embedding-ada-002')

In [8]:
vector = model.predict('This is a test', one=True)
print('vector size: ', len(vector))
vector

vector size:  1536


[-0.008059182204306126,
 -0.003603511257097125,
 -0.000528058095369488,
 -0.005753727629780769,
 -0.024468205869197845,
 0.016131576150655746,
 -0.014929304830729961,
 -0.004634029697626829,
 -0.0009636337636038661,
 -0.03445630520582199,
 0.015920188277959824,
 0.01726778782904148,
 -0.008997217752039433,
 0.0022311382927000523,
 0.008713165298104286,
 1.3005340406380128e-05,
 0.02448141761124134,
 0.0005771893775090575,
 0.008336629718542099,
 -0.007444834802299738,
 0.005446553695946932,
 0.0075637404806911945,
 -0.011547090485692024,
 0.02483813464641571,
 -0.028352467343211174,
 -0.02319987490773201,
 0.0035044229589402676,
 -0.03522258996963501,
 0.019421307370066643,
 -0.009941860102117062,
 0.021878696978092194,
 -0.0173470601439476,
 0.001747257076203823,
 -0.0363323800265789,
 0.0007807332440279424,
 -0.012676697224378586,
 -0.010609054937958717,
 -0.01729421131312847,
 0.00801954697817564,
 -0.010886501520872116,
 0.009162365458905697,
 0.016686471179127693,
 0.0071475696749

A `Listener` essentially deploys a `Model` to "listen" to incoming data, computes outputs, and then saves the results in the database via `db`.

In [9]:
# Import the Listener class from the pinnacledb module
from pinnacledb import Listener


# Create a Listener instance with the specified model, key, and selection criteria
listener = Listener(
    model=model,          # The model to be used for listening
    key='txt',            # The key field in the documents to be processed by the model
    select=collection.find()  # The selection criteria for the documents
)

A `VectorIndex` wraps a `Listener`, allowing its outputs to be searchable.

In [10]:
# Import the VectorIndex class from the pinnacledb module
from pinnacledb import VectorIndex

# Add a VectorIndex to the SuperDuperDB database with the specified identifier and indexing listener
_ = db.add(
    VectorIndex(
        identifier='my-index',        # Unique identifier for the VectorIndex
        indexing_listener=listener    # Listener to be used for indexing documents
    )
)

1076it [00:00, 3190.61it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:14<00:00,  1.30s/it]


[32m 2024-Jan-25 18:39:03.79[0m| [1mINFO    [0m | [36mDuncans-MBP.fritz.box[0m| [36m9d62ba71-b922-4bff-915a-4dfe51d67422[0m| [36mpinnacledb.components.model[0m:[36m477 [0m | [1mAdding 1076 model outputs to `db`[0m


In [11]:
# Execute a find_one operation on the SuperDuperDB collection
document = db.execute(collection.find_one())
document.content['txt']

'# Anthropic\n\n\n\n`pinnacledb` allows users to work with `anthropic` API models.\n\n\n\nRead more about this [here](/docs/docs/walkthrough/ai_models#anthropic).'

In [13]:
from pinnacledb.backends.mongodb import Collection
from pinnacledb import Document as D
from IPython.display import *

# Define the query for the search
# query = 'Code snippet how to create a `VectorIndex` with a torchvision model'
query = 'can you explain vector-indexes with `pinnacledb`?'

# Execute a search using SuperDuperDB to find documents containing the specified query
result = db.execute(
    collection
        .like(D({'txt': query}), vector_index='my-index', n=5)
        .find()
)

# Display a horizontal rule to separate results
display(Markdown('---'))

# Display each document's 'txt' field and separate them with a horizontal rule
for r in result:
    display(Markdown(r['txt']))
    display(r['link'])
    display(Markdown('---'))

---



<details>

<summary>Is SuperDuperDB a vector-database?</summary>



No, SuperDuperDB is not a vector-database. It is a versatile Python framework that excels in bringing AI into your favorite database.

</details>



<details>

<summary>Is plugging `pinnacledb` directly into a database secure? What precautions are in place, and can I restrict access to specific tables, such as a users table?</summary>



To adhere to the principle of least privilege, SuperDuperDB requires read-only access to the tables you intend to `index`.



One option is maintaining your database as read-only and storing the index externally, such as on your filesystem. Alternatively, you can establish a new table dedicated to housing the index (e.g pinnacle_index). In this case, the requisite step would be granting us write access to that specific table.



For enhanced security, consider creating a new user specifically for SuperDuperDB. Grant this user read-only access to your data tables and read-write access exclusively to the `pinnacle_index` table.



If you value privacy as well, we recommend engaging in a more in-depth discussion within the project's Slack channel: [SuperDuperDB Slack](https://join.slack.com/t/pinnacledb/shared_invite/zt-1zuojj0k0-RjAYBs1TDsvEa7yaFGa6QA).



</details>







<details>

<summary> What benefits does pinnacledb offer for training AI models (such as Classification) compared to conventional methods? Are there any fundamental distinctions? </summary>




'https://docs.pinnacledb.com/docs/docs/faq'

---



No, SuperDuperDB is not a vector-database. It is a versatile Python framework that excels in bringing AI into your favorite database.

</details>



<details>

<summary>Is plugging `pinnacledb` directly into a database secure? What precautions are in place, and can I restrict access to specific tables, such as a users table?</summary>



To adhere to the principle of least privilege, SuperDuperDB requires read-only access to the tables you intend to `index`.



One option is maintaining your database as read-only and storing the index externally, such as on your filesystem. Alternatively, you can establish a new table dedicated to housing the index (e.g pinnacle_index). In this case, the requisite step would be granting us write access to that specific table.



For enhanced security, consider creating a new user specifically for SuperDuperDB. Grant this user read-only access to your data tables and read-write access exclusively to the `pinnacle_index` table.



If you value privacy as well, we recommend engaging in a more in-depth discussion within the project's Slack channel: [SuperDuperDB Slack](https://join.slack.com/t/pinnacledb/shared_invite/zt-1zuojj0k0-RjAYBs1TDsvEa7yaFGa6QA).



</details>







<details>

<summary> What benefits does pinnacledb offer for training AI models (such as Classification) compared to conventional methods? Are there any fundamental distinctions? </summary>



While the underlying algorithm for training remains unchanged, the key distinction lies in the enhanced connectivity between your model and the data stores. 



This improved connectivity enables users to easily customize their models for different data subsets, offering flexibility in model development.


'https://docs.pinnacledb.com/docs/docs/faq'

---

---

sidebar_position: 7

---



# Vector-search



SuperDuperDB allows users to implement vector-search in their database by either 

using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`.



## Philosophy



In `pinnacledb`, from a user point-of-view vector-search isn't a completely different beast than other ways of 

using the system:



- The vector-preparation is exactly the same as preparing outputs with any model, 

  with the special difference that the outputs are vectors, arrays or tensors.

- Vector-searches are just another type of database query which happen to use 

  the stored vectors.



## Algorithm



Here is a schematic of how vector-search works:



![](/img/vector-search.png)




'https://docs.pinnacledb.com/docs/docs/fundamentals/vector_search_algorithm#philosophy'

---



# Vector-search



SuperDuperDB allows users to implement vector-search in their database by either 

using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`.



## Philosophy



In `pinnacledb`, from a user point-of-view vector-search isn't a completely different beast than other ways of 

using the system:



- The vector-preparation is exactly the same as preparing outputs with any model, 

  with the special difference that the outputs are vectors, arrays or tensors.

- Vector-searches are just another type of database query which happen to use 

  the stored vectors.



## Algorithm



Here is a schematic of how vector-search works:



![](/img/vector-search.png)



## Explanation



A vector-search query has the schematic form:


'https://docs.pinnacledb.com/docs/docs/fundamentals/vector_search_algorithm#philosophy'

---

SuperDuperDB allows users to implement vector-search in their database by either 

using in-database functionality, or via a sidecar implementation with `lance` and `FastAPI`.



## Philosophy



In `pinnacledb`, from a user point-of-view vector-search isn't a completely different beast than other ways of 

using the system:



- The vector-preparation is exactly the same as preparing outputs with any model, 

  with the special difference that the outputs are vectors, arrays or tensors.

- Vector-searches are just another type of database query which happen to use 

  the stored vectors.



## Algorithm



Here is a schematic of how vector-search works:



![](/img/vector-search.png)



## Explanation



A vector-search query has the schematic form:



```python

table_or_collection


'https://docs.pinnacledb.com/docs/docs/fundamentals/vector_search_algorithm#philosophy'

---

## Create a Chat-Completion Component

In this step, a chat-completion component is created and added to the system. This component is essential for the Q&A functionality:

In [14]:
# Import the OpenAIChatCompletion class from the pinnacledb.ext.openai module
from pinnacledb.ext.openai import OpenAIChatCompletion

# Define the prompt for the OpenAIChatCompletion model
prompt = (
    'Use the following description and code snippets about SuperDuperDB to answer this question about SuperDuperDB\n'
    'Do not use any other information you might have learned about other python packages\n'
    'Only base your answer on the code snippets retrieved and provide a very concise answer\n'
    '{context}\n\n'
    'Here\'s the question:\n'
)

# Create an instance of OpenAIChatCompletion with the specified model and prompt
chat = OpenAIChatCompletion(identifier='gpt-3.5-turbo', prompt=prompt)

# Add the OpenAIChatCompletion instance
db.add(chat)

# Print information about the models in the SuperDuperDB database
print(db.show('model'))

['gpt-3.5-turbo', 'text-embedding-ada-002']


## Ask Questions to Your Docs

Finally, you can ask questions about the documents. You can target specific queries and use the power of MongoDB for vector-search and filtering rules. Here's an example of asking a question:

In [16]:
from pinnacledb import Document
from IPython.display import Markdown

# Define the search parameters
search_term = 'Tell me about pinnacledb'
# search_term = 'can you explain vector-indexes with `pinnacledb`?'

num_results = 5

# Use the SuperDuperDB model to generate a response based on the search term and context
output, sources = db.predict(
    model_name='gpt-3.5-turbo',
    input=search_term,
    context_select=(
        collection
            .like(Document({'txt': search_term}), vector_index='my-index', n=num_results)
            .find()
    ),
    context_key='txt',
)

# Get the reference links corresponding to the answer context
links = '\n'.join(sorted(set([source.unpack()['link'] for source in sources])))

# Display the generated response using Markdown
Markdown(output.content + f'\n\nrefs: \n\n{links}')

SuperDuperDB is an open-source framework that integrates databases with AI models, APIs, and vector search engines. It is not a database itself but transforms existing databases into intelligent systems. SuperDuperDB allows for streaming inference, scalable model training, and model chaining. It provides a simple but extendable interface, supports difficult data types, and enables vector search without the need for specialized vector databases. SuperDuperDB is licensed under Apache 2.0 and encourages developers to contribute to the project.

refs: 

https://docs.pinnacledb.com/docs/docs/get_started/quickstart
https://docs.pinnacledb.com/docs/docs/intro
https://docs.pinnacledb.com/docs/docs/intro#what-is-pinnacledb-
https://docs.pinnacledb.com/docs/docs/production/command_line_interface

## Now you can build an API as well just like we did
### FastAPI Question the Docs Apps Tutorial
This tutorial will guide you through setting up a basic FastAPI application for handling questions with documentation. The tutorial covers both local development and deployment to the Fly.io platform.
https://github.com/SuperDuperDB/chat-with-your-docs-backend

Reset the demo

In [None]:
db.drop(force=True)