# Semantic Search of one's own data with OpenAI Embedding Model and MongoDB Atlas Vector Search

It is often a valuable exercise, when developing and documenting, to consider [User Stories](https://www.atlassian.com/agile/project-management/user-stories). We have a number of different personas interested in the ChatGPT Retrieval Plugin.

1. The End User, who wishes to extract information from her organization's or personal data.
2. The Data Scientist, who curates the data.
3. The Application Engineer, who sets up and maintains the application.

### Application Setup

**The Application Engineer** has a number of tasks to complete in order to provide service to her two users.

1. Set up the DataStore.


    * Create a MongoDB Atlas cluster.
    * Add a Vector Index Search to it.<br><br>

    Begin by following the detailed steps in **[setup.md](https://github.com/caseyclements/chatgpt-retrieval-plugin/blob/mongodb/docs/providers/mongodb_atlas/setup.md)**.
    Once completed, you will have a running Cluster, with a Database, a Collection, and a Vector Search Index attached to it.

    You will also have a number of required environment variables. These need to be available to run this example.
    We will check for them below, and suggest how to set them up with an `.env` file if that is your preference.

  
2. Create and Serve the ChatGPT Retrival Plugin.
    * Provide an API for the Data Scientist to insert, update, and delete data.
    * Provide an API for the End User to query the data using natural language.<br><br>
    
    Start the service in another terminal as described in the repo's **[QuickStart]( [here](https://github.com/openai/chatgpt-retrieval-plugin#quickstart)**. 

   **IMPORTANT** Make sure the environment variables are set in the terminal before `poetry run start`.

### Application Usage

This notebook tells a story of a **Data Scientist** and an **End User** as they interact with the service.


We begin by collecting and fiiltering an example dataset, the Stanford Question Answering Dataset (SQuAD)[https://huggingface.co/datasets/squad].
We upsert the data into a MongoDB Collection via the `query` endpoint of the Plugin API. 
Upon doing this, Atlas begins to automatically index the data in preparation for Semantic Search. 

We close by asking a question of the data, searching not for a particular text string, but using common language.



## 1) Application Engineering

Of course, we cannot begin until we test that our environment is set up.

### Check environment variables


In [None]:
!pwd

In [None]:
!which python

In [1]:
import os
required_vars = {'BEARER_TOKEN', 'OPENAI_API_KEY', 'DATASTORE', 'EMBEDDING_DIMENSION', 'EMBEDDING_MODEL',
                 'MONGODB_COLLECTION', 'MONGODB_DATABASE', 'MONGODB_INDEX', 'MONGODB_URI'}
assert os.environ["DATASTORE"] == 'mongodb-atlas'
missing = required_vars - set(os.environ)
if missing:
    print(f"It is strongly recommended to set these additional environment variables. {missing}=")

In [2]:
# If you keep the environment variables in a .env file, like that .env.example, do this:
if missing:
    from dotenv import dotenv_values
    from pathlib import Path
    import os
    config = dotenv_values(Path('../.env'))
    os.environ.update(config)

### Check MongoDB Atlas Datastore connection

In [3]:
from pymongo import MongoClient
client = MongoClient(os.environ["MONGODB_URI"])
# Send a ping to confirm a successful connection
try:
    client.admin.command('ping')
    print("Pinged your deployment. You successfully connected to MongoDB!")
except Exception as e:
    print(e)

Pinged your deployment. You successfully connected to MongoDB!


In [4]:
db = client[os.environ["MONGODB_DATABASE"]]
clxn = db[os.environ["MONGODB_COLLECTION"]]
clxn.name

'Beyonce'

### Check OpenAI Connection

These tests require the environment variables: `OPENAI_API_KEY, EMBEDDING_MODEL`

We set the api_key, then query the API for its available models. We then loop over this list to find which can provide text embeddings, and their natural, full, default dimensions.

In [5]:
import openai
openai.api_key = os.environ["OPENAI_API_KEY"]
models = openai.Model.list()
model_names = [model["id"] for model in models['data']]
model_dimensions = {}
for model_name in model_names:
    try:
        response = openai.Embedding.create(input=["Some input text"], model=model_name)
        model_dimensions[model_name] = len(response['data'][0]['embedding'])
    except:
        pass
f"{model_dimensions=}"

"model_dimensions={'text-embedding-3-small': 1536, 'text-embedding-ada-002': 1536, 'text-embedding-3-large': 3072}"

## 2) Data Engineering

### Prepare personal or organizational dataset

The ChatGPT Retrieval Plug provides semantic search of your own data using OpenAI's Embedding Models and MongoDB's Vector Datastore and Semantic Search.

In this example, we will use the **S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD), which we download from Hugging Face Datasets.

In [6]:
import pandas as pd
from datasets import load_dataset
data = load_dataset("squad_v2", split="train")
data = data.to_pandas().drop_duplicates(subset=["context"])
print(f'{len(data)=}')
data.head()

len(data)=19029


Unnamed: 0,id,title,context,question,answers
0,56be85543aeaaa14008c9063,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start'..."
15,56be86cf3aeaaa14008c9076,Beyoncé,Following the disbandment of Destiny's Child i...,"After her second solo album, what other entert...","{'text': ['acting'], 'answer_start': [207]}"
27,56be88473aeaaa14008c9080,Beyoncé,"A self-described ""modern-day feminist"", Beyonc...","In her music, what are some recurring elements...","{'text': ['love, relationships, and monogamy']..."
39,56be892d3aeaaa14008c908b,Beyoncé,"Beyoncé Giselle Knowles was born in Houston, T...",Beyonce's younger sibling also sang with her i...,"{'text': ['Destiny's Child'], 'answer_start': ..."
52,56be8a583aeaaa14008c9094,Beyoncé,Beyoncé attended St. Mary's Elementary School ...,What town did Beyonce go to school in?,"{'text': ['Fredericksburg'], 'answer_start': [..."


To speed up our example, let's focus specifically on questions about Beyoncé

In [7]:
data = data.loc[data['title']=='Beyoncé']

In [8]:
documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[0]

{'id': '56be85543aeaaa14008c9063',
 'text': 'Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny\'s Child. Managed by her father, Mathew Knowles, the group became one of the world\'s best-selling girl groups of all time. Their hiatus saw the release of Beyoncé\'s debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".',
 'metadata': {'title': 'Beyoncé'}}

## Upsert and Index data via the Plugin API

Posting an `upsert` request to the ChatGPT Retrieval Plugin API performs two tasks on the backend. First, it inserts into (or updates) your data in the MONGODB_COLLECTION in the MongoDB Cluster that you setup. Second, Atlas asynchronously begins populating a Vector Search Index on the embedding key. 

If you have already created the Collection and a Vector Search Index through the Atlas UI while Setting up MongoDB Atlas Cluster in [setup.md](https://github.com/caseyclements/chatgpt-retrieval-plugin/blob/main/docs/providers/mongodb_atlas/setup.md), then indexing will begin immediately.

If you haven't set up the Atlas Vector Search yet, no problem. `upsert` will insert the data. To start indexing, simply go back to the Atlas UI and add a Search Index. This will trigger indexing. Once complete, we can begin semantic queries!


The front end of the Plugin is a FastAPI web server. It's API provides simple `http` requests.'We will need to provide authorization in the form of the BEARER_TOKEN we set earlier. We do this below:

In [9]:
endpoint_url = 'http://0.0.0.0:8000'
headers = {"Authorization": f"Bearer {os.environ['BEARER_TOKEN']}"}

Although our sample data is not large, and the service and datastore are reponsive, we follow best-practice and execute bulk upserts in batches with retries.

In [10]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

# Setup request parameters to batch requests and retry on 5xx errors
batch_size = 100
retries = Retry(total=5, backoff_factor=0.1, status_forcelist=[500, 502, 503, 504])
session = requests.Session()
session.mount('http://', HTTPAdapter(max_retries=retries))
n_docs = len(documents)
for i in tqdm(range(0, n_docs, batch_size)):
    i_end = min(n_docs, i+batch_size)
    print(f'{(i,i_end) =}') 
    # make post request that allows up to 5 retries
    res = session.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={"documents": documents[i:i_end]}
    )

  0%|          | 0/1 [00:00<?, ?it/s]

(i,i_end) =(0, 66)


In [11]:
if res.status_code != 200:
    res.text, res.reason

# 3) Answering Questions

Now would be a good time to go back to the Atlas UI, navigate to your collection's search index. Once all our SQuAD records have been successfully indexed, we can proceed with the querying phase. By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.

In [12]:
def format_results(results):
    for query_result in results.json()['results']:
        query = query_result['query']
        answers = []
        scores = []
        for result in query_result['results']:
            answers.append(result['text'])
            scores.append(round(result['score'], 2))
        print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")    

def ask(question: str):
    res = requests.post(
        f"{endpoint_url}/query",
        headers=headers,
        json={'queries': [{"query": question}]}
    )
    format_results(res)

In [13]:
ask("Who is Beyonce?")

----------------------------------------------------------------------
Who is Beyonce?

0.83: Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ bee-YON-say) (born September 4, 1981) is an American singer, songwriter, record producer and actress. Born and raised in Houston, Texas, she performed in various singing and dancing competitions as a child, and rose to fame in the late 1990s as lead singer of R&B girl-group Destiny's Child. Managed by her father, Mathew Knowles, the group became one of the world's best-selling girl groups of all time. Their hiatus saw the release of Beyoncé's debut album, Dangerously in Love (2003), which established her as a solo artist worldwide, earned five Grammy Awards and featured the Billboard Hot 100 number-one singles "Crazy in Love" and "Baby Boy".
0.82: A self-described "modern-day feminist", Beyoncé creates songs that are often characterized by themes of love, relationships, and monogamy, as well as female sexuality and empowerment. On stage, her dynamic

In [14]:
ask("Who is Beyonce married to?")

----------------------------------------------------------------------
Who is Beyonce married to?

0.78: On April 4, 2008, Beyoncé married Jay Z. She publicly revealed their marriage in a video montage at the listening party for her third studio album, I Am... Sasha Fierce, in Manhattan's Sony Club on October 22, 2008. I Am... Sasha Fierce was released on November 18, 2008 in the United States. The album formally introduces Beyoncé's alter ego Sasha Fierce, conceived during the making of her 2003 single "Crazy in Love", selling 482,000 copies in its first week, debuting atop the Billboard 200, and giving Beyoncé her third consecutive number-one album in the US. The album featured the number-one song "Single Ladies (Put a Ring on It)" and the top-five songs "If I Were a Boy" and "Halo".
0.77: Beyoncé is believed to have first started a relationship with Jay Z after a collaboration on "'03 Bonnie & Clyde", which appeared on his seventh album The Blueprint 2: The Gift & The Curse (2002). 

## 4) Clean up

In [15]:
response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"delete_all":True}
)

response.json()

{'success': True}