# Azure Cosmos DB No SQL

This notebook shows you how to leverage this integrated [vector database](https://learn.microsoft.com/en-us/azure/cosmos-db/vector-database) to store documents in collections, create indicies and perform vector search queries using approximate nearest neighbor algorithms such as COS (cosine distance), L2 (Euclidean distance), and IP (inner product) to locate documents close to the query vectors. 
    
Azure Cosmos DB is the database that powers OpenAI's ChatGPT service. It offers single-digit millisecond response times, automatic and instant scalability, along with guaranteed speed at any scale. 

Azure Cosmos DB for NoSQL now offers vector indexing and search in preview. This feature is designed to handle high-dimensional vectors, enabling efficient and accurate vector search at any scale. You can now store vectors directly in the documents alongside your data. This means that each document in your database can contain not only traditional schema-free data, but also high-dimensional vectors as other properties of the documents. This colocation of data and vectors allows for efficient indexing and searching, as the vectors are stored in the same logical unit as the data they represent. This simplifies data management, AI application architectures, and the efficiency of vector-based operations.

Please refer here for more details:
- [Vector Search](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/vector-search)
- [Full Text Search](https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/full-text-search)
- [Hybrid Search](https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/hybrid-search)

[Sign Up](https://azure.microsoft.com/en-us/free/) for lifetime free access to get started today.

In [None]:
%pip install --upgrade --quiet azure-cosmos langchain-openai langchain-community

In [None]:
OPENAI_API_KEY = ""
OPENAI_API_TYPE = "azure"
OPENAI_API_VERSION = "2024-07-01-preview"
OPENAI_API_BASE = ""
OPENAI_EMBEDDINGS_MODEL_NAME = "text-embedding-3-small"
OPENAI_EMBEDDINGS_MODEL_DEPLOYMENT = "text-embedding-3-small"

## Insert Data

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# Load the PDF
loader = PyPDFLoader("https://arxiv.org/pdf/2303.08774.pdf")
data = loader.load()

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)

In [4]:
print(docs[0])

page_content='GPT-4 Technical Report
OpenAI∗
Abstract
We report the development of GPT-4, a large-scale, multimodal model which can
accept image and text inputs and produce text outputs. While less capable than
humans in many real-world scenarios, GPT-4 exhibits human-level performance
on various professional and academic benchmarks, including passing a simulated
bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-
based model pre-trained to predict the next token in a document. The post-training
alignment process results in improved performance on measures of factuality and
adherence to desired behavior. A core component of this project was developing
infrastructure and optimization methods that behave predictably across a wide
range of scales. This allowed us to accurately predict some aspects of GPT-4’s
performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction' metadata={'source': 'https://arxiv.org/pdf/2303.0877

## Creating AzureCosmosDB NoSQL Vector Search

In [5]:
indexing_policy = {
    "indexingMode": "consistent",
    "includedPaths": [{"path": "/*"}],
    "excludedPaths": [{"path": '/"_etag"/?'}],
    "vectorIndexes": [{"path": "/embedding", "type": "diskANN"}],
    "fullTextIndexes": [{"path": "/text"}],
}

vector_embedding_policy = {
    "vectorEmbeddings": [
        {
            "path": "/embedding",
            "dataType": "float32",
            "distanceFunction": "cosine",
            "dimensions": 1536,
        }
    ]
}

full_text_policy = {
    "defaultLanguage": "en-US",
    "fullTextPaths": [{"path": "/text", "language": "en-US"}],
}

In [16]:
from azure.cosmos import CosmosClient, PartitionKey
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
    AzureCosmosDBNoSqlVectorSearch,
)
from langchain_openai import OpenAIEmbeddings
from pydantic import SecretStr

HOST = "AZURE_COSMOS_DB_ENDPOINT"
KEY = "AZURE_COSMOS_DB_KEY"

cosmos_client = CosmosClient(HOST, KEY)
database_name = "langchain_python_db_notebook"
container_name = "langchain_python_container"
partition_key = PartitionKey(path="/id")
cosmos_container_properties = {"partition_key": partition_key}

openai_embeddings = OpenAIEmbeddings(
    deployment="smart-agent-embedding-ada",
    model="text-embedding-ada-002",
    chunk_size=1,
    openai_api_key=SecretStr("OPENAI_API_KEY"),
)


# insert the documents in AzureCosmosDBNoSql with their embedding
vector_search = AzureCosmosDBNoSqlVectorSearch.from_documents(
    documents=docs,
    embedding=openai_embeddings,
    cosmos_client=cosmos_client,
    database_name=database_name,
    container_name=container_name,
    vector_embedding_policy=vector_embedding_policy,
    full_text_policy=full_text_policy,
    indexing_policy=indexing_policy,
    cosmos_container_properties={"partition_key": partition_key},
    cosmos_database_properties={},
    vector_search_fields={"text_field": "text", "embedding_field": "embedding"},
    full_text_search_enabled=True,
)

## Vector Search

In [9]:
# Perform a similarity search between the embedding of the query and the embeddings of the documents
import json

query = "What were the compute requirements for training GPT 4"
results = vector_search.similarity_search(query)

print(results[0].page_content)

performance based on models trained with no more than 1/1,000th the compute of
GPT-4.
1 Introduction
This technical report presents GPT-4, a large multimodal model capable of processing image and
text inputs and producing text outputs. Such models are an important area of study as they have the
potential to be used in a wide range of applications, such as dialogue systems, text summarization,
and machine translation. As such, they have been the subject of substantial interest and progress in
recent years [1–34].
One of the main goals of developing such models is to improve their ability to understand and generate
natural language text, particularly in more complex and nuanced scenarios. To test its capabilities
in such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In
these evaluations it performs quite well and often outscores the vast majority of human test takers.


## Vector Search with Score

In [10]:
query = "What were the compute requirements for training GPT 4"

results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print(f"Score {i+1}: ", results[i][1])
    print("\n")

Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"5a9a248f-6885-4e07-8321-e416ecd01556"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it

## Vector Search with filtering

In [19]:
from langchain_community.vectorstores.azure_cosmos_db_no_sql import (
    Condition,
    PreFilter,
)

query = "What were the compute requirements for training GPT 4"

pre_filter = PreFilter(
    conditions=[
        Condition(property="metadata.page", operator="$eq", value=0),
    ]
)

results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
    pre_filter=pre_filter,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print(f"Score {i+1}: ", results[i][1])
    print("\n")

Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"4b0034fa-0d0e-46b3-9385-0582511eb28f"},"page_content":"performance based on models trained with no more than 1/1,000th the compute of\nGPT-4.\n1 Introduction\nThis technical report presents GPT-4, a large multimodal model capable of processing image and\ntext inputs and producing text outputs. Such models are an important area of study as they have the\npotential to be used in a wide range of applications, such as dialogue systems, text summarization,\nand machine translation. As such, they have been the subject of substantial interest and progress in\nrecent years [1–34].\nOne of the main goals of developing such models is to improve their ability to understand and generate\nnatural language text, particularly in more complex and nuanced scenarios. To test its capabilities\nin such scenarios, GPT-4 was evaluated on a variety of exams originally designed for humans. In\nthese evaluations it

## Full Text Search

In [20]:
query = "What were the compute requirements for training GPT 4"
pre_filter = PreFilter(
    conditions=[
        Condition(
            property="text",
            operator="$full_text_contains_any",
            value="What were the compute requirements for training GPT 4",
        ),
    ],
)
results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
    query_type="full_text_search",
    pre_filter=pre_filter,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print("\n")

Result 1:  {"id":null,"metadata":{"source":"https://arxiv.org/pdf/2303.08774.pdf","page":0,"id":"87ec78d5-26e9-4eae-afd3-07935f706230"},"page_content":"GPT-4 Technical Report\nOpenAI∗\nAbstract\nWe report the development of GPT-4, a large-scale, multimodal model which can\naccept image and text inputs and produce text outputs. While less capable than\nhumans in many real-world scenarios, GPT-4 exhibits human-level performance\non various professional and academic benchmarks, including passing a simulated\nbar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-\nbased model pre-trained to predict the next token in a document. The post-training\nalignment process results in improved performance on measures of factuality and\nadherence to desired behavior. A core component of this project was developing\ninfrastructure and optimization methods that behave predictably across a wide\nrange of scales. This allowed us to accurately predict some aspects of GPT-4’s\nper

## Full Text Search BM 25 Ranking

In [22]:
query = "What were the compute requirements for training GPT 4"

full_text_rank_filter = [
    {
        "search_field": "text",
        "search_text": "What were the compute requirements for training GPT 4",
    }
]
results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
    query_type="full_text_ranking",
    full_text_rank_filter=full_text_rank_filter,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print("\n")

Result 1:  {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}


Result 2:  {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\nmodel-specific tuning. To address this, we developed infrastructure and optimization methods that\nhave very predictable behavior across multiple scales. These improvements allowed us to reliably\npredict some aspects of the performance of GPT-4 from smalle

## Hybrid Search

In [23]:
query = "What were the compute requirements for training GPT 4"

full_text_rank_filter = [
    {
        "search_field": "text",
        "search_text": "What were the compute requirements for training GPT 4",
    }
]
results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
    query_type="hybrid",
    full_text_rank_filter=full_text_rank_filter,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print(f"Score {i+1}: ", results[i][1])
    print("\n")

Result 1:  {"id":null,"metadata":{"id":"3ae8615a-5fd3-4543-8d94-b01687904e02"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1:  0.5545045822126439


Result 2:  {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 2:  0.5529193759066282


Result 3:  {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\

## Hybrid Search with filtering

In [25]:
query = "What were the compute requirements for training GPT 4"

pre_filter = PreFilter(
    conditions=[
        Condition(
            property="text",
            operator="$full_text_contains_any",
            value="What were the compute requirements for training GPT 4",
        ),
        Condition(property="metadata.page", operator="$eq", value=0),
    ],
    logical_operator="$and",
)

full_text_rank_filter = [
    {
        "search_field": "text",
        "search_text": "What were the compute requirements for training GPT 4",
    }
]
results = vector_search.similarity_search_with_score(
    query=query,
    k=5,
    query_type="hybrid",
    full_text_rank_filter=full_text_rank_filter,
)

# Display results
for i in range(0, len(results)):
    print(f"Result {i+1}: ", results[i][0].json())
    print(f"Score {i+1}: ", results[i][1])
    print("\n")

Result 1:  {"id":null,"metadata":{"id":"3ae8615a-5fd3-4543-8d94-b01687904e02"},"page_content":"Figure 11: Results on IF evaluations across GPT3.5, GPT3.5-Turbo, GPT-4-launch\n98","type":"Document"}
Score 1:  0.5545045822126439


Result 2:  {"id":null,"metadata":{"id":"f81d994b-bd4e-4471-905f-841ac529584d"},"page_content":"the HumanEval dataset. A power law fit to the smaller models (excluding GPT-4) is shown as the dotted\nline; this fit accurately predicts GPT-4’s performance. The x-axis is training compute normalized so that\nGPT-4 is 1.\n3","type":"Document"}
Score 2:  0.5529193759066282


Result 3:  {"id":null,"metadata":{"id":"b8117761-b5ec-473d-a818-dd5f7dda75ac"},"page_content":"safety considerations above against the scientific value of further transparency.\n3 Predictable Scaling\nA large focus of the GPT-4 project was building a deep learning stack that scales predictably. The\nprimary reason is that for very large training runs like GPT-4, it is not feasible to do extensive\