# Canopy Experiments
This notebook covers some basic experiments with the [Canopy SDK](https://www.pinecone.io/blog/canopy-rag-framework/) by Pinecone. In this notebook, we will show how to:
1. Programatically upsert new documents to our Canopy Index
2. Progamatically query our canopy index, with metadata filters
    - The metadata filters will be used to create partitons for the index

In [4]:
import os
from dotenv import load_dotenv

load_dotenv()

# We have PINECONE_API_KEY, PINECONE_ENVIRONMENT, INDEX_NAME, and OPENAI_API_KEY as .env variables
PINCECONE_API_KEY = os.getenv("PINECONE_API_KEY")
PINECONE_ENVIRONMENT = os.getenv("PINECONE_ENVIRONMENT")
INDEX_NAME = os.getenv("INDEX_NAME")

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

assert PINCECONE_API_KEY is not None and PINECONE_ENVIRONMENT is not None and INDEX_NAME is not None and OPENAI_API_KEY is not None, "Please set the environment variables in .env file"

### Creating a Tokenizer

As a prerequisite to the remaining cells in this notebook, we need to setup a tokenizer to use for tokenizing our data for the embeding model.

In [7]:
from canopy.tokenizer import Tokenizer
Tokenizer.initialize()

After initializing the global object, we can simply create an instance from anywhere in our code, without providing any parameters:

In [8]:
from canopy.tokenizer import Tokenizer

tokenizer = Tokenizer()
tokenizer.tokenize("Hello world!")

['Hello', ' world', '!']

### Upsert Data

We will begin by programatically upserting data to the index. For this task, we can look in the `data/` directory in the root of the project. This directory will have some test files that we will use for parsing and upserting.

In [12]:
from canopy.knowledge_base import KnowledgeBase

kb = KnowledgeBase(index_name=INDEX_NAME)
kb.connect()

Now that we have our `KnowledgeBase` initialized and connected, we can upsert our documents.

Canopy uses the `Document` object to store information about Documents that are to be uploaded to canopy:
```python
example_docs = [Document(id="1",
                      text="This is text for example",
                      source="https://url.com"),
                Document(id="2",
                        text="this is another text",
                        source="https://another-url.com",
                        metadata={"my-key": "my-value"})]
```

For each document that we want to upload, we should create a random UUID for the doc, a URL if the document is publically available, and metadata, which will be used when querying the index.

In [68]:
import glob
from uuid import uuid4
from canopy.models.data_models import Document

file_paths = glob.glob('../data/sample-docs/*')

documents = []
for f_path in file_paths:
    name = f_path.split('/')[-1]
    with open(f_path, 'r') as f:
        text = f.read()
    doc = Document(id=str(uuid4()), text=text, metadata={'value': 'hello'})
    documents.append(doc)
documents

[Document(id='3bd6ddbe-9195-4dde-83e1-7f962768dbde', text='Recently, quantum algorithms that leverage real-time evolution under a many-body Hamiltonian have proven to be exceptionally effective in estimating individual eigenvalues near the edge of the Hamiltonian spectrum, such as the ground state energy. By contrast, evaluating the trace of an operator requires the aggregation of eigenvalues across the entire spectrum. In this work, we introduce an efficient near-term quantum algorithm for computing the trace of a broad class of operators, including matrix functions of the target Hamiltonian. Our trace estimator is similar to the classical Girard-Hutchinson estimator in that it involves the preparation of many random states. Although the exact Girard-Hutchinson estimator is not tractably realizable on a quantum computer, we can construct random states that match the variance of the Girard-Hutchinson estimator through only real-time evolution. Importantly, our random states are all gen

Now that we have our document objects created, we can upload them to the Knowledge Base

In [69]:
kb.upsert(documents)

### Query the Knowledge Base

Now that we have the knowledge base populated with some sample documents, we are ready to query the knowledge base using Canopy.

In [73]:
import json

def print_query_results(results):
    for query_results in results:
        print('query: ' + query_results.query + '\n')
        for document in query_results.documents:
            print('document: ' + document.text.replace("\n", "\\n"))
            print("metadata: " + json.dumps(document.metadata))
            print(f"score: {document.score}\n")

In [87]:
from canopy.models.data_models import Query

query = Query(
    text='What is an effective way of distributing quantum states.',
)

results = kb.query([query])
print_query_results(results)

query: What is an effective way of distributing quantum states.

document: Quantum communication implementations require efficient and reliable quantum channels. Optical fibers have proven to be an ideal candidate for distributing quantum states. Thus, today's efforts address overcoming issues towards high data transmission and long-distance implementations. Here, we experimentally demonstrate the secret key rate enhancement via space-division multiplexing using a multicore fiber. Our multiplexing technique exploits the momentum correlation of photon pairs generated by spontaneous parametric down-conversion. We distributed polarization-entangled photon pairs into opposite cores within a 19-core multicore fiber. We estimated the secret key rates in a configuration with 6 and 12 cores from the entanglement visibility after transmission through 411 m long multicore fiber.
metadata: {"value": "hello"}
score: 0.848302126

document: Recently, quantum algorithms that leverage real-time evolut

In addition to simple queries, we can also use the ContextEngine to easily set a max content length. This will return the most relevant context up toa certain limit.

In [91]:
from canopy.context_engine import ContextEngine

# Instantiate the context engine with the knowledge base
context_engine = ContextEngine(kb)

result = context_engine.query([query], max_context_tokens = 512)
print(result.to_text(indent=2))

[
  {
    "query": "What is an effective way of distributing quantum states.",
    "snippets": [
      {
        "source": "",
        "text": "Quantum communication implementations require efficient and reliable quantum channels. Optical fibers have proven to be an ideal candidate for distributing quantum states. Thus, today's efforts address overcoming issues towards high data transmission and long-distance implementations. Here, we experimentally demonstrate the secret key rate enhancement via space-division multiplexing using a multicore fiber. Our multiplexing technique exploits the momentum correlation of photon pairs generated by spontaneous parametric down-conversion. We distributed polarization-entangled photon pairs into opposite cores within a 19-core multicore fiber. We estimated the secret key rates in a configuration with 6 and 12 cores from the entanglement visibility after transmission through 411 m long multicore fiber."
      },
      {
        "source": "",
        "

To demonstrate the token context limit, we can shrink the value of the `max_content_tokens` kwarg to return less context.

In [93]:
result = context_engine.query([query], max_context_tokens=256)
print(result.to_text(indent=2))

[
  {
    "query": "What is an effective way of distributing quantum states.",
    "snippets": [
      {
        "source": "",
        "text": "Quantum communication implementations require efficient and reliable quantum channels. Optical fibers have proven to be an ideal candidate for distributing quantum states. Thus, today's efforts address overcoming issues towards high data transmission and long-distance implementations. Here, we experimentally demonstrate the secret key rate enhancement via space-division multiplexing using a multicore fiber. Our multiplexing technique exploits the momentum correlation of photon pairs generated by spontaneous parametric down-conversion. We distributed polarization-entangled photon pairs into opposite cores within a 19-core multicore fiber. We estimated the secret key rates in a configuration with 6 and 12 cores from the entanglement visibility after transmission through 411 m long multicore fiber."
      }
    ]
  }
]


Finally, if we chunk or text too large, we can see that we will get no context if no results can fit in the required context window.

In [94]:
result = context_engine.query([query], max_context_tokens=128)
print(result.to_text(indent=2))

[]


### Customization

In addition to using the canopy features out of the box, we also have the ability to customize the chunking and embedding process. This will be important for ensuring that we can store retrieve relevant context, regardless of the size of the context window. We should set a min context window size of 64 tokens. This should give us enough granularity to always return documents with enough context to provide complete answers.

In [107]:
from typing import List

from canopy.knowledge_base.chunker.base import Chunker
from canopy.knowledge_base.models import KBDocChunk

class TokenSizeChunker(Chunker):

    def __init__(self, chunk_size: int = 64):
        self.chunk_size = chunk_size
        self.tokenizer = Tokenizer()

    def chunk_single_document(self, document: Document) -> List[KBDocChunk]:
        """Chunk a single document into multiple 64 token chunks. With a stride of 32 tokens."""
        tokens = self.tokenizer.tokenize(document.text)
        chunks = []
        for i in range(0, len(tokens), self.chunk_size):
            chunk = tokens[i:i + self.chunk_size]
            chunk_text = self.tokenizer.detokenize(chunk)
            chunks.append(KBDocChunk(
                id=f"{document.id}-{i}",
                document_id=document.id,
                text=chunk_text,
                metadata=document.metadata,
            ))
        return chunks

    async def achunk_single_document(self, document: Document) -> List[KBDocChunk]:
        """Chunk a single document into multiple 64 token chunks. With a stride of 32 tokens."""
        raise NotImplementedError("Async chunking is not implemented yet.")

chunker = TokenSizeChunker()
chunks = chunker.chunk_single_document(documents[0])

[KBDocChunk(id='3bd6ddbe-9195-4dde-83e1-7f962768dbde-0', text='Recently, quantum algorithms that leverage real-time evolution under a many-body Hamiltonian have proven to be exceptionally effective in estimating individual eigenvalues near the edge of the Hamiltonian spectrum, such as the ground state energy. By contrast, evaluating the trace of an operator requires the aggregation of eigenvalues across the entire spectrum. In this', source='', metadata={'value': 'hello'}, document_id='3bd6ddbe-9195-4dde-83e1-7f962768dbde'),
 KBDocChunk(id='3bd6ddbe-9195-4dde-83e1-7f962768dbde-64', text=' work, we introduce an efficient near-term quantum algorithm for computing the trace of a broad class of operators, including matrix functions of the target Hamiltonian. Our trace estimator is similar to the classical Girard-Hutchinson estimator in that it involves the preparation of many random states. Although the exact Girard-Hutchinson estimator', source='', metadata={'value': 'hello'}, document_id

Now that we have our chunker built, we can instantiate our knowledge base to use this chunker to process documents.

In [110]:
kb = KnowledgeBase(index_name=INDEX_NAME, chunker=chunker)
kb.connect()

kb.upsert(documents)

Now, if we retry our query to fetch the documents with a small `max_context_tokens`, we should be able to get results.

In [111]:
result = context_engine.query([query], max_context_tokens=128)
print(result.to_text(indent=2))

[
  {
    "query": "What is an effective way of distributing quantum states.",
    "snippets": [
      {
        "source": "",
        "text": "Quantum communication implementations require efficient and reliable quantum channels. Optical fibers have proven to be an ideal candidate for distributing quantum states. Thus, today's efforts address overcoming issues towards high data transmission and long-distance implementations. Here, we experimentally demonstrate the secret key rate enhancement via space-division multiplexing using a multicore"
      },
      {
        "source": "",
        "text": " between the experimental state and the stationary thermal symmetric theoretical state, offering direct evidence of subsystem thermalization."
      }
    ]
  }
]
