Skip to content

Conversation

jhamon
Copy link
Collaborator

@jhamon jhamon commented Jan 30, 2025

Problem

Migrating search_records (aliased to search) and upsert_records from the pinecone-plugin-records plugin.

Solution

Working off the content of the records plugin, I have done the following:

  • Adjusted the codegen script to fix the way openapi generator handles underscore fields such as _id and _score
  • Adjusted the rest library code in rest_urllib3.py and rest_aiohttp.py to handle record uploading with content-type application/x-ndjson
  • Copied and modified the integration tests from the plugin
  • Extracted a lot of the guts of the upload_records and search_records methods into the request factory where they could more easily be unit tested. The logic around parsing user inputs into the openapi request objects is surprisingly complicated, so I added quite a lot of new unit tests checking some of those edge cases.
  • Compared to the plugin implementation, the major changes are:
    • Made search an alias of search_records
    • Moved away from usages of .pop() which mutates the input objects; this could be confusing for users if they are using those objects for anything else
    • Added better typing of dict fields
    • Incorporated optional use of enum values for RerankModel
    • Added asyncio variants of these methods, although most of the guts are shared in the request factory.

I already handled disallowing the records plugin in yesterday's PR #438

Usage

from pinecone import Pinecone, CloudProvider, AwsRegion, EmbedModel, RerankModel

pc = Pinecone(api_key="key")

# Create an index for your embedding model
index_model = pc.create_index_for_model(
    name="my-model-index",
    cloud=CloudProvider.AWS,
    region=AwsRegion.US_EAST_1,
    embed={
        "model": EmbedModel.Multilingual_E5_Large,
        "field_map": {"text": "my_text_field"}
    }
)

# Create an index client
index = pc.Index(host=index_model.host)

# Upsert records
namespace = "target-namespace"
index.upsert_records(
    namespace=namespace,
    records=[
        {
            "_id": "test1",
            "my_text_field": "Apple is a popular fruit known for its sweetness and crisp texture.",
        },
        {
            "_id": "test2",
            "my_text_field": "The tech company Apple is known for its innovative products like the iPhone.",
        },
        {
            "_id": "test3",
            "my_text_field": "Many people enjoy eating apples as a healthy snack.",
        },
        {
            "_id": "test4",
            "my_text_field": "Apple Inc. has revolutionized the tech industry with its sleek designs and user-friendly interfaces.",
        },
        {
            "_id": "test5",
            "my_text_field": "An apple a day keeps the doctor away, as the saying goes.",
        },
        {
            "_id": "test6",
            "my_text_field": "Apple Computer Company was founded on April 1, 1976, by Steve Jobs, Steve Wozniak, and Ronald Wayne as a partnership.",
        },
    ],
)

# Search for similar records
response = index.search(
    namespace=namespace,
    query={
        "inputs":{
            "text": "Apple corporation",
        },
        "top_k":3,
    },
    rerank={
        "model": RerankModel.Bge_Reranker_V2_M3,
        "rank_fields": ["my_text_field"],
        "top_n": 3,
    },
)

These methods also have asyncio variants available

import asyncio
from pinecone import Pinecone, RerankModel

async def main():
    # Create an index client

    pc = Pinecone(api_key='key')
    index = pc.AsyncioIndex(host='host')

    # Upsert records
    namespace = "target-namespace"
    records = [
        {
            "_id": "test1",
            "my_text_field": "Apple is a popular fruit known for its sweetness and crisp texture.",
        },
        {
            "_id": "test2",
            "my_text_field": "The tech company Apple is known for its innovative products like the iPhone.",
        },
        {
            "_id": "test3",
            "my_text_field": "Many people enjoy eating apples as a healthy snack.",
        },
        {
            "_id": "test4",
            "my_text_field": "Apple Inc. has revolutionized the tech industry with its sleek designs and user-friendly interfaces.",
        },
        {
            "_id": "test5",
            "my_text_field": "An apple a day keeps the doctor away, as the saying goes.",
        },
        {
            "_id": "test6",
            "my_text_field": "Apple Computer Company was founded on April 1, 1976, by Steve Jobs, Steve Wozniak, and Ronald Wayne as a partnership.",
        },
    ]
    await index.upsert_records(
        namespace=namespace,
        records=records,
    )

    # Search for similar records
    response = await index.search(
        namespace=namespace,
        query={
            "inputs":{
                "text": "Apple corporation",
            },
            "top_k":3,
        },
        rerank={
            "model": RerankModel.Bge_Reranker_V2_M3,
            "rank_fields": ["my_text_field"],
            "top_n": 3,
        },
    )
    
asyncio.run(main())

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update
  • Infrastructure change (CI configs, etc)
  • Non-code change (docs, etc)
  • None of the above: (explain here)

@jhamon jhamon force-pushed the jhamon/search-and-upsert-records branch from 5068508 to 9fcb8db Compare January 30, 2025 13:48
@jhamon jhamon marked this pull request as ready for review January 30, 2025 15:39
@jhamon jhamon merged commit a887143 into release-candidate/2025-01 Jan 30, 2025
70 of 71 checks passed
@jhamon jhamon deleted the jhamon/search-and-upsert-records branch January 30, 2025 16:57
Copy link
Contributor

@austin-denoble austin-denoble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks again!

fields: Optional[List[str]] = ["*"], # Default to returning all fields
) -> SearchRecordsResponse:
"""Alias of the search() method."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be calling search()?

fields: Optional[List[str]] = ["*"], # Default to returning all fields
) -> SearchRecordsResponse:
"""Alias of the search() method."""
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm just misunderstanding how pass works in this context. 🤔

from pinecone import RerankModel


class TestIndexRequestFactory:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beautiful coverage, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants