[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/pinecone-bulk-import.ipynb)

# Pinecone Bulk Import

**Note:** This feature is in [public preview](https://docs.pinecone.io/release-notes/feature-availability) and available only on [Standard and Enterprise plans](https://www.pinecone.io/pricing/).

## Scenario: Ingesting Embedded Parquet Data from S3 to Pinecone

In this scenario, you are tasked with ingesting pre-generated vector embeddings stored in Parquet files located in an S3 bucket into a Pinecone index. The embeddings have been precomputed by a third-party vendor and are ready to be indexed for future vector similarity search or other downstream tasks.

### Problem Overview
The goal is to seamlessly move the data from S3 to Pinecone so that it can be used for future tasks such as semantic search, recommendations, and anomaly detection.

### Solution steps
1. **Access the S3 Bucket**: You will access the S3 bucket where the Parquet files are stored. These files contain the embeddings and metadata needed for indexing.
  
2. **Read and Extract Embeddings**: Once the Parquet files are accessed, you will extract the embeddings and any necessary metadata (e.g., unique document IDs or other attributes).
   
3. **Upload Embeddings to Pinecone**: After extracting the data, you will upload the embeddings to a Pinecone index, associating each embedding with its respective identifier. This process allows the embeddings to be efficiently queried or analyzed later.

This approach allows you to efficiently transfer embedded parquet files from S3 storage to Pinecone to support vector search.  Please see our official [Understanding Imports in Pinecone Documentation](https://docs.pinecone.io/guides/data/understanding-imports)
 for additional information.


## Install required libraries

In [None]:
!pip install pinecone-client
!pip install pinecone_notebooks

In [1]:
from pinecone import Pinecone, ServerlessSpec
import time
import os
from datetime import datetime
import json

  from tqdm.autonotebook import tqdm


## Get Pinecone API key

In [None]:
from pinecone_notebooks.colab import Authenticate
Authenticate()

In [2]:
api_key = os.getenv('PINECONE_API_KEY')

# Configure Pinecone client
pc = Pinecone(api_key=api_key)

# Get cloud and region settings
cloud = os.getenv('PINECONE_CLOUD', 'aws')
region = os.getenv('PINECONE_REGION', 'us-east-1')

# Define serverless specifications
spec = ServerlessSpec(cloud=cloud, region=region)



## Create a serverless index




In [4]:

index_name = "pinecone-bulk-import"
dimension = 1536

#if not pc.has_index(index_name):
if index_name not in pc.list_indexes().names():  
  pc.create_index(
      name=index_name,
      dimension=dimension,
      metric="cosine",
      spec=ServerlessSpec(cloud="aws", region="us-east-1")
  )

index = pc.Index(name=index_name)

print(f"Index '{index_name}' created successfully.")

Index 'pinecone-bulk-import' created successfully.


## Start import task

This sample dataset contains:

*   **Dimensions**: 1536
*   **Rows**: 10,000
*   **Files**: 10 parquet files
*   **Size per file**: ~12.58 MB
*   **Total size**: ~125.8

Each file contains:

*   **id**: Unique identifier
*   **Values**: Embedded vectors
*   **metadata**: JSON-formatted dictionary with metadata

***Note***: *This task may take 10 minutes or more to complete. And Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.*

In [6]:
from pinecone import Pinecone, ServerlessSpec, ImportErrorMode

ImportError: cannot import name 'ImportErrorMode' from 'pinecone' (/home/peteryxu/CODE/sagemaker/amazon-bedrock-samples/.venv/lib/python3.10/site-packages/pinecone/__init__.py)

## Specify AWS S3 folder and start task

In [5]:
root = "s3://dev-bulk-import-datasets-pub/10k-1536/"
op = index.start_import(uri=root, error_mode="CONTINUE")

AttributeError: 'Index' object has no attribute 'start_import'

## Check the status of the import

In [None]:
index.describe_index_stats()


## List import operations

In [None]:
imports = list(index.list_imports())
if imports:
    for i in imports:
        print(i)
else:
    print("No imports found in the index.")

## Describe a specific import

In [None]:
index.describe_import("1")

## Cancel the Import (if needed)

In [None]:
# Check if operation status and cancel running instance
op_status = index.describe_import(op.id)
print(f"Operation status: {op_status}")

if op_status in ['in_progress', 'pending']:
    try:
        cancel_response = index.cancel_import(op.id)
        print(f"Import operation {op.id} cancelled.")
    except Exception as e:
        print(f"Error cancelling import: {e}")
else:
    print(f"Cannot cancel operation {op.id} because its status is: {op_status}")


## Delete the index

In [None]:
pc.delete_index(index_name)
print(f"Index '{index_name}' deleted.")