[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/assets/how-to-create-pinecone-datasets.ipynb)
[![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/assets/how-to-create-pinecone-datasets.ipynb)

# Creating Pinecone Datasets

This notebook will walk you through the process of creating a Pinecone dataset from a pandas Dataframe.

## Step 1: Create a simple sample dataset

In [24]:
!pip install -qU pinecone==6.0.1 pinecone-datasets==1.0.1 pandas==2.2.3

In [85]:
def build_doc(sentence, category):
    return {
        'text': sentence,
        'category': category
    }

planet_sentences = [
  "The smallest planet in our solar system, Mercury, holds the title for being closest to the Sun.",
  "Orbiting the Sun in just 88 Earth days, Mercury zips around at an incredible pace.",
  "A rocky, cratered surface defines this planet, offering clues about its tumultuous history.",
  "Extreme temperature fluctuations mark its days and nights, creating a challenging environment.",
]
planet_sentence_objs = [build_doc(s, "astronomy") for s in planet_sentences]

mythology_sentences = [
  "In Roman mythology, the messenger god Mercury is celebrated for his swift movement.",
  "With winged sandals propelling him, Mercury traverses the skies effortlessly.",
  "Renowned as a mediator between gods and mortals, the deity Mercury also guides souls to the underworld.",
  "Carrying the caduceus, Mercury symbolizes both commerce and communication.",
]
mythology_sentence_objs = [build_doc(s, 'mythology') for s in mythology_sentences]

chemistry_sentences = [
  "Known as quicksilver, mercury is the only metal that remains liquid at room temperature.",
  "With the symbol Hg, derived from the Greek 'hydrargyrum' meaning 'water-silver', mercury has captivated chemists for centuries.",
  "Utilized in devices like thermometers and barometers, mercury’s unique properties make it invaluable in scientific instruments.",
  "Once a common component in industrial processes, mercury now faces strict regulation due to its toxic effects.",
]
chemistry_sentence_objs = [build_doc(s, 'chemistry') for s in chemistry_sentences]

sentences = planet_sentence_objs + mythology_sentence_objs + chemistry_sentence_objs


In [86]:
import pandas as pd
pd.DataFrame(sentences)

Unnamed: 0,text,category
0,"The smallest planet in our solar system, Mercu...",astronomy
1,"Orbiting the Sun in just 88 Earth days, Mercur...",astronomy
2,"A rocky, cratered surface defines this planet,...",astronomy
3,Extreme temperature fluctuations mark its days...,astronomy
4,"In Roman mythology, the messenger god Mercury ...",mythology
5,"With winged sandals propelling him, Mercury tr...",mythology
6,Renowned as a mediator between gods and mortal...,mythology
7,"Carrying the caduceus, Mercury symbolizes both...",mythology
8,"Known as quicksilver, mercury is the only meta...",chemistry
9,"With the symbol Hg, derived from the Greek 'hy...",chemistry


## Step 2: Create embeddings

Next, we'll create some embeddings to go along with this data. There are many ways you could do this, but for this demo we'll use a Pinecone Inference hosted model.

In [77]:
from pinecone import Pinecone

pc = Pinecone()
sentence_embeddings = pc.inference.embed(
    model='multilingual-e5-large',
    inputs=[s['text'] for s in sentences],
    parameters={"input_type": "passage", "truncate": "END"}
)
sentence_embeddings

EmbeddingsList(
  model='multilingual-e5-large',
  vector_type='dense',
  data=[
    {'vector_type': dense, 'values': [0.0259246826171875, 0.0168609619140625, ..., -0.0272369384765625, -0.033203125]},
    {'vector_type': dense, 'values': [0.0198974609375, 0.00733184814453125, ..., -0.0400390625, -0.005207061767578125]},
    ... (26 more embeddings) ...,
    {'vector_type': dense, 'values': [0.01123809814453125, -0.02001953125, ..., -0.0277252197265625, -0.0232696533203125]},
    {'vector_type': dense, 'values': [-0.0018777847290039062, 0.004322052001953125, ..., -0.019012451171875, 0.022186279296875]}
  ],
  usage={'total_tokens': 834}
)

## Step 3: Format data

Now we need to do a bit of work to format our data into the schema expected by the pinecone-datasets package.

Some notes:
* Note that we have both metadata field and 'blob' field, the metadata field is the acutal pinecone metadata we will use in our index, blob, is an additional field that we can use to store any additional information we want to store along with the Dataset.
* here we used both `values` but `sparse_values`, is also available as an optional field.

In [88]:
import uuid

docs = [{
    "id": str(uuid.uuid4()),
    "values": e.values,
    "metadata": {
        "category": t['category']
    },
    "blob": {"text": t['text']}
} for e, t in zip(sentence_embeddings.data, sentences)]

docs_df = pd.DataFrame(docs)
docs_df.head()

Unnamed: 0,id,values,metadata,blob
0,0c55e3ab-6f8b-43de-a759-fe2d1e884b56,"[0.0259246826171875, 0.0168609619140625, -0.02...",{'category': 'astronomy'},{'text': 'The smallest planet in our solar sys...
1,47c6c690-26ef-491e-82f1-ace0d7f29a30,"[0.0198974609375, 0.00733184814453125, -0.0004...",{'category': 'astronomy'},{'text': 'Orbiting the Sun in just 88 Earth da...
2,1e5a2b22-6bf7-4466-878b-619e573fb268,"[0.006023406982421875, 0.0005712509155273438, ...",{'category': 'astronomy'},"{'text': 'A rocky, cratered surface defines th..."
3,39843338-ef61-425a-9a70-6a4d1ae1fc75,"[0.01971435546875, -0.0247802734375, -0.024307...",{'category': 'astronomy'},{'text': 'Extreme temperature fluctuations mar...
4,1a294212-a879-4a4c-842f-083570a19f8e,"[0.01239013671875, -0.007007598876953125, -0.0...",{'category': 'mythology'},"{'text': 'In Roman mythology, the messenger go..."


## Pinecone Dataset

Now that we have our data Ready, we can create a Pinecone Dataset. A Pinecone Dataset is a collection of documtents, queries and Metadata. We can create a Pinecone
* Documents: a collection of records with Id, Vectors (dense, sparse) and metadata
* Queries: a collection of queries with Vectors (dense, sparse), metadata filter and top_k
* Metadata: a defintion of the dataset: Name, dimension, metric, embedding models, etc.

In [89]:
from pinecone_datasets import Dataset, DatasetMetadata, DenseModelMetadata
from datetime import datetime

metadata = DatasetMetadata(
    name="mercury-sentences",
    documents=len(docs_df),
    queries=0,
    created_at=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f"),
    dense_model=DenseModelMetadata(
        name='multilingual-e5-large',
        dimension=1024
    )
)
metadata.model_dump()

{'name': 'mercury-sentences',
 'created_at': '2025-02-28 21:59:43.146979',
 'documents': 12,
 'queries': 0,
 'source': None,
 'license': None,
 'bucket': None,
 'task': None,
 'dense_model': {'name': 'multilingual-e5-large',
  'tokenizer': None,
  'dimension': 1024},
 'sparse_model': None,
 'description': None,
 'tags': None,
 'args': None}

In [92]:
ds = Dataset.from_pandas(documents=df, q=None, metadata=metadata)
ds.documents.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,95290eba-a884-43e9-8f25-fb56b581eca0,"[0.0259246826171875, 0.0168609619140625, -0.02...",,{'category': 'astronomy'},{'text': 'The smallest planet in our solar sys...
1,5eec80c1-0f39-4cf8-8826-d1098d44e8f7,"[0.0198974609375, 0.00733184814453125, -0.0004...",,{'category': 'astronomy'},{'text': 'Orbiting the Sun in just 88 Earth da...
2,efedb64b-8e3a-421d-b956-9e7c6f3774f5,"[0.006023406982421875, 0.0005712509155273438, ...",,{'category': 'astronomy'},"{'text': 'A rocky, cratered surface defines th..."
3,d3e5ac3e-5a74-4bc4-b016-ebcaed352fc0,"[0.01971435546875, -0.0247802734375, -0.024307...",,{'category': 'astronomy'},{'text': 'Extreme temperature fluctuations mar...
4,27ebe053-dda3-424d-baa8-0784fa5f9672,"[0.01239013671875, -0.007007598876953125, -0.0...",,{'category': 'astronomy'},"{'text': 'Unlike many other planets, it has no..."


## Save dataset to local path


In [81]:
import tempfile
from pinecone_datasets import Catalog

catalog_path = tempfile.mkdtemp()
catalog = Catalog(base_path=catalog_path)
catalog.save_dataset(ds)

print('Files saved to ' + catalog_path)

Files saved to /tmp/tmpsgv_o4bj




In [83]:
catalog.list_datasets(as_df=True)

Unnamed: 0,name,created_at,documents,queries,source,license,bucket,task,dense_model,sparse_model,description,tags,args
0,mercury-sentences,2025-02-28 21:44:48.510318,30,0,,,,,"{'name': 'multilingual-e5-large', 'tokenizer':...",,,,


### Reload dataset (or another dataset from your local catalog)

In [93]:
new_ds = catalog.load_dataset('mercury-sentences')
new_ds.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,95290eba-a884-43e9-8f25-fb56b581eca0,"[0.0259246826171875, 0.0168609619140625, -0.02...",,{'category': 'astronomy'},{'text': 'The smallest planet in our solar sys...
1,5eec80c1-0f39-4cf8-8826-d1098d44e8f7,"[0.0198974609375, 0.00733184814453125, -0.0004...",,{'category': 'astronomy'},{'text': 'Orbiting the Sun in just 88 Earth da...
2,efedb64b-8e3a-421d-b956-9e7c6f3774f5,"[0.006023406982421875, 0.0005712509155273438, ...",,{'category': 'astronomy'},"{'text': 'A rocky, cratered surface defines th..."
3,d3e5ac3e-5a74-4bc4-b016-ebcaed352fc0,"[0.01971435546875, -0.0247802734375, -0.024307...",,{'category': 'astronomy'},{'text': 'Extreme temperature fluctuations mar...
4,27ebe053-dda3-424d-baa8-0784fa5f9672,"[0.01239013671875, -0.007007598876953125, -0.0...",,{'category': 'astronomy'},"{'text': 'Unlike many other planets, it has no..."


# Save to bucket

Saving to a bucket in the cloud is basically the same except you will need to auth and specify the bucket location using the Catalog `base_path`. 

To authenticate:
- For Google Storage, run `gcloud auth login` to set auth credentials where the underlying `gcsfs` library can find them. Or, alternatively, set `GOOGLE_APPLICATION_CREDENTIALS` environment variable with a path to your service account json credentials. 
- For S3, set `AWSACCESSKEYID` and `AWSSECRETACCESSKEY`.

Once you've done that, you instantiate your Catalog with the bucket path and load/save datasets.

```pythom
from pinecone_dataset import Catalog

catalog = Catalog("gs://bucket-name")
catalog.list_datasets(as_df=True)
```