# First steps with Pinecone DB

This hands-on tutorial shows you how to load data into Pinecone DB, and how to search across the data using vector semantic search. 

We'll cover it in 3 parts:

1. Getting setup
2. Putting data in Pinecone DB
3. Searching Pinecone

### No LLMs here
You may be surprised to know that LLMs (large language models) like GPT3.5, GPT4, Llama2, etc. are NOT used in any part of this tutorial. Vector search all the way! 

### What is Pinecone DB?
Pinecone DB (https://www.pinecone.io/) is a powerful, fully-managed [vector database](https://www.pinecone.io/learn/vector-database/) that provides long-term memory and semantic search for today's modern apps.

### Tutorial use case
Have you ever struggled to remember the name of a movie? *"It's about this guy who gets stuck on Mars or something?"*

Well, we're going to build an app that can help you remember. We want to use whatever details you can remember about the movie to search movie summaries, and hopefully find a match! Let's call our app **"Total Movie Recall"**. 

We'll be using the [IMDB Top 1000 Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows) from Kaggle for this tutorial. (*By the way, how is [Total Recall](https://www.imdb.com/title/tt0100802/) not in this list!?* 🤨)

<figure align="middle">
  <img src="./img/01-total-recall-app.png" width="800"/>
  <figcaption>Fig. 1 - Total Movie Recall app</figcaption>
</figure>

> **Note**:
>
> This is just a mock user interface for this app. In this tutorial we're going to be focused on the back-end, not the front-end.

Let's go!

## 1. Getting setup
This section shows you how to setup the Python dependencies you'll need. It also shows you how to get an API key for both Pinecone and OpenAI.

### Running the tutorial

If you want to run this tutorial yourself, you can find it hosted [here on Colab](https://colab.research.google.com/github/ninetack/blog-public/blob/main/content/002_blog/002_first_steps_with_pinecone.ipynb) as a Jupyter notebook (easiest), or you can find the original notebook file [here on Github](https://github.com/ninetack/blog-public/blob/main/content/002_blog/002_first_steps_with_pinecone.ipynb). 

The only runtime requirement is Python 3.

### Environment setup

Let's setup our environment, including dependencies and obtaining API keys.

#### Install dependencies

We install the `pinecone-client`, plus we need the `openai` package because we will be using the `text-embedding-ada-002` embedding model [from OpenAI](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings).

In [1]:
! python -m pip install -qU \
    pinecone-client==2.2.2 \
    openai==0.27.8 \
    pandas==2.0.3 \
    tqdm

#### Get a Pinecone API key

After creating or logging into your account with [Pinecone](https://app.pinecone.io), create a new **Project**. (Note: if you just created your account, you might already have a default Project, but to avoid any issues we suggest deleting that one and creating a brand new Project.)

Click on "API Keys" in the menu. You can use the default API key, but the best practice is to create a new API key specific to the client application, so you can manage it separately from anything else you build.

Copy both the **Environment** setting as well as the **API key**, which will be something like `5a5b3643-4cc9-48da-af3e-56eee93bf435`. 

You'll enter them both below in the Environment variables section.

#### Get an Open AI API key

If you don't have an OpenAI account, create one at https://platform.openai.com/.

> Note that you will need to establish billing info with OpenAI. Creating embeddings is not free, but the embeddings model we're using in this tutorial (`text-embedding-ada-002`) is incredibly cost efficient. See [OpenAI's embeddings documentation](https://platform.openai.com/docs/guides/embeddings/what-are-embeddings) for more info about estimating embeddings costs.
>
> It's also a good idea to visit the "Usage Limits" settings on your OpenAI account page, and establish spending limits that make sense for you to avoid getting a nasty surprise bill!

Click on "API Keys" in the menu. Creating a new API key specific to the client application is preferred.

You'll also enter this below.

#### Environment variables

We need to set 3 environment variables. You can edit the code below to set them directly.

- `PINECONE_ENVIRONMENT` - The Pinecone environment where your index resides
- `PINECONE_API_KEY` - Your pinecone API key
- `OPENAI_API_KEY` - Your OpenAI API key

In practice, you'd likely set these in a private `.env` file or otherwise securely configure them in your runtime environment settings.

In [3]:

import os

print("Check environment\n---------------------")

pinecone_env = os.environ.get('PINECONE_ENVIRONMENT') or "YOUR PINECONE ENVIRONMENT"
pinecone_api_key = os.environ.get('PINECONE_API_KEY') or "YOUR PINECONE API KEY"
openai_api_key = os.environ.get('OPENAI_API_KEY') or "YOUR OPENAI API KEY"

print("pinecone_env:", pinecone_env)
print("pinecone_api_key:", pinecone_api_key[:5], "...")
print("openai_api_key:", openai_api_key[:5], "...")

Check environment
---------------------
pinecone_env: us-west4-gcp-free
pinecone_api_key: 05131 ...
openai_api_key: sk-7w ...


### Dataset review

We're using the [IMDB Movies Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows) from Kaggle for this tutorial.

For simplicity, we'll assume it's already downloaded and stored in a CSV file relative to the current directory: `./data/imdb_top_1000_movies.csv` (although the [Kaggle python library](https://pypi.org/project/kaggle/) could help with that part too).

Let's have a look at the dataset:

In [26]:
import pandas as pd
from pathlib import Path

# try to find the dataset locally, otherwise download it from GH
dataset_file_path = "./data/imdb_top_1000_movies.csv"
if not Path(dataset_file_path).is_file():
  dataset_file_path = "https://raw.githubusercontent.com/ninetack/blog-public/main/content/002_blog/data/imdb_top_1000_movies.csv"

movies_df = pd.read_csv(dataset_file_path)
movies_df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


There's potentially a lot of interesting info here, but for our app we'll focus on just a few columns: title and description.

In [12]:
movies_df = movies_df[['Series_Title', 'Overview']].copy() # select the columns we care about
movies_df.columns = ['Title', 'Description'] # rename columns

movies_df.head()

Unnamed: 0,Title,Description
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...
1,The Godfather,An organized crime dynasty's aging patriarch t...
2,The Dark Knight,When the menace known as the Joker wreaks havo...
3,The Godfather: Part II,The early life and career of Vito Corleone in ...
4,12 Angry Men,A jury holdout attempts to prevent a miscarria...


We'll also need a unique ID for each movie, so let's add one now, starting from 1000.

In [13]:
movies_df['ID'] = [str(id) for id in range(1000, 1000+len(movies_df))]

movies_df.head()

Unnamed: 0,Title,Description,ID
0,The Shawshank Redemption,Two imprisoned men bond over a number of years...,1000
1,The Godfather,An organized crime dynasty's aging patriarch t...,1001
2,The Dark Knight,When the menace known as the Joker wreaks havo...,1002
3,The Godfather: Part II,The early life and career of Vito Corleone in ...,1003
4,12 Angry Men,A jury holdout attempts to prevent a miscarria...,1004


## 2. Putting data in Pinecone DB

Now that we're all setup, let's get our data into Pinecone!

We need to:
- Design the data model
- Create embeddings
- Create the Pinecone Index
- Insert to Pinecone

### Data modeling in Pinecone

*How* we store data in Pinecone is just as important as what data we store in Pinecone, and it's all based on how we intend to search the data.

#### Namespaces
Pinecone allows you to insert data into a `namespace`. Think of this as a logical grouping of data that identifies a search boundary. When searching, you specify the namespace you want to execute your search within. 

In our movie recall app, we want to allow searches by Description. So we'll populate data into just one namespace:
- 'movie-descriptions'

#### Metadata
Pinecone also allows you to associate metadata to each vector stored in the DB. The primary use cases for metadata are: 
- Filtering at search time
- Retrieval of associated/additional (non-vectored) content from search results

For our use case, we don't need to use metadata.

#### Our data model
Remember, we want to be able to search for a movie title based on a similarity search against its description.

For example, if someone enters "A guy gets sent back in time in a car", the app should return "Back to the Future".

So the vector search will be performed against the movie descriptions, which means we need the descriptions to be *vectorized* in the database. 

Finally, we want to return the name of a movie when we find a match, so we'll use the movie's unique ID as the unique ID for our content in Pinecone. When we find a match we can use the ID to lookup the title and original description from our dataset.

### Overview of data flow

Here is an overview of the flow of data, including both indexing time and query time. 

<figure align="middle">
  <img src="./img/02-data-flow.png" width="800"/>
  <figcaption>Fig. 2 - Total Movie Recall app data flow</figcaption>
</figure>

In this section, we show you how to index your data in Pinecone. In the next section, we show you how to query.

### Creating embeddings

To enable semantic search, we need to encode our dataset using an embeddings model. You can learn more about embeddings in our post "[Intro to semantic search with vector databases](https://www.ninetack.io/post/intro-to-semantic-search-with-vector-databases)".

To get accurate search results, we need to use the same model to create embeddings for the searchable dataset as well as for each query against that dataset. 

We're going to use the `text-embedding-ada-002` model from OpenAI, which is both affordable and effective in preserving semantic meaning of textual datasets.

Let's try creating embeddings for the first few items.

In [14]:
import openai
openai.api_key = openai_api_key

model_id = 'text-embedding-ada-002'

# grab the first 3 descriptions
batch_strs = movies_df['Description'][0:3].tolist()

# create the embeddings for all 3
embeddings_resp = openai.Embedding.create(input=batch_strs,
                                          model=model_id)['data']

print("Embedding results:\n----------------------")
for i, embrsp in enumerate(embeddings_resp):
  print(f'Input: "{batch_strs[i]}"')
  print("Embedding:", embrsp['embedding'][0:3], "...(truncated)")
  print("Embedding length:", len(embrsp['embedding']))
  print()

Embedding results:
----------------------
Input: "Two imprisoned men bond over a number of years, finding solace and eventual redemption through acts of common decency."
Embedding: [0.023489240556955338, -0.03612123429775238, 0.012475396506488323] ...(truncated)
Embedding length: 1536

Input: "An organized crime dynasty's aging patriarch transfers control of his clandestine empire to his reluctant son."
Embedding: [0.004656488541513681, -0.03917321190237999, 0.019279731437563896] ...(truncated)
Embedding length: 1536

Input: "When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice."
Embedding: [-0.00337048526853323, -0.03665561601519585, 0.010372190736234188] ...(truncated)
Embedding length: 1536



Great, so we can see each movie description is being encoded as 1536 dimensional points.

> **What's so special about 1536?**
> 
> That's the dimension size of the model we're using, `text-embedding-ada-002`. If you use a different embedding model, it will likely have a different dimension size. You'll need to remember this number when you create the Pinecone index below!

Let's wrap the call to create embeddings in a function that accepts a batch of strings as input, and returns an array of vector encodings.

In [15]:
def create_embeddings(batch: list[str]):
  model_id = 'text-embedding-ada-002'
  embedding_resp = openai.Embedding.create(input=batch, model=model_id)
  return [emb['embedding'] for emb in embedding_resp['data']]

# try it out for the next few descriptions
embeddings = create_embeddings(batch=movies_df['Description'][4:6].tolist())
print(len(embeddings[0]), embeddings[0][0:3], "...")
print(len(embeddings[1]), embeddings[1][0:3], "...")


1536 [-0.020593425258994102, -0.02472022734582424, -0.0007298014825209975] ...
1536 [-0.011000500060617924, -0.02523421309888363, -0.024779541417956352] ...


Checkpoint! 

So far we have:
- selected an embedding model
- identified our vector data model
- created a function that can create embeddings

Next up, let's store some data in Pinecone!

### Creating a Pinecone index

We'll create the Pinecone index via the Pinecone web console (although it's possible to create via the API as well).

Open up the Pinecone app at https://app.pinecone.io, click on Indexes, and then Create Index.

> Data Modeling Tip: Each Pinecone index can only store one 'shape' of thing. This means all the embeddings stored in the index must use the same embedding model, have the same dimensions, and same metric type setting. 
>
> For example, if we wanted to allow searching for similar movie posters (images), we would need to select an embeddings model that is trained on images, and the embeddings would need to be stored in a different index in Pinecone.

Give the index a name based on the use case, set the Dimension size (1536 based on our use of `text-embedding-ada-002`), and leave the default Metric set to `cosine`.

<figure align="middle">
  <img src="./img/03a-create-index.png" width="800"/>
  <figcaption>Fig. 3a - Create Pinecone index</figcaption>
</figure>

> Note: For a more detailed discussion of the Metric setting, including when you should set it to something else like Euclidean, this article from Pinecone provides a nice overview of the different similarity metric options: [Vector Similarity Explained](https://www.pinecone.io/learn/vector-similarity).

The index will take a few minutes to initialize.

<figure align="middle">
  <img src="./img/03b-index-ready.png" width="800"/>
  <figcaption>Fig. 3b - Index ready</figcaption>
</figure>

### Inserting data to Pinecone

We'll use the `pinecone-client` lib to insert data into Pinecone. First we need to initialize it with our API key and environment.

In [16]:
import pinecone

pinecone.init(api_key=pinecone_api_key, environment=pinecone_env)

# verify client init by retrieving info about our index
pinecone.describe_index(name="movie-index")

  from tqdm.autonotebook import tqdm


IndexDescription(name='movie-index', metric='cosine', replicas=1, dimension=1536.0, shards=1, pods=1, pod_type='p1', status={'ready': True, 'state': 'Ready'}, metadata_config=None, source_collection='')

Each item you store in Pinecone has this structure:

| Name | Description |
|------|-------------|
| `id` | A unique ID used to manage the vector |
| `values` | The vector embedding itself |
| `metadata` | Key/value data that can be used for filtering query results or returning associated data |

To store an item in Pinecone, we need to first get a reference to the index. Then we can use the `upsert` function, which will insert or update the item based on its ID.

Let's store the description for *The Shawshank Redemption* in Pinecone.

In [17]:
movie_index = pinecone.Index(index_name="movie-index")

shawshank_item = {
  'id': "1000",
  'values': create_embeddings(batch=movies_df['Description'][0:1].tolist())[0],
}

response = movie_index.upsert(vectors=[shawshank_item], namespace="movie-descriptions")
response

{'upserted_count': 1}

Where did we get the namespace "movie-descriptions" from? It came from our data model design above!

#### Batching upserts

For best performance, Pinecone recommends uploading batches of 100 embeddings at a time, and the same batch size works well for OpenAI's embeddings endpoint. Let's write a function to process our dataset in batches of 100.

It will create embeddings for 100 items and then upload those 100 embeddings to Pinecone, with all the right metadata.

In [18]:
batch_size = 100

def batch_process():
  total_items = len(movies_df)
  num_batches = total_items // batch_size
  if total_items % batch_size > 0:
    num_batches = num_batches + 1 

  print(f'Start processing {num_batches} batches for {total_items} total items')

  # process a batch at a time
  for i in range(num_batches):
    start = i * batch_size
    end = (i + 1) * batch_size
    df_batch = movies_df[start:end]

    print(f'Processing batch {i+1} with {len(df_batch)} items')

    # Create the embeddings
    embeddings = create_embeddings(batch=df_batch['Description'].tolist())

    # Create an array of items for uploading to Pinecone
    to_upload = [{
      'id': row['ID'],
      'values': embeddings[batch_idx],
    } for batch_idx, (df_idx, row) in enumerate(df_batch.iterrows())]

    response = movie_index.upsert(vectors=to_upload, namespace="movie-descriptions")
    print(response)

batch_process()

Start processing 10 batches for 1000 total items
Processing batch 1 with 100 items
{'upserted_count': 100}
Processing batch 2 with 100 items
{'upserted_count': 100}
Processing batch 3 with 100 items
{'upserted_count': 100}
Processing batch 4 with 100 items
{'upserted_count': 100}
Processing batch 5 with 100 items
{'upserted_count': 100}
Processing batch 6 with 100 items
{'upserted_count': 100}
Processing batch 7 with 100 items
{'upserted_count': 100}
Processing batch 8 with 100 items
{'upserted_count': 100}
Processing batch 9 with 100 items
{'upserted_count': 100}
Processing batch 10 with 100 items
{'upserted_count': 100}


Finally, let's verify all the vectors made it -- there should be 1000.

In [19]:
movie_index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'movie-descriptions': {'vector_count': 1000}},
 'total_vector_count': 1000}

Now that we've got data in Pinecone, let's search it!

## 3. Searching Pinecone

Going back to our Total Recall app, we want to be able to provide a brief, possibly terrible, recollection a movie and have it recall the title. 

For example, *"It's about this guy who gets stuck on Mars or something?"* should somehow come back as a match for *The Martian*.

### Running the search

To search the vector space we need to create embeddings for the query text, and then pass those to the `query` interface of our index.

We also need to specify "movie-descriptions" namespace, since that is where the data resides within Pinecone.

Finally, we can specify the number of top search results we want to return using the `top_k` parameter.

In [20]:
query = "It's about this guy who gets stuck on Mars or something?"
query_emb = create_embeddings(batch=[query])[0]

query_resp = movie_index.query(vector=query_emb, namespace="movie-descriptions", top_k=3)
query_resp

{'matches': [{'id': '1329', 'score': 0.892151892, 'values': []},
             {'id': '1426', 'score': 0.853859782, 'values': []},
             {'id': '1623', 'score': 0.847651958, 'values': []}],
 'namespace': 'movie-descriptions'}

We're getting the IDs of the best-matching search results, as well as a score indicator (highest value == best match).

Let's use that first result to look up the corresponding row in our dataset.

In [21]:
top_match_id = query_resp.matches[0]['id']

movies_df.loc[movies_df['ID'] == top_match_id]

Unnamed: 0,Title,Description,ID
329,The Martian,An astronaut becomes stranded on Mars after hi...,1329


The Martian! Incredible.

### Putting the app together

Let's write a function to encapsulate the search and the lookup, and then do a bit more testing.

In [22]:
def total_recall(query: str):
  query_emb = create_embeddings(batch=[query])[0]

  query_resp = movie_index.query(vector=query_emb, namespace="movie-descriptions", top_k=1)

  if len(query_resp.matches) == 0:
    return None

  top_match_id = query_resp.matches[0]['id']

  match_df = movies_df.loc[movies_df['ID'] == top_match_id]
  return {
    'title': match_df['Title'].values[0],
    'description': match_df['Description'].values[0]
  }

total_recall(query="It's about this guy who gets stuck on Mars or something?")

{'title': 'The Martian',
 'description': 'An astronaut becomes stranded on Mars after his team assume him dead, and must rely on his ingenuity to find a way to signal to Earth that he is alive.'}

And now we'll run some additional test queries.

In [23]:
test_queries = [
  "A guy fights to save his home in Scotland", # Braveheart
  "A guy that eats people helps a cop",        # The Silence of the Lambs
  "A guy gets sent back in time in a car",     # Back to the Future
  "A guy is the head of a mafia family",       # The Godfather
  "A guy goes crazy in a deserted hotel",      # The Shining
]

for testq in test_queries:
  result = total_recall(query=testq)
  print(f'"{testq}"')
  print("--------------------------------------------------")
  print("Title:", result['title'])
  print("Description:", result['description'])
  print()

"A guy fights to save his home in Scotland"
--------------------------------------------------
Title: Braveheart
Description: Scottish warrior William Wallace leads his countrymen in a rebellion to free his homeland from the tyranny of King Edward I of England.

"A guy that eats people helps a cop"
--------------------------------------------------
Title: The Silence of the Lambs
Description: A young F.B.I. cadet must receive the help of an incarcerated and manipulative cannibal killer to help catch another serial killer, a madman who skins his victims.

"A guy gets sent back in time in a car"
--------------------------------------------------
Title: Back to the Future
Description: Marty McFly, a 17-year-old high school student, is accidentally sent thirty years into the past in a time-traveling DeLorean invented by his close friend, the eccentric scientist Doc Brown.

"A guy is the head of a mafia family"
--------------------------------------------------
Title: The Godfather
Descript

## Wrap-up

### It works!

As you can see, vector semantic search is very powerful. We were able to find matching movie titles despite providing only the bare minimum of information about the movie.

Importantly, we can see that the search is *not* relying on keyword search, but is instead finding similarities in the *meanings* of words in the description and query.

<figure align="middle">
  <img src="./img/04-total-recall-godfather.png" width="800"/>
  <figcaption>Fig. 4 - Semantic search, not keyword search</figcaption>
</figure>

### Next steps

#### 1. Expand to TV shows?
Now that our Total Recall app is working great for movies, you might consider expanding to include TV shows. (We'll leave that as an exercise to the reader.) 

If we were to do that, we'd likely use this dataset from Kaggle: [IMDB Top 250 TV Shows](https://www.kaggle.com/datasets/khushipitroda/imdb-top-250-tv-shows). 

Based on whether we want to search movies and TV shows together or separately, we could either store TV show descriptions in the same namespace as movies or in a separate namespace.

#### 2. Deploy the app in a user interface 
Our tutorial is long enough at this point, but you can imagine how this application could be packaged and rolled out to end users as a simple chat interface.

Note that if we were to build this as a full app, we'd likely start with one of Vercel's excellent [Next.js app templates](https://vercel.com/templates), such as [Next.js AI Chatbot](https://vercel.com/templates/next.js/nextjs-ai-chatbot).


## We'd love to talk with you

Ninetack is dedicated to helping our clients leverage the latest technologies to build innovative solutions for every industry.

We'd love to talk with you about how you're planning to incorporate vector search in your next AI application. Connect with us today @ ninetack.io!