<!--<badge>--><a href="https://colab.research.google.com/github/startakovsky/pinecone-examples-fork/blob/may-2022-semantic-text-search-refresh/semantic_text_search/semantic_text_search_refresh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Semantic Text Search Demo, with Pinecone

## Background

### What is Semantic Search?

_Semantic search_ is exactly the kind of search where the _meaning_ of the search query is the thing that's used, rather than it being done by keyword lookups. Pretrained neural networks on large sets of text data have been shown to be very effective at encoding the _meaning_ of a particular phrase, sentence, paragraph or long document into a data structure known as a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/).

### How will we demonstrate and apply this example.

We are going to use Pinecone's semantic search capabilities with an off-the-shelf and a pretrained model to curate custom categories of previously-aired Jeopardy questions. We will show how Pinecone makes it easy to ensure that question difficulty is on par with how the question was originally priced by filtering on question metadata.

### Learning Goals
_By the end of this demo, you will have:_
 1. Learned about Pinecone's value for solving realtime semantic search requirements!
 2. Stored and retrieved vectors from Pinecone your very-own Pinecone Vector Database.
 3. Encoded Jeopardy Questions as 384-dimensional vectors using a pretrained, encoder-only, model (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database on Jeopardy Questions that are semantically similar to the query.
 5. Used Pinecone's metadata filtering capability to ensure that each category you create will be on parity with difficulty used when the question originally aired (each category will contain questions of ranging difficulty).

## Setup: Prerequisites and Data Preparation

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Importing the helper modules

This notebook is self-contained, and as such, if background Python modules are not present, they will be imported from [Pinecone's Example repository](https://github.com/pinecone-io/examples/tree/master/semantic_text_search).

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain one for free on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

In [1]:
import os
import httpimport

if os.path.isfile('helper.py'):
    import helper as h
else:
    print('importing `helper.py` from https://github.com/pinecone-io')
    with httpimport.github_repo(
        username='startakovsky', 
        repo='pinecone-examples-fork',
        module=['semantic_text_search'],
        branch='master'):
        from semantic_text_search import helper as h

Extracting API Key from environmental variable `PINECONE_EXAMPLE_API_KEY`...

Pinecone API Key available at `h.pinecone_api_key`

### Installing and Importing Prerequisite Libraries:
Python libraries [pinecone-client](https://pypi.org/project/pinecone-client/), [sentence_transformers](https://pypi.org/project/sentence-transformers/), [pandas](https://pypi.org/project/pandas/), and [tqdm](https://pypi.org/project/tqdm/) are required for this notebook.

#### Installing via `pip`
The next line is equivalent to `pip install pinecone-client sentence-transformers pandas tqdm`. Note that _sys.executable_ is a way of ensuring it's the version of pip associated with this Jupyter Notebook's Python kernel.

In [2]:
pip install pinecone-client sentence-transformers pandas tqdm datasets -qU

Note: you may need to restart the kernel to use updated packages.


#### Importing and Defining Constants

In [3]:
import collections

import tqdm
import pinecone
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

INDEX_NAME, INDEX_DIMENSION = 'semantic-text-search', 384
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L6-cos-v5'

### Downloading and Processing Data

#### Downloading data
The [Jeopardy Dataset](https://huggingface.co/datasets/jeopardy) has over 200,000 rows and will be downloaded using the `datasets` library from HuggingFace.

In [4]:
dataset = load_dataset("jeopardy")

Using custom data configuration default
Reusing dataset jeopardy (/Users/steven/.cache/huggingface/datasets/jeopardy/default/0.1.0/774efb3257b2f482b1974faa754e6ce11853ad625a9b364e29f106052afe0204)


  0%|          | 0/1 [00:00<?, ?it/s]

#### The preprocessing step is self-explanatory and defined in the helper module.

In [5]:
df = dataset['train'].to_pandas()
df = h.get_processed_df(df)

#### Sample row from dataframe

In [6]:
pd.DataFrame(df.iloc[123456])

Unnamed: 0,146354
category,AUTOBIOGRAPHERS
air_date,2007-11-09 00:00:00
question,"'A psychologist:<br />1962's ""Memories, Dreams, Reflections""'"
amount,1200
answer,Carl Jung
round,Double Jeopardy!
show_number,5330
year,2007
month,11
text_to_encode,"'A psychologist:<br />1962's ""Memories, Dreams, Reflections""' Carl Jung"


### Creating your Pinecone Index
The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector. As we will see below, the model we are using maps each piece of text to a 384-dimensional vector.

In [7]:
pinecone.init(api_key=h.pinecone_api_key, environment='us-west1-gcp')
# pinecone.create_index(name=INDEX_NAME, dimension=INDEX_DIMENSION)
index = pinecone.Index(index_name=INDEX_NAME)

## Generate embeddings and send them to your Pinecone Index
This will all be done in batches. We will compute embeddings in batch, followed by taking each batch and sending it to Pinecone, also in batches.

### Loading a Pretrained Encoder model.
We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5). It is one of hundreds encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time. After this first import, the model is cached and available on a local machine.

In [8]:
h.printmd(f'Loading model from _Sentence Transformers_: `{MODEL_NAME}` from Sentence Transformers...')
model = SentenceTransformer(MODEL_NAME)
h.printmd('Model loaded.')

Loading model from _Sentence Transformers_: `sentence-transformers/msmarco-MiniLM-L6-cos-v5` from Sentence Transformers...

Model loaded.

### MSMARCO model v5 and Embeddings

In this example, we created an index with 384 dimensions because that is what the output is of this MSMARCO model. In fact, particular MSMARCO model used in this example generates [unit vectors](https://en.wikipedia.org/wiki/Unit_vector), which make [vector comparisons](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) agnostic to one's choice of similarity scores. In other words, when defining the index, it does not matter whether we use `euclidean`, `cosine` or `dotproduct` as a metric, so we left it blank, using the `cosine` default. 

#### On Embeddings

The output of this model's encodings are 384-dimensional, which was known in advance of creating above index.

So, when a piece of text such as "A quick fox jumped around" gets encoded into a vector embedding, the result is a sequence of floats.

#### On Comparing Embeddings aka _how_ Semantic Search works

Two 15-dimensional text embeddings might look like something like: 
 - _\[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]_
 - _\[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]_
 
In order to determine how _similar_ we may use something like [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity). This calculation is trivial when comparing two vectors, but nontrivial when needing to compare one vector against millions or billions of vectors.

### What is Pinecone for?
Often, there is a technical requirement to run a comparison of one vector to millions of others and return the most similar results in real time, with a latency of tens of milliseconds and at a high throughput. Pinecone solves this  problem with its managed vector database service, and we will demonstrate this below. Additionally, Pinecone offers the ability to filter by metadata as well as providing high-availability replication that will scale up (and down) as needed. We will demonstrate the metadata capability below.

### Prepare vector embeddings for upload

This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. We will demonstrate Pinecone's ability to filter by this metadata shortly. This is going to be important for forming the Jeopardy Categories with questions ranging from `$200` up to `$2000` difficulty.

In [9]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return pinecone vector."""
    vector_metadata = {
        'year': df_row['year'],
        'month': df_row['month'],
        'round': df_row['round'],
        'amount': df_row['amount']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [10]:
def get_vectors_to_upload_to_pinecone(df_chunk, model):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    pool = model.start_multi_process_pool()
    vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
    model.stop_multi_process_pool(pool)
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [11]:
def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    chunk_size=20000, 
    upsert_size=500):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = h.get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(h.chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in h.chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
Computing the embeddings may take up to an hour depending on hardware capabilities. The Pinecone API responds right away with the status of each request. 

In [12]:
# async_results = upload_dataframe_to_pinecone_in_chunks(df, index, model)

### Visualize the status of your upserts in the Pinecone Console

<img src='https://raw.githubusercontent.com/startakovsky/pinecone-examples-fork/may-2022-semantic-text-search-refresh/semantic_text_search/pinecone_console.png'>

## Querying Pinecone

Now that all the embeddings of the texts are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast semantic search query capabilities.

### Helper Functions
Below we have some helper functions, that convert text to a vector embedding, send the vector to Pinecone along with a metadata rule of how to filter the request, and converting the Pinecone API JSON response to a Pandas DataFrame.

In [13]:
def get_ids(response):
    """Return ids from results."""
    matches = response['results'][0]['matches']
    return [match['id'] for match in matches]

def get_query_results_from_api_response(dataframe, response, query):
    """Return pandas.DataFrame containing query_results with original text."""
    response_df = dataframe.loc[get_ids(response), ['question', 'answer', 'amount']]
    response_df['query'] = query
    return response_df

def get_query_results(query, pinecone_index, dataframe, model, top_k=1, filter_criteria=None):
    embedding = model.encode(query).tolist()
    response = pinecone_index.query(
        [embedding],
        top_k=top_k,
        filter=filter_criteria,
        include_metadata=True,
    )
    return get_query_results_from_api_response(df, response, query)

### Example Usage for a single query

#### _**I'd like 'Ocean Gods' for \\$800 please, Alex\!**_

In the below example we query Pinecone's API with an embedding of the term _ocean gods_ find a few embeddings that have the highest similarity score to the embedding of _ocean gods_ such that it was valued at \\$800.

#### Pinecone Example Request

Note how the [metadata](https://www.pinecone.io/learn/vector-search-filtering/) is taken into account because here we were looking for questions of `$800` difficulty.

```python
pinecone_index.query(
        [embedding], 
        top_k=5, 
        filter={'amount': {'$eq': 800}},
        include_metadata=True,
    )
```

#### Pinecone Example Response

```json
{'results': [{'matches': [{'id': '28440',
                           'metadata': {'amount': 800.0,
                                        'month': '03',
                                        'round': 'Double Jeopardy!',
                                        'year': '2010'},
                           'score': 0.517167628,
                           'values': []},
                          {'id': '15671',
                           'metadata': {'amount': 800.0,
                                        'month': '12',
                                        'round': 'Double Jeopardy!',
                                        'year': '2011'},
                           'score': 0.45932579,
                           'values': []},
                          {'id': '86169',
                           'metadata': {'amount': 800.0,
                                        'month': '09',
                                        'round': 'Double Jeopardy!',
                                        'year': '2002'},
                           'score': 0.450626254,
                           'values': []}],
              'namespace': ''}]}
```

#### Pinecone Enriched Example Response
Let's run the original code, processing the response style we have above.

In [14]:
get_query_results('ocean gods', index, df, model, top_k=5, filter_criteria={'amount': {'$eq': 800}})

Unnamed: 0_level_0,question,answer,amount,query
vector_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
28440,'The Angolan sea god Kianda is the guardian of this nearby ocean & all its creatures',the Atlantic Ocean,800,ocean gods
15671,"'Quoth the Bible, Noah released 2 birds from the Ark to test for dry land, a dove & one of these'",a raven,800,ocean gods
86169,"'Pantheists say the universe & this are identical; by definition, atheists don't believe in it at all'",God,800,ocean gods
175086,"'Despite their name, spring these, caused by alignment of the Sun, Moon & Earth, happen in the ocean in every season'",tides,800,ocean gods
181889,"'(<a href=""http://www.j-archive.com/media/2010-06-11_DJ_12.jpg"" target=""_blank"">Jimmy of the Clue Crew shows a map on the monitor.</a>) The Red Sea, the Gulf of Aden & the Indian Ocean form the eastern coast of the region seen <a href=""http://www.j-archive.com/media/2010-06-11_DJ_12a.jpg"" target=""_blank"">here</a>, known by this 3-word term'",the Horn of Africa,800,ocean gods


## Building Custom Jeopardy Boards

Here is a summary of the following helper function

### Pinecone Query for All Question Difficulties
The following functions scale up the previous example, wrangles the output into the form of two Jeopardy rounds.

In [15]:
import itertools
def get_jeopardy_questions(queries, pinecone_index, dataframe, model):
    """Return the questions to be used by making API requests to Pinecone."""
    df_list = []
    get_single_result = lambda query, amount: get_query_results(
        query,
        pinecone_index,
        dataframe,
        model,
        filter_criteria={'amount': {'$eq': amount}}
    )
    for query, amount in itertools.product(queries, h.JEOPARDY_STANDARD_AMOUNTS):
        results = get_single_result(query, amount)
        df_list.append(results)
    return pd.concat(df_list)

In [16]:
queries = ["over the moon", "united states", "invention of computer"]
jeopardy_questions = get_jeopardy_questions(queries, index, df, model)
jeopardy_board, double_jeopardy_board = h.get_jeopardy_boards(jeopardy_questions, queries)

### Jeopardy! Round 1 Board

In [17]:
jeopardy_board

Unnamed: 0_level_0,over the moon,united states,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200,'Any song about Earth's natural satellite','Canada',"'The Woz, Steve Wozniak, built the first computer for this company'"
400,'The only moon in our solar system that astrologists say has an influence circles this planet','Number of U.S. states divided by the number of noncontiguous states',"'On April 1, 1976 2 engineers with $1,300 in capital began this computer company in Cupertino, California'"
600,"'On July 20, 1969 Ryan Seacrest described the lunar surface as ""magnificent desolation""; 2nd man on Moon, out!'",'The westernmost country in North America',"'Founded by Ross Perot in 1962, Electronic Data Systems got its first computer in 1965, this company's 1401'"
800,'The Apollo lunar mission with this number was aborted en route to the Moon in 1970 due to an in-flight explosion','Skyy',"'Originally an adding machine maker, in 1944 IBM made its first steps toward one of these with the Harvard Mark I'"
1000,'You could say we sent this Greek god to the moon in 1969','It's the only U.S. state that fits the category',"'Announced on February 14, 1946, this first electronic digital computer had 18,000 vacuum tubes'"


### Double Jeopardy! Round 2 Board

In [18]:
double_jeopardy_board

Unnamed: 0_level_0,over the moon,united states,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1200,'High-aiming standard heard here','The third largest',"'Long before it was the name of a computer, it was Brit speak for a raincoat'"
1400,'It's the name shared by an Irish revolutionary & the pilot of the command module during the first moon landing','Number of U.S. states that begin with the letter I',"'In 1896 Herman Hollerith, born on Leap Day 1860, organized the Tabulating Machine Co., which evolved into this giant'"
1600,'The lunar kind means the light of the Moon is obscured because the Earth is between the Moon & the sun',"'On a map of the U.S., 1 of the 2 states that each border 8 other states'","'As a student at Eton, he did have his own laptop computer, despite being second in line to the British throne'"
1800,"'On Feb. 1, 1958 the Detroit Free Press said, ""U.S. Fires Moon!""; they meant the USA's first of these, Explorer 1'",'The Rio Grande forms part of the border between these 2 U.S. states','This technology used for wireless headsets is named after a Danish king who united parts of Scandinavia'
2000,'Lunar object over a Gulf of Guinea nation','Total number of U.S. states that begin and end with the same letter; (hint: they all start with vowels)',"'The Fly, the first pentop computer, is made by this company that ""jumped"" on the educational toys market'"


### Looking Up Answers
Before running the next cell, see if you can look up the answer to this question: "Over the moon for 400".

In [19]:
query, amount = "over the moon", 2000
id_ = jeopardy_questions.index[(jeopardy_questions['query'] == query) & (jeopardy_questions['amount'] == amount)][0]
jeopardy_questions.at[id_, 'answer']

'a Cameroon moon'