<!--<badge>--><a href="https://colab.research.google.com/github/startakovsky/pinecone-examples-fork/blob/master/semantic_text_search/semantic_text_search_refresh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Semantic Text Search Demo, with Pinecone

## Background

### What is Semantic Search?

_Semantic search_ is exactly the kind of search where the _meaning_ of the search query is the thing that's used, rather than it being done by keyword lookups. Pretrained neural networks on large sets of text data have been shown to be very effective at encoding the _meaning_ of a particular phrase, sentence, paragraph or long document into a data structure known as a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/).

### How will we demonstrate and apply this example.

We are going to use Pinecone's semantic search capabilities with an off-the-shelf and a pretrained model to curate custom categories of previously-aired Jeopardy questions. We will show how Pinecone makes it easy to ensure that question difficulty is on par with how the question was originally priced by filtering on question metadata.

### Learning Goals
_By the end of this demo, you will have:_
 1. Learned about Pinecone's value for solving realtime semantic search requirements!
 2. Stored and retrieved vectors from Pinecone your very-own Pinecone Vector Database.
 3. Encoded Jeopardy Questions as 384-dimensional vectors using a pretrained, encoder-only, model (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database on Jeopardy Questions that are semantically similar to the query.
 5. Used Pinecone's metadata filtering capability to ensure that each category you create will be on parity with difficulty used when the question originally aired (each category will contain questions of ranging difficulty).

## Setup: Prerequisites and Data Preparation

### Download the Data
Download [this Jeopardy dataset](https://www.kaggle.com/datasets/tunguz/200000-jeopardy-questions) from Kaggle and place it in a `tmp` folder relative to this notebook (see `CSV_FILEPATH` variable defined below). A free Kaggle account is required, and you will be prompted to sign-in or create an account before download can start.

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Importing the helper modules

This notebook is self-contained, and as such, if background Python modules are not present, they will be imported from [Pinecone's Example repository](https://github.com/pinecone-io/examples/tree/master/semantic_text_search).

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain one for free on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

In [1]:
import sys
import os
import httpimport

if os.path.isfile('helper.py'):
    import helper as h
else:
    print('importing `helper.py` from https://github.com/pinecone-io')
    with httpimport.github_repo(
        username='startakovsky', 
        repo='pinecone-examples-fork',
        module=['semantic_text_search'],
        branch='master'):
        from semantic_text_search import helper as h

Extracting API Key from environmental variable `PINECONE_EXAMPLE_API_KEY`...

Pinecone API Key available at `h.pinecone_api_key`

### Installing and Importing Prerequisite Libraries:
Python libraries [pinecone-client](https://pypi.org/project/pinecone-client/), [sentence_transformers](https://pypi.org/project/sentence-transformers/), [pandas](https://pypi.org/project/pandas/), and [tqdm](https://pypi.org/project/tqdm/) are required for this notebook.

#### Installing via `pip`
The next line is equivalent to `pip install pinecone-client sentence-transformers pandas tqdm`. Note that _sys.executable_ is a way of ensuring it's the version of pip associated with this Jupyter Notebook's Python kernel.

In [2]:
!{sys.executable} -m pip install pinecone-client sentence-transformers pandas tqdm -qU

#### Importing and Defining Constants

In [3]:
import collections

import tqdm
import pinecone
import pandas as pd
from sentence_transformers import SentenceTransformer

CSV_FILEPATH = './tmp/jeopardy_csv.zip'
INDEX_NAME, INDEX_DIMENSION = 'semantic-text-search', 384
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L6-cos-v5'

 ### Processing Data
The preprocessing step is self-explanatory and defined in the helper module.

In [4]:
df = pd.read_csv(CSV_FILEPATH)
df = h.get_processed_df(df)

#### Sample row from dataframe

In [5]:
pd.DataFrame(df.iloc[123456])

Unnamed: 0,146354
show_id,5330
date,2007-11-09 00:00:00
round,Double Jeopardy!
category,AUTOBIOGRAPHERS
amount,1200
question,"A psychologist: 1962's ""Memories, Dreams, Reflections"""
answer,Carl Jung
year,2007
month,11
text_to_encode,"A psychologist: 1962's ""Memories, Dreams, Reflections"" Carl Jung"


### Creating your Pinecone Index
The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector. As we will see below, the model we are using maps each piece of text to a 384-dimensional vector.

In [6]:
pinecone.init(api_key=h.pinecone_api_key, environment='us-west1-gcp')
# pinecone.create_index(name=INDEX_NAME, dimension=INDEX_DIMENSION))
index = pinecone.Index(index_name=INDEX_NAME)

## Generate embeddings and send them to your Pinecone Index
This will all be done in batches. We will compute embeddings in batch, followed by taking each batch and sending it to Pinecone, also in batches.

### Loading a Pretrained Encoder model.
We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5). It is one of hundreds encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time. After this first import, the model is cached and available on a local machine.

In [7]:
h.printmd(f'Loading model from _Sentence Transformers_: `{MODEL_NAME}` from Sentence Transformers...')
model = SentenceTransformer(MODEL_NAME)
h.printmd('Model loaded.')

Loading model from _Sentence Transformers_: `sentence-transformers/msmarco-MiniLM-L6-cos-v5` from Sentence Transformers...

Model loaded.

### MSMARCO model v5 and Embeddings

In this example, we created an index with 384 dimensions because that is what the output is of this MSMARCO model. In fact, particular MSMARCO model used in this example generates [unit vectors](https://en.wikipedia.org/wiki/Unit_vector), which make [vector comparisons](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) agnostic to one's choice of similarity scores. In other words, when defining the index, it does not matter whether we use `euclidean`, `cosine` or `dotproduct` as a metric, so we left it blank, using the `cosine` default. 

#### On Embeddings

The output of this model's encodings are 384-dimensional, which was known in advance of creating above index.

So, when a piece of text such as "A quick fox jumped around" gets encoded into a vector embedding, the result is a sequence of floats.

#### On Comparing Embeddings aka _how_ Semantic Search works

Two 15-dimensional text embeddings might look like something like: 
 - _\[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]_
 - _\[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]_
 
In order to determine how _similar_ we may use something like [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity). This calculation is trivial when comparing two vectors, but nontrivial when needing to compare one vector against millions or billions of vectors.

### What is Pinecone for?
Often, there is a technical requirement to run a comparison of one vector to millions of others and return the most similar results in real time, with a latency of tens of milliseconds and at a high throughput. Pinecone solves this  problem with its managed vector database service, and we will demonstrate this below. Additionally, Pinecone offers the ability to filter by metadata as well as providing high-availability replication that will scale up (and down) as needed. We will demonstrate the metadata capability below.

### Prepare vector embeddings for upload

This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. We will demonstrate Pinecone's ability to filter by this metadata shortly. This is going to be important for forming the Jeopardy Categories with questions ranging from `$200` up to `$2000` difficulty.

In [8]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return pinecone vector."""
    vector_metadata = {
        'year': df_row['year'],
        'month': df_row['month'],
        'round': df_row['round'],
        'amount': df_row['amount']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [9]:
def get_vectors_to_upload_to_pinecone(df_chunk, model):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    pool = model.start_multi_process_pool()
    vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
    model.stop_multi_process_pool(pool)
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [10]:
def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    chunk_size=20000, 
    upsert_size=500):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = h.get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(h.chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in h.chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
Computing the embeddings may take up to an hour depending on hardware capabilities. The Pinecone API responds right away with the status of each request. 

In [11]:
# async_results = upload_dataframe_to_pinecone_in_chunks(df, index, model)

### Visualize the status of your upserts in the Pinecone Console

<img src='https://raw.githubusercontent.com/startakovsky/pinecone-examples-fork/may-2022-semantic-text-search-refresh/semantic_text_search/pinecone_console.png'>

## Querying Pinecone

Now that all the embeddings of the texts are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast semantic search query capabilities.

### Helper Functions
Below we have some helper functions, that convert text to a vector embedding, send the vector to Pinecone along with a metadata rule of how to filter the request, and converting the Pinecone API JSON response to a Pandas DataFrame.

In [12]:
def get_query_ids_map_from_response(response, queries):
    """Return pandas.DataFrame containing query_results."""
    num_queries = len(queries)
    matches = [response['results'][i]['matches'] for i in range(len(queries))]
    return {query: map(lambda x: x.get('id'), matches[i]) for i, query in enumerate(queries)}

def get_query_results_from_api_response(dataframe, response, queries):
    """Return pandas.DataFrame containing query_results with original text."""
    df_list = []
    query_ids_map = get_query_ids_map_from_response(response, queries)
    for query, ids in query_ids_map.items():
        single_query_response_df = dataframe.loc[ids, ['question', 'answer', 'amount']]
        single_query_response_df['query'] = query
        df_list.append(single_query_response_df)
    enriched_response_df = pd.concat(df_list)
    return enriched_response_df

def get_query_results_from_queries(queries, pinecone_index, dataframe, model, filter_criteria=None):
    embedding = model.encode(queries).tolist()
    response = pinecone_index.query(
        embedding, 
        top_k=1, 
        filter=filter_criteria,
        include_metadata=True,
    )
    return get_query_results_from_api_response(df, response, queries)

### Example Usage for a single query

In the below example we get the following response from Pinecone that the above helper functions convert to a Pandas DataFrame. 

#### Pinecone Example Request

Note how the [metadata](https://www.pinecone.io/learn/vector-search-filtering/) is taken into account because here we were looking for questions of `$600` difficulty.

```python
pinecone_index.query(
        [embedding1, embedding2], 
        top_k=1, 
        filter={'amount': {'$eq': 600}},
        include_metadata=True,
    )
```

#### Pinecone Example Response

```
{'results': [{'matches': [{'id': '45481',
                           'metadata': {'amount': 600.0,
                                        'month': '03',
                                        'round': 'Jeopardy!',
                                        'year': '2007'},
                           'score': 0.59718585,
                           'values': []}],
              'namespace': ''},
             {'matches': [{'id': '6247',
                           'metadata': {'amount': 600.0,
                                        'month': '10',
                                        'round': 'Jeopardy!',
                                        'year': '2007'},
                           'score': 0.480220914,
                           'values': []}],
              'namespace': ''}]}
```

#### Pinecone Enriched Example Response
Do `love` and `ocean life` make sense based on the results? Note the ability to send one request to Pinecone's Vector Database for multiple queries.

In [13]:
get_query_results_from_queries(['love', 'ocean life'], index, df, model, filter_criteria={'amount': {'$eq': 600}})

Unnamed: 0,question,answer,amount,query
45481,A rose has long been a symbol of love; rearrange its letters & you get the name of this Greek god of love,Eros,600,love
6247,"Check out exotic marine life at one of these, like the National one in Baltimore",an aquarium,600,ocean life


## Building Custom Jeopardy Boards

Here is a summary of the following helper function

### Pinecone Query for All Question Difficulties
The following functions scale up the previous example, wrangles the output into the form of two Jeopardy rounds.

In [14]:
def get_jeopardy_questions(queries, pinecone_index, dataframe, model):
    """Return the questions to be used by making API requests to Pinecone."""
    df_list = []
    get_result_from_amount = lambda amount: get_query_results_from_queries(
        queries,
        pinecone_index,
        dataframe,
        model,
        filter_criteria={'amount': {'$eq': amount}}
    )
    for amount in h.JEOPARDY_STANDARD_AMOUNTS:
        results = get_result_from_amount(amount)
        df_list.append(results)
    return pd.concat(df_list)

In [15]:
queries = ["over the moon", "united states", "invention of computer"]
jeopardy_questions = get_jeopardy_questions(queries, index, df, model)
jeopardy_board, double_jeopardy_board = h.get_jeopardy_boards(jeopardy_questions, queries)

### Jeopardy! Round 1 Board

In [16]:
jeopardy_board

Unnamed: 0_level_0,over the moon,united states,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200,Any song about Earth's natural satellite,"Canada, United States, Russia","The Woz, Steve Wozniak, built the first computer for this company"
400,The only moon in our solar system that astrologists say has an influence circles this planet,Number of U.S. states divided by the number of noncontiguous states,"""iWOZ"" tells how he went ""from computer geek to cult icon"" by inventing the personal computer"
600,U can't touch this lunar lander Neil Armstrong used; it was released & crashed back into the moon,The westernmost country in North America,"Founded by Ross Perot in 1962, Electronic Data Systems got its first computer in 1965, this company's 1401"
800,"Occurring the first new moon after the sun enters Aquarius, it's Vietnamese new year",Skyy,"Originally a computer term, it now refers to doing any number of jobs at the same time"
1000,You could say we sent this Greek god to the moon in 1969,It's the only U.S. state that fits the category,"Announced on February 14, 1946, this first electronic digital computer had 18,000 vacuum tubes"


### Double Jeopardy! Round 2 Board

In [17]:
double_jeopardy_board

Unnamed: 0_level_0,over the moon,united states,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1200,High-aiming standard heard here,The third largest,"Long before it was the name of a computer, it was Brit speak for a raincoat"
1400,It's the name shared by an Irish revolutionary & the pilot of the command module during the first moon landing,Number of U.S. states that begin with the letter I,"In 1896 Herman Hollerith, born on Leap Day 1860, organized the Tabulating Machine Co., which evolved into this giant"
1600,The lunar kind means the light of the Moon is obscured because the Earth is between the Moon & the sun,"On a map of the U.S., 1 of the 2 states that each border 8 other states","As a student at Eton, he did have his own laptop computer, despite being second in line to the British throne"
1800,"On Feb. 1, 1958 the Detroit Free Press said, ""U.S. Fires Moon!""; they meant the USA's first of these, Explorer 1",The Rio Grande forms part of the border between these 2 U.S. states,This metal used to make semiconductors was discovered by Clemens Winkler & named for his homeland
2000,Lunar object over a Gulf of Guinea nation,Total number of U.S. states that begin and end with the same letter; (hint: they all start with vowels),"The Fly, the first pentop computer, is made by this company that ""jumped"" on the educational toys market"


### Looking Up Answers
Before running the next cell, see if you can look up the answer to this question: "Over the moon for 400".

In [None]:
query, amount = "over the moon", 400
id_ = jeopardy_questions.index[(jeopardy_questions['query'] == query) & (jeopardy_questions['amount'] == amount)][0]
jeopardy_questions.at[id_, 'answer']