<!--<badge>--><a href="https://colab.research.google.com/github/startakovsky/pinecone-examples-fork/blob/may-2022-semantic-text-search-refresh/semantic_text_search/semantic_text_search_refresh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a><!--</badge>-->

# Do-it-yourself Jeopardy Board Curation with Pinecone

## Background

### What is Semantic Search and how will we use it?

_Semantic search_ is exactly the kind of search where the _meaning_ of the search query is the thing that's used, rather than it being done by keyword lookups. Pretrained neural networks on large sets of text data have been shown to be very effective at encoding the _meaning_ of a particular phrase, sentence, paragraph or long document into a data structure known as a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/).

We are going to use Pinecone's semantic search capabilities with an off-the-shelf and a pretrained model to curate custom categories of previously-aired Jeopardy questions. We will show how Pinecone makes it easy to ensure that question difficulty is on par with how the question was originally priced.

### Learning Goals and Estimated Reading Time
_By the end of this 10 minute demo, you will have:_
 1. Learned about Pinecone's value for solving realtime semantic search requirements!
 2. Stored and retrieved vectors from Pinecone your very-own Pinecone Vector Database.
 3. Encoded Jeopardy Questions as 384-dimensional vectors using a pretrained, encoder-only, model (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database on Jeopardy Questions that are semantically similar to the query.
 5. Bonus for the Interested Reader: Near-Instant Custom Jeopardy Board Creation With Increasing Question Difficulty!
 
 If you want to execute the code yourself either in Google Colab or your computer, it may take up to an hour depending on processing speed.

## Setup: Prerequisites and Data Preparation

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Importing the helper modules

This notebook is self-contained, and as such, if background Python modules are not present, they will be imported from [Pinecone's Example repository](https://github.com/pinecone-io/examples/tree/master/semantic_text_search).

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain one for free on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

In [1]:
import os
import httpimport

if os.path.isfile('helper.py'):
    import helper as h
else:
    print('importing `helper.py` from https://github.com/pinecone-io')
    with httpimport.github_repo(
        username='startakovsky', 
        repo='pinecone-examples-fork',
        module=['semantic_text_search'],
        branch='master'):
        from semantic_text_search import helper as h

Extracting API Key from environmental variable `PINECONE_EXAMPLE_API_KEY`...

Pinecone API Key available at `h.pinecone_api_key`

### Installing and Importing Prerequisite Libraries:
Python libraries [pinecone-client](https://pypi.org/project/pinecone-client/), [sentence_transformers](https://pypi.org/project/sentence-transformers/), [pandas](https://pypi.org/project/pandas/), and [tqdm](https://pypi.org/project/tqdm/) are required for this notebook.

#### Installing via `pip`
The next line is equivalent to `pip install pinecone-client sentence-transformers pandas tqdm`. Note that _sys.executable_ is a way of ensuring it's the version of pip associated with this Jupyter Notebook's Python kernel.

In [2]:
!pip install pinecone-client sentence-transformers pandas tqdm datasets -qU

#### Importing and Defining Constants

In [3]:
import collections

import tqdm
import pinecone
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

INDEX_NAME, INDEX_DIMENSION = 'semantic-text-search', 384
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L6-cos-v5'

### Downloading and Processing Data

#### Downloading data
The [Jeopardy Dataset](https://huggingface.co/datasets/jeopardy) has over 200,000 rows and will be downloaded using the `datasets` library from HuggingFace.

In [4]:
dataset = load_dataset("jeopardy")

Using custom data configuration default
Reusing dataset jeopardy (/Users/steven/.cache/huggingface/datasets/jeopardy/default/0.1.0/774efb3257b2f482b1974faa754e6ce11853ad625a9b364e29f106052afe0204)


  0%|          | 0/1 [00:00<?, ?it/s]

#### The preprocessing step is self-explanatory and defined in the helper module.

In [5]:
df = dataset['train'].to_pandas()
df = h.get_processed_df(df)

#### Sample row from dataframe

In [6]:
pd.DataFrame(df.iloc[123456])

Unnamed: 0,146354
category,AUTOBIOGRAPHERS
air_date,2007-11-09 00:00:00
question,"'A psychologist:<br />1962's ""Memories, Dreams, Reflections""'"
amount,1200
answer,Carl Jung
round,Double Jeopardy!
show_number,5330
year,2007
month,11
text_to_encode,"'A psychologist:<br />1962's ""Memories, Dreams, Reflections""' Carl Jung"


### Creating your Pinecone Index
The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector. As we will see below, the model we are using maps each piece of text to a 384-dimensional vector.

In [7]:
pinecone.init(api_key=h.pinecone_api_key, environment='us-west1-gcp')
# pinecone.create_index(name=INDEX_NAME, dimension=INDEX_DIMENSION)
index = pinecone.Index(index_name=INDEX_NAME)

## Generate embeddings and send them to your Pinecone Index
This will all be done in batches. We will compute embeddings in batch, followed by taking each batch and sending it to Pinecone, also in batches.

### Loading a Pretrained Encoder model.
We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5). It is one of hundreds encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time. After this first import, the model is cached and available on a local machine.

In [8]:
h.printmd(f'Loading model from _Sentence Transformers_: `{MODEL_NAME}` from Sentence Transformers...')
model = SentenceTransformer(MODEL_NAME)
h.printmd('Model loaded.')

Loading model from _Sentence Transformers_: `sentence-transformers/msmarco-MiniLM-L6-cos-v5` from Sentence Transformers...

Model loaded.

### MSMARCO model v5 and Embeddings

In this example, we created an index with 384 dimensions because that is what the output is of this MSMARCO model. In fact, particular MSMARCO model used in this example generates [unit vectors](https://en.wikipedia.org/wiki/Unit_vector), which make [vector comparisons](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) agnostic to one's choice of similarity scores. In other words, when defining the index, it does not matter whether we use `euclidean`, `cosine` or `dotproduct` as a metric, so we left it blank, using the `cosine` default. 

#### On Embeddings

The output of this model's encodings are 384-dimensional, which was known in advance of creating above index.

So, when a piece of text such as "A quick fox jumped around" gets encoded into a vector embedding, the result is a sequence of floats.

#### On Comparing Embeddings aka _how_ Semantic Search works

Two 15-dimensional text embeddings might look like something like: 
 - _\[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]_
 - _\[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]_
 
In order to determine how _similar_ we may use something like [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity). This calculation is trivial when comparing two vectors, but nontrivial when needing to compare one vector against millions or billions of vectors.

### What is Pinecone for?
Often, there is a technical requirement to run a comparison of one vector to millions of others and return the most similar results in real time, with a latency of tens of milliseconds and at a high throughput. Pinecone solves this  problem with its managed vector database service, and we will demonstrate this below. 

### Prepare vector embeddings for upload

This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we will may want to employ in our queries.

In [9]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return pinecone vector."""
    vector_metadata = {
        'year': df_row['year'],
        'month': df_row['month'],
        'round': df_row['round'],
        'amount': df_row['amount']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [10]:
def get_vectors_to_upload_to_pinecone(df_chunk, model):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    pool = model.start_multi_process_pool()
    vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
    model.stop_multi_process_pool(pool)
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [11]:
def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    chunk_size=20000, 
    upsert_size=500):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = h.get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(h.chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in h.chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
Computing the embeddings may take up to an hour depending on hardware capabilities. The Pinecone API responds right away with its [async](https://www.pinecone.io/docs/insert-data/#sending-upserts-in-parallel) requests. 

Note: You may see the following output a few times when executing, it is not an error: 
```
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
```

In [12]:
# async_results = upload_dataframe_to_pinecone_in_chunks(df, index, model)

### Visualize the status of your upserts in the Pinecone Console

<img src='https://raw.githubusercontent.com/startakovsky/pinecone-examples-fork/may-2022-semantic-text-search-refresh/semantic_text_search/pinecone_console.png'>

## Querying Pinecone

Now that all the embeddings of the texts are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast semantic search query capabilities.

### Pinecone Example Usage

#### _**Show me Jeopardy questions that are semantically similar to "ancient attitudes"\!**_

In the below example we query Pinecone's API with an embedding of a query term to return the vector embeddings that have the highest similarity score. In other words, Pinecone does all the work to effeciently determine which of the uploaded vector embeddings have the highest similarity when paired with the query term's embedding.

#### Example: Pinecone API Request

A sample request for questions that that a similar semantic meaning to _ancient attitudes_.

In [13]:
query = 'ancient attitudes'
vector_embedding = model.encode(query).tolist()
response = index.query([vector_embedding], top_k=3, include_metadata=True)

#### Pinecone API Response
A typical Pinecone response to the above query.

In [14]:
# A hacky way to print the response object in color.
h.printmd(f"```python\n{response}\n```")

```python
{'matches': [],
 'namespace': '',
 'results': [{'matches': [{'id': '131253',
                           'metadata': {'amount': 1200.0,
                                        'month': '01',
                                        'round': 'Double Jeopardy!',
                                        'year': '2006'},
                           'score': 0.510472894,
                           'values': []},
                          {'id': '116689',
                           'metadata': {'amount': 1600.0,
                                        'month': '04',
                                        'round': 'Double Jeopardy!',
                                        'year': '2006'},
                           'score': 0.500562906,
                           'values': []},
                          {'id': '71462',
                           'metadata': {'amount': 800.0,
                                        'month': '05',
                                        'round': 'Double Jeopardy!',
                                        'year': '2007'},
                           'score': 0.460552633,
                           'values': []}],
              'namespace': ''}]}
```

#### Enriched Response
To show which questions we retreived, the above response needs to be enriched using the original dataset.

In [15]:
h.get_query_results_from_api_response(df, response, query)

Unnamed: 0_level_0,question,answer,amount,query
vector_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
131253,'Modern cultural movement emphasizing alternative approaches to spirituality',New Age,1200,ancient attitudes
116689,'The god Sobek of this ancient culture was often depicted as a crocodile wearing a headdress',Ancient Egypt,1600,ancient attitudes
71462,'An ancient language:<br />Skt.',Sanskrit,800,ancient attitudes


### Pinecone Example Usage With [Metadata](https://www.pinecone.io/docs/metadata-filtering/)

The above questions do turn out to be related to _ancient attidues_, which is pretty spectacular! _Note that this is **not a keyword search** but rather a **search for semantically similar results**.

But we can do better! In Jeopardy, you choose a question **and** a price point, where, in general, higher prices indicate harder questions. So it is natural to want to choose semantically similar questions from a specific price point.

#### _Can I see 5 questions related to "ancient attitudes" for \\$1000?_

Yes. Pinecone's [metadata feature](https://www.pinecone.io/docs/metadata-filtering/) makes this request trivial. We've already uploaded the metadata so filtering is just a Pinecone API request away. The only difference we make to the api request is to add the `filter_criteria` keyword argument like so: 

```python
index.query([vector_embedding], top_k=5, include_metadata=True, filter_criteria={'amount': {'$eq': 1000}})
```

In [16]:
h.get_query_results_from_query(query, index, df, model, top_k=5, filter_criteria={'amount': {'$eq': 1000}})

Unnamed: 0_level_0,question,answer,amount,query
vector_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
22432,"'The name of this most recent geological era is from the Greek for ""new animals""'",Cenozoic,1000,ancient attitudes
189904,'It's the only ancient wonder that fits the category',Lighthouse at Alexandria,1000,ancient attitudes
177958,'Language of ancient India that's related to Greek & Latin',Sanskrit,1000,ancient attitudes
105535,'Ancient priests checked these animal innards for signs & portents',entrails,1000,ancient attitudes
139436,"'This ""elder"" Roman writer died in 79 A.D. while rescuing people from mount Vesuvius' eruption'",Pliny (the Elder),1000,ancient attitudes


Pretty good, right? Every question above is from the dataset, each one of them previously aired on Jeopardy for \\$1000, and, while subjective, most of them have to do with _antient attidues_.

### An Additional Note on Metadata

This is a basic demonstration of metadata. Extensive predicate logic can be applied to metadata filtering, just like the [WHERE clause](https://www.pinecone.io/learn/vector-search-filtering/) in SQL!

## Conclusion

In this notebook, we demonstrated how trivial Pinecone makes instant retrieval of similar vector embeddings to create custom Jeopardy questions of a pre-assigned difficulty. We did not need to train any models or develop any algorithms to allow for this type of instant computation. This example is illustrative of how to use a pre-trained transformer-encoder model with Pinecone to achieve realtime similarity retrieval!

### Like what you see? Explore our [community](https://www.pinecone.io/community/)
Learn more about semantic search and the rich, performant, and production-level feature set of Pinecone's Vector Database by visiting https://pinecone.io, connect with us [here](https://www.pinecone.io/contact/) and [follow us](https://www.linkedin.com/company/pinecone-io) on LinkedIn.

## Bonus Material: Jeopardy Building Custom Jeopardy Boards

For the interested reader, we've created a few functions in the helper module that will automatically generate Jeopardy Boards. 

### Pinecone Query for All Question Difficulties
Now, we scale up the previous example and wrangle the output into the form of two Jeopardy Boards (First and second round).

### Jeopardy! Round 1 Board

In [17]:
queries = ["over the moon", "ancient atitudes", "invention of computer"]
jeopardy_questions = h.get_jeopardy_questions(queries, index, df, model)
jeopardy_board, double_jeopardy_board = h.get_jeopardy_boards(jeopardy_questions, queries)
jeopardy_board

Unnamed: 0_level_0,over the moon,ancient atitudes,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
200,'Any song about Earth's natural satellite','Etruscan civilization had gone past tense by the 453 A.D. wedding-night death of this Hun leader',"'The Woz, Steve Wozniak, built the first computer for this company'"
400,'The only moon in our solar system that astrologists say has an influence circles this planet',"'In Arthurian legend, the sword in the stone was stuck in one of these blacksmith aids on top of the stone'","'On April 1, 1976 2 engineers with $1,300 in capital began this computer company in Cupertino, California'"
600,"'On July 20, 1969 Ryan Seacrest described the lunar surface as ""magnificent desolation""; 2nd man on Moon, out!'","'(<a href=""http://www.j-archive.com/media/2010-02-05_J_28.wmv"">Jimmy of the Clue Crew reports from Israel.</a>) The Church of the Beatitudes is on the hilltop long considered the site where Jesus delivered this, which contained the Beatitudes'","'Founded by Ross Perot in 1962, Electronic Data Systems got its first computer in 1965, this company's 1401'"
800,'The Apollo lunar mission with this number was aborted en route to the Moon in 1970 due to an in-flight explosion',"'According to the Beatitudes, this group shall ""inherit the Earth""'","'Originally an adding machine maker, in 1944 IBM made its first steps toward one of these with the Harvard Mark I'"
1000,'You could say we sent this Greek god to the moon in 1969','The aye-aye',"'Announced on February 14, 1946, this first electronic digital computer had 18,000 vacuum tubes'"


### Double Jeopardy! Round 2 Board

In [18]:
double_jeopardy_board

Unnamed: 0_level_0,over the moon,ancient atitudes,invention of computer
amount,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1200,'High-aiming standard heard here',"'She wrote the classic anthropology text ""Coming of Age in Samoa""'","'Long before it was the name of a computer, it was Brit speak for a raincoat'"
1400,'It's the name shared by an Irish revolutionary & the pilot of the command module during the first moon landing',"'Adopted by an Irish regiment, Kipling's Kimball O'Hara was from this country'","'In 1896 Herman Hollerith, born on Leap Day 1860, organized the Tabulating Machine Co., which evolved into this giant'"
1600,'The lunar kind means the light of the Moon is obscured because the Earth is between the Moon & the sun','Traditional adjective for saintly 8th century historian Bede',"'As a student at Eton, he did have his own laptop computer, despite being second in line to the British throne'"
1800,"'On Feb. 1, 1958 the Detroit Free Press said, ""U.S. Fires Moon!""; they meant the USA's first of these, Explorer 1'","'Keats' ""Ode on"" this says, ""Sylvan historian, who canst thus express a flowery tale more sweetly than our rhyme""'",'This technology used for wireless headsets is named after a Danish king who united parts of Scandinavia'
2000,'Lunar object over a Gulf of Guinea nation','Ancient Roman name for the region of Scotland',"'The Fly, the first pentop computer, is made by this company that ""jumped"" on the educational toys market'"


### Looking Up Answers!
See if you can think of the answer to this question which you can view in the above jeopardy board: 
#### _Over the moon for 400 please, Alex!_.

In [19]:
h.show_answer_widget(jeopardy_questions, queries)

Dropdown(description='query:', options=('over the moon', 'ancient atitudes', 'invention of computer'), value='…

Dropdown(description='amount:', options=(200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000), value=200)

Button(description='Submit', style=ButtonStyle())

Output()