<a href="https://colab.research.google.com/github/iwswordpress/DRL/blob/main/Copy_of_semantic_text_search_refresh.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/semantic_text_search/semantic_text_search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/semantic_text_search/semantic_text_search.ipynb)

# Semantic Search With Pinecone

## Background

### What is Semantic Search and how will we use it?

_Semantic search_ is search where the _meaning_ of the search query is the focus, rather than using keyword lookups. Pretrained neural networks on large sets of text data have been shown to be effective at encoding the _meaning_ of a particular phrase, sentence, paragraph or long document into a data structure known as a [vector embedding](https://www.pinecone.io/learn/vector-embeddings/).

In this example, we are going to demonstrate Pinecone's semantic search capabilities with an off-the-shelf, pretrained NLP model. In the process we'll learn a few things.

### Learning Goals and Estimated Reading Time
_By the end of this 10 minute demo, you will have:_
 1. Learned about Pinecone's value for solving realtime semantic search requirements!
 2. Stored and retrieved vectors from your very own Pinecone Vector Database.
 3. Encoded news articles as 384-dimensional vectors using a pretrained, encoder-only, model (i.e. no model training necessary).
 4. Queried Pinecone's Vector Database to find similar news articles to the query in question.
 
Executing all the code in the notebook may take a few hours, but once all data is encoded results of queries to pinecone are processed on the order of tens of milliseconds.

## Setup: Prerequisites and Data Preparation

### Python 3.7+

This code has been tested with Python 3.7. It is recommended to run this code in a virtual environment or Google Colab.

### Acquiring your Pinecone API Key

A Pinecone API key is required. You can obtain one for free on our [our website](https://app.pinecone.io/). Either add `PINECONE_EXAMPLE_API_KEY` to your list of environmental variables, or manually enter it after running the below cell (a prompt will pop up requesting the API key, storing the result within this kernel (session)).

### Installing and Importing Prerequisite Libraries:
Python libraries [pinecone-client](https://pypi.org/project/pinecone-client/), [sentence_transformers](https://pypi.org/project/sentence-transformers/), [datasets](https://pypi.org/project/datasets/), [pandas](https://pypi.org/project/pandas/), and [tqdm](https://pypi.org/project/tqdm/) are required for this notebook.

#### Installing via `pip`
The next line is equivalent to `pip install pinecone-client sentence-transformers datasets pandas tqdm`. Note that _sys.executable_ is a way of ensuring it's the version of pip associated with this Jupyter Notebook's Python kernel.

In [None]:
!pip install pinecone-client sentence-transformers pandas tqdm datasets httpimport -qU

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.2/12.2 MB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.1/77.1 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m469.0/469.0 KB[0m [31m17.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m58.3/58.3 KB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m31.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m53.6 MB/s[0m e

#### Importing and Defining Constants

In [None]:
import os
import collections

import httpimport
import tqdm
import pinecone
import pandas as pd
from sentence_transformers import SentenceTransformer
from datasets import load_dataset

INDEX_NAME, INDEX_DIMENSION = 'squad', 384
MODEL_NAME = 'sentence-transformers/msmarco-MiniLM-L6-cos-v5'

  from tqdm.autonotebook import tqdm


### Helper Module

In [None]:
# There is a helper module required for this notebook to run.
# When not present with this notebook, it will be streamed in from Pinecone's Example Repository.
# You can find the module at https://github.com/pinecone-io/examples/tree/master/semantic_text_search

if os.path.isfile('helper.py'):
    import helper as h
else:
    print('importing `helper.py` from https://github.com/pinecone-io')
    with httpimport.github_repo(
        username='pinecone-io', 
        repo='examples',
        profile='semantic_text_search',
        ref='master'):
        from semantic_text_search import helper as h

importing `helper.py` from https://github.com/pinecone-io


Extracting API Key from environmental variable `PINECONE_EXAMPLE_API_KEY`...

PINECONE_EXAMPLE_API_KEY not found in environmental variables list.
Get yours at https://app.pinecone.io and enter it here: 

··········


Pinecone API Key available at `h.pinecone_api_key`

### Downloading and Processing Data

#### Downloading data
To demonstrate semantic search using Pinecone, we will be using [a dataset](https://huggingface.co/datasets/cc_news) consisting of over 700,000 English language news articles. We will be downloading this dataset using the `datasets` library in the next cell.

In [None]:
rows = 30_000  # number of rows to download, increase/decrease as preferred
# To download all 700,000+ news articles remove the `split` keyword argument entirely

dataset = load_dataset("cc_news", split=f"train[:{rows}]")

Downloading builder script:   0%|          | 0.00/4.38k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.04k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.46k [00:00<?, ?B/s]

Downloading and preparing dataset cc_news/plain_text to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6...


Downloading data:   0%|          | 0.00/845M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/708241 [00:00<?, ? examples/s]

Dataset cc_news downloaded and prepared to /root/.cache/huggingface/datasets/cc_news/plain_text/1.0.0/ae469e556251e6e7e20a789f93803c7de19d0c4311b6854ab072fecb4e401bd6. Subsequent calls will reuse this data.


#### The preprocessing step is self-explanatory and defined in the helper module.

In [None]:
df = h.get_processed_df(dataset.to_pandas())

#### Sample row from dataframe

Note we use an abridged version of the original text in the _text\_to\_encode_ field.

In [None]:
pd.DataFrame(df.iloc[1234])

Unnamed: 0,1234
title,Red Sox manager Alex Cora brings youthful eye to new job
text,"BOSTON (AP) — Dave Dombrowski wanted to make sure he and Alex Cora were on the same page, so the Red Sox boss sent off an email for his new manager's approval.\nThe response: A thumbs-up emoji.\nTony La Russa and Jim Leyland never did that.\n""He's a good emoji texter,"" Dombrowski said with a laugh this month as the team turned its thoughts toward spring training. ""He's very good with the thumbs-up. My children, they help me out at times.""\nA native of Puerto Rico, Cora is already a pioneer as the first minority manager in the history of a franchise that was the last to field a black player. But he's also a new kind of Red Sox dugout boss: One of the youngest managers in franchise history, giving him a unique chance to connect with his players.\n""He's not too far removed from actually playing the game. He's played — actually, personally — with some of my teammates now,"" Boston outfielder Jackie Bradley Jr. said. ""I think it's going to be a great combination of old school and new school. He's learned from the past, and he's going to be able to put his own twist on things.""\nStill just 42 and in his first major league managerial job, Cora is no newbie.\nHis shaved head shows the stubble of a receded hairline, with some gray around the temples picked up during a 14-year career spent with six big-league teams. As a member of the Red Sox from 2005-08, he was a part of the franchise's 2007 World Series title and was teammates with current second baseman Dustin Pedroia. (He also overlapped with first baseman Mitch Moreland for about five days with the Rangers in 2010.)\nIt's this that made him an intriguing choice to replace John Farrell, who was fired last fall at the age of 55 despite leading Boston to the first back-to-back AL East titles in franchise history. Farrell's predecessor, Bobby Valentine, was 62 for his lone season in Boston; you'd have to go back to Kevin Kennedy, who was 41 when he was hired in 1995, to find a younger Red Sox skipper.\n""I'm 42. I'm young..."
domain,www.taiwannews.com.tw
date,2018-01-29 15:54:00
description,Red Sox manager Alex Cora brings youthful eye to new job
url,https://www.taiwannews.com.tw/en/news/3353845
image_url,https://www.taiwannews.com.tw/images/category/580888eb17740.jpg
text_to_encode,"Red Sox manager Alex Cora brings youthful eye to new job BOSTON (AP) — Dave Dombrowski wanted to make sure he and Alex Cora were on the same page, so the Red Sox boss sent off an email for his new manager's approval. The response: A thumbs-up emoji. Tony La Russa and Jim Leyland never did that. ""He's a good emoji texter,"" Dombrowski said with a laugh this month as the team turned its thoughts toward spring training."
year,2018
month,1


### Creating your Pinecone Index
The process for creating a Pinecone Index requires your Pinecone API key, the name of your index, and the number of dimensions of each vector. As we will see below, the model we are using maps each piece of text to a 384-dimensional vector.

In [None]:
# pinecone.init(api_key="<<API KEY-****>>", environment="us-east1-gcp")
# index = pinecone.Index("squad")

pinecone.init(
    api_key='7f80bc6a-4642-4f31-95ae-e0fda444171b',
    environment='us-east1-gcp'  # find in console next to api key
)

if 'squad' not in pinecone.list_indexes():
    pinecone.create_index(name=INDEX_NAME, dimension=384)

index = pinecone.Index(index_name=INDEX_NAME)

## Generate embeddings and send them to your Pinecone Index
This will all be done in batches. We will compute embeddings in batch, followed by taking each batch and sending it to Pinecone, also in batches.

### Loading a Pretrained Encoder model.
We will generate embeddings by using [this Sentence Transformers model](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L6-cos-v5). It is one of hundreds of encoder models available. Downloads happen automatically with SentenceTransformer, and may take up to a minute the first time. After this first import, the model is cached and available on a local machine.

In [None]:
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
h.printmd(f'Loading model from _Sentence Transformers_: `{MODEL_NAME}` from Sentence Transformers to `{device}`...')
model = SentenceTransformer(MODEL_NAME, device=device)
h.printmd('Model loaded.')

Loading model from _Sentence Transformers_: `sentence-transformers/msmarco-MiniLM-L6-cos-v5` from Sentence Transformers to `cuda`...

Downloading (…)b63f4/.gitattributes:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)c2ce7b63f4/README.md:   0%|          | 0.00/5.11k [00:00<?, ?B/s]

Downloading (…)ce7b63f4/config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading (…)"pytorch_model.bin";:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)b63f4/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/430 [00:00<?, ?B/s]

Downloading (…)c2ce7b63f4/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)e7b63f4/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

Model loaded.

### MSMARCO model v5 and Embeddings

In this example, we created an index with 384 dimensions and the [cosine similarity score](https://en.wikipedia.org/wiki/Cosine_similarity). This calculation is trivial when comparing two vectors, but very difficult when needing to compare a query vector against millions or billions of vectors and determine those most similar with the query vector.

#### On Embeddings

This model produces vectors from text, each a sequence of 384 floats. So, when a piece of text such as "A quick fox jumped around" gets encoded into a vector embedding, the result is a sequence of floats of length 384. The same is true for a long news article and a single word. 

#### On Comparing Embeddings aka _how_ Semantic Search works

Two 15-dimensional text embeddings might look like something like: 
 - _\[-0.02, 0.06, 0.0, 0.01, 0.08, -0.03, 0.01, 0.02, 0.01, 0.02, -0.07, -0.11, -0.01, 0.08, -0.04\]_
 - _\[-0.04, -0.09, 0.04, -0.1, -0.05, -0.01, -0.06, -0.04, -0.02, -0.04, -0.04, 0.07, 0.03, 0.02, 0.03\]_
 
In order to determine how [_similar_](https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d) they are, it is a simple formula that takes a very short time to compute. Similarity scores are, in general, an excellent proxy for semantic similarity. So a natural question one might ask is to compare one vector to a handful of others and select the most similar.

### What is Pinecone for?
There is often a technical requirement to compare one vector to tens or hundreds of millions or more vectors, to do so with low latency (less than 50ms) and a high throughput. Pinecone solves this problem with its managed vector database service, and we will demonstrate this below.

### Components of a Pinecone vector embedding

There are three components to every Pinecone vector embedding:
 - a vector ID
 - a sequence of floats of a user-defined, fixed dimension
 - vector metadata (a key-value store)

### Prepare vector embeddings for upload

We will encode the news articles for upload to Pinecone. This may take a while depending on your machine. If on a recent MacBookPro or Google Colab, this may take up to one hour, sometimes longer. We will use the index of the pandas dataframe for the vector ID, the pretrained model to generate the sequence of 384 floats, and the year, month and article source for details in the metadata.

#### Prepare metadata

The function below creates metadata from a single row of the dataframe. This is going to be important further down this notebook for additional filter requirements we may want to employ in our queries.

In [None]:
def get_vector_metadata_from_dataframe_row(df_row):
    """Return vector metadata."""
    vector_metadata = {
        'year': df_row['year'],
        'month': df_row['month'],
        'source': df_row['processed_domain']
    }
    return vector_metadata

#### Prepare all vector data for upload

The function below will take a portion of the dataframe and create the full vector data as Pinecone expects it for [upsert](https://www.pinecone.io/docs/insert-data/).

In [None]:
def get_vectors_to_upload_to_pinecone(df_chunk, model, is_multiprocess=False):
    """Return list of tuples like (vector_id, vector_values, vector_metadata)."""
    # create embeddings
    if is_multiprocess:
        pool = model.start_multi_process_pool()
        vector_values = model.encode_multi_process(df_chunk['text_to_encode'], pool).tolist()
        model.stop_multi_process_pool(pool)
    else:
        vector_values = model.encode(df_chunk['text_to_encode'], show_progress_bar=True).tolist()
    # create vector ids and metadata
    vector_ids = df_chunk.index.tolist()
    vector_metadata = df_chunk.apply(get_vector_metadata_from_dataframe_row,axis=1).tolist()
    return list(zip(vector_ids, vector_values, vector_metadata))

### Upload data to Pinecone in asynchronous batches

The function below iterates through the dataframe in chunks, and for each of those chunks, will upload asynchronously in sub-chunks to your Pinecone Index.

In [None]:
def upload_dataframe_to_pinecone_in_chunks(
    dataframe, 
    pinecone_index, 
    model, 
    is_multiprocess=False,
    chunk_size=20000, 
    upsert_size=500):
    """Encode dataframe column `text_to_encode` to dense vector and upsert to Pinecone."""
    tqdm_kwargs = h.get_tqdm_kwargs(dataframe, chunk_size)
    async_results = collections.defaultdict(list)
    for df_chunk in tqdm.notebook.tqdm(h.chunks(dataframe, chunk_size), **tqdm_kwargs):
        vectors = get_vectors_to_upload_to_pinecone(df_chunk, model, is_multiprocess=is_multiprocess)
        # upload to Pinecone in batches of `upsert_size`
        for vectors_chunk in h.chunks(vectors, upsert_size):
            start_index_chunk = df_chunk.index[0]
            async_result = pinecone_index.upsert(vectors_chunk, async_req=True)
            async_results[start_index_chunk].append(async_result)
        # wait for results
        _ = [async_result.get() for async_result in async_results[start_index_chunk]]
        is_all_successful = all(map(lambda x: x.successful(), async_results[start_index_chunk]))
        # report chunk upload status
        print(
        f'All upserts in chunk successful with index starting with {start_index_chunk:>7}: '
        f'{is_all_successful}. Vectors uploaded: {len(vectors):>3}.'
        )
    return async_results

#### Asynchronous Upload
Computing the embeddings _on the full dataset of 700k+ articles_ may several hours depending on hardware capabilities. The Pinecone API responds right away with its [async](https://www.pinecone.io/docs/insert-data/#sending-upserts-in-parallel) requests. For 30,000 records (the default for this notebook), in Google Colab, it should take approximately 2 minutes without async, or 1 minute with. 

In [None]:
# Toggling the `is_multiprocess` flag to `False` gives visibilty 
# into per-batch progress but the embeddings will be created at roughly a 2x 
# slower rate, based on a few runs on a 2021 macbook pro
async_results = upload_dataframe_to_pinecone_in_chunks(df, index, model, is_multiprocess=False)

  0%|          | 0/2 [00:00<?, ?chunk of vectors/s]

Batches:   0%|          | 0/625 [00:00<?, ?it/s]

ProtocolError: ignored

### Visualize the status of your upserts in the [Pinecone Console](https://app.pinecone.io/)

## Querying Pinecone

Now that all the embeddings of the texts are on Pinecone's database, it's time to demonstrate Pinecone's lightning fast semantic search query capabilities.

### Pinecone Example Usage

In the below example we query Pinecone's API with an embedding of a query term to return the vector embeddings that have the highest similarity score. Pinecone effeciently estimates which of the uploaded vector embeddings have the highest similarity when paired with the query term's embedding, and the database will scale to billions of embeddings maintaining low-latency and high throughput. In this example we have upserted up to 700,000 embeddings (depending on the `row` variable). Our [starter plan](https://www.pinecone.io/pricing/) supports up to one million. 

#### Example: Pinecone API Request and Response

Let's find articles with a similar semantic meaning to the `query` variable.

In [None]:
query = "Is too much CO2 in the ocean bad for the environment? Research supports this claim."
vector_embedding = model.encode(query).tolist()
response = index.query([vector_embedding], top_k=3, include_metadata=True)
h.printmd(f"#### A sample response from Pinecone \n ==============\n \n ```python\n{response}\n```")

#### Enriched Response
To show which questions we retreived, the above response needs to be enriched using the original dataset.

In [None]:
vector_ids, scores = h.get_ids_scores(response)
result = df.loc[vector_ids]
result['score'] = scores
result[['title', 'score', 'domain', 'date', 'description', 'url', 'text_to_encode']]

#### Are the results any good?

We invite the reader to explore various queries by running the code in the last two cells. Note that this is **not a keyword search** but rather a **search for semantically similar results**. Note the _score_ column indicating the similarity score with the query. Better scores are typically associated with more semantic similarity.

### Pinecone Example Usage With [Metadata](https://www.pinecone.io/docs/metadata-filtering/)

Extensive predicate logic can be applied to metadata filtering, just like the [WHERE clause](https://www.pinecone.io/learn/vector-search-filtering/) in SQL! Pinecone's [metadata feature](https://www.pinecone.io/docs/metadata-filtering/) provides easy-to-implement filtering.

Here are the top 20 sources, with the rest grouped into the _other_ category. We will filter results so that they come from any of the top 5 sources of articles written in 2018 or 2019. We are able to do this because you've provided this metadata when upserting the vectors to your Pinecone index.

In [None]:
sources = h.get_top_sources(df)
print(*sources, sep=', ')

In [None]:
response = index.query(
    [vector_embedding], 
    top_k=5, 
    filter={
        "$and": [
            {'year': {'$in': [2018, 2019]}},
            {'source': {'$in': sources[:5]}}
        ]
    }
)
vector_ids, scores = h.get_ids_scores(response)
result = df.loc[vector_ids]
result['score'] = scores
result[['title', 'score', 'domain', 'date', 'description', 'url', 'text_to_encode']]

#### Are the results any good?

We leave this to the reader to assess, as it is subjective. One thing to notice is the the similarity scores are a bit lower when retreiving from the top news sources. This is not surprising, since one might expect relevant results to come from more scientific sources such as _climatecentral.org_ and _energylivenews.com_, like in the non-filtered query.

After we've finished, delete the index to save resources.

In [None]:
# pinecone.delete_index(INDEX_NAME)

## Conclusion

In this example, we demonstrated how trivial Pinecone makes it possible to do semantic search using a pre-trained transformer-encoder model with Pinecone to achieve realtime similarity retrieval! We demonstrated the use of metadata filtering with querying Pinecone's vector database.

### Like what you see? Explore our [community](https://www.pinecone.io/community/)
Learn more about semantic search and the rich, performant, and production-level feature set of Pinecone's Vector Database by visiting https://pinecone.io, connecting with us [here](https://www.pinecone.io/contact/) and [following us](https://www.linkedin.com/company/pinecone-io) on LinkedIn. If interested in some of the algorithms that allow for effecient estimation of similar vectors, visit our Algorithms and Libraries section of our [Learning Center](https://www.pinecone.io/learn/).