[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/question-answering/question-answering.ipynb)

# Question Answering with Similarity Search

This notebook demonstrates how Pinecone's similarity search as a service helps you build a question answering application. We will index a set of questions and retrieve the most similar stored questions for a new (unseen) question. That way, we can link a new question to answers we might already have.

You can build a questions answering application with Pinecone in three steps:
- Represent questions as vector embeddings so that semantically similar questions are in close proximity within the same vector space. 
- Index vectors using Pinecone.
- Given a new question, query the index to fetch similar questions. This can allow us to store answers associated with these questions 

In this notebook we will be dealing with indexing a set of quetions and retrieving similar questions for a new and unseen question.



## Dependencies

In [None]:
!pip install -qU pinecone-client
!pip install -qU matplotlib ipywidgets
!pip install -qU sentence-transformers --no-cache-dir

In [2]:
import pandas as pd
import numpy as np

%matplotlib inline

## Pinecone Installation and Setup

In [3]:
from pinecone import Pinecone
import os

# load Pinecone API key
api_key = os.getenv("PINECONE_API_KEY") or "YOUR_API_KEY"
pinecone.init(
    api_key=api_key,
    environment="YOUR_ENV"  # find next to API key in console
)

[Get a Pinecone API key](https://www.pinecone.io/start/) if you don’t have one already.

## Create a New Pinecone Index

In [4]:
# pick a name for the new index
index_name = "question-answering"

In [5]:
# check whether an index with the same name already exists
if index_name in pinecone.list_indexes().names():
    pinecone.delete_index(index_name)

**Create index**

In [6]:
pinecone.create_index(name=index_name, dimension=300)

**Connect to the index**

The index object, a class instance of pinecone.Index , will be reused for optimal performance.

In [7]:
index = pinecone.Index(index_name=index_name)

## Uploading Questions

The dataset used in this notebook is the [Quora Question Pairs Dataset](https://www.kaggle.com/c/quora-question-pairs).

Let's download the dataset and load the data.

In [8]:
# download dataset from the url
import requests

DATA_DIR = "tmp"
DATA_FILE = f"{DATA_DIR}/quora_duplicate_questions.tsv"
DATA_URL = "https://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"


def download_data():
    os.makedirs(DATA_DIR, exist_ok=True)

    if not os.path.exists(DATA_FILE):
        r = requests.get(DATA_URL)  # create HTTP response object
        with open(DATA_FILE, "wb") as f:
            f.write(r.content)


download_data()

In [9]:
pd.set_option("display.max_colwidth", 500)

df = pd.read_csv(
    f"{DATA_FILE}", sep="\t", usecols=["qid1", "question1"], index_col=False
)
df = df.sample(frac=1).reset_index(drop=True)
df.drop_duplicates(inplace=True)
df['qid1'] = df['qid1'].apply(str)
df['question1'] = df['question1'].apply(str)
print(df.head())

     qid1  \
0  216488   
1  424959   
2  300233   
3  302677   
4  468590   

                                                                                                                                     question1  
0                                                                                               I would love to give a TED talk. What do I do?  
1                                                                Do all caps titles on YouTube videos attract more viewers than normal titles?  
2                                                                                                How do I start self-learning ethical hacking?  
3                                                                           Should learning musical instruments in schools be made compulsory?  
4  Does the success of a self proclaimed Acharya Pankaj Pathak in Assam prove that we, as a state, are regressing back instead of progressing?  


### Define the model

We will use the [Averarage Word Embeddings Model](https://nlp.stanford.edu/projects/glove/) for this example. This model has a high computation speed but relatively low quality of embeddings. You can look into other sentence embeddings models such as the [Sentence Embeddings Models trained on Paraphrases](https://huggingface.co/sentence-transformers/paraphrase-distilroberta-base-v1) for improving quality of embeddings. 

In [10]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the model from huggingface model hub
model = SentenceTransformer("average_word_embeddings_glove.6B.300d", device=device)

Downloading:   0%|          | 0.00/690 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/480M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/164 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.15k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/248 [00:00<?, ?B/s]

### Creating Vector Embeddings

In [11]:
# create embedding for each question
question_vectors = model.encode(list(df.question1), show_progress_bar=True).tolist()

Batches:   0%|          | 0/9083 [00:00<?, ?it/s]

In [12]:
# add question embeddings to dataframe
df["question_vector"] = question_vectors

### Index the Vectors

In [13]:
import itertools

def chunks(iterable, batch_size=100):
    it = iter(iterable)
    chunk = tuple(itertools.islice(it, batch_size))
    while chunk:
        yield chunk
        chunk = tuple(itertools.islice(it, batch_size))

In [14]:
for batch in chunks(zip(df.qid1, df.question_vector)):
    index.upsert(vectors=batch)

## Search

Once you have indexed the vectors it is very straightforward to query the index. These are the steps you need to follow:
- Select a set of questions you want to query with
- Use the Average Embedding Model to transform questions into embeddings.
- Send each question vector to the Pinecone index and retrieve most similar indexed questions

In [15]:
# define questions to query the vector index
query_questions = [
    "What is best way to make money online?",
    "How can i build an e-commerce website?"
]

# extract embeddings for the questions
query_vectors = model.encode(query_questions).tolist()

# query pinecone
query_results = [index.query(vector=xq, top_k=5) for xq in query_vectors]

# show the results
for question, res in zip(query_questions, query_results):
    print("\n\n\n Original question : " + str(question))
    print("\n Most similar questions based on pinecone vector search: \n")

    ids = [match.id for match in res.matches]
    scores = [match.score for match in res.matches]
    df_result = pd.DataFrame(
        {
            "id": ids,
            "question": [
                df[df.qid1 == _id].question1.values[0] for _id in ids
            ],
            "score": scores,
        }
    )
    print(df_result)




 Original question : What is best way to make money online?

 Most similar questions based on pinecone vector search: 

       id                                             question     score
0      57               What is best way to make money online?  1.000000
1  297469           What is the best way to make money online?  1.000000
2   55585        What is the best way for making money online?  0.989930
3   28280         What are the best ways to make money online?  0.981526
4  157045  What is the best way to make money on the internet?  0.978538



 Original question : How can i build an e-commerce website?

 Most similar questions based on pinecone vector search: 

       id                                                   question     score
0  119383                   How can I develop an e-commerce website?  0.925466
1    1713                 How would I develop an e-commerce website?  0.925466
2    1714                     How do I create an e-commerce website?  0.919407


## Delete the Index

Delete the index once you are sure that you do not want to use it anymore. Once it is deleted, you cannot reuse it.


In [16]:
pinecone.delete_index(index_name)