[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/multitask/instructor-multitask.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/multitask/instructor-multitask.ipynb)

# Using **Pinecone** Vector Database with Multitask Embedding Model - [InstructOR](https://huggingface.co/hkunlp/instructor-large)


Text embeddings represent discrete text inputs (e.g., sentences, documents, and code) as fixed-sized vectors that can be used in many downstream tasks. These tasks include semantic search, document retrieval for question-answering, prompt retrieval for in-context learning and beyond.

However, most existing embeddings can have *significantly degraded performance when applied to new tasks or domains*. Moreover, existing embeddings usually perform poorly when applied to the same type of task but in different domains such as medicine and finance.

In this notebook, we will demonstrate how text embeddings (even for the same text input) can be adjusted to different downstream applications using **task and domain descriptions**, *without* further task- or domain-specific finetuning using the multitask embedding model - [InstructOR](https://huggingface.co/hkunlp/instructor-large).

First, we need to install the `InstructorEmbedding` library and other dependencies.

In [1]:
!pip install -qU \
    InstructorEmbedding \
    sentence-transformers \
    pinecone-client

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━[0m [32m81.9/86.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m19.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m51.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m30.1 MB

## Initialization

Now, we can instantiate our InstructOR model using the `InstructorEmbedding` library we downloaded above. We just need to specify the [Hugging Face repository name](https://huggingface.co/hkunlp/instructor-large) for the model.

In [2]:
from InstructorEmbedding import INSTRUCTOR

repository_name = 'hkunlp/instructor-large'

model = INSTRUCTOR(repository_name)

  from tqdm.autonotebook import trange


Downloading (…)c7233/.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

Downloading (…)/2_Dense/config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/3.15M [00:00<?, ?B/s]

Downloading (…)9fb15c7233/README.md:   0%|          | 0.00/66.3k [00:00<?, ?B/s]

Downloading (…)b15c7233/config.json:   0%|          | 0.00/1.53k [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)c7233/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/2.41k [00:00<?, ?B/s]

Downloading (…)15c7233/modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


Let's see how the InstructOR model can create embeddings by providing an `input_text` sentence and an `instruction` to it.


In [3]:
input_text = (
    "Exploring the Impact of Climate Change on Biodiversity: A Comprehensive "
    "Analysis of Species Distribution Shifts and Ecosystem Resilience"
)
instruction = "Represent the Science title:"

embeddings = model.encode([[instruction, input_text]])

print(embeddings.shape)

(1, 768)


Here, the InstructOR model is creating the embedding using the `input_text` but in respect to the instructions provided by `instruction`. The embedding that we would create for the same `input_text` but given different instructions like `"Represent the sentiment:"` we would return a very different embedding because the meaning given the new instruction is very different.

In the output we returned a `768` dimensional embedding. This is the embedding dimensionality of the InstructOR model. We'll need this value later when initializing our Pinecone vector DB.

In [4]:
embeddings_dim = embeddings.shape[1]

You can also easily encode multiple combinations of `text_inputs` and `instructions`.

In [5]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

input_texts = ["Sentence 1", "Sentence 2", "Sentence 3", "Sentence 4"]
instructions = ["Represent something:"] * len(input_texts)

inputs = [[instruction, input_text] for instruction, input_text in zip(instructions, input_texts)]

batch_size = 2

embeddings = model.encode(
    sentences=inputs,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

print(embeddings.shape)

Batches:   0%|          | 0/2 [00:00<?, ?it/s]

(4, 768)


Let's see how we can use the InstructOR model for various use cases.

## Use cases

If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:

> "Represent the [**domain**] [**text_type**] for [**task_objective**]:"

Here are some examples:

- "Represent the **Science** **sentence**:"
- "Represent the **Financial** **statement**:"
- "Represent the **Wikipedia** **document** for **retrieval**:"
- "Represent the **Wikipedia** **question** for **retrieving supporting documents**:"

### First use case - Sentence Similarity Search

Let's see how we can use the model to compute the semantic similary between two groups of sentences, following the customized embedding template.

In [6]:
from sklearn.metrics.pairwise import cosine_similarity

sentences_a = [['Represent the Pharmaceutical definition: ', 'Aspirin: Aspirin is a widely-used over-the-counter medication known for its anti-inflammatory and analgesic properties. It is commonly used to relieve pain, reduce fever, and alleviate minor aches and pains.'],
               ['Represent the Artistic definition: ', "Impressionism: Impressionism is an art movement that emerged in the late 19th century, characterized by the use of short brush strokes and the depiction of light and color to capture the fleeting effects of a scene. It emphasizes the artist's immediate perception and emotional response to the subject."]]
sentences_b = [['Represent the Pharmaceutical definition: ', 'Amoxicillin: Amoxicillin is an antibiotic medication commonly prescribed to treat various bacterial infections, such as respiratory, ear, throat, and urinary tract infections. It belongs to the penicillin class of antibiotics and works by inhibiting bacterial cell wall synthesis.'],
               ['Represent the Artistic definition: ', "Sculpture: Sculpture is a form of visual art that involves creating three-dimensional objects by carving, modeling, or molding materials such as stone, wood, metal, clay, or other materials. Sculptures can be representational or abstract and are often displayed in galleries, museums, or public spaces."]]

embeddings_a = model.encode(sentences_a)
embeddings_b = model.encode(sentences_b)

similarities = cosine_similarity(embeddings_a,embeddings_b)

print(similarities)

[[0.8799274  0.7748538 ]
 [0.7468935  0.82635736]]


We can see that the sentences from the same domain have a higher similarity score (*88% and 83%*) compared to sentences from a different domain (*77% and 75%*).

### Second use case - Question Answering

Question-answering is very similar to semantic search in that we're comparing text and calculating a similarity metric between them. However, it differs in that we are not looking for the direct semantic similarity between two chunks of text. We're instead looking for the relevance between a question and an answer (or chunk of text they may contain the answer).

Unlike semantic similarity, our question may have a very different semantic meaning to the context (text containing our answer) that we'd like to retrieve. For example, in a pure semantic search the sentences:

```
"tell me the name of the capital of France"
```

and

```
"Paris is the capital and most populous city of France, with an official estimated population of 2.1M residents as of January 2023 in an area of more than 105km^2."
```

Would not be similar, they have very different meanings despite being about the same topic of Paris.

In question-answering, these two sentences would be a perfect match and should have a very high similarity. Let's see how we apply this idea to our InstructOR embeddings.

In the example below, we define a `question` with the corresponding instruction, and then we create a `corpus` of various sentences with some object definitions and their instructions. Afterward, we utilize `cosine_similarity` to find the most similar document (sentence) from the corpus that we can use to answer our question.

In [7]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

question  = [['Represent the Wikipedia question for retrieving supporting documents: ',
           'When were pocket watches popular?']]

corpus = [['Represent the Wikipedia document for retrieval: ',
           'A canvas painting is artwork created on a canvas surface using various painting techniques and mediums like oil, acrylic, or watercolor. It is popular in traditional and contemporary art, displayed in galleries, museums, and homes.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A cinema, also known as a movie theater or movie house, is a venue where films are shown to an audience for entertainment. It typically consists of a large screen, seating arrangements, and audio-visual equipment to project and play movies.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A pocket watch is a small, portable timekeeping device with a clock face and hands, designed to be carried in a pocket or attached to a chain. It is typically made of materials such as metal, gold, or silver and was popular during the 18th and 19th centuries.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A laptop is a compact and portable computer with a keyboard and screen, ideal for various tasks on the go. It offers versatility for browsing, word processing, multimedia, gaming, and professional work.']]

question_embeddings = model.encode(question)
corpus_embeddings = model.encode(corpus)

similarities = cosine_similarity(question_embeddings, corpus_embeddings)
retrieved_doc_id = np.argmax(similarities)

print(retrieved_doc_id)

2


We can see that the document containing the most important information is at index 2, which is exactly what we expected.

## Using **Pinecone** as a vector database for storing InstructOR embeddings

Now we can instantiate the Pinecone object which we are going to use to create our index. You can get your credentials from the [Pinecone Console](https://app.pinecone.io/).



In [9]:
import os
from pinecone import Pinecone

# get api key from app.pinecone.io
api_key = os.environ.get('PINECONE_API_KEY') or 'YOUR_PINECONE_API_KEY'
# find your environment next to the api key in pinecone console
env = os.environ.get('PINECONE_ENVIRONMENT') or 'YOUR_PINECONE_ENVIRONMENT'

pinecone.init(
    api_key=api_key,
    environment=env
)

Let's see how we can enrich the previously described use cases with the **Pinecone** vector database, allowing us to store a large amount of embeddings and still quickly receive relevant results.
<br><br>
Below, we are going to use a small, custom-made corpus of sentences, but in a real use-case, you can utilize datasets of much larger size by following the same syntax.

### First use case - Semantic Search with Pinecone

Here, we are defining the index name that we will use in the initialization process.
<br>
Additionally, the `embeddings_dim` in the cell below is what we need to match with the InstructOR model's embeddings dimension.

In [10]:
index_name = "instructor-semantic-search"

In [19]:
import time

# only create index if it doesn't exist
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        name=index_name,
        dimension=embeddings_dim,
        metric='cosine'
    )
    while not pinecone.describe_index(index_name).status['ready']:
        time.sleep(1)

# now connect to the index
index = pinecone.Index(index_name)

We need to create a corpus of `text_input` and `instruction` pairs, as shown below. Each instruction follows the template we described above (*Represent the ...*), while `text_input` contains the actual content.
<br><br>
We are adding definitions from different domains so that we can test later on how our similarity search is going to match these examples.

In [20]:
corpus = [['Represent the Pharmaceutical definition: ','Aspirin: Aspirin is a widely-used over-the-counter medication known for its anti-inflammatory and analgesic properties. It is commonly used to relieve pain, reduce fever, and alleviate minor aches and pains.'],
          ['Represent the Pharmaceutical definition: ','Amoxicillin: Amoxicillin is an antibiotic medication commonly prescribed to treat various bacterial infections, such as respiratory, ear, throat, and urinary tract infections. It belongs to the penicillin class of antibiotics and works by inhibiting bacterial cell wall synthesis.'],
          ['Represent the Pharmaceutical definition: ','Atorvastatin: Atorvastatin is a lipid-lowering medication used to manage high cholesterol levels and reduce the risk of cardiovascular events. It belongs to the statin class of drugs and works by inhibiting an enzyme involved in cholesterol production in the liver.'],
          ['Represent the Financial definition: ', "Asset Allocation: Asset allocation is a financial strategy that involves distributing an investment portfolio across various asset classes, such as stocks, bonds, cash, and real estate, to achieve the desired risk-return balance based on an individual's financial goals and risk tolerance."],
          ['Represent the Financial definition: ', 'Capital Gains: Capital gains refer to the profits realized from the sale of a capital asset, such as stocks, real estate, or mutual funds, at a price higher than its original purchase price. These gains are subject to capital gains taxes, which vary based on the holding period and tax laws of the country.'],
          ['Represent the Financial definition: ', "Debt-to-Equity Ratio: The debt-to-equity ratio is a financial metric used to assess a company's financial leverage. It is calculated by dividing the total debt (long-term and short-term liabilities) of a company by its total shareholders' equity. A higher ratio indicates a higher level of debt financing relative to equity, which may signify higher financial risk."],
          ['Represent the Artistic definition: ', "Impressionism: Impressionism is an art movement that emerged in the late 19th century, characterized by the use of short brush strokes and the depiction of light and color to capture the fleeting effects of a scene. It emphasizes the artist's immediate perception and emotional response to the subject."],
          ['Represent the Artistic definition: ', "Sculpture: Sculpture is a form of visual art that involves creating three-dimensional objects by carving, modeling, or molding materials such as stone, wood, metal, clay, or other materials. Sculptures can be representational or abstract and are often displayed in galleries, museums, or public spaces."],
          ['Represent the Artistic definition: ', "Abstract Expressionism: Abstract Expressionism is an art movement that developed in the mid-20th century, characterized by non-representational and spontaneous artworks conveying the artist's emotions and subconscious thoughts. It often features large-scale canvases with bold brushwork and a focus on the artist's gestural movements."]]

In [21]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 4

vectors = model.encode(
    sentences=corpus,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

Batches:   0%|          | 0/3 [00:00<?, ?it/s]

Here, we are preparing the records for the Pinecone vector database to be in the right format.

In [22]:
ids = [str(x) for x in range(len(corpus))]
metadata = [{'text': record[1]} for record in corpus]
records = zip(ids, vectors.tolist(), metadata)

Now we need to upsert these vectors.
<br>
Afterward, we can move on to the query phase.

In [23]:
index.upsert(vectors=records)

{'upserted_count': 9}

In the query phase, we are preparing one definition of post-impressionism and then checking which documents in the index are the most semantically similar.


In [24]:
query  = [['Represent the Artistic definition: ','Post-Impressionism: Post-Impressionism is an art movement that developed in the late 19th and early 20th centuries as a reaction to Impressionism. Artists associated with Post-Impressionism sought to explore new ways of expressing emotions and ideas through their art. While retaining some aspects of Impressionism, they moved towards more symbolic and abstract representations, emphasizing the use of color, form, and brushwork to convey deeper meaning and subjective experiences.']]

# create the query embedding
query_embedding = model.encode(query)

# now query
result = index.query(vector=query_embedding.tolist(), top_k=3, include_metadata=True)
result

{'matches': [{'id': '6',
              'metadata': {'text': 'Impressionism: Impressionism is an art '
                                   'movement that emerged in the late 19th '
                                   'century, characterized by the use of short '
                                   'brush strokes and the depiction of light '
                                   'and color to capture the fleeting effects '
                                   "of a scene. It emphasizes the artist's "
                                   'immediate perception and emotional '
                                   'response to the subject.'},
              'score': 0.927976,
              'values': []},
             {'id': '8',
              'metadata': {'text': 'Abstract Expressionism: Abstract '
                                   'Expressionism is an art movement that '
                                   'developed in the mid-20th century, '
                                   'characterized by non-rep

We received great results! Our most similar definition is the one about impressionism, and the following two are both from the artistic domain. Therefore, we can say that the InstructOR embeddings were created successfully, and Pinecone database produced relevant similarity search results.

In [25]:
pinecone.delete_index(index_name)

### Second use case - Question-Answering with Pinecone

Let's see how InstructOR and Pinecone will behave in our second use-case: question-answering.


Again, we are preparing our index by setting the index name, creating embeddings, and moving them into the right format.

In [26]:
index_name = "instructor-information-retrieval"

In [27]:
# only create index if it doesn't exist
if index_name not in pinecone.list_indexes().names():
    pinecone.create_index(
        name=index_name,
        dimension=embeddings_dim,
        metric='cosine'
    )

# now connect to the index
index = pinecone.Index(index_name)

In this use-case, we have a slightly different corpus. Our instructions now specify that we need to use these Wikipedia documents for retrieval. This will assist us in the query phase, where we will provide the question and attempt to find the answer within the documents from the corpus.

In [28]:
corpus = [['Represent the Wikipedia document for retrieval: ',
           'A canvas painting is artwork created on a canvas surface using various painting techniques and mediums like oil, acrylic, or watercolor. It is popular in traditional and contemporary art, displayed in galleries, museums, and homes.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A cinema, also known as a movie theater or movie house, is a venue where films are shown to an audience for entertainment. It typically consists of a large screen, seating arrangements, and audio-visual equipment to project and play movies.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A pocket watch is a small, portable timekeeping device with a clock face and hands, designed to be carried in a pocket or attached to a chain. It is typically made of materials such as metal, gold, or silver and was popular during the 18th and 19th centuries.'],
          ['Represent the Wikipedia document for retrieval: ',
           'A laptop is a compact and portable computer with a keyboard and screen, ideal for various tasks on the go. It offers versatility for browsing, word processing, multimedia, gaming, and professional work.']]


In [29]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
batch_size = 4

vectors = model.encode(
    sentences=corpus,
    batch_size=batch_size,
    show_progress_bar=True,
    convert_to_numpy=True,
    device=device
)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [30]:
ids = [str(x) for x in range(len(corpus))]
metadata = [{'text': record[1]} for record in corpus]
records = zip(ids, vectors.tolist(), metadata)

In [32]:
index.upsert(vectors=records)

{'upserted_count': 4}

Let's give it a try with some simple question first.

In [35]:
question  = [['Represent the Wikipedia question for retrieving supporting documents: ',
           'Is there any other name for a cinema?']]

# create the question embedding
question_embedding = model.encode(question)

# now query
result = index.query(vector=question_embedding.tolist(), top_k=1, include_metadata=True)
result

{'matches': [{'id': '1',
              'metadata': {'text': 'A cinema, also known as a movie theater or '
                                   'movie house, is a venue where films are '
                                   'shown to an audience for entertainment. It '
                                   'typically consists of a large screen, '
                                   'seating arrangements, and audio-visual '
                                   'equipment to project and play movies.'},
              'score': 0.901540339,
              'values': []}],
 'namespace': ''}

Perfect! We have found the document containing the information that answers our question. Now let's check one more.

In [36]:
question  = [['Represent the Wikipedia question for retrieving supporting documents: ',
           'When were pocket watches popular?']]

# create the question embedding
question_embedding = model.encode(question)

# now query
result = index.query(vector=question_embedding.tolist(), top_k=1, include_metadata=True)
result

{'matches': [{'id': '2',
              'metadata': {'text': 'A pocket watch is a small, portable '
                                   'timekeeping device with a clock face and '
                                   'hands, designed to be carried in a pocket '
                                   'or attached to a chain. It is typically '
                                   'made of materials such as metal, gold, or '
                                   'silver and was popular during the 18th and '
                                   '19th centuries.'},
              'score': 0.910102367,
              'values': []}],
 'namespace': ''}

Again, we received relevant results.

In [30]:
pinecone.delete_index(index_name)

## Summary

This example showcased the [InstructOR](https://huggingface.co/hkunlp/instructor-large) model's ability to create embeddings effectively across different domains and use-cases. Additionally, the demonstration highlighted how **Pinecone** serves as a valuable vector database, enabling us to store a large number of documents and get appropriate results when querying the database.