## InstructOR - A multitask custom embedding model for task based applications, made easier with LanceDB
![instruct](https://github.com/lancedb/vectordb-recipes/blob/main/examples/instruct-multitask/embeddings11.png?raw=1)

### Installing all dependencies

In [1]:
!pip install lancedb

Collecting lancedb
  Downloading lancedb-0.5.0-py3-none-any.whl (87 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m87.4/87.4 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting deprecation (from lancedb)
  Downloading deprecation-2.1.0-py2.py3-none-any.whl (11 kB)
Collecting pylance==0.9.6 (from lancedb)
  Downloading pylance-0.9.6-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m18.6/18.6 MB[0m [31m44.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ratelimiter~=1.0 (from lancedb)
  Downloading ratelimiter-1.2.0.post0-py3-none-any.whl (6.6 kB)
Collecting retry>=0.9.2 (from lancedb)
  Downloading retry-0.9.2-py2.py3-none-any.whl (8.0 kB)
Collecting semver>=3.0 (from lancedb)
  Downloading semver-3.0.2-py3-none-any.whl (17 kB)
Collecting overrides>=0.7 (from lancedb)
  Downloading overrides-7.6.0-py3-none-any.whl (17 kB)
Collecting pyarrow>=12 (from pylance==0.9.6->lancedb

In [2]:
!pip install InstructorEmbedding sentence-transformers torch pandas

Collecting InstructorEmbedding
  Downloading InstructorEmbedding-1.0.1-py2.py3-none-any.whl (19 kB)
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece (from sentence-transformers)
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125923 sha256=9d7322894ecb5581cb484d80da823d44ed5d996c5f936d11179e7f83ce2b0128
  Stored in directory: /root/.cache/pi

If you want to calculate customized embeddings for specific sentences, you may follow the unified template to write instructions:

"Represent the [**domain**] [**text_type**] for [**task_objective**]:"

Here are some examples:

- "Represent the **Science** **sentence**:"
- "Represent the **Financial** **statement**:"
- "Represent the **Wikipedia** **document** for **retrieval**:"
- "Represent the **Wikipedia** **question** for **retrieving supporting documents**:"

### Importing neccessary libraries

In [3]:
import lancedb
from lancedb.pydantic import LanceModel, Vector
from lancedb.embeddings import get_registry
from lancedb.embeddings import InstructorEmbeddingFunction

### Calling the embedding model from LanceDB embedding's API

In [4]:
instructor = (
    get_registry()
    .get("instructor")
    .create(
        source_instruction="represent the document for retreival",
        query_instruction="represent the document for most similar definition",
    )
)


class Schema(LanceModel):
    vector: Vector(instructor.ndims()) = instructor.VectorField()
    text: str = instructor.SourceField()


# Creating LanceDB table
db = lancedb.connect("~/.lancedb")
tbl = db.create_table("intruct-multitask", schema=Schema, mode="overwrite")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/270 [00:00<?, ?B/s]

2_Dense/config.json:   0%|          | 0.00/115 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.36M [00:00<?, ?B/s]

README.md:   0%|          | 0.00/66.2k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.55k [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/439M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/2.43k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/461 [00:00<?, ?B/s]

load INSTRUCTOR_Transformer
max_seq_length  512


### Adding Data to the Table

In [5]:
data_f1 = [
    {
        "text": "Aspirin is a widely-used over-the-counter medication known for its anti-inflammatory and analgesic properties. It is commonly used to relieve pain, reduce fever, and alleviate minor aches and pains."
    },
    {
        "text": "Amoxicillin is an antibiotic medication commonly prescribed to treat various bacterial infections, such as respiratory, ear, throat, and urinary tract infections. It belongs to the penicillin class of antibiotics and works by inhibiting bacterial cell wall synthesis."
    },
    {
        "text": "Atorvastatin is a lipid-lowering medication used to manage high cholesterol levels and reduce the risk of cardiovascular events. It belongs to the statin class of drugs and works by inhibiting an enzyme involved in cholesterol production in the liver."
    },
    {
        "text": "The Theory of Relativity is a fundamental physics theory developed by Albert Einstein, consisting of the special theory of relativity and the general theory of relativity. It revolutionized our understanding of space, time, and gravity."
    },
    {
        "text": "Photosynthesis is a vital biological process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose, using carbon dioxide and water."
    },
    {
        "text": "The Big Bang Theory is the prevailing cosmological model that describes the origin of the universe. It suggests that the universe began as a singularity and has been expanding for billions of years."
    },
    {
        "text": "Compound Interest is the addition of interest to the principal sum of a loan or investment, resulting in the interest on interest effect over time."
    },
    {
        "text": "Stock Market is a financial marketplace where buyers and sellers trade ownership in companies, typically in the form of stocks or shares."
    },
    {
        "text": "Inflation is the rate at which the general level of prices for goods and services is rising and subsequently purchasing power is falling."
    },
    {
        "text": "Diversification is an investment strategy that involves spreading your investments across different asset classes to reduce risk."
    },
    {
        "text": "Liquidity refers to how easily an asset can be converted into cash without a significant loss of value. It's a key consideration in financial management."
    },
    {
        "text": "401(k) is a retirement savings plan offered by employers, allowing employees to save and invest a portion of their paycheck before taxes."
    },
    {
        "text": "Ballet is a classical dance form that originated in the Italian Renaissance courts of the 15th century and later developed into a highly technical art."
    },
    {
        "text": "Rock and Roll is a genre of popular music that originated and evolved in the United States during the late 1940s and early 1950s, characterized by a strong rhythm and amplified instruments."
    },
    {
        "text": "Cuisine is a style or method of cooking, especially as characteristic of a particular country, region, or establishment."
    },
    {"text": "Renaissance was a cultural, artistic, and intellectual movement that"},
    {
        "text": "Neutrino is subatomic particles with very little mass and no electric charge. They are produced in various nuclear reactions, including those in the Sun, and play a significant role in astrophysics and particle physics."
    },
    {
        "text": "Higgs Boson is a subatomic particle that gives mass to other elementary particles. Its discovery was a significant achievement in particle physics."
    },
    {
        "text": "Quantum Entanglement is a quantum physics phenomenon where two or more particles become connected in such a way that the state of one particle is dependent on the state of the other(s), even when they are separated by large distances."
    },
    {
        "text": "Genome Sequencing is the process of determining the complete DNA sequence of an organism's genome. It has numerous applications in genetics, biology, and medicine."
    },
]

tbl.add(data_f1)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

## First use case - Semantic Search with LanceDB

In [6]:
# LanceDB supports full text search, so there is no need of embedding the Query
query = "amoxicillin"
result = tbl.search(query).limit(1).to_pandas()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [7]:
# printing the output
print(result)

                                              vector  \
0  [-0.024510663, 0.0005563084, 0.028840268, 0.08...   

                                                text  _distance  
0  Amoxicillin is an antibiotic medication common...   0.163671  


## Same Input Data with Different Instruction Pair

In [8]:
instructor = (
    get_registry()
    .get("instructor")
    .create(
        source_instruction="represent the captions",
        query_instruction="represent the captions for retrieving duplicate captions",
    )
)


class Schema(LanceModel):
    vector: Vector(instructor.ndims()) = instructor.VectorField()
    text: str = instructor.SourceField()


db = lancedb.connect("~/.lancedb")
tbl = db.create_table("intruct-multitask", schema=Schema, mode="overwrite")

load INSTRUCTOR_Transformer
max_seq_length  512


In [9]:
data_f2 = [
    {
        "text": "Aspirin is a widely-used over-the-counter medication known for its anti-inflammatory and analgesic properties. It is commonly used to relieve pain, reduce fever, and alleviate minor aches and pains."
    },
    {
        "text": "Amoxicillin is an antibiotic medication commonly prescribed to treat various bacterial infections, such as respiratory, ear, throat, and urinary tract infections. It belongs to the penicillin class of antibiotics and works by inhibiting bacterial cell wall synthesis."
    },
    {
        "text": "Atorvastatin is a lipid-lowering medication used to manage high cholesterol levels and reduce the risk of cardiovascular events. It belongs to the statin class of drugs and works by inhibiting an enzyme involved in cholesterol production in the liver."
    },
    {
        "text": "The Theory of Relativity is a fundamental physics theory developed by Albert Einstein, consisting of the special theory of relativity and the general theory of relativity. It revolutionized our understanding of space, time, and gravity."
    },
    {
        "text": "Photosynthesis is a vital biological process by which green plants, algae, and some bacteria convert light energy into chemical energy in the form of glucose, using carbon dioxide and water."
    },
    {
        "text": "The Big Bang Theory is the prevailing cosmological model that describes the origin of the universe. It suggests that the universe began as a singularity and has been expanding for billions of years."
    },
    {
        "text": "Compound Interest is the addition of interest to the principal sum of a loan or investment, resulting in the interest on interest effect over time."
    },
    {
        "text": "Stock Market is a financial marketplace where buyers and sellers trade ownership in companies, typically in the form of stocks or shares."
    },
    {
        "text": "Inflation is the rate at which the general level of prices for goods and services is rising and subsequently purchasing power is falling."
    },
    {
        "text": "Diversification is an investment strategy that involves spreading your investments across different asset classes to reduce risk."
    },
    {
        "text": "Liquidity refers to how easily an asset can be converted into cash without a significant loss of value. It's a key consideration in financial management."
    },
    {
        "text": "401(k) is a retirement savings plan offered by employers, allowing employees to save and invest a portion of their paycheck before taxes."
    },
    {
        "text": "Ballet is a classical dance form that originated in the Italian Renaissance courts of the 15th century and later developed into a highly technical art."
    },
    {
        "text": "Rock and Roll is a genre of popular music that originated and evolved in the United States during the late 1940s and early 1950s, characterized by a strong rhythm and amplified instruments."
    },
    {
        "text": "Cuisine is a style or method of cooking, especially as characteristic of a particular country, region, or establishment."
    },
    {"text": "Renaissance was a cultural, artistic, and intellectual movement that"},
    {
        "text": "Neutrino is subatomic particles with very little mass and no electric charge. They are produced in various nuclear reactions, including those in the Sun, and play a significant role in astrophysics and particle physics."
    },
    {
        "text": "Higgs Boson is a subatomic particle that gives mass to other elementary particles. Its discovery was a significant achievement in particle physics."
    },
    {
        "text": "Quantum Entanglement is a quantum physics phenomenon where two or more particles become connected in such a way that the state of one particle is dependent on the state of the other(s), even when they are separated by large distances."
    },
    {
        "text": "Genome Sequencing is the process of determining the complete DNA sequence of an organism's genome. It has numerous applications in genetics, biology, and medicine."
    },
]

tbl.add(data_f2)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [10]:
# same query, but for the differently embed data
query = "amoxicillin"
result = tbl.search(query).limit(1).to_pandas()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [11]:
# showing the result
print(result)

                                              vector  \
0  [-0.02448329, 0.00093284156, 0.033273738, 0.07...   

                                                text  _distance  
0  Amoxicillin is an antibiotic medication common...    0.18135  


### We can see that the **_distance** value for different instructions are different, this clearly indicates that the instructions have some effect on the embedding

## Second use case - **Question Answering** with LanceDB

Calling embedding model with different instruction pair

In [12]:
instructor = (
    get_registry()
    .get("instructor")
    .create(
        source_instruction="represent the docuement for retreival",
        query_instruction="represent the query for retrieval",
    )
)


class Schema(LanceModel):
    vector: Vector(instructor.ndims()) = instructor.VectorField()
    text: str = instructor.SourceField()


db = lancedb.connect("~/.lancedb-qa")
tbl = db.create_table("intruct-multitask-qa", schema=Schema, mode="overwrite")

load INSTRUCTOR_Transformer
max_seq_length  512


In [13]:
data_qa = [
    {
        "text": "A canvas painting is artwork created on a canvas surface using various painting techniques and mediums like oil, acrylic, or watercolor. It is popular in traditional and contemporary art, displayed in galleries, museums, and homes."
    },
    {
        "text": "A cinema, also known as a movie theater or movie house, is a venue where films are shown to an audience for entertainment. It typically consists of a large screen, seating arrangements, and audio-visual equipment to project and play movies."
    },
    {
        "text": "A pocket watch is a small, portable timekeeping device with a clock face and hands, designed to be carried in a pocket or attached to a chain. It is typically made of materials such as metal, gold, or silver and was popular during the 18th and 19th centuries."
    },
    {
        "text": "A laptop is a compact and portable computer with a keyboard and screen, ideal for various tasks on the go. It offers versatility for browsing, word processing, multimedia, gaming, and professional work."
    },
]

tbl.add(data_qa)

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [14]:
query = "what is a cinema"
result = tbl.search(query).limit(1).to_pandas()

Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [15]:
print(result)

                                              vector  \
0  [0.02184453, 0.0017777232, 0.022723947, 0.0497...   

                                                text  _distance  
0  A cinema, also known as a movie theater or mov...   0.131036  


Thanks, for more such examples, please visit [LanceDB](https://github.com/lancedb/vectordb-recipes/tree/main)