# Text Embeddings with `sentence-transformers`

#### We'll start off by installing some dependencies: `sentence-transformers` for the models and `milvus` for the vector database. Milvus is known for its scalability and wide adoption among organiziations, but we have an "embedded" version too!

In [1]:
!pip install -U sentence_transformers
!pip install -U milvus

You should consider upgrading via the '/Users/zilliz/.pyenv/bin/python3 -m pip install --upgrade pip' command.[0m
You should consider upgrading via the '/Users/zilliz/.pyenv/bin/python3 -m pip install --upgrade pip' command.[0m


##### We'll go over the basics first: specifying a model and computing its embeddings.

In [22]:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("llmrails/ember-v1")
model.max_seq_length = 1024

In [23]:
embedding_0 = model.encode("Zilliz is an awesome vector database.")
embedding_0
embedding_0.shape

(1024,)

In [24]:
from sentence_transformers.util import cos_sim
sentences = ["Zilliz is a vector data store that is amazing.",
             "Unstructured data can be semantically represented with embeddings.",
             "Singular value decomposition factorizes the input matrix into three other matrices.",
             "My favorite chess opening is the King's Gambit.",
             "It doesn't matter if a cat is black or white, so long as it catches mice."]
embeddings = model.encode(sentences)
print(cos_sim(embedding_0, embeddings))

tensor([[0.9391, 0.5762, 0.5002, 0.3822, 0.3003]])


In [14]:
cos_sim(model.encode("I like green eggs and ham."), model.encode("I like green eggs and ham."))

tensor([[1.0000]])

In [15]:
cos_sim(model.encode("Let's eat, Chris."), model.encode("Let's eat Chris!"))

tensor([[0.8867]])

#### Now let's check out how to fine-tune our model.

In [9]:
from sentence_transformers import InputExample
train_examples = [
    InputExample(texts=["Give me a quote on pragmatism.", "Whether the cat is black or white doesn't matter, so long as it catches mice."], label=1.0),
    InputExample(texts=["Y Combinator's 7th birthday was March 11.", "As usual we were so busy we didn't notice till a few days after."], label=1.0)
]

In [10]:
from sentence_transformers import losses
from torch.utils.data import DataLoader
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=1)
train_loss = losses.CosineSimilarityLoss(model)

In [11]:
model.fit(train_objectives=[(train_dataloader, train_loss)], epochs=1, warmup_steps=0)

Epoch:   0%|          | 0/1 [00:00<?, ?it/s]

Iteration:   0%|          | 0/2 [00:00<?, ?it/s]

#### How about inserting into a vector database?

In [16]:
from milvus import default_server
default_server.start()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [17]:
from pymilvus import connections
connections.connect(alias="default",
                    host="127.0.0.1", 
                    port=default_server.listen_port,
                    show_startup_banner=True)

In [18]:
from pymilvus import utility, FieldSchema, DataType, Collection, CollectionSchema

if utility.has_collection("default"):
    utility.drop_collection("default")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=1024)
]
schema = CollectionSchema(fields=fields)

collection = Collection(name="default", schema=schema)

index_params = {
    "index_type": "HNSW",
    "metric_type": "COSINE",
    "params": {"M": 64, "ef": 32, "efConstruction": 32}
}
collection.create_index(field_name="embedding", index_params=index_params)
collection.load()

In [19]:
collection.insert([{"embedding": e} for e in embeddings])

(insert count: 5, delete count: 0, upsert count: 0, timestamp: 447271173950799876, success count: 5, err count: 0)

In [20]:
default_server.stop()
default_server.cleanup()