# Embeddings

You've hopefully heard that encoding is not encryption, and that is equally true for embeddings. Embeddings are like a lossy form of encoding. Embeddings encode the semantics of an input into vector-space. 

In [1]:
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
import numpy
from dotenv import load_dotenv
from sentence_transformers import SentenceTransformer
import vec2text
import transformers

load_dotenv()
client = OpenAI()

def get_embedding(text, model="text-embedding-3-small"):
   text = text.replace("\n", " ")
   vector = numpy.array(client.embeddings.create(input = [text], model=model).data[0].embedding).reshape(1,-1)
   return vector

  from .autonotebook import tqdm as notebook_tqdm


Semantically similar inputs will be closer in vector-space than dissimilar inputs.

In [2]:
terms = [
    "Yesterday, I went to the grocery store.",
    "I bought turtles.",
    "I bought food.",
    "I bought food earlier this week.",
    "I bought food yesterday from a restaurant.",
    "I bought food yesterday.",
    "I bought groceries a day ago.",
]
embeddings = { term: get_embedding(term) for term in terms }
for term in embeddings.keys():
    term_embedding = numpy.array(embeddings[term]).reshape(1, -1)
    similarity = cosine_similarity(embeddings[terms[0]], term_embedding)[0][0]
    print(f"{round(similarity,3):<5} {term}")

1.0   Yesterday, I went to the grocery store.
0.228 I bought turtles.
0.509 I bought food.
0.57  I bought food earlier this week.
0.574 I bought food yesterday from a restaurant.
0.643 I bought food yesterday.
0.697 I bought groceries a day ago.


Although embeddings are unintelligble to us at first glance, the goal of embedding is to retain as much information about the input as possible when it gets mapped into vector-space. Because of this, you should assume that it is a reversible process. In fact, [The Information Engineering Lab](https://huggingface.co/ielabgroup) has published a model that excels at reversing embeddings back to their original text.

In [3]:
inversion_model = vec2text.models.InversionModel.from_pretrained(
    "ielabgroup/vec2text_gtr-base-st_inversion"
)
model = vec2text.models.CorrectorEncoderModel.from_pretrained(
    "ielabgroup/vec2text_gtr-base-st_corrector"
)

inversion_trainer = vec2text.trainers.InversionTrainer(
    model=inversion_model,
    train_dataset=None,
    eval_dataset=None,
    data_collator=transformers.DataCollatorForSeq2Seq(
        inversion_model.tokenizer,
        label_pad_token_id=-100,
    ),
)

model.config.dispatch_batches = None
corrector = vec2text.trainers.Corrector(
    model=model,
    inversion_trainer=inversion_trainer,
    args=None,
    data_collator=vec2text.collator.DataCollatorForCorrection(
        tokenizer=inversion_trainer.model.tokenizer
    ),
)

model = SentenceTransformer('sentence-transformers/gtr-t5-base')

Here, I encode my favorite password into an embedding and then show how it can be retrieved again.

In [7]:
embeddings = model.encode([
       "ILoveTurtles",
], convert_to_tensor=True,).to('mps')


In [8]:
embeddings = embeddings.clone().requires_grad_(True)
reversed = vec2text.invert_embeddings(
    embeddings=embeddings,
    corrector=corrector,
    num_steps=20,
)

print(reversed)

['           iLoveTurtles ']


You should secure embedded data to the same extent that you would secure the original data. Embedding does NOT reduce the sensitivity of information. Don't assume just because a vector is not readable at first glance that it doesn't contain sensitive information!