#  Text Embeddings

Langchain supports many different text embeddings that convert a text string to an embedded vectorized representation.

We can store those embeddings vectors and perform similarity searches between new vectorized strings or documents against a vector store.

We need to choose our embedding model very carefully because different embedding models cannot interact with each other. This means that it's not possible to calculate the cosine similarity between two embeddings vectors that were obtained with different embedding models. If we have a vector store that was obtained with a given embedding model and we want to change the embedding model for future data, we would need to re-embed the entire historical vectorized documents. This also means that we have to have access to the raw historical data because we can not go from embedding vector to original string.

In [2]:
from langchain.embeddings import OpenAIEmbeddings

In [5]:
openai_api_key = os.getenv(key="OPENAI_API_KEY")

In [6]:
embeddings = OpenAIEmbeddings()

In [8]:
embeddings.model

'text-embedding-ada-002'

In [9]:
text = "Some normal text to send to OpenAI to be embedded into a vector"

In [10]:
embedded_text = embeddings.embed_query(text)

In [13]:
len(embedded_text)

1536

In [11]:
type(embedded_text)

list

## Embed Documents

Instead of strings we can embed entire documents

In [14]:
from langchain.document_loaders import CSVLoader

In [15]:
loader = CSVLoader('penguins.csv')

In [16]:
data = loader.load()

In [17]:
type(data)

list

In [18]:
type(data[0])

langchain_core.documents.base.Document

In [20]:
# the variable text contains each page_content for each row of the document list
embedded_docs = embeddings.embed_documents([text.page_content for text in data])

In [23]:
len(embedded_docs)

344

In [24]:
len(embedded_docs[0])

1536

In [25]:
embedded_docs[0][:10]

[-0.012749632519686768,
 -0.010540504691173527,
 -0.02000338536990946,
 -0.030577565753264004,
 -0.006903525279011112,
 0.035265228810383714,
 -0.0368547244615598,
 -0.0006945620243483338,
 -0.01399563499302944,
 -0.03289445784566611]

We have 344 rows of vectors with 1536 dimensions