# Text embedding models

---

Alejandro Ricciardi (Omegapy)  
created date: 01/23/2024   
[GitHub](https://github.com/Omegapy)  

Credit: [LangChain](https://python.langchain.com/docs/expression_language/)

<br>

--- 

 
Projects Description:  
**LangChain** is a framework for developing applications powered by language models.  
**In this project:** This project is a series of LangChain text embedding models for LLMs tutorials on Jupyter Notebook.  
The tutorials are a series LangChain Python code examples from the https://python.langchain.com/ website.

Specifically from the section [Text embedding models](https://python.langchain.com/docs/modules/data_connection/text_embedding/)

⚠️ **Info**: Head to [Integrations](https://python.langchain.com/docs/integrations/text_embedding/) for documentation on built-in integrations with text embedding model providers.

The ```Embeddings class``` is a class designed for interfacing with text embedding models. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) - this class is designed to provide a standard interface for all of them.

Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

<p></p>
<b style="font-size:15;">
⚠️ This project requires an OpenAI key.
</b>


##### Project Map  
- [API Keys](#api-keys)  
- [Get started](#get-started)
- [CacheBackedEmbeddings](#cachebackedembeddings)
    - [Using with a Vector Store](#using-with-a-vector-store)
    - [Swapping the ByteStore](#swapping-the-bytestore)

<br>

---


#### API Keys

In [4]:
import os
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
OPENAI_API_KEY = os.environ.get("OPEN_AI_KEY")

---
## Get started



<br>

---

In [5]:
#!pip install langchain-openai

In [6]:
from langchain_openai import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [7]:
embeddings = embeddings_model.embed_documents(
    [
        "Hi there!",
        "Oh, hello!",
        "What's your name?",
        "My friends call me World",
        "Hello World!"
    ]
)
len(embeddings), len(embeddings[0])

(5, 1536)

##### embed_query
Embed single query
Embed a single piece of text for the purpose of comparing to other embedded pieces of texts.

In [8]:
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
embedded_query[:5]

[0.005354681365594308,
 -0.0005715346531097276,
 0.038875909934336914,
 -0.0029596003572924627,
 -0.008966285328704282]

[Project Map](#project-map)

---

---
## CacheBackedEmbeddings

Embeddings can be stored or temporarily cached to avoid needing to recompute them.

Caching embeddings can be done using a ```CacheBackedEmbeddings```. The cache backed embedder is a wrapper around an embedder that caches embeddings in a key-value store. The text is hashed and the hash is used as the key in the cache.

The main supported way to initialize a ```CacheBackedEmbeddings``` is ```from_bytes_store```. This takes in the following parameters:

- underlying_embedder: The embedder to use for embedding.
- document_embedding_cache: Any [ByteStore](https://python.langchain.com/docs/integrations/stores/) for caching document embeddings.
- namespace: (optional, defaults to ```""```) The namespace to use for document cache. This namespace is used to avoid collisions with other caches. 
For example, set it to the name of the embedding model used.
Attention: Be sure to set the ```namespace``` parameter to avoid collisions of the same text embedded using different embeddings models.

<br>

---

In [9]:
from langchain.embeddings import CacheBackedEmbeddings

### Using with a Vector Store

First, let’s see an example that uses the local file system for storing embeddings and uses FAISS vector store for retrieval.

In [19]:
!conda install langchain-openai faiss-cpu

Channels:
 - defaults
 - conda-forge
Platform: win-64
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed



PackagesNotFoundError: The following packages are not available from current channels:

  - langchain-openai

Current channels:

  - defaults
  - https://conda.anaconda.org/conda-forge/win-64
  - https://conda.anaconda.org/conda-forge/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




In [12]:
from langchain.storage import LocalFileStore
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

underlying_embeddings = OpenAIEmbeddings()

store = LocalFileStore("./cache/")

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

In [13]:
list(store.yield_keys())

[]

In [15]:
raw_documents = TextLoader("data/state_of_the_union.txt").load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
documents = text_splitter.split_documents(raw_documents)

In [16]:
%%time
db = FAISS.from_documents(documents, cached_embedder)

ImportError: Could not import faiss python package. Please install it with `pip install faiss-gpu` (for CUDA supported GPU) or `pip install faiss-cpu` (depending on Python version).

If we try to create the vector store again, it’ll be much faster since it does not need to re-compute any embeddings.

In [None]:
%%time
db2 = FAISS.from_documents(documents, cached_embedder)

And here are some of the embeddings that got created:

In [None]:
list(store.yield_keys())[:5]

[Project Map](#project-map)

---

### Swapping the ByteStore

In order to use a different ```ByteStore```, just use it when creating your ```CacheBackedEmbeddings```. Below, we create an equivalent cached embeddings object, except using the non-persistent ```InMemoryByteStore``` instead:

In [None]:
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import InMemoryByteStore

store = InMemoryByteStore()

cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings, store, namespace=underlying_embeddings.model
)

[Project Map](#project-map)

---