### Embeddings

#### Embeddings create a vector representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space.

### OPENAI EMBEDDINGS

In [1]:
import os
from langchain.embeddings import OpenAIEmbeddings

In [2]:
with open("../../openai_api_key.txt") as f:
    api_key = f.read()
os.environ['OPENAI_API_KEY'] = api_key

In [3]:
embeddings = OpenAIEmbeddings()

In [17]:
text = "The scar had not pained Harry for nineteen years. All was well."

In [18]:
# Embedding the text

embedded_text = embeddings.embed_query(text)
print(embedded_text[:5]), print(len(embedded_text))

[-0.006067691294778975, -0.006654051049083575, 0.03343223365953213, -0.02039625470103048, -0.008338620781671906]
1536


(None, None)

##### If we have n lines in langchain document format

In [8]:
from langchain.docstore.document import Document

doc_lines = [
    Document(page_content=text, metadata = {"source":"Harry Potter"}),
    Document(page_content="It is our choices, Harry, that show what we truly are, far more than our abilities", metadata = {"source":"Harry Potter"})
]
doc_lines

[Document(page_content='The scar had not pained Harry for nineteen years. All was well.', metadata={'source': 'Harry Potter'}),
 Document(page_content='It is our choices, Harry, that show what we truly are, far more than our abilities', metadata={'source': 'Harry Potter'})]

In [9]:
# Lets extract the page content
line_List = [doc.page_content for doc in doc_lines]
line_List

['The scar had not pained Harry for nineteen years. All was well.',
 'It is our choices, Harry, that show what we truly are, far more than our abilities']

In [10]:
import numpy as np
embedded_docs = [embeddings.embed_query(text) for text in line_List]

np.array(embedded_docs).shape

(2, 1536)

### Lets explore some open source embedding models

#### BGE Embeddings( BAAI(Beijing Academy of Artificial Intelligence) General Embeddings)


In [11]:
!pip install sentence-transformers



You should consider upgrading via the 'C:\Users\HP\OneDrive\Desktop\LLMs_Intro\langchain_new_env\Scripts\python.exe -m pip install --upgrade pip' command.


In [12]:
import numpy as np
from langchain.embeddings import HuggingFaceEmbeddings

model_name = "BAAI/bge-base-en-v1.5"
model_kwargs = {"device":'cpu'}
encode_kwargs = {'normalize_embeddings':True}

hf = HuggingFaceEmbeddings(
    model_name = model_name,
    model_kwargs = model_kwargs,
    encode_kwargs = encode_kwargs
)

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [13]:
import numpy as np
embedded_docs = [hf.embed_query(text) for text in line_List]

np.array(embedded_docs).shape

(2, 768)

### FAKE EMBEDDINGS

#### For testing purposes, if you have some hardware restrictions, you can use Fake Embeddings from Langchain

In [14]:
from langchain_community.embeddings import FakeEmbeddings

fake_embeddings = FakeEmbeddings(size = 300) # embedding size

fake_embedding_record = fake_embeddings.embed_query("This is a random text")
fake_embedding_records = fake_embeddings.embed_documents(["This is a random text"])

In [15]:
# Single record
np.array(fake_embedding_record).shape

(300,)

In [16]:
# Multiple records
np.array(fake_embedding_records).shape

(1, 300)