In [None]:
!pip install sentence_transformers

# Retrieval-Augmented Generation (RAG)

Using Sentence Transformers and Vector Databases for Natural Language Search

# Sentence Transformers

Sentence Transformers are a type of transformer-based neural network designed to generate high-quality sentence embeddings, which are dense vector representations of sentences. These embeddings can be used for various natural language processing (NLP) tasks, such as semantic search, clustering, paraphrase detection, and more. Sentence Transformers build on top of pre-trained transformer models like BERT, RoBERTa, and others, and fine-tune them specifically to produce meaningful embeddings for sentences or longer text spans.



In [6]:
import warnings
from tqdm import TqdmExperimentalWarning
warnings.filterwarnings("ignore", category=TqdmExperimentalWarning)
from sentence_transformers import SentenceTransformer

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define some sentences
sentences = [
    "This is an example sentence.",
    "Each sentence is converted into a vector representation.",
    "Sentence Transformers provide high-quality embeddings."
]

# Generate embeddings for the sentences
embeddings = model.encode(sentences)

# Print the embeddings
for sentence, embedding in zip(sentences, embeddings):
    print(f"Sentence: {sentence}")
    print(f"Embedding: {embedding}\n")

Sentence: This is an example sentence.
Embedding: [ 9.81245711e-02  6.78127110e-02  6.25232309e-02  9.50847939e-02
  3.66476364e-02 -3.98458866e-03  7.47759966e-03 -1.32313929e-02
  6.28837124e-02  2.24955045e-02  7.26956800e-02 -3.12743150e-02
  4.63550873e-02 -1.25545394e-02  4.78147976e-02 -4.91032843e-03
  4.94200438e-02 -6.41093552e-02 -9.69658643e-02  3.28887068e-02
  5.41043878e-02  3.53286453e-02  3.30506228e-02  1.46993687e-02
 -3.34306844e-02 -2.56157573e-02 -5.07921092e-02  7.32545778e-02
  1.10274017e-01 -2.96618957e-02 -6.75570890e-02 -3.05714905e-02
  3.95602100e-02  4.54760306e-02  1.59962233e-02  3.85503471e-02
 -1.09540550e-02  8.48357081e-02 -4.42870818e-02 -6.79645035e-03
  9.42566060e-03  5.07504992e-05  1.30359922e-03 -1.19697684e-02
  1.36451218e-02 -8.41742828e-02 -1.65131452e-04  5.48379449e-03
  2.56151389e-02 -3.15452851e-02 -1.07344717e-01 -4.57877778e-02
 -9.11749303e-02 -2.51047732e-03  1.79983862e-02  4.94016483e-02
  6.18480612e-03  5.97963221e-02  2.7002

## Applications of Sentence Transformers

1.	Semantic Search: Finding sentences or documents that are semantically similar to a query sentence.
2.	Clustering: Grouping sentences or documents based on their semantic similarity.
3.	Paraphrase Detection: Identifying whether two sentences have the same meaning.
4.	Textual Entailment: Determining if one sentence logically follows from another.
5.	Summarization: Generating concise summaries of documents by capturing the main ideas.

# Vector Databases

A vector database is a specialized type of database designed to store and manage high-dimensional vectors efficiently. These vectors are typically used to represent data in a numerical format that captures the essential features and relationships of the data, often derived from machine learning models or other data processing techniques. Vector databases are particularly useful for tasks that involve similarity search, clustering, recommendation systems, and other applications where comparing high-dimensional data points is crucial.

<img width="425" alt="Vector Space" src="https://github.com/user-attachments/assets/3439b580-8b77-44ff-926d-913c973fef7e">

# Semantic Search

Semantic search is a search technique that aims to improve search accuracy by understanding the intent and contextual meaning of the search query, rather than relying solely on keyword matching. It uses natural language processing (NLP) and machine learning (ML) techniques to interpret the semantics of the query and the content of the documents, enabling more relevant and meaningful search results.

<img width="485" alt="image" src="https://github.com/user-attachments/assets/52c8a2b0-d9a0-4fb0-995a-92de1af4c7aa">

In [7]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained Sentence Transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define some documents
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "A fast, brown animal leaps over a sleeping canine.",
    "Artificial intelligence and machine learning are transforming the world.",
    "Self-driving cars use machine learning for navigation."
]

# Encode the documents to get their embeddings
document_embeddings = model.encode(documents, convert_to_tensor=True)

# Define a query
query = "What technologies are used in autonomous vehicles?"

# Encode the query to get its embedding
query_embedding = model.encode(query, convert_to_tensor=True)

# Compute the cosine similarity between the query and the documents
cosine_scores = util.pytorch_cos_sim(query_embedding, document_embeddings)

# Get the most similar document
most_similar_idx = cosine_scores.argmax().item()
most_similar_doc = documents[most_similar_idx]

print("Query:", query)
print("Most similar document:", most_similar_doc)

Query: What technologies are used in autonomous vehicles?
Most similar document: Self-driving cars use machine learning for navigation.


# Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) is a method that combines the strengths of information retrieval (IR) and natural language generation (NLG) to produce high-quality, contextually relevant text. It integrates a retrieval system with a generative model, allowing the model to access external knowledge sources during the generation process. This approach enhances the model’s ability to generate accurate and informative responses by grounding the generated text in retrieved information.

<img width="1098" alt="image" src="https://github.com/user-attachments/assets/8125d6c8-6a41-4896-aaec-e45f16faa8bf">\\

## Encode Documents into Vector DB

In [None]:
!pip install sentence-transformers chromadb
import os
import re
import string
from html import unescape
import uuid
import httpx
import chromadb
from chromadb.config import Settings

In [118]:
# ChromaDB Configuration Settings for Data
print("ChromaDB starting...")
DIR = "."
DB_PATH = os.path.join(DIR, 'data')
chroma_client = chromadb.PersistentClient(path=DB_PATH, settings=Settings(allow_reset=True, anonymized_telemetry=False))
sample_collection = chroma_client.get_or_create_collection(name="jasonacox")

ChromaDB starting...


In [119]:
# Initialize arrays
documents = []
metadatas = []
ids = []

In [120]:
# Read in blog data from jasonacox.com
tag_re = re.compile('<.*?>') # regex to remove html tags
feed = "https://www.jasonacox.com/wordpress/feed/json"
print(f"Pulling blog json feed content from {feed}...")
data = httpx.get(feed, timeout=None).json()
print(f"Size: {len(data['items'])}")

Pulling blog json feed content from https://www.jasonacox.com/wordpress/feed/json...
Size: 196


In [121]:
# Loop to read in all articles - ignore any errors
print("Indexing blog articles...")
n = 1
for item in data["items"]:
    uid = str(uuid.uuid1().int)[:32]
    title = item["title"]
    url = item["url"]
    meta = {'title': title, 'url': url}
    body = tag_re.sub('', item["content_html"])
    body = unescape(body)
    body = ''.join(char for char in body if char in string.printable)
    documents.append(body)
    metadatas.append(meta)
    ids.append(uid)
    n = n + 1
print(f"Loaded {len(documents)} documents")

Indexing blog articles...
Loaded 196 documents


In [122]:
# Add vectors to collection
sample_collection.add(documents=documents, metadatas=metadatas, ids=ids)
print(sample_collection)

Collection(id=cebd494f-f45c-4c2b-bd1b-f213fbe4a7c4, name=jasonacox)


In [127]:
# Query the collection - TEST
prompt = "Give me some facts about solar in California."
query_result = sample_collection.query(query_texts=prompt, n_results=5)
print(len(query_result['metadatas'][0]))

5


In [128]:
docs = []
print("")
print("Prompt: " + prompt)
print(f"Top {5} Documents found:")
# Print Titles and Concatenate Documents
x = 0
for result in query_result['metadatas'][0]:
    print(" * " + result['title'])
    doc = {
        'title': result['title'],
        'text': query_result['documents'][0][x]
    }
    docs.append(doc)
    x += 1

context_str = "\n".join([f"{doc['title']}\n{doc['text']}" for doc in docs])


Prompt: Give me some facts about solar in California.
Top 5 Documents found:
 * California Solar and Net Metering
 * 23.5 Degrees
 * An Ocean of Science
 * Halfway Out of the Dark
 * Dog Days of Summer


In [117]:
# Remove DB
chroma_client.reset()

True

## Use LLM to Provide NL Answer

In [129]:
prompt = "Give me some facts about solar."
rag = f"""You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. Back up your answer using facts from the following context.
Context: {context_str}
Question: {prompt}
Answer:
"""

In [None]:
!pip install openai

In [130]:
import openai
api_key = "API_KEY"
base_url = "http://localhost:8000/v1"
model = "meta-llama/Meta-Llama-3.1-8B-Instruct"

llm = openai.OpenAI(api_key=api_key, base_url=base_url)
response = llm.chat.completions.create(
    model=model,
    max_tokens=2000,
    stream=False,
    temperature=0.0,
    messages=[{"role": "user", "content": rag}],
)
# Print the answer
print("LLM:")
print(response.choices[0].message.content)

LLM:
Here are some facts about solar energy based on the provided context:

1. **Solar Energy Generation**: Solar energy is generated by harnessing the power of the sun's rays, which can be used to produce electricity. In the context, the author mentions that their solar array and batteries were installed in 2021, and they have been observing the energy production from their solar panels.

2. **Solar Energy Year**: The author notes that the solar energy year is affected by the tilt of the Earth (23.5 degrees) and the elliptical orbit around the sun. This results in varying energy production throughout the year, with peak production in the summer and minimal production in the winter.

3. **Solar Energy Storage**: The author emphasizes the importance of energy storage devices (ESDs) like batteries to store excess energy generated by solar panels during the day for use during the night or periods of low energy production.

4. **Solar Duck Curve**: The author mentions the "solar duck curve