<a href="https://github.com/labrijisaad/LLM-RAG/blob/main/notebooks/vectorization_experiments.ipynb" target="_blank">
  <img src="https://img.shields.io/badge/Open%20in-GitHub-blue.svg" alt="Open In GitHub"/>
</a>

## <center><a><span style="color:red">`OpenAI LLM` - Text Vectorization Experiments</span></a></center>

#### Setup and Configuration

In [1]:
import pandas as pd
import numpy as np
import requests
import openai
import faiss
import yaml
import re
import os

from tqdm.auto import tqdm

# Load OpenAI API Key
with open("../secrets/credentials.yml", "r") as stream:
    config = yaml.safe_load(stream)
OPENAI_CREDENTIALS = config["OPENAI_CREDENTIALS"]

# Constants
EMBEDDING_MODEL = "text-embedding-ada-002"
PATH_TO_MARKDOWN_FILE = "../data/raw/mock_markdown.md"

### <a><span style="color:green">Read and Preprocess `Markdown File`</span></a>

In [2]:
def read_and_process_markdown(file_path):
    with open(file_path, "r", encoding="utf-8") as file:
        text = file.read()
    sections = re.split(r"\n(#{1,3} .*)\n", text)
    processed = [sections[0]] + [
        sections[i] + sections[i + 1] for i in range(1, len(sections), 2)
    ]
    return processed


texts = read_and_process_markdown(PATH_TO_MARKDOWN_FILE)

# Display sections
df_sections = pd.DataFrame({"Sections": texts})
df_sections.head()

Unnamed: 0,Sections
0,"## Healthcare\n\nIn healthcare, AI is being us..."
1,## Finance\nThe finance sector leverages AI fo...
2,# Ethical Considerations\nAs AI continues to e...
3,## Privacy and Surveillance\nWith the increasi...
4,## Bias and Fairness\nAI systems can inherit b...


### <a><span style="color:green">Generate `Embeddings` for Each Section</span></a>

In [3]:
def query_openai_embedding(api_key, text, model="text-embedding-ada-002"):
    """
    Queries OpenAI's embedding model for a single text and returns the embedding.

    :param api_key: OpenAI API key.
    :param text: Text to generate embedding for.
    :param model: Embedding model to use.
    :return: An embedding as a numpy array or a structured error message.
    """
    url = f"https://api.openai.com/v1/embeddings"
    headers = {"Authorization": f"Bearer {api_key}"}
    payload = {
        "input": text,
        "model": model,
    }

    try:
        response = requests.post(url, headers=headers, json=payload)
        if response.status_code == 200:
            data = response.json()
            embedding = np.array(data["data"][0]["embedding"], dtype="float32")
            return embedding
        else:
            error_message = f"HTTP Error {response.status_code}"
            try:
                error_details = response.json().get("error", {})
                message = error_details.get("message", "An unspecified error occurred")
            except ValueError:
                message = "Error details unavailable"
            error_message += f": {message}"
            return {"error": error_message}

    except requests.RequestException as e:
        return {"error": f"Connection error: {e}"}

### <a><span style="color:green">Create `Embeddings`</span></a>
##### (Uncomment if First Time)

In [4]:
# # Get embeddings
# embeddings = []
# for text in tqdm(texts):
#     embedding = query_openai_embedding(OPENAI_CREDENTIALS, text)
#     if not isinstance(embedding, dict):
#         embeddings.append(embedding)
#     else:
#         print("Error retrieving embedding:", embedding["error"])

# # Convert the list of embeddings into a numpy array
# embeddings = np.array(embeddings)
# embeddings

### <a><span style="color:green">Create `FAISS Index`</span></a>
##### (Uncomment if First Time)

In [5]:
# dimension = embeddings.shape[1]
# index = faiss.IndexFlatL2(dimension)
# index.add(embeddings)

### <a><span style="color:green">Saving the `FAISS Index`</span></a>
##### (Uncomment if First Time)

In [6]:
# # Save the index to a file
# faiss.write_index(index, "../faiss_index.bin")

### <a><span style="color:green">Loading the `FAISS Index`</span></a>

In [7]:
# Load the index from the file
index = faiss.read_index("../faiss_index.bin")

### <a><span style="color:green">Querying</span></a>

In [8]:
def search_similar_sections(query_text, index, texts, api_key, num_results=2):
    """
    Searches for sections most similar to the given query text, using a FAISS index.

    Parameters:
    - query_text: The text to query against the indexed sections.
    - index: The FAISS index containing the embeddings of the sections.
    - texts: The original texts corresponding to the embeddings in the FAISS index.
    - api_key: The API key for OpenAI.
    - num_results: The number of similar sections to return.

    Returns:
    - A list of dictionaries, each containing the 'index' and 'text' of the similar sections.
    """
    # Generate the embedding for the query text
    query_embedding = query_openai_embedding(api_key, query_text)
    query_embedding = np.array(query_embedding, dtype="float32")

    # Search the FAISS index for the nearest neighbors
    distances, indices = index.search(query_embedding.reshape(1, -1), num_results)

    # Retrieve the most similar sections
    results = [{"index": idx, "text": texts[idx]} for idx in indices[0]]

    return results

#### <a><span style="color:blue">Example `1`</span></a>

In [9]:
num_results = 4
query_text = "Artificial Intelligence"
results = search_similar_sections(
    query_text, index, texts, OPENAI_CREDENTIALS, num_results
)

print("Top similar sections to the query:")
for result in results:
    print(f"\nSection {result['index']+1}: {result['text']}")

Top similar sections to the query:

Section 12: # Introduction to AI
Artificial intelligence (AI) has rapidly become a key technology in many industries, revolutionizing processes and efficiency.


Section 46: ## The Rise of Artificial Intelligence
AI is transforming business, healthcare, and daily life, offering new possibilities in automation and smart technology.


Section 13: ## History of AI
The concept of artificial intelligence has been around for centuries, but it wasn't until the 20th century that it became a field of study. Alan Turing, a British mathematician and logician, laid the groundwork for modern computing and theorized about machines that could think.


Section 15: ## Machine Learning
Machine learning, a subset of AI, involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data.



#### <a><span style="color:blue">Example `2`</span></a>

In [10]:
num_results = 3
query_text = "healthcare"
results = search_similar_sections(
    query_text, index, texts, OPENAI_CREDENTIALS, num_results
)

print("Top similar sections to the query:")
for result in results:
    print(f"\nSection {result['index']+1}: {result['text']}")

Top similar sections to the query:

Section 1: ## Healthcare

In healthcare, AI is being used to make more accurate diagnoses, predict patient outcomes, and personalize patient treatment plans.


Section 30: # Comprehensive Guide on Health and Fitness
Health and fitness have become central to modern lifestyle, emphasizing the importance of regular exercise, balanced diet, and mental well-being.


Section 46: ## The Rise of Artificial Intelligence
AI is transforming business, healthcare, and daily life, offering new possibilities in automation and smart technology.



#### <a><span style="color:blue">Example `3`</span></a>

In [11]:
num_results = 1
query_text = "Social Media"
results = search_similar_sections(
    query_text, index, texts, OPENAI_CREDENTIALS, num_results
)

print("Top similar sections to the query:")
for result in results:
    print(f"\nSection {result['index']+1}: {result['text']}")

Top similar sections to the query:

Section 26: ## The Rise of Social Media
Platforms like Facebook, Twitter, and Instagram have changed the way we connect with others and consume information.



## <center><a><span style="color:red">`OpenAI LLM` - Text Vectorization Experiments - `Class Based`</span></a></center>

In [1]:
import faiss
import numpy as np
import pandas as pd
import requests
import yaml
import os
import re
from tqdm.auto import tqdm

class OpenAIEmbeddings:
    def __init__(self, credentials_path="../secrets/credentials.yml", embedding_model="text-embedding-ada-002"):
        self.credentials_path = credentials_path
        self.embedding_model = embedding_model
        self.api_key = self.load_credentials()
        self.texts = []
        self.embeddings = None
        self.faiss_index = None

    def load_credentials(self):
        with open(self.credentials_path, "r") as stream:
            config = yaml.safe_load(stream)
        return config["OPENAI_CREDENTIALS"]

    def read_and_process_markdown(self, file_path):
        with open(file_path, "r", encoding="utf-8") as file:
            text = file.read()
        sections = re.split(r"\n(#{1,3} .*)\n", text)
        self.texts = [sections[0]] + [sections[i] + sections[i + 1] for i in range(1, len(sections), 2)]
        return self.texts

    def query_openai_embedding(self, text):
        url = f"https://api.openai.com/v1/embeddings"
        headers = {"Authorization": f"Bearer {self.api_key}"}
        payload = {
            "input": text,
            "model": self.embedding_model,
        }
        response = requests.post(url, headers=headers, json=payload)
        if response.status_code == 200:
            data = response.json()
            embedding = np.array(data["data"][0]["embedding"], dtype="float32")
            return embedding
        else:
            return None

    def generate_embeddings(self):
        self.embeddings = np.array([self.query_openai_embedding(text) for text in tqdm(self.texts) if text is not None])

    def create_faiss_index(self):
        dimension = self.embeddings.shape[1]
        self.faiss_index = faiss.IndexFlatL2(dimension)
        self.faiss_index.add(self.embeddings)

    def search_similar_sections(self, query_text, num_results=2):
        query_embedding = self.query_openai_embedding(query_text)
        distances, indices = self.faiss_index.search(np.array([query_embedding], dtype="float32"), num_results)
        return [{"index": idx, "text": self.texts[idx]} for idx in indices[0]]

    def save_faiss_index(self, index_path="../faiss_index.bin"):
        faiss.write_index(self.faiss_index, index_path)

    def load_faiss_index(self, index_path="../faiss_index.bin"):
        self.faiss_index = faiss.read_index(index_path)

In [2]:
# Initialize the class
embedder = OpenAIEmbeddings(credentials_path="../secrets/credentials.yml", 
                            embedding_model="text-embedding-ada-002")

In [4]:
# Load and process a markdown file
texts = embedder.read_and_process_markdown("../data/raw/mock_markdown.md")
texts[0]

'## Healthcare\n\nIn healthcare, AI is being used to make more accurate diagnoses, predict patient outcomes, and personalize patient treatment plans.\n'

In [5]:
# Generate embeddings for the processed texts
embedder.generate_embeddings()

# Create a FAISS index for the embeddings
embedder.create_faiss_index()

  0%|          | 0/50 [00:00<?, ?it/s]

In [7]:
# Query for similar sections
query_text = "Artificial Intelligence"
results = embedder.search_similar_sections(query_text, num_results=4)

for result in results:
    print(f"\n{result['index']+1}: {result['text']}")


12: # Introduction to AI
Artificial intelligence (AI) has rapidly become a key technology in many industries, revolutionizing processes and efficiency.


46: ## The Rise of Artificial Intelligence
AI is transforming business, healthcare, and daily life, offering new possibilities in automation and smart technology.


13: ## History of AI
The concept of artificial intelligence has been around for centuries, but it wasn't until the 20th century that it became a field of study. Alan Turing, a British mathematician and logician, laid the groundwork for modern computing and theorized about machines that could think.


15: ## Machine Learning
Machine learning, a subset of AI, involves the development of algorithms that allow computers to learn from and make predictions or decisions based on data.



## Connect with me 🌐
<div align="center">
  <a href="https://www.linkedin.com/in/labrijisaad/">
    <img src="https://img.shields.io/badge/LinkedIn-%230077B5.svg?&style=for-the-badge&logo=linkedin&logoColor=white" alt="LinkedIn" style="margin-bottom: 5px;"/>
  </a>
  <a href="https://github.com/labrijisaad">
    <img src="https://img.shields.io/badge/GitHub-100000?style=for-the-badge&logo=github&logoColor=white" alt="GitHub" style="margin-bottom: 5px;"/>
  </a>
</div>