# Simplified Retrieval Augmented Generation (RAG) Demonstration

This is based on: https://www.youtube.com/watch?v=P8tOjiYEFqU&ab_channel=DonWoodlock

"""
This demonstration illustrates the ease of implementing Retrieval Augmented Generation (RAG) using ChatGPT's built-in capabilities.
We'll construct a minimal, or "micro," dataset to serve as the knowledge base for our Language Model (LLM).
The core concept of RAG involves retrieving relevant information from this external dataset based on the user's query and
incorporating it into the LLM's response. We will also visualize the similarity between the query and dataset embeddings
using the dot product, highlighting how proximity in embedding space correlates with semantic relevance.

Key Concepts:

* **RAG (Retrieval Augmented Generation):** A technique that enhances LLMs by allowing them to access and incorporate information from external knowledge sources.
  This improves accuracy and reduces hallucinations by grounding responses in real data.
* **Embeddings:** Numerical representations of text, capturing semantic meaning. These are vectors in a high-dimensional space where similar texts have closer vectors.
* **Dot Product Similarity:** A measure of similarity between two vectors. In the context of embeddings, a higher dot product indicates greater similarity.
  The dot product of two vectors $\mathbf{a}$ and $\mathbf{b}$ is calculated as:
    $$ \mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i $$
    where $n$ is the dimension of the vectors.
* **Visualization:** We will plot the dot product results to visually represent the similarity scores, aiding in understanding how the query's embedding relates to the dataset embeddings.

Detailed Steps:

1.  **Environment Setup and API Configuration:**
    * Install necessary Python libraries, including `openai` for accessing the OpenAI API and potentially `numpy` and `matplotlib` for numerical operations and plotting.
    * Configure the OpenAI API key to enable communication with the API. This typically involves setting an environment variable or loading the key from a configuration file.
2.  **Dataset Embedding Generation:**
    * Create a small dataset of text snippets.
    * Utilize the OpenAI API's embedding model (e.g., `text-embedding-ada-002`) to generate embeddings for each text snippet in the dataset.
      This transforms the text into numerical vectors. Store these embeddings along with the corresponding text.
3.  **Query Embedding Generation:**
    * Formulate a user query.
    * Generate the embedding for the query using the same embedding model used for the dataset. This ensures consistency in the embedding space.
4.  **Similarity Calculation and Visualization:**
    * Calculate the dot product between the query embedding and each dataset embedding. This yields a similarity score for each dataset entry.
    * Store the similarity scores.
    * Visualize the similarity scores. This can be done by creating a bar plot, where the x-axis represents the dataset entries, and the y-axis represents the dot product similarity scores.
      This visualization will clearly show which dataset entries are most relevant to the query.
5.  **RAG Integration:**
    * Provide the LLM with the user query, and the most relevant information retrieved from the dataset, based on the highest dot product scores.
    * Instruct the LLM to use the retrieved information to generate a response.
    * Observe the response generated by the LLM, noting how the retrieved information influences the output.
"""

In [110]:
# Install necessary libraries if they are not already installed.
# We will check if the libraries are installed, and install them if they are not.

try:
    import pandas as pd
    print("pandas is already installed.")
except ImportError:
    print("pandas is not installed. Installing...")
    !pip install pandas
    import pandas as pd

try:
    from openai import OpenAI
    print("openai is already installed.")
except ImportError:
    print("openai is not installed. Installing...")
    !pip install openai
    from openai import OpenAI
    import openai

try:
    import numpy as np
    print("numpy is already installed.")
except ImportError:
    print("numpy is not installed. Installing...")
    !pip install numpy
    import numpy as np

try:
    import pdp
    print("pdp is already installed.")
except ImportError:
    print("pdp is not installed. Installing...")
    !pip install pdp
    import pdp

# os and ast are standard library modules and should be available without pip install.
import os
import ast

print("All necessary libraries are installed.")


pandas is already installed.
openai is already installed.
numpy is already installed.
pdp is already installed.
All necessary libraries are installed.


In [111]:
# Instantiate OpenAI Api and api key
client = OpenAI()
openai.api_key = ''# Add your own key

In [102]:
# User query (real-time information request - ChatGPT will likely refuse)
question = "How is the weather in Darbhanga on Monday, 7th?"

In [112]:
import pandas as pd

# Load the pre-saved 10-day weather data for Darbhanga, which will serve as context.
weather_data_path = "/darbhanga_weather.txt"

try:
    df = pd.read_csv(weather_data_path)
except FileNotFoundError:
    print(f"Error: Weather data file not found at '{weather_data_path}'.")
    exit()  # Or handle the error appropriately

# Convert each row of the DataFrame into a comma-separated string for context.
context = [", ".join(row.astype(str)) for _, row in df.iterrows()]

# Display the first two context strings for quick verification.
print("First context string:", context[0])
print("Second context string:", context[1])

# Print the entire context list (for debugging or inspection).
print("\nFull context list:", context)

First context string: Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan
Second context string: Wed 02,  Mostly Cloudy,  35°/22°,  0%,  WSW 10 mph, nan

Full context list: ['Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan', 'Wed 02,  Mostly Cloudy,  35°/22°,  0%,  WSW 10 mph, nan', 'Thu 03,  Cloudy,  35°/22°,  0%,  W 11 mph, nan', 'Fri 04,  Mostly Sunny,  36°/21°,  1%,  W 13 mph, nan', 'Sat 05,  Sunny,  37°/21°,  1%,  W 11 mph, nan', 'Sun 06,  Sunny,  37°/23°,  1%,  SSE 7 mph, nan', 'Mon 07,  Sunny,  37°/25°,  2%,  ESE 13 mph, nan', 'Tue 08,  Mostly Sunny,  38°/24°,  4%,  E 13 mph, nan', 'Wed 09,  Mostly Sunny,  37°/24°,  7%,  E 10 mph, nan', 'Thu 10,  Mostly Sunny,  36°/23°,  17%,  E 13 mph, nan', 'Fri 11,  Partly Cloudy,  35°/23°,  15%,  E 13 mph, nan', 'Sat 12,  Isolated T-Storms,  34°/23°,  32%,  E 11 mph, nan', 'Sun 13,  Partly Cloudy,  34°/23°,  24%,  E 11 mph, nan', 'Mon 14,  Partly Cloudy,  35°/24°,  16%,  E 10 mph, nan']


In [113]:
# Load the weather data from the CSV file.
df = pd.read_csv("/darbhanga_weather.txt")

# Create a new 'text_data' column by converting each row into a comma-separated string.
df['text_data'] = df.apply(lambda row: ", ".join(row.astype(str)), axis=1)

# Display the 'text_data' column.
print(df[['text_data']])

                                            text_data
0       Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan
1   Wed 02,  Mostly Cloudy,  35°/22°,  0%,  WSW 10...
2      Thu 03,  Cloudy,  35°/22°,  0%,  W 11 mph, nan
3   Fri 04,  Mostly Sunny,  36°/21°,  1%,  W 13 mp...
4       Sat 05,  Sunny,  37°/21°,  1%,  W 11 mph, nan
5      Sun 06,  Sunny,  37°/23°,  1%,  SSE 7 mph, nan
6     Mon 07,  Sunny,  37°/25°,  2%,  ESE 13 mph, nan
7   Tue 08,  Mostly Sunny,  38°/24°,  4%,  E 13 mp...
8   Wed 09,  Mostly Sunny,  37°/24°,  7%,  E 10 mp...
9   Thu 10,  Mostly Sunny,  36°/23°,  17%,  E 13 m...
10  Fri 11,  Partly Cloudy,  35°/23°,  15%,  E 13 ...
11  Sat 12,  Isolated T-Storms,  34°/23°,  32%,  E...
12  Sun 13,  Partly Cloudy,  34°/23°,  24%,  E 11 ...
13  Mon 14,  Partly Cloudy,  35°/24°,  16%,  E 10 ...


In [117]:
# Initialize OpenAI client (ensure your API key is set in the environment)
client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

def get_embedding(text, model="text-embedding-3-small"):
    """
    Generates an embedding for the given text using the specified OpenAI embedding model.

    Args:
        text (str): The input text to be embedded.
        model (str, optional): The OpenAI embedding model to use. Defaults to "text-embedding-3-small".

    Returns:
        list: The embedding vector as a list of floats.
    """
    # Replace newlines with spaces to avoid potential issues with the API
    text = text.replace("\n", " ")

    # Print the input text (for debugging or logging purposes)
    print(f"Generating embedding for: '{text}'")

    # Generate the embedding using the OpenAI API
    try:
      embedding_response = client.embeddings.create(input=[text], model=model)
      embedding = embedding_response.data[0].embedding
      return embedding
    except Exception as e:
      print(f"An error occurred while generating embedding: {e}")
      return None # or raise the exception.

# Example usage
# example_text = "This is an example sentence for embedding."
# example_embedding = get_embedding(example_text)

# if example_embedding:
#     print("Example embedding:", example_embedding[:5], "...") # Print the first 5 elements as an example.

In [119]:
question_embedding = get_embedding(question)

if question_embedding is None:
    print("Failed to generate question embedding. Please check the question and API key.")
else:
    print("Question embedding generated successfully.")
    # You can now use the question_embedding variable.

Generating embedding for: 'How is the weather in darbhanga on Monday, 7th?'
Question embedding generated successfully.


In [122]:
# Apply the get_embedding function to each 'text_data' entry and store the embeddings in a new 'embedding' column.
df['embedding'] = df['text_data'].apply(get_embedding)

# Display the DataFrame with the new 'embedding' column.
print(df.head())

Generating embedding for: 'Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan'
Generating embedding for: 'Wed 02,  Mostly Cloudy,  35°/22°,  0%,  WSW 10 mph, nan'
Generating embedding for: 'Thu 03,  Cloudy,  35°/22°,  0%,  W 11 mph, nan'
Generating embedding for: 'Fri 04,  Mostly Sunny,  36°/21°,  1%,  W 13 mph, nan'
Generating embedding for: 'Sat 05,  Sunny,  37°/21°,  1%,  W 11 mph, nan'
Generating embedding for: 'Sun 06,  Sunny,  37°/23°,  1%,  SSE 7 mph, nan'
Generating embedding for: 'Mon 07,  Sunny,  37°/25°,  2%,  ESE 13 mph, nan'
Generating embedding for: 'Tue 08,  Mostly Sunny,  38°/24°,  4%,  E 13 mph, nan'
Generating embedding for: 'Wed 09,  Mostly Sunny,  37°/24°,  7%,  E 10 mph, nan'
Generating embedding for: 'Thu 10,  Mostly Sunny,  36°/23°,  17%,  E 13 mph, nan'
Generating embedding for: 'Fri 11,  Partly Cloudy,  35°/23°,  15%,  E 13 mph, nan'
Generating embedding for: 'Sat 12,  Isolated T-Storms,  34°/23°,  32%,  E 11 mph, nan'
Generating embedding for: 'Sun 13,  Partly Clou

In [123]:
def get_distance(text_emb):
    """
    Calculates the dot product between a text embedding and the question embedding.

    Args:
        text_emb (list): The embedding vector for a text entry.

    Returns:
        float: The dot product, representing the similarity between the embeddings.
    """
    if question_embedding is None:
        print("Error: Question embedding is None. Cannot calculate distance.")
        return None  # Or raise an exception

    # Convert embeddings to NumPy arrays for efficient dot product calculation.
    text_emb_array = np.array(text_emb)
    question_emb_array = np.array(question_embedding)

    # Calculate and return the dot product.
    return np.dot(text_emb_array, question_emb_array)

In [124]:
# Compare embeddings and rank by similarity:
df['distance'] = df['embedding'].apply(get_distance)

# Sort the DataFrame by distance (similarity) in descending order.
df.sort_values('distance', ascending=False, inplace=True)

# Display the top 5 most similar entries.
print(df.head())

       Day            Date  Feel like  Temperature     Humidity  \
8   Wed 09    Mostly Sunny    37°/24°           7%     E 10 mph   
6   Mon 07           Sunny    37°/25°           2%   ESE 13 mph   
0   Tue 01           Sunny    37°/19°           1%     SW 9 mph   
13  Mon 14   Partly Cloudy    35°/24°          16%     E 10 mph   
7   Tue 08    Mostly Sunny    38°/24°           4%     E 13 mph   

     Wind direction and speed  \
8                         NaN   
6                         NaN   
0                         NaN   
13                        NaN   
7                         NaN   

                                            text_data  \
8   Wed 09,  Mostly Sunny,  37°/24°,  7%,  E 10 mp...   
6     Mon 07,  Sunny,  37°/25°,  2%,  ESE 13 mph, nan   
0       Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan   
13  Mon 14,  Partly Cloudy,  35°/24°,  16%,  E 10 ...   
7   Tue 08,  Mostly Sunny,  38°/24°,  4%,  E 13 mp...   

                                            embedding  

In [125]:
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are an AI assistant that answers questions about weather using provided context."},
        {"role": "user", "content": question},
        {"role": "assistant", "content": f"Use this context about current weather to answer the question: {context}."}
    ]
)

# Access the response content
answer = response.choices[0].message.content
print(answer)

The weather in Darbhanga on Monday, 7th is forecasted to be sunny with a high of 37°C and a low of 25°C. There is a 2% chance of precipitation with an east-southeasterly wind at 13 mph.


In [129]:
# Print top 3 sentences, based on distance - likely the answer should be derived from those sentences.
print(df['text_data'].head(3))

8    Wed 09,  Mostly Sunny,  37°/24°,  7%,  E 10 mp...
6      Mon 07,  Sunny,  37°/25°,  2%,  ESE 13 mph, nan
0        Tue 01,  Sunny,  37°/19°,  1%,  SW 9 mph, nan
Name: text_data, dtype: object
