<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Grant Glass](https://glassgrant.com) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email grantg@unc.edu.<br />
____

# Large Language Models and Embeddings for Retrieval Augmented Generation: Day 2 7/17/24

This is lesson `2` of 3 in the educational series on `Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)`. This notebook focuses on understanding embeddings and introducing the concept of RAG.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* Language models
* Vector embeddings
* Retrieval Augmented Generation

**Audience:** `Learners`

**Use case:** `Tutorial`

This tutorial guides users through the process of creating and using embeddings, and introduces the concept of Retrieval Augmented Generation.



**Difficulty:** `Intermediate`

Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.


**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* Basic understanding of machine learning concepts
* Familiarity with LLMs (covered in Day 1)

**Knowledge Recommended:**
* Experience with natural language processing (NLP) concepts
* Familiarity with vector operations

**Learning Objectives:**
After this lesson, learners will be able to:
1. Explain the concept of embeddings and their role in NLP
2. Generate and visualize embeddings using pre-trained models
3. Implement basic similarity search using embeddings
4. Describe the principles of Retrieval Augmented Generation (RAG)
5. Develop a simple RAG system using embeddings and an LLM

**Research Pipeline:**
1. Introduction to LLMs and their applications (Day 1)
2. **Exploring embeddings and introduction to RAG**
3. Optimizing RAG systems (Day 3)
4. Applying RAG in research contexts

___

# Required Python Libraries

* [OpenAI](https://github.com/openai/openai-python) for generating embeddings and interacting with GPT models
* [Pandas](https://pandas.pydata.org/) for data manipulation
* [NumPy](https://numpy.org/) for numerical operations
* [Matplotlib](https://matplotlib.org/) for data visualization
* [Scikit-learn](https://scikit-learn.org/) for dimensionality reduction and similarity calculations

## Install Required Libraries

In [None]:
### Install Libraries ###
!pip install openai pandas numpy matplotlib scikit-learn

In [None]:
### Import Libraries ###

# Import the openai library for accessing OpenAI's API functionalities
import openai
# Import the OpenAI class from the openai library for direct use of its methods (though this seems redundant given the previous import)
from openai import OpenAI
# Import pandas, a powerful data manipulation and analysis library, and use 'pd' as its alias
import pandas as pd
# Import numpy, a library for numerical operations on large, multi-dimensional arrays and matrices, using 'np' as its alias
import numpy as np
# Import pyplot from matplotlib, a plotting library, and use 'plt' as its alias for creating static, interactive, and animated visualizations
import matplotlib.pyplot as plt
# Import TSNE from sklearn.manifold, a tool for dimensionality reduction suitable for visualization of high-dimensional datasets
from sklearn.manifold import TSNE
# Import cosine_similarity from sklearn.metrics.pairwise, a method to compute similarity between pairs of elements using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
# Import the os module, a standard library module providing a way to use operating system dependent functionality like reading or writing to the file system
import os


# Required Data

We'll continue using the texts from Day 1.



In [None]:

# Load the dataset
df = pd.read_csv('day1_dataset.csv')  # Assuming we saved the DataFrame from Day 1



# Introduction

In this lesson, we'll explore the concept of embeddings, which are dense vector representations of text that capture semantic meaning. We'll then use these embeddings to implement a basic Retrieval Augmented Generation (RAG) system, which combines the power of LLMs with the ability to retrieve relevant information from a knowledge base.

Key topics we'll cover:
1. Understanding and generating embeddings
2. Visualizing embeddings
3. Implementing similarity search using embeddings
4. Introduction to Retrieval Augmented Generation (RAG)
5. Building a simple RAG system

Let's begin by setting up our OpenAI API access:

## Configure the OpenAI client

To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)
2. [Generate an API key in your project](https://platform.openai.com/api-keys)
3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)

In [None]:
## Method 1: Directly paste the API key (not recommended for production or shared code)
client = OpenAI(api_key="your_actual_openai_api_key_here")

# Method 2: Use an environment variable (recommended for most use cases)
# Ensure the environment variable OPENAI_API_KEY is set in your environment before running the script
#client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Method 3: Use a configuration file (alternative for keeping keys out of code)
# Create a file named config.py (or similar) and define OPENAI_API_KEY in it, then import it here
from config import OPENAI_API_KEY
#client = OpenAI(api_key=OPENAI_API_KEY)

# Method 4: Use Python's built-in `getpass` module to securely input the API key at runtime (useful for notebooks or temporary scripts)
#from getpass import getpass
#api_key = getpass("Enter your OpenAI API key: ")
#client = OpenAI(api_key=api_key)

In [None]:
# Define a function to get the embedding of a given text using a specified model
def get_embedding(text, model="text-embedding-ada-002"):
    # Replace newline characters with spaces in the text to ensure it's on a single line
    text = text.replace("\n", " ")
    # Use the OpenAI API client to create an embedding for the text using the specified model
    # The function returns the embedding of the first (and only) input text
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Define a function to get a completion for a given prompt using a specified model
def get_completion(prompt, model="gpt-3.5-turbo"):
    # Prepare the prompt as a message from the user
    messages = [{"role": "user", "content": prompt}]
    # Use the OpenAI API client to create a chat completion using the specified model
    # The temperature parameter is set to 0 for deterministic, less random responses
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    # Return the content of the first (and only) completion message
    return response.choices[0].message.content

# Print a message indicating that the OpenAI API client is ready for use
print("OpenAI API is ready.")

# Lesson

## 1. Understanding and Generating Embeddings

Embeddings are vector representations of text that capture semantic meaning. Let's generate embeddings for our book summaries:

In [None]:
# Generate embeddings for each row in the dataframe by applying the get_embedding function
# The get_embedding function is applied to the first 1000 characters of the 'fullText' column for each row
# This is done for brevity and to ensure the embedding process is manageable and efficient
df['embedding'] = df['fullText'].apply(lambda x: get_embedding(x[:1000]))

# Print the first 5 rows of the dataframe showing only the 'title' and 'embedding' columns
# This provides a quick overview of the embeddings generated for the initial texts
print(df[['title', 'embedding']].head())

# Save the dataframe to a CSV file named 'day2_dataset.csv'
# The index=False parameter is used to indicate that the dataframe index should not be written to the file
# This results in a CSV file that contains only the data columns
df.to_csv('day2_dataset.csv', index=False)

## 2. Visualizing Embeddings

To visualize high-dimensional embeddings, we'll use t-SNE to reduce them to 2D:

In [None]:
# Convert the list of embeddings from the dataframe into a 2D numpy array for numerical operations
embeddings_array = np.array(df['embedding'].tolist())

# Determine the perplexity value for t-SNE, ensuring it's less than the number of samples to avoid errors
# The perplexity value influences how t-SNE balances attention between local and global aspects of your data
# The minimum function ensures the perplexity is not greater than the number of samples minus one
perplexity_value = min(30, len(embeddings_array) - 1)

# Initialize the t-SNE model with two components (for 2D visualization), a fixed random state for reproducibility,
# and the calculated perplexity value
tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity_value)

# Fit the t-SNE model on the embeddings array to reduce its dimensionality to two dimensions
embeddings_2d = tsne.fit_transform(embeddings_array)

# Create a figure for plotting with a specified size
plt.figure(figsize=(10, 8))

# Scatter plot of the two-dimensional embeddings
# Each point represents an embedding, plotted according to its t-SNE reduced dimensions
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])

# Annotate each point in the scatter plot with its corresponding title from the dataframe
# This loop goes through each embedding and places a text label at its location
for i, title in enumerate(df['title']):
    plt.annotate(title, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

# Set the title of the plot to give context about what is being visualized
plt.title('2D Visualization of Embeddings')

# Display the plot
# This visual representation helps in understanding the relationship between different embeddings
plt.show()

## 3. Implementing Similarity Search

Let's implement a simple similarity search using cosine similarity:

In [None]:
# Define a function to find the most similar text in a dataframe to a given query
def find_most_similar(query, df):
    # Generate an embedding for the query text using the get_embedding function
    query_embedding = get_embedding(query)
    # Calculate the cosine similarity between the query embedding and each embedding in the dataframe
    # The similarity score is stored in a new column 'similarity'
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])
    # Sort the dataframe by the similarity score in descending order and select the top row
    # This row represents the text most similar to the query
    return df.sort_values('similarity', ascending=False).iloc[0]

# Example usage of the find_most_similar function
# Define a query text
query = "A story about a Frederick Douglas"
# Call the find_most_similar function with the query and the dataframe
most_similar = find_most_similar(query, df)
# Print the results, including the title, summary (full text), and similarity score of the most similar text
print(f"Most similar to '{query}':")
print(f"Title: {most_similar['title']}")
print(f"Summary: {most_similar['fullText']}")
print(f"Similarity score: {most_similar['similarity']:.4f}")

## 4. Introduction to Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models with the ability to retrieve relevant information from a knowledge base. The basic steps of RAG are:

1. Convert the query into an embedding
2. Find the most similar documents in the knowledge base
3. Use the retrieved documents to augment the prompt sent to the LLM

Let's implement a simple RAG system using our data:

In [None]:
# Define a function to generate a response to a query using a Retrieval-Augmented Generation (RAG) approach
def rag_response(query, df, model="gpt-3.5-turbo"):
    # Generate an embedding for the query using the get_embedding function
    query_embedding = get_embedding(query)
    # Calculate the cosine similarity between the query embedding and each document's embedding in the dataframe
    # Store these similarity scores in a new column 'similarity'
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])
    # Find the document with the highest similarity score to the query
    most_similar = df.sort_values('similarity', ascending=False).iloc[0]
    
    # Construct an augmented prompt that includes the context (the most similar document's text, truncated to 1000 characters)
    # followed by the query, asking for an informed answer based on the given context
    augmented_prompt = f"""
    Given the following context and question, provide an informed answer:
    
    Context: {most_similar['fullText'][:1000]}...
    
    Question: {query}
    
    Answer:
    """
    
    # Use the get_completion function to generate a response from the language model (LLM) based on the augmented prompt
    return get_completion(augmented_prompt, model)

# Example usage of the rag_response function
# Define a query regarding the main themes in Frederick Douglass' works
query = "What are the main themes in Frederick Douglass' works?"
# Call the rag_response function with the query and the dataframe containing document embeddings
response = rag_response(query, df)
# Print the original query and the generated response
print(f"Query: {query}")
print(f"RAG Response: {response}")

## 5. Comparing RAG to Standard LLM Responses

Let's compare the RAG response to a standard LLM response without context:

In [None]:
# Generate a response to the query using a standard Language Model (LLM) without any additional context or augmentation
standard_response = get_completion(query)

# Print the response generated by the standard LLM
print("Standard LLM Response:")
print(standard_response)

# Print a newline for better readability between the two responses
print("\nRAG Response:")

# Print the response generated by the Retrieval-Augmented Generation (RAG) method
# This response is expected to be more informed or context-aware due to the use of relevant document(s) during generation
print(response)

# Exercises

1. Generate embeddings for longer passages (e.g., first chapters) from the three books we downloaded. Visualize these embeddings and compare them to the summary embeddings.

2. Implement a more sophisticated RAG system that retrieves multiple relevant documents and combines their information in the prompt.

3. Experiment with different similarity metrics (e.g., Euclidean distance, Manhattan distance) and compare their performance to cosine similarity.

4. Create a simple chatbot that uses the RAG system to answer questions about the books in our dataset.

# Conclusion

In this lesson, we've explored the concept of embeddings and how they can be used to capture semantic meaning in text. We've also introduced Retrieval Augmented Generation (RAG) and implemented a basic RAG system using embeddings and an LLM.

In the next lesson, we'll focus on optimizing RAG systems for better performance and explore more advanced techniques in this field.

# References

1. Pennington, J., Socher, R., & Manning, C. D. (2014). [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf). In Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543).
2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). arXiv preprint arXiv:1810.04805.
3. Lewis, P., et al. (2020). [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401). arXiv preprint arXiv:2005.11401.

___
[Proceed to next lesson: LLMs with RAG Workshop: Day 3 - Optimizing RAG for Enhanced Performance ->](./rag_advanced.ipynb)