<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Grant Glass](https://glassgrant.com) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email grantg@unc.edu.<br />
____

# Large Language Models and Embeddings for Retrieval Augmented Generation: Day 2 7/17/24

This is lesson `2` of 3 in the educational series on `Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)`. This notebook focuses on understanding embeddings and introducing the concept of RAG.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* Language models
* Vector embeddings
* Retrieval Augmented Generation

**Audience:** `Learners`

**Use case:** `Tutorial`

This tutorial guides users through the process of creating and using embeddings, and introduces the concept of Retrieval Augmented Generation.



**Difficulty:** `Intermediate`

Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.


**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* Basic understanding of machine learning concepts
* Familiarity with LLMs (covered in Day 1)

**Knowledge Recommended:**
* Experience with natural language processing (NLP) concepts
* Familiarity with vector operations

**Learning Objectives:**
After this lesson, learners will be able to:
1. Explain the concept of embeddings and their role in NLP
2. Generate and visualize embeddings using pre-trained models
3. Implement basic similarity search using embeddings
4. Describe the principles of Retrieval Augmented Generation (RAG)
5. Develop a simple RAG system using embeddings and an LLM

**Research Pipeline:**
1. Introduction to LLMs and their applications (Day 1)
2. **Exploring embeddings and introduction to RAG**
3. Optimizing RAG systems (Day 3)
4. Applying RAG in research contexts

___

# Required Python Libraries

* [OpenAI](https://github.com/openai/openai-python) for generating embeddings and interacting with GPT models
* [Pandas](https://pandas.pydata.org/) for data manipulation
* [NumPy](https://numpy.org/) for numerical operations
* [Matplotlib](https://matplotlib.org/) for data visualization
* [Scikit-learn](https://scikit-learn.org/) for dimensionality reduction and similarity calculations

## Install Required Libraries

In [None]:
### Install Libraries ###
!pip install openai pandas numpy matplotlib scikit-learn

In [None]:
### Import Libraries ###

# Import the openai library for accessing OpenAI's API functionalities
import openai
# Import the OpenAI class from the openai library for direct use of its methods (though this seems redundant given the previous import)
from openai import OpenAI
# Import pandas, a powerful data manipulation and analysis library, and use 'pd' as its alias
import pandas as pd
# Import numpy, a library for numerical operations on large, multi-dimensional arrays and matrices, using 'np' as its alias
import numpy as np
# Import pyplot from matplotlib, a plotting library, and use 'plt' as its alias for creating static, interactive, and animated visualizations
import matplotlib.pyplot as plt
# Import TSNE from sklearn.manifold, a tool for dimensionality reduction suitable for visualization of high-dimensional datasets
from sklearn.manifold import TSNE
# Import cosine_similarity from sklearn.metrics.pairwise, a method to compute similarity between pairs of elements using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
# Import the os module, a standard library module providing a way to use operating system dependent functionality like reading or writing to the file system
import os


# Required Data

We'll continue using the texts from Day 1.



In [None]:

# Load the dataset
df = pd.read_csv('day1_dataset.csv')  # Assuming we saved the DataFrame from Day 1

df.head()

In [None]:
# Count the words and make a new column
df['word_count'] = df['fullText'].apply(lambda x: len(str(x).split()))

In [None]:
# Count the unique words and make a new column
df['unique_word_count'] = df['fullText'].apply(lambda x: len(set(str(x).split())))

In [None]:
#Count the unqiue topics and make a new column
import pandas as pd
from gensim import corpora, models
import gensim
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer


nltk.download('stopwords')
nltk.download('wordnet')
stop = set(stopwords.words('english'))
exclude = set('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
lemma = WordNetLemmatizer()

def clean(doc):
    stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
    return normalized

doc_clean = [clean(doc).split() for doc in df['fullText']]  
dictionary = corpora.Dictionary(doc_clean)
corpus = [dictionary.doc2bow(text) for text in doc_clean]

# Apply LDA
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=5, id2word = dictionary, passes=50)

# Function to count significant topics for a document
def count_significant_topics(doc_bow, threshold=0.1):
    topic_distribution = ldamodel.get_document_topics(doc_bow)
    # Count topics with a distribution above the threshold
    return sum(1 for _, prob in topic_distribution if prob > threshold)

# Apply the function to each document in the corpus
df['significant_topics'] = [count_significant_topics(doc) for doc in corpus]

In [None]:
df

# Introduction

In this lesson, we'll explore the concept of embeddings, which are dense vector representations of text that capture semantic meaning. We'll then use these embeddings to implement a basic Retrieval Augmented Generation (RAG) system, which combines the power of LLMs with the ability to retrieve relevant information from a knowledge base.

Key topics we'll cover:
1. Understanding and generating embeddings
2. Visualizing embeddings
3. Implementing similarity search using embeddings
4. Introduction to Retrieval Augmented Generation (RAG)
5. Building a simple RAG system

Let's begin by setting up our OpenAI API access:

## Configure the OpenAI client

To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)
2. [Generate an API key in your project](https://platform.openai.com/api-keys)
3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)

In [None]:
## Method 1: Directly paste the API key (not recommended for production or shared code)
client = OpenAI(api_key="your_actual_openai_api_key_here")

# Method 2: Use an environment variable (recommended for most use cases)
# Ensure the environment variable OPENAI_API_KEY is set in your environment before running the script
#client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Method 3: Use a configuration file (alternative for keeping keys out of code)
# Create a file named config.py (or similar) and define OPENAI_API_KEY in it, then import it here
#from config import OPENAI_API_KEY
#client = OpenAI(api_key=OPENAI_API_KEY)

# Method 4: Use Python's built-in `getpass` module to securely input the API key at runtime (useful for notebooks or temporary scripts)
#from getpass import getpass
#api_key = getpass("Enter your OpenAI API key: ")
#client = OpenAI(api_key=api_key)



## Text Embedding Models Overview

Text embedding models are designed to convert text into numerical representations, making it easier for machines to understand, compare, and process natural language. Here are some of the models available:

### 1. [`text-embedding-ada-002`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22fsPath%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22external%22%3A%22vscode-notebook-cell%3A%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%23X13sZmlsZQ%253D%253D%22%2C%22path%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22scheme%22%3A%22vscode-notebook-cell%22%2C%22fragment%22%3A%22X13sZmlsZQ%3D%3D%22%7D%2C%7B%22line%22%3A1%2C%22character%22%3A18%7D%5D "/Users/grantglass/teaching/constellate-RAG-workshop/rag_embedding_basics.ipynb")
- **Description**: A smaller, more efficient model suitable for tasks that require less precision and can benefit from faster processing times. It's part of the GPT-3 family but optimized for embedding tasks.
- **Use Cases**: Quick prototyping, mobile applications, or any scenario where speed is more critical than the absolute precision of embeddings.

### 2. [`text-embedding-babbage-001`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22fsPath%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22external%22%3A%22vscode-notebook-cell%3A%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%23X13sZmlsZQ%253D%253D%22%2C%22path%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22scheme%22%3A%22vscode-notebook-cell%22%2C%22fragment%22%3A%22X13sZmlsZQ%3D%3D%22%7D%2C%7B%22line%22%3A1%2C%22character%22%3A18%7D%5D "/Users/grantglass/teaching/constellate-RAG-workshop/rag_embedding_basics.ipynb")
- **Description**: A mid-range model that offers a balance between performance and computational efficiency. It provides more detailed embeddings than `ada` and is suitable for a wide range of applications.
- **Use Cases**: General-purpose embedding tasks where there is a need for a balance between precision and computational resources.

### 3. [`text-embedding-curie-001`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22fsPath%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22external%22%3A%22vscode-notebook-cell%3A%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%23X13sZmlsZQ%253D%253D%22%2C%22path%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22scheme%22%3A%22vscode-notebook-cell%22%2C%22fragment%22%3A%22X13sZmlsZQ%3D%3D%22%7D%2C%7B%22line%22%3A1%2C%22character%22%3A18%7D%5D "/Users/grantglass/teaching/constellate-RAG-workshop/rag_embedding_basics.ipynb")
- **Description**: A high-performance model that generates more nuanced and detailed embeddings. It's significantly larger than `ada` and `babbage`, making it more computationally intensive.
- **Use Cases**: Complex natural language understanding tasks, such as sentiment analysis, summarization, or when the highest quality embeddings are required.

### 4. [`text-embedding-davinci-002`](command:_github.copilot.openSymbolFromReferences?%5B%7B%22%24mid%22%3A1%2C%22fsPath%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22external%22%3A%22vscode-notebook-cell%3A%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%23X13sZmlsZQ%253D%253D%22%2C%22path%22%3A%22%2FUsers%2Fgrantglass%2Fteaching%2Fconstellate-RAG-workshop%2Frag_embedding_basics.ipynb%22%2C%22scheme%22%3A%22vscode-notebook-cell%22%2C%22fragment%22%3A%22X13sZmlsZQ%3D%3D%22%7D%2C%7B%22line%22%3A1%2C%22character%22%3A18%7D%5D "/Users/grantglass/teaching/constellate-RAG-workshop/rag_embedding_basics.ipynb")
- **Description**: The most advanced and largest model in the series, offering the highest quality embeddings with a deep understanding of context and nuances in the text.
- **Use Cases**: High-stakes applications where the quality of the embeddings directly impacts the outcome, such as legal document analysis, medical research, and advanced natural language understanding tasks.

### Choosing the Right Model
Selecting the right model depends on your specific needs:
- **For rapid development and lower resource consumption**: Consider `ada`.
- **For a balance between performance and efficiency**: `babbage` is a good choice.
- **When quality cannot be compromised**: Opt for `curie` or `davinci`, depending on the level of sophistication and resource availability you have.

Each model has its strengths and is optimized for different scenarios, so the choice should be based on the specific requirements of your application, including factors like computational resources, the complexity of the task, and the need for precision in the embeddings.

In [None]:
# Define a function to get the embedding of a given text using a specified model
def get_embedding(text, model="text-embedding-ada-002"):
    # Replace newline characters with spaces in the text to ensure it's on a single line
    text = text.replace("\n", " ")
    # Use the OpenAI API client to create an embedding for the text using the specified model
    # The function returns the embedding of the first (and only) input text
    return client.embeddings.create(input=[text], model=model).data[0].embedding

# Define a function to get a completion for a given prompt using a specified model
def get_completion(prompt, model="gpt-3.5-turbo"):
    # Prepare the prompt as a message from the user
    messages = [{"role": "user", "content": prompt}]
    # Use the OpenAI API client to create a chat completion using the specified model
    # The temperature parameter is set to 0 for deterministic, less random responses
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    # Return the content of the first (and only) completion message
    return response.choices[0].message.content

# Print a message indicating that the OpenAI API client is ready for use
print("OpenAI API is ready.")

# Lesson

## 1. Understanding and Generating Embeddings

Embeddings are vector representations of text that capture semantic meaning. Let's generate embeddings for a sample of our text:

In [None]:
# Generate embeddings for each row in the dataframe by applying the get_embedding function
# The get_embedding function is applied to the first 1000 characters of the 'fullText' column for each row
# This is done for brevity and to ensure the embedding process is manageable and efficient
df['embedding'] = df['fullText'].apply(lambda x: get_embedding(x[:1000]))

# Print the first 5 rows of the dataframe showing only the 'title' and 'embedding' columns
# This provides a quick overview of the embeddings generated for the initial texts
print(df[['title', 'embedding']].head())

# Save the dataframe to a CSV file named 'day2_dataset.csv'
# The index=False parameter is used to indicate that the dataframe index should not be written to the file
# This results in a CSV file that contains only the data columns
df.to_csv('day2_dataset.csv', index=False)



## Explanation of the Embedding Code


1. **DataFrame Column Access**: `df['fullText']` accesses the `fullText` column of the DataFrame `df`. This column is expected to contain text data for which embeddings will be generated.

2. **Apply Function**: The `.apply()` method is used to apply a function along the axis of the DataFrame. In this case, it's applied to each element (row) of the `fullText` column.

3. **Lambda Function**: A lambda function is defined inline to process each value `x` (representing the text in each row of the `fullText` column). The lambda function is used here for its conciseness and the ability to define a function in a single line.

4. **Text Truncation**: Inside the lambda function, `x[:1000]` truncates the text to the first 1000 characters. This step is likely taken to ensure consistent embedding sizes, reduce computational load, or comply with the limitations of the `get_embedding` function.

5. **Embedding Generation**: `get_embedding(x[:1000])` calls a function named `get_embedding`, passing the truncated text as an argument. This function is responsible for converting the text into a numerical representation, known as an embedding. The specifics of how `get_embedding` works, including the model or algorithm it uses, are not provided in the snippet.

6. **Storing Embeddings**: The result of the `get_embedding` function, which is the embedding for each row's text, is then stored in a new column in the DataFrame named `embedding`. This effectively adds a new column to `df`, where each row contains the embedding of the text from the `fullText` column.




In [None]:
# We could look at different ways of chunking up the text for processing
#import textwrap
#from typing import List

#def get_chunk_embeddings(text: str, chunk_size: int = 1000) -> List:
    # Split the text into chunks without splitting words
    #chunks = textwrap.wrap(text, width=chunk_size, break_long_words=False)
    
    # Get embedding for each chunk
   # embeddings = [get_embedding(chunk) for chunk in chunks]
    
    #return embeddings

# Apply the function to each row in the dataframe
#df['embeddings'] = df['fullText'].apply(lambda x: get_chunk_embeddings(x))

## 2. Visualizing Embeddings

To visualize high-dimensional embeddings, we'll use t-SNE to reduce them to 2D:

In [None]:
# Convert the list of embeddings from the dataframe into a 2D numpy array for numerical operations
embeddings_array = np.array(df['embedding'].tolist())

# Determine the perplexity value for t-SNE, ensuring it's less than the number of samples to avoid errors
# The perplexity value influences how t-SNE balances attention between local and global aspects of your data
# The minimum function ensures the perplexity is not greater than the number of samples minus one
perplexity_value = min(30, len(embeddings_array) - 1)

# Initialize the t-SNE model with two components (for 2D visualization), a fixed random state for reproducibility,
# and the calculated perplexity value
tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity_value)

# Fit the t-SNE model on the embeddings array to reduce its dimensionality to two dimensions
embeddings_2d = tsne.fit_transform(embeddings_array)

# Create a figure for plotting with a specified size
plt.figure(figsize=(10, 8))

# Scatter plot of the two-dimensional embeddings
# Each point represents an embedding, plotted according to its t-SNE reduced dimensions
plt.scatter(embeddings_2d[:, 0], embeddings_2d[:, 1])

# Annotate each point in the scatter plot with its corresponding title from the dataframe
# This loop goes through each embedding and places a text label at its location
for i, title in enumerate(df['title']):
    plt.annotate(title, (embeddings_2d[i, 0], embeddings_2d[i, 1]))

# Set the title of the plot to give context about what is being visualized
plt.title('2D Visualization of Embeddings')

# Display the plot
# This visual representation helps in understanding the relationship between different embeddings
plt.show()

The point of showing a 2D graph, especially in the context of visualizing embeddings, is to:

Simplify Complex Data: Embeddings often exist in high-dimensional spaces that are difficult to visualize or understand intuitively. Reducing these to 2D allows for easy visualization.

Discover Patterns: By visualizing embeddings in 2D, one can identify clusters, outliers, or patterns that indicate how the embeddings relate to each other. This can reveal the structure of the data or the effectiveness of the embedding process.

Facilitate Analysis: It makes it easier for humans to analyze and interpret the relationships between data points. For instance, in natural language processing (NLP), embeddings that are close together in the 2D space might represent words or documents with similar meanings.

Debugging and Improvement: Visualizing the embeddings can help in debugging or improving the model by identifying whether similar items are indeed clustered together as expected.

Communication: It provides a straightforward way to communicate complex ideas or relationships in the data to a broader audience, including those without a deep technical background.

## 3. Implementing Similarity Search

Let's implement a simple similarity search using cosine similarity:

In [None]:
# Define a function to find the most similar text in a dataframe to a given query
def find_most_similar(query, df):
    # Generate an embedding for the query text using the get_embedding function
    query_embedding = get_embedding(query)
    # Calculate the cosine similarity between the query embedding and each embedding in the dataframe
    # The similarity score is stored in a new column 'similarity'
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])
    # Sort the dataframe by the similarity score in descending order and select the top row
    # This row represents the text most similar to the query
    return df.sort_values('similarity', ascending=False).iloc[0]

# Example usage of the find_most_similar function
# Define a query text
query = "A story about a Frederick Douglas"
# Call the find_most_similar function with the query and the dataframe
most_similar = find_most_similar(query, df)
# Print the results, including the title, summary (full text), and similarity score of the most similar text
print(f"Most similar to '{query}':")
print(f"Title: {most_similar['title']}")
print(f"Summary: {most_similar['fullText']}")
print(f"Similarity score: {most_similar['similarity']:.4f}")

In [None]:
# Importing distance metrics from scipy library
from scipy.spatial.distance import euclidean  # Euclidean distance
from scipy.spatial.distance import cityblock  # Manhattan (city block) distance

# Importing similarity and correlation metrics
from sklearn.metrics import jaccard_score  # Jaccard similarity for binary data
from scipy.stats import pearsonr  # Pearson correlation coefficient
from scipy.stats import spearmanr  # Spearman rank correlation

# Importing NLP model and similarity measure from gensim library
from gensim.models import Word2Vec  # Word2Vec model for word embeddings
from gensim.similarities import WmdSimilarity  # Word Mover's Distance similarity

In [None]:
# Try your own similairty measure here

## 4. Introduction to Retrieval Augmented Generation (RAG)

Retrieval Augmented Generation (RAG) is a technique that combines the power of large language models with the ability to retrieve relevant information from a knowledge base. The basic steps of RAG are:

1. Convert the query into an embedding
2. Find the most similar documents in the knowledge base
3. Use the retrieved documents to augment the prompt sent to the LLM

Let's implement a simple RAG system using our data:

In [None]:
# Define a function to generate a response to a query using a Retrieval-Augmented Generation (RAG) approach
def rag_response(query, df, model="gpt-3.5-turbo"):
    # Generate an embedding for the query using the get_embedding function
    query_embedding = get_embedding(query)
    # Calculate the cosine similarity between the query embedding and each document's embedding in the dataframe
    # Store these similarity scores in a new column 'similarity'
    df['similarity'] = df['embedding'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])
    # Find the document with the highest similarity score to the query
    most_similar = df.sort_values('similarity', ascending=False).iloc[0]
    
    # Construct an augmented prompt that includes the context (the most similar document's text, truncated to 1000 characters)
    # followed by the query, asking for an informed answer based on the given context
    augmented_prompt = f"""
    Given the following context and question, provide an informed answer:
    
    Context: {most_similar['fullText'][:1000]}...
    
    Question: {query}
    
    Answer:
    """
    
    # Use the get_completion function to generate a response from the language model (LLM) based on the augmented prompt
    return get_completion(augmented_prompt, model)

# Example usage of the rag_response function
# Define a query regarding the main themes in Frederick Douglass' works
query = "What are the main themes in Frederick Douglass' works?"
# Call the rag_response function with the query and the dataframe containing document embeddings
response = rag_response(query, df)
# Print the original query and the generated response
print(f"Query: {query}")
print(f"RAG Response: {response}")

## 5. Comparing RAG to Standard LLM Responses

Let's compare the RAG response to a standard LLM response without context:

In [None]:
# Generate a response to the query using a standard Language Model (LLM) without any additional context or augmentation
standard_response = get_completion(query)

# Print the response generated by the standard LLM
print("Standard LLM Response:")
print(standard_response)

# Print a newline for better readability between the two responses
print("\nRAG Response:")

# Print the response generated by the Retrieval-Augmented Generation (RAG) method
# This response is expected to be more informed or context-aware due to the use of relevant document(s) during generation
print(response)

# Exercises

1. Generate embeddings for longer passages (e.g., first chapters) from the texts we downloaded. Visualize these embeddings and compare them to the smaller embeddings.

2. Implement a more sophisticated RAG system that retrieves multiple relevant documents and combines their information in the prompt.

3. Experiment with different similarity metrics (e.g., Euclidean distance, Manhattan distance) and compare their performance to cosine similarity.

4. Create a simple chatbot that uses the RAG system to answer questions about the texts in our dataset.

# Conclusion

In this lesson, we've explored the concept of embeddings and how they can be used to capture semantic meaning in text. We've also introduced Retrieval Augmented Generation (RAG) and implemented a basic RAG system using embeddings and an LLM.

In the next lesson, we'll focus on optimizing RAG systems for better performance and explore more advanced techniques in this field.

# References

1. Pennington, J., Socher, R., & Manning, C. D. (2014). [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf). In Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532-1543).
2. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805). arXiv preprint arXiv:1810.04805.
3. Lewis, P., et al. (2020). [Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks](https://arxiv.org/abs/2005.11401). arXiv preprint arXiv:2005.11401.

___
[Proceed to next lesson: LLMs with RAG Workshop: Day 3 - Optimizing RAG for Enhanced Performance ->](./rag_advanced.ipynb)

In [None]:
def process_longer_passage(row, threshold=10000):
    """
    Process a row from the DataFrame to generate embeddings for longer passages.
    If the text length is below a certain threshold, it's processed as a single chunk.
    """
    try:
        text = row['fullText']
        # If the text is longer than the threshold, consider it a longer passage
        if len(text) > threshold:
            # Process the entire text as a single chunk
            return get_embedding(text)
        else:
            # For shorter texts, use the existing chunking method
            return get_chunk_embeddings(text, chunk_size=1000)
    except ValueError as e:
        print(f"Error processing row: {e}")
        return None

# Apply the modified function to each row of the DataFrame
df['embeddings_long'] = df.apply(process_longer_passage, axis=1)

In [None]:
# Convert the list of embeddings from the dataframe into 2D numpy arrays for numerical operations
embeddings_short_array = np.array(df['embeddings'].tolist())
embeddings_long_array = np.array(df['embeddings_long'].tolist())

# Determine the perplexity value for t-SNE, ensuring it's less than the number of samples to avoid errors
# Use the smaller of the two arrays to determine the perplexity value
perplexity_value = min(30, min(len(embeddings_short_array), len(embeddings_long_array)) - 1)

# Initialize the t-SNE model with two components (for 2D visualization), a fixed random state for reproducibility,
# and the calculated perplexity value
tsne = TSNE(n_components=2, random_state=42, perplexity=perplexity_value)

# Fit the t-SNE model on both embeddings arrays to reduce their dimensionality to two dimensions
embeddings_short_2d = tsne.fit_transform(embeddings_short_array)
embeddings_long_2d = tsne.fit_transform(embeddings_long_array)

# Create a figure for plotting with a specified size
plt.figure(figsize=(12, 10))

# Scatter plot of the two-dimensional embeddings for short passages
plt.scatter(embeddings_short_2d[:, 0], embeddings_short_2d[:, 1], color='blue', label='Short Passages')

# Scatter plot of the two-dimensional embeddings for long passages
plt.scatter(embeddings_long_2d[:, 0], embeddings_long_2d[:, 1], color='red', label='Long Passages')

# Optionally, annotate points with titles or specific characteristics
# This example annotates points from the short passages dataset
for i, title in enumerate(df['title']):
    plt.annotate(title, (embeddings_short_2d[i, 0], embeddings_short_2d[i, 1]))

# Set the title of the plot and add a legend
plt.title('2D Visualization of Short vs. Long Passage Embeddings')
plt.legend()

# Display the plot
plt.show()

In [None]:
def improved_rag_response(query, df, top_n=3, model="gpt-3.5-turbo"):
    # Generate an embedding for the query
    query_embedding = get_embedding(query)
    # Calculate cosine similarity between the query embedding and each document's embedding
    df['similarity'] = df['embeddings'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])
    # Sort the dataframe by similarity and select the top N most similar documents
    top_documents = df.sort_values('similarity', ascending=False).head(top_n)
    
    # Construct the combined context from the top N documents, truncating each to 1000 characters
    combined_context = "\n\n".join([f"Context {i+1}: {doc['fullText'][:1000]}..." for i, doc in top_documents.iterrows()])
    
    # Construct an augmented prompt with the combined context and the query
    augmented_prompt = f"""
    Given the following contexts and question, provide an informed answer:
    
    {combined_context}
    
    Question: {query}
    
    Answer:
    """
    
    # Generate a response from the language model based on the augmented prompt
    return get_completion(augmented_prompt, model)

# Example usage
query = "Defining adaptation studies as it relates to literature into film"
response = improved_rag_response(query, df)
print(f"Query: {query}")
print(f"Improved RAG Response: {response}")

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Assuming the existence of functions `get_embedding`, `get_completion`, and `improved_rag_response` as defined previously



def chatbot():
    print("Hello! I'm a chatbot that can answer questions about specific adaptation study texts. Ask me anything or type 'quit' to exit.")
    
    while True:
        user_input = input("You: ")
        if user_input.lower() == 'quit':
            print("Chatbot: Goodbye!")
            break
        
        # Use the improved RAG response function to generate an answer
        response = improved_rag_response(user_input, df)
        print(f"Chatbot: {response}")

# Example usage
chatbot()