<img align="left" src="https://ithaka-labs.s3.amazonaws.com/static-files/images/tdm/tdmdocs/tapi-logo-small.png" />

This notebook free for educational reuse under [Creative Commons CC BY License](https://creativecommons.org/licenses/by/4.0/).

Created by [Grant Glass](https://glassgrant.com) for the 2024 Text Analysis Pedagogy Institute, with support from [Constellate](https://constellate.org).

For questions/comments/improvements, email grantg@unc.edu.<br />
____

# Large Language Models and Embeddings for Retrieval Augmented Generation: Day 1 7/15/24

This is lesson `1` of 3 in the educational series on `Large Language Models (LLMs) and Retrieval Augmented Generation (RAG)`. This notebook is intended to introduce the concepts of LLMs and provide hands-on experience with analyzing their capabilities and limitations.

**Skills:** 
* Data analysis
* Machine learning
* Text analysis
* Language models

**Audience:** `Learners`

**Use case:** `Tutorial`

This tutorial guides users through the process of understanding and analyzing Large Language Models, providing step-by-step instructions and explanations.


**Difficulty:** `Intermediate`

Intermediate assumes users are familiar with Python and have been programming for 6+ months. Code makes up a larger part of the notebook and basic concepts related to Python are not explained.


**Completion time:** `90 minutes`

**Knowledge Required:** 
* Python basics (variables, flow control, functions, lists, dictionaries)
* Basic understanding of machine learning concepts

**Knowledge Recommended:**
* Familiarity with natural language processing (NLP) concepts
* Experience with data manipulation libraries like Pandas

**Learning Objectives:**
After this lesson, learners will be able to:
1. Describe the fundamental concepts and architecture of Large Language Models
2. Implement basic prompts and analyze LLM responses
3. Evaluate LLM performance on various tasks
4. Discuss the strengths and limitations of current LLM technology

**Research Pipeline:**
1. Introduction to LLMs and their applications
2. **Hands-on analysis of LLM capabilities**
3. Exploring embeddings and RAG (Day 2)
4. Optimizing RAG systems (Day 3)

___

# Required Python Libraries
* [OpenAI](https://github.com/openai/openai-python) for interacting with GPT models
* [Pandas](https://pandas.pydata.org/) for data manipulation and analysis
* [Matplotlib](https://matplotlib.org/) for data visualization

## Install Required Libraries

In [None]:
### Install Libraries ###
!pip install openai pandas matplotlib tiktoken

In [None]:
### Import Libraries ###
# Import the openai library for accessing OpenAI's API functionalities
import openai
# Import the OpenAI class from the openai library for more specific API interactions
from openai import OpenAI
# Import the os library to interact with the operating system, like reading or writing files
import os
# Import pandas, a powerful data manipulation and analysis library, as 'pd'
import pandas as pd
# Import matplotlib's pyplot to create static, interactive, and animated visualizations in Python, as 'plt'
import matplotlib.pyplot as plt
# Import the constellate library for working with datasets and analytics
import constellate
# Import the dataset_reader function from the constellate library to read and process datasets
from constellate import dataset_reader
# Import time
import time
# Import tiktoken which helps us count tokens
import tiktoken



## Import your dataset

The next code cell tries to import your dataset using the following method:

* Download a full dataset that has been requested


If you are using a [dataset ID](https://constellate.org/docs/key-terms/#dataset-ID), replace the default dataset ID in the next code cell.

If you don't have a dataset ID, you can:
* Use the sample dataset ID already in the code cell
* [Create a new dataset](https://constellate.org/builder)
* [Use a dataset ID from other pre-built sample datasets](https://constellate.org/dataset/dashboard)

The Constellate client will download datasets automatically using either the `.download()` or `.get_dataset()` method.
* Full datasets are downloaded using the `.download()` method. They must be requested first in the builder environment. See the [Constellate client documentation](https://constellate.org/docs/constellate-client).

* Sampled datasets (1500 items) are downloaded using the `.get_dataset()` method. They are built automatically when a dataset is created.

In [None]:

# Assign a specific dataset ID to the variable `dataset_id` you will get this in the Constellate dataset builder
dataset_id = "716ce175-4d27-6e6e-2f3d-defaa1f0d81d"

# Use the constellate library to download the dataset specified by `dataset_id` in JSON Lines format
dataset_file = constellate.download(dataset_id, 'jsonl')

# Initialize empty lists to hold various pieces of information for each document in the dataset
document_ids = []  # To store document IDs
document_titles = []  # To store document titles
document_authors = []  # To store document authors
document_fulltexts = []  # To store the full text of each document

# Loop through each document in the dataset, as read by the `dataset_reader` function from the constellate library
for document in dataset_reader(dataset_file):
    # For each document, extract and append its ID, title, author(s), and full text to their respective lists
    document_ids.append(document.get('id'))  # Extract and append the document's ID
    document_titles.append(document.get('title'))  # Extract and append the document's title
    document_authors.append(document.get('creator'))  # Extract and append the document's author(s)
    document_fulltexts.append(document.get('fullText'))  # Extract and append the document's full text

# Create a pandas DataFrame from the collected lists, organizing the data into columns
df = pd.DataFrame({
    'id': document_ids,  # Column for document IDs
    'title': document_titles,  # Column for document titles
    'author': document_authors,  # Column for document authors
    'fullText': document_fulltexts  # Column for document full texts
})

# Print the first few rows of the DataFrame to get a preview of the data
print(df.head())


# Introduction

Large Language Models (LLMs) have revolutionized natural language processing and artificial intelligence. These powerful models, trained on vast amounts of text data, can generate human-like text, answer questions, and perform a wide range of language-related tasks. Understanding LLMs is crucial for researchers, educators, and professionals working with text analysis and AI applications.

In this lesson, we will:
1. Explore the basic concepts behind LLMs
2. Interact with an LLM (GPT-3.5) using the OpenAI API
3. Analyze LLM performance on various tasks
4. Discuss the strengths and limitations of current LLM technology

By the end of this lesson, you will have hands-on experience working with LLMs and a deeper understanding of their capabilities and potential applications in research and education.

# Lesson

## 1. Understanding Large Language Models

Large Language Models are deep learning models trained on massive amounts of text data. They use transformer architecture and self-attention mechanisms to process and generate text. Some key concepts:

- Transformer architecture
- Self-attention mechanism
- Token-based processing
- Fine-tuning and few-shot learning

Let's start by setting up our OpenAI API access:

## Configure the OpenAI client

To setup the client for our use, we need to create an API key to use with our request. Skip these steps if you already have an API key for usage.

You can get an API key by following these steps:

1. [Create a new project](https://help.openai.com/en/articles/9186755-managing-your-work-in-the-api-platform-with-projects)
2. [Generate an API key in your project](https://platform.openai.com/api-keys)
3. (RECOMMENDED, BUT NOT REQUIRED) [Setup your API key for all projects as an env var](https://platform.openai.com/docs/quickstart/step-2-set-up-your-api-key)

In [None]:
## Method 1: Directly paste the API key (not recommended for production or shared code)
client = OpenAI(api_key="your_actual_openai_api_key_here")

# Method 2: Use an environment variable (recommended for most use cases)
# Ensure the environment variable OPENAI_API_KEY is set in your environment before running the script
#client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

# Method 3: Use a configuration file (alternative for keeping keys out of code)
# Create a file named config.py (or similar) and define OPENAI_API_KEY in it, then import it here
#from config import OPENAI_API_KEY
#client = OpenAI(api_key=OPENAI_API_KEY)

# Method 4: Use Python's built-in `getpass` module to securely input the API key at runtime (useful for notebooks or temporary scripts)
#from getpass import getpass
#api_key = getpass("Enter your OpenAI API key: ")
#client = OpenAI(api_key=api_key)


In [None]:
# Define a function `get_completion` that takes a prompt and optionally a model name (defaulting to "gpt-3.5-turbo")
def get_completion(prompt, model="gpt-3.5-turbo"):
    # Create a list of messages where each message is a dictionary with a role (user/system) and the content (the prompt)
    messages = [{"role": "user", "content": prompt}]
    # Call the OpenAI API's chat.completions.create method with the specified model, messages, and a temperature of 0
    # Temperature of 0 makes the model's responses deterministic (the same input will always produce the same output)
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
    )
    # Return the content of the first message in the response's choices. This is the model's completion of the input prompt.
    return response.choices[0].message.content

# Test the function by defining a test prompt asking to explain what a large language model is in one sentence
test_prompt = "Explain what a large language model is in one sentence."
# Print the result of calling `get_completion` with the test prompt to see the model's response
print(get_completion(test_prompt))

## 2. Basic Interaction with LLMs

Now that we have our API set up, let's explore some basic interactions with the LLM:

In [None]:
# Define a list named `prompts` containing three different prompts for the language model
prompts = [
    "Summarize the plot of Pride and Prejudice in 3 sentences.",  # First prompt
    "What are the main themes in Frankenstein?",  # Second prompt
    "Describe Alice's character in Alice in Wonderland.",  # Third prompt
]

# Iterate over each prompt in the `prompts` list
for prompt in prompts:
    # Print the current prompt to the console, formatted with a prefix "Prompt: "
    print(f"Prompt: {prompt}")
    # Call the `get_completion` function with the current prompt, print the response prefixed with "Response: "
    # The `get_completion` function is expected to return a string containing the model's response to the prompt
    print(f"Response: {get_completion(prompt)}\n")  # A newline is added after each response for better readability

## 3. Analyzing LLM Performance

Let's analyze the LLM's performance on various literary analysis tasks:

In [None]:
# Define a function to calculate the number of tokens in a string using a specific encoding
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    # Retrieve the encoding object for the specified encoding name using tiktoken library
    encoding = tiktoken.get_encoding(encoding_name)
    # Encode the string and calculate the number of tokens in the encoded result
    num_tokens = len(encoding.encode(string))
    # Return the number of tokens
    return num_tokens

# Define a function to analyze a text in parts, each part having a maximum number of tokens
def analyze_text_in_parts(text, task, max_tokens=8000):
    # Check if the input text is a list (of strings) or a single string
    if isinstance(text, list):
        # If it's a list, join the elements into a single string separated by spaces
        text = ' '.join(text)
    
    # Initialize an empty list to hold the parts of text
    parts = []
    # Initialize an empty string to accumulate the current part of text
    current_part = ""
    # Split the text into words
    words = text.split()
    
    # Iterate over each word in the text
    for word in words:
        # Check if adding the current word to the current part exceeds the max_tokens limit
        if num_tokens_from_string(current_part + " " + word) > max_tokens:
            # If so, add the current part to the parts list and start a new part with the current word
            parts.append(current_part)
            current_part = word
        else:
            # Otherwise, add the current word to the current part
            current_part += " " + word
    
    # After iterating through all words, add the last part to the parts list if it's not empty
    if current_part:
        parts.append(current_part)
    
    # Initialize an empty string to accumulate the aggregated response from analyzing each part
    aggregated_response = ""
    # Iterate over each part
    for part in parts:
        # Construct a prompt for the analysis task using the current part
        prompt = f"Analyze the following text for {task}: {part}"
        # Get the response for the prompt using the get_completion function
        response = get_completion(prompt)
        # Append the response to the aggregated response, separated by spaces
        aggregated_response += response + " "
        # Sleep for 1 second to avoid hitting rate limits of the API
        time.sleep(1)
    
    # Return the aggregated response, stripped of leading/trailing whitespace
    return aggregated_response.strip()

# Define a list of tasks for analysis
tasks = ["main themes", "writing style", "historical context"]
# Initialize an empty list to store the results
results = []

# Iterate over each row in the dataframe `df`
for _, row in df.iterrows():
    # For each task, analyze the full text of the row
    for task in tasks:
        analysis = analyze_text_in_parts(row['fullText'], task)
        # Append the analysis result to the results list
        results.append({"Text": row['title'], "Task": task, "Analysis": analysis})
    
    # After analyzing all tasks for a row, save the results to a CSV file to prevent data loss
    pd.DataFrame(results).to_csv('day1_dataset_analysis.csv', index=False)

# Convert the results list to a DataFrame and print it
df_results = pd.DataFrame(results)
print(df_results)

## 4. Visualizing LLM Performance

Let's create a simple visualization to compare the length of LLM responses for different tasks and texts:

In [None]:
# Add a new column 'ResponseLength' to the dataframe 'df_results' that contains the length of each response in the 'Analysis' column
df_results['ResponseLength'] = df_results['Analysis'].str.len()

# Create a new figure for plotting with a specified size (12 inches wide by 6 inches tall)
plt.figure(figsize=(12, 6))
# Pivot the dataframe to have 'Text' as the index, 'Task' as the columns, and 'ResponseLength' as the values, then plot a bar chart
df_results.pivot(index='Text', columns='Task', values='ResponseLength').plot(kind='bar')
# Set the title of the plot to 'LLM Response Length by Text and Task'
plt.title('LLM Response Length by Text and Task')
# Label the x-axis as 'Text'
plt.xlabel('Text')
# Label the y-axis as 'Response Length (characters)'
plt.ylabel('Response Length (characters)')
# Add a legend to the plot with the title 'Task'
plt.legend(title='Task')
# Adjust the layout to make sure everything fits without overlapping
plt.tight_layout()
# Display the plot
plt.show()

In [None]:
# Save initial data to csv for next class
df.to_csv('day1_dataset.csv', index=False)

## 5. Discussing Strengths and Limitations

Based on our experiments, let's discuss some strengths and limitations of LLMs:

Strengths:
1. Versatility in handling different types of questions and tasks
2. Ability to generate coherent and contextually relevant responses
3. Fast processing of large amounts of text

Limitations:
1. Potential for factual inaccuracies or "hallucinations"
2. Limited context window for processing long texts
3. Difficulty with tasks requiring deep reasoning or external knowledge

# Exercises

1. Choose a short passage (about 500 words) from one of the downloaded texts and ask the LLM to perform the following tasks:
   a. Summarize the passage
   b. Identify the mood or tone
   c. List any literary devices used

2. Compare the LLM's analysis with your own interpretation. What similarities and differences do you notice?

3. Experiment with different prompting techniques (e.g., few-shot learning, chain-of-thought) to improve the LLM's performance on a specific task of your choice.

# Conclusion

In this lesson, we've explored the basics of Large Language Models, interacted with an LLM using the OpenAI API, and analyzed its performance on various literary analysis tasks. We've seen both the impressive capabilities and some limitations of current LLM technology.

In the next lesson, we'll dive into embeddings and introduce the concept of Retrieval Augmented Generation (RAG) to enhance LLM performance.

# References

1. Vaswani, A., et al. (2017). [Attention Is All You Need](https://arxiv.org/abs/1706.03762). arXiv preprint arXiv:1706.03762.
2. Brown, T. B., et al. (2020). [Language Models are Few-Shot Learners](https://arxiv.org/abs/2005.14165). arXiv preprint arXiv:2005.14165.
3. OpenAI. (2023). [OpenAI API Documentation](https://platform.openai.com/docs/introduction).

___
[Proceed to next lesson: LLMs with RAG Workshop: Day 2 - Exploring Embeddings and Introduction to RAG ->](./rag_embedding_basics.ipynb)

In [None]:

# Define a function to calculate the number of tokens in a string using a specific encoding
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    # Retrieve the encoding object for the specified encoding name using tiktoken library
    encoding = tiktoken.get_encoding(encoding_name)
    # Encode the string and calculate the number of tokens in the encoded result
    num_tokens = len(encoding.encode(string))
    # Return the number of tokens
    return num_tokens

# Define a function to analyze a text in parts, each part having a maximum number of tokens
def analyze_text_in_parts(text, task, max_tokens=500):
    # Check if the input text is a list (of strings) or a single string
    if isinstance(text, list):
        # If it's a list, join the elements into a single string separated by spaces
        text = ' '.join(text)
    
    # Initialize an empty list to hold the parts of text
    parts = []
    # Initialize an empty string to accumulate the current part of text
    current_part = ""
    # Split the text into words
    words = text.split()
    
    # Iterate over each word in the text
    for word in words:
        # Check if adding the current word to the current part exceeds the max_tokens limit
        if num_tokens_from_string(current_part + " " + word) > max_tokens:
            # If so, add the current part to the parts list and start a new part with the current word
            parts.append(current_part)
            current_part = word
        else:
            # Otherwise, add the current word to the current part
            current_part += " " + word
    
    # After iterating through all words, add the last part to the parts list if it's not empty
    if current_part:
        parts.append(current_part)
    
    # Initialize an empty string to accumulate the aggregated response from analyzing each part
    aggregated_response = ""
    # Iterate over each part
    for part in parts:
        # Construct a prompt for the analysis task using the current part
        prompt = f"Analyze the following text for {task}: {part}"
        # Assuming get_completion is a function that sends the prompt to an LLM and returns the response
        response = get_completion(prompt)  # Implement get_completion according to your LLM's API
        # Append the response to the aggregated response, separated by spaces
        aggregated_response += response + " "
        # Sleep for 1 second to avoid hitting rate limits of the API
        time.sleep(1)
    
    # Return the aggregated response, stripped of leading/trailing whitespace
    return aggregated_response.strip()

In [None]:

def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

def analyze_text_in_parts(text, task, max_tokens=8000, technique="few-shot"):
    if isinstance(text, list):
        text = ' '.join(text)
    
    parts = []
    current_part = ""
    words = text.split()
    
    for word in words:
        if num_tokens_from_string(current_part + " " + word) > max_tokens:
            parts.append(current_part)
            current_part = word
        else:
            current_part += " " + word
    
    if current_part:
        parts.append(current_part)
    
    aggregated_response = ""
    for part in parts:
        if technique == "few-shot":
            prompt = few_shot_prompt(task, part)
        elif technique == "chain-of-thought":
            prompt = chain_of_thought_prompt(task, part)
        else:
            prompt = f"Analyze the following text for {task}: {part}"
        
        response = get_completion(prompt)  # Implement get_completion according to your LLM's API
        aggregated_response += response + " "
        time.sleep(1)
    
    return aggregated_response.strip()

def few_shot_prompt(task, text):
    # Example of a few-shot prompt with two examples
    return f"""
    Task: {task}
    Example 1: [The text is about the main themes.....]
    Example 2: [The historical context of the text is.....]
    Text: {text}
    Analysis:
    """

def chain_of_thought_prompt(task, text):
    # Example of a chain-of-thought prompt
    return f"""
    Task: {task}
    To analyze the text, consider the following steps:
    1. Identify the key themes.
    2. Note the writing style and tone.
    3. Consider the historical context or relevance.
    Text: {text}
    Analysis:
    """



In [None]:
def get_completion(prompt, model="gpt-3.5-turbo", frequency_penalty=0):
    messages = [{"role": "user", "content": prompt}]
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0,
        frequency_penalty=frequency_penalty,  # Add frequency penalty here
    )

Effects of Changing the Frequency Penalty:

Higher Frequency Penalty (>0): Makes the model less likely to repeat the same text verbatim. This can be useful for generating more diverse and creative responses. However, setting it too high might lead to less coherent or relevant responses.

Lower Frequency Penalty (closer to 0): Allows more repetition. This can be useful when you want the model to focus on a specific topic or when repetition is not a concern. However, too low might result in responses that are too repetitive.