<a href="https://colab.research.google.com/github/ha-pu/data_course/blob/Google_Colab/5-llms.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

+ title: Large Language Models (LLMs)
+ author: Harald Puhr
+ date: April 21, 2025

# Load packages and data

In [None]:
install.packages(c("ellmer", "tidyverse"))

In [None]:
library(ellmer)
library(tidyverse)

## Generate dummy data

This code processes a list of customer reviews for a bike product. It creates a
tibble (data frame) with the reviews and assigns a unique Review_ID to each
review. Variables:

+ reviews: A character vector containing customer reviews for a bike product.
+ data: A tibble containing the customer reviews and their corresponding Review_IDs.

In [None]:
reviews = c(
    "My daughter absolutely loves this bike! Easy to assemble and sturdy.",
    "The bike is well-made and looks great, but it arrived with a scratch.",
    "Perfect size for my 5-year-old son. He rides it every day!",
    "The training wheels are a bit flimsy, but otherwise it's a good bike.",
    "Great value for the price. Assembly was simple and instructions were clear.",
    "My son outgrew his old bike, and this was a perfect upgrade. Highly recommend.",
    "Poor quality materials. The chain fell off after one week of use.",
    "We had trouble with the brakes at first, but customer service was very helpful.",
    "Fantastic bike! My daughter loves the color and design.",
    "The seat could be more comfortable, but it's a nice bike overall."
)

data = tibble(Customer_Review = reviews) %>%
  mutate(Review_ID = row_number())

data

## Initialize LLM

Set the OpenAI API key as a environment variable for authentication.

In [None]:
Sys.setenv(
  OPENAI_API_KEY = "XXX" # Replace with your OpenAI API key.
)

In [None]:
base_model <- "gpt-4.1-nano" # Replace with your preferred model.

# Test LLM

This code initializes a chat session with the OpenAI GPT-4.1-nano model and sends
a query to the model asking "Who created R?". The response from the model is
not printed to the console (echo = FALSE).

In [None]:
chat <- chat_openai(model = base_model)
. <- chat$chat("Who created R?", echo = FALSE)

Show the chat including the query and response.

In [None]:
chat

This function call `token_usage()` is used to track or display the usage of
tokens by model. This is important for billing.

In [None]:
token_usage()

# Create a function to query the LLM

## Function: Get response from OpenAI chat model

This function facilitates interaction with the OpenAI chat model by generating
a response based on the provided system and user prompts.

### Parameters:

+ **system**: A character string that serves as the system prompt to initialize
  the chat model.
+ **user**: A character string that represents the user prompt to which the chat
  model will generate a response.

### Returns:

A character string containing the response generated by the chat model.

### Examples:

```{r}
# system_prompt <- "You are a helpful assistant."
# user_prompt <- "What is the weather like today?"
# response <- get_response(system_prompt, user_prompt)
# print(response)
```

### Define function:

In [None]:
get_response <- function(system, user) {
  chat <- chat_openai(
    model = base_model,
    system_prompt = system,
    echo = "none"
  )
  out <- chat$chat(user)
  return(out)
}

# Classification of customer reviews

In [None]:
system_prompt <- "You are a classifier for customer reviews. Review each
customer review whether it is negative or positive. Return only a number between
1 (negative) and 5 (positive)."

## Apply classification

This line of code calls the `get_response` function with two arguments:

+ `system_prompt`: A variable that likely contains a prompt or instruction
  for the system.
+ `data$Customer_Review[[1]]`: The first element of the `Customer_Review`
  column in the `data` dataframe.

The function is expected to generate a response based on the provided system
prompt and customer review.

In [None]:
get_response(system_prompt, data$Customer_Review[[1]])

This code performs the following operations on the 'data' dataframe:

+ Creates a new column 'Classification' by applying the 'get_response'
  function to each element in the 'Customer_Review' column. The 'get_response'
  function is called with 'system_prompt' and each 'Customer_Review' as
  arguments.
+ Expands the 'Classification' column if it contains nested data.
+ Converts the 'Classification' column to numeric type.

In [None]:
data1 <- data %>%
  mutate(Classification = map(Customer_Review, ~ get_response(system_prompt, .))) %>%
  unnest(Classification) %>%
  mutate(Classification = as.numeric(Classification))

data1

This code creates a histogram using ggplot2 in R. It takes the data frame
'data1' and plots a histogram of the 'Classification' variable.

In [None]:
data1 %>%
  ggplot() +
  geom_histogram(aes(x = Classification), bins = 5) +
  labs(x = "Classification", y = "Frequency") +
  ggtitle("Histogram of Classification") +
  theme_bw()

filter(data1, Classification <= 3)

## Re-run classification to test stability of results

In [None]:
data2 <- data1 %>%
  mutate(Classification2 = map(Customer_Review, ~ get_response(system_prompt, .))) %>%
  unnest(Classification2) %>%
  mutate(Classification2 = as.numeric(Classification2))

This code creates a scatter plot using ggplot2 in R. It visualizes the
relationship between two variables: Classification and Classification2 from
the data2 dataset.

In [None]:
data2 %>%
  ggplot() +
  geom_point(aes(x = Classification, y = Classification2)) +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(x = "Classification", y = "Classification2") +
  ggtitle("Classification vs. Classification2") +
  theme_bw()

# Working with structured data in ellmer

This code defines an object `type_score`. The object includes a system prompt
and a type definition for an output variable `customer_score`.

In [None]:
type_score <- type_object(
  "Classify the customer review whether it is negative (1) or positive (5).",
  customer_score = type_number("Customer reviewer classification, ranging from 1 to 5.")
)

## Function: Get numeric from OpenAI chat model

This function interacts with the OpenAI chat model to get a response based on
the provided system and user prompts. The `type_score` object is used to force
the LLM to generate a numeric output for the customer score.

### Parameters:

+ **system**: A string representing the system prompt to guide the chat model.
+ **user**: A string representing the user input to be processed by the chat model.

### Returns:

A numeric value representing the customer score extracted from the chat model's response.

### Examples:

```{r}
# system_prompt <- "You are a helpful assistant."
# user_input <- "How can I improve my coding skills?"
# score <- get_response2(system_prompt, user_input)
# print(score)
```

### Define function:

In [None]:
get_response2 <- function(system, user) {
  chat <- chat_openai(
    model = base_model,
    system_prompt = system,
    echo = "none"
  )
  out <- chat$extract_data(user, type = type_score)
  return(out$customer_score)
}

This code calls the `get_response2` function with two arguments:

+ `system_prompt`: A variable that likely contains a prompt or instruction
  for the system.
+ `data$Customer_Review[[1]]`: The first element of the `Customer_Review`
  column in the `data` dataframe.
  
The function is expected to generate a response based on the provided system
prompt and customer review.

In [None]:
get_response2(system_prompt, data$Customer_Review[[1]])

### Force output to be numeric

This line of code creates a new dataframe `data3` by mutating the existing
dataframe `data1`. It adds a new column `Classification3` which is generated
by applying the `get_response2` function to each element in the
`Customer_Review` column of `data1`. The `get_response2` function is called
with `system_prompt` and the current element of `Customer_Review` as arguments.
Because `get_response2` uses the `type_score`object to structure the ouput, we
can use the `map_int` function to create a column of integers.

In [None]:
data3 <- mutate(data1, Classification3 = map_int(Customer_Review, ~ get_response2(system_prompt, .)))

data3

## Function: Get summary output from OpenAI chat model

This code defines an object `type_summary`. The object includes a system
prompt and a type definition for an output variable `issue`.

In [None]:
type_summary <- type_object(
  "Summarize the main issue mentioned in the review.",
  issue = type_string("Main problem mentioned.")
)

This function interacts with the OpenAI chat model to generate a response
based on the provided system and user prompts. The `type_summary` object is
used to ensure the LLM generates a character output summarizing the main issue
mentioned in the review.

### Parameters:

+ `system`: A character string representing the system prompt to guide the
  chat model.
+ `user`: A character string representing the user's input or query.

### Returns:

A character string containing the extracted data from the chat model's
response, specifically the issue summary.

### Examples:

```{r}
# system_prompt <- "You are a helpful assistant."
# user_input <- "What is the weather like today?"
# get_response3(system_prompt, user_input)
```

### Define function:

In [None]:
get_response3 <- function(system, user) {
  chat <- chat_openai(
    model = base_model,
    system_prompt = system,
    echo = "none"
  )
  out <- chat$extract_data(user, type = type_summary)
  return(out$issue)
}

This line of code calls the `get_response3` function with two arguments:

1. `system_prompt`: A variable that likely contains a prompt or instruction
  for the system.
2. `data$Customer_Review[[1]]`: The first element of the `Customer_Review`
  column in the `data` dataframe.

The function is expected to generate a response based on the provided system
prompt and customer review.

In [None]:
get_response3(system_prompt, data$Customer_Review[[1]])

This code filters the data1 dataframe to include only rows where the
'Classification' column is less than or equal to 3. It then creates a new
column 'Issue' in the resulting dataframe by applying the 'get_response3'
function to each element in the 'Customer_Review' column. The 'get_response3'
function takes 'system_prompt' and the customer review as arguments. Because
`get_response3` uses the `type_summary`object to structure the ouput, we
can use the `map_chr` function to create a column of characters.

In [None]:
data4 <- data1 %>%
  filter(Classification <= 3) %>%
  mutate(Issue = map_chr(Customer_Review, ~ get_response3(system_prompt, .)))

data4
data4$Issue

# Exercises

+ **Modify the classification system prompt**: Change the system prompt in the
  `type_score` object to classify reviews on a scale of 1 to 10 instead of 1 to
  5. Update the `get_response` function accordingly.
+ **Add a new feature**: Create a new feature in the `data` dataframe that
  indicates whether the review is generally positive or negative. Compare this
  feature to the initial results.
+ **Visualize the results**: Create a visual analysis that compares the results
  from the previous step to your intitial results.
+ **Explore different models**: Experiment with different OpenAI models
  (e.g., `gpt-4.1`, `gpt-4.1-mini`, `gpt-4o-mini`) and compare their performance
  in classifying the reviews. Document any differences you observe.
+ **Multi-class classification**: Extend the classification task to include more
  than two classes (e.g., very negative, negative, neutral, positive, very
  positive) and evaluate the model's performance on this multi-class problem.
+ **Ethical considerations**: Discuss the ethical implications of using LLMs for
  sentiment analysis and classification, including potential biases in the model
  and how to mitigate them.
+ **Real-world application**: Discuss other real-world applications for text
  classification with LLMs other than feedback classification. What
  considerations would you need to take into account when applying LLMs in these
  scenarios?
