# Evaluate LLM-generated PR Titles and Descriptions

In this notebook, we'll demonstrate how to use `gpt-3.5-turbo` to generate accurate titles and descriptions for pull requests (PRs) and evaluate the generated content using the **LastMile Eval library**.

## Notebook Outline
* [Step 1: Install and Setup](#install)
* [Step 2: Fetch Pull Request](#fetch)
* [Step 3: Generate PR Description](#pr_desc)
* [Step 4: Generate PR Title](#pr_title)
* [Step 5: Evaluate LLM-generated Content](#evaluate)
* [Step 6: View Evaluation Results](#view)

<a name="install"></a>

## Step 1: Install and Setup

Before we begin, we need to install the following packages:

- `requests`: Used for making HTTP requests to the GitHub API.
- `openai`: The OpenAI library for interacting with the OpenAI API.
- `lastmile-eval`: The LastMile Eval library for evaluating the generated content.

In [None]:
!pip install requests
!pip install openai
!pip install lastmile-eval --upgrade

Collecting lastmile-eval
  Downloading lastmile_eval-0.0.51-py3-none-any.whl (2.1 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2.1/2.1 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
Collecting llama-index (from lastmile-eval)
  Downloading llama_index-0.10.38-py3-none-any.whl (6.8 kB)
Collecting llama-index-agent-openai<0.3.0,>=0.1.4 (from llama-index->lastmile-eval)
  Downloading llama_index_agent_openai-0.2.5-py3-none-any.whl (13 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index->lastmile-eval)
  Downloading llama_index_cli-0.1.12-py3-none-any.whl (26 kB)
Collecting llama-index-core<0.11.0,>=0.10.38 (from llama-index->lastmile-eval)
  Downloading llama_index_core-0.10.38.post2-py3-none-any.whl (15.4 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m15.4/15.4 MB[0m [31m33.4

We also need the following tokens/keys:

* **LastMile AI API Token:** Go to the [LastMile Settings page](https://lastmileai.dev/settings?page=tokens). You will need to first create a LastMile AI account.
* **OpenAI API Key:** Go to [OpenAI API Keys page](https://platform.openai.com/account/api-keys) to create and access your OpenAI API Key.

We're using Google Colab's Secret Manager to set our tokens in this notebook.

In [None]:
import os
from google.colab import userdata

os.environ['OPENAI_API_KEY'] =  userdata.get('OPENAI_API_KEY')
os.environ['LASTMILE_API_TOKEN'] =  userdata.get('LASTMILE_API_TOKEN')

<a name="fetch"></a>
## Step 2: Fetch Pull Request

First, define a function `get_pull_request_diff` that takes a pull request link and fetches the diff of the pull request using the GitHub API.

In [None]:
import requests

# List of merged pull request URLs
merged_prs = [
    "https://github.com/keras-team/keras/pull/19720",
    "https://github.com/keras-team/keras/pull/19728",
    "https://github.com/keras-team/keras/pull/19729"
]

def get_pull_request_diff(pr_link: str) -> str:
    """Fetches the diff of a pull request using the GitHub API."""
    diff_suffix = ".diff"
    diff_url = f'{pr_link}{diff_suffix}'

    response = requests.get(diff_url)
    return response.text

Let's take a look at the diff of the first PR:

In [None]:
pr_diff = get_pull_request_diff(merged_prs[0])
print(pr_diff)

diff --git a/keras/src/export/export_lib.py b/keras/src/export/export_lib.py
index 1157630da0e..e3749c2b33c 100644
--- a/keras/src/export/export_lib.py
+++ b/keras/src/export/export_lib.py
@@ -621,18 +621,17 @@ def export_model(model, filepath):
             input_signature = [input_signature]
         export_archive.add_endpoint("serve", model.__call__, input_signature)
     else:
-        save_spec = _get_save_spec(model)
-        if not save_spec or not model._called:
+        input_signature = _get_input_signature(model)
+        if not input_signature or not model._called:
             raise ValueError(
                 "The model provided has never called. "
                 "It must be called at least once before export."
             )
-        input_signature = [save_spec]
         export_archive.add_endpoint("serve", model.__call__, input_signature)
     export_archive.write_out(filepath)
 
 
-def _get_save_spec(model):
+def _get_input_signature(model):
     shapes_dict = get

<a name="pr_desc"></a>

## Step 3: Generate Pull Request Description

Next, we use `gpt-3.5-turbo` to generate a description for the pull request based on the diff. Make sure to set your OpenAI API key in the environment variable `OPENAI_API_KEY`.

In [None]:
import openai

client = openai.OpenAI()

# System prompt template for generating PR description
system_prompt_template = (
    "You are a developer who writes amazing code. "
    "You are working on a project and you need to generate a pull request description for a pull request diff. "
    "When given a diff, generate a description for the pull request. Say only the description with formatting"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt_template},
        {"role": "user", "content": pr_diff}
    ]
)

generated_pr_description = response.choices[0].message.content
print(generated_pr_description)

- Updated `_get_save_spec` function to `_get_input_signature` for clarity
- Refactored logic in `_get_input_signature` to handle multiple input shapes
- Added a new test case for a model with multiple inputs in `export_lib_test.py`


<a name="pr_title"></a>

## Step 4: Generate Pull Request Title

We can also generate a concise and descriptive title for the pull request based on the generated description.

In [None]:
# System prompt template for generating PR title
system_prompt_template = (
    "You are a developer who writes amazing code. "
    "You are working on a project and you need to generate a pull request title from a pull request description. "
    "When given a diff, generate a title for the pull request. Say only the title"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt_template},
        {"role": "user", "content": generated_pr_description}
    ]
)

generated_pr_title = response.choices[0].message.content
print(generated_pr_title)

Refactor input signature handling and add test for multiple input shapes


<a name="evaluate"></a>

## Step 5: Evaluate LLM-generated Content

Finally, we use the **LastMile Eval library** to evaluate the quality of the generated description and title.

The `calculate_summarization_score` function asks `gpt-3.5-turbo` to generate a list of float scores indicating the summary quality of each input-reference pair, where 1.0 denotes 'good' and 0.0 denotes otherwise.

In [None]:
import pandas as pd
from lastmile_eval.text import calculate_summarization_score
from lastmile_eval.rag.debugger.api.evaluation import create_input_set
from lastmile_eval.rag.debugger.api.evaluation import run_and_store_evaluations
from lastmile_eval.text import calculate_summarization_score

# Create Input Set with Generated PR Title and Description
input_set_id = create_input_set( [generated_pr_description, generated_pr_title], "pr_generator", [pr_diff, generated_pr_description]).ids[0]

# Define Summarization Evaluator
def wrap_summarize(df: pd.DataFrame) -> list[float]:
    def helper(row) -> float:
        return calculate_summarization_score(
            [row["query"]],
            [row["groundTruth"]],
            model_name="gpt-3.5-turbo"
        )[0]

    return df.apply(helper, axis=1)

# Evaluate LLM-generated responses with Summarization Evaluator
run_and_store_evaluations(
      input_set_id,
      project_id = None,
      trace_level_evaluators={"summarize": wrap_summarize},
      dataset_level_evaluators = {},
      evaluation_set_name="summarization"
)



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s



llm_classify |          | 0/1 (0.0%) | ‚è≥ 00:00<? | ?it/s

CreateEvaluationsResult(success=True, message='{"id":"clwlahpsb00w6qync5y4spkub","createdAt":"2024-05-24T23:04:53.194Z","updatedAt":"2024-05-24T23:04:53.194Z","name":"summarization","paramSet":null,"testSetId":"clwlahj2n00gipet3uod502vp","creatorId":"clh5ugdjv001kpm11cbg1ssmu","projectId":null,"organizationId":null,"visibility":"MEMBER","metadata":null,"active":true}', df_metrics_trace=                   testSetId                 testCaseId metricName  value
0  clwlahj2n00gipet3uod502vp  clwlahj2x00gkpet3lec6i3sh  summarize    1.0
1  clwlahj2n00gipet3uod502vp  clwlahj2x00glpet3l1j9a4m0  summarize    1.0, df_metrics_dataset=                   testSetId       metricName  value
0  clwlahj2n00gipet3uod502vp   summarize_mean    1.0
0  clwlahj2n00gipet3uod502vp    summarize_std    0.0
0  clwlahj2n00gipet3uod502vp  summarize_count    2.0)

<a name="view"></a>

## Step 6: View Evaluation Results

Now you can view the evaluation results in the RAG Debugger UI.

Run this CLI command to access the UI:

`rag-debug launch`

The 'Evaluation Console' is the landing page of RAG Debugger. Here you can see all your Evaluation Sets (including the one we just made):

<img width="973" alt="Screenshot 2024-05-24 at 7 07 37 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/3d49b64b-6263-4345-ad37-b5ce3c696a18"/>


Click 'Evaluation Set' to dig deeper into the results.

<img width="973" alt="Screenshot 2024-05-24 at 7 07 46 PM" src="https://github.com/lastmile-ai/aiconfig/assets/81494782/c1310c4b-d2a1-4dd8-b365-2c418214cd4e"/>

You can see the Summarization score for each of our sample PR descriptions and also the LLM-generated output.

You can also inspect the Traces for each of the test cases in the 'Trace' Page.

