# Leveraging AI for Accurate PRs: A Guide to Using OpenAI for Title and Description Generation, and Evaluating with LastMile Eval

In this notebook, we will explore how to leverage OpenAI's language model to generate accurate and descriptive titles and descriptions for pull requests (PRs). We will then use the LastMile Eval library to evaluate the quality of the generated content.

## Prerequisites

Before we begin, make sure you have the following libraries installed:

- `requests`: Used for making HTTP requests to the GitHub API.
- `openai`: The OpenAI library for interacting with the OpenAI API.
- `lastmile-eval`: The LastMile Eval library for evaluating the generated content.

You can install these libraries using the following commands:

In [1]:
# !pip install requests
# !pip install openai
# !pip install lastmile-eval

## Step 1: Fetching Pull Request Diffs

We start by defining a function `get_pull_request_diff` that takes a pull request link and fetches the diff of the pull request using the GitHub API.

In [2]:
import requests

# List of merged pull request URLs
merged_prs = [
    "https://github.com/keras-team/keras/pull/19720",
    "https://github.com/keras-team/keras/pull/19728",
    "https://github.com/keras-team/keras/pull/19729"
]

def get_pull_request_diff(pr_link: str) -> str:
    """Fetches the diff of a pull request using the GitHub API."""
    diff_suffix = ".diff"
    diff_url = f'{pr_link}{diff_suffix}'

    response = requests.get(diff_url)
    return response.text

Let's take a look at the diff of the first PR:

In [3]:
pr_diff = get_pull_request_diff(merged_prs[0])
print(pr_diff)

diff --git a/keras/src/export/export_lib.py b/keras/src/export/export_lib.py
index 1157630da0e..e3749c2b33c 100644
--- a/keras/src/export/export_lib.py
+++ b/keras/src/export/export_lib.py
@@ -621,18 +621,17 @@ def export_model(model, filepath):
             input_signature = [input_signature]
         export_archive.add_endpoint("serve", model.__call__, input_signature)
     else:
-        save_spec = _get_save_spec(model)
-        if not save_spec or not model._called:
+        input_signature = _get_input_signature(model)
+        if not input_signature or not model._called:
             raise ValueError(
                 "The model provided has never called. "
                 "It must be called at least once before export."
             )
-        input_signature = [save_spec]
         export_archive.add_endpoint("serve", model.__call__, input_signature)
     export_archive.write_out(filepath)
 
 
-def _get_save_spec(model):
+def _get_input_signature(model):
     shapes_dict = get

## Step 2: Generating Pull Request Description

Next, we use OpenAI's language model to generate a description for the pull request based on the diff. Make sure to set your OpenAI API key in the environment variable `OPENAI_API_KEY`.

In [4]:
import openai

client = openai.OpenAI()

# System prompt template for generating PR description
system_prompt_template = (
    "You are a developer who writes amazing code. "
    "You are working on a project and you need to generate a pull request description for a pull request diff. "
    "When given a diff, generate a description for the pull request. Say only the description with formatting"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt_template},
        {"role": "user", "content": pr_diff}
    ]
)

generated_pr_description = response.choices[0].message.content 
print(generated_pr_description)

- Updated `export_lib.py` to use `_get_input_signature` instead of `_get_save_spec`.
- Introduced `_get_input_signature` to handle models with multiple inputs by returning a list of input signatures.
- Added a new test in `export_lib_test.py` to validate the model with multiple inputs.
- Successfully tested the model with different batch sizes to ensure correctness.


## Step 3: Generating Pull Request Title

We can also generate a concise and descriptive title for the pull request based on the generated description.

In [5]:
# System prompt template for generating PR title
system_prompt_template = (
    "You are a developer who writes amazing code. "
    "You are working on a project and you need to generate a pull request title from a pull request description. "
    "When given a diff, generate a title for the pull request. Say only the title"
)

response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": system_prompt_template},
        {"role": "user", "content": generated_pr_description}
    ]
)

generated_pr_title = response.choices[0].message.content 
print(generated_pr_title)

Refactor export_lib to use _get_input_signature for multiple inputs


## Step 4: Evaluating Generated Content

Finally, we use the LastMile Eval library to evaluate the quality of the generated description and title. The `calculate_summarization_score` function asks GPT-3.5 to generate a list of float scores indicating the summary quality of each input-reference pair, where 1.0 denotes 'good' and 0.0 denotes otherwise.

In [6]:
from lastmile_eval.text import calculate_summarization_score

description_score = calculate_summarization_score(
    [generated_pr_description],
    [pr_diff],
    model_name="gpt-3.5-turbo"
)

title_score = (calculate_summarization_score
    [generated_pr_title],
    [generated_pr_description],
    model_name="gpt-3.5-turbo"
)

print(f"Description score: {description_score}")
print(f"Title score: {title_score}")

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

Description score: [1.0]
Title score: [1.0]


The scores indicate that both the generated description and title are of good quality and accurately summarize the pull request diff.

That's it! You now have a notebook that demonstrates how to generate accurate and descriptive pull request titles and descriptions using OpenAI, and evaluate their quality using LastMile Eval.

## Creating Test Sets and Evaluation runs with lastmile-eval

In [None]:
import pandas as pd
from lastmile_eval.rag.debugger.api.evaluation import create_input_set
from lastmile_eval.rag.debugger.api.evaluation import run_and_store_evaluations
from lastmile_eval.text import calculate_summarization_score


test_set_id = create_input_set( [generated_pr_description, generated_pr_title], "pr_generator", [pr_diff, generated_pr_description]).ids[0]

def wrap_summarize(df: pd.DataFrame) -> list[float]:
    def helper(row) -> float:
        return calculate_summarization_score(
            [row["query"]],
            [row["groundTruth"]],
            model_name="gpt-3.5-turbo"
        )[0]
    
    return df.apply(helper, axis=1)

run_and_store_evaluations(test_set_id, project_id = None, trace_level_evaluators={"summarize": wrap_summarize}, dataset_level_evaluators = {}, evaluation_set_name="summarization")

Open The Debugger UI to see the generated evaluation scores.

<img width="959" alt="Screenshot 2024-05-20 at 5 34 01 PM" src="https://github-production-user-asset-6210df.s3.amazonaws.com/141073967/332206801-b5fdadc0-88e5-4602-b84f-e1997283d81f.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAVCODYLSA53PQK4ZA%2F20240520%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20240520T213720Z&X-Amz-Expires=300&X-Amz-Signature=bd2acd3267b56463c279d159c2784a27b176e2386760c1ae86476b4e5a16e842&X-Amz-SignedHeaders=host&actor_id=141073967&key_id=0&repo_id=768880246">


In [None]:
!rag-debug launch