# Leveraging AI for Accurate PRs: A Guide to Using OpenAI for Title and Description Generation, and Evaluating with LastMile Eval

In this notebook, we will explore how to leverage OpenAI's language model to generate accurate and descriptive titles and descriptions for pull requests (PRs). We will then use the LastMile Eval library to evaluate the quality of the generated content.

## Prerequisites

Before we begin, make sure you have the following libraries installed:

- `requests`: Used for making HTTP requests to the GitHub API.
- `openai`: The OpenAI library for interacting with the OpenAI API.
- `lastmile-eval`: The LastMile Eval library for evaluating the generated content.

You can install these libraries using the following commands:

In [None]:
!pip install requests
!pip install openai
!pip install lastmile-eval

## Step 1: Fetching Pull Request Diffs

We start by defining a function `get_pull_request_diff` that takes a pull request link and fetches the diff of the pull request using the GitHub API.

In [8]:
import requests

merged_prs = [
    "https://github.com/keras-team/keras/pull/19720",
    "https://github.com/keras-team/keras/pull/19728",
    "https://github.com/keras-team/keras/pull/19729"
]

def get_pull_request_diff(pr_link: str):
    diff_suffix = ".diff"
    diff_url = f'{pr_link}{diff_suffix}'

    # Send a GET request to the GitHub API
    response = requests.get(diff_url)
    return response.text

Let's take a look at the diff of the first PR:

In [15]:
pr_diff = get_pull_request_diff(merged_prs[0])
print(pr_diff)

diff --git a/keras/src/export/export_lib.py b/keras/src/export/export_lib.py
index 1157630da0e..e3749c2b33c 100644
--- a/keras/src/export/export_lib.py
+++ b/keras/src/export/export_lib.py
@@ -621,18 +621,17 @@ def export_model(model, filepath):
...


## Step 2: Generating Pull Request Description

Next, we use OpenAI's language model to generate a description for the pull request based on the diff. Make sure to set your OpenAI API key in the environment variable `OPENAI_API_KEY`.

In [25]:
import openai

client = openai.OpenAI()
system_prompt_template = f"You are a developer who writes amazing code. You are working on a project and you need to generate a pull request description for a pull request diff. When given a diff, generate a description for the pull request. Say only the description with formatting"
response = client.chat.completions.create(model="gpt-3.5-turbo", messages=[{"role": "system", "content": system_prompt_template}, {"role": "user", "content": pr_diff}])

generated_pr_description = response.choices[0].message.content 
print(generated_pr_description)

- Refactored `_get_save_spec` to `_get_input_signature` in `export_lib.py`
- Updated the function to return a list comprehension for input signatures based on shapes dictionary values
- Added a new test `test_model_with_multiple_inputs` in `export_lib_test.py` to test the model with multiple inputs and batch sizes


## Step 3: Generating Pull Request Title

We can also generate a concise and descriptive title for the pull request based on the generated description.

In [28]:
system_prompt_template = f"You are a developer who writes amazing code. You are working on a project and you need to generate a pull request title from a pull request description. When given a diff, generate a title for the pull request. Say only the title"
response = client.chat.completions.create(model="gpt-3.5-turbo", messages=[{"role": "system", "content": system_prompt_template}, {"role": "user", "content": generated_pr_description}])
generated_pr_title = response.choices[0].message.content 
print(generated_pr_title)

Refactor input signature naming convention and add test for multiple inputs and batch sizes


## Step 4: Evaluating Generated Content

Finally, we use the LastMile Eval library to evaluate the quality of the generated description and title. The `calculate_summarization_score` function asks GPT-3.5 to generate a list of float scores indicating the summary quality of each input-reference pair, where 1.0 denotes 'good' and 0.0 denotes otherwise.

In [35]:
from lastmile_eval.text import calculate_summarization_score

description_score = calculate_summarization_score([generated_pr_description], [pr_diff], model_name="gpt-3.5-turbo")
title_score = calculate_summarization_score([generated_pr_title], [generated_pr_description], model_name="gpt-3.5-turbo")

print(f"Description score: {description_score}")
print(f"Title score: {title_score}")

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:01<00:00 |  1.55s/it
🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
llm_classify |██████████| 1/1 (100.0%) | ⏳ 00:00<00:00 |  1.25it/s

Description score: [1.0]
Title score: [1.0]





The scores indicate that both the generated description and title are of good quality and accurately summarize the pull request diff.

That's it! You now have a notebook that demonstrates how to generate accurate and descriptive pull request titles and descriptions using OpenAI, and evaluate their quality using LastMile Eval.