### Table of Content
- [Load dataset](#load-dataset)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Setup LLM](#Setup-LLM)
- [Analyze repo issues](#Analyze-repo-issues)
- [Retrieve relevant codes](#Retrieve-relevant-codes)
- [Generate patches](#Generate-patches)
- [Validate patches](#Validate-patches)

In [1]:
import os
import pandas as pd

# Define paths
comp_dir = "konwinski-prize"
comp_kaggle_evaluation_dir = os.path.join(comp_dir, "kaggle_evaluation")
comp_kprize_setup_dir = os.path.join(comp_dir, "kprize_setup")

comp_data_zip_path = os.path.join(comp_dir, "data.a_zip")
comp_data_dir = os.path.join(comp_dir, "data")
comp_data_parquet_path = os.path.join(comp_data_dir, "data.parquet")
comp_conda_packages_dir = os.path.join(comp_data_dir, "conda_packages")
comp_pip_packages_dir = os.path.join(comp_data_dir, "pip_packages")
comp_repo_configs_dir = os.path.join(comp_data_dir, "repo_configs")
comp_repos_dir = os.path.join(comp_data_dir, "repos")

### Load dataset

From the competition readme and our earlier investigation we know that the dataframe contains the following:

**instance_id (string)**
- Unique string identifier for each instance (GitHub issue)

**repo (string)**
- The GitHub repository relevant to the issue
- Also accessible through the evaluation API

**problem_statement (string)**
- Textual description of the issue
- Also accessible through the evaluation API

**patch (string)**
- The patch that resolves the issue
- Only provided in the train set

**test_patch (string)**
- The patch that resolves the issue
- Only provided in the train set

**pull_number (int)**
- The pull request number that resolved the issue

**base_commit (string)**
- The commit used as the foundation for the provided repository copy

**issue_numbers (int)**
- The original ID number of the GitHub issue

**[PASS_TO_PASS/FAIL_TO_PASS] (list)**
- Lists containing unit tests to be executed for this issue


In [2]:
# Load dataset
kprize_df = pd.read_parquet(comp_data_parquet_path)
kprize_df

Unnamed: 0,instance_id,repo,problem_statement,patch,test_patch,pull_number,base_commit,PASS_TO_PASS,FAIL_TO_PASS,issue_numbers
0,pylint-dev__astroid-2496,pylint-dev/astroid,TypeError: unsupported format string passed to...,diff --git a/ChangeLog b/ChangeLog\nindex 4560...,diff --git a/tests/test_inference.py b/tests/t...,2496,8d3cdbbe6685fd8cf211816bec56c90f38f1859e,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::test_formatted_fstri...,[2492]
1,pylint-dev__astroid-2468,pylint-dev/astroid,Pylint checks against incorrect type with prop...,diff --git a/ChangeLog b/ChangeLog\nindex fdbb...,diff --git a/tests/test_inference.py b/tests/t...,2468,6db3a60553ff538a936d5dda23d67a3924a57f45,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::InferenceTest::test_...,[2467]
2,astropy__astropy-17048,astropy/astropy,QTable cannot take `dimensionless_unscaled` wh...,diff --git a/astropy/table/table.py b/astropy/...,diff --git a/astropy/table/tests/test_table.py...,17048,d60f6b72cd525262bfd179331d9fe4474177918f,[astropy/table/tests/test_table.py::TestSetTab...,[astropy/table/tests/test_table.py::test_qtabl...,[17047]
3,astropy__astropy-16898,astropy/astropy,BUG: tables do not deal well with zero-sized s...,diff --git a/astropy/io/registry/core.py b/ast...,diff --git a/astropy/io/fits/tests/test_connec...,16898,ee6d087baf301c1d08db92e6e5b6d909d57e6fac,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,[16897]
4,astropy__astropy-16830,astropy/astropy,KeyError: 'version_1_3_or_later' when parsing ...,diff --git a/astropy/io/votable/tree.py b/astr...,diff --git a/astropy/io/votable/tests/test_tre...,16830,e39f486fec48d87aa3677326167954370d7a7bf9,[astropy/io/votable/tests/test_tree.py::test_c...,[astropy/io/votable/tests/test_tree.py::test_v...,"[16825, 16826]"
5,astropy__astropy-16812,astropy/astropy,Provide a way to make a copy of a model with d...,diff --git a/astropy/modeling/core.py b/astrop...,diff --git a/astropy/modeling/tests/test_core....,16812,c241103c11954d3c1cfe3c1840b1ece72479c522,[astropy/modeling/tests/test_core.py::test_Mod...,[astropy/modeling/tests/test_core.py::test_res...,[16593]


### Exploratory Data Analysis

In [3]:
from rich import print as rprint

In [4]:
rprint(f"{kprize_df.shape=}\n")

In [5]:
rprint("\nAny Missing Values?\n")
rprint(kprize_df.isnull().sum())

In [6]:
rprint("\nDatatypes?\n")
kprize_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   instance_id        6 non-null      object
 1   repo               6 non-null      object
 2   problem_statement  6 non-null      object
 3   patch              6 non-null      object
 4   test_patch         6 non-null      object
 5   pull_number        6 non-null      int64 
 6   base_commit        6 non-null      object
 7   PASS_TO_PASS       6 non-null      object
 8   FAIL_TO_PASS       6 non-null      object
 9   issue_numbers      6 non-null      object
dtypes: int64(1), object(9)
memory usage: 608.0+ bytes


In [7]:
rprint("\nRepo Distribution?\n")
rprint(kprize_df['repo'].value_counts())

In [8]:
# Fixes can reference more than one GitHub issue.
rprint("\nNumber of Issues per PR\n")
kprize_df["issue_numbers"].apply(len).value_counts()

1    5
2    1
Name: issue_numbers, dtype: int64

In [9]:
kprize_df['problem_statement_length'] = kprize_df['problem_statement'].apply(lambda x: len(x.split()))
rprint("\nProblem Statement Lengths\n")
display(kprize_df['problem_statement_length'].describe())

count      6.000000
mean     297.166667
std      162.149828
min       72.000000
25%      171.750000
50%      358.000000
75%      407.000000
max      462.000000
Name: problem_statement_length, dtype: float64

In [10]:
rprint("\nPatch Lengths\n")
kprize_df['patch_length'] = kprize_df['patch'].apply(lambda x: len(x))
kprize_df['test_patch_length'] = kprize_df['test_patch'].apply(lambda x: len(x))
display(kprize_df[['patch_length', 'test_patch_length']].describe())

Unnamed: 0,patch_length,test_patch_length
count,6.0,6.0
mean,2337.833333,2255.833333
std,1380.618328,629.710542
min,912.0,1339.0
25%,1382.25,2069.5
50%,2195.0,2214.5
75%,2723.5,2405.25
max,4714.0,3277.0


In [11]:
rprint("\nTest Counts\n")
kprize_df['PASS_TO_PASS_count'] = kprize_df['PASS_TO_PASS'].apply(len)
kprize_df['PASS_TO_PASS_count'] = kprize_df['FAIL_TO_PASS'].apply(len)
display(kprize_df[['PASS_TO_PASS', 'FAIL_TO_PASS', 'PASS_TO_PASS_count', 'PASS_TO_PASS_count']])

Unnamed: 0,PASS_TO_PASS,FAIL_TO_PASS,PASS_TO_PASS_count,PASS_TO_PASS_count.1
0,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::test_formatted_fstri...,2,2
1,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::InferenceTest::test_...,3,3
2,[astropy/table/tests/test_table.py::TestSetTab...,[astropy/table/tests/test_table.py::test_qtabl...,3,3
3,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,2,2
4,[astropy/io/votable/tests/test_tree.py::test_c...,[astropy/io/votable/tests/test_tree.py::test_v...,1,1
5,[astropy/modeling/tests/test_core.py::test_Mod...,[astropy/modeling/tests/test_core.py::test_res...,2,2


Look at an example problem from the kprize data

In [12]:
import pathlib
from rich.filesize import decimal
from rich.markup import escape
from rich.text import Text
from rich.tree import Tree
from typing import Iterable, Optional


default_ignore_list: Iterable[str] = ".ipynb_checkpoints", ".DS_Store", ".git", ".idea", ".coverage", ".pytest_cache"


def get_directory_tree(
        directory: pathlib.Path,
        tree: Optional[Tree] = None,
        show_hidden: bool = False,
        inplace: bool = False,
        ignore_list_extras: Iterable[str] = (),
) -> Optional[Tree]:
    """Recursively build a Tree with directory contents.

    Args:
        directory (pathlib.Path): The directory to walk.
        tree (Tree, optional):
            The Tree object to build.
            If not provided, a new Tree is created.
        show_hidden (bool, optional):
            Whether to show hidden files.
        inplace (bool, optional):
            Whether to print the tree in place.
            If False, the tree is returned.
        ignore_list_extras (Iterable[str], optional):
            Additional file extensions to ignore.

    Returns:
        If inplace is False, the Tree object with the directory contents.
        Else, None.
    """
    # Get the ignore list that includes the default and any extras
    ignore_list = sorted(set(default_ignore_list) | set(ignore_list_extras))

    # Create a new Tree if one is not provided
    tree = tree or Tree(label=f"[bold]{directory!s}[/bold] File Tree")  # type: ignore

    # Sort dirs first then by filename
    paths = sorted(
        pathlib.Path(directory).iterdir(),
        key=lambda path: (path.is_file(), path.name.lower()),
    )

    # Sort dirs first then by filename
    for path in paths:

        # Remove hidden files if show_hidden is False
        if path.name.startswith(".") and not show_hidden:
            continue

        # Skip files in the ignore list by suffix or name
        if path.is_file() and (path.suffix in ignore_list or path.name in ignore_list):
            continue

        # Skip directories only by name (not suffix)
        if path.is_dir() and path.name in ignore_list:
            continue

        # Add the directory to the tree
        if path.is_dir():

            # Style directories starting with "__" differently
            style = "dim" if path.name.startswith("__") else ""

            # Add the directory to the tree
            branch = tree.add(
                f"[bold magenta]:open_file_folder: [link file://{path}]{escape(path.name)}",
                style=style,
                guide_style=style,
            )
            get_directory_tree(path, branch)

        # Add the file to the tree
        else:
            main_style = "dim green" if path.name.startswith("_") else "green"
            ext_style = "dim red" if path.name.startswith("_") else "bold red"
            file_size_style = "dim blue" if path.name.startswith("_") else "blue"
            text_filename = Text(path.name, main_style)
            text_filename.highlight_regex(r"\..*$", ext_style)
            text_filename.stylize(f"link file://{path}")
            text_filename.append(f" ({decimal(path.stat().st_size)})", file_size_style)
            if path.suffix == ".py":
                icon = "🐍 "
            elif path.suffix == ".ipynb":
                icon = "🐍📓 "
            elif path.suffix == ".sh":
                icon = "🔧 "
            elif ".env" in path.name.lower():
                icon = "🔑 "
            elif path.suffix == ".csv":
                icon = "📊 "
            elif path.suffix in [".yaml", ".yml", ".json"]:
                icon = "📜 "
            elif path.suffix in [".txt", ".md"]:
                icon = "📝 "
            elif path.suffix in [".png", ".jpg", ".jpeg", ".gif", ".svg"]:
                icon = "🖼️ "
            elif path.suffix in [".zip", ".tar", ".gz", ".7z"]:
                icon = "📦 "
            elif path.suffix in [".pdf"]:
                icon = "📰 "
            elif path.suffix in [".mp4", ".avi", ".mov", ".mkv"]:
                icon = "🎥 "
            elif path.suffix in [".mp3", ".wav", ".flac"]:
                icon = "🎵 "
            elif path.suffix in [".html", ".css", ".js"]:
                icon = "🌐 "
            elif path.suffix in [".exe", ".msi"]:
                icon = "🛠️ "
            elif path.suffix in [".docx", ".pptx", ".xlsx"]:
                icon = "📄 "
            elif path.suffix in [".parquet", ".feather"]:
                icon = "🧼 "
            elif path.suffix in [".db", ".sqlite", ".sql", ".jsonl"]:
                icon = "🗄️ "
            else:
                icon = "📄 "

            # Prefix hidden files with a "🤫" emoji
            if path.name.startswith("."):
                icon = "🤫"+icon

            # Add the file to the tree (with icon prefix)
            tree.add(Text(icon) + text_filename)

    # If inplace is False, return the tree... otherwise the Tree object is updated in place
    if not inplace:
        return tree
    return None

In [13]:
idx = 3
row = kprize_df.iloc[idx]
problem_statement = row["problem_statement"]
instance_id = row["instance_id"]
p2p_tests = row["PASS_TO_PASS"]
f2p_tests = row["FAIL_TO_PASS"]
repo_path = os.path.join(comp_repos_dir, f'repo__{instance_id}')

In [14]:
# Display tree
tree = get_directory_tree(repo_path)
rprint(tree)

In [15]:
# Display problem_statement
rprint(problem_statement)

In [16]:
# Display current row
display(pd.DataFrame(row).T)

Unnamed: 0,instance_id,repo,problem_statement,patch,test_patch,pull_number,base_commit,PASS_TO_PASS,FAIL_TO_PASS,issue_numbers,problem_statement_length,patch_length,test_patch_length,PASS_TO_PASS_count
3,astropy__astropy-16898,astropy/astropy,BUG: tables do not deal well with zero-sized s...,diff --git a/astropy/io/registry/core.py b/ast...,diff --git a/astropy/io/fits/tests/test_connec...,16898,ee6d087baf301c1d08db92e6e5b6d909d57e6fac,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,[16897],315,2203,2463,2


### Setup LLM

In [17]:
import openai

In [18]:
def set_openai_key(path_to_key: str = "/home/loc/Documents/keys/OPENAI_API_KEY.txt") -> None:
    """
    Sets the OpenAI API key from a file to an environment variable.

    Args:
        path_to_key (str): Path to the file containing the OpenAI API key.
                           Default is '/home/loc/Documents/keys/OPENAI_API_KEY.txt'.
    """
    # Check if the path exists
    if os.path.exists(path_to_key):
        with open(path_to_key, "r") as f:
            api_key = f.read().strip()  # Read and strip any extra whitespace/newlines
        os.environ["OPENAI_API_KEY"] = api_key  # Set the environment variable
        print(f"API key set successfully.")
    else:
        raise FileNotFoundError(f"{path_to_key} does not exist!")  # Use a proper exception

In [22]:
def test_openai_api(model: str = "gpt-4") -> None:
    """
    Tests the OpenAI API by generating a chat completion with a simple prompt.

    Args:
        model (str): The name of the model to use. Default is 'gpt-4'.
    """
    try:
        client = openai.OpenAI(
            api_key=os.environ.get("OPENAI_API_KEY"),  # This is the default and can be omitted
        )

        
        response = client.chat.completions.create(
            messages=[
                {
                    "role": "user",
                    "content": "Hello assistant!",
                }
            ],
            model="gpt-4o",
        )

        # Access the assistant's response content
        response_content = response.choices[0].message.content
        print(response_content)
        
    except Exception as e:
        print(f"An error occurred: {e}")

In [23]:
set_openai_key()

API key set successfully.


In [24]:
test_openai_api()

Hello! How can I assist you today?


In [26]:
from typing import Literal, Any

def format_gpt_message(role: Literal["user", "assistant"], content: str) -> dict:
    """Format a single message for the OpenAI GPT API."""
    if role not in ["user", "assistant"]:
        raise ValueError("Role must be either 'user' or 'assistant'")
    return {"role": role, "content": content}

### Analyze repo issues

### Retrieve relevant codes

### Generate patches

### Validate patches

```python
import unidiff

def is_valid_patch_format(patch: str) -> bool:
    """
    A quick check to confirm if a patch could be valid.
    """
    if not(isinstance(patch, str)):
        return False
    try:
        patch_set = unidiff.PatchSet(patch)
        if len(patch_set) == 0:
            return False
    except Exception:
        return False
    return True

# This should demo patch should fail.
is_valid_patch_format('Hullo world')
```