### Table of Content
- [Load dataset](#load-dataset)
- [Exploratory Data Analysis](#Exploratory-Data-Analysis)
- [Setup LLM](#Setup-LLM)
- [Analyze repo issues & Retrieve relevant codes](#Analyze-repo-issues-&-Retrieve-relevant-codes)
- [Validate patches](#Validate-patches)

In [1]:
import os
import pandas as pd

# Define paths
comp_dir = "konwinski-prize"
comp_kaggle_evaluation_dir = os.path.join(comp_dir, "kaggle_evaluation")
comp_kprize_setup_dir = os.path.join(comp_dir, "kprize_setup")

comp_data_zip_path = os.path.join(comp_dir, "data.a_zip")
comp_data_dir = os.path.join(comp_dir, "data")
comp_data_parquet_path = os.path.join(comp_data_dir, "data.parquet")
comp_conda_packages_dir = os.path.join(comp_data_dir, "conda_packages")
comp_pip_packages_dir = os.path.join(comp_data_dir, "pip_packages")
comp_repo_configs_dir = os.path.join(comp_data_dir, "repo_configs")
comp_repos_dir = os.path.join(comp_data_dir, "repos")

### Load dataset

From the competition readme and our earlier investigation we know that the dataframe contains the following:

**instance_id (string)**
- Unique string identifier for each instance (GitHub issue)

**repo (string)**
- The GitHub repository relevant to the issue
- Also accessible through the evaluation API

**problem_statement (string)**
- Textual description of the issue
- Also accessible through the evaluation API

**patch (string)**
- The patch that resolves the issue
- Only provided in the train set

**test_patch (string)**
- The patch that resolves the issue
- Only provided in the train set

**pull_number (int)**
- The pull request number that resolved the issue

**base_commit (string)**
- The commit used as the foundation for the provided repository copy

**issue_numbers (int)**
- The original ID number of the GitHub issue

**[PASS_TO_PASS/FAIL_TO_PASS] (list)**
- Lists containing unit tests to be executed for this issue


In [2]:
# Load dataset
kprize_df = pd.read_parquet(comp_data_parquet_path)
kprize_df

Unnamed: 0,instance_id,repo,problem_statement,patch,test_patch,pull_number,base_commit,PASS_TO_PASS,FAIL_TO_PASS,issue_numbers
0,pylint-dev__astroid-2496,pylint-dev/astroid,TypeError: unsupported format string passed to...,diff --git a/ChangeLog b/ChangeLog\nindex 4560...,diff --git a/tests/test_inference.py b/tests/t...,2496,8d3cdbbe6685fd8cf211816bec56c90f38f1859e,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::test_formatted_fstri...,[2492]
1,pylint-dev__astroid-2468,pylint-dev/astroid,Pylint checks against incorrect type with prop...,diff --git a/ChangeLog b/ChangeLog\nindex fdbb...,diff --git a/tests/test_inference.py b/tests/t...,2468,6db3a60553ff538a936d5dda23d67a3924a57f45,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::InferenceTest::test_...,[2467]
2,astropy__astropy-17048,astropy/astropy,QTable cannot take `dimensionless_unscaled` wh...,diff --git a/astropy/table/table.py b/astropy/...,diff --git a/astropy/table/tests/test_table.py...,17048,d60f6b72cd525262bfd179331d9fe4474177918f,[astropy/table/tests/test_table.py::TestSetTab...,[astropy/table/tests/test_table.py::test_qtabl...,[17047]
3,astropy__astropy-16898,astropy/astropy,BUG: tables do not deal well with zero-sized s...,diff --git a/astropy/io/registry/core.py b/ast...,diff --git a/astropy/io/fits/tests/test_connec...,16898,ee6d087baf301c1d08db92e6e5b6d909d57e6fac,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,[16897]
4,astropy__astropy-16830,astropy/astropy,KeyError: 'version_1_3_or_later' when parsing ...,diff --git a/astropy/io/votable/tree.py b/astr...,diff --git a/astropy/io/votable/tests/test_tre...,16830,e39f486fec48d87aa3677326167954370d7a7bf9,[astropy/io/votable/tests/test_tree.py::test_c...,[astropy/io/votable/tests/test_tree.py::test_v...,"[16825, 16826]"
5,astropy__astropy-16812,astropy/astropy,Provide a way to make a copy of a model with d...,diff --git a/astropy/modeling/core.py b/astrop...,diff --git a/astropy/modeling/tests/test_core....,16812,c241103c11954d3c1cfe3c1840b1ece72479c522,[astropy/modeling/tests/test_core.py::test_Mod...,[astropy/modeling/tests/test_core.py::test_res...,[16593]


### Exploratory Data Analysis

In [3]:
from rich import print as rprint

In [4]:
rprint(f"{kprize_df.shape=}\n")

In [5]:
rprint("\nAny Missing Values?\n")
rprint(kprize_df.isnull().sum())

In [6]:
rprint("\nDatatypes?\n")
kprize_df.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   instance_id        6 non-null      object
 1   repo               6 non-null      object
 2   problem_statement  6 non-null      object
 3   patch              6 non-null      object
 4   test_patch         6 non-null      object
 5   pull_number        6 non-null      int64 
 6   base_commit        6 non-null      object
 7   PASS_TO_PASS       6 non-null      object
 8   FAIL_TO_PASS       6 non-null      object
 9   issue_numbers      6 non-null      object
dtypes: int64(1), object(9)
memory usage: 608.0+ bytes


In [7]:
rprint("\nRepo Distribution?\n")
rprint(kprize_df['repo'].value_counts())

In [8]:
# Fixes can reference more than one GitHub issue.
rprint("\nNumber of Issues per PR\n")
kprize_df["issue_numbers"].apply(len).value_counts()

1    5
2    1
Name: issue_numbers, dtype: int64

In [9]:
kprize_df['problem_statement_length'] = kprize_df['problem_statement'].apply(lambda x: len(x.split()))
rprint("\nProblem Statement Lengths\n")
display(kprize_df['problem_statement_length'].describe())

count      6.000000
mean     297.166667
std      162.149828
min       72.000000
25%      171.750000
50%      358.000000
75%      407.000000
max      462.000000
Name: problem_statement_length, dtype: float64

In [10]:
rprint("\nPatch Lengths\n")
kprize_df['patch_length'] = kprize_df['patch'].apply(lambda x: len(x))
kprize_df['test_patch_length'] = kprize_df['test_patch'].apply(lambda x: len(x))
display(kprize_df[['patch_length', 'test_patch_length']].describe())

Unnamed: 0,patch_length,test_patch_length
count,6.0,6.0
mean,2337.833333,2255.833333
std,1380.618328,629.710542
min,912.0,1339.0
25%,1382.25,2069.5
50%,2195.0,2214.5
75%,2723.5,2405.25
max,4714.0,3277.0


In [11]:
rprint("\nTest Counts\n")
kprize_df['PASS_TO_PASS_count'] = kprize_df['PASS_TO_PASS'].apply(len)
kprize_df['PASS_TO_PASS_count'] = kprize_df['FAIL_TO_PASS'].apply(len)
display(kprize_df[['PASS_TO_PASS', 'FAIL_TO_PASS', 'PASS_TO_PASS_count', 'PASS_TO_PASS_count']])

Unnamed: 0,PASS_TO_PASS,FAIL_TO_PASS,PASS_TO_PASS_count,PASS_TO_PASS_count.1
0,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::test_formatted_fstri...,2,2
1,[tests/test_inference.py::InferenceUtilsTest::...,[tests/test_inference.py::InferenceTest::test_...,3,3
2,[astropy/table/tests/test_table.py::TestSetTab...,[astropy/table/tests/test_table.py::test_qtabl...,3,3
3,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,2,2
4,[astropy/io/votable/tests/test_tree.py::test_c...,[astropy/io/votable/tests/test_tree.py::test_v...,1,1
5,[astropy/modeling/tests/test_core.py::test_Mod...,[astropy/modeling/tests/test_core.py::test_res...,2,2


In [None]:
idx = 3
row = kprize_df.iloc[idx]
problem_statement = row["problem_statement"]
instance_id = row["instance_id"]
repo_path = os.path.join(comp_repos_dir, f'repo__{instance_id}')

In [15]:
# Display problem_statement
rprint(problem_statement)

In [16]:
# Display current row
display(pd.DataFrame(row).T)

Unnamed: 0,instance_id,repo,problem_statement,patch,test_patch,pull_number,base_commit,PASS_TO_PASS,FAIL_TO_PASS,issue_numbers,problem_statement_length,patch_length,test_patch_length,PASS_TO_PASS_count
3,astropy__astropy-16898,astropy/astropy,BUG: tables do not deal well with zero-sized s...,diff --git a/astropy/io/registry/core.py b/ast...,diff --git a/astropy/io/fits/tests/test_connec...,16898,ee6d087baf301c1d08db92e6e5b6d909d57e6fac,[astropy/io/fits/tests/test_connect.py::TestSi...,[astropy/io/fits/tests/test_connect.py::test_z...,[16897],315,2203,2463,2
