# Mini-Project 2

## Task 2

### Prompts

Name | Prompt | Result
-----|--------|-------
Direct | We have asked developers to find bugs in software and explain the bug to us. Merge the participants’ explanations in a way that minimizes redundant information, while keeping the information that would be necessary for someone else to fix the bug. These are the answers of the participants (one per line): | Answer has intro and outro sentence, extensive formatting (dividers, code snippets, bullet points, ...). Elaborate sentences on what the code does, the error's root cause and its implications, detailed solution steps and additional remarks regarding further improvements and posible pitfalls
Developer Persona | We have asked developers to find bugs in software and explain the bug to us. Merge the participants’ explanations in a way that minimizes redundant information, while keeping the information that would be necessary for someone else to fix the bug. Formulate your answer like you were one of the developers. These are the answers of the participants (one per line): | Similar to Direct prompt, but without intro and outro text
Short | We have asked developers to find bugs in software and explain the bug to us. Based on the participants’ explanations, give us one very short explanation of the bug, while keeping the information that would be necessary for someone else to fix the bug. Write it in a way one of the participants would. These are the answers of the participants (one per line): | Few well formulated, whole sentences. Reduced to the root cause explanation and the fix. Occasional use of code blocks
No Formatting | We have asked developers to find bugs in software and explain the bug to us. Based on the participants’ explanations, give us one very short explanation of the bug, while keeping the information that would be necessary for someone else to fix the bug. Write it in a way one of the participants would, e.g. do not rely on code formatting to convey semantics/clarity, maybe do not write complete sentences etc. These are the answers of the participants (one per line): | Similar to Short Prompts. Mix of whole sentences and sentence fragments. Most notably: Still uses inline code formatting (instead of e.g. quotation marks)

In [4]:
import numpy as np
import pandas as pd

from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

In [31]:
# Load Pandas table with all answers, incl. the generated answers
file = 'answerList_LLM.csv'
df_original = pd.read_csv(file)
df_original["Answer.option"] = np.where(df_original["Answer.option"] == "YES", 1, 0)
df_original["isCorrect"] = np.where(df_original["Answer.option"] == df_original["GroundTruth"], 1, 0)
df_original

# Get bug reports
report_ids = df_original["FailingMethod"].unique()

# calculate BLEU
# calculate ROGUE

df_original

Unnamed: 0,Answer.ID,FailingMethod,Question.ID,Answer.duration,Answer.confidence,Answer.difficulty,GroundTruth,TP,TN,FN,...,Worker.ID,Worker.score,Worker.profession,Worker.yearsOfExperience,Worker.age,Worker.gender,Worker.whereLearnedToCode,Worker.country,Worker.programmingLanguage,isCorrect
0,,HIT01_8,1,,,,1,1,0,0,...,LLM,,,,,,,,,1
1,,HIT01_8,1,,,,1,1,0,0,...,LLM-short,,,,,,,,,1
2,261.0,HIT01_8,0,90984.0,4.0,2.0,0,0,1,0,...,832cg-7G1i-462:73eI-8E-2g-985,5.0,Undergraduate_Student,7.0,21.0,Male,High School,United States,Java; C++; C#,1
3,262.0,HIT01_8,0,133711.0,5.0,1.0,0,0,1,0,...,98ce7A-4i-507,4.0,Undergraduate_Student,10.0,25.0,Female,High School;University;Web,United States,c#,1
4,263.0,HIT01_8,0,77696.0,5.0,2.0,0,0,1,0,...,881AC0I2E-625:135cI3E-7e8-86,5.0,Professional_Developer,7.0,24.0,Male,High School;University;Web,United States,C++;Java;PHP,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2587,2316.0,HIT08_54,128,22042.0,2.0,4.0,0,0,1,0,...,1171ea-4g-6C-73-8,3.0,Graduate_Student,2.0,28.0,Male,University,USA,C#,1
2588,2317.0,HIT08_54,128,32279.0,4.0,3.0,0,0,0,0,...,66AC-5a0g-47-9:1443IA-7C-6e967,5.0,Professional_Developer,17.0,39.0,Male,University;Web,USA,C#; VB.NET; Java,0
2589,2318.0,HIT08_54,128,15953.0,5.0,1.0,0,0,1,0,...,106iG8G-9I-9-80:590CG-6G-7i-71-9,4.0,Professional_Developer,10.0,31.0,Male,High School;University;Web;Other work,usa,C++,1
2590,2319.0,HIT08_54,128,68578.0,5.0,1.0,0,0,1,0,...,1221iC8A5A242:495CC9e6a691:11aE2c-4c-9-86,4.0,Undergraduate_Student,4.0,19.0,Male,University;Web;Other FIRST Robotics,United States,C++,1


In [24]:
def llm_answer(df, report_id, worker_id):
    llm_answers = df.loc[(df["Worker.ID"] == worker_id) & (df["FailingMethod"] == report_id), "Answer.explanation"].tolist()
    return llm_answers[0] if len(llm_answers) > 0 else ""

def worker_answers(df, report_id):
    return df.loc[(df["FailingMethod"] == report_id) & (df["TP"] == 1) & (~df["Worker.ID"].str.contains("LLM")), "Answer.explanation"].tolist()

## BLEU

In [25]:
def calc_bleu(df, report_id, worker_id="LLM"):
    return sentence_bleu(
        [worker_answer.split() for worker_answer in worker_answers(df, report_id)],
        llm_answer(df, report_id, worker_id).split(),
        smoothing_function=SmoothingFunction().method1
    )

In [26]:
bleu_scores = dict(zip(report_ids, [round(calc_bleu(df_original, report_id), 3) for report_id in report_ids]))

bleu_scores

{'HIT01_8': 0.248,
 'HIT02_24': 0.04,
 'HIT03_6': 0.113,
 'HIT04_7': 0.134,
 'HIT05_35': 0.163,
 'HIT06_51': 0.042,
 'HIT07_33': 0.25,
 'HIT08_54': 0.2}

In [32]:
bleu_scores_short = dict(zip(report_ids, [round(calc_bleu(df_original, report_id, "LLM-short"), 3) for report_id in report_ids]))

bleu_scores_short

{'HIT01_8': 0.211,
 'HIT02_24': 0,
 'HIT03_6': 0,
 'HIT04_7': 0,
 'HIT05_35': 0.2,
 'HIT06_51': 0.013,
 'HIT07_33': 0.08,
 'HIT08_54': 0}

## ROUGE

In [28]:
def calc_rouge(df, report_id, worker_id="LLM"):
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'])
    return scorer.score_multi(worker_answers(df, report_id), llm_answer(df, report_id, worker_id))

In [33]:
rouge_scores = dict(zip(report_ids, [calc_rouge(df_original, report_id) for report_id in report_ids]))

rouge_scores

{'HIT01_8': {'rouge1': Score(precision=0.3076923076923077, recall=0.64, fmeasure=0.4155844155844156),
  'rouge2': Score(precision=0.13725490196078433, recall=0.2916666666666667, fmeasure=0.18666666666666668),
  'rougeL': Score(precision=0.28846153846153844, recall=0.6, fmeasure=0.3896103896103896)},
 'HIT02_24': {'rouge1': Score(precision=0.3424657534246575, recall=0.5102040816326531, fmeasure=0.40983606557377056),
  'rouge2': Score(precision=0.06944444444444445, recall=0.35714285714285715, fmeasure=0.11627906976744186),
  'rougeL': Score(precision=0.2054794520547945, recall=0.30612244897959184, fmeasure=0.2459016393442623)},
 'HIT03_6': {'rouge1': Score(precision=0.4745762711864407, recall=0.4057971014492754, fmeasure=0.43749999999999994),
  'rouge2': Score(precision=0.10344827586206896, recall=0.42857142857142855, fmeasure=0.16666666666666663),
  'rougeL': Score(precision=0.1694915254237288, recall=0.4166666666666667, fmeasure=0.24096385542168672)},
 'HIT04_7': {'rouge1': Score(preci

In [34]:
rouge_scores_short = dict(zip(report_ids, [calc_rouge(df_original, report_id, "LLM-short") for report_id in report_ids]))

rouge_scores_short

{'HIT01_8': {'rouge1': Score(precision=0.34782608695652173, recall=0.64, fmeasure=0.4507042253521127),
  'rouge2': Score(precision=0.13333333333333333, recall=0.25, fmeasure=0.1739130434782609),
  'rougeL': Score(precision=0.30434782608695654, recall=0.56, fmeasure=0.3943661971830986)},
 'HIT02_24': {'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rougeL': Score(precision=0, recall=0, fmeasure=0)},
 'HIT03_6': {'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rougeL': Score(precision=0, recall=0, fmeasure=0)},
 'HIT04_7': {'rouge1': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rouge2': Score(precision=0.0, recall=0.0, fmeasure=0.0),
  'rougeL': Score(precision=0, recall=0, fmeasure=0)},
 'HIT05_35': {'rouge1': Score(precision=0.2916666666666667, recall=0.7, fmeasure=0.4117647058823529),
  'rouge2': Score(precision=0.14893617021276595, re

## Discussion

The low BLEU scores indicate stark differences between the LLM-generated answers and the explanations by human workers. Looking closer at the ROUGE scores, the precision is generally rather low, while the recall is oftentimes higher, especially for ROUGE 1 (Overlap of individual words) and ROUGE L (Longest Common Substring). This hints at the LLM using many words which are not used by the workers (low precision), while also using many relevant words also used by the workers.

A quick qualitative analysis supports this interpretation: The LLM generates longer, well phrased sentences, while candidates either omit sentence parts for conveying most relevant aspects only, or focus their explanation on specific topics (e.g. explaining the error, without giving hints at a solution). It seems like the LLM had a broader prompt to cover (error explanation and solution provision) while also not adhering the same level of "efficiency" as human workers. Explicitely allowing the model to omit sentence parts yielded unclear results, as whole sentences were still generated in some instances. Yet, no code blocks were generated any longer, introducing more similarity to the worker answers.

