In [1]:
%load_ext autoreload
%autoreload 2

# Imports

In [2]:
import os
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from tqdm import tqdm

from src.evaluation.metrics import aggregate_scores, calculate_scores
from src.data.data_loading import load_config
from src.data.code_processor import parse_code

In [3]:
# Set plot style
plt.rcParams['mathtext.fontset'] = 'stix'
plt.rcParams['font.family'] = 'STIXGeneral'
plt.rcParams['font.size'] = 12
%config InlineBackend.figure_format = 'retina'
# Set color palette

# Set working directory

In [7]:
# set working directory to project root - EXECUTE ONLY ONCE or RESTART KERNEL
os.chdir('../..')
os.getcwd()

'C:\\Users\\merti\\PycharmProjects\\cadenza-playwright-llm'

# Load configuration

In [8]:
config = load_config(config_path='config/config.yaml')

# Evaluation


## Evaluation Methods Summary

Metrics implemented and used in this project (see `src/evaluation/metrics.py`) in the evaluation process:

### 1. **Weighted BLEU Score** $ \in [0.0, 1.0] $
- **Purpose**: The BLEU score proposed by [Papineni et al. (2002)](https://aclanthology.org/P02-1040.pdf) [1], [2] is a metric that measures the similarity between two sequences of text. The weighted BLEU score is a variant implementd in this project that uses a weighted average of the BLEU scores of the precondition part and the actual generated additional part in teh generated test script. Measures the quality of the generated code by comparing it to reference (validation) code.
- **Description**: This method calculates the BLEU score with two components:
  - **Precondition Code Accuracy**: Evaluates how well the generated code matches the precondition code.
  - **New Lines Accuracy**: Assesses how well the generated code meets the goals by comparing the new lines added.
- **Formula**: 
  
  Weighted BLEU Score = (1 - α) * First BLEU Score + α * Second BLEU Score
  
  Where \(\alpha\) is a weight factor, typically set to 0.5.
- **Output**: A floating-point score indicating the degree of similarity between the generated and validation code.

### 2. **Success Rate** $ \in [0.0, 1.0] $
- **Purpose**: Evaluates whether the generated code successfully performs the intended functionality.
- **Description**: The generated code is executed in a testing environment, and the success rate is determined based on the test's result.
- **Procedure**:
  1. Modify the generated code to include screenshot commands.
  2. Save the updated code to a temporary file.
  3. Run the Playwright test.
  4. Return `1` if the test passes, otherwise `0`.
- **Output**: A binary score (`1` or `0`) representing test success.

### 3. **Levenshtein Distance** $ d(s, t) \in \mathbb{N} $
- **Purpose**: Measures the similarity between the generated code and validation code based on edit distance.
- **Description**: The Levenshtein distance between strings $ s $ and $ t $ is an integer that measures the number of single-character edits (insertions, deletions, or substitutions) needed to change the generated code into the validation code. The distance is normalized by the length of the longer code.
- **Formula**:
  Levenshtein Distance = Edit Distance / Max Length of Generated and Validation Code
- **Output**: A floating-point score between 0 and 1, where lower values indicate higher similarity.

### 4. **Similarity (Cosine Similarity)**
- **Purpose**: Assesses the similarity between screenshots of the generated code and the ground truth.
- **Description**: Uses a pre-trained ResNet-18 model to extract image embeddings and calculates the cosine similarity between embeddings of the predicted and ground truth images.
- **Procedure**:
  1. Load and preprocess the screenshots.
  2. Compute embeddings using the ResNet-18 model.
  3. Calculate cosine similarity between the embeddings.
- **Output**: A floating-point score between 0 and 1, indicating the similarity between the images.

## Evaluation Storage

- **Evaluation Data**: Evaluation results for different templates and options can be found under `data/scores`.
- **Evaluation Code**: The code used for evaluation is stored in `src/eval/metrics.py`.

These metrics collectively provide a comprehensive assessment of the generated code's quality and effectiveness.


## Evaluation Summary

The evaluation of generated code was conducted across various templates and configurations, addressing different aspects such as image presence, HTML inclusion, and model fine-tuning. The following types of evaluations were performed:

### 1. **Templates with Images**
- **Examples**: `pred_test_script_pretr_T4_sc+_html+_single`, `pred_test_script_finetuned_T5_sc+_html+_single`
- **Description**: Evaluated scripts that included image-based components, comparing generated outputs with expected results for scenarios where images were part of the test.

### 2. **Templates without Images**
- **Examples**: `pred_test_script_template_1_no_html_pretrained`, `pred_test_script_pretr_T1_sc-_html-_single`
- **Description**: Focused on templates where no images were included. This evaluated the performance and accuracy of generated code in the absence of image-based validation.

### 3. **HTML vs. No HTML**
- **Examples**: `pred_test_script_pretr_T1_sc+_html-_single`, `pred_test_script_finetuned_T5_sc-_html+_single`
- **Description**: Assessed how well the generated scripts handled scenarios with and without HTML components, testing their effectiveness in different contexts.

### 4. **Single vs. All Configurations**
- **Examples**: `pred_test_script_pretr_T5_sc+_html+_single`, `pred_test_script_pretr_T5_sc+_html+_all`
- **Description**: Included evaluations for both single-instance and all-instance configurations to determine the performance across various levels of complexity and data variety.

### 5. **Pretrained vs. Finetuned Models**
- **Examples**: `pred_test_script_finetuned_T5_sc+_html+_single`, `pred_test_script_pretr_T1_sc-_html+_single`
- **Description**: Compared results from pretrained models against those from finetuned models to assess improvements and differences in performance and accuracy.

### 6. **Different Attribute Lengths and Concatenation Modes**
- **Examples**: `pred_test_script_template_2_html_concat_mode_single_max_attr_length_50_pretrained`, `pred_test_script_template_2_html_concat_mode_all_max_attr_length_50_pretrained`
- **Description**: Evaluated the impact of different attribute lengths and concatenation modes on the performance of generated code.


This evaluation approach ensured a comprehensive assessment of the generated code under multiple conditions and configurations, providing insights into the effectiveness and accuracy of the different methods.


### Weighted BLEU Score  

For the scoring of our generated predictions, we chose multiple scores. Our foremost score is the BLEU score, which is a frequent metric for LLMs, so we chose to implement it here as well.
The BLEU score uses a similarity measure between the n-grams of the sample compared to the references he gets.
The reference in our case is the human-made code for the described test step. We chose to use the default configuration for the BLEU weights, which utilizes 1-grams up to 4-grams. All n-grams are uniformly weighted and have equal weight in the result.

See [src.evaluation.metrics](../) for our implementation of the BLEU score.

Our weighted Bleu score separately evaluates the code from the precondition, which is copied by the LLM, and the newly generated code for the current step. This is intended to decouple the final result from the length of the precondition. Both parts are evaluated with 50 percent each.

### Levenshtein Distance
The scoring function for our validation samples is the Levenshtein distance. This measures the distance between two strings by the amount of necessary single-character operations to turn one string into another. These operations are remove, add, and replace. So the maximum distance this measure can calculate is the maximum length of input strings. Since later tests involve longer preconditions and descriptions the chance for mistakes is higher, so the Levenshtein distance would always increase for the longer test and be lower for the shorter tests. To counteract this, we normalize the distance to values between 0 and 1 by dividing with the maximum length of the input strings. This represents small errors or deviations in tests much better since those should perform better than very short and error-riddled tests.

See [src.evaluation.metrics](../) for our implementation of the Levenshtein distance.

# Evaluation Summary

In [24]:
df = pd.read_pickle(config['paths']['scores_dir']+'eval_scores_all_20240721.pkl')

In [26]:
# exclude all but the relevant directories
list = df['prediction_dir'].unique()
excluded_list = []
for i in list:
    if "max_attr" in i:
        excluded_list.append(i)
    elif "test_set" in i:
        excluded_list.append(i)
    elif "_sc" not in i:
        excluded_list.append(i)
    elif "_html" not in i:
        excluded_list.append(i)
    elif "_T" not in i:
        excluded_list.append(i)
rel_list = [x for x in list if x not in excluded_list]

In [71]:
# Aggregate the results for each prediction directory
g = df[df['prediction_dir'].isin(rel_list)]
results = g.groupby('prediction_dir').agg({'weighted bleu': ['mean', 'std'],
                                           'levenshtein distance': ['mean', 'std'],
                                           'similarity': ['mean', 'std'],
                                           'success rate': ['mean', 'std']})
# strip to 2 decimal places
results = results.round(4)
results

Unnamed: 0_level_0,weighted bleu,weighted bleu,levenshtein distance,levenshtein distance,similarity,similarity,success rate,success rate
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
prediction_dir,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/,0.5993,0.1202,0.3145,0.2346,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_finetuned_T1_sc-_html+_single/,0.7882,0.1359,0.0726,0.0998,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/,0.633,0.1097,0.2088,0.2115,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_finetuned_T5_sc-_html+_single/,0.7871,0.1242,0.0629,0.0872,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T1_sc+_html+_all/,0.3976,0.1302,0.3037,0.2029,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T1_sc+_html+_single/,0.4324,0.1118,0.2564,0.1863,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T1_sc+_html-_single/,0.2991,0.1503,0.4102,0.2051,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T1_sc-_html+_single/,0.4189,0.1774,0.3358,0.2229,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T1_sc-_html-_single/,0.3211,0.2021,0.384,0.2006,0.0,0.0,0.01,0.1
data/prediction/pred_test_script_pretr_T2_sc+_html+_single/,0.3987,0.1322,0.3487,0.1987,0.0,0.0,0.01,0.1


In [None]:
# Mean

In [53]:
# Baseline pretrained_model, Template1 with screenshots with html concat-mode: single
base = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_single/'
print("Template 1: \n", f"BLEU: {results.loc[base]['weighted bleu']['mean']}\n", f"Levenshtein: {results.loc[base]['levenshtein distance']['mean']}\n", f"Similarity: {results.loc[base]['similarity']['mean']}\n", f"Success Rate: {results.loc[base]['success rate']['mean']}")

Template 1: 
 BLEU: 0.4324
 Levenshtein: 0.2564
 Similarity: 0.0
 Success Rate: 0.01


This is our baseline performance for this evaluation. The template 1 with screenshots and html concat-mode single.
* Success rate is so low due to many tiny errors in the generated code. This can be just a missing space or a missing bracket which leads to a failing test. But also a duplicated line or a missing line can lead to a failing test. Therefore for evaluation the success rate as well as the similarity which is dependent on the success rate. Simliarity is calculated by comparing the screenshots of the generated code and the human-made code. If the test fails, no screenshot is taken of the actual outcome of the website action and therefore the similarity is 0.

### Template variations

In [55]:
score_dic = {}
for i in rel_list:
    score_dic[i] = f"BLEU: {results.loc[i]['weighted bleu']['mean']}, Levenshtein: {results.loc[i]['levenshtein distance']['mean']}, Similarity: {results.loc[i]['similarity']['mean']}, Success Rate: {results.loc[i]['success rate']['mean']}"

In [56]:
temp2 = 'data/prediction/pred_test_script_pretr_T2_sc+_html+_single/'
temp3 = 'data/prediction/pred_test_script_pretr_T3_sc+_html+_single/'
temp4 = 'data/prediction/pred_test_script_pretr_T4_sc+_html+_single/'
temp5 = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'

print("Template 1: ", score_dic[base])
print("Template 2: ", score_dic[temp2])
print("Template 3: ", score_dic[temp3])
print("Template 4: ", score_dic[temp4])
print("Template 5: ", score_dic[temp5])

Template 1:  BLEU: 0.4324, Levenshtein: 0.2564, Similarity: 0.0, Success Rate: 0.01
Template 2:  BLEU: 0.3987, Levenshtein: 0.3487, Similarity: 0.0, Success Rate: 0.01
Template 3:  BLEU: 0.3551, Levenshtein: 0.3976, Similarity: 0.0, Success Rate: 0.01
Template 4:  BLEU: 0.3796, Levenshtein: 0.3435, Similarity: 0.0, Success Rate: 0.01
Template 5:  BLEU: 0.4694, Levenshtein: 0.2318, Similarity: 0.0, Success Rate: 0.01


As we see above, the base case with template 1 and template 5 perform far better than the other templates.
Though it is to be mentioned that there is also large gap between template 1 und 5 in favor of template 5

### Variations with contextual informations

In [57]:
no_scr = 'data/prediction/pred_test_script_pretr_T1_sc-_html+_single/'
no_html = 'data/prediction/pred_test_script_pretr_T1_sc+_html-_single/'
no_scrNhtml = 'data/prediction/pred_test_script_pretr_T1_sc-_html-_single/'

print("Template 1: ", score_dic[base])
print("No Screenshot: ", score_dic[no_scr])
print("No HTML: ", score_dic[no_html])
print("Neither: ", score_dic[no_scrNhtml])

Template 1:  BLEU: 0.4324, Levenshtein: 0.2564, Similarity: 0.0, Success Rate: 0.01
No Screenshot:  BLEU: 0.4189, Levenshtein: 0.3358, Similarity: 0.0, Success Rate: 0.01
No HTML:  BLEU: 0.2991, Levenshtein: 0.4102, Similarity: 0.0, Success Rate: 0.01
Neither:  BLEU: 0.3211, Levenshtein: 0.384, Similarity: 0.0, Success Rate: 0.01


As expected the base case with the most contextual information performs the best. But important to notice the case without the screenshot is not far behind. 

### Variations within processing parameters

In [58]:
T1_concat_all = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_all/'
T5_concat_single = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'
T5_concat_all = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_all/'
print("T1 - Base/Single: ", score_dic[base])
print("T1 - All: ", score_dic[T1_concat_all])

print("T5 - Single: ", score_dic[T5_concat_single])
print("T5 - All: ", score_dic[T5_concat_all])

T1 - Base/Single:  BLEU: 0.4324, Levenshtein: 0.2564, Similarity: 0.0, Success Rate: 0.01
T1 - All:  BLEU: 0.3976, Levenshtein: 0.3037, Similarity: 0.0, Success Rate: 0.01
T5 - Single:  BLEU: 0.4694, Levenshtein: 0.2318, Similarity: 0.0, Success Rate: 0.01
T5 - All:  BLEU: 0.4483, Levenshtein: 0.2563, Similarity: 0.0, Success Rate: 0.01


Also the "single"-concat mode seems to outperform "all"-concat mode for both our top performing templates.

### Pretrained vs Finetuned

In [59]:
base = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_single/'
pre_t5 = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'

print("Pre - Template 1: ", score_dic[base])
print("Pre - Template 5: ", score_dic[pre_t5])

ft_t1 = 'data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/'
ft_t5  = 'data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/'

print("Ft - Template 1: ", score_dic[ft_t1])
print("Ft - Template 5: ", score_dic[ft_t5])

Pre - Template 1:  BLEU: 0.4324, Levenshtein: 0.2564, Similarity: 0.0, Success Rate: 0.01
Pre - Template 5:  BLEU: 0.4694, Levenshtein: 0.2318, Similarity: 0.0, Success Rate: 0.01
Ft - Template 1:  BLEU: 0.5993, Levenshtein: 0.3145, Similarity: 0.0, Success Rate: 0.01
Ft - Template 5:  BLEU: 0.633, Levenshtein: 0.2088, Similarity: 0.0, Success Rate: 0.01


Our finetuning did improve our performance quite a bit. Both templates reach a BLEU score above 50 which means they are able to accurately return the precondition and also manage to generate somewhat fitting new code.

<br><img src="./finetunevspretrained.png" width="1000" ><br>

### Finetuning with or without screenshots

In [60]:
t1_w = 'data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/'
t5_w  = 'data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/'

t1_wo = 'data/prediction/pred_test_script_finetuned_T1_sc-_html+_single/'
t5_wo = 'data/prediction/pred_test_script_finetuned_T5_sc-_html+_single/'

print("With Screenshot - Template 1: ", score_dic[t1_w])
print("With Screenshot - Template 5: ", score_dic[t5_w])

print("Without Screenshot - Template 1: ", score_dic[t1_wo])
print("Without Screenshot - Template 5: ", score_dic[t5_wo])

With Screenshot - Template 1:  BLEU: 0.5993, Levenshtein: 0.3145, Similarity: 0.0, Success Rate: 0.01
With Screenshot - Template 5:  BLEU: 0.633, Levenshtein: 0.2088, Similarity: 0.0, Success Rate: 0.01
Without Screenshot - Template 1:  BLEU: 0.7882, Levenshtein: 0.0726, Similarity: 0.0, Success Rate: 0.01
Without Screenshot - Template 5:  BLEU: 0.7871, Levenshtein: 0.0629, Similarity: 0.0, Success Rate: 0.01


# Conclusion

The version of the finetuned model which used the templates without screenshots seemed to perform the best by far. Independent of the template.<br>
Which seems to indicate that the most valuable data can be gained without screenshot and can be easier understood if the noise generated by the screenshot is removed.

Thus, in summary the finetuned models without screenshots perform by far the best. Though if we were to only look at the BLEU Score, template 1 seems to be slightly favored.<br>
But in regard to how close we got to the human-made test script, template 5 seems to be slightly ahead as indicated by the Levenshtein distance.
So we can't identify a clear winner, template 5 seems to perform on average the best over all observed configurations.  
