# Evaluation


## Evaluation Methods Summary

Metrics implemented and used in this project (see `src/evaluation/metrics.py`) in the evaluation process:

### 1. **Weighted BLEU Score** $ \in [0.0, 1.0] $
- **Purpose**: The BLEU score proposed by [Papineni et al. (2002)](https://aclanthology.org/P02-1040.pdf) [1], [2] is a metric that measures the similarity between two sequences of text. The weighted BLEU score is a variant implementd in this project that uses a weighted average of the BLEU scores of the precondition part and the actual generated additional part in teh generated test script. Measures the quality of the generated code by comparing it to reference (validation) code.
- **Description**: This method calculates the BLEU score with two components:
  - **Precondition Code Accuracy**: Evaluates how well the generated code matches the precondition code.
  - **New Lines Accuracy**: Assesses how well the generated code meets the goals by comparing the new lines added.
- **Formula**: 
  
  Weighted BLEU Score = (1 - α) * First BLEU Score + α * Second BLEU Score
  
  Where \(\alpha\) is a weight factor, typically set to 0.5.
- **Output**: A floating-point score indicating the degree of similarity between the generated and validation code.

### 2. **Success Rate** $ \in [0.0, 1.0] $
- **Purpose**: Evaluates whether the generated code successfully performs the intended functionality.
- **Description**: The generated code is executed in a testing environment, and the success rate is determined based on the test's result.
- **Procedure**:
  1. Modify the generated code to include screenshot commands.
  2. Save the updated code to a temporary file.
  3. Run the Playwright test.
  4. Return `1` if the test passes, otherwise `0`.
- **Output**: A binary score (`1` or `0`) representing test success.

### 3. **Levenshtein Distance** $ d(s, t) \in \mathbb{N} $
- **Purpose**: Measures the similarity between the generated code and validation code based on edit distance.
- **Description**: The Levenshtein distance between strings $ s $ and $ t $ is an integer that measures the number of single-character edits (insertions, deletions, or substitutions) needed to change the generated code into the validation code. The distance is normalized by the length of the longer code.
- **Formula**:
  Levenshtein Distance = Edit Distance / Max Length of Generated and Validation Code
- **Output**: A floating-point score between 0 and 1, where lower values indicate higher similarity.

### 4. **Similarity (Cosine Similarity)**
- **Purpose**: Assesses the similarity between screenshots of the generated code and the ground truth.
- **Description**: Uses a pre-trained ResNet-18 model to extract image embeddings and calculates the cosine similarity between embeddings of the predicted and ground truth images.
- **Procedure**:
  1. Load and preprocess the screenshots.
  2. Compute embeddings using the ResNet-18 model.
  3. Calculate cosine similarity between the embeddings.
- **Output**: A floating-point score between 0 and 1, indicating the similarity between the images.

## Evaluation Storage

- **Evaluation Data**: Evaluation results for different templates and options can be found under `data/scores`.
- **Evaluation Code**: The code used for evaluation is stored in `src/eval/metrics.py`.

These metrics collectively provide a comprehensive assessment of the generated code's quality and effectiveness.


## Evaluation Summary

The evaluation of generated code was conducted across various templates and configurations, addressing different aspects such as image presence, HTML inclusion, and model fine-tuning. The following types of evaluations were performed:

### 1. **Templates with Images**
- **Examples**: `pred_test_script_pretr_T4_sc+_html+_single`, `pred_test_script_finetuned_T5_sc+_html+_single`
- **Description**: Evaluated scripts that included image-based components, comparing generated outputs with expected results for scenarios where images were part of the test.

### 2. **Templates without Images**
- **Examples**: `pred_test_script_template_1_no_html_pretrained`, `pred_test_script_pretr_T1_sc-_html-_single`
- **Description**: Focused on templates where no images were included. This evaluated the performance and accuracy of generated code in the absence of image-based validation.

### 3. **HTML vs. No HTML**
- **Examples**: `pred_test_script_pretr_T1_sc+_html-_single`, `pred_test_script_finetuned_T5_sc-_html+_single`
- **Description**: Assessed how well the generated scripts handled scenarios with and without HTML components, testing their effectiveness in different contexts.

### 4. **Single vs. All Configurations**
- **Examples**: `pred_test_script_pretr_T5_sc+_html+_single`, `pred_test_script_pretr_T5_sc+_html+_all`
- **Description**: Included evaluations for both single-instance and all-instance configurations to determine the performance across various levels of complexity and data variety.

### 5. **Pretrained vs. Finetuned Models**
- **Examples**: `pred_test_script_finetuned_T5_sc+_html+_single`, `pred_test_script_pretr_T1_sc-_html+_single`
- **Description**: Compared results from pretrained models against those from finetuned models to assess improvements and differences in performance and accuracy.

### 6. **Different Attribute Lengths and Concatenation Modes**
- **Examples**: `pred_test_script_template_2_html_concat_mode_single_max_attr_length_50_pretrained`, `pred_test_script_template_2_html_concat_mode_all_max_attr_length_50_pretrained`
- **Description**: Evaluated the impact of different attribute lengths and concatenation modes on the performance of generated code.


This evaluation approach ensured a comprehensive assessment of the generated code under multiple conditions and configurations, providing insights into the effectiveness and accuracy of the different methods.


In [None]:
"""  
##Evaluation
In this section, we evaluate the generated code by comparing it with a validation code.
We use several metrics such as weighted BLEU score, success rate, and Levenshtein distance.
"""
import os
import esprima
from Levenshtein import distance 
from src.evaluation.metrics import strip_script_code
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction  


def calculate_scores(test_cases: List[dict], config: dict, metrics: list = None) -> dict:
    """Calculate scores for given metrics across multiple test cases.

    :param test_cases: List of dictionaries, each containing 'generated_code', 'validation_code',
                       'precondition_code', etc.
    :param config: The configuration dictionary.
    :param metrics: List of metrics to calculate the scores for, e.g., ['weighted bleu', 'success rate', 'levenshtein distance']
    :return: Dictionary containing scores for each metric across all test cases.
    """
    if metrics is None:
        metrics = ['weighted bleu', 'success rate', 'levenshtein distance', "similarity"]
    scores = {metric: [] for metric in metrics}

    for test_case in test_cases:
        test = test_case.get('test_case', '')
        test_step = test_case.get('test_step', '')
        generated_code = test_case.get('generated_code', '')
        validation_code = test_case.get('validation_code', '')
        precondition_code = test_case.get('precondition_code', '')

        image_folder_pred = os.path.normpath(config['paths']['eval_run_dir'])
        image_folder_gt = os.path.normpath(config['paths']['gt_images'])
        logger.debug(f"Calculating scores for test case {test}_{test_step}...")

        for metric in metrics:
            try:
                match metric:
                    case 'weighted bleu':
                        scores[metric].append(
                            calculate_weighted_bleu_score(generated_code, validation_code, precondition_code)
                        )
                    case 'success rate':
                        file_name = test + "_" + test_step + ".spec.ts"
                        scores[metric].append(
                            calculate_success_rate(generated_code, file_name=file_name, config=config)
                        )
                    case 'similarity':
                        file_name = test + "_" + test_step
                        screenshot_path_pred = os.path.join(image_folder_pred, f"{file_name}.png")
                        screenshot_path_gt = os.path.join(image_folder_gt, f"{file_name}.png")

                        model = models.resnet18(weights=ResNet18_Weights.DEFAULT)
                        model.eval()

                        if not os.path.exists(screenshot_path_gt):
                            scores[metric].append(0)
                        else:
                            # Define your preprocessing steps here
                            preprocess = transforms.Compose([
                                transforms.Resize((224, 224)),  # Resize the images to the size expected by the model
                                transforms.ToTensor(),  # Convert the image to a PyTorch tensor
                                transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
                                # Normalize the tensor
                            ])
                            scores[metric].append(
                                encode_and_calculate_similarity(screenshot_path_pred, screenshot_path_gt, model=model,
                                                                preprocess=preprocess)
                            )
                    case 'levenshtein distance':
                        scores[metric].append(
                            calculate_levenshtein_distance(generated_code, validation_code)
                        )
                    case _:
                        scores[metric].append(None)
                        logger.warning(f"Unknown metric: {metric}. Skipping...")
            except Exception as e:
                scores[metric].append(None)
                logger.error(f"Error calculating {metric} for test case {test}_{test_step}: {e}")
    return scores


def encode_and_calculate_similarity(pred_img_path, gt_img_path, model, preprocess):
    # Check if the prediction image path exists
    if not os.path.exists(pred_img_path):
        return 0

    # Load and preprocess the images
    image1 = Image.open(pred_img_path)
    image2 = Image.open(gt_img_path)
    input_tensor1 = preprocess(image1).unsqueeze(0)  # Create a mini-batch as expected by the model
    input_tensor2 = preprocess(image2).unsqueeze(0)

    # Encode the images
    with torch.no_grad():
        embedding1 = model(input_tensor1)
        embedding2 = model(input_tensor2)

    # Calculate cosine similarity
    cos_sim = cosine_similarity(embedding1.numpy(), embedding2.numpy())
    return round(cos_sim.item(), 4)


### Weighted BLEU Score  

For the scoring of our generated predictions, we chose multiple scores. Our foremost score is the BLEU score, which is a frequent metric for LLMs, so we chose to implement it here as well.
The BLEU score uses a similarity measure between the n-grams of the sample compared to the references he gets.<br>
The reference in our case is the human-made code for the described test step. We chose to use the default configuration for the BLEU weights, which utilizes 1-grams up to 4-grams. All n-grams are uniformly weighted and have equal weight in the result.

Our weighted Bleu score separately evaluates the code from the precondition, which is copied by the LLM, and the newly generated code for the current step. This is intended to decouple the final result from the length of the precondition. Both parts are evaluated with 50 percent each.

In [None]:

def calculate_weighted_bleu_score(generated_code: str, validation_code: str, precondition_code: str,
                                  alpha: float = 0.5) -> float:
    """ This method returns the BLEU score of the given generated code.

    :param generated_code: The generated code from the LLM as Python or TypeScript playwright script.
    :param validation_code: Examples for validation as Python or TypeScript playwright script.
    :param precondition_code: The precondition of the step as Python or TypeScript playwright script
    :param alpha: The weight of the second part of the BLEU score.
    :return: The BLEU score of the given generated code.
    """
    generated_code_tokens = esprima.tokenize(generated_code)
    validation_code_tokens = esprima.tokenize(validation_code)
    precondition_code_tokens = esprima.tokenize(precondition_code)

    # Convert tokens to string
    generated_code_tokens = [str(elem) for elem in generated_code_tokens]
    validation_code_tokens = [str(elem) for elem in validation_code_tokens]
    precondition_code_tokens = [str(elem) for elem in precondition_code_tokens]

    precondition_code_length = len(precondition_code_tokens)
    precondition_code_length_without_end_lines = -1
    for i in range(precondition_code_length):
        if validation_code_tokens[i] != precondition_code_tokens[i]:
            precondition_code_length_without_end_lines = i
            break

    # Define a smoothing function for BLEU score calculation
    smoothing_function = SmoothingFunction().method1

    # The first part: Has the LLM correctly copied the precondition code?
    first_bleu_score = sentence_bleu(references=[validation_code_tokens[:precondition_code_length_without_end_lines]],
                                     hypothesis=generated_code_tokens[:precondition_code_length_without_end_lines],
                                     smoothing_function=smoothing_function)

    # The second part: Has the LLM correctly added the new lines to reach the given goal?
    second_bleu_score = sentence_bleu(references=[validation_code_tokens[precondition_code_length_without_end_lines:]],
                                      hypothesis=generated_code_tokens[precondition_code_length_without_end_lines:],
                                      smoothing_function=smoothing_function)

    return (1 - alpha) * first_bleu_score + alpha * second_bleu_score


def calculate_success_rate(generated_code: str, file_name: str, config: dict):
    """Returns the success rate of the given generated code."""
    try:
        # Normalize paths
        eval_run_dir = os.path.normpath(config['paths']['eval_run_dir'])
        screen_shot_dir = os.path.join(eval_run_dir, 'screenshots')

        # Logging paths and current working directory
        logger.debug(f"Current working directory: {os.getcwd()}")
        logger.debug(f"Screenshot directory: {screen_shot_dir}")
        logger.debug(f"Evaluation run directory: {eval_run_dir}")

        # Create directories for screenshots
        os.makedirs(screen_shot_dir, exist_ok=True)
        file_name_png = file_name.split(".")[0]  # remove .spec.ts
        screenshot_path = os.path.join(screen_shot_dir, f"{file_name_png}.png")
        # Replace backslashes with forward slashes
        screenshot_path = screenshot_path.replace("\\", "/")
        logger.debug(f"Screenshot path: {screenshot_path}")

        screenshot_code = f"  await page.screenshot({{ path: '{screenshot_path}' }});\n"
        time_out = 30000
        time_out_code = f"  test.setTimeout({time_out});\n"

        generated_code = generated_code.split("\n")

        #### insert screenshot code
        # Find the position to insert the screenshot command
        assert type(generated_code) == type([])
        insert_position = 0
        for i, line in enumerate(generated_code):
            if 'async' in line and 'test(' in line:
                insert_position = i + 1
                break

        # Insert the timeout code and screenshot command
        generated_code.insert(insert_position, time_out_code)
        found_position = 0
        last_await_position = 0
        for i, line in enumerate(generated_code):
            if "await page.close()" in line or "await context.close()" in line or "await browser.close()" in line:
                found_position = 1
                insert_position = i
                break
            if "await" in line:
                last_await_position = i + 1  # position after the last "await" line

        if found_position == 0:
            insert_position = last_await_position

        # Insert the screenshot and HTML extraction commands
        generated_code.insert(insert_position, screenshot_code)
        ####

        # Ensure the directory for the temp_path exists
        temp_dir = os.path.join(eval_run_dir, "test_script")
        os.makedirs(temp_dir, exist_ok=True)
        logger.debug(f"Created temp directory: {temp_dir}")

        temp_path = os.path.join(temp_dir, file_name)
        logger.debug(f"Temp file path: {temp_path}")

        # Save updated test code to a temporary file
        try:
            with open(temp_path, 'w', encoding="utf-8") as file:
                file.write("\n".join(generated_code))
            logger.debug(f"File {temp_path} created successfully.")
        except Exception as e:
            logger.error(f"Failed to create the file {temp_path}. Error: {e}")
            return 0

        # Small delay to account for file system delays
        time.sleep(1)

        # Run the Playwright test
        try:
            logger.debug(f"Current working directory: {os.getcwd()}")
            result = os.system(f"npx playwright test {temp_path} --config=config/playwright.config.ts")
            score = 1 if result == 1 else 0
        except Exception as e:
            logger.error(f"Failed to run Playwright test. Error: {e}")
            return 0

        # Delete temp file after test run if defined in config
        if config.get('evaluation', {}).get('delete_temp_files', False):
            try:
                os.remove(temp_path)
                logger.debug(f"Deleted temp file {temp_path}.")
            except Exception as e:
                logger.error(f"Failed to delete temp file {temp_path}. Error: {e}")

        logger.debug(f"Playwright test result: {result}")
        return score

    except Exception as e:
        logger.error(f"An error occurred: {e}")
        return 0


### Levenshtein Distance
The scoring function for our validation samples is the Levenshtein distance. This measures the distance between two strings by the amount of necessary single-character operations to turn one string into another. These operations are remove, add, and replace. So the maximum distance this measure can calculate is the maximum length of input strings. Since later tests involve longer preconditions and descriptions the chance for mistakes is higher, so the Levenshtein distance would always increase for the longer test and be lower for the shorter tests. To counteract this, we normalize the distance to values between 0 and 1 by dividing with the maximum length of the input strings. This represents small errors or deviations in tests much better since those should perform better than very short and error-riddled tests.

In [None]:
def calculate_levenshtein_distance(generated_code, validation_code):
    """ This method returns the Levenshtein distance of the given generated code.

    :param generated_code: The generated code from the LLM as TypeScript playwright script.
    :param validation_code: Examples for validation TypeScript playwright script.
    :return: The Levenshtein distance over the length of the max code length of the given generated code.
    """
    gen_script = ' '.join(strip_script_code(generated_code))
    vd_script = ' '.join(strip_script_code(validation_code))

    len_gen = len(generated_code)
    len_valid = len(validation_code)
    max_len = max(len_gen, len_valid)

    score = distance(gen_script, vd_script) / max_len

    return score

# Validate the generated code (if validation_path is provided)
if validation_path:
    validation_code = parse_code(validation_path)
    scores = calculate_scores(generated_code=generated_code, validation_code=validation_code, precondition_code=precondition_text, programming_language='Python')  # Adjust language as needed
    print("Validation Scores:", scores)


# Evaluation Summary

In [1]:
import pandas as pd
import os

In [4]:
df = pd.read_csv('./../../data/scores/eval_scores_all.csv')

In [6]:
list = df['prediction_dir'].unique()
excluded_list = []
for i in list:
    if "max_attr" in i:
        excluded_list.append(i)
    elif "test_set" in i:
        excluded_list.append(i)
    elif "_sc" not in i:
        excluded_list.append(i)
    elif "_html" not in i:
        excluded_list.append(i)
    elif "_T" not in i:
        excluded_list.append(i)
rel_list = [x for x in list if x not in excluded_list]

df_dic = {}
score_dic = {}
for el in rel_list:
    df_dic[el] = df[df['prediction_dir'] == el]
for el in rel_list:
    bleu = df_dic[el]['weighted bleu'].mean()
    succ_rate = df_dic[el]['success rate'].mean()
    lev_dist = df_dic[el]['levenshtein distance'].mean()
    sim = df_dic[el]['similarity'].mean()
    
    score_dic[el] = [bleu, succ_rate, lev_dist, sim]

In [7]:
for el in rel_list:
    print(el,": ", score_dic[el],"\n")# order BLEU,SucRate,LevDist, SimScore

data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/ :  [0.6238876998306156, 1.0, 0.3144528211805267, 0.0] 

data/prediction/pred_test_script_finetuned_T1_sc-_html+_single/ :  [0.7978877506485907, 1.0, 0.07263106238627116, 0.0] 

data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/ :  [0.6539843409967431, 1.0, 0.20876644679554532, 0.0] 

data/prediction/pred_test_script_finetuned_T5_sc-_html+_single/ :  [0.7951108638978999, 1.0, 0.06291960074643649, 0.0] 

data/prediction/pred_test_script_pretr_T1_sc+_html+_all/ :  [0.410849679497375, 1.0, 0.3037133970071445, 0.0] 

data/prediction/pred_test_script_pretr_T1_sc+_html+_single/ :  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0] 

data/prediction/pred_test_script_pretr_T1_sc+_html-_single/ :  [0.2985081180354515, 1.0, 0.4101583529458054, 0.0] 

data/prediction/pred_test_script_pretr_T1_sc-_html+_single/ :  [0.43623929414048274, 1.0, 0.3357728404355431, 0.0] 

data/prediction/pred_test_script_pretr_T1_sc-_html-_si

General observation is that all our generated scripts seemed to work with playwright without generating any fatal errors as the success rate is "1.0" for all.

In [8]:
# Baseline pretrained_model, Template1 with screenshots with html concat-mode: single
base = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_single/'
print("Template 1: ", score_dic[base])

Template 1:  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0]


This is our baseline performance for this evaluation

### Template variations

In [9]:
temp2 = 'data/prediction/pred_test_script_pretr_T2_sc+_html+_single/'
temp3 = 'data/prediction/pred_test_script_pretr_T3_sc+_html+_single/'
temp4 = 'data/prediction/pred_test_script_pretr_T4_sc+_html+_single/'
temp5 = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'

print("Template 1: ", score_dic[base])
print("Template 2: ", score_dic[temp2])
print("Template 3: ", score_dic[temp3])
print("Template 4: ", score_dic[temp4])
print("Template 5: ", score_dic[temp5])

Template 1:  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0]
Template 2:  [0.3994676776884143, 1.0, 0.348674571339635, 0.0]
Template 3:  [0.39687346488140346, 1.0, 0.39756964672408657, 0.0]
Template 4:  [0.3926198363216162, 1.0, 0.3434960266372642, 0.0]
Template 5:  [0.47075838428413214, 1.0, 0.23181610064026512, 0.0]


As we see above, the base case with template 1 and template 5 perform far better than the other templates.
Though it is to be mentioned that there is also large gap between template 1 und 5 in favor of template 5

### Variations with contextual informations

In [10]:
no_scr = 'data/prediction/pred_test_script_pretr_T1_sc-_html+_single/'
no_html = 'data/prediction/pred_test_script_pretr_T1_sc+_html-_single/'
no_scrNhtml = 'data/prediction/pred_test_script_pretr_T1_sc-_html-_single/'

print("Template 1: ", score_dic[base])
print("No Screenshot: ", score_dic[no_scr])
print("No HTML: ", score_dic[no_html])
print("Neither: ", score_dic[no_scrNhtml])

Template 1:  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0]
No Screenshot:  [0.43623929414048274, 1.0, 0.3357728404355431, 0.0]
No HTML:  [0.2985081180354515, 1.0, 0.4101583529458054, 0.0]
Neither:  [0.32615808305155664, 1.0, 0.3839561739841793, 0.0]


As expected the base case with the most contextual information performs the best. But important to notice the case without the screenshot is not far behind. 

### Variations within processing parameters

In [11]:
T1_concat_all = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_all/'
T5_concat_single = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'
T5_concat_all = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_all/'
print("T1 - Base/Single: ", score_dic[base])
print("T1 - All: ", score_dic[T1_concat_all])

print("T5 - Single: ", score_dic[T5_concat_single])
print("T5 - All: ", score_dic[T5_concat_all])

T1 - Base/Single:  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0]
T1 - All:  [0.410849679497375, 1.0, 0.3037133970071445, 0.0]
T5 - Single:  [0.47075838428413214, 1.0, 0.23181610064026512, 0.0]
T5 - All:  [0.45370072478415674, 1.0, 0.2562852252312223, 0.0]


Also the "single"-concat mode seems to outperform "all"-concat mode for both our top performing templates.

### Pretrained vs Finetuned

In [12]:
base = 'data/prediction/pred_test_script_pretr_T1_sc+_html+_single/'
pre_t5 = 'data/prediction/pred_test_script_pretr_T5_sc+_html+_single/'

print("Pre - Template 1: ", score_dic[base])
print("Pre - Template 5: ", score_dic[pre_t5])

ft_t1 = 'data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/'
ft_t5  = 'data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/'

print("Ft - Template 1: ", score_dic[ft_t1])
print("Ft - Template 5: ", score_dic[ft_t5])

Pre - Template 1:  [0.44735627392871585, 1.0, 0.25635185155034884, 0.0]
Pre - Template 5:  [0.47075838428413214, 1.0, 0.23181610064026512, 0.0]
Ft - Template 1:  [0.6238876998306156, 1.0, 0.3144528211805267, 0.0]
Ft - Template 5:  [0.6539843409967431, 1.0, 0.20876644679554532, 0.0]


Our finetuning did improve our performance quite a bit. Both templates reach a BLEU score above 50 which means they are able to accurately return the precondition and also manage to generate somewhat fitting new code.

### Finetuning with or without screenshots

In [13]:
t1_w = 'data/prediction/pred_test_script_finetuned_T1_sc+_html+_single/'
t5_w  = 'data/prediction/pred_test_script_finetuned_T5_sc+_html+_single/'

t1_wo = 'data/prediction/pred_test_script_finetuned_T1_sc-_html+_single/'
t5_wo = 'data/prediction/pred_test_script_finetuned_T5_sc-_html+_single/'

print("With Screenshot - Template 1: ", score_dic[t1_w])
print("With Screenshot - Template 5: ", score_dic[t5_w])

print("Without Screenshot - Template 1: ", score_dic[t1_wo])
print("Without Screenshot - Template 5: ", score_dic[t5_wo])

With Screenshot - Template 1:  [0.6238876998306156, 1.0, 0.3144528211805267, 0.0]
With Screenshot - Template 5:  [0.6539843409967431, 1.0, 0.20876644679554532, 0.0]
Without Screenshot - Template 1:  [0.7978877506485907, 1.0, 0.07263106238627116, 0.0]
Without Screenshot - Template 5:  [0.7951108638978999, 1.0, 0.06291960074643649, 0.0]


# Conclusion

The version of the finetuned model which used the templates without screenshots seemed to perform the best by far. Independent of the template.<br>
Which seems to indicate that the most valuable data can be gained without screenshot and can be easier understood if the noise generated by the screenshot is removed.

Thus, in summary the finetuned models without screenshots perform by far the best. Though if we were to only look at the BLEU Score, template 1 seems to be slightly favored.<br>
But in regard to how close we got to the human-made test script, template 5 seems to be slightly ahead as indicated by the Levenshtein distance.
So we can't identify a clear winner, template 5 seems to perform on average the best over all observed configurations.  
