# Pipeline
This notebook describes the pipeline for the generation of UI tests using a multimodal language model (LLM). This includes a short overview of the Preparation of the data, the finetuning of the model, and the actual test generation. The pipeline is based on the LLaVA model, which is a multimodal language model that can generate code based on multimodal input.

All configurations and settings we used, we managed using a ```config.yaml```. Our final pipeline and its components are shown in the image below. The pipeline consists of the following steps:

<br>
<img src="./pipeline_overview.png" width="1500" placeholder="Pipeline Overview" title="Pipeline Overview">
<br>

1. ✅ HTML Processing: Extract relevant information from the HTML file.
    * Use **BeautifulSoup** or lxml in Python to parse and extract information from the HTML file.
➡️ see src.data.html_processor.py
2. ✅ (Skip) Image Processing: Parse the image file to the multimodal model. Also implemented generating an image description using the Tesseracts OCR model for text-based LLMs.
➡️ see src.data.image_processor.py
(Summarize the image and HTML information and the prompt from the playwright test code using T5 model. (optional))
3. ✅ Precondition Parser: Parse the given playwright test code for previous step as a precondition.
➡️ see src.data.code_processor.py
4. ✅ Input Prompt Generation: Combine the extracted information from the HTML and the image with the selected template prompt for the LLM model.
➡️ see src.data.input_combiner.py and config/config.yaml
5. ✅ Test Generation: Pass the combined input to the language model for generating the UI test code. This also includes the finetuning of the model.
➡️ see main.py and also in this notebook
------ See other summary notebook for following steps ------
6. ✅ Evaluation: Evaluate the generated test code using the Playwright test framework.
➡️ see src.evaluation.metrics.py and src.scripts.evaluation.py

# Data Preparation

### Changes to general database

#### Adding the precondition to the Database
As previously described in the config, our prompts use contextual information of the previous step to increase the resulting code generated by the model. 
The precondition of every first step of a test case requires the user to be logged in. This is given in the test context "0_1".  
Thus we want to add this step to each test case as the first step "X_0" so each first step also has a precondition to rely upon. 

First, we need to identify all test cases for which we need to add the first step of logging in. We do this by checking the test_scripts for the test_case numbers

Afterwards we create for each test case the first entry by utilising the 0_1 test script

In [None]:
tc_ids, Connector= "1,13,14,15,19", "" # example values, orig. Source is in >./notebooks/240717_Chris_JsonSplit.ipynb
for tc_id in tc_ids:
    
    conn = Connector 
    cursor = conn.cursor()
    
    id = f"{tc_id}.0"
    steps = ""
    expectation = ""
    html = ".\\html\\0_1.html"
    screenshot = ".\\screenshot\\0_1.png"
    test_script = ".\\test_script\\0_1.spec.ts"
    
    query = f"INSERT INTO tests VALUES('{id}','{steps}','{expectation}','{html}','{screenshot}','{test_script}')"
    cursor.execute(query)
    cursor.fetchall()
    conn.commit()
    
    conn.close()

After doing this our database is now updated to fit the input generation and all test steps now fully work with their preconditions.

### General prompt structure

The prompt is divided into 4 sections:
-	Simplified HTML Content: This section contains a list of the extracted HTML elements.
-	Playwright Test Precondition: This section contains the code that must be executed to get to the point from which the current step has to be executed.
-	UI Test Description: This section contains the test description for the current step, which comes from the user.
-	Screenshot: This section contains the image.
-	Task: The task for the LLM is described in this section. There are various templates for this section.

In the following, the input prompt with Template_1 is given for 01_01.


In [7]:
print("### Simplified HTML Content: Buttons: {\"id\": \"navigationTrigger\", \"class\": \"button button-icon button-borderless\"} {\"id\": \"workbook-create\", \"class\": \"button workbook-create button-icon\"} {\"id\": \"RDxYr2vFytOijWjelj7P1\", \"class\": \"button navigation-menu button-icon\"} {\"text\": \"Arbeitsmappe importieren\", \"class\": \"button\"} {\"text\": \"Repository neu einlesen …\", \"class\": \"button\"} Inputs: {\"class\": \"select2-search__field\", \"aria-label\": \"Suchen nach …\", \"type\": \"search\", \"placeholder\": \"Suchen nach …\"} Links: {\"text\": \"Zum Navigatorbaum springen\", \"id\": \"skip-to-navigator\", \"class\": \"button button-primary\"} {\"text\": \"Zum Hauptbereich springen\", \"id\": \"skip-to-content\", \"class\": \"button button-primary\"} {\"text\": \"Startseite\", \"id\": \"home\", \"class\": \"button button-icon button-borderless\"} {\"text\": \"Karte\"} {\"text\": \"Verzeichnis Tutorial\", \"id\": \"d-nav-tree-node_ROOT-Tutorial_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Verzeichnis Gewässergüte\", \"id\": \"d-nav-tree-node_ROOT-Gewässergüte_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Verzeichnis Automobile\", \"id\": \"d-nav-tree-node_ROOT-Automobile_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Verzeichnis Ergänzende Geodaten\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Verzeichnis Zentrale Dienste\", \"id\": \"d-nav-tree-node_ROOT-Zentrale-Dienste_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Verzeichnis Meine Arbeitsmappen\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Arbeitsmappe Zugangsdaten\", \"class\": \"d-nav-tree-node--main d-hover-context\"} {\"text\": \"Zugangsdaten\", \"class\": \"d-nav-tree-node--text ellipsis\"} {\"text\": \"disy Cadenza\"} {\"text\": \"Tutorials\"} {\"text\": \"Lernmodulen\"} {\"text\": \"Onlinehilfe\"} {\"text\": \"Webseite\"} {\"text\": \"Lernmodulen\"} {\"text\": \"1\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"2\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"3\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"4\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"5\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"6\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"7\", \"class\": \"startpage-section-navigation-item\"} {\"text\": \"disy Cadenza v9.4.71\", \"class\": \"userSpecificLink ellipsis hidden-xs\"} {\"text\": \"© Disy Informationssysteme GmbH\", \"class\": \"userSpecificLink ellipsis hidden-xs\"} {\"text\": \"Über Disy\", \"class\": \"userSpecificLink ellipsis\"} \n\n### Playwright Test Precondition: import { test, expect } from '@playwright/test'; import { writeFileSync } from 'fs'; test('test', async ({ page }) => { await page.goto('http://localhost:8080/cadenza/'); await page.getByRole('link', { name: 'Anmelden' }).click(); await page.getByLabel('Benutzername *').click(); await page.getByLabel('Benutzername *').fill('Admin'); await page.getByLabel('Benutzername *').press('Tab'); await page.getByPlaceholder(' ').fill('Admin'); await page.getByRole('button', { name: 'Anmelden' }).click(); }); \n\n### UI Test Description: Öffne die Arbeitsmappe \"Übersicht Messstellen\" im Ordner \"Gewässergüte\". \n\n### Screenshot: <image> \n\n### Task: You are a test automation script writer. I will describe a UI test in German and you will generate Playwright test code for the given webpage. Strictly follow these instructions: 1. Don't explain the code, just generate the code block itself. 2. You get some HTML elements and its attributes from the website. You can use playwright locators to find the elements by their attributes. 3. Use the precondition code to set up the initial state. You must continue the code. 4. Follow the steps in the ui test description to perform actions on the website. 5. Use the screenshot to understand the context of the test. Assistant:")

### Simplified HTML Content: Buttons: {"id": "navigationTrigger", "class": "button button-icon button-borderless"} {"id": "workbook-create", "class": "button workbook-create button-icon"} {"id": "RDxYr2vFytOijWjelj7P1", "class": "button navigation-menu button-icon"} {"text": "Arbeitsmappe importieren", "class": "button"} {"text": "Repository neu einlesen …", "class": "button"} Inputs: {"class": "select2-search__field", "aria-label": "Suchen nach …", "type": "search", "placeholder": "Suchen nach …"} Links: {"text": "Zum Navigatorbaum springen", "id": "skip-to-navigator", "class": "button button-primary"} {"text": "Zum Hauptbereich springen", "id": "skip-to-content", "class": "button button-primary"} {"text": "Startseite", "id": "home", "class": "button button-icon button-borderless"} {"text": "Karte"} {"text": "Verzeichnis Tutorial", "id": "d-nav-tree-node_ROOT-Tutorial_firstContent", "class": "d-nav-tree-node--main d-hover-context"} {"text": "Verzeichnis Gewässergüte", "id": "d-nav-tre

### Config 

For the our processes we configure the config to test different scenarios.
In the preprocessing settings we configure the maximum attribute length for our html processing as well as the concatenation mode of the html processing.

In the input settings we use the following parameters for our testing purposes:
- template_name: We have 5 different templates, 3 in english and 2 in german
- include_html: Whether or not the input includes the processed html file.
- include_screenshot: If the model is given a screenshot as an input.


These are essentially the testing parameters. The following is copied from: ```config/config.yaml```
```yaml

preprocessing:
&emsp;html:
&emsp;&emsp;concat_mod: "single" # or all
&emsp;&emsp;max_attr_length: 50
&emsp;image:
&emsp;&emsp;max_length: 100

input_settings:<br>
&emsp;model_type: "multimodal" # or 'text_based'
&emsp;template_name: "template_5"
&emsp;include_html: True
&emsp;include_screenshot: True
```


### Simplified HTML Content

* In Simplified HTML Content we extract relevant information from the HTML file.
* We use **BeautifulSoup** to parse and extract information from the HTML file. ➡️ see ```src.data.html_processor.py``` for more information.

### Task

For the task description different templates can be selected: 
* Template_1:
  * 1 suppresses that the LLM explains the generated code
  * 2 describes to the LLM which elements it receives from the HTML
  * 3 causes the LLM to continue the precondition code
  * 4 explains to the LLM where it can read the description of the steps to be added
  * 5 says that the LLM should also use the screenshot.
* Template_2:
  *    This template is a predecessor of Template_1 and has no subdivision. It contains most of the points from Template_1.
* Template_3 and Template_4:
  * The templates are the German translation of Template_1 and Template_2.
* Template_5:
    * Idea: LLM has often generated too much new code (see the example underneath)
    * Is an extension of Template_1 with the instruction that no more than 3 new lines of code should be added.
    * 2 and 4 were slightly modified.


Output for 01_01 with Template_1.

In [None]:
import { test, expect } from '@playwright/test';
import { writeFileSync } from 'fs';
test('test', async ({ page }) => {
  await page.goto('http://localhost:8080/cadenza/');
  await page.getByRole('link', { name: 'Anmelden' }).click();
  await page.getByLabel('Benutzername *').click();
  await page.getByLabel('Benutzername *').fill('Admin');
  await page.getByLabel('Benutzername *').press('Tab');
  await page.getByPlaceholder(' ').fill('Admin');
  await page.getByRole('button', { name: 'Anmelden' }).click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren').click();
  await page.getByLabel('Repository neu einlesen').click();
  await page.getByLabel('Arbeitsmappe importieren

Output for 01_01 with Template_5.

In [None]:
import { test, expect } from '@playwright/test';
import { writeFileSync } from 'fs';

test('test', async ({ page }) => {
  await page.goto('http://localhost:8080/cadenza/');
  await page.getByRole('link', { name: 'Anmelden' }).click();
  await page.getByLabel('Benutzername *').click();
  await page.getByLabel('Benutzername *').fill('Admin');
  await page.getByRole('button', { name: 'Anmelden' }).click();

  // Add code to retrieve the links from the website
  const links = await page.evaluate(() => {
    const links = document.querySelectorAll('a');
    return Array.from(links).map((link) => link.href);
  });

  // Add code to set up the initial state
  await page.waitForSelector('startpage-section-navigation-item');

  // Add code to perform the UI test
  const selectedLink = await page.select('startpage-section-navigation-item');
  await page.click(selectedLink);

  // Add code to take a screenshot
  await page.screenshot({ path: 'cadenz_screenshot.png' });

  // Add code to save the screenshot
  writeFileSync('cadenz_screenshot.png', 'image/png');
});

### Selection of model

The originally used model llava-hf/llava-1.5-7b-hf cannot be finetuned with the code base https://github.com/haotian-liu/LLaVA, as the pre-trained weights are not adopted. We have also not found any other code base explicitly for the model.

We first tested the model liuhaotian/llava-v1.6-vicuna-7b and liuhaotian/llava-v1.6-mistral-7b but encountered errors for which we found nothing on the net.

Finally, we used the liuhaotian/llava-v1.5-7b model, where we had to fix many bugs. In the chapter “Bash scripts” one of the biggest bugs is discussed.

One problem with the new model is that it reacts differently to the prompts, which is why other prompt templates were tested again.

# Finetuning


### Training, Validation and Test dataset

#### JSONs for finetuning of the model
For the finetuning of our model we need to give the model a specific structured json. In the following paragraph the structure and our accommodations in regard to ways we improved performance will be explained. 

The JSONs are essentially a list of multiple conversations. The conversation are structured like this:

In [None]:
#Source for the code below: src>finetuning.py>create_finetuning_data_sample()
id, image_path, parse_code_with_demark, validation_path = "1","1_1.png",str(),"1_2.spec.ts" # throwaway to remove errors
conversation = {
        "id": f"{id}",
        "image": image_path,
        "conversations": [
            {
                "from": "human",
                "value": input
            },
            {
                "from": "gpt",
                "value": parse_code_with_demark(validation_path) # parse_code(validation_path)
            }
        ]
    } 

Small annotation in regard to parse_code_with_demark. This is an extension of parse_code which adds javascript demarkations so the model doesn't unlearn them once we start finetuning. The input from the human looks like this: (Take a look at the output of the print since it is formatted better for readability)

In [5]:
print("### Simplified HTML Content:\nButtons: \n{\"id\": \"navigationTrigger\", \"class\": \"button button-icon button-borderless\"}\n{\"id\": \"workbook-create\", \"class\": \"button workbook-create button-icon\"}\n{\"id\": \"RDxYr2vFytOijWjelj7P1\", \"class\": \"button navigation-menu button-icon\"}\n{\"text\": \"Arbeitsmappe importieren\", \"class\": \"button\"}\n{\"text\": \"Repository neu einlesen …\", \"class\": \"button\"}\nInputs: \n{\"class\": \"select2-search__field\", \"aria-label\": \"Suchen nach …\", \"type\": \"search\", \"placeholder\": \"Suchen nach …\"}\nLinks: \n{\"text\": \"Zum Navigatorbaum springen\", \"id\": \"skip-to-navigator\", \"class\": \"button button-primary\"}\n{\"text\": \"Zum Hauptbereich springen\", \"id\": \"skip-to-content\", \"class\": \"button button-primary\"}\n{\"text\": \"Startseite\", \"id\": \"home\", \"class\": \"button button-icon button-borderless\"}\n{\"text\": \"Karte\"}\n{\"text\": \"Verzeichnis Tutorial\", \"id\": \"d-nav-tree-node_ROOT-Tutorial_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Verzeichnis Gewässergüte\", \"id\": \"d-nav-tree-node_ROOT-Gewässergüte_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Verzeichnis Automobile\", \"id\": \"d-nav-tree-node_ROOT-Automobile_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Verzeichnis Ergänzende Geodaten\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Verzeichnis Zentrale Dienste\", \"id\": \"d-nav-tree-node_ROOT-Zentrale-Dienste_firstContent\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Verzeichnis Meine Arbeitsmappen\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Arbeitsmappe Zugangsdaten\", \"class\": \"d-nav-tree-node--main d-hover-context\"}\n{\"text\": \"Zugangsdaten\", \"class\": \"d-nav-tree-node--text ellipsis\"}\n{\"text\": \"disy Cadenza\"}\n{\"text\": \"Tutorials\"}\n{\"text\": \"Lernmodulen\"}\n{\"text\": \"Onlinehilfe\"}\n{\"text\": \"Webseite\"}\n{\"text\": \"Lernmodulen\"}\n{\"text\": \"1\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"2\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"3\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"4\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"5\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"6\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"7\", \"class\": \"startpage-section-navigation-item\"}\n{\"text\": \"disy Cadenza v9.4.71\", \"class\": \"userSpecificLink ellipsis hidden-xs\"}\n{\"text\": \"© Disy Informationssysteme GmbH\", \"class\": \"userSpecificLink ellipsis hidden-xs\"}\n{\"text\": \"Über Disy\", \"class\": \"userSpecificLink ellipsis\"}\n\n\n### Playwright Test Precondition:\nimport { test, expect } from '@playwright/test';\nimport { writeFileSync } from 'fs';\n\n\ntest('test', async ({ page }) => {\n  await page.goto('http://localhost:8080/cadenza/');\n  await page.getByRole('link', { name: 'Anmelden' }).click();\n  await page.getByLabel('Benutzername *').click();\n  await page.getByLabel('Benutzername *').fill('Admin');\n  await page.getByLabel('Benutzername *').press('Tab');\n  await page.getByPlaceholder(' ').fill('Admin');\n  await page.getByRole('button', { name: 'Anmelden' }).click();\n\n});\n\n### UI Test Description:\nÖffne die Arbeitsmappe \"Übersicht Messstellen\" im Ordner \"Gewässergüte\".\n\n### Screenshot:\n <image>\n\n### Task:\nYou are a test automation script writer. I will describe a UI test in German and you will generate Playwright test code for the given webpage. Strictly follow these instructions:\n1. Don't explain the code, just generate the code block itself.\n2. You get some HTML elements and its attributes from the website. You can use playwright locators to find the elements by their attributes.\n3. Use the precondition code to set up the initial state. You must continue the code.\n4. Follow the steps in the ui test description to perform actions on the website.\n5. Use the screenshot to understand the context of the test.\nAssistant: ")

### Simplified HTML Content:
Buttons: 
{"id": "navigationTrigger", "class": "button button-icon button-borderless"}
{"id": "workbook-create", "class": "button workbook-create button-icon"}
{"id": "RDxYr2vFytOijWjelj7P1", "class": "button navigation-menu button-icon"}
{"text": "Arbeitsmappe importieren", "class": "button"}
{"text": "Repository neu einlesen …", "class": "button"}
Inputs: 
{"class": "select2-search__field", "aria-label": "Suchen nach …", "type": "search", "placeholder": "Suchen nach …"}
Links: 
{"text": "Zum Navigatorbaum springen", "id": "skip-to-navigator", "class": "button button-primary"}
{"text": "Zum Hauptbereich springen", "id": "skip-to-content", "class": "button button-primary"}
{"text": "Startseite", "id": "home", "class": "button button-icon button-borderless"}
{"text": "Karte"}
{"text": "Verzeichnis Tutorial", "id": "d-nav-tree-node_ROOT-Tutorial_firstContent", "class": "d-nav-tree-node--main d-hover-context"}
{"text": "Verzeichnis Gewässergüte", "id": "d-nav-

While the expected result from the gpt model looks like this: (Take a look at the output of the print since it is formatted better for readability)

In [6]:
print("```javascript\nimport { test, expect } from '@playwright/test';\nimport { writeFileSync } from 'fs';\n\n\ntest('test', async ({ page }) => {\n  await page.goto('http://localhost:8080/cadenza/');\n  await page.getByRole('link', { name: 'Anmelden' }).click();\n  await page.getByLabel('Benutzername *').click();\n  await page.getByLabel('Benutzername *').fill('Admin');\n  await page.getByLabel('Benutzername *').press('Tab');\n  await page.getByPlaceholder(' ').fill('Admin');\n  await page.getByRole('button', { name: 'Anmelden' }).click();\n  await page.getByText('Verzeichnis Gewässergüte', { exact: true }).click();\n  const parentElement = await page.getByText('Arbeitsmappe Übersicht Messstellen').locator('..');\n  await parentElement.locator('.d-icon.d-icon-bold.status-icon').click(); \n\n});\n```")

```javascript
import { test, expect } from '@playwright/test';
import { writeFileSync } from 'fs';


test('test', async ({ page }) => {
  await page.goto('http://localhost:8080/cadenza/');
  await page.getByRole('link', { name: 'Anmelden' }).click();
  await page.getByLabel('Benutzername *').click();
  await page.getByLabel('Benutzername *').fill('Admin');
  await page.getByLabel('Benutzername *').press('Tab');
  await page.getByPlaceholder(' ').fill('Admin');
  await page.getByRole('button', { name: 'Anmelden' }).click();
  await page.getByText('Verzeichnis Gewässergüte', { exact: true }).click();
  const parentElement = await page.getByText('Arbeitsmappe Übersicht Messstellen').locator('..');
  await parentElement.locator('.d-icon.d-icon-bold.status-icon').click(); 

});
```


Combined a single conversation should look like this:

In [4]:
with open("./../data/finetuning/s61_finetuning_data_train_finetuned_T1_sc+_html+_single_20240719-194343.json") as input_file:
    head = [next(input_file) for _ in range(15)]
for i  in range(15): print(head[i])

[

    {

        "id": "01_01",

        "image": "./data/raw/./screenshot/0_1.png",

        "conversations": [

            {

                "from": "human",

                "value": "### Simplified HTML Content:\nButtons: \n{\"id\": \"navigationTrigger\", \"class\": \"button button-icon button-borderless\"}\n{\"id\": \"workbook-create\", \"class\": \"button workbook-create button-icon\"}\n{\"id\": \"RDxYr2vFytOijWjelj7P1\", \"class\": \"button navigation-menu button-icon\"}\n{\"text\": \"Arbeitsmappe importieren\", \"class\": \"button\"}\n{\"text\": \"Repository neu einlesen â€¦\", \"class\": \"button\"}\nInputs: \n{\"class\": \"select2-search__field\", \"aria-label\": \"Suchen nach â€¦\", \"type\": \"search\", \"placeholder\": \"Suchen nach â€¦\"}\nLinks: \n{\"text\": \"Zum Navigatorbaum springen\", \"id\": \"skip-to-navigator\", \"class\": \"button button-primary\"}\n{\"text\": \"Zum Hauptbereich springen\", \"id\": \"skip-to-content\", \"class\": \"button button-primary\"}\n{

And for each test step in our sample we generate a conversation representing the test step.

#### WandB

For the test generation we split our data into multiple groups:
First, we split the test case data into training data and test data and then we further split training data into train data and validation data. So, 20% of our total data is test data and 20% of our training data is validation data.

The first results from the eval/loss we were getting back looked promising.

<img src="./W&B Chart 20.7.2024, 12_37_21_5-epoch.png" width="1000">
<img src="./W&B Chart 20.7.2024, 12_37_21_30-epoch.png" width="1000">

Sadly, these low values for the evaluation loss are misleading as the returned data from the evaluations with more epochs were at times nonsensical and were a severe step back from the earlier results, despite the eval/loss being better.
Thus, we stuck with fewer epochs for the finetuning. As this enables the LLM to adapt to our data and prompt inputs without leading it too far astray.


### Changes in the Code-Base Llava

* The codebase has been adapted in three cases
    *	llava/train/train.py: The DataArguments was adapted to pass an additional validation data set. The make_supervised_data_module method also had to be adapted for this. In addition, the data module is also saved in the train method for debugging.
    *	llava/model/language_model/llava_llama.py: The parameter cache_position=None was added to the forward method in the LlavaLlamaForCausalLM class to fix an error.
    *	llava/eval/run_llava.py: The eval_model method has been adapted to output the input and output to the model for better debugging.
* Evaluation runs via the main in model_llava.py
    *	The main method calls main.py in the cadenza-playwright-llm folder for each sample. The normal pipeline is then executed in this.
    *	Furthermore, the class LLaVAModel is located in the file
    *	This code is largely taken from the eval_model method from llava/eval/run_llava.py.
    *	A class was created from the method so that the model only has to be loaded once. (For one run)
    *	Furthermore, the method has been rewritten so that no image can be transferred.


### Bash-Scripts

The job_eval_all_tests.sh bash script which is in the code cell underneath generates tests with the LLM for all entries in the database and stores them in a newly created folder pred_test_script in the cadenza-playwright-llm folder. The script calls the main in llava_model.py. The MODEL_PATH parameter specifies the model used.

In [None]:
#!/bin/bash
#SBATCH --output=/pfs/data5/home/kit/tm/ge6778/PSDA/output/eval/%j.out
#SBATCH --error=/pfs/data5/home/kit/tm/ge6778/PSDA/error/eval/%j.err
#SBATCH --time=30:00  # Set the time limit to 2 hours
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --partition=sdil
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=10

export USERNAME=ge6778          # Your username
export ENV=llava                # Name of your conda enviroment
export PATH_TO_MAIN_FOLDER=PSDA # Folder with data and model
export MODEL_PATH=liuhaotian/llava-v1.5-7b
#export MODEL_PATH=/pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/llava-v1.5-7b-finetune_lora-test-template-1-gpu-2_merged
export MODEL_PATH=/pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/llava-v1.5-7b-finetune_lora-test-T5_sc-_html+_single_epoch-5_merged

module purge
module load devel/cuda/11.7
#module load devel/cudnn/10.2
module load devel/miniconda/23.9.0-py3.9.15-bwHPC-channel

# select and activate our enviroment
conda activate ${ENV}

# go to llava directory and set pathvariable
cd /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/cadenza-playwright-llm
export PYTHONPATH=/pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/LLaVA:$PYTHONPATH
# enable the automatic boosting of GPU clock speeds
nvidia-smi --auto-boost-default=ENABLED


export LD_LIBRARY_PATH=/opt/bwhpc/common/toolkit/nvidia_hpc_sdk/23.9/Linux_x86_64/23.9/comm_libs/nccl/lib/:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_BLOCKING_WAIT=1
export TORCH_DISTRIBUTED_DEBUG=INFO

HEAD_ADDRESS=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# MASTER_IP=$(srun -N1 -w $HEAD_ADDRESS bash -c "ip -o -4 addr show ib0 | awk '{print \$4}' | cut -d/ -f1")
MASTER_IP=$(srun -N1 -w $HEAD_ADDRESS bash -c "ip -o -4 addr show ens3f0 | awk '{print \$4}' | cut -d/ -f1")
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

if [ -z "$MASTER_IP" ]; then
  echo "No valid IP address found for ib0. Exiting."
  exit 1
fi

export HEAD_ADDRESS MASTER_IP WORLD_SIZE


NCCL_SOCKET_IFNAME=ens3f0 # will automap to ib for data transfer. 
# NCCL_SOCKET_IFNAME=ib0
export NCCL_SOCKET_IFNAME

# check if network is working
echo "Verifying network settings..."
echo "Worldsize:"
echo $WORLD_SIZE
echo "Slurm nnodes:"
echo $SLURM_NNODES
echo "Slurm nodeid:"
echo $SLURM_NODEID
echo "Hostname:"
srun hostname
echo "All network interfaces on the master node ($HEAD_ADDRESS):"
srun -N1 -w $HEAD_ADDRESS ip a
echo "Using network interface: $NCCL_SOCKET_IFNAME for NCCL communication"
srun ip a show $NCCL_SOCKET_IFNAME
echo "Testing connection to master ip"
srun ping -c 4 $MASTER_IP
echo "Testing connection among nodes"
for node in $(scontrol show hostnames "$SLURM_JOB_NODELIST"); do
    echo "Testing connectivity from $node to $MASTER_IP..."
    srun -N1 -w $node ping -c 4 $MASTER_IP
done

python ../cadenza-playwright-llm/src/models/llm/llava_model.py \
    --model-path ${MODEL_PATH}

echo "Training"
echo "Finished at: $(date)"

The bash script for finetuning is job_finetune.sh which is in the code cell underneath. In this script, the MODEL_PATH parameter is used to specify the model to be finetuned and the OUTPUT_DIR parameter is used to specify the output folder. 
   * In addition, a validation data set is also transferred in validation_data_path.
   * 2 GPUs are selected, as there is not enough memory with one.
   * Important: The parameter “--mm_projector_type mlp2x_gelu” must be added for Deepspeed, otherwise the model cannot be loaded correctly later. Some weights of the model are otherwise initialized randomly and the output of the finetuned model was empty.


In [None]:
#!/bin/bash
#SBATCH --output=/pfs/data5/home/kit/tm/ge6778/PSDA/output/%j.out
#SBATCH --error=/pfs/data5/home/kit/tm/ge6778/PSDA/error/%j.err
#SBATCH --time=45:00  # Set the time limit to 2 hours
#SBATCH --ntasks-per-node=1
#SBATCH --nodes=1
#SBATCH --partition=sdil
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=10

export USERNAME=ge6778          # Your username
export ENV=llava                # Name of your conda enviroment
export PATH_TO_MAIN_FOLDER=PSDA # Folder with data and model
export MODEL_PATH=/pfs/data5/home/kit/tm/ge6778/PSDA/llava-v1.5-7b
#export MODEL_PATH=llava-hf/llava-1.5-7b-hf
export OUTPUT_DIR=llava-v1.5-7b-finetune_lora-test-T5_sc+_html+_single_epoch-5-REDO

module purge
module load devel/cuda/11.7
#module load devel/cudnn/10.2
module load devel/miniconda/23.9.0-py3.9.15-bwHPC-channel

# select and activate our enviroment
conda activate ${ENV}

# go to llava directory and set pathvariable
# TODO: Changed variable 
cd /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/LLaVA
#cd /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/cadenza-playwright-llm/LLaVA
export PYTHONPATH=/pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/LLaVA:$PYTHONPATH
# enable the automatic boosting of GPU clock speeds
nvidia-smi --auto-boost-default=ENABLED


export LD_LIBRARY_PATH=/opt/bwhpc/common/toolkit/nvidia_hpc_sdk/23.9/Linux_x86_64/23.9/comm_libs/nccl/lib/:$LD_LIBRARY_PATH
export NCCL_DEBUG=INFO
export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
# export NCCL_BLOCKING_WAIT=1
export TORCH_DISTRIBUTED_DEBUG=INFO

HEAD_ADDRESS=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
# MASTER_IP=$(srun -N1 -w $HEAD_ADDRESS bash -c "ip -o -4 addr show ib0 | awk '{print \$4}' | cut -d/ -f1")
MASTER_IP=$(srun -N1 -w $HEAD_ADDRESS bash -c "ip -o -4 addr show ens3f0 | awk '{print \$4}' | cut -d/ -f1")
WORLD_SIZE=$(($SLURM_NNODES * $SLURM_NTASKS_PER_NODE))

if [ -z "$MASTER_IP" ]; then
  echo "No valid IP address found for ib0. Exiting."
  exit 1
fi

export HEAD_ADDRESS MASTER_IP WORLD_SIZE


NCCL_SOCKET_IFNAME=ens3f0 # will automap to ib for data transfer. 
# NCCL_SOCKET_IFNAME=ib0
export NCCL_SOCKET_IFNAME

# check if network is working
echo "Verifying network settings..."
echo "Worldsize:"
echo $WORLD_SIZE
echo "Slurm nnodes:"
echo $SLURM_NNODES
echo "Slurm nodeid:"
echo $SLURM_NODEID
echo "Hostname:"
srun hostname
echo "All network interfaces on the master node ($HEAD_ADDRESS):"
srun -N1 -w $HEAD_ADDRESS ip a
echo "Using network interface: $NCCL_SOCKET_IFNAME for NCCL communication"
srun ip a show $NCCL_SOCKET_IFNAME
echo "Testing connection to master ip"
srun ping -c 4 $MASTER_IP
echo "Testing connection among nodes"
for node in $(scontrol show hostnames "$SLURM_JOB_NODELIST"); do
    echo "Testing connectivity from $node to $MASTER_IP..."
    srun -N1 -w $node ping -c 4 $MASTER_IP
done

deepspeed llava/train/train_mem.py \
    --deepspeed ./scripts/zero3.json \
    --lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
    --model_name_or_path ${MODEL_PATH}\
    --version v1 \
    --data_path /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/data/finetune/s61_finetuning_data_train_finetuned_T5_sc+_html+_single_20240719-194726.json \
    --validation_data_path /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/data/finetune/s18_finetuning_data_val_finetuned_T5_sc+_html+_single_20240719-194726.json \
    --image_folder /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/cadenza-playwright-llm \
    --vision_tower openai/clip-vit-large-patch14-336 \
    --mm_projector_type mlp2x_gelu \
    --mm_vision_select_layer -2 \
    --mm_use_im_start_end False \
    --mm_use_im_patch_token False \
    --bf16 True \
    --output_dir /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/${OUTPUT_DIR} \
    --num_train_epochs 5 \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 1 \
    --evaluation_strategy "steps" \
    --save_strategy "epoch" \
    --save_total_limit 1 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --tf32 True \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True \
    --dataloader_num_workers 4 \
    --report_to wandb


python scripts/merge_lora_weights.py \
    --model-path /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/${OUTPUT_DIR} \
    --model-base ${MODEL_PATH} \
    --save-model-path /pfs/data5/home/kit/tm/${USERNAME}/${PATH_TO_MAIN_FOLDER}/${OUTPUT_DIR}_merged


echo "Training"
echo "Finished at: $(date)"
