In [1]:
%load_ext autoreload
%autoreload 2

Structure of the notebook:

In [25]:
import sqlite3
from bs4 import BeautifulSoup

from src.main import main, main_cluster_multimodal_model
from src.utils.helpers import truncate_text, clean_string
from src.utils.logger import setup_logger
from src.llm.access_2_cluster import Access2Cluster

import logging
from src.data.html_processor import extract_html_info
from src.data.code_processor import parse_code
from src.data.image_processor import extract_text_from_image
from src.input_builder import create_input
from src.ui_tests.test_generation import generate_code
from src.evaluation import calculate_scores
from src.utils.logger import setup_logger


logger = setup_logger(__name__, level='DEBUG') # Change to 'INFO' for less verbosity

# Load Data

In [3]:
# Connect to the database
conn = sqlite3.connect('../data/raw/playwright_script.db')
cursor = conn.cursor()

res = cursor.execute("SELECT * FROM tests")
items = res.fetchall()

print("There are {} data.".format(len(items)))

There are 100 data.


In [4]:
# Check the first item
items[0]

('1.1',
 '[1.1] Öffne die Arbeitsmappe "Übersicht Messstellen" im Ordner "Gewässergüte".',
 '[1.1] Expected result: Die Arbeitsmappe wird geöffnet, der Analysekontext ist nicht sichtbar.',
 '.\\html\\1_1.html',
 '.\\screenshot\\1_1.png',
 '.\\test_script\\1_1.spec.ts')

# Approach

Use a Pre-trained LLM:
* GPT-3, GPT-3.5, or a lighter version like GPT-2 (well-suited for text generation tasks)
* Fine-tuning or adapting for specific tasks later possible

Steps:
1. ✅ HTML Processing: Extract relevant information from the HTML file.
    * Use **BeautifulSoup** or lxml in Python to parse and extract information from the HTML file. ➡️ see src.data.html_processor.py
2. ✅ Image Processing: Extract relevant information from the image:
    * Use image-to-text models like **Tesseracts** or pytesseract (OCR libraries) to extract text from the image. ➡️ see src.data.image_processor.py
    * Use OpenCV or PIL (Pillow) in Python to process the image and extract relevant information.
3. Summarize the image and HTML information and the prompt from the playwright test code using T5 model. (optional)
4. ✅ Python Processing: Parse the given playwright test code for previous step as a precondition. ➡️ see src.data.python_processor.py
5. ✅ Combine the extracted information from the HTML and the image with the prompt for the language model. ➡️ see src.data.input_combiner.py
6. ✅ Pass the combined input to the language model for generating the UI test code. ➡️ see src.ui_tests.test_generation.py

➡️ Run locally via notebook or script using the src.main.py

# HTML Processing

In [5]:
# load example file
html_path = './html/0_1.html'

Option 1: Parse all HTML content by extracting the text from it using BeautifulSoup:

In [6]:
def parse_html(html_path: str, max_length: int = 200) -> str:
    """ Parse the HTML content from a file using BeautifulSoup. It extracts the text content and truncates it to the given maximum length.

    :param max_length: The maximum length of the text.
    :param html_path: The path to the HTML file.
    :return: The text content of the HTML file.
    """
    # Load HTML content from a file
    with open(html_path, "r") as file:
        html_content = file.read()

    # Parse the HTML content
    soup = BeautifulSoup(html_content, 'html.parser')

    # Get the text content
    html_text = soup.get_text(strip=True)

    # Truncate the text to the maximum length
    html_text = truncate_text(html_text, max_length=max_length)

    logger.debug(f"HTML content parsed successfully. - Lines of Code: {len(html_text.splitlines())}")

    return html_text

In [7]:
html_text = parse_html(html_path, max_length=2000)
print(html_text)

2024-06-25 12:11:53 [[34m__main__:21[0m] [DEBUG[0m] >>>> HTML content parsed successfully. - Lines of Code: 3[0m
Startseite - disy CadenzaAchtung!Ihr Browser unterstÃ¼tzt kein JavaScript oder JavaScript wurde in Ihrem Browser deaktiviert.Bitte verwenden Sie einen Browser, der JavaScript unterstÃ¼tzt, oder aktivieren Sie JavaScript in Ihrem Browser. Ohne aktiviertes JavaScript ist die Anwendung nicht nutzbar.OfflineVerbinden â€¦Zum Navigatorbaum springenZum Hauptbereich springenStartseiteKartedisy Cadenza[{"printName":"Lernmodule â€“ Tutorials und mehr","url":"/help-learning/index.html","targetFrame":"_blank","id":"help","type":"help","webApplication":false},{"printName":"Hilfe","url":"/help/","targetFrame":"_blank","id":"help","type":"help","webApplication":false},{"printName":"Hilfe zu Classic","url":"/help-classic/","targetFrame":"_blank","id":"help-classic","type":"help-classic","webApplication":false}]Admin[{"printName":"Profil","url":"/pages/access/userprofile.xhtml","targetFr

Option 2: Extract only Elements (input fields, buttons, links) from the HTML file using BeautifulSoup.

In [8]:
def extract_html_info(file_path):
    """Extract relevant information from an HTML file.

    :param file_path: The path to the HTML file.
    :return: A formatted string containing the extracted HTML elements.
    """
    with open(file_path, 'r', encoding='utf-8') as file:
        html_content = file.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    buttons = soup.find_all('button')
    inputs = soup.find_all('input')
    links = soup.find_all('a')

    html_elements = "HTML Elements:\n"

    for button in buttons:
        button_text = clean_string(button.text)
        button_id = button.get("id", "No ID")
        button_class = ' '.join(button.get("class", []))  # Convert list to string with space separator
        button_type = button.get("type", "button")
        html_elements += f'Button: {button_text} - ID: {button_id} - Class: {button_class} - Type: {button_type}\n'

    for input_field in inputs:
        input_name = input_field.get("name", "No Name")
        input_type = input_field.get("type", "text")
        input_value = input_field.get("value", "")
        input_placeholder = clean_string(input_field.get("placeholder", ""))
        input_id = input_field.get("id", "No ID")
        input_class = ' '.join(input_field.get("class", []))  # Convert list to string with space separator
        input_label = input_field.get("aria-label", "")
        html_elements += f'Input: {input_name} - Type: {input_type} - Value: {input_value} - Placeholder: {input_placeholder} - ID: {input_id} - Class: {input_class} - Label: {input_label}\n'

    for link in links:
        link_text = clean_string(link.text)
        link_href = link.get("href", "#")
        link_id = link.get("id", "No ID")
        link_class = ' '.join(link.get("class", []))  # Convert list to string with space separator
        html_elements += f'Link: {link_text} - Href: {link_href} - ID: {link_id} - Class: {link_class}\n'

    logger.debug(f"HTML elements extracted successfully. - Number of Elements: {len(html_elements.splitlines())}")

    return html_elements.strip()

In [9]:
html_elements = extract_html_info(html_path)
print(html_elements)

2024-06-25 12:11:53 [[34m__main__:42[0m] [DEBUG[0m] >>>> HTML elements extracted successfully. - Number of Elements: 48[0m
HTML Elements:
Button:  - ID: navigationTrigger - Class: button button-icon button-borderless - Type: button
Button: [{"printName":"Lernmodule – Tutorials und mehr","url":"/help-learning/index.html","targetFrame":"_blank","id":"help","type":"help","webApplication":false},{"printName":"Hilfe","url":"/help/","targetFrame":"_blank","id":"help","type":"help","webApplication":false},{"printName":"Hilfe zu Classic","url":"/help-classic/","targetFrame":"_blank","id":"help-classic","type":"help-classic","webApplication":false}] - ID: No ID - Class: d-help-menu button button-icon button-borderless d-topnav-dropdown - Type: button
Button: Admin [{"printName":"Profil","url":"/pages/access/userprofile.xhtml","targetFrame":"_self","id":"userprofile","type":"userprofile","webApplication":false},{"printName":"Abmelden","url":"/logout","targetFrame":"_self","id":"logout","type

# Image Processing

In [None]:
# Imports for Image Processing
from PIL import Image
import pytesseract
from typing import Union

# Utility functions
from src.utils.helpers import truncate_text
from src.utils.logger import setup_logger

# Setup Logger
logger = setup_logger(__name__, level='DEBUG')

# Define the function to extract text from image
def extract_text_from_image(image_path: str, max_length: Union[int, None] = 200) -> str:
    """ Extract text from an image file.

    :param max_length: The maximum length of the text.
    :param image_path: The path to the image file.
    :return: The extracted text from the image.
    """
    image = Image.open(image_path)
    pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
    # Extract text from the image
    text = pytesseract.image_to_string(image)
    # Truncate the text to the maximum length
    if max_length:
        text = truncate_text(text, max_length=max_length)

    logger.debug(f"Text extracted from image successfully. - Characters: {len(text)}")

    return text

# Demonstrate the function with an example image
image_path = 'path/to/image/file.png'
extracted_text = extract_text_from_image(image_path, max_length=300)
print("Extracted Text from Image:", extracted_text)


# 