In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from llama_index.core import PromptTemplate

DEFAULT_WORLD_MODEL_EXAMPLES = """
Objective:  Go to the first issue you can find
Previous instructions: 
- Click on 'Issues' with the number '28' next to it.
- [FAILED] Click on 'Build and share place where people can suggest their use cases and results #225'
- [FAILED] Click on 'Build and share place where people can suggest their use cases and results #225'
Last engine: Navigation Engine
Current state: 
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: []
  user_inputs: []

Thoughts:
- The current screenshot shows the issues page of the GitHub repository 'lavague-ai/LaVague'.
- The objective is to go to the first issue.
- Previous instructions have been unsuccessful. A new approach should be used.
- The '#225' seems not to be clickable and it might be relevant to devise an instruction that does not include it. 
Next engine: Navigation Engine
Instruction: Click on the first issue, with title 'Build and share place where people can suggest their use cases and results'
-----
Objective: Find When Llama 3 was released
Previous instructions:
- Click on 'meta-llama/Meta-Llama-3-8B'
Last engine: Navigation Engine
Current state:
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: []
  user_inputs: []

Thoughts:
- The current screenshot shows the model page for 'meta-llama/Meta-Llama-3-8B' on Hugging Face.
- Hugging Face, is a hub for AI models and datasets, where users can explore and interact with a variety of AI models.
- I am therefore on the right page to find information about the release date of 'Meta-Llama-3-8B'.
- To find the release date, I need to locate the relevant information in the content of the page.
- Therefore the best next step is to use the Python Engine to extract the release date from the content of the page.
Next engine: Python Engine
Instruction: Extract the release date of 'Meta-Llama-3-8B' from the textual content of the page.
-----
Objective: Provide the code to get started with Gemini API
Previous instructions:
- Click on 'Read API docs'
- Click on 'Gemini API quickstart' on the menu
Last engine: Navigation Engine
Current state:
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: []
  user_inputs: []

Thoughts:
- The current screenshot shows the documentation page for the getting started of Gemini API.
- I am therefore on the right page to find the code to get started with the Gemini API.
- The next step is to provide the code to get started with the Gemini API.
- Therefore I need to use the Python Engine to generate the code to extract the code to get started with the Gemini API from this page.
Next engine: Python Engine
Instruction: Extract the code to get started with the Gemini API from the content of the page.
-----
Objective: Show what is the cheapest product
Previous instructions: [NONE]
Last engine: [NONE]
Current state:
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: []
  user_inputs: []
  
Thoughts:
- The screenshot shows an e-commerce website with various products.
- To find the cheapest product, I need to identify the product with the lowest price.
- There seems to be selectors for sorting products by price on the left side of the page.
- The screenshot only shows part of the selectors for price. I should scroll down to see the full list of products and prices.
Next engine: Navigation Controls
Instruction: SCROLL_DOWN
-----
Objective: What tech stack do we use?
Previous instructions: 
- [FAILED] Locate and click on the "Technology Solutions" link or section to find information about the tech stack.
- [FAILED] Click on the "Technology Solutions" section to explore detailed information about the tech stack.
Last engine: Navigation Engine
Current state: 
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: []
  user_inputs: []

Thought:
- The screenshot shows a Notion webpage with information about a company called ACME INC.
- It has information about the company, their services, and departments.
- Previous instructions tried to click on "Technology Solutions" without success. This probably means that "Technology Solutions" is not clickable or reachable.
- Other strategies have to be pursued to reach the objective.
- There seems to be information at the end of the screen about departments, with mention of a 'Software development' section that could be promising.
- The best next step is to scroll down to gather more information.
Next engine: Navigation Controls
Instruction: SCROLL_DOWN
-----
Objective: Find the description of the author
Previous instructions:
- Click on 'About the author'
- Extract the description of the author from the content of the page.
Last engine: Python Engine
Current state:
external_observations:
  vision: '[SCREEENSHOT]'
internal_state:
  agent_outputs: [
    'The author is a software engineer with a passion for AI and machine learning. He has worked on various projects and has a blog where he shares his knowledge and experience.'
  ]
  user_inputs: []

Thoughts:
- The screenshot shows the description of the author.
- The description of the author has been successfully extracted from the content of the page.
- The objective has been reached.
Next engine: STOP
Instruction: STOP
"""

WORLD_MODEL_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI system specialized in high level reasoning. Your goal is to generate instructions for other specialized AIs to perform web actions to reach objectives given by humans.
Your inputs are:
- objective ('str'): a high level description of the goal to achieve.
- previous_instructions ('str'): a list of previous steps taken to reach the objective.
- last_engine ('str'): the engine used in the previous step.
- current_state ('dict'): the state of the environment in YAML to use to perform the next step.

Your output are:
- thoughts ('str'): a list of thoughts in bullet points detailling your reasoning.
- next_engine ('str'): the engine to use for the next step.
- instruction ('str'): the instruction for the engine to perform the next step.

Here are the engines at your disposal:
- Python Engine: This engine is used when the task requires doing computing using the current state of the agent. 
It does not impact the outside world and does not navigate.
- Navigation Engine: This engine is used when the next step of the task requires further navigation to reach the goal. 
For instance it can be used to click on a link or to fill a form on a webpage. This engine is heavy and will do complex processing of the current HTML to decide which element to interact with.
- Navigation Controls: This is a simpler engine to do commands that does not require reasoning, which are 'SCROLL_DOWN', 'SCROLL_UP' and 'WAIT'.

Here are guidelines to follow:

# General guidelines
- The instruction should be detailled as possible and only contain the next step. 
- If the objective is already achieved in the screenshot, or the current state contains the demanded information, provide the next engine and instruction 'STOP'.
- If previous instructions failed, denoted by [FAILED], reflect on the mistake, and try to leverage other visual and textual cues to reach the objective.

# Python Engine guidelines
- When providing an instruction to the Python Engine, do not provide any guideline on using visual information such as the screenshot, as the Python Engine does not have access to it.
- If the objective requires information gathering, and the previous step was a Navigation step, do not directly stop when seeing the information but use the Python Engine to gather as much information as possible.

# Navigation guidelines
- If the screenshot provides information but seems insufficient, use navigation controls to further explore the page.
- When providing information for the Navigation Engine, focus on elements that are most likely interactable, such as buttons, links, or forms and be precise in your description of the element to avoid ambiguitiy.
- If several steps have to be taken, provide them in bullet points.

Here are previous examples:
{examples}

Here is the next objective:
Objective: {objective}
Previous instructions: 
{previous_instructions}
Last engine: {last_engine}
Current state: 
{current_state}

Thought:
"""
)


In [3]:
# from llama_index.llms.groq import Groq
# from llama_index.multi_modal_llms.openai import OpenAIMultiModal
# from llama_index.embeddings.gemini import GeminiEmbedding
# from llama_index.llms.gemini import Gemini

# from lavague.contexts.openai import OpenaiContext

# mm_llm = OpenAIMultiModal(model="gpt-4o")
# action_engine_llm = Groq(model="llama3-8b-8192")
# embed_model = GeminiEmbedding(model_name="models/text-embedding-004")
# python_engine_llm = Gemini(model_name="models/gemini-1.5-flash-latest")

# context = OpenaiContext()
# context.llm = action_engine_llm
# context.embedding = embed_model

# prompt_template = WORLD_MODEL_PROMPT_TEMPLATE.partial_format(
#     examples=world_model_examples
# )

In [4]:
from llama_index.core.llms import LLM
from lavague.core import Context, get_default_context
from typing import Optional

REWRITER_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI expert.
You are given a high level instruction on a generic action to perform.
Your output is an instruction of the action, rewritten to be more specific on the capabilities at your disposal to perform the action.
Here are your capabilities:
{capabilities}

Here are previous examples:
{examples}

Here is the next instruction to rewrite:
Original instruction: {original_instruction}
""")

DEFAULT_PYTHON_ENGINE_CAPABILITIES = """
- Answer questions using the content of an HTML page using llama index and trafilatura
"""

DEFAULT_PYTHON_ENGINE_REWRITER_EXAMPLES = """
Original instruction: Use the content of the HTML page to answer the question 'How was falcon-11B trained?'
Capability: Answer questions using the content of an HTML page using llama index and trafilatura
Rewritten instruction: Extract the content of the HTML page and use llama index to answer the question 'How was falcon-11B trained?'
"""

class Rewriter:
    def __init__(self, llm: LLM, capabilities: str = DEFAULT_PYTHON_ENGINE_CAPABILITIES, examples: str = DEFAULT_PYTHON_ENGINE_REWRITER_EXAMPLES):
        self.llm = llm
        self.prompt_template = REWRITER_PROMPT_TEMPLATE.partial_format(
            capabilities=capabilities,
            examples=examples
        )      
    def rewrite_instruction(self, original_instruction: str) -> str:
        prompt = self.prompt_template.format(original_instruction=original_instruction)
        rewritten_instruction = self.llm.complete(prompt).text
        return rewritten_instruction



In [5]:
from lavague.core import Context, get_default_context
from llama_index.core.base.llms.base import BaseLLM
import copy

from lavague.core.extractors import BaseExtractor
from lavague.core import PythonFromMarkdownExtractor
from lavague.contexts.openai import OpenaiContext

PYTHON_ENGINE_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI system specialized in Python code generation to answer user queries.
The inputs are: an instruction, and the current state of the local variables available to the environment where your code will be executed.
Your output is the code that will perform the action described in the instruction, using the variables available in the environment.
You can import libraries and use any variables available in the environment.
Detail thoroughly the steps to perform the action in the code you generate with comments.
The last line of your code should be an assignment to the variable 'output' containing the result of the action.

Here are previous examples:
{examples}

Instruction: {instruction}
State:
{state_description}
Code:
""")

with open("python_engine_examples.txt") as f:
    DEFAULT_PYTHON_ENGINE_EXAMPLES = f.read()

class PythonEngine:
    llm: BaseLLM    
    prompt_template: PromptTemplate
    rewriter: Rewriter

    def __init__(self, llm: LLM, extractor: BaseExtractor = PythonFromMarkdownExtractor(), examples: str = DEFAULT_PYTHON_ENGINE_EXAMPLES):
        self.llm = llm
        self.extractor = extractor
        self.prompt_template = PYTHON_ENGINE_PROMPT_TEMPLATE.partial_format(
            examples=examples
        )
        self.rewriter = Rewriter(llm=llm)
        
    def generate_code(self, instruction: str, state: dict) -> str:
        rewriter = self.rewriter
        rewritten_instruction = rewriter.rewrite_instruction(instruction)
        
        state_description = self.get_state_description(state)
        prompt = self.prompt_template.format(instruction=rewritten_instruction, state_description=state_description)
        response = self.llm.complete(prompt).text
        code = self.extractor.extract(response)
        if not code:
            raise ValueError(f"No code generated or extracted. Here is the original response: {response}")
        return code
    
    def execute_code(self, code: str, state: dict):
        local_scope = copy.deepcopy(state)
        exec(code, local_scope, local_scope)
        output = local_scope["output"]
        return output
        
    def get_state_description(self, state: dict) -> str:
        """TO DO: provide more complex state descriptions"""
        state_description = """
html ('str'): The content of the HTML page being analyzed"""
        return state_description

In [6]:
# Flemme de finir

# Thoughts:
# - The current page is the Azure Pricing Calculator.
# - To calculate the cost of an AKS (Azure Kubernetes Service) cluster, we need to add the AKS service to the calculator.
# - The current screen does not show the options to add services.
# - The best approach is to scroll down to find the section where services can be added to the calculator.
# - Because this is a simple scrolling action, the best next step is to use the Navigation Controls engine to scroll down.
# Next engine: Navigation Controls
# Instruction: Scroll down by one full screen to continue exploring the current page.
# -----
# Objective: Find the definition of 'Diffusers'
# Previous instructions: 
# - Click on 'Diffusers' link
# Last engine: Navigation Engine
# Current state: [SCREENSHOT]

# Thought:
# - The current page is the documentation page of Hugging Face.
# - Hugging Face is a platform for AI models and datasets, where users can explore and interact with latest AI resources.
# - The definition of 'Diffusers' is provided in the documentation.
# - No further action is needed to achieve the objective.
# Next engine: STOP
# Instruction: STOP -->

In [7]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=chrome_options)

width = 1024
height = 1024

driver.set_window_size(width, height)
viewport_height = driver.execute_script("return window.innerHeight;")

height_difference = height - viewport_height
driver.set_window_size(width, height + height_difference)

driver.get("https://huggingface.co/meta-llama/Meta-Llama-3-8B")

In [8]:
import re

def extract_world_model_instruction(text):
    # Use a regular expression to find the content after "Instruction:"
    instruction_patterns = [
        r"Instruction:\s*((?:- .*\n?)+)",  # For multi-line hyphenated instructions
        r"### Instruction:\s*((?:- .*\n?)+)",  # For multi-line hyphenated instructions with ### prefix
        r"Instruction:\s*((?:\d+\.\s.*\n?)+)",  # For multi-line numbered instructions
        r"### Instruction:\s*((?:\d+\.\s.*\n?)+)",  # For multi-line numbered instructions with ### prefix
        r"Instruction:\s*(.*)",  # For single-line instructions
        r"### Instruction:\s*(.*)"  # For single-line instructions with ### prefix
    ]
    
    for pattern in instruction_patterns:
        instruction_match = re.search(pattern, text, re.MULTILINE)
        if instruction_match:
            instruction_text = instruction_match.group(1).strip()
            # Check if the instruction is multi-line or single-line
            if '\n' in instruction_text:
                # Remove newlines and extra spaces for multi-line instructions
                instruction_str = ' '.join(line.strip() for line in instruction_text.split('\n'))
            else:
                instruction_str = instruction_text
            return instruction_str
        
    raise ValueError("No instruction found in the text.")

def extract_next_engine(text):
    # Use a regular expression to find the content after "Next engine:"
    
    next_engine_patterns = [
        r"Next engine:\s*(.*)",
        r"### Next Engine:\s*(.*)"
    ]
    
    for pattern in next_engine_patterns:
        next_engine_match = re.search(pattern, text)
        if next_engine_match:
            return next_engine_match.group(
                1
            ).strip()
    raise ValueError("No next engine found in the text.")

In [9]:
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# TO DO: Generalize to BaseDriver to support other drivers

def scroll_down_one_viewport(driver):
    viewport_height = driver.execute_script("return window.innerHeight")

    body = driver.find_element(By.TAG_NAME, "body")
    num_scrolls = viewport_height // 50  # Assuming each arrow key press scrolls 50 pixels
    for _ in range(num_scrolls):
        body.send_keys(Keys.ARROW_DOWN)
        time.sleep(0.05)  
        
def scroll_up_one_viewport(driver):
    viewport_height = driver.execute_script("return window.innerHeight")

    body = driver.find_element(By.TAG_NAME, "body")
    num_scrolls = viewport_height // 50  # Assuming each arrow key press scrolls 50 pixels
    for _ in range(num_scrolls):
        body.send_keys(Keys.ARROW_UP)
        time.sleep(0.05)
    
class NavigationControl:
    def __init__(self, driver) -> None:
        self.driver = driver
        
    def execute_instruction(self, instruction):
        if 'SCROLL_DOWN' in instruction:
            driver = self.driver
            scroll_down_one_viewport(driver)
        elif 'SCROLL_UP' in instruction:
            driver = self.driver
            scroll_up_one_viewport(driver)
        elif 'WAIT' in instruction:
            time.sleep(2)
        else:
            raise ValueError(f"Unknown instruction: {instruction}")
        

In [10]:
from typing import Optional
from llama_index.core import PromptTemplate, SimpleDirectoryReader
from llama_index.core.multi_modal_llms import MultiModalLLM

from lavague.core import Context, get_default_context

class WorldModel:
    """Abstract class for WorldModel"""

    mm_llm: MultiModalLLM
    prompt_template: PromptTemplate

    def __init__(self, mm_llm: Optional[Context] = None, examples: str = DEFAULT_WORLD_MODEL_EXAMPLES):
        self.mm_llm = mm_llm
        self.prompt_template = WORLD_MODEL_PROMPT_TEMPLATE.partial_format(
            examples=examples
        )
        
    def get_instruction(self, objective: str, previous_instructions, last_engine, current_state, image_documents) -> str:
        """Use GPT*V to generate instruction from the current state and objective."""
        mm_llm = self.mm_llm
        
        prompt = self.prompt_template.format(
            objective=objective, previous_instructions=previous_instructions, 
            last_engine=last_engine, current_state=current_state)
        
        mm_llm_output = mm_llm.complete(prompt, image_documents=image_documents).text

        return mm_llm_output

In [11]:

from lavague.core import ActionEngine
from selenium.webdriver.remote.webdriver import WebDriver
from llama_index.core import SimpleDirectoryReader
import yaml
from lavague.drivers.selenium import SeleniumDriver

DEFAULT_N_ATTEMPTS = 5
DEFAULT_N_STEPS = 10
DEFAULT_TIME_BETWEEN_ACTIONS = 1.5

class WebAgent:
    """
    Web agent class, for now only works with selenium.
    """
    def __init__(self, action_engine: ActionEngine, python_engine: PythonEngine, world_model: WorldModel,
                 n_attempts: int = DEFAULT_N_ATTEMPTS, n_steps: int = DEFAULT_N_STEPS, 
                 time_between_actions: float = DEFAULT_TIME_BETWEEN_ACTIONS):
        driver = action_engine.driver
        
        self.driver: SeleniumDriver = driver
        self.action_engine: ActionEngine = action_engine
        self.world_model: WorldModel = world_model
        self.navigation_control: NavigationControl = NavigationControl(driver.driver)
        self.python_engine: PythonEngine = python_engine
        
        self.n_attempts = n_attempts
        self.n_steps = n_steps
        self.time_between_actions = time_between_actions

    def get(self, url):
        self.driver.goto(url)
        
    def run(self, objective: str, user_data = None, display_in_notebook: bool = False):
        world_model = self.world_model
        action_engine = self.action_engine
        driver: WebDriver = self.driver.driver
        python_engine = self.python_engine
        navigation_control = self.navigation_control
        
        n_steps = self.n_steps
        n_attempts = self.n_attempts
        time_between_actions = self.time_between_actions
        
        previous_instructions = "[NONE]"
        last_engine = "[NONE]"
        
        current_state = {
            "external_observations": {
                "vision": "[SCREEENSHOT]",
            },
            "internal_state": {
                "user_inputs": [],
                "agent_outputs": [],
            }
        }
        
        if user_data:
            current_state["internal_state"]["user_inputs"].append(user_data)

        # TO DO: Don't save on disk the screenshot but do it in memory
        driver.save_screenshot("screenshots/output.png")
        image_documents = SimpleDirectoryReader("./screenshots").load_data()
        
        for i in range(n_steps):
            current_state_str = yaml.dump(current_state, default_flow_style=False)

            world_model_output = world_model.get_instruction(objective, previous_instructions, last_engine, current_state_str, image_documents)

            print(world_model_output)

            next_engine = extract_next_engine(world_model_output)
            instruction = extract_world_model_instruction(world_model_output)
            
            if next_engine == "Navigation Engine":
                
                query = instruction
                nodes = action_engine.get_nodes(query)
                llm_context = "\n".join(nodes)
                
                success = False

                for _ in range(n_attempts):
                    try:
                        action = action_engine.get_action_from_context(llm_context, query)
                        action_code = f"""
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
                        {action}""".strip()

                        local_scope = {"driver": driver}
                        exec(action_code, local_scope, local_scope)
                        
                        success = True
                        break
                    except Exception as e:
                        print(f"Action execution failed. Retrying...")
                        print("Error: ", e)
                        pass
                if not success:
                    instruction = "[FAILED] " + instruction
                time.sleep(time_between_actions)
                driver.save_screenshot("screenshots/output.png")
                image_documents = SimpleDirectoryReader("./screenshots").load_data()
                    
            elif "Python Engine" in next_engine:
                state = {
                    "html": driver.page_source
                }
                success = False

                for _ in range(n_attempts):
                    try:
                        python_code = python_engine.generate_code(instruction, state)
                        output = python_engine.execute_code(python_code, state)
                        
                        if output:
                            current_state["internal_state"]["agent_outputs"].append(output)
                            success = True
                            break
                        else:
                            print("Empty output of Python engine")
                            print("Code generated by Python Engine: ", python_code)
                            print("Output generated by Python Engine: ", output)
                            pass
                    except Exception as e:
                        print(f"Python engine execution failed. Retrying...")
                        print("Error: ", e)
                        pass
                
                if not success:
                    instruction = "[FAILED] " + instruction
                
                
            elif "Navigation Controls" in next_engine:
                navigation_control.execute_instruction(instruction)
                driver.save_screenshot("screenshots/output.png")
                image_documents = SimpleDirectoryReader("./screenshots").load_data()
                
            elif next_engine == "STOP" or instruction == "STOP":
                print("Objective reached. Stopping...")
                break
            
            if previous_instructions == "[NONE]":
                previous_instructions = f"""
- {instruction}"""
            else:
                previous_instructions += f"""
- {instruction}"""
                
            last_engine = next_engine

        output = current_state["internal_state"]["agent_outputs"]
        return output

In [17]:
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.llms.groq import Groq
from lavague.core import ActionEngine
from lavague.drivers.selenium import SeleniumDriver

from lavague.contexts.openai import OpenaiContext

context = OpenaiContext()

# embed_model = GeminiEmbedding(model_name="models/text-embedding-004")
# llm = Groq(model="llama3-8b-8192")
# mm_llm = OpenAIMultiModal(model="gpt-4o", temperature=0.0)

selenium_driver = SeleniumDriver()
selenium_driver.driver = driver
action_engine = ActionEngine(selenium_driver, context=context)

mm_llm = context.mm_llm
# llm = context.llm
llm = Gemini(model_name="models/gemini-1.5-flash-latest")

world_model = WorldModel(mm_llm=mm_llm)
python_engine = PythonEngine(llm=llm)
agent = WebAgent(action_engine, python_engine, world_model)

# url = "https://huggingface.co/"
# objective = "Provide code to run falcon 11b"

# url = "https://form.jotform.com/241363523875359"
# objective = "Fill out this form"


url = "https://maize-paddleboat-93e.notion.site/Welcome-to-ACME-INC-0ac66cd290e3453b93a993e1a3ed272f"
objective = "Who is in the Software Quality Assurance team?"

user_data = [
    {
        "job": "product lead",
        "name": "john doe",
        "email": "john.doe@gmail.com",
        "phone": "555-123-4567",
        "cover letter": "Excited to work with you!"
    }
]

agent.get(url)
output = agent.run(objective)
print(output)

Thoughts:
- The screenshot is currently blank, indicating that no visual information is available.
- The objective is to find out who is in the Software Quality Assurance team.
- Since there is no previous instruction, I need to start by navigating to a relevant section that might contain information about the team.
- Typically, information about teams can be found under sections like "About Us," "Our Team," or "Departments" on a website.

Next engine: Navigation Engine
Instruction: Look for and click on a link or section that might lead to information about the team, such as "About Us," "Our Team," or "Departments."
Thoughts:
- The screenshot shows the homepage of ACME INC. with sections about the company, services, and departments.
- The objective is to find information about the Software Quality Assurance team.
- The "Our Departments" section lists several departments, including "Software Development," "Sales and Marketing Department," "Customer Support Department," and "Human Resou

In [16]:
print(output[0])

To run Falcon 11B, you can use the following code snippets:

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model = "tiiuae/falcon-11B"
tokenizer = AutoTokenizer.from_pretrained(model)

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

sequences = pipeline(
    "Can you explain the concepts of Quantum Computing?",
    max_length=200,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

for seq in sequences:
    print(f"Result: {seq['generated_text']}")
```


In [13]:

print(output[0])

IndexError: list index out of range

In [None]:
from llama_index.core import SimpleDirectoryReader
import yaml

N_ATTEMPTS = 3
N_STEPS = 13
TIME_BETWEEN_ACTIONS = 1.5

# url = "https://maize-paddleboat-93e.notion.site/Welcome-to-ACME-INC-0ac66cd290e3453b93a993e1a3ed272f"
# objective = "What's the name of our Lead Software Dev?"

url = "https://maize-paddleboat-93e.notion.site/Welcome-to-ACME-INC-0ac66cd290e3453b93a993e1a3ed272f"
objective = "Who is in the Software Quality Assurance team?"

# url = "https://huggingface.co"
# objective = "Provide the code to use Falcon 11B"

# url = "https://huggingface.co"
# objective = "Provide information Waifu diffusion"

# url = "https://huggingface.co"
# objective = "Provide code to use DeepSeek-V2-Chat with transformers"

# url = "https://huggingface.co/docs"
# objective = "Provide the code to install PEFT"

# url = "https://huggingface.co/"
# objective = "Print the code to use the hotest model on Hugging Face"



# current_state = {
#     "external_observations": {
#         "vision": "[SCREEENSHOT]",
#     },
#     "internal_state": {
#         "user_inputs": [],
#         "agent_outputs": [],
#     }
# }

world_model = WorldModel(mm_llm=mm_llm)

url = "https://form.jotform.com/241363523875359"
objective = "Fill out this form"

useer_inputs = [
    {
        "job": "product lead",
        "name": "john doe",
        "email": "john.doe@gmail.com",
        "phone": "555-123-4567",
        "cover letter": "Excited to work with you!"
    }
]
current_state = {
    "external_observations": {
        "vision": "[SCREEENSHOT]",
    },
    "internal_state": {
        "user_inputs": [data],
        "agent_outputs": [],
    }
}

driver.get(url)

previous_instructions = "[NONE]"
last_engine = "[NONE]"

driver.save_screenshot("screenshots/output.png")
image_documents = SimpleDirectoryReader("./screenshots").load_data()

for i in range(N_STEPS):
    current_state_str = yaml.dump(current_state, default_flow_style=False)
    prompt = prompt_template.format(
        objective=objective, previous_instructions=previous_instructions, 
        last_engine=last_engine, current_state=current_state_str)

    world_model_output = world_model.get_instruction(objective, previous_instructions, last_engine, current_state, image_documents)

    print(world_model_output)

    next_engine = extract_next_engine(world_model_output)
    instruction = extract_world_model_instruction(world_model_output)
    
    if "Navigation Engine" in next_engine:
        
        query = instruction
        nodes = action_engine.get_nodes(query)
        llm_context = "\n".join(nodes)
        
        success = False

        for _ in range(N_ATTEMPTS):
            try:
                action = action_engine.get_action_from_context(llm_context, query)
                action_code = f"""
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
                {action}""".strip()

                local_scope = {"driver": driver}
                exec(action_code, local_scope, local_scope)
                
                success = True
                break
            except Exception as e:
                print(f"Action execution failed. Retrying...")
                pass
        if not success:
            instruction = "[FAILED] " + instruction
        time.sleep(TIME_BETWEEN_ACTIONS)
        driver.save_screenshot("screenshots/output.png")
        image_documents = SimpleDirectoryReader("./screenshots").load_data()
            
    elif "Python Engine" in next_engine:
        state = {
            "html": driver.page_source
        }
        success = False

        for _ in range(N_ATTEMPTS):
            try:
                python_code = python_engine.generate_code(instruction, state)
                print("Code generated by Python Engine: ", python_code)
                output = python_engine.execute_code(python_code, state)
                print("Output generated by Python Engine: ", output)
                
                current_state["internal_state"]["agent_outputs"].append(output)
                success = True
                break
            except Exception as e:
                print(f"Python engine execution failed. Retrying...")
                print("Error: ", e)
                
                pass
        
        if not success:
            instruction = "[FAILED] " + instruction
        
    elif "Navigation Controls" in next_engine:
        navigation_control.execute_instruction(instruction)
        driver.save_screenshot("screenshots/output.png")
        image_documents = SimpleDirectoryReader("./screenshots").load_data()
        
    elif next_engine == "STOP" or instruction == "STOP":
        print("Objective reached. Stopping...")
        break
    
    if previous_instructions == "[NONE]":
        previous_instructions = f"""
- {instruction}"""
    else:
        previous_instructions += f"""
- {instruction}"""
        
    last_engine = next_engine


Thoughts:
- The current screenshot shows a job application form for the position of Operations Manager at ACME Inc.
- The form requires the following fields to be filled: Full Name, Email Address, Phone Number, and Cover Letter.
- The internal state contains the necessary information to fill out the form: job, name, email, phone, and cover letter.
- The next step is to fill out the form with the provided information.

Next engine: Navigation Engine
Instruction: Fill out the form with the following details:
- Full Name: John Doe
- Email Address: john.doe@gmail.com
- Phone Number: 555-123-4567
- Cover Letter: Excited to work with you!
Thoughts:
- The current screenshot shows a job application form for the position of Operations Manager at ACME Inc.
- The form fields include Full Name, Email Address, Phone Number, and Cover Letter.
- The objective is to fill out the form with the provided details.
- The provided details are: 
  - Job: Product Lead
  - Name: John Doe
  - Email: john.doe@gm

In [None]:
instruction

'Fill in the "Email Address" field with "john.doe@gmail.com".'

- Fill out the Full Name field with 'John Doe'. - Fill out the Phone Number field with '555-123-4567'. - Fill out the Cover Letter field with 'Excited to work with you!'. - Click on the 'Apply' button.
Fill out the Full Name field with 'John Doe'.


In [None]:
extract_instruction(text)

"- Fill out the Full Name field with 'John Doe'."