In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from llama_index.core import PromptTemplate

world_model_examples = """
Objective:  What's the name of our Lead Developer ?
Previous instructions: [NONE]
Last engine: [NONE]
Current relevant state: 
- screenshot: [SCREENSHOT]
- page_content_summary: The organization is comprised of several teams, each with a distinct focus. The Software Development Team creates custom applications, while the IT Support Team ensures the technology infrastructure runs smoothly. The Quality Assurance Team rigorously tests products to maintain high standards of quality. The Research and Development Team explores new technologies and methodologies, and the Systems Engineering Team designs and maintains complex systems that power the organization's services.

Thoughts:
- The screenshot shows a Notion page.
- The page content summary provides an overview of the organization's structure and the roles of different teams.
- The "Our Departments" section lists various departments, including "Engineering Department."
- To find the name of the Lead Developer, I should look for information under the "Engineering Department" section.
- The "Engineering Department" section is likely to contain details about the team, including the Lead Developer.
Next engine: Navigation Engine
Instruction: Locate the link titled "Engineering Department" and click on it.
-----
Objective:  Go to the first issue you can find
Previous instructions: 
- Click on 'Issues' with the number '28' next to it.
Last engine: Navigation Engine
Current relevant state: 
- screenshot: [SCREENSHOT]
- page_content_summary: The provided text appears to be a list of issues or tasks related to the LaVague project. The issues are categorized into different types, including bugs, enhancements, and documentation improvements. The issues range from debugging guides to documentation improvements, and even feature requests. Some issues are labeled as "help wanted," indicating that they require additional attention or expertise. The text also includes information about the status of the issues, such as whether they are open, in progress, or have been closed

Thoughts:
- The current screenshot shows the issues page of the repository 'lavague-ai/LaVague'.
- The summary indicates that the page contains a list of issues related to the LaVague project.
- The objective is to go to the first issue.
- The first issue is highlighted in the list of issues.
Next engine: Navigation Engine
Instruction: Click on the issue, with title 'Build and share place where people can suggest their use cases and results #225'
-----
Objective: Find When Llama 3 was released
Previous instructions:
- Click on 'meta-llama/Meta-Llama-3-8B'
Last engine: Navigation Engine
Current relevant state: 
- screenshot: [SCREENSHOT]
- page_content_summary: The provided text discusses the development and release of Llama 3, a large language model designed for various applications. The model's development prioritizes responsible AI development, limiting misuse and harm, and supporting the open-source community. 
The model's safety features include safeguards such as Meta Llama Guard 2 and Code Shield, which drastically reduce residual risks while maintaining helpfulness. The model's safety is ensured through extensive testing, adversarial evaluations, and implemented safety mitigations. 
The text also highlights the importance of responsible use, emphasizing the need for developers to implement safety best practices and tailor safeguards to their specific use case and audience. The model's refusals to benign prompts have been improved, reducing the likelihood of falsely refusing to answer prompts. 
The release of Llama 3 followed a rigorous process, including extra measures against misuse and critical risks. The model's safety has been assessed in various areas, including CBRNE threats, cyber security, and child safety. The company is actively contributing to open consortiums, promoting safety standardization and transparency in the development of generative AI.

Thoughts:
- The current screenshot shows the model page for 'meta-llama/Meta-Llama-3-8B' on Hugging Face.
- The page content summary provides detailed information about the development and release of Llama 3, emphasizing its safety features and responsible AI development.
- Hugging Face, is a hub for AI models and datasets, where users can explore and interact with a variety of AI models.
- I am therefore on the right page to find information about the release date of 'Meta-Llama-3-8B'.
- To find the release date, I need to locate the relevant information in the content of the page.
- Therefore the best next step is to use the Python Engine to extract the release date from the content of the page.
Next engine: Python Engine
Instruction: Extract the release date of 'Meta-Llama-3-8B' from the textual content of the page.
-----
Objective: Provide the code to get started with Gemini API
Previous instructions:
- Click on 'Read API docs'
- Click on 'Gemini API quickstart' on the menu
Last engine: Navigation Engine
Current relevant state:
- screenshot: [SCREENSHOT]
- page_content_summary: The text provides an introduction to the Gemini API, a family of Google's most capable AI models. It offers a comprehensive guide to help users get started with the API. The fastest way to start using Gemini is through Google AI Studio, a web-based tool that allows users to prototype, run prompts, and get started with the API. The text also provides a quickstart guide to help users begin. Additionally, it emphasizes the importance of using LLMs safely and responsibly, directing users to safety settings and safety guidance documentation. The text also mentions that the Gemini API and Google AI Studio are available in over 180 countries. Finally, it provides further reading resources for users to learn more about the models provided by the Gemini API.

Thoughts:
- The current screenshot shows the documentation page for the getting started of Gemini API.
- The page content summary provides an overview of the Gemini API, highlighting its capabilities and the code to get started.
- I am therefore on the right page to find the code to get started with the Gemini API.
- The next step is to provide the code to get started with the Gemini API.
- Therefore I need to use the Python Engine to generate the code to extract the code to get started with the Gemini API from this page.
Next engine: Python Engine
Instruction: Extract the code to get started with the Gemini API from the content of the page.
-----
Objective: Show what is the cheapest product
Previous instructions: [NONE]
Last engine: [NONE]
Current relevant state:
- screenshot: [SCREENSHOT]
- page_content_summary: The text provides an overview of the products available on the website. The products are categorized into different sections, including electronics, clothing, accessories, and home goods. Each product listing includes details such as the product name, price, and description. 

Thoughts:
- The screenshot shows an e-commerce website with various products.
- The page content summary describes the different product categories available on the website.
- To find the cheapest product, I need to identify the product with the lowest price.
- There seems to be selectors for sorting products by price on the left side of the page.
- The screenshot only shows part of the selectors for price. I should scroll down to see the full list of products and prices.
Next engine: Navigation Controls
Instruction: SCROLL_DOWN
"""

WORLD_MODEL_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI system specialized in high level reasoning. Your goal is to generate instructions for other specialized AIs to perform web actions to reach objectives given by humans.
Your inputs are:
- objective ('str'): a high level description of the goal to achieve.
- previous_instructions ('str'): a list of previous steps taken to reach the objective.
- last_engine ('str'): the engine used in the previous step.
- current_relevant_state ('obj'): the state of the environment to use to perform the next step. This can be a screenshot if the previous engine was a NavigationEngine, or description of variables if the previous engine was a PythonEngine.

Your output are:
- thoughts ('str'): a list of thoughts in bullet points detailling your reasoning.
- next_engine ('str'): the engine to use for the next step.
- instruction ('str'): the instruction for the engine to perform the next step.

Here are the engines at your disposal:
- Python Engine: This engine is used when the task requires doing computing using the current state of the agent. 
It does not impact the outside world and does not navigate.
- Navigation Engine: This engine is used when the next step of the task requires further navigation to reach the goal. 
For instance it can be used to click on a link or to fill a form on a webpage. This engine is heavy and will do complex processing of the current HTML to decide which element to interact with.
- Navigation Controls: This is a simpler engine to do commands that does not require reasoning, which are 'SCROLL_DOWN', 'SCROLL_UP' and 'WAIT'.

The instruction should be detailled as possible and only contain the next step. 
When providing an instruction to the Python Engine, do not provide any guideline on using visual information such as the screenshot, as the Python Engine does not have access to it.
If the screenshot provides information but seems insufficient, use navigation controls to further explore the page.
If the objective is already achieved in the screenshot, provide the next engine and instruction 'STOP'.

Here are previous examples:
{examples}

Here is the next objective:
Objective: {objective}
Previous instructions: 
{previous_instructions}
Last engine: {last_engine}
Current relevant state: {current_state}

Thought:
"""
)


In [3]:
"""
- IO Engine: This engine is used when the task requires interacting with the outside world, such at the end of an agent run, to send the result of the agent to the user.
For instance, it can be used to send an email with the result of the agent, or pull data from a database to update the agent state.
"""

'\n- IO Engine: This engine is used when the task requires interacting with the outside world, such at the end of an agent run, to send the result of the agent to the user.\nFor instance, it can be used to send an email with the result of the agent, or pull data from a database to update the agent state.\n'

In [4]:


from lavague.contexts.openai import OpenaiContext
openai_context = OpenaiContext(llm="gpt-4o")

from lavague.contexts.gemini import GeminiContext
context = GeminiContext()

context.mm_llm = openai_context.mm_llm

mm_llm = context.mm_llm
prompt_template = WORLD_MODEL_PROMPT_TEMPLATE.partial_format(
    examples=world_model_examples
)



In [5]:

from typing import Optional
from lavague.core import Context, get_default_context
from llama_index.core.base.llms.base import BaseLLM
import copy

PYTHON_ENGINE_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI system specialized in Python code generation to answer user queries.
The inputs are: an instruction, and the current state of the local variables available to the environment where your code will be executed.
Your output is the code that will perform the action described in the instruction, using the variables available in the environment.
You can import libraries and use any variables available in the environment.
Detail thoroughly the steps to perform the action in the code you generate with comments.
The last line of your code should be an assignment to the variable 'output' containing the result of the action.

Here are previous examples:
{examples}

Instruction: {instruction}
State:
{state_description}
Code:
""")

class PythonEngine:
    llm: BaseLLM    
    prompt_template: PromptTemplate

    def __init__(self, examples: str, context: Optional[Context] = None):
        if context is None:
            context = get_default_context()
        self.llm = context.llm
        self.extractor = context.extractor
        self.prompt_template = PYTHON_ENGINE_PROMPT_TEMPLATE.partial_format(
            examples=examples
        )
        
    def generate_code(self, instruction: str, state: dict) -> str:
        state_description = self.get_state_description(state)
        prompt = self.prompt_template.format(instruction=instruction, state_description=state_description)
        response = self.llm.complete(prompt).text
        code = self.extractor.extract(response)
        return code
    
    def execute_code(self, code: str, state: dict):
        local_scope = copy.deepcopy(state)
        exec(code, local_scope, local_scope)
        output = local_scope["output"]
        return output
        
    def get_state_description(self, state: dict) -> str:
        """TO DO: provide more complex state descriptions"""
        state_description = """
    html ('str'): The content of the HTML page being analyzed"""
        return state_description
    
with open("python_engine_examples.txt") as f:
    python_engine_examples = f.read()
    
python_engine = PythonEngine(python_engine_examples, context)

In [6]:
REWRITER_PROMPT_TEMPLATE = PromptTemplate("""
You are an AI expert.
You are given a high level instruction on a generic action to perform.
Your output is an instruction of the action, rewritten to be more specific on the capabilities at your disposal to perform the action.
Here are your capabilities:
{capabilities}

Here are previous examples:
{examples}

Here is the next instruction to rewrite:
Original instruction: {original_instruction}
""")

DEFAULT_CAPABILITIES = """
- Answer questions using the content of an HTML page using llama index and trafilatura
"""

DEFAULT_EXAMPLES = """
Original instruction: Use the content of the HTML page to answer the question 'How was falcon-11B trained?'
Capability: Answer questions using the content of an HTML page using llama index and trafilatura
Rewritten instruction: Extract the content of the HTML page and use llama index to answer the question 'How was falcon-11B trained?'
"""

class Rewriter:
    def __init__(self, capabilities: str = DEFAULT_CAPABILITIES, examples: str = DEFAULT_EXAMPLES, context: Optional[Context] = None):
        if context is None:
            context = get_default_context()
        self.llm = context.llm
        self.prompt_template = REWRITER_PROMPT_TEMPLATE.partial_format(
            capabilities=capabilities,
            examples=examples
        )      
    def rewrite_instruction(self, original_instruction: str) -> str:
        prompt = self.prompt_template.format(original_instruction=original_instruction)
        rewritten_instruction = self.llm.complete(prompt).text
        return rewritten_instruction
    
rewriter = Rewriter(capabilities=DEFAULT_CAPABILITIES, examples=DEFAULT_EXAMPLES, context=context)

In [7]:
# Flemme de finir

# Thoughts:
# - The current page is the Azure Pricing Calculator.
# - To calculate the cost of an AKS (Azure Kubernetes Service) cluster, we need to add the AKS service to the calculator.
# - The current screen does not show the options to add services.
# - The best approach is to scroll down to find the section where services can be added to the calculator.
# - Because this is a simple scrolling action, the best next step is to use the Navigation Controls engine to scroll down.
# Next engine: Navigation Controls
# Instruction: Scroll down by one full screen to continue exploring the current page.
# -----
# Objective: Find the definition of 'Diffusers'
# Previous instructions: 
# - Click on 'Diffusers' link
# Last engine: Navigation Engine
# Current state: [SCREENSHOT]

# Thought:
# - The current page is the documentation page of Hugging Face.
# - Hugging Face is a platform for AI models and datasets, where users can explore and interact with latest AI resources.
# - The definition of 'Diffusers' is provided in the documentation.
# - No further action is needed to achieve the objective.
# Next engine: STOP
# Instruction: STOP -->

In [8]:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains

chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
driver = webdriver.Chrome(options=chrome_options)

width = 1024
height = 1024

driver.set_window_size(width, height)
viewport_height = driver.execute_script("return window.innerHeight;")

height_difference = height - viewport_height
driver.set_window_size(width, height + height_difference)

driver.get("https://huggingface.co/meta-llama/Meta-Llama-3-8B")

In [9]:
from llama_index.core import Document, VectorStoreIndex
from llama_index.core import Settings
import trafilatura

class PageSummarizer:
    def __init__(self, llm, embed_model):
        self.llm = llm
        self.embed_model = embed_model
        
    def summarize(self, html: str) -> str:
        Settings.llm = self.llm
        Settings.embed_model = self.embed_model
        
        page_content = trafilatura.extract(html)
        
        documents = [Document(text=page_content)]
        index = VectorStoreIndex.from_documents(documents)
        query_engine = index.as_query_engine()
        instruction = "Provide a detailled summary of this text"
        page_content_summary = query_engine.query(instruction).response
        return page_content_summary

from llama_index.llms.groq import Groq

model = "llama3-8b-8192"
llm = Groq(model=model, temperature=0.1)
embed_model = context.embedding

page_summarizer = PageSummarizer(llm, embed_model)

In [10]:
import re

def extract_instruction(text):
    # Use a regular expression to find the content after "Instruction:"
    instruction_patterns = [
        r"Instruction:\s*(.*)",
        r"### Instruction:\s*(.*)"
    ]
    for pattern in instruction_patterns:
        instruction_match = re.search(pattern, text)
        if instruction_match:
            return instruction_match.group(
                1
            ).strip()  # Return the matched group, stripping any excess whitespace
        
    raise ValueError("No instruction found in the text.")

def extract_next_engine(text):
    # Use a regular expression to find the content after "Next engine:"
    
    next_engine_patterns = [
        r"Next engine:\s*(.*)",
        r"### Next Engine:\s*(.*)"
    ]
    
    for pattern in next_engine_patterns:
        next_engine_match = re.search(pattern, text)
        if next_engine_match:
            return next_engine_match.group(
                1
            ).strip()
    raise ValueError("No next engine found in the text.")

In [11]:
from lavague.core import ActionEngine
from lavague.drivers.selenium import SeleniumDriver

selenium_driver = SeleniumDriver()
selenium_driver.driver = driver
navigation_engine = ActionEngine(selenium_driver)
action_engine = navigation_engine

In [12]:
import time
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# TO DO: Generalize to BaseDriver to support other drivers

class NavigationControl:
    def __init__(self, driver) -> None:
        self.driver = driver
        
    def execute_instruction(self, instruction):
        if instruction == 'SCROLL_DOWN':
            driver = self.driver
            body = driver.find_element(By.TAG_NAME, 'body')
            body.send_keys(Keys.PAGE_DOWN)
        elif instruction == 'SCROLL_UP':
            driver = self.driver
            body = driver.find_element(By.TAG_NAME, 'body')
            body.send_keys(Keys.PAGE_UP)
        elif instruction == 'WAIT':
            time.sleep(2)
        else:
            raise ValueError(f"Unknown instruction: {instruction}")
        

navigation_control = NavigationControl(driver)

In [13]:
from llama_index.core import SimpleDirectoryReader

# url = "https://maize-paddleboat-93e.notion.site/Welcome-to-ACME-INC-0ac66cd290e3453b93a993e1a3ed272f"
# objective = "What's the name of our Head of Software?"

# url = "https://huggingface.co/models"
# objective = "Provide the code to use Falcon 11B"

url = "https://huggingface.co/docs"
objective = "Provide the code to install PEFT"

driver.get(url)

import time
time.sleep(3)

previous_instructions = "[NONE]"
last_engine = "[NONE]"

N_ATTEMPTS = 5
N_STEPS = 5

driver.save_screenshot("screenshots/output.png")
image_documents = SimpleDirectoryReader("./screenshots").load_data()

html = driver.page_source
page_content_summary = page_summarizer.summarize(html)

current_state = f"""
- screenshot: [SCREENSHOT]
- page_content_summary: {page_content_summary}
"""

for i in range(N_STEPS):
    prompt = prompt_template.format(
        objective=objective, previous_instructions=previous_instructions, 
        last_engine=last_engine, current_state=current_state)

    mm_llm_output = mm_llm.complete(prompt, image_documents=image_documents).text

    print(mm_llm_output)

    next_engine = extract_next_engine(mm_llm_output)
    instruction = extract_instruction(mm_llm_output)
    
    if next_engine == "Navigation Engine":
        
        query = instruction
        nodes = action_engine.get_nodes(query)
        llm_context = "\n".join(nodes)

        for _ in range(N_ATTEMPTS):
            try:
                action = action_engine.get_action_from_context(llm_context, query)
                code = f"""
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
                {action}""".strip()

                local_scope = {"driver": driver}
                exec(code, local_scope, local_scope)
                break
            except Exception as e:
                print(f"Action execution failed. Retrying...")
                pass
            
        driver.save_screenshot("screenshots/output.png")
        image_documents = SimpleDirectoryReader("./screenshots").load_data()

        html = driver.page_source
        page_content_summary = page_summarizer.summarize(html)

        current_state = f"""
- screenshot: [SCREENSHOT]
- page_content_summary: {page_content_summary}
        """
            
    elif next_engine == "Python Engine":
        rewritten_instruction = rewriter.rewrite_instruction(instruction)
        state = {
            "html": driver.page_source
        }

        code = python_engine.generate_code(rewritten_instruction, state)
        output = python_engine.execute_code(code, state)
        print("Output generated by Python Engine: ", output)
        print("Code generated by Python Engine: ", code)
        
        current_state = f"""
- output: {output}
"""
    elif next_engine == "Navigation Controls":
        navigation_control.execute_instruction(instruction)
        driver.save_screenshot("screenshots/output.png")
        image_documents = SimpleDirectoryReader("./screenshots").load_data()
        
    # elif next_engine == "IO Engine":
    #     print("IO Engine")
    #     break
    elif next_engine == "STOP" or instruction == "STOP":
        print("Objective reached. Stopping...")
        break
    #     print("IO Engine")
    #     break
    
    if previous_instructions == "[NONE]":
        previous_instructions = f"""
- {instruction}"""
    else:
        previous_instructions += f"""
- {instruction}"""
        
    last_engine = next_engine


Thoughts:
- The screenshot shows the documentation page of Hugging Face.
- The page content summary provides an overview of various products and services offered by Hugging Face.
- The objective is to provide the code to install PEFT.
- The "PEFT" section is visible in the screenshot.
- I need to navigate to the PEFT section to find the installation code.

Next engine: Navigation Engine
Instruction: Click on the "PEFT" section to navigate to its documentation.
Thoughts:
- The current screenshot shows the documentation page for the PEFT library on Hugging Face.
- The page content summary provides an overview of the PEFT library and its integration with other libraries.
- To provide the code to install PEFT, I need to locate the installation instructions within the documentation.
- The installation instructions are likely to be found in the "Quickstart" or "Installation" section of the documentation.
- The "Quickstart" section is visible in the left-hand menu.

Next engine: Navigation En

KeyboardInterrupt: 

In [None]:
mm_llm

GeminiMultiModal(model_name='models/gemini-1.5-flash-latest', temperature=0.1, max_tokens=8192, generate_kwargs={})

In [None]:
import re

# Sample text
text = """
### Next Engine:
Navigation Engine

### Instruction:
Locate and click on the "PEFT" link to navigate to its documentation or installation guide.
"""

# Define the patterns to extract the required parts


# Print the results
print("Navigation Engine:", navigation_engine)
print("Instruction:", instruction)


Navigation Engine: <lavague.core.action_engine.ActionEngine object at 0x7f70c9a9ab00>
Instruction: STOP


In [None]:
html = driver.page_source
page_content_summary = page_summarizer.summarize(html)


In [None]:
print(output)

Thoughts:
- The Python Engine has already provided the answer to the objective.
- The objective is already achieved.
Next engine: STOP
Instruction: STOP
