### Desktop Assistant



1. Install dependencies...

In [1]:
%pip install -qU pyautogui Pillow langchain langchain_core langchain_openai langchain_google_genai langchain_experimental Pydantic

Note: you may need to restart the kernel to use updated packages.


2. Constants and tokens:

In [2]:
from langchain_openai import ChatOpenAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import PromptTemplate
import getpass
import os

if not os.environ.get("OPENAI_API_KEY"):
  os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter API key for OpenAI: ")
if not os.environ.get("GOOGLE_API_KEY"):
  os.environ["GOOGLE_API_KEY"] = getpass.getpass("Enter Gemini API key: ")

OPENAI = ChatOpenAI(
    model = "gpt-4o",
    api_key = os.getenv("OPENAI_API_KEY")
)
GEMINI = ChatGoogleGenerativeAI(
    model = "gemini-1.5-pro-latest",
    google_api_key = os.getenv("GOOGLE_API_KEY")
)
MODELS = {
    "gemini": GEMINI,
    "openai": OPENAI
}


PREFIX = """
YOU ARE AN EXPERT AUTOMATION AGENT WITH FULL ACCESS TO THE PyAutoGUI LIBRARY in the variable `pg`, SPECIALIZED IN PERFORMING PRECISE AND EFFICIENT SYSTEM ACTIONS ON BEHALF OF THE USER. YOU MUST FOLLOW THE USER'S COMMANDS TO AUTOMATE KEYBOARD, MOUSE, AND SCREEN INTERACTIONS, WHILE ENSURING SAFETY AND ACCURACY IN EVERY TASK. YOU ARE RESPONSIBLE FOR COMPLETING TASKS SWIFTLY, AVOIDING ERRORS, AND HANDLING POTENTIAL EXCEPTIONS GRACEFULLY.

INSTRUCTIONS

- You MUST use the variable `pg` of PyAutoGUI library to perform system actions such as moving the mouse, clicking, typing, taking screenshots, and automating window actions as directed by the user.
- Target computer is operating MAC OS.
- Always EXECUTE tasks with maximum precision to avoid unintentional actions.
- You MUST IMPLEMENT a logical chain of thoughts to approach every task methodically, ensuring the user's commands are carried out on action at a time.
- ONLY perform one action at a time from the chain of thoughts. DO NOT write code to perform all actions all at once.
- After each action, use the `get_screen_info` tool to get the information of the screen, coordinates to click, and plan the next actions to be taken.
- ALWAYS CATCH ERRORS or unexpected situations, and inform the user about potential issues.

FOLLOW this process to AUTOMATE each task effectively:

1. Thought:
    1.1. THOROUGHLY READ the user's request and IDENTIFY the specific system action they want to automate.
    1.2. EVALUATE whether the task is feasible using PyAutoGUI, considering any constraints related to screen size, active windows, or input permissions.

2. Action Input:
    2.0. OPEN the app in the user's request from the Windows search bar by pressing `pg.press('win')\npg.write(<app_name>)`. DO NOT SKIP THIS STEP.
    2.1. INITIATE the appropriate PyAutoGUI functions (e.g., mouse movement, typing, clicking) based on the user's request.
    2.2. MAKE USE of `pyautogui` commands such as `moveTo`, `click`, `write`, `press`, `screenshot`, etc., while confirming coordinates and actions.
    2.3. MAKE USE of `get_screen_info` tool to validate whether the previous step is successfully completed or not.
    2.4. HANDLE task dependencies (e.g., waiting for certain screens, pauses, or timeouts) by using PyAutoGUI's built-in functions like `sleep` or `timeout`.
    2.5. ALWAYS wait for 5 seconds after each action to ensure the system has time to process the action.
    2.6. ONLY perform one action at a time and do not write code to perform all actions at once.

3. VERIFY THE OUTCOME:
    3.0. Call the `get_screen_info` tool after every action and plan the next action accordingly.
    3.1. PROVIDE FEEDBACK to the user, confirming the successful completion of the task.
    3.2. If an error occurs (e.g., the screen changes unexpectedly or coordinates are incorrect), IMPLEMENT error handling and INFORM the user clearly.


OBEY these rules to avoid common pitfalls:
- ALWAYS open the app in the user's request from the task bar by pressing Windows button and searching for that app. DO NOT SKIP this step.
- ALWAYS call the `get_screen_info` tool to verify the previous step has been successfully completed or not. DO NOT SKIP THIS STEP
- NEVER PERFORM DANGEROUS SYSTEM ACTIONS (e.g., force quitting critical applications, deleting system files) unless the user explicitly requests it and you have confirmed their intent.
- DO NOT MAKE ASSUMPTIONS about user intent—always follow their exact request, and if unclear, ASK for clarification.
- AVOID MOVING THE MOUSE OR TYPING without calculating the correct screen coordinates or target window using the `get_screen_info` tool.
- NEVER IGNORE ERRORS—if PyAutoGUI fails to perform an action (e.g., window not found), INFORM the user and PROVIDE an alternative solution.
- DO NOT OVERUSE SYSTEM RESOURCES—ensure that frequent or complex automation tasks are performed efficiently, minimizing system load.

"""
EXAMPLES = """#### Example 1: Move Mouse to Specific Coordinates and Click
User: "Open YouTube in Google Chrome"
Agent:
Thought: User wants to open YouTube on Google Chrome. For this I need to perform the following tasks.\n1. Open Google Chrome, if not already opened.\n2. Search for https://youtube.com/ in a new tab.
Action: get_screen_info
Action Input: Is Google Chrome open?
Yes
Thought: Chrome is Open. Open a new tab and search for Youtube
Action Input:
```python
# Open Spotlight Search
pg.hotkey('command', 'space')
time.sleep(2) # Wait for Spotlight to open
# open chrome browser
pg.write('chrome')
time.sleep(2) 
pg.press('enter')
time.sleep(2)
# next, open a new tab
pg.hotkey("command", "t") # Open new tab
time.sleep(2)
pg.write("https://youtube.com") # Open YouTube
time.sleep(2)
pg.press('enter')
print("I have opened the YouTube page on Google Chrome")
```
Verify: YouTube successfully Opened in a new tab. Report success.


#### Example 2: Type a Message in a Text Editor
User: "Open TextEditor and type 'Hello, World!'."
Agent:
Thought: User wants TextEditor opened and text typed.
Action: python_repl_ast
Action Input:
```python
# Open Spotlight Search
pg.hotkey('command', 'space') 
time.sleep(1)  # Wait for Spotlight to open
# Type 'TextEdit' to search for TextEdit
pg.write('TextEdit') 
time.sleep(1)  # Wait for the search results to appear
pg.press('enter') # Press 'enter' to open TextEdit
time.sleep(2)  # Wait for TextEdit to open
pg.hotkey('command', 'n') # If a new document window does not open automatically, open one
time.sleep(1)  # Wait for the new document window to open
pg.write('Hello, World!') # Type the message
print("I have written 'Hello, World!' in TextEdit")
```
VERIFY: Notepad opened, text written. Report completion.
"""

SUFFIX = """
User's input: {input}

You have access to the following tools: {tools}

Carefully use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}], it should only contain the tool name and nothing else
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!


Question: {input}
Thought:{agent_scratchpad}
"""

_template = PREFIX + "\n\n" + EXAMPLES + "\n\n"  + SUFFIX
prompt = PromptTemplate(input_variables=['agent_scratchpad', 'tool_names', 'input', 'tools'], template=_template)

3. Working with images - getting screenshot, and defining a tool to analyze the screenshot.

In [3]:
from pydantic import BaseModel, Field
from langchain_core.messages import HumanMessage, SystemMessage
from langchain.tools import tool

from PIL import Image, ImageDraw, ImageFont
import pyautogui as pg
import base64

pg.PAUSE = 2

def get_ruled_screenshot():
    image = pg.screenshot()
    # Get the image dimensions
    width, height = image.size

    # Create a new image for the semi-transparent layer
    overlay = Image.new("RGBA", (width, height), (255, 255, 255, 0))  # Transparent layer
    draw = ImageDraw.Draw(overlay)

    # Set the line color (gray) and line opacity (adjusting the alpha value)
    line_color = (200, 200, 0, 128)  # The last value (128) controls opacity, 0 = fully transparent, 255 = fully opaque

    # Load a font for the labels (you can specify any TTF font you have)
    try:
        font = ImageFont.truetype("arial.ttf", 25)
    except IOError:
        font = ImageFont.load_default()

    # Draw vertical and horizontal lines every 100 pixels and add labels
    for x in range(0, width, 50):
        draw.line([(x, 0), (x, height)], fill=line_color, width=1)
        # Add labels at the top for vertical lines
        if x % 100 == 0:
            draw.text((x + 5, 5), str(x), font=font, fill=(250, 250, 0, 128))
            draw.text((x, height - 25), str(x), font=font, fill=(250, 250, 0, 128))

    for y in range(0, height, 50):
        draw.line([(0, y), (width, y)], fill=line_color, width=1)
        # Add labels on the left for horizontal lines
        if y % 100 == 0:
            draw.text((5, y + 5), str(y), font=font, fill=(0, 250, 250, 128))
            text_width, text_height = 35, 15
            draw.text((width - text_width - 5, y + 5), str(y), font=font, fill=(0, 250, 250, 128))

    # Convert screenshot to RGBA for proper merging
    image = image.convert("RGBA")
    # Merge the overlay (with lines and labels) back onto the original image
    combined = Image.alpha_composite(image.convert("RGBA"), overlay)
    combined.save("screenshot.png")

class ScreenInfo(BaseModel):
    query: str = Field(description="should be a question about the screenshot of the current screen. Should always be in text.")

@tool(args_schema=ScreenInfo)
def get_screen_info(question: str) -> dict:
    """Tool to get the information about the current screen on the basis of the question of the user. The tool will take the screenshot of the screen to understand the contents of the screen and give answer based on the agent's questions. Do not write code to take screenshot."""
    try:
        get_ruled_screenshot()
        with open(f"screenshot.png", "rb") as image:
            image = base64.b64encode(image.read()).decode("utf-8")
            messages = [
                SystemMessage(
                content="""You are a Computer agent that is responsible for answering questions based on the input provided to you. You will have access to the screenshot of the current screen of the user along with a grid marked with true coordinates of the screen. The size of the screen is 1920 x 1080 px.
                        ONLY rely on the coordinates marked in the screen. DO NOT create an assumption of the coordinates. 
                        Here's how you can help:
                        1. Find out coordinates of a specific thing. You have to be super specific about the coordinates. These coordinates will be passed to PyAutoGUI Agent to perform further tasks. Refer the grid line to get the accurate coordinates.
                        2. Give information on the contents of the screen.
                        3. Analyse the screen to give instructions to perform further steps.
                        
                    """
                ),
                HumanMessage(
                    content=[
                        {
                            "type": "text",
                            "text": f"{question}"
                        },
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/jpeg;base64,{image}"}
                        }
                    ]
                )
            ]
            image_model = MODELS["gemini"]
            response = image_model.invoke(messages)
            return response.content
        
    except Exception as e:
        return {"error": str(e)}
    

4. Create React Agent

In [4]:
from langchain.agents import AgentExecutor, create_react_agent
from langchain_experimental.tools import PythonAstREPLTool
import pyautogui as pg

pg.PAUSE = 2

def create_agent(model, prompt):
    print("============================================\nInitialising Agent\n============================================")
    df_locals = {}
    df_locals["pg"] = pg
    tools = [PythonAstREPLTool(locals=df_locals), get_screen_info]
    agent = create_react_agent(model, tools, prompt)
    agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True, handle_parsing_errors=True, return_intermediate_steps=True)
    return agent_executor

5. GUI window with command interface

In [5]:
from tkinter import *
from PIL import Image, ImageDraw, ImageFont
import time

# Create the agent executor
agent_executor = create_agent(MODELS["openai"], prompt)

# Initialize the GUI
root = Tk()
root.title("Desktop Agent")

window_width = 1200
window_height = 1000
root.geometry(f"{window_width}x{window_height}")

# Define the send function
def send():
    user_input = e.get().lower()
    txt.insert(END, f"\nIN> {user_input}")
    if user_input == "screenshot":
        get_ruled_screenshot()
        response = "Screenshot taken and saved as 'screenshot.png'"
    else:
        time.sleep(1.5)
        response = agent_executor.invoke({"input": user_input})
    txt.insert(END, f"\nOUT> {response}")
    e.delete(0, END)

# Set up the GUI components
Label(root, fg="black", text="Welcome to Desktop AI Agent", font=("Helvetica", 12, "bold"), pady=10, width=30, height=2).pack()
txt_frame = Frame(root, bg="lightgray")
txt_frame.pack(pady=10)
txt = Text(txt_frame, bg="white", fg="black", font=("Helvetica", 10), width=180, height=75)
txt.pack(side=LEFT)
scrollbar = Scrollbar(txt_frame, command=txt.yview)
scrollbar.pack(side=RIGHT, fill=Y)
txt.config(yscrollcommand=scrollbar.set)
entry_frame = Frame(root)
entry_frame.pack(pady=10)
e = Entry(entry_frame, bg="white", fg="black", font=("Helvetica", 10), width=140)
e.pack(side=LEFT, padx=5)
Button(entry_frame, text="Send", font=("Helvetica", 10, "bold"), fg="black", bg="gray", command=send).pack(side=LEFT)

# Set window attributes and start the main loop
root.attributes('-topmost', 0)
root.mainloop()

Initialising Agent


2025-01-14 20:33:29.599 python[42192:17930693] +[IMKClient subclass]: chose IMKClient_Modern
2025-01-14 20:33:29.599 python[42192:17930693] +[IMKInputSession subclass]: chose IMKInputSession_Modern




[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mTo accomplish the task, I need to perform the following actions:

1. Open the Google Chrome application.
2. Navigate to CNN's website by entering the URL https://cnn.com.
3. Analyze the front page of CNN and summarize its contents.

First, I'll start by opening Google Chrome.

Action: python_repl_ast
Action Input:
```python
# Open Spotlight Search
import time
pg.hotkey('command', 'space')
time.sleep(2)  # Wait for Spotlight to open
# Type 'chrome' to search for Google Chrome
pg.write('chrome')
time.sleep(2)  # Wait for search results
pg.press('enter')
time.sleep(5)  # Allow time for Chrome to open
```[0m[36;1m[1;3m[0m[32;1m[1;3mAction: get_screen_info
Action Input: Is Google Chrome open?[0m[33;1m[1;3mYes, Google Chrome is open and visible in the dock at the bottom of the screen. It's the 2nd icon from the left, located within the approximate coordinates (170, 1000) to (240, 1070) on the screen.[0m[32;1m[1;3mNow th