### MultiOn x Pydantic: The holy grail of data extraction?

In this Jupyter notebook, I would like to present a small proof-of-concept regarding the idea of Pydantic-based query approach for web data extraction. Because currently I have no access to source code and internals of MultiOn, I was able only to create a software-level abstraction which adds a new method, `retrieve_with_model`, which can accept Pydantic models as queries.

In [14]:
from multion.types.retrieve_output import RetrieveOutput
from pydantic import BaseModel, ValidationError
from multion.client import MultiOn


class MultiOnPydantic:  # Used during MultiOn() object assignment to provide type hinting.
    def retrieve_with_model(self, *args, **kwargs) -> RetrieveOutput | BaseModel:
        raise NotImplementedError()


def retrieve_with_model(
    self, output_schema: BaseModel | None = None, *args, **kwargs
) -> RetrieveOutput | BaseModel:

    if output_schema:
        schema = output_schema.model_json_schema()

        # Parse for field names and their respective types, description, examples.
        fields: list[tuple[str, str, str, str]] = [
            (
                name,
                field["type"],
                field.get("description", "No description provided."),
                field.get("examples", ["No example provided."])[0],
            )
            for name, field in schema["properties"].items()
        ]

        # Pass docstring as main mody of cmd and attach field names and their respective types.
        cmd = f"{output_schema.__doc__}\nPlease ensure proper typing of the outputs:\n{fields}"

        # Create arguments list based on schema.
        args_list = list(output_schema.model_json_schema()["required"])

        # Call to original Retrieval API.
        response = self.retrieve(cmd=cmd, fields=args_list, *args, **kwargs)

        # Schema is constructed and validated.
        try:
            return output_schema.model_validate(response.data[0])
        except ValidationError as e:
            # TODO Handle potential re-request to model to fix the error.
            raise ValidationError() from e

    return self.retrieve(*args, **kwargs)  # API call in case output_schema is not used.


# A little pythonic trick to modify existing code without editing the source code.
setattr(MultiOn, "retrieve_with_model", retrieve_with_model)

In [None]:
# Assign a child type to provide type hinting.
client: MultiOnPydantic = MultiOn(api_key="05ab569bdefe479aaa6077a59c06cf6c")  # type: ignore

Now, we can use Pydantic class as a query. In docstring we can specify natural language message, and fields as outputs.

In [11]:
from pydantic import BaseModel


class Query(BaseModel):
    """Please extract social links and data from my personal website."""

    email: str
    twitter: str
    github: str
    telegram: str
    linkedin: str


client.retrieve_with_model(
    url="https://keellorenz.com", output_schema=Query
)  # Outputs Query instance filled with data.

Query(email='https://www.keellorenz.com/mailto:bogdan122305@gmail.com', twitter='https://twitter.com/keell0renz', github='https://github.com/keell0renz/', telegram='https://t.me/keellorenz', linkedin='https://www.linkedin.com/in/bohdan-agarkov-87937a276/')

Also, we can use `Field` utility from Pydantic, to give more context for specific output value.

In [17]:
from pydantic import BaseModel, Field


class Query(BaseModel):
    """Please extract various data from my website."""  # Intentionally have given vague instruction.

    name: str = Field(
        description="Name on the website, probably name of the person described."
    )
    university: str = Field(description="Where this person studied?")
    motto: str = Field(
        description="Written on the top, in big letters.", examples=["I love..."]
    )
    description: str = Field(description="Description of the person.")
    is_technical: str = Field(
        description="Please answer his question based on content of the website, whether the author is a technical person (software engineer)",
        examples=["Yes", "No"],
    )


client.retrieve_with_model(
    url="https://keellorenz.com", output_schema=Query
)  # Outputs Query instance filled with data.

Query(name='Bohdan Agarkov', university='Universiteit van Amsterdam', motto='I love building crazy shit.', description='Generative AI and full-stack web developer, software development intern at Tiny Fish, business student at Universiteit van Amsterdam. Curious about autonomous AI and AGI.', is_technical='Yes')

What I expecially like about MultiOn Scraping API is that AI is able to do it's own thinking and answer questions which are inferred from the content of the website, such as `is_technical`. This can be a new paradigm of web scraping, where you can scrape not only for data, but for specific answers, like "Does this page promote my competitor's product?", which frees user from implementing various complex logic or etc.

In [20]:
from pydantic import BaseModel, Field, field_validator


class Query(BaseModel):
    """Please extract various data from my website."""  # Intentionally have given vague instruction.

    name: str = Field(
        description="Name on the website, probably name of the person described."
    )
    university: str = Field(description="Where this person studied?")
    motto: str = Field(
        description="Written on the top, in big letters.", examples=["I love..."]
    )
    description: str = Field(description="Description of the person.")
    is_technical: bool = Field(
        description="Please answer his question based on content of the website, whether the author is a technical person (software engineer)",
        examples=["Yes", "No"],
    )

    @field_validator("is_technical")
    def convert_is_technical(cls, v):
        if isinstance(v, str):
            return v.lower() == "yes"
        return v


client.retrieve_with_model(
    url="https://keellorenz.com", output_schema=Query
)  # Outputs Query instance filled with data.

Query(name='Bohdan Agarkov', university='Universiteit van Amsterdam', motto='I love building crazy shit.', description='Generative AI and full-stack web developer, software development intern at Tiny Fish, business student at Universiteit van Amsterdam. Curious about autonomous AI and AGI.', is_technical=True)

Now you can also see how validators can be used to validate and change data returned from the MultiOn agent.