## Cargar variables de entorno

In [1]:
from dotenv import load_dotenv

# Load environment variables
load_dotenv(dotenv_path=".env", override=True)

True

## Crear aplicación AI 

### Setup 

Como siempre, definamos nuestro prompt y demos a nuestra aplicación acceso a la web.

In [2]:
# Inicializar herramienta de búsqueda web.
from langchain_community.tools.tavily_search import TavilySearchResults

web_search_tool = TavilySearchResults(max_results=1)

# Definir prompt template
prompt = """Sos un profesor y un experto en explicar temas complejos de una manera fácil de entender.
Tu trabajo es responder la pregunta dada de forma que incluso un niño de 5 años pueda comprenderla.
Se te ha brindado el contexto necesario para responder la pregunta.

Pregunta: {question} 

Contexto: {context}

Respuesta:"""

  web_search_tool = TavilySearchResults(max_results=1)


### Definir la lógica de la aplicación.

La lógica acá es la misma que en el módulo de trazas. Definimos un paso de búsqueda para explorar la web y un paso de explicación para que un modelo de lenguaje resuma los resultados encontrados.

In [3]:
from openai import OpenAI
from langsmith import traceable
from langsmith.wrappers import wrap_openai


# Crear application
openai_client = wrap_openai(OpenAI())

@traceable
def search(question):
    web_docs = web_search_tool.invoke({"query": question})
    web_results = "\n".join([d["content"] for d in web_docs])
    return web_results
    
@traceable
def explain(question, context):
    formatted = prompt.format(question=question, context=context)
    
    completion = openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": formatted},
            {"role": "user", "content": question},
        ],
        model="o3-mini",
    )
    return completion.choices[0].message.content

@traceable
def eli5(question):
    context = search(question)
    answer = explain(question, context)
    return answer


## Setup del experimento

Ahora estamos listos para ejecutar experimentos y probar el rendimiento de nuestra aplicación sobre nuestro dataset.

### Importar cliente LangSmith 

Primero, vamos a crear un cliente de LangSmith para usar el SDK y especificar el dataset sobre el que queremos ejecutar nuestro experimento.

In [4]:
from langsmith import Client

client = Client()
#dataset_name = "eli5-silver"
dataset_name = "ds-new-crystallography-60"

### Definir evaluadores

#### Evaluador de código personalizado

Primero definiremos un evaluador de código personalizado, que resulta útil para medir métricas deterministas o de respuesta cerrada.

In [5]:
def conciseness(outputs: dict) -> bool:
    words = outputs["output"].split(" ")
    return len(words) <= 200

Este evaluador de código personalizado es simplemente una función de Python que verifica si nuestra aplicación produce respuestas de 200 palabras o menos.

#### LLM-as-a-Judge Evaluador

Para métricas abiertas, puede ser muy potente usar un LLM para puntuar las respuestas.

Usemos un LLM para comprobar si nuestra aplicación produce resultados correctos. Primero, definamos un esquema de puntuación que nuestro LLM deba seguir en su respuesta.

In [None]:
from pydantic import BaseModel, Field

# Definir un esquema de puntuación al que nuestro LLM debe ajustarse.
class CorrectnessScore(BaseModel):
    """Correctness score of the answer when compared to the reference answer."""
    score: int = Field(description="The score of the correctness of the answer, from 0 to 1")

We'll define a function to give an LLM our application's outputs, alongside the reference outputs stored in our dataset. 

The LLM will then be able to reference the "right" output to judge if our application's answer meets our accuracy standards.

In [7]:
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage


def correctness(inputs: dict, outputs: dict, reference_outputs: dict) -> bool:
    prompt = """
    You are an expert data labeler evaluating model outputs for correctness. Your task is to assign a score based on the following rubric:

    <Rubric>
        A correct answer:
        - Provides accurate information
        - Uses suitable analogies and examples
        - Contains no factual errors
        - Is logically consistent

        When scoring, you should penalize:
        - Factual errors
        - Incoherent analogies and examples
        - Logical inconsistencies
    </Rubric>

    <Instructions>
        - Carefully read the input and output
        - Use the reference output to determine if the model output contains errors
        - Focus whether the model output uses accurate analogies and is logically consistent
    </Instructions>

    <Reminder>
        The analogies in the output do not need to match the reference output exactly. Focus on logical consistency.
    </Reminder>

    <input>
        {}
    </input>

    <output>
        {}
    </output>

    Use the reference outputs below to help you evaluate the correctness of the response:
    <reference_outputs>
        {}
    </reference_outputs>
    """.format(inputs["question"], outputs["output"], reference_outputs["output"])
    structured_llm = ChatOpenAI(model_name="gpt-4o", temperature=0).with_structured_output(CorrectnessScore)
    generation = structured_llm.invoke([HumanMessage(content=prompt)])
    return generation.score == 1


### Define Run Function

We'll define a function to run our application on the example inputs of our dataset. This is function that will be called when we run our experiment.

In [8]:
# 4. Define a function to run your application
def run(inputs: dict):
    return eli5(inputs["question"])

## Run Experiment

We have all the necessary components, so let's run our experiment! 

In [9]:
from langsmith import evaluate

evaluate(
    run,
    data=dataset_name,
    evaluators=[correctness, conciseness],
    experiment_prefix="eli5-o3-mini"
)

  from .autonotebook import tqdm as notebook_tqdm


View the evaluation results for experiment: 'eli5-o3-mini-61a07e79' at:
https://smith.langchain.com/o/11afda74-8804-4cb8-8ad4-c2a9d67d44e5/datasets/0a5a0302-cfaf-44b4-9de9-e4b312e514fd/compare?selectedSessions=8d191364-db83-488c-a737-6a5f8a44bd6f




            id = uuid7()
Future versions will require UUID v7.
  input_data = validator(cls_, input_data)
10it [01:49, 10.94s/it]


Unnamed: 0,inputs.question,outputs.output,error,reference.output,feedback.correctness,feedback.conciseness,execution_time,example_id,id
0,Qué es la macroeconomía?,Imagínate que la economía es como un enorme ro...,,La macroeconomía es como una lupa gigante que ...,True,True,7.366929,15d24932-295f-473e-b1d2-086f1bd2f32d,019a9ed9-85d0-763c-bb46-778f015b7991
1,Why is the sky blue?,Imagine you have a box with lots of colored pe...,,Alright! Imagine the sky is like a big bowl of...,True,True,10.03559,1af9dae1-8130-46b1-9efd-2b9a381f2b0f,019a9ed9-b613-721b-860f-856d0931ffdf
2,How does string theory work?,Imagine that everything in the universe is mad...,,"Okay! Imagine that everything in the universe,...",True,False,11.949423,4054da81-5869-484d-916c-e12c19059078,019a9ed9-e157-74c9-be41-7562da2ffc48
3,How does photosynthesis work?,Imagine that a plant is like a little chef in ...,,"Okay! Imagine plants are like tiny chefs, and ...",True,True,11.735456,8edadc77-fb59-4aa1-9bb3-4a0e369c50af,019a9eda-1376-732b-ab95-2d8f0b5d46c9
4,What is trustcall library?,Imagine you have a big box of LEGOs that you u...,,"Alright, imagine you have a toy box where each...",False,True,9.853126,a661d0ce-9672-4a25-b813-104ea367db23,019a9eda-47d5-71a7-b6fb-b4d515be7d42
5,How does a democracy work?,Imagine you and your friends need to choose wh...,,Okay! Imagine you and your friends want to dec...,True,True,6.848612,acd456da-cd68-4842-bcfe-60049549e0eb,019a9eda-72f8-7778-b5f3-1c1ac59e2739
6,What is LangSmith by LangChain?,Imagine you have a very smart toy robot that t...,,Okay! Imagine you have a big box of toys that ...,True,True,8.322277,b80f2c35-2d23-400d-9963-83cfdb4a0026,019a9eda-909a-7055-86b0-7ac733be8468
7,What is the Langchain framework?,Imagine you have a big box of colorful LEGO br...,,Okay! Imagine you want to build a really cool ...,True,True,9.396697,bb1147ea-5e7a-4445-bba4-fcafb900d5dc,019a9eda-b503-757f-bd9d-71c345e378c3
8,What is sound?,Imagine you’re playing with a drum. When you h...,,Okay! Imagine you have a drum. When you hit it...,True,True,8.593642,bf115918-1d03-48c0-a4ef-a33ee567533f,019a9eda-ddae-74fe-8fc4-b369cd7ada2d
9,What is LangGraph?,"Imagine you have a big box of LEGO pieces, and...",,"Okay, imagine you have a big box of LEGO brick...",True,True,10.463656,d7a5d945-7662-45ab-9099-634e1544c56c,019a9edb-0246-776d-a1c9-a8ce5ca8d223
