<a href="https://colab.research.google.com/github/jayusctrojan/about_me/blob/master/Jay%20Bajaj%20-%20ik_agenteval_exercise1_llm_judge_homework_non_swe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Agent Evaluation Homework for non-SWE

This module demonstrates using LangSmith to monitor your agent in production. You need to have created a langsmith account, and have your custom openAI keys.


Observability is important for any software application, but especially so for LLM applications. LLMs are non-deterministic by nature, meaning they can produce unexpected results. This makes them trickier than normal to debug.

Note that observability is important throughout all stages of application development - from prototyping, to beta testing, to production. There are different considerations at all stages, but they are all intricately tied together. In this tutorial we walk through the natural progression.

Let's assume that we're building a simple RAG application using the OpenAI SDK. The simple application we're adding observability to looks like.

Link to homework document: https://docs.google.com/document/d/1IrgnVLXtHcx0TJ0IJlUzrHsqseM23eGh7DdpYEkzJvU/edit?tab=t.0


In [None]:
import os
!pip install langsmith
from openai import OpenAI
from langsmith.wrappers import wrap_openai
from langsmith import traceable
from google.colab import userdata

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGSMITH_PROJECT"] = "default"
os.environ["LANGSMITH_TRACING"] = "true"
os.environ["LANGSMITH_PROJECT"] = "default"
#os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
#os.environ["LANGSMITH_API_KEY"] =  userdata.get('LANGSMITH_API_KEY')
os.environ["OPENAI_API_KEY"] = "<enter-your-own-key-here>"
os.environ["LANGSMITH_API_KEY"] = "<enter-your-own-key-here>"

# Verify they're set
print("OpenAI key set:", "OPENAI_API_KEY" in os.environ and len(os.environ["OPENAI_API_KEY"]) > 0)
print("LangSmith key set:", "LANGSMITH_API_KEY" in os.environ and len(os.environ["LANGSMITH_API_KEY"]) > 0)

!pip install openai langsmith langchain-core

OpenAI key set: True
LangSmith key set: True


# **Part 1 - Running the first Basic App**

Let's first build a basic LLM application, with a basic calls wrapped around LangSmith. Run the below, and check out what is showing up on the LangSmith Portal.

In [3]:
openai_client = wrap_openai(OpenAI())

def retriever(query: str):
    results = ["Harrison worked at Kensho"]
    return results

def rag(question):
    docs = retriever(question)
    system_message = """Answer the users question using only the provided information below:

    {docs}""".format(docs="\n".join(docs))

    response = openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-4o-mini",
    )
    return response.choices[0].message.content

In [4]:
# Deep LangSmith Debugging
from google.colab import userdata
import os
import requests
import uuid

print("🔍 Deep LangSmith Debugging...")

# Load fresh from secrets
langsmith_key = userdata.get('LANGSMITH_API_KEY')
os.environ["LANGSMITH_API_KEY"] = langsmith_key

print(f"✅ Using Personal Access Token: {langsmith_key[:15]}...")

# Test 1: Check LangSmith API directly
print(f"\n🧪 Test 1: Direct API Health Check")
try:
    headers = {"Authorization": f"Bearer {langsmith_key}"}
    response = requests.get("https://api.smith.langchain.com/info", headers=headers)
    print(f"API Status: {response.status_code}")
    if response.status_code == 200:
        print("✅ LangSmith API is accessible")
    else:
        print(f"❌ API issue: {response.status_code} - {response.text}")
except Exception as e:
    print(f"❌ API request failed: {e}")

# Test 2: Check Client initialization
print(f"\n🧪 Test 2: Client Initialization")
try:
    from langsmith import Client
    ls_client = Client()
    print("✅ LangSmith Client created successfully")

    # Check if we can access basic info
    try:
        # This should work with any valid token
        runs_iter = ls_client.list_runs(limit=1)
        runs = list(runs_iter)
        print(f"✅ Can list runs: {len(runs)} runs found")
    except Exception as e:
        print(f"❌ Can't list runs: {e}")

except Exception as e:
    print(f"❌ Client creation failed: {e}")

# Test 3: Create a minimal trace and try feedback
print(f"\n🧪 Test 3: Minimal Trace + Feedback Test")
try:
    from langsmith import traceable

    @traceable
    def simple_test(input_text):
        return f"Processed: {input_text}"

    # Create a trace
    test_run_id = str(uuid.uuid4())
    print(f"Creating trace with run_id: {test_run_id}")

    result = simple_test(
        "test input",
        langsmith_extra={"run_id": test_run_id}
    )
    print(f"✅ Trace created: {result}")

    # Wait a moment for the trace to be processed
    import time
    time.sleep(2)

    # Try to add feedback
    print(f"🧪 Attempting to add feedback to run_id: {test_run_id}")

    try:
        feedback_response = ls_client.create_feedback(
            test_run_id,
            key="test-score",
            score=1.0,
        )
        print("✅ Feedback creation successful!")
        print(f"Feedback response: {feedback_response}")

    except Exception as feedback_error:
        print(f"❌ Feedback creation failed: {feedback_error}")

        # Let's try to get more details about the error
        if hasattr(feedback_error, 'response'):
            print(f"Response status: {feedback_error.response.status_code}")
            print(f"Response text: {feedback_error.response.text}")

        # Check if the run actually exists
        print(f"\n🔍 Checking if run exists...")
        try:
            run_info = ls_client.read_run(test_run_id)
            print(f"✅ Run exists: {run_info.id}")
        except Exception as read_error:
            print(f"❌ Run doesn't exist or can't be read: {read_error}")

except Exception as e:
    print(f"❌ Trace creation failed: {e}")

# Test 4: Check project settings
print(f"\n🧪 Test 4: Project Settings")
try:
    project_name = os.environ.get("LANGSMITH_PROJECT", "default")
    print(f"Current project: {project_name}")

    # Try to list projects
    try:
        # Note: This might not be available in all client versions
        print("Attempting to check project access...")
    except Exception as e:
        print(f"Can't check projects: {e}")

except Exception as e:
    print(f"Project check failed: {e}")

# Test 5: Alternative feedback approach
print(f"\n🧪 Test 5: Alternative Feedback Approach")
print("Let's try the REST API directly for feedback...")

try:
    # Create another simple trace first
    alt_run_id = str(uuid.uuid4())

    @traceable
    def alt_test():
        return "alternative test"

    alt_result = alt_test(langsmith_extra={"run_id": alt_run_id})
    time.sleep(1)  # Wait for processing

    # Try direct API call for feedback
    feedback_url = f"https://api.smith.langchain.com/feedback"
    headers = {
        "Authorization": f"Bearer {langsmith_key}",
        "Content-Type": "application/json"
    }

    feedback_data = {
        "run_id": alt_run_id,
        "key": "direct-test",
        "score": 0.8
    }

    response = requests.post(feedback_url, headers=headers, json=feedback_data)
    print(f"Direct API feedback response: {response.status_code}")

    if response.status_code == 200:
        print("✅ Direct API feedback worked!")
    else:
        print(f"❌ Direct API feedback failed: {response.text}")

except Exception as e:
    print(f"❌ Direct API approach failed: {e}")

print(f"\n📊 Debug Summary Complete")
print("Check the results above to identify the specific issue.")

🔍 Deep LangSmith Debugging...
✅ Using Personal Access Token: lsv2_pt_ae537ac...

🧪 Test 1: Direct API Health Check
API Status: 200
✅ LangSmith API is accessible

🧪 Test 2: Client Initialization
✅ LangSmith Client created successfully
❌ Can't list runs: Failed to POST /runs/query in LangSmith API. HTTPError('400 Client Error: Bad Request for url: https://api.smith.langchain.com/runs/query', '{"detail":"At least one of \'session\', \'id\', \'parent_run\', \'trace\' or \'reference_example\' must be specified"}')

🧪 Test 3: Minimal Trace + Feedback Test
Creating trace with run_id: 26d189bd-e010-4f0f-896b-0915946484a9
✅ Trace created: Processed: test input
🧪 Attempting to add feedback to run_id: 26d189bd-e010-4f0f-896b-0915946484a9
✅ Feedback creation successful!
Feedback response: id=UUID('acb47dc6-6a77-4912-803e-833ee6992eee') created_at=datetime.datetime(2025, 9, 18, 2, 10, 47, 68634, tzinfo=datetime.timezone.utc) modified_at=datetime.datetime(2025, 9, 18, 2, 10, 47, 68640, tzinfo=dateti

# **Part 2 - Run the LangSmith enhanced version**

The first thing you might want to trace is all your OpenAI calls. After all, this is where the LLM is actually being called, so it is the most important part! We've tried to make this as easy as possible with LangSmith by introducing a dead-simple OpenAI wrapper. All you have to do is modify your code to look something like:

Notice how we import from langsmith.wrappers import wrap_openai and use it to wrap the OpenAI client (openai_client = wrap_openai(OpenAI())).


Notice how we import from langsmith import traceable and use it decorate the overall function (@traceable).


Notice how we import from langsmith import traceable and use it decorate the overall function (@traceable(run_type="retriever")).

What happens if you call it in the following way?

# Beta Testing
The next stage of LLM application development is beta testing your application. This is when you release it to a few initial users. Having good observability set up here is crucial as often you don't know exactly how users will actually use your application, so this allows you get insights into how they do so. This also means that you probably want to make some changes to your tracing set up to better allow for that. This extends the observability you set up in the previous section

# Collecting Feedback
A huge part of having good observability during beta testing is collecting feedback. What feedback you collect is often application specific - but at the very least a simple thumbs up/down is a good start. After logging that feedback, you need to be able to easily associate it with the run that caused that. Luckily LangSmith makes it easy to do that.

First, you need to log the feedback from your app. An easy way to do this is to keep track of a run ID for each run, and then use that to log feedback. Keeping track of the run ID would look something like:

# Logging Metadata
It is also a good idea to start logging metadata. This allows you to start keep track of different attributes of your app. This is important in allowing you to know what version or variant of your app was used to produce a given result.

For this example, we will log the LLM used. Oftentimes you may be experimenting with different LLMs, so having that information as metadata can be useful for filtering. In order to do that, we can add it as such:

In [5]:
# Working LangSmith RAG Application

from google.colab import userdata
import os
from openai import OpenAI
from langsmith.wrappers import wrap_openai
from langsmith import traceable, Client
import uuid

# Load API keys from Colab Secrets
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["LANGSMITH_API_KEY"] = userdata.get('LANGSMITH_API_KEY')

print("🔑 Loaded API keys from Colab Secrets")

# Initialize clients
openai_client = wrap_openai(OpenAI())
ls_client = Client()

@traceable(run_type="retriever")
def retriever(query: str):
    results = ["Harrison worked at Kensho"]
    return results

@traceable(metadata={"llm": "gpt-4o-mini"})
def rag(question):
    docs = retriever(question)
    system_message = """Answer the users question using only the provided information below:

    {docs}""".format(docs="\n".join(docs))

    response = openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-4o-mini",
    )

    return response.choices[0].message.content

print("🚀 Starting RAG application tests...")

# Test 1: Successful run
print("\n✅ Test 1: Successful run")
run_id = str(uuid.uuid4())
print(f"Run ID: {run_id}")

result = rag(
    "where did harrison work",
    langsmith_extra={"run_id": run_id, "metadata": {"user_id": "harrison"}}
)
print(f"Response: {result}")

# Add positive feedback
feedback_response = ls_client.create_feedback(
    run_id,
    key="user-score",
    score=1.0,
)
print(f"✅ Positive feedback added successfully")

# Test 2: "Failed" run (will still work but gets negative feedback)
print("\n❌ Test 2: Failed run (negative feedback)")
run_id = str(uuid.uuid4())
print(f"Run ID: {run_id}")

result = rag(
    "where did peter work",
    langsmith_extra={"run_id": run_id, "metadata": {"user_id": "peter"}}
)
print(f"Response: {result}")

# Add negative feedback
feedback_response = ls_client.create_feedback(
    run_id,
    key="user-score",
    score=0.0,
)
print(f"✅ Negative feedback added successfully")

print(f"\n🎉 Both tests completed successfully!")
print(f"📊 Now you can proceed with the homework assignment:")
print(f"1. Go to https://smith.langchain.com/projects")
print(f"2. Click on your recent runs")
print(f"3. Explore the trace view, metadata, and feedback as described in the homework")
print(f"4. Use the feedback tab to filter by positive/negative feedback")
print(f"5. Check out the monitoring dashboard")

print(f"\n🔍 Your traces should show:")
print(f"- Full RAG → Retriever → OpenAI call structure")
print(f"- Custom metadata (user_id, llm model)")
print(f"- Feedback scores (1.0 for Harrison, 0.0 for Peter)")

🔑 Loaded API keys from Colab Secrets
🚀 Starting RAG application tests...

✅ Test 1: Successful run
Run ID: 87dfe73d-4ce0-42d2-a0fd-ec8d326563b8
Response: Harrison worked at Kensho.
✅ Positive feedback added successfully

❌ Test 2: Failed run (negative feedback)
Run ID: 12db72d1-8aac-40d7-beaa-4b2a6db51390
Response: The provided information does not include where Peter worked.
✅ Negative feedback added successfully

🎉 Both tests completed successfully!
📊 Now you can proceed with the homework assignment:
1. Go to https://smith.langchain.com/projects
2. Click on your recent runs
3. Explore the trace view, metadata, and feedback as described in the homework
4. Use the feedback tab to filter by positive/negative feedback
5. Check out the monitoring dashboard

🔍 Your traces should show:
- Full RAG → Retriever → OpenAI call structure
- Custom metadata (user_id, llm model)
- Feedback scores (1.0 for Harrison, 0.0 for Peter)


In [6]:
# Generate Fresh LangSmith Traces
# This will create new traces each time you run it

from google.colab import userdata
import os
from openai import OpenAI
from langsmith.wrappers import wrap_openai
from langsmith import traceable, Client
import uuid
import time

# Load API keys
os.environ["OPENAI_API_KEY"] = userdata.get('OPENAI_API_KEY')
os.environ["LANGSMITH_API_KEY"] = userdata.get('LANGSMITH_API_KEY')

# Initialize clients
openai_client = wrap_openai(OpenAI())
ls_client = Client()

@traceable(run_type="retriever")
def retriever(query: str):
    results = ["Harrison worked at Kensho"]
    return results

@traceable(metadata={"llm": "gpt-4o-mini"})
def rag(question):
    docs = retriever(question)
    system_message = """Answer the users question using only the provided information below:

    {docs}""".format(docs="\n".join(docs))

    response = openai_client.chat.completions.create(
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": question},
        ],
        model="gpt-4o-mini",
    )

    return response.choices[0].message.content

# Method 1: Try different questions
questions = [
    "where did harrison work",
    "what company did harrison join",
    "tell me about harrison's job",
    "where was harrison employed",
    "what is harrison's workplace"
]

print("🔄 Generating traces with different questions...")

for i, question in enumerate(questions, 1):
    run_id = str(uuid.uuid4())
    print(f"\n{i}. Testing: '{question}'")
    print(f"   Run ID: {run_id}")

    result = rag(
        question,
        langsmith_extra={
            "run_id": run_id,
            "metadata": {
                "user_id": f"user_{i}",
                "question_type": "harrison_work",
                "test_number": i
            }
        }
    )
    print(f"   Response: {result}")

    # Add varied feedback
    score = 1.0 if i % 2 == 1 else 0.5  # Alternate between high and medium scores
    ls_client.create_feedback(
        run_id,
        key="user-score",
        score=score,
    )
    print(f"   Feedback: {score}")

    # Small delay to ensure traces are processed
    time.sleep(1)

# Method 2: Try questions about Peter (will get negative responses)
peter_questions = [
    "where did peter work",
    "what is peter's job",
    "tell me about peter's career"
]

print(f"\n🔄 Generating traces for Peter questions...")

for i, question in enumerate(peter_questions, 1):
    run_id = str(uuid.uuid4())
    print(f"\n{i}. Testing: '{question}'")
    print(f"   Run ID: {run_id}")

    result = rag(
        question,
        langsmith_extra={
            "run_id": run_id,
            "metadata": {
                "user_id": f"peter_user_{i}",
                "question_type": "peter_work",
                "test_number": i
            }
        }
    )
    print(f"   Response: {result}")

    # Add negative feedback since Peter isn't in our data
    ls_client.create_feedback(
        run_id,
        key="user-score",
        score=0.0,
    )
    print(f"   Feedback: 0.0 (negative)")

    time.sleep(1)

print(f"\n🎉 Generated {len(questions) + len(peter_questions)} new traces!")
print(f"🔍 Check your LangSmith dashboard: https://smith.langchain.com/projects")
print(f"💡 Each trace has a unique run_id and different metadata for easy identification")

🔄 Generating traces with different questions...

1. Testing: 'where did harrison work'
   Run ID: eb9029ac-c9ed-44e6-8fdc-0d79833553cb
   Response: Harrison worked at Kensho.
   Feedback: 1.0

2. Testing: 'what company did harrison join'
   Run ID: e6861f34-1431-4371-86e5-d7a9631e54ba
   Response: Harrison joined Kensho.
   Feedback: 0.5

3. Testing: 'tell me about harrison's job'
   Run ID: f24668c1-3bb2-4e11-9dd6-ed429c0393ed
   Response: Harrison worked at Kensho.
   Feedback: 1.0

4. Testing: 'where was harrison employed'
   Run ID: 7a1a1198-db15-4ad4-80ec-753358a0398c
   Response: Harrison was employed at Kensho.
   Feedback: 0.5

5. Testing: 'what is harrison's workplace'
   Run ID: aff7d1eb-f0ba-42df-a2ce-28d341b9cc02
   Response: Harrison's workplace is Kensho.
   Feedback: 1.0

🔄 Generating traces for Peter questions...

1. Testing: 'where did peter work'
   Run ID: e16d22b3-bf7a-49e0-b6d2-5ed823870175
   Response: The provided information does not specify where Peter worked.


Observing what happens

You have now just run a app with Langsmith integreated with 1)tracing, 2)extra metadata, 3)feedback.

After successfully running this, let's goto the LangSmith website to check out
1)the traces from Rag to retriever to the OpenAI call
2)the metadata that we added on user_id and llm model
3)the feedback added for the track.


Next, let's try to query by feedback that is positive

Next, let's try to query the feedback that is negative.

Last, let's look at the monitoring dashboard.