Install Required Dependencies

This cell installs all necessary Python packages for the evaluation pipeline
dotenv: For loading environment variables

Langsmith: LangSmith client for dataset management and experiment tracking

requests: HTTP library for API calls

deepeval: Framework for evaluating LLM outputs with various metrics

openai, langchain, langchain-openai: For LLM interactions and chains

langchain_community: Vector database and community integrations


In [None]:
%pip install dotenv langsmith
%pip install requests
%pip install deepeval==3.6.6
%pip install openai langchain langchain-openai langchain_community

Import Core Libraries

Imports essential modules for:
Environment variable management (dotenv)

Azure OpenAI client initialization

LangSmith client for experiment tracking and tracing

Wrapper utilities to integrate OpenAI with LangSmith tracing


In [None]:
from dotenv import load_dotenv
import os
from openai import AzureOpenAI
from langsmith import Client, traceable
from langsmith.wrappers import wrap_openai

Set LangSmith API Key
Configures the LangSmith API key as an environment variable

This key is required for authenticating with LangSmith services

Used for tracking experiments, storing datasets, and logging evaluation results


In [None]:
os.environ["LANGSMITH_API_KEY"]  = "*********"   ---->> Add you langsmith API Key

In [None]:
Initialize LangSmith Client
Creates a LangSmith client instance using the API key from environment variables
This client will be used to:
Fetch datasets

Create and manage experiments

Log evaluation results

Confirms successful initialization with a success message


In [None]:
ls = Client(api_key=os.environ["LANGSMITH_API_KEY"])
print("✅ LangSmith client initialized.")

Configure Azure OpenAI Environment
Sets up environment variables for Azure OpenAI service:
- API Key: Authentication credential for Azure OpenAI
- API Base: Endpoint URL for your Azure OpenAI resource
- API Version: Specifies the API version to use
- Deployment Name: The specific GPT model deployment to use

Then initializes the DeepEval Azure OpenAI model wrapper with:
- Temperature=1: Controls randomness in model outputs
- All Azure-specific configuration parameters
This model will be used by DeepEval metrics for evaluation

In [None]:
import os

# Set environment variables (ensure names match what you will use later)
os.environ["AZURE_OPENAI_API_KEY"]        = "**********"  ---->> Add your Azure OpenAI API Key
os.environ["AZURE_OPENAI_API_BASE"]       = "***********" ----->> Add your Azure OpenAI API Base
os.environ["AZURE_OPENAI_API_VERSION"]    = "***********" ------->> Add your Azure OpenAI API Version
os.environ["AZURE_OPENAI_LLM_DEPLOYMENT"] = "gpt-5-mini"

# Import the model class
from deepeval.models import AzureOpenAIModel

# Initialize the model
azure_client = AzureOpenAIModel(
    model_name="gpt-5-mini",
    deployment_name=os.environ["AZURE_OPENAI_LLM_DEPLOYMENT"],
    azure_openai_api_key=os.environ["AZURE_OPENAI_API_KEY"],
    openai_api_version=os.environ["AZURE_OPENAI_API_VERSION"],
    azure_endpoint=os.environ["AZURE_OPENAI_API_BASE"],
    temperature=1
)


Fetch and Display Dataset from LangSmith
Purpose: Verify dataset contents before running evaluation
Steps:
1. Connect to LangSmith using API key
2. Fetch all examples from the specified dataset ID
3. Loop through each example and display:
    - Question: The input query   
    -Actual Answer: The model's response
   - Expected Answer: The ground truth/reference answer

This is useful for:
- Verifying dataset structure
- Checking data quality before evaluation
- Understanding the format of inputs and outputs

In [None]:
# ✅ Imports
import os
import pandas as pd
from langsmith import Client
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, SummarizationMetric, ToxicityMetric
from tabulate import tabulate
from IPython.display import display, HTML

# -----------------------------
# 1️⃣ Connect to LangSmith
# -----------------------------
client = Client(api_key="****************") ----->> Add here langsmithe API key
dataset_id = "*****************" -------->> Add here dataset ID

# Fetch all examples
examples = client.list_examples(dataset_id=dataset_id)

# Loop through all examples and print their content
for ex in examples:
    question = ex.inputs.get("Question", "")
    actual_answer = ex.outputs.get("Actual_Answer", "")
    expected_answer = ex.outputs.get("Expected_Answer", "")

    print("Question:", question)
    print("Actual Answer:", actual_answer)
    print("Expected Answer:", expected_answer)
    print("-" * 50)



Run DeepEval Metrics and Save Results
This is the MAIN EVALUATION CELL that performs the complete evaluation pipeline
Process:
CONNECT TO LANGSMITH
- Initializes client and fetches dataset examples
- Prepares evaluation cases with input, actual output, and expected output
SELECT METRICS
- AnswerRelevancyMetric: Measures how relevant the answer is to the question
 -SummarizationMetric: Evaluates quality of summarization
 - ToxicityMetric: Detects toxic or harmful content

DEFINE EVALUATION FUNCTION
Creates LLMTestCase for each example
Runs all metrics on each test case
Captures scores and reasoning for each metric
 Displays results in HTML table format

EVALUATE ALL EXAMPLES
- Loops through each example in the dataset
- Runs evaluate_case() for each
- Collects all results in all_results_list

SAVE RESULTS TO CSV
 - Converts results to DataFrame
- Saves to CSV file for later use
- Each row contains: Question, Metric, Score, Reason, Actual_Answer, Expected_Answer


In [None]:
# ✅ Imports
import os
import pandas as pd
from langsmith import Client
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric, SummarizationMetric, ToxicityMetric
from tabulate import tabulate
from IPython.display import display, HTML

# -----------------------------
# 1️⃣ Connect to LangSmith
# -----------------------------
client = Client(api_key="****************") ----->> Add here langsmithe API key
dataset_id = "*****************" -------->> Add here dataset ID


# Fetch all examples
examples = client.list_examples(dataset_id=dataset_id)


# Prepare evaluation cases
eval_cases = [
    {
        "input": ex.inputs.get("Question", ""),
        "actual_output": ex.outputs.get("Actual_Answer", ""),
        "expected_output": ex.outputs.get("Expected_Answer", "")
    }
    for ex in examples
]


# -----------------------------
# 2️⃣ Pick metrics
# -----------------------------
metrics_list = [AnswerRelevancyMetric, SummarizationMetric, ToxicityMetric]

# -----------------------------
# 3️⃣ Evaluate function
# -----------------------------
def evaluate_case(case_data, model=None, metrics_list=None):
    if metrics_list is None or model is None:
        return

    test_case = LLMTestCase(
        input=case_data["input"],
        expected_output=case_data["expected_output"],
        actual_output=case_data["actual_output"],
        retrieval_context=None,
        context=None
    )

    all_results = []
    for metric in metrics_list:
        try:
            m = metric(model=model)
            m.measure(test_case)
            all_results.append({
                "Metric": type(m).__name__,
                "Score": m.score,
                "Reason": getattr(m, "reason", None)
            })
        except Exception as e:
            all_results.append({
                "Metric": type(metric).__name__,
                "Score": None,
                "Reason": f"Error: {e}"
            })

    print(f"\n📘 Results for: {case_data['input']}\n")
    display(HTML(tabulate(
        [[r["Metric"], r["Score"], r["Reason"]] for r in all_results],
        headers=["Metric", "Score", "Reason"],
        tablefmt="html"
    )))

    return all_results


# -----------------------------
# 4️⃣ Evaluate all examples
# -----------------------------
all_results_list = []

for case in eval_cases:
    results = evaluate_case(case, model=azure_client, metrics_list=metrics_list)
    for r in results:
        all_results_list.append({
            "Question": case["input"],
            "Expected_Answer": case["expected_output"],
            "Actual_Answer": case["actual_output"],
            "Metric": r["Metric"],
            "Score": r["Score"],
            "Reason": r["Reason"]
        })

# -----------------------------
# 5️⃣ Save results to CSV (optional)
# -----------------------------
df_results = pd.DataFrame(all_results_list)
df_results.to_csv("deepeval_results_updated_final.csv", index=False)
print("✅ Results saved to deepeval_results.csv")


Upload Evaluation Results to LangSmith UI
This cell uploads the saved DeepEval results to LangSmith for visualization
Purpose: Display all evaluation metrics in the LangSmith web interface
deepeval_wrapper(): Takes inputs (question) and returns outputs (answer + metrics)
Looks up results from CSV for each question
Returns formatted output with all metric scores and reasons

DEFINE EVALUATOR FORMATTERS
- answer_relevancy_evaluator: Extracts AnswerRelevancyMetric score
- summarization_evaluator: Extracts SummarizationMetric score
- toxicity_evaluator: Extracts ToxicityMetric score
- Each evaluator returns: key (metric name), score (float), comment (reason)

 RUN EVALUATION FORMATTER & UPLOAD
- Uses ls_client.evaluate() to create experiment in LangSmith
- Processes each example with the wrapper function
- Applies all three evaluators to capture scores
- Uploads results to LangSmith dashboard
Result: All metrics visible in LangSmith UI with scores and explanations


In [None]:
# ✅ Imports
import os
import pandas as pd
from langsmith import Client
from datetime import datetime

# -----------------------------
# 1️⃣ Connect to LangSmith
# -----------------------------
client = Client(api_key="****************") ----->> Add here langsmithe API key
dataset_id = "*****************" -------->> Add here dataset ID
dataset = ls_client.read_dataset(dataset_id=dataset_id)
print(f"✅ Using existing dataset: {dataset.name} (ID: {dataset.id})")

# -----------------------------
# 3️⃣ Load DeepEval results from CSV
# -----------------------------
df_results = pd.read_csv("/content/deepeval_results_updated_final.csv")
print(f"✅ Loaded {len(df_results)} results from CSV")
print(f"📊 Unique questions: {df_results['Question'].nunique()}")

# -----------------------------
# 4️⃣ Prepare model wrapper function
# -----------------------------
def Resultslogback_wrapper(inputs: dict) -> dict:
    """
    Returns actual answer and all metrics for a given question.
    """
    question = inputs.get("Question", "")
    rows = df_results[df_results["Question"] == question]

    if rows.empty:
        return {"answer": None, "metrics": {}}

    actual_answer = rows.iloc[0]["Actual_Answer"]
    metrics = {}
    for _, row in rows.iterrows():
        metrics[row["Metric"]] = {"Score": row["Score"], "Reason": row["Reason"]}

    return {"answer": actual_answer, "metrics": metrics}

# -----------------------------
# 5️⃣ Dynamic evaluator for all metrics
# -----------------------------
def dynamic_metrics_evaluator(run, example):
    outputs = run.outputs
    metrics = outputs.get("metrics", {})

    results = []
    for metric_name, data in metrics.items():
        results.append({
            "key": metric_name.lower(),
            "score": float(data.get("Score", 0.0)),
            "comment": data.get("Reason", "")
        })
    return results


# -----------------------------
# 6️⃣ Run evaluation and upload to LangSmith
# -----------------------------
experiment_name = f"deepeval-metrics-{datetime.now().strftime('%Y%m%d_%H%M%S')}"
print(f"\n📤 Starting LangSmith evaluation: {experiment_name}\n")

results = ls_client.evaluate(
    Resultslogback_wrapper,      # Model wrapper that returns answer + all metrics
    data=dataset_id,             # Dataset ID
    evaluators=[dynamic_metrics_evaluator],
    experiment_prefix=experiment_name,
    description="DeepEval metrics uploaded dynamically",
    max_concurrency=1,
    num_repetitions=1,
)

print("\n✅ Evaluation complete!")
print(f"📊 Results object: {results}")
