# Testing Responding in JSON

This notebook will cover how to benchmark examples of chains that should respond in JSON. We will:

1. Create a dataset of test examples
2. Upload that dataset to LangSmith
3. Create multiple chains
4. Define some evaluation criteria
5. Run some tests!

In [5]:
import os
os.environ["LANGCHAIN_TRACING_V2"]="true"
os.environ["LANGCHAIN_ENDPOINT"]="https://api.smith.langchain.com"
os.environ["LANGCHAIN_API_KEY"]="..."

In [6]:
from langsmith import Client

client = Client()
dataset_name = "Structured JSON Dataset"

# Storing inputs in a dataset lets us
# run chains and LLMs over a shared set of examples.
dataset = client.create_dataset(
    dataset_name=dataset_name, description="Extracting structured JSON",
)

## Create a dataset of test examples

Let's create a dataset of examples. Let's pretend we want to extract structured information from unstructured input and we want to be structured in JSON format. Let's pretend we want to extract a person's name and age.

In [9]:
import json

examples = [
    # Standard example
    ("Julie is 13", json.dumps({"name": "Julie", "age": 13})),
    # Example with name in lower case
    ("ben is 9", json.dumps({"name": "Ben", "age": 9})),
    # Example with age spelled out
    ("Sam is thirty four", json.dumps({"name": "Sam", "age": 34})),
    # Examples without ground truth
    ("Bob is 17", ),
    ("Molly is 2", ),
]

## Upload dataset to LangSmith

In [13]:
for example in examples:
    # Each example must be unique and have inputs defined.
    # Outputs are optional
    if len(example) == 1:
        client.create_example(
            inputs={"input": example[0]},
            outputs=None,
            dataset_id=dataset.id,
        )
    elif len(example) == 2:
        client.create_example(
            inputs={"input": example[0]},
            outputs={"output": example[1]},
            dataset_id=dataset.id,
        )
    else:
        raise ValueError

## Create Multiple Chains

At this point, let's just try out OpenAI vs Anthropic

In [32]:
from langchain.chat_models import ChatAnthropic, ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema import SystemMessage
from langchain.schema.output_parser import StrOutputParser
from langchain.chains import LLMChain

instructions = """Convert any user messages into valid json. You should only respond with 

```json
...
```

Do NOT include any words before or after.

For each user input, you should extract the name and age of person in question. \
You should use the `name` and `age` to extract that information. \
Name should always be a properly capitalized name, Age should always be an integer.

For example, for the input `Jim is 10` would get a response of:

```json
{"name": "Jim", "age": 10}}
```"""

prompt = ChatPromptTemplate.from_messages([
    SystemMessage(content=instructions),
    ("human", "{input}")
])

In [33]:
def create_openai():
    return LLMChain(
        prompt=prompt, 
        llm=ChatOpenAI(temperature=0, model="gpt-4"),
        output_parser=StrOutputParser()
    )

def create_anthropic():
    return LLMChain(
        prompt=prompt,
        llm=ChatAnthropic(temperature=0, model="claude-2"),
        output_parser=StrOutputParser()
    )

## Define Custom Evaluation Criteria

We can now define some custom evaluation criteria. Let's define a few!

1. Whether after some parsing the expected output is exactly the same as expected
2. Whether any words were returned before ```json
3. Whether the json that was returned was valid json

In [39]:
import re
from typing import Any, Optional

from langchain.evaluation import StringEvaluator


class ParsedEquality(StringEvaluator):


    @property
    def requires_input(self) -> bool:
        return False

    @property
    def requires_reference(self) -> bool:
        return True

    @property
    def evaluation_name(self) -> str:
        return "parsed_equality"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        parsed = prediction.split("```json")[1].split("```")[0].strip()
        result = json.loads(parsed)
        return {"score": json.dumps(parsed) == reference}

In [36]:
import re
from typing import Any, Optional

from langchain.evaluation import StringEvaluator


class IsVerbose(StringEvaluator):


    @property
    def requires_input(self) -> bool:
        return False

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "is_verbose"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        parsed = prediction.split("```json")[0]
        return {"score": len(parsed.strip()) > 0}

In [37]:
import re
from typing import Any, Optional

from langchain.evaluation import StringEvaluator


class IsValidJSON(StringEvaluator):


    @property
    def requires_input(self) -> bool:
        return False

    @property
    def requires_reference(self) -> bool:
        return False

    @property
    def evaluation_name(self) -> str:
        return "is_valid_json"

    def _evaluate_strings(
        self,
        prediction: str,
        input: Optional[str] = None,
        reference: Optional[str] = None,
        **kwargs: Any
    ) -> dict:
        try:
            parsed = prediction.split("```json")[1].split("```")[0].strip()
            result = json.loads(parsed)
            return {"score": 1}
        except:
            return {"score": 0}

## Run evaluation

Now we can run evaluation!

In [41]:
from langchain.smith import RunEvalConfig, run_on_dataset

evaluation_config = RunEvalConfig(
    custom_evaluators = [
        IsVerbose(), 
        IsValidJSON(), 
        # ParsedEquality()
    ],
)
run_on_dataset(
    client,
    "Structured JSON Dataset",
    create_anthropic,
    evaluation=evaluation_config,
)

View the evaluation results for project '2023-08-01-17-18-39-LLMChain' at:
https://smith.langchain.com/projects/p/51cf0351-e754-40cf-b6d8-ec7ccaf5e594?eval=true


{'project_name': '2023-08-01-17-18-39-LLMChain',
 'results': {'e6c57ab7-a74e-4d8c-aa1e-95b3d7a8dd9c': [' ```json\n{"name": "Molly", "age": 2}\n```'],
  '36b6eb46-e7a7-479c-9eb2-69647c8b970e': [' ```json\n{"name": "Bob", "age": 17}\n```'],
  'f8bc4d86-ceb2-4acb-961b-df7eda182ec0': [' ```json\n{"name": "Sam", "age": 34}\n```'],
  '035228dd-3180-4296-83d5-c486d35e092b': [' ```json\n{"name": "Ben", "age": 9}\n```'],
  'e7d56cbe-0ac1-4928-a0f5-6d319319bb26': [' ```json\n{"name": "Julie", "age": 13}\n```']}}