In [None]:
%pip install langchain langchain_openai pandas tqdm --upgrade --quiet

## Generate Ground Truth Data with GPT-4 API

Creating the known labels or ground truth data can be time consuming and expensive. You can use GPT-4 to _generate the ground truth data_ for you. This is useful for training your own models, and for evaluating the performance of other models. Then you can use these evals to test whether the open source or smaller / faster / cheaper models are performing as well as the larger / slower / more expensive models.

In [None]:
import pandas as pd
from tqdm import tqdm
import requests
import io

# Dataset URL:
url = "https://storage.googleapis.com/oreilly-content/transaction_data_with_expanded_descriptions.csv"

# Download the file from the URL:
downloaded_file = requests.get(url)

# Load the transactions dataset and only look at 20 transactions:
df = pd.read_csv(io.StringIO(downloaded_file.text))[:20]
df.head()

In [None]:
df

In [None]:
from google.colab import userdata
import os
os.environ['OPENAI_API_KEY'] = userdata.get('OPENAI_API')

In [None]:
from langchain_openai import ChatOpenAI

# Define GPT-4 Model for ground truth generation
gpt4_model = ChatOpenAI(model="gpt-4")

## Part 1: Setup the OpenAI GPT-4 Model for Ground Truth Generation

#### Defining a Structured Output Model

In [None]:
from langchain.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field
from typing import Union, Literal

# Define Pydantic model for transaction classification
class EnrichedTransactionInformation(BaseModel):
    transaction_type: Union[
        Literal["Purchase", "Withdrawal", "Deposit", "Bill Payment", "Refund"], None
    ]
    transaction_category: Union[
        Literal["Food", "Entertainment", "Transport", "Utilities", "Rent", "Other"],
        None,
    ]


To improve our model's accuracy, we'll define a more structured output model using Pydantic. This model will specify the exact values that our transaction types and categories can take, helping to constrain the model's outputs.

In this code block, we've defined a Pydantic model `EnrichedTransactionInformation` that specifies the exact values our `transaction_type` and `transaction_category` fields can take.

The use of `Union` and `Literal` types is crucial here. They serve to indicate the specific values these fields can take on, effectively constraining the model's output. Without these constraints, the model had only a 7.5% accuracy because it was free to generate arbitrary names for transaction types and categories.

By providing this structure:
1. We guide the model to choose from a predefined set of options.
2. We make it easier for the parser to validate the model's output.
3. We reduce the likelihood of the model generating creative but incorrect category names.


This approach should significantly improve our accuracy, as the model now has a clear, limited set of options to choose from, rather than an open-ended text generation task. It aligns the model's output more closely with our expected categorization scheme, which should lead to better matching with our reference data.

In the next steps, we'll update our prompt to incorporate this new structure and re-run our analysis to see the improvement in accuracy.

In [None]:
# Create an output parser
parser = PydanticOutputParser(pydantic_object=EnrichedTransactionInformation)

# Define the prompt template
template = """
You are an expert financial assistant. Categorize the following transaction description.

Transaction: {transaction_description}

Provide the transaction_type and transaction_category.

{format_instructions}
"""

# Get format instructions
format_instructions = parser.get_format_instructions()

# Create the prompt
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["transaction_description"],
    template=template,
    partial_variables={"format_instructions": format_instructions}
)

In [None]:
prompt.pretty_print()

### Creating a Structured Prompt with Output Parser

Now that we have defined our `EnrichedTransactionInformation` model, we'll create a structured prompt using a `PydanticOutputParser`. This will help guide the language model to produce outputs that conform to our specified structure.

In this code block, we've set up a structured prompt that incorporates our `EnrichedTransactionInformation` model:

1. We created a `PydanticOutputParser` using our `EnrichedTransactionInformation` model. This parser will help ensure that the model's output adheres to our defined structure.

2. We defined a prompt template that includes placeholders for the transaction description and format instructions.

3. We obtained the format instructions from the parser. These instructions will guide the model on how to structure its output.

4. Finally, we created a `PromptTemplate` that combines our template with the format instructions.

This structured approach offers several benefits:

- It provides clear guidance to the language model on the expected output format.
- It ensures that the model's responses will be consistent with our predefined transaction types and categories.

- It simplifies the parsing of the model's output, as we can expect it to conform to our `EnrichedTransactionInformation` structure.

By using this structured prompt, we're likely to see a significant improvement in the accuracy of our transaction categorization. The model now has a clear framework for its responses, which should align much more closely with our expected categories and types.


In [None]:
# Create the chain
gpt4_chain = prompt | gpt4_model | parser

# Apply the chain to each transaction
transaction_types = []
transaction_categories = []

for _, row in df.iterrows():
    input_variables = {
        "transaction_description": row["Transaction Description"]
    }
    result = gpt4_chain.invoke(input_variables)
    transaction_types.append(result.transaction_type)
    transaction_categories.append(result.transaction_category)

# Add results to dataframe
df["transaction_type"] = transaction_types
df["transaction_category"] = transaction_categories
df.head()

In the next steps, we'll use this prompt to generate categorizations for our transactions and evaluate the improvement in accuracy.

## Applying GPT-4 Model with Structured Output

Now that we have our structured prompt and parser, we'll use them with the GPT-4 model to categorize our transactions. This approach should yield more accurate and consistent results compared to our previous attempt.

In this code block, we've applied our structured approach using the GPT-4 model:

1. We created a chain that combines our prompt, the GPT-4 model, and our parser. This chain will process each transaction description and output structured results.

2. We iterated through each row in our dataframe, using the transaction description as input for our chain.

3. For each transaction, we invoked the chain, which prompted GPT-4 with our structured prompt and parsed the output according to our `EnrichedTransactionInformation` model.

4. We collected the results (transaction types and categories) and added them as new columns to our dataframe.

5. Finally, we displayed the first few rows of the updated dataframe to verify the results.


This approach offers several advantages:


In [None]:
# Define GPT-3.5 Turbo Model for evaluation
gpt35_model = ChatOpenAI(model="gpt-3.5-turbo")

- The use of GPT-4, combined with our structured prompt and parser, should provide more accurate and consistent categorizations.
- The output is guaranteed to conform to our predefined transaction types and categories, eliminating the possibility of arbitrary or unexpected values.

- We can directly compare these results with our previous attempts or with any reference data we might have.

In the next steps, we'll evaluate the accuracy of these new categorizations. We should expect to see a significant improvement compared to our earlier results, given the more structured approach and the use of the more capable GPT-4 model.

## Part 3: Setup the OpenAI GPT-3.5 Turbo Model for Evaluation

## Part 4: Evaluate Model with GPT-3.5 Turbo

### Applying GPT-3.5-Turbo Model to Transactions

We will now use the GPT-3.5-Turbo model to categorize our transactions. This process involves creating a chain that combines our prompt, the GPT-3.5-Turbo model, and a parser. We'll then apply this chain to each transaction in our dataset and store the results.

In [None]:
# Create the chain
gpt35_chain = prompt | gpt35_model | parser

# Apply the chain to each transaction
transaction_types = []
transaction_categories = []

for _, row in df.iterrows():
    input_variables = {
               "transaction_description": row["Transaction Description"]
   }
   result = gpt35_chain.invoke(input_variables)
   transaction_types.append(result.transaction_type)
   transaction_categories.append(result.transaction_category)

#  Add results to dataframe
df["gpt35_transaction_type"] = transaction_types
df["gpt35_transaction_category"] = transaction_categories
df.head()

In this code block, we created a chain combining our prompt, the GPT-3.5-Turbo model, and a parser. We then applied this chain to each transaction in our dataset. For each transaction, we used its description as input and obtained the predicted transaction type and category.

The results were stored in two new columns in our dataframe: `gpt35_transaction_type` and `gpt35_transaction_category`. We displayed the first few rows of the updated dataframe to verify the results.

In [None]:
# Evaluate answers using LangChain evaluators
from langchain.evaluation import load_evaluator, EvaluatorType

evaluator = load_evaluator(EvaluatorType.EXACT_MATCH)


This process allows us to compare the categorization performance of GPT-3.5-Turbo with other models or reference data we might have.

### LangChain Evaluators

LangChain provides a set of evaluation tools to assess the performance of language models and chains. The `load_evaluator` function and `EvaluatorType` enum are key components of this evaluation framework.

#### load_evaluator Function

The `load_evaluator` function is used to load a specific evaluator based on the provided type. Here's its basic structure:



This function takes the following parameters:

- `evaluator`: The type of evaluator to load (an `EvaluatorType` enum value)
- `llm`: (Optional) A language model to use for evaluation -- e.g. to compare predicted vs desired outputs for similarity
- `**kwargs`: Additional keyword arguments specific to the evaluator

The function returns an instance of the requested evaluator, which can be used to perform evaluations.

#### EvaluatorType

`EvaluatorType` is an enumeration that defines various types of evaluators available in LangChain. Some of the key evaluator types include:

- `EXACT_MATCH`: Compares predictions to a reference answer using exact matching
- `QA`: Question answering evaluator that grades answers using an LLM
- `COT_QA`: Chain of thought question answering evaluator

- `CONTEXT_QA`: Question answering evaluator that incorporates context in the response
- `CRITERIA`: Evaluates a model based on custom criteria without reference labels
- `LABELED_CRITERIA`: Evaluates a model based on custom criteria with a reference label
- `STRING_DISTANCE`: Compares predictions to a reference using string edit distances
- `EMBEDDING_DISTANCE`: Compares predictions using embedding distances
- `JSON_VALIDITY`: Checks if a prediction is valid JSON
- `REGEX_MATCH`: Compares predictions to a reference using regular expressions

Each evaluator type is designed for specific evaluation tasks, allowing you to choose the most appropriate method for your use case.

In the example provided, `EvaluatorType.EXACT_MATCH` is used, which will create an evaluator that checks for exact matches between predictions and reference answers. This is particularly useful for assessing the accuracy of categorical predictions, such as transaction types and categories.

See  https://api.python.langchain.com/en/latest/langchain/evaluation.html

We will now evaluate the accuracy of our model's predictions for transaction types and categories using the exact match evaluator.
python

In [None]:
# loop through the dataframe and evaluate the predictions
transaction_types = []
transaction_categories = []

for i, row in tqdm(df.iterrows(), total=len(df)):
    transaction_type_score = evaluator.evaluate_strings(
        prediction=row.gpt35_transaction_category,
        reference=row.transaction_type,
    )

    transaction_category_score = evaluator.evaluate_strings(
        prediction=row.gpt35_transaction_category,
        reference=row.transaction_category,
    )

    transaction_types.append(transaction_type_score)
    transaction_categories.append(transaction_category_score)

df["transaction_type_score"] = transaction_types
df["transaction_category_score"] = transaction_categories

In this cell, we iterated through each row of our dataframe, comparing the model's predictions (`mistral_transaction_type` and `mistral_transaction_category`) against the reference values (`transaction_type` and `transaction_category`) using the exact match evaluator. We then added two new columns to our dataframe, `transaction_type_score` and `transaction_category_score`, which contain the evaluation scores for each prediction. These scores indicate whether the model's predictions exactly match the reference values.

## Calculating Overall Accuracy

We will now calculate the overall accuracy of our model's predictions. This score will combine the accuracy of both transaction type and category predictions, giving us a single metric to evaluate our model's performance.

In [None]:
accuracy_score = 0

for transaction_type_score, transaction_category_score in zip(
    transaction_types, transaction_categories
):
    accuracy_score += transaction_type_score['score'] + transaction_category_score['score']

accuracy_score = accuracy_score / (len(transaction_types) * 2)
print(f"Accuracy score: {accuracy_score}")

The accuracy score we obtained is surprisingly low:

Accuracy score: 0.075

This score indicates that our model's predictions match the reference values only 7.5% of the time, which is far from ideal. Such a low accuracy suggests that there might be significant issues with our current approach. Here are some potential next steps to investigate and improve our model:

1. **Error Analysis**: Examine a sample of misclassified transactions to understand where the model is failing. Look for patterns in the errors.

2. **Data Quality Check**: Verify the quality and consistency of our reference data. Ensure that the 'transaction_type' and 'transaction_category' columns are correctly labeled.