# データ検証での推論の活用

このガイドでは、o1モデル、特にo1-previewを使用して、推論によるデータ検証を実行する方法を探ります。合成医療データセットを使用した実践的な例を通して、データ内の問題を特定するモデルの精度を評価する方法を説明します。

## 概要

データ検証は、特に医療などの機密性の高い分野において、データセットの品質と信頼性を確保するための重要なステップです。従来の検証方法は、多くの場合、事前定義されたルールやパターンに依存しています。しかし、o1のような高度なモデルは、コンテキストを理解し、データについて推論することができ、より柔軟で知的な検証アプローチを提供します。

このチュートリアルでは、以下を行います：
- 不整合を含む医療データの合成データセットを生成する
- データの行を受け取り、その精度を検証する関数を定義する
- 検証プロセスを実行し、精度メトリクスを計算する
- 結果を分析し、解釈する

In [3]:
from openai import OpenAI
import json
from IPython.display import display, HTML
from sklearn.metrics import precision_score, recall_score, f1_score
from concurrent.futures import ThreadPoolExecutor, as_completed
import csv
import pandas as pd

client = OpenAI()
MODEL = 'o1-preview'

## 合成データ生成

私たちは[Synthetic Data Generation](https://cookbook.openai.com/examples/sdg1)クックブックで説明されている多くの原則を使用して、データセットの基盤を作成します。

私たちのユースケースに向けて、医療データのセットを生成するようモデルにプロンプトを出します。データセットの作成方法、従うべき形式、不正確さを含める方法について、モデルに詳細な指示を提供しました。また、モデルが開始できるように、いくつかの行のサンプルデータも提供します。

データセットの各行には以下のフィールドが含まれます：
- Patient ID: ランダムに生成された患者ID
- Date of Birth: 患者の生年月日
- Gender: M/F
- Medical History: 過去の診断
- Current Medications: 患者が服用している薬剤
- Allergies: 特定されたアレルギー
- Lab Results (Glucose mg/dL): 検査結果（血糖値 mg/dL）
- Diagnoses: 現在の診断
- Treatment Plan: 現在の治療計画
- Is Valid: 現在のデータ行が有効かどうか（True/False）
- Issue: データ行が有効でない場合、その問題点

データに存在する可能性のある不正確さの例：
- 患者がアレルギーを持つ薬剤を処方する
- 現在の薬剤が病歴と一致しない
- 治療計画が診断と一致しない

In [2]:
def generate_data():
    messages = [
        {
            "role": "user",
            "content": """
You are a helpful assistant designed to generate data. You will be given a format for the data to generate and some examples of the data.

When generating Patient IDs, use the format 'P' followed by a three-digit number (e.g., P006, P941, P319).

Intentionally make some mistakes in the data generation and document them in the appropriate columns ('Is Valid' and 'Issue') if the row of data is invalid.

The types of mistakes to include are:

- **Allergy Contradictions**: Prescribing a medication that the patient is allergic to (e.g., prescribing Penicillin to a patient allergic to Penicillin).
- **Medical History and Medication Mismatch**: A patient with a medical condition not receiving appropriate medication (e.g., a diabetic patient not prescribed any diabetes medication).
- **Lab Results and Diagnosis Mismatch**: Lab results that do not support the diagnosis (e.g., normal glucose levels but diagnosed with Diabetes Type 2).
- **Other Plausible Mistakes**: Any other realistic errors that could occur in medical records, such as incorrect gender entries, impossible dates of birth, or inconsistent treatment plans.

Ensure that when 'Is Valid' is 'False', the 'Issue' column clearly explains the problem.

Return 100 rows of data for the user. Your response should strictly be in the format of a valid CSV.

Generate Synthetic Medical Records Dataset with the following columns:
    - Patient ID: A randomly generated patient id
    - Date of Birth: Date of birth of the patient
    - Gender: M/F
    - Medical History: Past diagnoses
    - Current Medications: Medication the patient is taking
    - Allergies: Identified allergies
    - Lab Results (Glucose mg/dL)
    - Diagnoses: Current diagnosis
    - Treatment Plan: Current treatment plan
    - Is Valid: Whether or not the current row of data is valid (True/False)
    - Issue: If the row of data is not valid, what the issue is

Patient ID,Date of Birth,Gender,Medical History,Current Medications,Allergies,Lab Results (Glucose mg/dL),Diagnoses,Treatment Plan,Is Valid,Issue
P001,1980-05-14,M,Hypertension,Lisinopril,None,110,Hypertension,Continue Lisinopril,True,
P002,1975-11-30,F,Diabetes Type 2,Metformin,Penicillin,90,Diabetes Type 2,Continue Metformin,True,
P003,1990-07-22,F,Asthma,Albuterol,Aspirin,85,Asthma,Prescribe Albuterol,True,
P004,2000-03-10,M,None,Amoxicillin,Penicillin,95,Infection,Prescribe Amoxicillin,False,Prescribed Amoxicillin despite Penicillin allergy
P005,1985-09-18,F,Hyperlipidemia,Atorvastatin,None,200,Hyperlipidemia,Continue Atorvastatin,True,
P006,1978-12-05,M,Hypertension; Diabetes Type 2,Lisinopril; Insulin,None,55,Diabetes Type 2,Adjust insulin dosage,False,Low glucose level not properly addressed
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    return response.choices[0].message.content.replace('```csv', '').replace('```', '')

In [3]:
# Generate data three times using the existing dataGeneration function
generated_data = []
data = generate_data()
generated_data.extend(data.strip().split('\n'))

# Append the generated data to the medicalData.csv file
with open('../data/medicalData.csv', 'a', newline='') as csvfile:
    csvwriter = csv.writer(csvfile)
    for row in generated_data:
        csvwriter.writerow(row.split(','))

print("Synthetic data generation and appending completed.")


Synthetic data generation and appending completed.


## データ検証

データセットの準備が完了したので、推論モデルに各データ行をレビューしてもらい、問題が含まれているかどうかを判定させます。モデルには、データに問題があるかどうかを出力し、その問題の説明を提供するよう求めます。

モデルが無効なデータのリストを決定したら、その結果をモデル評価器に渡して2つの指標を評価します：
- データの問題を正しく特定するモデルの能力の精度
- 問題が正しく特定されたデータのサブセットについて、手元の問題を特定するモデルの精度

このタスクはより限定的であるため、より高速なgpt-4oモデルを使用して精度を計算できます。

注意：これらのモデルはまだベータ版であるため、レート制限が大幅に削減されます。それに応じて同時実行ワーカー数を調整してください。

In [4]:
def validate_data(input_data):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a helpful assistant designed to validate the quality of medical datasets. You will be given a single row of medical data, and your task is to determine whether the data is valid.

- Carefully analyze the data for any inconsistencies, contradictions, missing values, or implausible information.
- Consider the logical relationships between different fields (e.g., treatments should be appropriate for the diagnoses, medications should not conflict with allergies, lab results should be consistent with diagnoses, etc.).
- Use your general medical knowledge to assess the validity of the data.
- Focus solely on the information provided without making assumptions beyond the given data.

**Return only a JSON object** with the following two properties:

- `"is_valid"`: a boolean (`true` or `false`) indicating whether the data is valid.
- `"issue"`: if `"is_valid"` is `false`, provide a brief explanation of the issue; if `"is_valid"` is `true`, set `"issue"` to `null`.

Both JSON properties must always be present.

Do not include any additional text or explanations outside the JSON object.

MEDICAL DATA:
{input_data}
            """
        }
    ]

    response = client.chat.completions.create(
        model=MODEL,
        messages=messages
    )

    response_content = response.choices[0].message.content.replace('```json', '').replace('```', '').strip()
    
    try:
        if isinstance(response_content, dict):
            response_dict = response_content
        else:
            response_dict = json.loads(response_content)
        return response_dict
    except json.JSONDecodeError as e:
        print(f"Failed to decode JSON response: {response_content}")
        raise e

In [5]:
# Read the CSV file and exclude the last two columns
input_data = []
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        input_data.append(row[:-2])  # Exclude "Is Valid" and "Issue" columns

# Initialize lists to store true labels
true_is_valid = []
true_issues = []

# Extract true labels from the CSV file
with open('../data/medicalData.csv', 'r') as file:
    reader = csv.reader(file)
    headers = next(reader)
    for row in reader:
        true_is_valid.append(row[-2] == 'True')
        true_issues.append(row[-1])

# Function to validate a single row of data
def validate_row(row):
    input_str = ','.join(row)
    result_json = validate_data(input_str)
    return result_json

# Validate data rows and collect results
pred_is_valid = [False] * len(input_data)
pred_issues = [''] * len(input_data)

with ThreadPoolExecutor() as executor:
    futures = {executor.submit(validate_row, row): i for i, row in enumerate(input_data)}
    
    for future in as_completed(futures):
        i = futures[future]  # Get the index of the current row
        result_json = future.result()
        pred_is_valid[i] = result_json['is_valid']
        pred_issues[i] = result_json['issue']

モデルの結果が得られたので、これを正解データと比較してシステムの精度を判定することができます。

In [6]:
# Convert predicted and true 'is_valid' labels to boolean if they aren't already
pred_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in pred_is_valid]
true_is_valid_bool = [bool(val) if isinstance(val, bool) else val == 'True' for val in true_is_valid]

# Calculate precision, recall, and f1 score for the 'is_valid' prediction
precision = precision_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
recall = recall_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)
f1 = f1_score(true_is_valid_bool, pred_is_valid_bool, pos_label=True)

# Initialize issue_matches_full with False
issue_matches_full = [False] * len(true_is_valid)

In [7]:
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1: {f1:.2f}")

Precision: 0.82
Recall: 0.87
F1: 0.84


## 問題の特定

データ内の問題を正確に分類するモデルの能力を判定します。

In [8]:
def validate_issue(model_generated_answer, correct_answer):
    messages = [
        {
            "role": "user",
            "content": f"""
You are a medical expert assistant designed to validate the quality of an LLM-generated answer.

The model was asked to review a medical dataset row to determine if the data is valid. If the data is not valid, it should provide a justification explaining why.

Your task:

    •	Compare the model-generated justification with the correct reason provided.
    •	Determine if they address the same underlying medical issue or concern, even if phrased differently.
    •	Focus on the intent, medical concepts, and implications rather than exact wording.

Instructions:

    •	If the justifications have the same intent or address the same medical issue, return True.
    •	If they address different issues or concerns, return False.
    •	Only respond with a single word: True or False.

Examples:

    1.	Example 1:
    •	Model Generated Response: “The patient is allergic to penicillin”
    •	Correct Response: “The patient was prescribed penicillin despite being allergic”
    •	Answer: True
    2.	Example 2:
    •	Model Generated Response: “The date of birth of the patient is incorrect”
    •	Correct Response: “The patient was prescribed penicillin despite being allergic”
    •	Answer: False


Model Generated Response: {model_generated_answer}
Correct Response:  {correct_answer}
            """
        }
    ]

    response = client.chat.completions.create(
        model="o1-preview",
        messages=messages
    )

    result = response.choices[0].message.content

    return result

In [9]:
# Validate issues for rows where both true and predicted 'is_valid' are False
validation_results = []

with ThreadPoolExecutor() as executor:
    futures = {
        executor.submit(validate_issue, pred_issues[i], true_issues[i]): i
        for i in range(len(pred_is_valid_bool))
        if not pred_is_valid_bool[i] and not true_is_valid_bool[i]
    }
    
    for future in as_completed(futures):
        i = futures[future]  # Get the original index
        issue_match = future.result()
        issue_matches_full[i] = (issue_match == 'True')
        validation_results.append({
            "index": i,
            "predicted_issue": pred_issues[i],
            "true_issue": true_issues[i],
            "issue_match": issue_matches_full[i]
        })
    
    # Calculate issue accuracy
    issue_accuracy = sum([i['issue_match'] for i in validation_results]) / len(validation_results)
    
    # Store the results in the dictionary
    model_results = {
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "issue_accuracy": issue_accuracy
    }

# Create a DataFrame to store the results
df_results = pd.DataFrame([model_results])

# Create a DataFrame to store the validation results for each row
df_validation_results = pd.DataFrame(validation_results)

以下では、問題が含まれていると正しく特定した行のサブセットを表示します。各行について、予測された問題と実際の問題、および一致しているかどうかを示します。

In [10]:
def display_formatted_dataframe(df):
    def format_text(text):
        return text.replace('\n', '<br>')

    df_formatted = df.copy()
    df_formatted['predicted_issue'] = df_formatted['predicted_issue'].apply(format_text)
    df_formatted['true_issue'] = df_formatted['true_issue'].apply(format_text)
    
    display(HTML(df_formatted.to_html(escape=False, justify='left')))
    
display_formatted_dataframe(pd.DataFrame(validation_results))

Unnamed: 0,index,predicted_issue,true_issue,issue_match
0,39,Amoxicillin is prescribed to a patient with Penicillin allergy.,Prescribed Amoxicillin despite Penicillin allergy,True
1,50,Patient diagnosed with Type 1 Diabetes is not on any medications and the treatment field lists the diagnosis instead of appropriate treatment.,Diabetes Type 1 patient not receiving insulin,True
2,51,Lab result of 300 indicates hyperglycemia but no diagnosis or treatment is recorded.,Extremely high glucose level not diagnosed or treated,True
3,26,The patient is being prescribed penicillin despite having an allergy to penicillin.,Prescribed Penicillin despite Penicillin allergy,True
4,31,The patient's age (88) is inconsistent with the date of birth (1996-11-05).,Osteoporosis patient not receiving treatment,False
5,24,The 'Treatment Plan' field should not be 'Depression'; it should specify the treatment prescribed for depression.,Depression patient not receiving treatment,True
6,3,Patient is allergic to Penicillin but is prescribed Amoxicillin.,Prescribed Amoxicillin despite Penicillin allergy,True
7,28,"The treatment field contains 'Asthma', which is a diagnosis, not a treatment.",Asthma patient not prescribed any medication,False
8,7,"Patient with asthma and low lab result (100) is treated only with lifestyle modifications without medications, which is inappropriate.",Asthma patient not prescribed any medication,True
9,16,The patient's age (86) does not match the date of birth (1955-10-10).,COPD patient not receiving treatment,False


In [11]:
# Display the DataFrame
print(df_results)

   precision    recall       f1  issue_accuracy
0   0.818182  0.870968  0.84375        0.615385


## 結論

ここでの結果から、問題の特定において高い精度/再現率を達成できるとともに、データ内の正確な問題の特定においても適切な精度を得られることがわかります。

これにより、様々なドメインにわたる評価セットのデータ検証を効率化できるはずです。