# Mistral API: Information Extraction with JSON Mode

This notebook offers a straightforward comparison of the performance of models in the [Mistral family](https://docs.mistral.ai/getting-started/models/), specifically focusing on extracting information using JSON mode.

1. Import libraries
2. Introduction
3. Data
4. Model assessments

----
## 1. Import libraries

In [22]:
# file handling
import os

# json handling
import io
import json

# data manipulation and cleaning
import re
import pandas as pd

# API
from time import sleep
from helper import load_mistral_api_key
load_mistral_api_key()
from helper import mistral

## 2. Introduction

We assess the performance of models in the [Mistral family](https://docs.mistral.ai/getting-started/models/), particularly regarding information extraction with JSON mode. Our focus is on evaluating how well these models can extract structured information from unstructured text and present it in a consistent JSON format, following JSON schema.

In [2]:
models = [
    "open-mistral-7b"
    ,"open-mixtral-8x7b"
    ,"open-mixtral-8x22b"
    ,"mistral-small-latest"
    ,"mistral-medium-latest"
    ,"mistral-large-latest"
    ]

----
## 3. Data

### 3.1 Fictitious Medical Records

For this assessment, we've created a dataset of fictitious medical records. These records contain various types of information typically found in patient files, including personal details, diagnoses, and treatment plans.

In [39]:
data_medical_notes = pd.read_json(os.path.join("..","data","fictitious-medical-notes.json"))
data_medical_notes.info()
print(json.loads(data_medical_notes.iloc[0:6].to_json()))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 1 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   medical_notes  20 non-null     object
dtypes: object(1)
memory usage: 288.0+ bytes
{'medical_notes': {'0': {'note': "A 45-year-old female patient, Ms. Thompson, presented with complaints of persistent headaches, dizziness, and blurred vision. Upon examination, her blood pressure was found to be significantly elevated. Ms. Thompson's weight is 165 lbs and her height is 5'6. She was diagnosed with hypertension and prescribed Lisinopril to be taken once daily. It was noted that the patient is a non-smoker but leads a sedentary lifestyle."}, '1': {'note': "Mr. Rodriguez, a 52-year-old male, came in with symptoms of shortness of breath, chest pain, and fatigue during physical activity. After a thorough cardiac evaluation, he was diagnosed with coronary artery disease. Mr. Rodriguez weighs 195 lbs and 

### 3.2 JSON schema

We've defined a specific JSON schema for our information extraction task. This schema outlines the structure and expected data types for each field we want the models to extract. The schema includes fields such as:

- Patient ID
- Age
- Gender
- Height
- Weight
- BMI (to be calculated)
- Smoker status
- Diagnosis
- Treatment

In [4]:
json_schema = f"""
{{
        "patient_id": {{
            "type": "integer"
        }},
        "age": {{
            "type": "integer"
        }},
        "gender": {{
            "type": "string",
            "enum": ["male", "female", "other"]
        }},
        "weight": {{
            "type": "integer"
        }},
        "height": {{
            "type": "integer"
        }},
        "bmi": {{
            "type": "float"
        }},
        "smoker": {{
            "type": "string",
            "enum": ["yes", "no"]
        }}
        "diagnosis": {{
            "type": "string"
        }},
        "treatment": {{
            "type": "string"
        }}
}}
"""

This standardized schema allows us to consistently evaluate the models' performance across different records and ensures that the extracted information is structured uniformly for further analysis or integration into other systems.

## 4. Model assessments

Our model assessment is divided into two main parts, each focusing on different aspects of the Mistral family models' performance in information extraction tasks.

In section 4.1, "A single string," we assess how the models handle extracting information from a single medical record, examining their output consistency and content accuracy against human-classified benchmarks. This section provides insights into the models' capabilities in structured information extraction from unstructured text.

Section 4.2, "Handling Datasets," addresses the challenges encountered when attempting to process larger datasets, highlighting technical limitations of the current API and suggesting future strategies to overcome these obstacles.

Together, these sections offer a overview of the models' strengths in single-record processing and, apparently, potential limitations in handling larger-scale data extraction tasks.

### 4.1 A single string

#### 4.1.1. Setup

**Single string data**

For our initial assessment, we used a single string of text representing one fictitious medical record. This string contains all the information outlined in our JSON schema, but in an unstructured format similar to a doctor's notes.

In [5]:
data_baseline_string = data_medical_notes["medical_notes"][0]["note"]
data_baseline_string

"A 45-year-old female patient, Ms. Thompson, presented with complaints of persistent headaches, dizziness, and blurred vision. Upon examination, her blood pressure was found to be significantly elevated. Ms. Thompson's weight is 165 lbs and her height is 5'6. She was diagnosed with hypertension and prescribed Lisinopril to be taken once daily. It was noted that the patient is a non-smoker but leads a sedentary lifestyle."

**Benchmark: Human-classified single string data**

To establish a baseline for comparison, we had human experts manually extract the information from our single string data according to the JSON schema. This human-classified data serves as our benchmark for assessing the models' performance.

In [6]:
# lbs to kg
baseline_weight = 165 / 2.205
# 5'6 = 66 inches to meters
baseline_height = 66 / 39.37
baseline_bmi_who = round(baseline_weight / (baseline_height ** 2), 2)
baseline_bmi_cdc = round(165 / (66**2) * 703, 2)

data_human_baseline = pd.DataFrame(data={
        "patient_id":1,
        "age":45,
        "gender": "female",
        "weight":165,
        "height":66,
        "bmi": baseline_bmi_cdc,
        "smoker": "no",
        "diagnosis": "hypertension",
        "treatment": "Lisinopril to be taken once daily"
        }
    ,index=[0]
    )
data_human_baseline

Unnamed: 0,patient_id,age,gender,weight,height,bmi,smoker,diagnosis,treatment
0,1,45,female,165,66,26.63,no,hypertension,Lisinopril to be taken once daily


**Prompt**

We designed a prompt that instructs the models to extract information from the given text and format it according to our JSON schema. The prompt includes:

1. A context section outlining the required fields (patient_id, age, gender, weight, height, bmi, smoker, diagnosis, treatment)
2. The medical notes as a single string
3. Detailed instructions for information extraction and formatting
4. The JSON schema structure

This prompt is used consistently across all models to ensure a fair comparison. 

In [7]:
prompt = f"""
sections are delimited by XML tags
<context>
1. patient_id: unique identifier
2. age
3. gender
4. weight
5. height
6. bmi
7. smoker
8. diagnosis
9. treatment
</context>
<medical_notes>
{data_baseline_string}
</medical_notes>
<instructions>
Instruction 1: Extract or infer information in <context> from <medical_notes>.
Instruction 2: if `patient_id` is not available, assign an integer as unique identifier.
Instruction 3: if `bmi` is not available, calculate it.
Instruction 4: Use "Not provided" if it is not possible to infer information.
Instruction 5: Return the extracted information in json format following this JSON schema: {json_schema}
Instruction 6: Do not return references to <context> or to <instructions> in the output.
Instruction 7: Do not return any other key-value pair that is not included in the JSON schema.
</instructions>
"""

#### 4.1.2. Performance Assessment

We examine two key aspects: output consistency and output content. Output consistency measures how well the returned output conforms to the required JSON schema. Output content compares the generated content to human-classified content to assess adherence.

##### 4.1.2.1. Output Consistency Analysis 

To assess output consistency, we conducted five consultations per model and visually inspected the results. 

Most models adhered perfectly to the JSON schema across all five iterations. However, two models showed notable inconsistencies. The `open-mixtral-8x7b` model only adhered in two out of five iterations, while the `mistral-medium-latest` model adhered in just one iteration, which still required cleaning due to unnecessary ticks around keys.

In [8]:
list_baseline = []
for model in models:
    for _ in range(5):
        response_baseline = mistral(prompt, model=model, is_json=True)
        list_baseline.extend([model, response_baseline])
        sleep(1.1)
list_baseline

['open-mistral-7b',
 '{"patient_id": 1, "age": 45, "gender": "female", "weight": 165, "height": 66, "bmi": 28.23, "smoker": "no", "diagnosis": "hypertension", "treatment": "Lisinopril once daily"}',
 'open-mistral-7b',
 '{"patient_id": 1, "age": 45, "gender": "female", "weight": 165, "height": 66, "bmi": 27.8, "smoker": "no", "diagnosis": "hypertension", "treatment": "Lisinopril"}',
 'open-mistral-7b',
 '{"patient_id": 1, "age": 45, "gender": "female", "weight": 165, "height": 66, "bmi": 28.23, "smoker": "no", "diagnosis": "hypertension", "treatment": "Lisinopril once daily"}',
 'open-mistral-7b',
 '{"patient_id": 1, "age": 45, "gender": "female", "weight": 165, "height": 66, "bmi": 27.5, "smoker": "no", "diagnosis": "hypertension", "treatment": "Lisinopril once daily"}',
 'open-mistral-7b',
 '{"patient_id": 1, "age": 45, "gender": "female", "weight": 165, "height": 66, "bmi": 28.23, "smoker": "no", "diagnosis": "hypertension", "treatment": "Lisinopril once daily"}',
 'open-mixtral-8x7

Given these inconsistencies, we excluded these two models from further assessment to maintain data quality.

In [9]:
models.remove("open-mixtral-8x7b")
models.remove("mistral-medium-latest")

##### 4.1.2.2. Output content analysis

Our assessment shows that all models demonstrate consistency in extracting both quantitative and qualitative information from the given data.

However, we observe a common challenge across all models: the calculation of Body Mass Index (BMI). We intentionally did not provide the BMI formula, expecting the models to infer it. This approach led to inconsistencies in BMI calculations across all models.

These findings suggest that while the models perform well in direct information extraction, they may struggle with tasks requiring the application of inferred knowledge or formulas.

In [13]:
data_baseline = pd.DataFrame()
for model in models:
    for i in range(5):
        response_baseline = mistral(prompt, model=model, is_json=True)
        data_response_baseline = pd.DataFrame(json.loads(response_baseline), index=[0])
        data_response_baseline.columns = [re.sub(r'[^\x00-\x7F]+','', col) for col in data_response_baseline.columns]
        data_response_baseline.columns = [re.sub(r'`','', col) for col in data_response_baseline.columns]
        data_response_baseline["model"] = model
        data_response_baseline["iteration"] = i
        data_baseline = data_baseline._append(data_response_baseline)
        sleep(1.1)
data_baseline.reset_index(drop=True, inplace=True)

In [12]:
data_baseline.info()
data_baseline.head(20)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20 entries, 0 to 19
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   patient_id  20 non-null     int64  
 1   age         20 non-null     int64  
 2   gender      20 non-null     object 
 3   weight      20 non-null     int64  
 4   height      20 non-null     int64  
 5   bmi         20 non-null     float64
 6   smoker      20 non-null     object 
 7   diagnosis   20 non-null     object 
 8   treatment   20 non-null     object 
 9   model       20 non-null     object 
 10  iteration   20 non-null     int64  
dtypes: float64(1), int64(5), object(5)
memory usage: 1.8+ KB


Unnamed: 0,patient_id,age,gender,weight,height,bmi,smoker,diagnosis,treatment,model,iteration
0,1,45,female,165,66,27.8,no,hypertension,Lisinopril once daily,open-mistral-7b,0
1,1,45,female,165,66,29.31,no,hypertension,Lisinopril once daily,open-mistral-7b,1
2,1,45,female,165,66,28.23,no,hypertension,Lisinopril once daily,open-mistral-7b,2
3,1,45,female,165,66,29.23,no,hypertension,Lisinopril once daily,open-mistral-7b,3
4,1,45,female,165,66,28.8,no,hypertension,Lisinopril once daily,open-mistral-7b,4
5,1,45,female,165,66,26.05,no,hypertension,Lisinopril to be taken once daily,open-mixtral-8x22b,0
6,1,45,female,165,66,26.42,no,hypertension,Lisinopril once daily,open-mixtral-8x22b,1
7,1,45,female,165,66,26.03,no,hypertension,Lisinopril once daily,open-mixtral-8x22b,2
8,1,45,female,165,66,26.22,no,hypertension,Lisinopril,open-mixtral-8x22b,3
9,1,45,female,165,66,26.22,no,hypertension,Lisinopril to be taken once daily,open-mixtral-8x22b,4


**Consistency**

Our analysis reveals a high level of consistency across all models in their results. Specifically, each model displays consistency in approximately 89% of the extracted information. This translates to consistent results in 8 out of 9 columns examined.

In [11]:
data_baseline_consistency = data_baseline.drop(columns=["iteration"]).groupby(["model"]).nunique().eq(1)
data_baseline_consistency["prcnt"] = round(
    (data_baseline_consistency.sum(axis=1) / data_baseline_consistency.shape[1]) * 100, 2)
data_baseline_consistency

Unnamed: 0_level_0,patient_id,age,gender,weight,height,bmi,smoker,diagnosis,treatment,prcnt
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
mistral-large-latest,True,True,True,True,True,True,True,True,False,88.89
mistral-small-latest,True,True,True,True,True,False,True,True,True,88.89
open-mistral-7b,True,True,True,True,True,False,True,True,True,88.89
open-mixtral-8x22b,True,True,True,True,True,False,True,True,False,77.78


**Human benchmarks**

The comparison between the human model and all other models showed successful extraction of information according to the human-based benchmark.

For the string column "treatment", minor variations were observed, such as "Lisinopril to be taken once daily" versus "Lisinopril once daily". Despite these slight differences, the overall extraction appeared successful.

However, a notable shortcoming was observed across all models: they failed to accurately calculate BMI using the [imperial-based CDC formula](https://www.cdc.gov/nccdphp/dnpao/growthcharts/training/bmiage/page5_2.html). This issue persisted even when using the metric-based WHO formula, although those results were not reported in detail.

In [None]:
data_baseline_compared = pd.DataFrame()
for i in range(len(data_baseline)):
    data_baseline_current = data_baseline.iloc[i:i+1]
    iteration = data_baseline_current["iteration"].unique().tolist()
    model = data_baseline_current["model"].unique().tolist()
    data_baseline_current = data_baseline_current.drop(
        columns=["model","iteration"]).reset_index(drop=True).copy()
    data_baseline_current.index = data_human_baseline.index
    data_compared = data_human_baseline.compare(data_baseline_current, result_names=("human", "model"))
    data_compared.insert(0, "model", model)
    data_compared.insert(1, "iteration", iteration)
    data_baseline_compared = pd.concat([data_baseline_compared,data_compared], ignore_index=True)
data_baseline_compared.head(20)

Unnamed: 0_level_0,model,iteration,bmi,bmi,treatment,treatment
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,human,model,human,model
0,open-mistral-7b,0,26.63,28.4,Lisinopril to be taken once daily,Lisinopril once daily
1,open-mistral-7b,1,26.63,27.8,Lisinopril to be taken once daily,Lisinopril once daily
2,open-mistral-7b,2,26.63,26.9,Lisinopril to be taken once daily,Lisinopril once daily
3,open-mistral-7b,3,26.63,26.4,Lisinopril to be taken once daily,Lisinopril once daily
4,open-mistral-7b,4,26.63,27.5,Lisinopril to be taken once daily,Lisinopril once daily
5,open-mixtral-8x22b,0,26.63,26.1,Lisinopril to be taken once daily,Lisinopril once daily
6,open-mixtral-8x22b,1,26.63,25.98,Lisinopril to be taken once daily,Lisinopril once daily
7,open-mixtral-8x22b,2,26.63,26.18,Lisinopril to be taken once daily,Lisinopril once daily
8,open-mixtral-8x22b,3,26.63,25.98,Lisinopril to be taken once daily,Lisinopril once daily
9,open-mixtral-8x22b,4,26.63,26.01,,


### 4.2. Handling Datasets

Our intention was to extend our assessment to larger datasets containing multiple medical records. However, we encountered significant technical challenges when attempting to process these datasets through the API.

In [64]:
json_medical_notes = json.loads(data_medical_notes["medical_notes"].iloc[0:6].to_json())
json_medical_notes

{'0': {'note': "A 45-year-old female patient, Ms. Thompson, presented with complaints of persistent headaches, dizziness, and blurred vision. Upon examination, her blood pressure was found to be significantly elevated. Ms. Thompson's weight is 165 lbs and her height is 5'6. She was diagnosed with hypertension and prescribed Lisinopril to be taken once daily. It was noted that the patient is a non-smoker but leads a sedentary lifestyle."},
 '1': {'note': "Mr. Rodriguez, a 52-year-old male, came in with symptoms of shortness of breath, chest pain, and fatigue during physical activity. After a thorough cardiac evaluation, he was diagnosed with coronary artery disease. Mr. Rodriguez weighs 195 lbs and stands at 5'10. He is a former smoker who quit 5 years ago. He was prescribed Aspirin and a statin, and referred to a cardiologist for further management."},
 '2': {'note': "A 38-year-old female patient, Ms. Chen, presented with complaints of joint pain, stiffness, and swelling in her hands a

**Prompt**

We adjust the afore-mentioned prompt as follows:
- we replace a string with a subset of 6 rows from the medical data; this is because parsing all the 20 entries generally yield an error. We discuss this below.
- We specify the main instruction to "each entry in medical_notes".

In [65]:
prompt = f"""
sections are delimited by XML tags
<context>
1. patient_id: unique identifier
2. age
3. gender
4. weight
5. height
6. bmi
7. smoker
8. diagnosis
9. treatment
</context>
<medical_notes>
{json_medical_notes}
</medical_notes>
<instructions>
Instruction 1: Extract or infer information in <context> for each entry in <medical_notes>.
Instruction 2: if `patient_id` is not available, assign an integer as unique identifier.
Instruction 3: if `bmi` is not available, calculate it.
Instruction 4: Use "Not provided" if it is not possible to infer information.
Instruction 5: Return the extracted information in json format following this JSON schema: {json_schema}
Instruction 6: Do not return references to <context> or to <instructions> in the output.
Instruction 7: Do not return any other key-value pair that is not included in the JSON schema.
</instructions>
"""

In [66]:
data_datasets = pd.DataFrame()
counter_loop = 0
while True:
    try:
        for model in models:
            response_datasets = mistral(prompt, model=model, is_json=True)
            try:
                data_response_datasets = pd.read_json(io.StringIO(response_datasets), orient="index")
            except:
                data_response_datasets = pd.read_json(io.StringIO(response_datasets))
            data_response_datasets.columns = [re.sub(r'[^\x00-\x7F]+','', col) for col in data_response_datasets.columns]
            data_response_datasets.columns = [re.sub(r'`','', col) for col in data_response_datasets.columns]
            data_response_datasets["model"] = model
            data_datasets = data_datasets._append(data_response_datasets)
            sleep(1.2)
        data_datasets.reset_index(drop=True, inplace=True)
        break
    except (Exception) as e:
        # Handle exceptions
        print(f"An error occurred: {e.__class__.__name__} - {e}")
        print("Model in loop:", model)
        print("Iteration:", counter_loop)
        counter_loop += 1
        sleep(1.2)
        if counter_loop == 5:
            break
        else:
            continue

An error occurred: MistralException - Unexpected exception (ReadTimeout): The read operation timed out
Model in loop: open-mistral-7b
Iteration: 0
An error occurred: MistralException - Unexpected exception (ReadTimeout): The read operation timed out
Model in loop: open-mistral-7b
Iteration: 1


In [68]:
data_datasets.info()
data_datasets.head(24)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24 entries, 0 to 23
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   patient_id  24 non-null     int64  
 1   age         24 non-null     int64  
 2   gender      24 non-null     object 
 3   weight      24 non-null     int64  
 4   height      24 non-null     int64  
 5   bmi         24 non-null     float64
 6   smoker      24 non-null     object 
 7   diagnosis   24 non-null     object 
 8   treatment   24 non-null     object 
 9   model       24 non-null     object 
dtypes: float64(1), int64(4), object(5)
memory usage: 2.0+ KB


Unnamed: 0,patient_id,age,gender,weight,height,bmi,smoker,diagnosis,treatment,model
0,1,45,female,165,60,26.7,no,hypertension,Lisinopril,open-mistral-7b
1,2,52,male,195,60,30.6,former,coronary artery disease,Aspirin and statin,open-mistral-7b
2,3,38,female,130,54,23.8,no,rheumatoid arthritis,Methotrexate,open-mistral-7b
3,4,70,male,180,60,27.3,no,early-stage Alzheimer's disease,Donepezil,open-mistral-7b
4,5,28,female,110,52,20.4,no,hyperthyroidism,Methimazole,open-mistral-7b
5,6,55,male,220,60,32.8,yes,Not provided,"physical therapy and NSAIDs, advised to quit s...",open-mistral-7b
6,0,45,female,165,66,26.31,no,hypertension,Lisinopril,open-mixtral-8x22b
7,1,52,male,195,70,28.08,former,coronary artery disease,Aspirin and a statin,open-mixtral-8x22b
8,2,38,female,130,64,21.47,no,rheumatoid arthritis,Methotrexate,open-mixtral-8x22b
9,3,70,male,180,72,24.22,no,Alzheimer's disease,Donepezil,open-mixtral-8x22b


#### 4.2.1. Technical Limitations

We consistently encountered the following error when trying to parse JSON datasets to the API: `An error occurred: MistralException - Unexpected exception (ReadTimeout)`

This error suggests that the API has a timeout limit that is being exceeded when processing datasets (i.e., JSON files with more than one entry). This limitation prevented us from assessing the selected models' performance on multiple records.

Moreover, it's worth noting that we are using the API through Mistral's free trial, which may have stricter rate limits or timeout thresholds compared to paid tiers, potentially contributing to the challenges we faced in processing datasets.

#### 4.2.2. Implications and Future Work

The inability to process large datasets directly through the API has the following implications for our assessment:

1. **Scale limitations:** Our current analysis is limited to processing of small subset (n=6), which may not fully represent the models' capabilities in handling larger datasets.

To address these limitations, we will continue to explore alternative batch processing strategies that stay within the API's time constraints.

Work in progress...