# General Structured Data Field Extaction
#### Extract structured-fields from unstructured input into a python dictionary


## Simple Minimal Data-Structured Task with Mistral API
This is a colab-notebook for testing data-extraction using a Mistral cloud model and cloud-api. This colab can run on any online device that runs a web browser (phone, tablet, laptop, desktop, etc.). This is a system for using the online cloud Mistral api, not the mistral models run locally in a local pipeline (e.g. using .gguf llama.cpp).

## Steps:
1. Configure api-key https://console.mistral.ai/api-keys/
2. Select Model https://docs.mistral.ai/getting-started/models/models_overview/
3. set "parameters" (or leave to default) https://docs.mistral.ai/capabilities/completion/sampling
3. Define what fields you want in a 'dictionary'
4. Describe those fields (e.g. their data-types, ranges, etc.)
5. Configure the extraction-task in the "Prompt"
6. Run all cells (Use the "Run all" button above)
7. (Optional) Download the session-log. Use the  folder-icon to the left <- to see the session log files; right click; download.




### Note
Key parts of the structured-data extraction process include
1. forcing a structure, pattern, or delimitor that can be detected in output
2. using the pattern or marker to extract the data
3. retrying in case of failure
4. retrying to check consistency and self-agreement (outlier check, sanity check, etc.)

Note: Escape-characters can become a problem with some formats (markdown-json is not always a viable format).

These can very significantly between models and sizes of models.

# Set Raw Input

In [None]:
raw_text_blob = """
Carl is a Siamese cat. He eats blue milk and wears shoes.
"""

# Define Output Structure (Python Dictionary)
e.g.
```python
target_schema_model_dict = {
    "animal_or_not": None,
    "favorite_food": None,
    "species": None,
    "wearing": None,
}
```

In [None]:
# this gets added to prompt
target_schema_model_dict = {
    "animal_or_not":  False, # boolean
    "probability_is_animal": 0.0, # float, confidence level

    "favorite_food":  "", # string
    "probability_of_food": 0.0, # float, confidence level

    "species":  "", # string
    "wearing":  "", # string

    "noise_level_in_input": 0, # int, 0-10: 10 is all noise, 0 is clean
    "description_comment": "", # string
}

# Describe the Data Schema
- data types
- what the meaning and context of the field is

(note: make field names very clear)

e.g.
```
data_schema_doc_blurb = """
- id or not is boolean
- confidence level is 0-1 float, like a probability
- comment/description must not be more than 100 characters
"""
```


In [None]:
# this gets added to prompt
data_schema_doc_blurb = """
- id or not is boolean
- confidence level is 0-1 float, like a probability
- comment/description must not be more than 100 characters
"""

# Set 'system prompt'

In [None]:
def build_extraction_prompt(
    raw_text_input,
    data_schema,
    description_of_data_schema,
):

    full_prompt = f"""

    The task is to extract fields of structured data from an unstructured string.

    The fields and schema are:
    {data_schema}

    Descriptions of this including datatypes:
    {description_of_data_schema}

    The original unstructured text is:
    {raw_text_input}

    Output must be in correct markdown json format
    starting with ```json
    ending with ```

    Return a structured json object in markdown format,
    No other comments or output.
    """

    return full_prompt


In [None]:
# set extraction_prompt_string
extraction_prompt_string = build_extraction_prompt(
    raw_text_blob,
    target_schema_model_dict,
    data_schema_doc_blurb,
)

# The Core API call:


- Select style-prompt to experiment with personality of answer (optional)
  - Modify the personality = "" text to describe the personality you want.
  - Specifying the language of reply can be done in the sytem-prompt

## Notes:

# Mistral models and names are updated and changes fairly often, check web for current.
Older models will hopefully be available somewhere online if not huggingface.

### Note: Colabs are slower
Free colabs are amazing for easily sharing and running code in a portable way,
but they are slower. Code in production, or run locally, will be faster than a colab.

#### From Mistral docs, See:
- https://docs.mistral.ai/
- https://docs.mistral.ai/getting-started/models/models_overview/
- https://docs.mistral.ai/platform/client/

```
curl --location "https://api.mistral.ai/v1/chat/completions" \
     --header 'Content-Type: application/json' \
     --header 'Accept: application/json' \
     --header "Authorization: Bearer $MISTRAL_API_KEY" \
     --data '{
    "model": "mistral-small-latest",
    "messages": [{"role": "user", "content": "Who is the most renowned French painter?"}]
  }'
```


## mistral-small-latest = Mixtral8x7
https://mistral.ai/news/mixtral-of-experts/

### response = requests.post(endpoint_url, headers=headers, json=request_body)


# Imports

In [None]:
from datetime import datetime
import json
import re
import requests

# login

In [None]:
"""
.env: get your environment variables:
  Using the Google Secretes (like.env) system
  built into colab on the left menu: the 'key' icon.
"""
from google.colab import userdata
mistral_api_key = userdata.get('mistral_api_key')


"""
Python Dot-env
"""
# from dotenv import load_dotenv
# import os

# load_dotenv()
# api_key = os.getenv("mistral_api_key")


"""
Hard Code (not the best idea)
"""
# mistral_api_key = 'xxx'

'\nHard Code (not the best idea)\n'

# Setup

Comment out the model you don't want to use.

In [None]:
# Select Model
"""
https://docs.mistral.ai/api/

open-mistral-7b

open-mixtral-8x22b
open-mixtral-8x22b-2404

codestral-latest
codestral-2405


open-mistral-7b
(aka mistral-tiny-2312)
renamed from mistral-tiny
The endpoint mistral-tiny will be deprecated


Feb. 26, 2024

API endpoints: We renamed 3 API endpoints and added 2 model endpoints.

open-mistral-7b (aka mistral-tiny-2312): renamed from mistral-tiny. The endpoint mistral-tiny will be deprecated in three months.
open-mixtral-8x7B (aka mistral-small-2312): renamed from mistral-small. The endpoint mistral-small will be deprecated in three months.
mistral-small-latest (aka mistral-small-2402): new model.
mistral-medium-latest (aka mistral-medium-2312): old model. The previous mistral-medium has been dated and tagged as mistral-medium-2312. The endpoint mistral-medium will be deprecated in three months.
mistral-large-latest (aka mistral-large-2402): our new flagship model with leading performance.

"""

##################
# Open Mistral 7b
##################
# previously "tiny"
use_this_model = "open-mistral-7b"


###################
# Open Mixtral 8x7
###################
# previously "small"
use_this_model = "open-mixtral-8x7B"


######################
# open mixtral 8x22b
######################
# ...was 'medium'?
use_this_model = "open-mixtral-8x22b"


#######################
# Small, Medium, Large  (no 'tiny')
#######################
use_this_model = "mistral-small-latest"
use_this_model = "mistral-medium-latest"
use_this_model = "mistral-large-latest"

##############
# Codestral
##############
use_this_model = "codestral-latest"


use_this_model = "open-mistral-7b"

# Paremeters
https://docs.mistral.ai/capabilities/completion/sampling

In [None]:
TEMPERATURE = 0.8 # higher number is more ~creative/variable answer

"""
Higher number accepts larger range of options,
or options with lower probabilities.
0.5 looks only at options with the top 50% likelihoods.
"""
TOP_P = 0.5  # top % (as fraction) considered for tokens,

"""
Range: [-2, 2]
Default: 0
"""
PRESENCE_PENALTY = 0

FREQUENCY_PENALTY = 0


# Code

In [None]:
# import requests
# import json
# import os
# import re
# from google.colab import userdata

"""
# mistral_api_key = userdata.get('mistral_api_key')

# Define the endpoint URL
endpoint_url = "https://api.mistral.ai/v1/chat/completions"

# Set the headers
headers = {
  "Content-Type": "application/json",
  "Accept": "application/json",
  "Authorization": f"Bearer {mistral_api_key}"
}

# mode: [{"role": "user", "content": "say yes"}]

    # Define the request body
    request_body = {
      "model": "mistral-small-latest",
      "messages": [{"role": "user", "content": user_input}]
    }

    # Send the request
    response = requests.post(endpoint_url, headers=headers, json=request_body)
"""


def add_to_context_history(role, comment):

    if role == 'user':
        segment = {"role": "user", "content": comment}

    elif role == 'assistant':
        segment = {"role": "assistant", "content": comment}

    elif role == 'system':
        segment = {"role": "system", "content": comment}

    else:
        print("add_to_context_history(role, comment)")
        print(role, comment)
        print('error')

    return segment


def ask_mistral_api(context_history, use_this_model):


    # Define the endpoint URL
    endpoint_url = "https://api.mistral.ai/v1/chat/completions"

    # Set the headers
    headers = {
        "Content-Type": "application/json",
        "Accept": "application/json",
        "Authorization": f"Bearer {mistral_api_key}"
    }

    # Define the request body
    request_body = {
        "model": use_this_model,
        "messages": context_history,

        "temperature": TEMPERATURE,
        "top_p": TOP_P,
        "presence_penalty": PRESENCE_PENALTY,
        "frequency_penalty": FREQUENCY_PENALTY,
    }

    #################
    #################
    # Hit the ai api
    #################
    #################
    # Send the request
    response = requests.post(endpoint_url, headers=headers, json=request_body)

    # Check the response status code
    if response.status_code != 200:
        raise Exception(f"Error: {response.status_code} {response.text}")

    return response


def simple_ask_mistral_cloud(input_string, use_this_model):
    """
    you have: a string
    you need: a response

    1. make minimal history contexxt
    2. make a generic system instruction, for show
    3. make system-user context: string input
    4. ask mistral for that model
    5. extract just the response string
    6. return only reply (no 'history')
    """

    # 1. make minimal history contexxt
    context_history = []

    # 2. make a generic system instruction
    generic_system_instruction = "You are helpful and answer accurately."
    context_history.append( add_to_context_history("system", generic_system_instruction) )

    # 3. make system-user context: string input
    context_history.append( add_to_context_history("user", input_string) )

    # 4. ask mistral for that model
    response = ask_mistral_api(context_history, use_this_model)


    # Get the response data
    response_data = response.json()


    # 5. extract just the response string

    ##
    ##
    # Turn this print on to see full return data
    ##
    ##
    """
    e.g.
    {
      "id": "635cb8d445ujhe5546bb64e5e7",
      "object": "chat.completion",
      "created": 170hrjfjf7084,
      "model": "open-mistral-7b",
      "choices": [
        {
          "index": 0,
          "message": {
            "role": "assistant",
            "content": "Enjoy your cup of tea!"
          },
          "finish_reason": "stop",
          "logprobs": null
        }
      ],
      "usage": {
        "prompt_tokens": 575,
        "total_tokens": 629,
        "completion_tokens": 54
      }
    }
    """
    # print(json.dumps(response_data, indent=2))
    # print(type(response_data))

    output = response_data
    # print(type(output))
    # print(type(output["choices"][0]))

    # extract just the 'what they said' part out
    assistant_says = output["choices"][0]['message']['content']

    # 6. return only reply (no 'history')
    return assistant_says


# Extraction Code

In [None]:
import json

# Helper Function
def extract_json_as_pydict_from_unstructured_text(dict_str, model_dict):
    """
    This function CAN fail and should fail
    if the AI needs to retry at a task.
    Do not stop server when this this triggers an exception.

    edge case: before there is a populated output_log

    if passing, this function will return a valid json object
    """

    """
    Extracts JSON string enclosed between ```json and ``` markers.

    Parameters:
    - text (str): The input text containing the JSON block.

    Returns:
    - str: The extracted JSON string, or an empty string if no JSON block is found.
    """
    print(f"\n\n Starting check_function_description_keys, dict_str -> {dict_str}")

    ########################
    # Check Json Formatting
    ########################
    try:
        pattern = r'```json\n([\s\S]*?)\n```'
        match = re.search(pattern, dict_str)
        dict_str =  match.group(1) if match else ''

    except Exception as e:
        print(f"\nTRY AGAIN: check_function_description_keys() extraction from markdown failed: {e}")
        print(f"Failed dict_str -> {dict_str}")
        return False

    print(f"\n extracted from markdown ->{dict_str}")

    # clean
    try:
        # try safety cleaning
        dict_str = dict_str.replace("True", "true")
        dict_str = dict_str.replace("False", "false")
        dict_str = dict_str.replace("None", "null")

        # # This conflicted with free language in description section...
        # dict_str = dict_str.replace("'", '"')

        # remove trailing delimiter comma
        print(f"{dict_str[:-6]}")
        dict_str = dict_str.replace('",\n}', '"\n}')

    except Exception as e:
        print(f"\nTRY AGAIN:try safety cleaning: {e}")
        print(f"Failed dict_str -> {dict_str}")
        return False

    # load
    try:
        # try converting
        dict_str = json.loads(dict_str)

    except Exception as e:
        print(f"\nTRY AGAIN: trying json.loads(dict_str) Dictionary load failed: {e}")
        print(f"Failed dict_str -> {dict_str}")
        return False


    # check if keys are the same
    try:
        result = dict_str.keys() == model_dict.keys()
        if result is False:
            print(f"Failed: keys are not the same.")
            print(f"Failed dict_str -> {dict_str}")
            return False

    except Exception as e:
        print(f"\nTRY AGAIN: Failed with Exception: keys are not the same: {e}")
        print(f"Failed dict_str -> {dict_str}")
        return False

    # if ok...
    return dict_str



In [None]:
# import json
import re
import traceback
# from datetime import datetime


def initialize_session_log(
    raw_text_blob: str,
    target_schema_model_dict: dict,
    use_this_model: str,
) -> str:
    """
    Creates a new timestamped session log file and writes
    the opening configuration block to it.

    Everything printed during the session will also be
    appended to this file via print_and_log().

    Parameters:
    -----------
    raw_text_blob : str
        The raw input text to be processed.
    target_schema_model_dict : dict
        The schema dict defining expected output fields.
    use_this_model : str
        The model name string used for this session.

    Returns:
    --------
    str
        The filepath of the created session log file.
    """

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S_%f')
    session_log_filepath = f"session_log_{timestamp}.txt"

    opening_block = (
        f"\n{'='*60}\n"
        f"SESSION LOG\n"
        f"{'='*60}\n"
        f"Timestamp      : {timestamp}\n"
        f"Model          : {use_this_model}\n"
        f"Schema         :\n{json.dumps(target_schema_model_dict, indent=2)}\n"
        f"Raw Input Blob :\n{raw_text_blob}\n"
        f"{'='*60}\n"
    )

    try:
        with open(session_log_filepath, 'w') as log_file:
            log_file.write(opening_block)
        print(opening_block)

    except Exception as e:
        print(f"WARNING: Could not create session log file: {e}")
        print(traceback.format_exc())

    return session_log_filepath


def print_and_log(
    message: str,
    session_log_filepath: str,
) -> None:
    """
    Prints a message to stdout AND appends it to the session log file.

    This is the single point of output throughout the pipeline.
    Nothing should be printed directly — always use this function
    so that stdout and the log file stay in sync.

    Parameters:
    -----------
    message : str
        The string to print and append to the log.
    session_log_filepath : str
        Path to the active session log file.

    Returns:
    --------
    None
    """

    print(message)

    try:
        with open(session_log_filepath, 'a') as log_file:
            log_file.write(message + "\n")

    except Exception as e:
        # Print warning to stdout only — do not recurse into print_and_log
        print(f"WARNING: Could not append to session log file: {e}")
        print(traceback.format_exc())


def save_extraction_results(
    raw_text_blob: str,
    target_schema_model_dict: dict,
    use_this_model: str,
    result_dict: dict | None,
    attempts_made: int,
    session_log_filepath: str,
) -> None:
    """
    Saves the final extraction results to a timestamped JSON file.

    Called on success (with the validated dict) and on total failure
    (with result_dict=None). Captures all inputs, settings, and output
    together in one file for reproducibility and inspection.

    Parameters:
    -----------
    raw_text_blob : str
        The original raw input text.
    target_schema_model_dict : dict
        The schema dict.
    use_this_model : str
        Model name string.
    result_dict : dict or None
        The validated extracted dict on success, or None on failure.
    attempts_made : int
        How many attempts were made before this result.
    session_log_filepath : str
        Path to the session log file (recorded here for cross-reference).

    Returns:
    --------
    None
    """

    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S_%f')
    results_filepath = f"extraction_results_{timestamp}.json"

    results_payload = {
        "timestamp": timestamp,
        "model_used": use_this_model,
        "attempts_made": attempts_made,
        "raw_text_blob": raw_text_blob,
        "target_schema_model_dict": target_schema_model_dict,
        "extraction_result": result_dict,
        "session_log_filepath": session_log_filepath,
    }

    try:
        with open(results_filepath, 'w') as results_file:
            json.dump(results_payload, results_file, indent=2)

        print_and_log(
            f"\nResults saved to: {results_filepath}",
            session_log_filepath,
        )

    except Exception as e:
        print_and_log(
            f"WARNING: Could not save results file: {e}\n{traceback.format_exc()}",
            session_log_filepath,
        )


def run_extraction_with_retry(
    raw_text_blob: str,
    target_schema_model_dict: dict,
    extraction_prompt_string: str,
    use_this_model: str,
    max_retries: int = 3,
) -> dict | None:
    """
    Main extraction pipeline loop.

    Takes a fully pre-built prompt string (already assembled by
    build_extraction_prompt() before this function is called),
    sends it to the Mistral API, and attempts to extract a validated
    Python dict matching the target schema.

    Retries up to max_retries times on JSON extraction failure.

    All output — raw API responses, extraction attempts, successes,
    failures — is printed to stdout AND appended to a timestamped
    session log file via print_and_log().

    Final results (success or failure) are saved to a separate
    timestamped JSON file via save_extraction_results().

    Context history is constructed ONCE before the loop.
    The prompt does not change between retries.

    API-level exceptions (non-200 status from ask_mistral_api())
    are NOT caught here — they propagate up and stop execution.
    JSON extraction failures are caught and retried.

    Parameters:
    -----------
    raw_text_blob : str
        The original raw input text. Used for saving results.
        The prompt string is already built before this is called.
    target_schema_model_dict : dict
        The schema dict defining expected output fields and structure.
        Passed to extract_json_as_pydict_from_unstructured_text()
        for key validation.
    extraction_prompt_string : str
        The fully assembled prompt string from build_extraction_prompt().
    use_this_model : str
        The Mistral model name string.
    max_retries : int
        Maximum number of API call + extraction attempts. Default 3.

    Returns:
    --------
    dict
        Validated Python dict matching the schema on success.
    None
        If all retries are exhausted without a valid extraction.
    """

    # --- Initialize session log (opened once, appended throughout) ---
    session_log_filepath = initialize_session_log(
        raw_text_blob,
        target_schema_model_dict,
        use_this_model,
    )

    # ----------------------------------------------------------------
    # Construct context_history ONCE before the loop.
    # Same history is sent on every retry attempt.
    # ----------------------------------------------------------------
    context_history = []

    generic_system_instruction = (
        "You are a precise data extraction assistant. "
        "Follow the instructions exactly. "
        "Return only valid JSON in markdown format. "
        "No other text or commentary."
    )

    # system message
    context_history.append(
        add_to_context_history("system", generic_system_instruction)
    )

    # user message: the full pre-built prompt
    context_history.append(
        add_to_context_history("user", extraction_prompt_string)
    )

    print_and_log(
        (
            f"\nContext history constructed."
            f"\nStarting extraction loop."
            f"\nMax retries: {max_retries}\n"
        ),
        session_log_filepath,
    )

    # ----------------------------------------------------------------
    # Retry loop
    # ----------------------------------------------------------------
    for attempt_number in range(1, max_retries + 1):

        print_and_log(
            f"\n{'='*50}\nATTEMPT {attempt_number} of {max_retries}\n{'='*50}",
            session_log_filepath,
        )

        # --- Call the API ---
        # Note: ask_mistral_api() raises on non-200 — let it propagate.
        response = ask_mistral_api(context_history, use_this_model)

        # --- Extract raw reply string from response object ---
        try:
            response_data = response.json()
            raw_reply_string = response_data["choices"][0]["message"]["content"]

        except Exception as e:
            print_and_log(
                (
                    f"\nATTEMPT {attempt_number}: Could not extract reply string"
                    f" from response object.\n{e}\n{traceback.format_exc()}"
                ),
                session_log_filepath,
            )
            continue

        # --- Print and log the complete raw API reply ---
        print_and_log(
            f"\nRAW API RESPONSE (attempt {attempt_number}):\n{raw_reply_string}",
            session_log_filepath,
        )

        # --- Attempt JSON extraction and schema key validation ---
        print_and_log(
            f"\nAttempting JSON extraction and validation (attempt {attempt_number})...",
            session_log_filepath,
        )

        try:
            validated_dict = extract_json_as_pydict_from_unstructured_text(
                raw_reply_string,
                target_schema_model_dict,
            )

        except Exception as e:
            print_and_log(
                (
                    f"\nATTEMPT {attempt_number}: extract_json_as_pydict raised"
                    f" an exception.\n{e}\n{traceback.format_exc()}"
                ),
                session_log_filepath,
            )
            continue

        # --- Check result: False means malformed or wrong keys ---
        if validated_dict is False:
            print_and_log(
                (
                    f"\nATTEMPT {attempt_number} FAILED:"
                    f" extraction returned False (malformed JSON or wrong keys)."
                    f" Retrying...\n"
                ),
                session_log_filepath,
            )
            continue

        # ------------------------------------------------------------
        # SUCCESS
        # ------------------------------------------------------------
        print_and_log(
            (
                f"\nATTEMPT {attempt_number} SUCCEEDED.\n"
                f"Extracted dict:\n{json.dumps(validated_dict, indent=2)}"
            ),
            session_log_filepath,
        )

        save_extraction_results(
            raw_text_blob=raw_text_blob,
            target_schema_model_dict=target_schema_model_dict,
            use_this_model=use_this_model,
            result_dict=validated_dict,
            attempts_made=attempt_number,
            session_log_filepath=session_log_filepath,
        )

        return validated_dict

    # ----------------------------------------------------------------
    # Loop exhausted — all attempts failed
    # ----------------------------------------------------------------
    print_and_log(
        (
            f"\nWARNING: All {max_retries} attempts exhausted."
            f" Extraction failed."
            f" Returning None.\n"
        ),
        session_log_filepath,
    )

    save_extraction_results(
        raw_text_blob=raw_text_blob,
        target_schema_model_dict=target_schema_model_dict,
        use_this_model=use_this_model,
        result_dict=None,
        attempts_made=max_retries,
        session_log_filepath=session_log_filepath,
    )

    return None

# Run

In [None]:
"""
Input Fields

def run_extraction_with_retry(
    raw_text_blob: str,
    target_schema_model_dict: dict,
    extraction_prompt_string: str,
    use_this_model: str,
    max_retries: int = 3,
)
"""

run_extraction_result = run_extraction_with_retry(
    raw_text_blob,
    target_schema_model_dict,
    extraction_prompt_string,
    use_this_model,
    max_retries=3,
)

print(f"run_extraction_result -> {run_extraction_result}")


SESSION LOG
Timestamp      : 20260220_204636_959162
Model          : open-mistral-7b
Schema         :
{
  "animal_or_not": false,
  "probability_is_animal": 0.0,
  "favorite_food": "",
  "probability_of_food": 0.0,
  "species": "",
  "wearing": "",
  "noise_level_in_input": 0,
  "description_comment": ""
}
Raw Input Blob :

Carl is a siamese cat, who eats blue milk and wears shoes



Context history constructed.
Starting extraction loop.
Max retries: 3


ATTEMPT 1 of 3

RAW API RESPONSE (attempt 1):
```json
{
  "animal_or_not": true,
  "probability_is_animal": 1.0,
  "favorite_food": "blue milk",
  "probability_of_food": 1.0,
  "species": "Siamese cat",
  "wearing": "shoes",
  "noise_level_in_input": 0,
  "description_comment": ""
}
```

Attempting JSON extraction and validation (attempt 1)...


 Starting check_function_description_keys, dict_str -> ```json
{
  "animal_or_not": true,
  "probability_is_animal": 1.0,
  "favorite_food": "blue milk",
  "probability_of_food": 1.0,
  "speci

# Test API


In [None]:
## Optional Test of api system / connection
# simple_ask_mistral_cloud("Hellow world", use_this_model)