# Getting Structured Information with Validation using a State Monad

In this notebook structured information will be extracted from a free text. The structured information can than be validated. If validation fails, another attempt is made to get a valid answer. If this fails again, the answer is ignored. To keep track of the list of messages, i.e. the conversation with the LLM, a state monad is used. 

If you want to know what a state monad is and how it works, you should see the execellent blog and video by [Scott Wlaschin](https://fsharpforfunandprofit.com/posts/monadster/).

## First setup the libraries and open up the namespaces

In [1]:
#r "nuget: FSharpPlus"
#r "nuget: Newtonsoft.Json"
#r "nuget: NJsonSchema"

#r "../../Informedica.Utils.Lib/bin/Debug/net8.0/Informedica.Utils.Lib.dll"

#load "../Types.fs"
#load "../Utils.fs"
#load "../Texts.fs"
#load "../Prompts.fs"
#load "../Message.fs"
#load "../OpenAI.fs"
#load "../Fireworks.fs"
#load "../Ollama.fs"


open Newtonsoft.Json

open FSharpPlus
open FSharpPlus.Data
open Informedica.Utils.Lib.BCL

open Informedica.OpenAI.Lib

## Adjust the model settings

In [2]:
Ollama.options.temperature <- 0.
Ollama.options.penalize_newline <- true
Ollama.options.top_k <- 10
Ollama.options.top_p <- 0.95

## An initial system prompt and a generic method to extract structured data

The `systemMsg` is used to prime the LLM as a medical expert that can extract structured information from a free text. 

The `extract` function uses a model (the name of the LLM) and a initial structure allong with the request msg to extract information from a free text as JSON. The function uses a 
monad computational expression. This allows usage of a `State` monad that keeps track of the state, i.e. the list of messages, i.e. the "conversation" with 
the LLM.

In [3]:
let systemMsg text = [ text |> Texts.systemDoseQuantityExpert2 |> Message.system ]


let inline extract (model: string) zero msg =
    monad {
        // get the current list of messages
        let! msgs = State.get
        // get the structured extraction allong with
        // the updated list of messages
        let msgs, res =
            msg
            |> Ollama.validate2
                model
                msgs
            |> Async.RunSynchronously
            |> function
                | Ok (result, msgs) -> msgs, result
                | Error (_, msgs)   -> msgs, zero
        // refresh the state with the updated list of messages
        do! State.put msgs
        // return the structured extraction
        return res
    }

In [4]:
let unitValidator<'Unit> text get validUnits s =
    let isValidUnit s =
        if validUnits |> List.isEmpty then true
        else
            validUnits
            |> List.exists (String.equalsCapInsens s)
    try
        let un = JsonConvert.DeserializeObject<'Unit>(s)
        match un |> get |> String.split "/" with
        | [u] when u |> isValidUnit -> 
            if text |> String.containsCapsInsens u then Ok s
            else
                $"{u} is not mentionned in the text"
                |> Error 
        | _ -> 
            if validUnits |> List.isEmpty then $"{s} is not a valid unit, the unit should not contain '/'"
            else
                $"""
{s} is not a valid unit, the unit should not contain '/' and the unit should be one of the following:
{validUnits |> String.concat ", "}
"""
            |> Error
    with
    | e ->
        e.ToString()
        |> Error

## Functions that extract different pieces of structured information

Using the general extract function, specific functions can be created that extract small pieces of validated structured information. This allows breaking up a 
large task into smaller/simpler tasks for the LLM to execute. 

Each function has a `zero` structure, used when extraction fails as a fallback. Each extraction function also has a `validator`, a function that validates the 
extracted structure.

In [5]:
let extractSubstanceUnit model text =
    let zero = {| substanceUnit = "" |}
    let validator = unitValidator text (fun (u: {| substanceUnit: string |}) -> u.substanceUnit)  []

    $"""
Use Schema {"{| substanceUnit: string |}" |> Utils.anonymousTypeStringToJson}
Extract the unit of measurement for the medication, the substance unit in the text between '''

Examples:
- mg/kg/dag: "{{ substanceUnit = "mg" }}"
- mg/kg: "{{ substanceUnit = "mg" }}"
- g/m2/dag: "{{ substanceUnit = "g" }}"
- IE/m2: "{{ substanceUnit = "IE" }}"


Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractAdjustUnit model text =
    let zero = {| adjustUnit = "" |}
    let validator = 
        ["kg"; "m2"; "mˆ2"]
        |> unitValidator text (fun (u: {| adjustUnit: string |}) -> u.adjustUnit)

    $"""
Use Schema {"{| adjustUnit: string |}" |> Utils.anonymousTypeStringToJson}
Extract the unit by which a medication dose is adjusted, the patient weight or body surfacd area, the adjust unit in the text between '''

Examples:
- mg/kg/dag: "{{ adjustUnit = "mg" }}"
- mg/kg: "{{ adjustUnit = "mg" }}"
- mg/m2/dag: "{{ adjustUnit = "m2" }}"
- mg/m2: "{{ adjustUnit = "m2" }}"

Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero


let extractTimeUnit model text =
    let zero = {| timeUnit = "" |}
    let validator = 
        [
            "dag"
            "week"
            "maand"
        ]
        |> unitValidator text (fun (u: {| timeUnit: string |}) -> u.timeUnit)

    $"""
Use Schema {"{| timeUnit: string |}" |> Utils.anonymousTypeStringToJson}
Extract time unit in the text between '''


Examples:
- mg/kg/dag: "{{ timeUnit = "dag" }}"
- mg/kg: "{{ timeUnit = "" }}"
- mg/m2/week: "{{ timeUnit = "week" }}"
- mg/2 dagen: "{{ timeUnit = "2 dagen" }}"


Respond in JSON
"""
    |> Message.userWithValidator validator
    |> extract model zero

## Combine the extractions to create a larger structured extraction

The seperate extraction functions can be combined to create a larger extraction structure. Using the monad computational expression, the state, i.e. the conversation (list of messages) will automatically passed around so the LLM gets the "full picture" and hopefully this improves the extraction process and the validity of the extractions.

In [6]:
let createDoseUnits model text =
    monad {
        let! substanceUnit = extractSubstanceUnit model text
        let! adjustUnit = extractAdjustUnit model text
        let! timeUnit = extractTimeUnit model text

        return
            {|
                substanceUnit = substanceUnit.substanceUnit
                adjustUnit = adjustUnit.adjustUnit
                timeUnit = timeUnit.timeUnit
            |}
    }

## Finally the State monad can run the whole proces of extraction

Running the extraction proces returns the extracted structure and the full list of messages.

In [12]:
let un, msgs =
    let text = Texts.testTexts[0]
    State.run
        (createDoseUnits Ollama.Models.llama2 text)
        (systemMsg text)

printfn $"## The final exctracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print

## The final exctracted structure:
{ adjustUnit = "kg"
  substanceUnit = "mg"
  timeUnit = "dag" }


## The full conversation

## System:
You are an expert on medication prescribing, preparation and administration. You will give
exact answers. If there is no possible answer return an empty string.
You have to answer questions about a free text between ''' that describes the dosing of a medication.
You will be asked to extract structured information from the following text:

'''
alprazolam
6 jaar tot 18 jaar Startdosering: 0,125 mg/dag, éénmalig. Onderhoudsdosering: Op geleide van klinisch beeld verhogen met stappen van 0,125-0,25 mg/dosis tot max 0,05 mg/kg/dag in 3 doses. Max: 3 mg/dag. Advies inname/toediening: De dagdosis indien mogelijk verdelen over 3 doses.Bij plotselinge extreme slapeloosheid: alleen voor de nacht innemen; dosering op geleide van effect ophogen tot max 0,05 mg/kg, maar niet hoger dan 3 mg/dag.De effectiviteit bij de behandeling van acute angst is discutabel.
'''

## Another attempt using a different LLM

In [11]:
let un, msgs =
    let text = Texts.testTexts[0]
    State.run
        (createDoseUnits Ollama.Models.openhermes text)
        (systemMsg text)

printfn $"## The final exctracted structure:\n{un}\n\n"

printfn "## The full conversation"
msgs
|> List.iter Message.print

## The final exctracted structure:
{ adjustUnit = "kg"
  substanceUnit = "mg"
  timeUnit = "dag" }


## The full conversation

## System:
You are an expert on medication prescribing, preparation and administration. You will give
exact answers. If there is no possible answer return an empty string.
You have to answer questions about a free text between ''' that describes the dosing of a medication.
You will be asked to extract structured information from the following text:

'''
alprazolam
6 jaar tot 18 jaar Startdosering: 0,125 mg/dag, éénmalig. Onderhoudsdosering: Op geleide van klinisch beeld verhogen met stappen van 0,125-0,25 mg/dosis tot max 0,05 mg/kg/dag in 3 doses. Max: 3 mg/dag. Advies inname/toediening: De dagdosis indien mogelijk verdelen over 3 doses.Bij plotselinge extreme slapeloosheid: alleen voor de nacht innemen; dosering op geleide van effect ophogen tot max 0,05 mg/kg, maar niet hoger dan 3 mg/dag.De effectiviteit bij de behandeling van acute angst is discutabel.
'''

## Testing different LLMs

In [7]:
let test model =
    [
        for (text, exp) in Texts.testUnitTexts do
            let un, _ =
                State.run
                    (createDoseUnits model text)
                    (systemMsg text)
            if un = exp then 1 else 0
    ]
    |> List.sum

Run the tests:

In [9]:
[
    Ollama.Models.llama2
    Ollama.Models.gemma
    Ollama.Models.openhermes
    Ollama.Models.mistral
    Ollama.Models.``llama-pro``
    Ollama.Models.``openchat:7b``
    Ollama.Models.``llama2:13b-chat``
]
|> List.map (fun model -> 
    printf $"- Testing: {model}: "
    let s = model |> test
    printfn $"score: {s}"
    model, s
)
|> List.maxBy snd
|> fun (m, s) -> printfn $"\n\n## And the winner is: {m} with a high score: {s} from {Texts.testUnitTexts |> List.length}"

- Testing: llama2: score: 1
- Testing: gemma: score: 0
- Testing: openhermes: score: 4
- Testing: mistral: score: 4
- Testing: llama-pro: score: 1
- Testing: openchat:7b: score: 4
- Testing: llama2:13b-chat: score: 2


## And the winner is: openhermes with a high score: 4 from 6
