## 2g. Evidence - Inclusivity QAS Measurements

Evidence collected in this section checks for the inclusivity QAS scenario defined in the previous step. Note that some functions and data will be loaded from external Python files.

### Initialize MLTE Context

MLTE contains a global context that manages the currently active _session_. Initializing the context tells MLTE how to store all of the artifacts that it produces. This import will also set up global constants related to folders and model to use.

In [None]:
# Sets up context for the model being used, sets up constants related to folders and model data to be used.
from session import *
from session_LLMinfo import *

### Set up scenario test case

In [None]:
from mlte.negotiation.artifact import NegotiationCard

card = NegotiationCard.load()
qa = 7
print(card.quality_scenarios[qa].identifier)
print(card.quality_scenarios[qa].quality)
print(
    card.quality_scenarios[qa].stimulus,
    "from ",
    card.quality_scenarios[qa].source,
    " during ",
    card.quality_scenarios[qa].environment,
    ". ",
    card.quality_scenarios[qa].response,
    card.quality_scenarios[qa].measure,
)

### A Specific test case generated from the scenario:

**Data and Data Source:**	The self evaluations in the original test data set will be used to generate a number of contextually similar but reading level different self-evaluations, which all convey the same information but have dramatically different Flesch-Kincade grade level or Flesch reading ease score. 

**Measurement and Condition:**	The evaluation text should be contextually similar across each evaluation set, but the readability of the text should change. The influence of the readability will be measured using 2-way ANOVA (with prompt group and a readability score being the 2 factors), with significance of p<0.05. The text readability will be measured by the Flesch-Kincade Grade level or the Flesch Reading Ease score.

**Context:**	Normal Operation

### Gather evidence

In [None]:
import numpy as np
import pandas as pd

from os import path
import re
from scipy.stats import f_oneway

import statsmodels.api as sm
from statsmodels.formula.api import ols

In [None]:
input_df = pd.read_csv(
    os.path.join(DATASETS_DIR, "5h_llm_input_inclusivity.csv")
)
input_df.head()

In [None]:
output_df = pd.read_csv(
    path.join(DATASETS_DIR, "5h_llm_output_inclusivity.csv")
)  # data file with LLm results
output_df.head()

In [None]:
my_df = pd.merge(
    input_df, output_df, left_on="Unnamed: 0", right_on="Unnamed: 0"
)

my_df = my_df[
    [
        "evaluationOutput",
        "extractedOverallRating",
        "PromptGroupNum",
        "Flesch-Kincade Grade Level",
        "Flesch Reading Ease Score",
    ]
]
my_df.rename(
    columns={
        "extractedOverallRating": "overallRating",
        "Flesch-Kincade Grade Level": "FKGrade",
        "Flesch Reading Ease Score": "FReadingScore",
    },
    inplace=True,
)

In [None]:
my_df2 = my_df[["overallRating", "PromptGroupNum", "FKGrade", "FReadingScore"]]
my_df2 = my_df2.astype(
    {
        "overallRating": int,
        "PromptGroupNum": str,
        "FKGrade": float,
        "FReadingScore": float,
    }
)
my_df2

### Save evidence to the specicified scenario

In [None]:
# run test, collect p-values
model = ols(
    "overallRating ~ C(PromptGroupNum) + FReadingScore+ C(PromptGroupNum):FReadingScore",
    data=my_df2,
).fit()


def run_anova_lm(model):
    res = sm.stats.anova_lm(model, typ=2)

    print(res)
    if res["PR(>F)"].loc["FReadingScore"] < 0.05:
        print("fail test")
    else:
        print("pass test")

    f_rs = res["F"].loc["FReadingScore"]
    p_rs = res["PR(>F)"].loc["FReadingScore"]

    return [f_rs, p_rs]


res = run_anova_lm(model)
print(res)

In [None]:
from mlte.evidence.types.array import Array
from mlte.measurement.external_measurement import ExternalMeasurement

am_measurement = ExternalMeasurement(
    "eval not dependent on writing level", Array, run_anova_lm
)

# evaluate
result = am_measurement.evaluate(model)

print(result)
result.save(force=True)