## Generating Data for Ground Truth Evaluation

In order to generate a ground truth summary for our data, we first need an input dataset. In this case we use threads from the [Thunderbird public mailing list.](https://thunderbird.topicbox.com/latest).  In order to generate the ground truth and then later evaluate the model, we need at least 100 samples to start with, where a sample is a single email or single email conversation.

Our selection criteria: 

+ Collect 100 samples of email thread conversations, as recent as possible and fairly complete so they can be evaluated
+ Clean them of email formatting such as `>`
+ One consideration here will be that BART, the baseline model we're using, accepts 1024 token context window as input, i.e.  we have to have input email threads that are ~ approximately 1000 words, so keeping on the conservative side

Once we've collected them, we'd like to take a look at the data before we generate summaries. 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import os
import requests
import json

# wrap columns for inspection
pd.set_option('display.max_colwidth', 0)
# stylesheet for visibility
plt.style.use("fast")

In [None]:
APP_URL = os.environ.get('APP_URL')
LOCAL_APP_URL = os.environ.get('LOCAL_APP_URL')

In [None]:
dataset_id = "db7ff8c2-a255-4d75-915d-77ba73affc53"
r = requests.get(f"{APP_URL}/api/v1/datasets/{dataset_id}")
print(json.dumps(r.json(), indent = 2))

In [None]:
# load into pandas
df = pd.read_csv('thunderbird_samples.csv')

In [None]:
# Examine a single sample 
# we define the data with examples
df['examples'].iloc[0]

In [None]:
# Add a function to do some simple character counts for model input

df['char_count'] = df['examples'].str.len()

In [None]:
# Show dataframe 
df.head()

In [None]:
# Preprocess data 
df['preprocessed'] = df['examples'].apply(lambda x: x.replace("'", "").lower().strip())
df.head()

In [None]:
df['char_count'].describe()

In [None]:
# Plot character counts 
fig, ax = plt.subplots(figsize=(12, 6))
ax.hist(df['char_count'], bins=30)
ax.set_xlabel('Character Count')
ax.set_ylabel('Frequency')

stats = df['char_count'].describe().apply(lambda x: f"{x:.0f}")

# Add text boxes for statistics
plt.text(1.05, 0.95, stats.to_string(), 
         transform=ax.transAxes, verticalalignment='top')

# Adjust layout
plt.tight_layout()
fig.subplots_adjust(right=0.75)

plt.show()

We can see from the chart that about half of our email threads are on the shorter side, however 50% are more than 1300 characters which may be an issue for the model. Something to watch out for as we begin to run inference runs. If we wanted to be precise, we could tokenize each row with the [BART tokenizer](https://huggingface.co/docs/transformers/en/model_doc/bart#transformers.BartTokenizer)020 to get true counts input into the model.  

In [None]:
headers = {"Content-Type": "application/json"}

def remove_single_quotes(text):
    return text.replace("'", "")

responses = []

# generate 10 instances of ground truth 
for string in df['preprocessed'][0:10]:
    response = requests.post(f"{APP_URL}/api/v1/ground-truth/deployments/97fd0628-e9b6-49e9-9a67-f36bab0fb3aa", headers=headers, data=json.dumps({"text": string}))
    print(string, response.text)
    responses.append((string, response.text))

results_df = pd.DataFrame(responses, columns=['Original', 'Response'])

In [None]:
results_df

In [None]:
responses

In [None]:
import pandas as pd
import json


# Function to extract the result value from the JSON string
def extract_result(json_string):
    try:
        return json.loads(json_string)['deployment_response']['result']
    except Exception as e:
        pass

# Create the DataFrame
responses_clean = pd.DataFrame(responses, columns=['first_value', 'json_string'])
responses_clean['result'] = responses_clean['json_string'].apply(extract_result)

# Drop the intermediate 'json_string' column
responses_clean = responses_clean[['first_value', 'result']]

print(responses_clean)

In [None]:
responses_clean.to_csv('output.csv', index=False)