# Data preparation

Before we start training the model, we should take a look at the labelled data.

In [None]:
import json
with open("data/labelled_data.json") as f:
    data = json.load(f)

In [None]:
# visualize the data

text_sample = [" ".join(d["tokens"]) for d in data]
print(f"Text Sample: {text_sample[:5]}")
label_sample = [d["bio_tags"] for d in data]
print(f"Label Sample: {label_sample[:5]}")

In [None]:
# Let's see the distribution of the labels

from collections import Counter

label_counter = Counter()
for labels in [d["bio_tags"] for d in data]:
    label_counter.update(labels)

print(label_counter)

In [None]:
# look at data with LOC tag

[d for d in data if "B-LOC" in d["bio_tags"]]

There are not enough LOC tags in the data the model won't be able to meaningfully learn to recognize them.

It's best to drop these data points and perhaps try location labels when there are more data.

In [None]:
data = [d for d in data if "B-LOC" not in d["bio_tags"]]

In [None]:
# Generate the span labels format for the data to help evaluation later

from src.utils import extract_spans

for d in data:
    d["labels"] = extract_spans(d["bio_tags"])

data[:2]


In [None]:
# Look at the most common spans and all the unique spans for SVC and ENV

span_counter = Counter()
for d in data:
    for span in d["labels"]:
        text = " ".join(d["tokens"][span[0]:span[1]]).lower()
        span_counter.update([f"{text}, {span[2]}"])

print(span_counter.most_common(5))

svc_set = set()
env_set = set()

for d in data:
    for span in d["labels"]:
        text = " ".join(d["tokens"][span[0]:span[1]]).lower()
        if span[2] == "SVC":
            svc_set.add(text)
        elif span[2] == "ENV":
            env_set.add(text)



print(f"SVC: {svc_set}")
print(f"ENV: {env_set}")



### Training and Test Sets in Machine Learning


During development of a ML model, data is split into **training** and **test sets** for: model learning and model evaluation.


#### **1. Training Set:**

- **Purpose:** The training set is used to **train** or fit the model. It includes both the input data and the corresponding correct outputs.

- **Why:** The model learns to recognize patterns and make predictions by adjusting its parameters based on the training data.



#### **2. Test Set:**

- **Purpose:** The testing set is used to **evaluate** the performance of the model. It is separate from the training data and also includes the correct outputs.

- **Why:** Using a separate testing set helps to assess how well the model can generalize to new, unseen data. This is crucial for understanding the model's effectiveness and ensuring it isn't just memorizing the training data (a problem known as overfitting).


In zero shot learning, you won't need a training set, but you should still have a test set.

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.1, random_state=42)

# save the split data
with open("data/train.json", "w") as f:
    json.dump(train, f)
with open("data/test.json", "w") as f:
    json.dump(test, f)

# Evaluation

We want to be able to measure how good our model is at complete the task.

The evaluation process will take data that the model has not seen during training to evaluate how well the model is doing. We can decide how good our model is doing by calculating the evaluation metrics.


There are typically two types of metrics:

Online: to measure the AI/ML system performance in the real world or in production

Offline: to measure the AI/ML system performance using existing data while building the model


### Offline evaluation

Decide important offline metrics we can use to measure the model performance:

- Precision, recall, and f1
- High recall: costly false negative, e.g. fraud detection
- High precision: costly false positive, e.g. legal
- Balanced: both important

**Question**:  What kind of mix do we want in this case?


### Online evaluation
In online evaluation, we can measure success in two ways:
- Implicit: user click through rate or adding suggestion to incident implies good labels
- Explicit: suggest labels for user to accept/reject

We can then get a proxy of how good the model is doing in production by uses this signals. (e.g. 5% CTR on suggestions)

We compare this value either to an arbitrary goal we set or if we have a existing solution, we can compare the value of new solution vs old solution via A/B testing or shadow scoring.

A/B testing: randomly assigns a model to use in production

Shadow scoring: running the new model with production inputs, but don't use these inputs to end users

In [None]:
# Offline evaluation

sample_text = "google drive service high latency for prod us 1"

prediction_1 = [
    ("google drive", "SVC"),  # correct
    ("prod us 1", "ENV"),     # correct
]
# 2/2

prediction_2 = [
    ("google drive", "SVC"),  # correct
    ("prod us 1", "SVC"),     # incorrect
]
# 1/2

prediction_2 = [
    ("google drive", "SVC"),  # correct
]
# 1/2

prediction_3 = [
    ("service", "SVC"),       # incorrect
]
# 0/2

prediction_4 = [
]
# 0/2

# A prediction is considered correct if both the span and the label are correct

In [None]:
from src.utils import evaluate_ner

evaluate_ner(
    [{"tokens": sample_text.split(), "labels": [(0, 2, 'SVC'), (6, 9, 'ENV')]}],
    [{"tokens": sample_text.split(), "labels": [(0, 2, 'SVC'), (6, 9, 'SVC')]}],
)
# evaluate_ner(prediction_1, sample_text)