In [None]:
import json

with open("data/data.json") as f:
    data = json.load(f)

In [None]:
# visualize the data

text_sample = [" ".join(d["tokens"]) for d in data]
print(f"Text Sample: {text_sample[:5]}")
label_sample = [d["bio_tags"] for d in data]
print(f"Label Sample: {label_sample[:5]}")

- look at data breakdown
- spliting into train and test
- text data representation (simple embedding vs contextual embedding)

In [None]:
# Let's see the distribution of the labels

from collections import Counter

label_counter = Counter()
for labels in [d["bio_tags"] for d in data]:
    label_counter.update(labels)

print(label_counter)

In [None]:
# look at data with LOC tag

[d for d in data if "B-LOC" in d["bio_tags"]]

In [None]:
# There are not enough LOC tags in the data, so we will drop them
data = [d for d in data if "B-LOC" not in d["bio_tags"]]

In [None]:
from src.utils import extract_spans

# Let's see the distribution of the spans using a different data format

for d in data:
    spans = extract_spans(d["bio_tags"])
    d["labels"] = spans


data[:2]


In [None]:

span_counter = Counter()
for d in data:
    for span in d["labels"]:
        text = " ".join(d["tokens"][span[0]:span[1]]).lower()
        span_counter.update([f"{text}, {span[2]}"])

print(span_counter.most_common(5))


### Training and Testing Sets in Machine Learning



In machine learning, data is split into **training** and **testing sets** for two main reasons: model learning and model evaluation.



#### **1. Training Set:**

- **Purpose:** The training set is used to **train** or fit the model. It includes both the input data and the corresponding correct outputs.

- **Why:** The model learns to recognize patterns and make predictions by adjusting its parameters based on the training data.



#### **2. Testing Set:**

- **Purpose:** The testing set is used to **evaluate** the performance of the model. It is separate from the training data and also includes the correct outputs.

- **Why:** Using a separate testing set helps to assess how well the model can generalize to new, unseen data. This is crucial for understanding the model's effectiveness and ensuring it isn't just memorizing the training data (a problem known as overfitting).



In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state=42)

# save the split data
with open("data/train.json", "w") as f:
    json.dump(train, f)
with open("data/test.json", "w") as f:
    json.dump(test, f)

# Evaluation

Online vs offline metric:

Online: to measure the AI/ML system performance in the real world

Offline: to measure the AI/ML system performance using existing data while building the model.

Challenge: These two don’t always match

In offline evaluation, we can measure how “good” are the keywords we extract from our text.

In online evaluation, we can measure success in two ways:
- Implicit: user click through rate or adding suggestion to incident implies good labels
- Explicit: suggest labels for user to accept/reject

Good can be subjective:

- Precision, recall, and f1 (small illustration in notebook)
- High recall + low precision: costly false negative, e.g. fraud detection
- High precision + low recall: costly false positive, e.g. legal
- Balanced: both important

**Question**:  What kind of mix do we want in this case?

In [None]:
# Offline evaluation

sample_text = "google drive service high latency for prod us 1"

prediction_1 = [
    ("google drive", "SVC"),  # correct
    ("prod us 1", "ENV"),     # correct
]
# 2/2

prediction_2 = [
    ("google drive", "SVC"),  # correct
    ("prod us 1", "SVC"),     # incorrect
]
# 1/2

prediction_2 = [
    ("google drive", "SVC"),  # correct
]
# 1/2

prediction_3 = [
    ("service", "SVC"),       # incorrect
]
# 0/2

prediction_4 = [
]
# 0/2

# A prediction is considered correct if both the span and the label are correct

In [None]:
from src.utils import evaluate_ner

evaluate_ner(
    [{"tokens": sample_text.split(), "labels": [(0, 2, 'SVC'), (6, 9, 'ENV')]}],
    [{"tokens": sample_text.split(), "labels": [(0, 2, 'SVC'), (6, 9, 'SVC')]}],
)
# evaluate_ner(prediction_1, sample_text)