# Span Evaluator Example
This notebook demonstrates how to use the `SpanEvaluator` class to create spans from a DataFrame of tokens and evaluate predictions.

In [1]:
# Import required libraries
import pandas as pd
from presidio_evaluator.evaluation.span_evaluator import SpanEvaluator
from presidio_evaluator.data_objects import Span

stanza and spacy_stanza are not installed
Flair is not installed by default


## Example DataFrame
Below is a sample DataFrame representing tokenized text, with columns for input id, token, annotation, prediction, and start index.

In [5]:
# Create a sample DataFrame with partially matching annotations and predictions
sample_data = [
    {"sentence_id": 1, "token": "John", "annotation": "PERSON", "prediction": "PERSON", "start": 0},
    {"sentence_id": 1, "token": "Doe", "annotation": "PERSON", "prediction": "O", "start": 5},
    {"sentence_id": 1, "token": "lives", "annotation": "O", "prediction": "O", "start": 9},
    {"sentence_id": 1, "token": "in", "annotation": "O", "prediction": "O", "start": 15},
    {"sentence_id": 1, "token": "New", "annotation": "LOCATION", "prediction": "LOCATION", "start": 18},
    {"sentence_id": 1, "token": "York", "annotation": "LOCATION", "prediction": "O", "start": 22},
    {"sentence_id": 1, "token": "City", "annotation": "LOCATION", "prediction": "LOCATION", "start": 27},
    {"sentence_id": 1, "token": ".", "annotation": "O", "prediction": "O", "start": 31},
    {"sentence_id": 2, "token": "Jane", "annotation": "PERSON", "prediction": "O", "start": 33},
    {"sentence_id": 2, "token": "Smith", "annotation": "PERSON", "prediction": "PERSON", "start": 38},
    {"sentence_id": 2, "token": "visited", "annotation": "O", "prediction": "O", "start": 44},
    {"sentence_id": 2, "token": "Paris", "annotation": "LOCATION", "prediction": "LOCATION", "start": 52},
    {"sentence_id": 2, "token": "last", "annotation": "O", "prediction": "O", "start": 58},
    {"sentence_id": 2, "token": "summer", "annotation": "O", "prediction": "O", "start": 63},
    {"sentence_id": 2, "token": ".", "annotation": "O", "prediction": "O", "start": 69},
]

df = pd.DataFrame(sample_data)
df

Unnamed: 0,sentence_id,token,annotation,prediction,start
0,1,John,PERSON,PERSON,0
1,1,Doe,PERSON,O,5
2,1,lives,O,O,9
3,1,in,O,O,15
4,1,New,LOCATION,LOCATION,18
5,1,York,LOCATION,O,22
6,1,City,LOCATION,LOCATION,27
7,1,.,O,O,31
8,2,Jane,PERSON,O,33
9,2,Smith,PERSON,PERSON,38


## Create Spans from Tokens
Use the `create_spans_from_tokens` method to merge adjacent tokens with the same entity label into spans.

In [6]:
# Initialize the SpanEvaluator
span_evaluator = SpanEvaluator()

# Create annotation spans
annotation_spans = span_evaluator._create_spans(df, "annotation")

# Create prediction spans
prediction_spans = span_evaluator._create_spans(df, "prediction")

# Display the created spans
print("Annotation Spans:")
for span in annotation_spans:
    print(span)

print("\nPrediction Spans:")
for span in prediction_spans:
    print(span)

Annotation Spans:
Span(type: PERSON, value: ['John', 'Doe'], char_span: [0: 9])
Span(type: LOCATION, value: ['New', 'York', 'City'], char_span: [18: 31])
Span(type: PERSON, value: ['Jane', 'Smith'], char_span: [33: 44])
Span(type: LOCATION, value: ['Paris'], char_span: [52: 58])

Prediction Spans:
Span(type: PERSON, value: ['John'], char_span: [0: 5])
Span(type: LOCATION, value: ['New'], char_span: [18: 22])
Span(type: LOCATION, value: ['City'], char_span: [27: 31])
Span(type: PERSON, value: ['Smith'], char_span: [38: 44])
Span(type: LOCATION, value: ['Paris'], char_span: [52: 58])


## Evaluate Predictions
Use the `evaluate` method to compute precision, recall, and F1 score for the predictions.

In [8]:
# Evaluate the predictions
results = span_evaluator.evaluate(df)
results

{'precision': 1.0,
 'recall': 1.0,
 'f1': 1.0,
 'per_type': {'PERSON': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0},
  'LOCATION': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0}},
 'error_analysis': {}}

You have now seen how to use the SpanEvaluator to create spans and evaluate predictions from a token-level DataFrame.