### Token-Level Evaluation for Named Entity Recognition (NER)

* Token-level evaluation is a common practice in NER model assessment. However, this approach has limitations as it doesn't consider full-entity accuracy.
* A named entity can span multiple tokens, and token-level evaluation may not capture this adequately.
* The followin is based on the [nerevalaute](https://github.com/MantisAI/nervaluate/) library documentation


In [2]:
pip install nervaluate

Collecting nervaluate
  Downloading nervaluate-0.1.8-py3-none-any.whl (24 kB)
Installing collected packages: nervaluate
Successfully installed nervaluate-0.1.8

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [6]:
from nervaluate import Evaluator
true = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

pred = [
    ['O', 'O', 'B-PER', 'I-PER', 'O'],
    ['O', 'B-LOC', 'I-LOC', 'B-LOC', 'I-LOC', 'O'],
]

evaluator = Evaluator(true, pred, tags=['LOC', 'PER'], loader="list")

results, results_by_tag = evaluator.evaluate()
results

{'ent_type': {'correct': 3,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'possible': 3,
  'actual': 3,
  'precision': 1.0,
  'recall': 1.0,
  'f1': 1.0},
 'partial': {'correct': 3,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'possible': 3,
  'actual': 3,
  'precision': 1.0,
  'recall': 1.0,
  'f1': 1.0},
 'strict': {'correct': 3,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'possible': 3,
  'actual': 3,
  'precision': 1.0,
  'recall': 1.0,
  'f1': 1.0},
 'exact': {'correct': 3,
  'incorrect': 0,
  'partial': 0,
  'missed': 0,
  'spurious': 0,
  'possible': 3,
  'actual': 3,
  'precision': 1.0,
  'recall': 1.0,
  'f1': 1.0}}

In [7]:
results_by_tag

{'LOC': {'ent_type': {'correct': 2,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 0,
   'possible': 2,
   'actual': 2,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0},
  'partial': {'correct': 2,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 0,
   'possible': 2,
   'actual': 2,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0},
  'strict': {'correct': 2,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 0,
   'possible': 2,
   'actual': 2,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0},
  'exact': {'correct': 2,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 0,
   'possible': 2,
   'actual': 2,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0}},
 'PER': {'ent_type': {'correct': 1,
   'incorrect': 0,
   'partial': 0,
   'missed': 0,
   'spurious': 0,
   'possible': 1,
   'actual': 1,
   'precision': 1.0,
   'recall': 1.0,
   'f1': 1.0},
  'partial': {'correct': 1,
   'incorrect': 0,
   'parti

### Challenges in Token-Level Evaluation: I

**Scenario I**: Full Match

| Token   | Gold   | Prediction |
| ------- | ------ | ---------- |
| in      | O      | O          |
| New     | B-LOC  | B-LOC      |
| York    | I-LOC  | I-LOC      |
| .       | O      | O          |

- In this scenario, the gold standard and the prediction match both in terms of the surface string and entity type.
- Token-level metrics like precision and recall work well in this case.




### Challenges in Token-Level Evaluation: II

**Scenario II**: System Hypothesized an Incorrect Entity

| Token   | Gold   | Prediction |
| ------- | ------ | ---------- |
| an      | O      | O          |
| Awful   | O      | B-ORG      |
| Headache| O      | I-ORG      |
| in      | O      | O          |

- Here, the NER system incorrectly predicts the entity as an organization (B-ORG, I-ORG) instead of recognizing it as an ordinary word (O).


### Challenges in Token-Level Evaluation: III

**Scenario III**: System Misses an Entity

| Token   | Gold   | Prediction |
| ------- | ------ | ---------- |
| in      | O      | O          |
| Palo    | B-LOC  | O          |
| Alto    | I-LOC  | O          |
| ,       | O      | O          |

- In this scenario, the NER system completely misses the named entity (Palo Alto) that exists in the gold standard.


### Example Receipt

```
XYZ Mart
1234 Main Lp.
Anytown, USA

Receipt: #123456789
Date: 2023-10-21 16:30:00

--------------------------------------------------
Description        |  Qty   |  Price   |  Total
--------------------------------------------------

Apples (2 lb)       |   1    |  $2.99   |  $2.99
Milk (1 gal)        |   2    |  $3.49   |  $6.98
Bread (Whole Wheat) |   1    |  $2.29   |  $2.29
Toilet Paper (6 pk) |   2    | $4.99    |  $9.98
Dish Soap (16 oz)   |   1    |  $1.99   |  $1.99
Charging Cable      |   3    |  $5.99   | $17.97

--------------------------------------------------
Subtotal:                      $41.20
Sales Tax (7%):                $2.88
Total:                         $44.08

Payment Method: Credit Card
Card Ending In: **** 1234
Authorization Code: 987654

Thank you for shopping with us!

```

### Example Receipt: Annotation

```
[ORG]: XYZ Mart
[ADDRESS]: 1234 Main Lp.
[ADDRESS]: Anytown, USA

Receipt: #123456789
[RECEIPT_NUMBER]: #123456789
[DATE]: 2023-10-21 16:30:00

--------------------------------------------------
Description        |  Qty   |  Price   |  Total
--------------------------------------------------

Apples (2 lb)       |   1    |  $2.99   |  $2.99
[PRODUCT]: Apples
[QUANTITY]: (2 lb)
[PRICE]: $2.99
[TOTAL_PRICE]: $2.99

Milk (1 gal)        |   2    |  $3.49   |  $6.98
[PRODUCT]: Milk
[QUANTITY]: (1 gal)
[PRICE]: $3.49
[TOTAL_PRICE]: $6.98

Bread (Whole Wheat) |   1    |  $2.29   |  $2.29
[PRODUCT]: Bread (Whole Wheat)
[QUANTITY]: 1
[PRICE]: $2.29
[TOTAL_PRICE]: $2.29

Toilet Paper (6 pk) |   2    |  $4.99   |  $9.98
[PRODUCT]: Toilet Paper (6 pk)
[QUANTITY]: 2
[PRICE]: $4.99
[TOTAL_PRICE]: $9.98

Dish Soap (16 oz)   |   1    |  $1.99   |  $1.99
[PRODUCT]: Dish Soap (16 oz)
[QUANTITY]: 1
[PRICE]: $1.99
[TOTAL_PRICE]: $1.99

Charging Cable      |   3    |  $5.99   | $17.97
[PRODUCT]: Charging Cable
[QUANTITY]: 3
[PRICE]: $5.99
[TOTAL_PRICE]: $17.97

--------------------------------------------------
[SUBTOTAL]:                       $41.20
[SALES_TAX]: (7%):                $2.88
[TOTAL_AMOUNT]:                   $44.08

[PAYMENT_METHOD]: Credit Card
[CARD_ENDING]: **** 1234
[AUTHORIZATION_CODE]: 987654
```

### Example Receipt: Annotation Comparison
| Token        | Gold                | Prediction     |
|------------|-------------------|--------------------|
| MART         | I-ORG               | I-ORG               |
| 1234         | B-ADDRESS           | B-ADDRESS           |
| Main         | I-ADDRESS           | I-ADDRESS           |
| Lp           | I-ADDRESS           | O                   |
| .            | O                   | O                   |


### Precision and Recall in NER

- **Precision**: The percentage of named entities completely identified by the system that are correct.
- **Recall**: The percentage of named entities in the corpus found by the system.
- **F1-score**: The harmonic mean of precision and recall.
    
* The metrics are valuable for NER evaluation, but do have importnat limitations:

* Ignores Partial Matches: they only considers exact matches, overlooking partial matches or overlapping entities.
 * No entity-level evaluation , which is crucial in real-world applications.
* Don't address other scenarios, such as nested entities or entities that cross sentence boundaries.
*  We may need to adapt evaluation schemas based on specific NER tasks and requirements beyond the CoNLL-2003 metrics.


### Challenges in Token-Level Evaluation: IV

**class IV**: System assigns the wrong entity type

|Token|Gold|Prediction|
|---|---|---|
|I|O|O|
|live|O|O|
|in|O|O|
|Palo|B-LOC|B-ORG|
|Alto|I-LOC|I-ORG|
|,|O|O|


### Challenges in Token-Level Evaluation: V

**class V**: System gets the boundaries wrong

|Token|Gold|Prediction|
|---|---|---|
|Unless|O|B-PER|
|Karl|B-PER|I-PER|
|Smith|I-PER|I-PER|
|resigns|O|O|


### Challenges in Token-Level Evaluation: VI

**Class VI**: System gets the boundaries and entity type wrong

|Token|Gold|Prediction|
|---|---|---|
|Unless|O|B-ORG|
|Karl|B-PER|I-ORG|
|Smith|I-PER|I-ORG|
|resigns|O|O|

### MUC Evaluation Metrics for NER

* MUC (Message Understanding Conference) introduced comprehensive metrics for evaluating Named Entity Recognition (NER) systems.
* These metrics assess different categories of errors by comparing system output to golden annotations.

* **Correct (COR)**: Both system output and golden annotation are identical.
* **Incorrect (INC)**: System output and golden annotation don't match.
* **Partial (PAR)**: System and golden annotation are somewhat similar but not identical.
* **Missing (MIS)**: A golden annotation is not captured by the system.
* **Spurious (SPU)**: The system produces a response that doesn't exist in the golden annotation.

* These metrics go beyond strict classification and allow for partial matching, offering a more nuanced evaluation of NER systems.
* They cover various scenarios encountered in NER, including recognizing differences in surface strings and entity types.

### Variants of Precision/Recall/F1-Score

* The workshop [SemEval’13](https://www.aclweb.org/portal/content/semeval-2013-international-workshop-semantic-evaluation) introduced four variants of precision, recall, and F1-score metrics based on the MUC framework.
  * These variants provide different ways to evaluate NER performance.


|Evaluation schema|Explanation|
|:---|:---|
|Strict|exact boundary string match and entity type|
|Exact|exact boundary match over the string, regardless of the type|
|Partial|partial boundary match over the string, regardless of the type|
|Type|some overlap between the system tagged entity and the gold annotation is required|

### Understanding the New Variants

* Each variant of precision, recall, and F1-score assesses NER performance differently.
  * They account for correct, incorrect, partial, missed, and spurious entity recognition in unique ways.

| Scenario | Golden Standard           |  | System Prediction       |  | Evaluation Schema       |  |  |  |
|:----------|:---------------------------|:----|------------------------|:----|------------------------|----|----|----|
|          | Entity Type               | Surface String           | Entity Type              | Surface String        | Type | Partial | Exact | Strict |
| III      | brand                     | TIKOSYN                  |                          |                       | MIS  | MIS     | MIS   | MIS    |
| II       |                           |                          | brand                    | healthy               | SPU  | SPU     | SPU   | SPU    |
| V        | drug                      | warfarin                 | drug                     | of warfarin           | COR  | PAR     | INC   | INC    |
| IV       | drug                      | propranolol              | brand                    | propranolol           | INC  | COR     | COR   | INC    |
| I        | drug                      | phenytoin                | drug                     | phenytoin             | COR  | COR     | COR   | COR    |
| I        | drug                      | theophylline             | drug                     | theophylline          | COR  | COR     | COR   | COR    |
| VI       | group                     | contraceptives           | drug                     | oral contraceptives   | INC  | PAR     | INC   | INC    |



### Computing the Actual Precision/Recall/F1-Score
```
ACTUAL (ACT) = COR + INC + PAR + SPU = TP + FP
POSSIBLE (POS) = COR + INC + PAR + MIS = TP + FN
```

* Precision: percentage of correctly identified named entities by the NER system
* Recall: percentage of named entities in the gold standard annotations that the NER system correctly retrieves. 
* Depending on whether we seek an exact match (strict and exact) or a partial match (partial and type) scenario.

__Exact Match (i.e., strict and exact )__
```
Precision = (COR / ACT) = TP / (TP + FP)
Recall = (COR / POS) = TP / (TP+FN)
```
__Partial Match (i.e., partial and type)__
```
Precision = (COR + 0.5 × PAR) / ACT = TP / (TP + FP)
Recall = (COR + 0.5 × PAR)/POS = COR / ACT = TP / (TP + FN)
```


### Putting it All Together
|Measure|Type|Partial|Exact|Strict|
|---|---|---|---|---|
|Correct|3|3|3|2|
|Incorrect|2|0|2|3|
|Partial|0|2|0|0|
|Missed|1|1|1|1|
|Spurious|1|1|1|1|
|Precision|0.5|0.66|0.5|0.33|
|Recall|0.5|0.66|0.5|0.33|
|F1|0.5|0.66|0.5|0.33|