[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2025/blob/main/Module%208%20Named%20Entity%20Recognition.ipynb)

# Named Entity Recognition

We will continue use this [UUDeCART](https://github.com/UUDeCART/decart_rule_based_nlp) dataset. Instead of converting the labels into sentence labels, we will keep original concept labels and convert them into [BIO format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Then your excerice will take from there.

## Download the dataset

In [None]:
%%capture
!wget https://github.com/jianlins/FHI_Hands_on/raw/master/data/cc_train.zip

In [None]:
%%capture
!wget https://github.com/jianlins/FHI_Hands_on/raw/master/img/cc_test.zip

In [None]:
!ls

cc_test.zip  cc_train.zip  sample_data


In [None]:
%%capture
!unzip cc_train.zip -d cc_train

In [None]:
%%capture
!unzip cc_test.zip -d cc_test

## Install & import the packages

In [None]:
%%capture
!pip install quicksectx git+https://github.com/medspacy/medspacy_io
!pip install git+https://github.com/MeMartijn/updated-sklearn-crfsuite.git#egg=sklearn_crfsuite

In [None]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from spacy.tokens import Doc
from typing import List
import pandas as pd
from sklearn_crfsuite.metrics import flatten, flat_classification_report

from sklearn.model_selection import train_test_split
from sklearn_crfsuite import CRF
from sklearn_crfsuite import metrics
from sklearn import preprocessing
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_validate
import sklearn_crfsuite
from sklearn.metrics import classification_report


In [None]:
config='''[entities]
FAM_COLON_CA_DOC
FAM_COLON_CA
ANATOM
NEGATED_DOC
COLON_CA
Pos_Doc
NEG_DOC
PastConcept
[attributes]
Negation	Arg:COLON_CA,	Value:affirm
Experiencer	Arg:COLON_CA,	Value:patient
Certainty	Arg:COLON_CA,	Value:certain
Note	Arg:FAM_COLON_CA|PossibleConcept|NegatedConcept|PastConcept
Temporality	Arg:COLON_CA,	Value:present
Section	Arg:COLON_CA,	Value:SourceDocumentInformation
[relations]
[events]'''

In [None]:
Path('annotation.conf').write_text(config)

403

## Now read the data as spaCy Doc objects.

In [None]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=True, recursive=True, schema_file='annotation.conf')

found annotation.conf file


In [None]:
train_docs = dir_reader.read(txt_dir='cc_train')
test_docs = dir_reader.read(txt_dir='cc_test')

In [None]:
len(test_docs), len(train_docs)

(40, 60)

In [None]:
pdocs=[d for d in train_docs if len(d.spans)>1]

In [None]:
pdocs[0].spans

{'FAM_COLON_CA_DOC': [Admission], 'FAM_COLON_CA': [colon cancer]}

## Convert to BIO

I've provided the function for the conversion to save your time. Now the output string would be the same as the book is using. You can take the ouput string to train your NER models.

In [None]:
# I revised this function to return a dataframe, so that the annotations can be reviewed easier
def spans_to_bio(doc:Doc, anno_types:List[str], abbr:bool=True)->str:
  """
  Converts spans in a spaCy Doc object to a BIO-formatted string, with an option
  to abbreviate the entity labels. It adds an empty line between sentences to improve
  readability.

  Parameters:
  - doc (Doc): The spaCy Doc object containing the text and its annotations, including
                entities and sentence boundaries.
  - anno_types (List[str]): A list of annotation types to include in the output. These
                            types should correspond to the keys in `doc.spans`.
  - abbr (bool, optional): If True, entity labels are abbreviated to their initials.
                            Defaults to True.

  Returns:
  - str: A string where each token is followed by its BIO tag (with the entity label if applicable),
          formatted as "token B-entity" or "token I-entity" for tokens within entities, and
          "token O" for tokens outside any entities. Sentences are separated by an empty line.
  """
  # Initialize a dictionary to hold BIO tags for each token index
  bio_tags = {token.i: 'O' for token in doc}  # Default to 'O' for outside any entity

  # Preprocess spans to assign BIO tags
  for anno_type in anno_types:
    for span in doc.spans.get(anno_type, []):
      if span:  # Check if span is not empty
        label=span.label_
        if abbr:
          label=''.join([w[0] for w in label.split('_')])
        bio_tags[span.start] = f"B-{label}"  # Begin tag for the first token in the span
        for token in span[1:]:  # Inside tags for the rest of the tokens in the span
          bio_tags[token.i] = f"I-{label}"

  # Generate BIO format string
  bio_text = []
  bio_data={'sentence_id':[],'token':[],'label':[]}
  for s, sent in enumerate(doc.sents):
    for i,token in enumerate(sent):
      # trim the whitespaces on both sides of a sentence
      if (i==0 or i==len(sent)-1) and str(token).strip()=='':
        bio_text.append('')
        continue
      elif str(token).strip()=='':
        # clean up extra whitespaces within a sentence.
        bio_text.append(f' \t{bio_tags[token.i]}')
        bio_data['label'].append(bio_tags[token.i])
      else:
        bio_text.append(f"{token.text} {bio_tags[token.i]}")
        bio_data['label'].append(bio_tags[token.i])
      bio_data['token'].append(token)
      bio_data['sentence_id'].append(s)
    bio_text.append('')  # Empty line between sentences
  return '\n'.join(bio_text), pd.DataFrame(bio_data)

In [None]:
data, train_df=spans_to_bio(train_docs[1], anno_types=['FAM_COLON_CA','COLON_CA'], abbr=False)

In [None]:
train_df[train_df.label!='O']

Unnamed: 0,sentence_id,token,label
453,37,colon,B-FAM_COLON_CA
454,37,cancer,I-FAM_COLON_CA


## CRF model for NER.

In [None]:
# We will focus on two types of concepts here
def convert_docs(docs:List[Doc], anno_types=['FAM_COLON_CA','COLON_CA']):
  all_conll=[]
  offset=0
  dfs=[]
  for d in docs:
    data, df=spans_to_bio(d, anno_types=anno_types)
    all_conll.append(data)
    df['sentence_id']+=offset
    offset+=df.shape[0]
    dfs.append(df)
  return '\n\n'.join(all_conll), pd.concat(dfs)


In [None]:
train_data, train_df=convert_docs(train_docs)
test_data, test_df=convert_docs(test_docs)


In [None]:
train_df.shape

(120511, 3)

In [None]:
def word2features(sent, i):
    word = sent[i]
    postag = word.pos_
    word=str(word)

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1]
        postag1 = word1.pos_
        word1=str(word1)
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1]
        postag1 = word1.pos_
        word1=str(word1)
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]



### Generate features

In [None]:
X_train=[sent2features(list(sdf['token'])) for id,sdf in train_df.groupby('sentence_id')]
X_test=[sent2features(list(sdf['token'])) for id,sdf in test_df.groupby('sentence_id')]

In [None]:
y_train=[list(sdf['label']) for id,sdf in train_df.groupby('sentence_id')]
y_test=[list(sdf['label']) for id,sdf in test_df.groupby('sentence_id')]

In [None]:
len(X_train), len(y_train), len(X_test), len(y_test)


(6762, 6762, 4502, 4502)

In [None]:
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True
)
crf.fit(X_train, y_train)

In [None]:
y_pred = crf.predict(X_test)

In [None]:
new_classes = train_df.label.unique().tolist()
new_classes.remove('O')

In [None]:
print(flat_classification_report(y_test, y_pred, labels = new_classes))

              precision    recall  f1-score   support

       B-FCC       0.61      0.82      0.70        17
       I-FCC       0.61      0.74      0.67        19
        B-CC       1.00      0.12      0.22         8
        I-CC       1.00      0.12      0.22         8

   micro avg       0.62      0.58      0.60        52
   macro avg       0.80      0.45      0.45        52
weighted avg       0.73      0.58      0.54        52



## Span-level evaluation
The above scores are calculated at the token level. However, in most cases, we don't need to be overly rigorous with every token. For example, annotating "**diagnosed with dementia**" versus "**dementia**" in the sentence "The patient was **diagnosed with dementia** 10 years ago" does not make much difference. Therefore, we often use a lenient measurement to compute precision, recall, and F1 scores.

In [2]:
def compute_metrics_and_averages(y_true, y_pred):
    def extract_entities(sentence_tags, row_id):
        entities = []
        current_entity = None
        for i, tag in enumerate(sentence_tags):
            if tag.startswith('B-'):
                if current_entity:
                    entities.append(current_entity)
                current_entity = {'type': tag[2:], 'start': i, 'end': i, 'row_id': row_id}
            elif tag.startswith('I-') and current_entity and current_entity['type'] == tag[2:]:
                current_entity['end'] = i
            else:
                if current_entity:
                    entities.append(current_entity)
                    current_entity = None
        if current_entity:
            entities.append(current_entity)
        return entities

    # Initialize containers
    metrics = {}
    FP_ids = {}
    FN_ids = {}

    for row_id, (true_tags, pred_tags) in enumerate(zip(y_true, y_pred)):
        true_entities = extract_entities(true_tags, row_id)
        pred_entities = extract_entities(pred_tags, row_id)
        for entity in true_entities + pred_entities:
            entity_type = entity['type']
            if entity_type not in metrics:
                metrics[entity_type] = {'TP': 0, 'FP': 0, 'FN': 0}
                FP_ids[entity_type] = []
                FN_ids[entity_type] = []

        for pred_entity in pred_entities:
            matched = False
            for true_entity in true_entities:
                if pred_entity['type'] == true_entity['type'] and not (pred_entity['end'] < true_entity['start'] or pred_entity['start'] > true_entity['end']):
                    metrics[pred_entity['type']]['TP'] += 1
                    matched = True
                    true_entities.remove(true_entity)
                    break
            if not matched:
                metrics[pred_entity['type']]['FP'] += 1
                FP_ids[pred_entity['type']].append(pred_entity['row_id'])

        for true_entity in true_entities:
            metrics[true_entity['type']]['FN'] += 1
            FN_ids[true_entity['type']].append(true_entity['row_id'])

    # Calculate micro and macro averages
    total_TP = sum(metrics[etype]['TP'] for etype in metrics)
    total_FP = sum(metrics[etype]['FP'] for etype in metrics)
    total_FN = sum(metrics[etype]['FN'] for etype in metrics)

    micro_precision = total_TP / (total_TP + total_FP) if total_TP + total_FP > 0 else 0
    micro_recall = total_TP / (total_TP + total_FN) if total_TP + total_FN > 0 else 0
    micro_f1 = 2 * micro_precision * micro_recall / (micro_precision + micro_recall) if micro_precision + micro_recall > 0 else 0

    precisions = [metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FP']) if metrics[etype]['TP'] + metrics[etype]['FP'] > 0 else 0 for etype in metrics]
    recalls = [metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FN']) if metrics[etype]['TP'] + metrics[etype]['FN'] > 0 else 0 for etype in metrics]
    macro_precision = sum(precisions) / len(metrics) if metrics else 0
    macro_recall = sum(recalls) / len(metrics) if metrics else 0
    macro_f1 = 2 * macro_precision * macro_recall / (macro_precision + macro_recall) if macro_precision + macro_recall > 0 else 0

    # Prepare DataFrame
    data = {
        'Entity Type': list(metrics.keys()) + ['Micro Average', 'Macro Average'],
        'Precision': [metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FP']) if metrics[etype]['TP'] + metrics[etype]['FP'] > 0 else 0 for etype in metrics] + [micro_precision, macro_precision],
        'Recall': [metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FN']) if metrics[etype]['TP'] + metrics[etype]['FN'] > 0 else 0 for etype in metrics] + [micro_recall, macro_recall],
        'F1': [2 * (metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FP']) * metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FN'])) / ((metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FP'])) + (metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FN']))) if (metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FP'])) + (metrics[etype]['TP'] / (metrics[etype]['TP'] + metrics[etype]['FN'])) > 0 else 0 for etype in metrics] + [micro_f1, macro_f1]
    }

    results_df = pd.DataFrame(data)
    return results_df, FP_ids, FN_ids

# Assignment 1
Use the above function to recalculate the performance scores.

# Assignment 2

Use this [BERT model](https://huggingface.co/google-bert) to implement a tokenclassifier for this NER task. You can follow the token classification [tutorial](https://huggingface.co/docs/transformers/en/tasks/token_classification) here.

Note: you will need to reimplement a different spans_to_bio function or make another function to align the labels. Why?