[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2025/blob/main/Module%208%20Named%20Entity%20Recognition.ipynb)

# Named Entity Recognition

We will continue use this [UUDeCART](https://github.com/UUDeCART/decart_rule_based_nlp) dataset. Instead of converting the labels into sentence labels, we will keep original concept labels and convert them into [BIO format](https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)). Then your excerice will take from there.

## Download the dataset

In [None]:
%%capture
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/training_v2.zip

In [None]:
%%capture
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/test_v2.zip

In [None]:
!ls

sample_data  test_v2.zip  training_v2.zip


In [None]:
%%capture
!unzip training_v2.zip

In [None]:
%%capture
!unzip test_v2.zip

In [None]:
!ls

sample_data  test_v2  test_v2.zip  training_v2	training_v2.zip


## Install & import the packages

In [None]:
%%capture
!pip install quicksectx git+https://github.com/medspacy/medspacy_io

In [None]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from spacy.tokens import Doc
from typing import List

In [None]:
# The dataset files does not include schema configuration, let's create one
concepts=['EVIDENCE_OF_PNEUMONIA', 'PNEUMONIA_DOC_NO', 'PNEUMONIA_DOC_YES']
lines=['[entities]']+concepts
Path('annotation.conf').write_text('\n'.join(lines))

67

## Now read the data as spaCy Doc objects.

In [None]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=True, recursive=True, schema_file='annotation.conf')

In [None]:
train_docs = dir_reader.read(txt_dir='training_v2')
test_docs = dir_reader.read(txt_dir='training_v2')

## Convert to BIO

I've provided the function for the conversion to save your time. Now the output string would be the same as the book is using. You can take the ouput string to train your NER models.

In [None]:
def spans_to_bio(doc:Doc, anno_types:List[str], abbr:bool=True)->str:
  """
  Converts spans in a spaCy Doc object to a BIO-formatted string, with an option
  to abbreviate the entity labels. It adds an empty line between sentences to improve
  readability.

  Parameters:
  - doc (Doc): The spaCy Doc object containing the text and its annotations, including
                entities and sentence boundaries.
  - anno_types (List[str]): A list of annotation types to include in the output. These
                            types should correspond to the keys in `doc.spans`.
  - abbr (bool, optional): If True, entity labels are abbreviated to their initials.
                            Defaults to True.

  Returns:
  - str: A string where each token is followed by its BIO tag (with the entity label if applicable),
          formatted as "token B-entity" or "token I-entity" for tokens within entities, and
          "token O" for tokens outside any entities. Sentences are separated by an empty line.
  """
  # Initialize a dictionary to hold BIO tags for each token index
  bio_tags = {token.i: 'O' for token in doc}  # Default to 'O' for outside any entity

  # Preprocess spans to assign BIO tags
  for anno_type in anno_types:
    for span in doc.spans.get(anno_type, []):
      if span:  # Check if span is not empty
        label=span.label_
        if abbr:
          label=''.join([w[0] for w in label.split('_')])
        bio_tags[span.start] = f"B-{label}"  # Begin tag for the first token in the span
        for token in span[1:]:  # Inside tags for the rest of the tokens in the span
          bio_tags[token.i] = f"I-{label}"

  # Generate BIO format string
  bio_text = []
  for sent in doc.sents:
    for i,token in enumerate(sent):
      # trim the whitespaces on both sides of a sentence
      if (i==0 or i==len(sent)-1) and str(token).strip()=='':
        bio_text.append('')
      elif str(token).strip()=='':
        # clean up extra whitespaces within a sentence.
        bio_text.append(f' \t{bio_tags[token.i]}')
      else:
        bio_text.append(f"{token.text} {bio_tags[token.i]}")
    bio_text.append('')  # Empty line between sentences
  return '\n'.join(bio_text)

In [None]:
bio_str=spans_to_bio(train_docs[0], ['EVIDENCE_OF_PNEUMONIA'])

In [None]:
# Here just show an example that contains an EVIDENCE_OF_PNEUMONIA annotation
print(bio_str[1516:1805])




A B-EOP
right I-EOP
IJ I-EOP
line I-EOP
, I-EOP
NGT I-EOP
, I-EOP
and I-EOP
ETT I-EOP
are I-EOP
 	I-EOP
unchanged I-EOP
as I-EOP
are I-EOP
the I-EOP
parenchymal I-EOP
changes I-EOP
in I-EOP
the I-EOP
lungs I-EOP
compared O
to O
the O
earlier O
 	O
chest O
x O
- O
ray O
this O
morning O


## Here is your solution for NER.
Now your task is to implement an NER solution and evaluate it. Please refer to Chapter 5 for detailed guidance. It's important to note that the chapter provides only the key functions necessary for implementing the solution. You will need to comprehend how these functions operate in order to successfully integrate them into your system.