[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2024/blob/main/Module%209%20Name%20Entity%20Recognition%20Using%20Deep%20Learning.ipynb)

# Name Entity Recognition Using Deep Learning

This week we will use a different demo dataset to identify patients' family history of colon cancer by differentiating whether the statement of colon cancer is referring to a patient or a patient's relative. This demo dataset was also created using MIMIC demo dataset used in previous BMI NLP class.

This dataset will be somewhat more challenging than the previous one we used. If you're wondering why, I highly recommend taking a look at the actual annotations using Brat (as I demonstrated in class).

## Download the dataset

In [1]:
%%capture
!wget https://github.com/jianlins/FHI_Hands_on/raw/master/data/cc_train.zip

In [2]:
%%capture
!wget https://github.com/jianlins/FHI_Hands_on/raw/master/img/cc_test.zip

In [3]:
!ls

cc_test.zip  cc_train.zip  sample_data


In [4]:
%%capture
!unzip cc_train.zip -d cc_train

In [5]:
%%capture
!unzip cc_test.zip -d cc_test

In [6]:
!ls

cc_test  cc_test.zip  cc_train	cc_train.zip  sample_data


## Install & import the packages

In [7]:
!pip install quicksectx git+https://github.com/medspacy/medspacy_io

Collecting git+https://github.com/medspacy/medspacy_io
  Cloning https://github.com/medspacy/medspacy_io to /tmp/pip-req-build-37snx69q
  Running command git clone --filter=blob:none --quiet https://github.com/medspacy/medspacy_io /tmp/pip-req-build-37snx69q
  Resolved https://github.com/medspacy/medspacy_io to commit b8104f8a998c641da04aef6eaa00a2dd8cae5f6a
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting quicksectx
  Downloading quicksectx-0.3.9-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (408 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m408.3/408.3 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Cython<=3.0.8,>=0.25 (from quicksectx)
  Downloading Cython-3.0.8-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.6/3.6 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: medspacy-io
  Building wheel for 

## Import packages

In [50]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
import re
from spacy.tokens import Doc
from typing import List
import pandas as pd

## Read data

In [9]:
config='''[entities]
FAM_COLON_CA_DOC
FAM_COLON_CA
ANATOM
NEGATED_DOC
COLON_CA
Pos_Doc
NEG_DOC
PastConcept
[attributes]
Negation	Arg:COLON_CA,	Value:affirm
Experiencer	Arg:COLON_CA,	Value:patient
Certainty	Arg:COLON_CA,	Value:certain
Note	Arg:FAM_COLON_CA|PossibleConcept|NegatedConcept|PastConcept
Temporality	Arg:COLON_CA,	Value:present
Section	Arg:COLON_CA,	Value:SourceDocumentInformation
[relations]
[events]'''

In [10]:
# The dataset files does not include schema configuration, let's create one
Path('annotation.conf').write_text(config)


403

In [11]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=False, recursive=True, schema_file=Path('annotation.conf'))

found annotation.conf file


In [12]:
train_docs= dir_reader.read(txt_dir='cc_train')
test_docs= dir_reader.read(txt_dir='cc_test')

## Generate BIO tags

We will reuse the spans_to_bio function from last week to first generate BIO tags. I

In [60]:
def spans_to_bio(doc:Doc, anno_types:List[str], abbr:bool=True)->str:
  """
  Converts spans in a spaCy Doc object to a BIO-formatted string, with an option
  to abbreviate the entity labels. It adds an empty line between sentences to improve
  readability.

  Parameters:
  - doc (Doc): The spaCy Doc object containing the text and its annotations, including
                entities and sentence boundaries.
  - anno_types (List[str]): A list of annotation types to include in the output. These
                            types should correspond to the keys in `doc.spans`.
  - abbr (bool, optional): If True, entity labels are abbreviated to their initials.
                            Defaults to True.

  Returns:
  - str: A string where each token is followed by its BIO tag (with the entity label if applicable),
          formatted as "token B-entity" or "token I-entity" for tokens within entities, and
          "token O" for tokens outside any entities. Sentences are separated by an empty line.
  """
  # Initialize a dictionary to hold BIO tags for each token index
  bio_tags = {token.i: 'O' for token in doc}  # Default to 'O' for outside any entity

  # Preprocess spans to assign BIO tags
  for anno_type in anno_types:
    for span in doc.ents:
      if span:  # Check if span is not empty
        label=span.label_
        if abbr:
          label=''.join([w[0] for w in label.split('_')])
        bio_tags[span.start] = f"B-{label}"  # Begin tag for the first token in the span
        for token in span[1:]:  # Inside tags for the rest of the tokens in the span
          bio_tags[token.i] = f"I-{label}"

  # Generate BIO format string
  bio_text = []
  bio_data={'sentence_id':[],'token':[],'lab':[]}
  for s,sent in enumerate(doc.sents):
    for i,token in enumerate(sent):
      # trim the whitespaces on both sides of a sentence
      if (i==0 or i==len(sent)-1) and str(token).strip()=='':
        bio_text.append('')
        continue
      elif str(token).strip()=='':
        # clean up extra whitespaces within a sentence.
        bio_text.append(f' \t{bio_tags[token.i]}')
        bio_data['lab'].append(bio_tags[token.i])
      else:
        bio_text.append(f"{token.text} {bio_tags[token.i]}")
        bio_data['lab'].append(bio_tags[token.i])
      bio_data['token'].append(token)
      bio_data['sentence_id'].append(s)
    bio_text.append('')  # Empty line between sentences
  return '\n'.join(bio_text), pd.DataFrame(bio_data)

In [61]:
# We will focus on two types of concepts here
data, df=spans_to_bio(train_docs[7], anno_types=['FAM_COLON_CA','COLON_CA'])

In [62]:
df[df['lab']!='O']

Unnamed: 0,sentence_id,token,lab
0,0,Admission,B-FCCD
929,44,colon,B-FCC
930,44,CA,I-FCC


## Assigment 1
Let's begin with a simpler task: train an NER model using the CRF approach and evaluate its performance.

## Assigment 2
Plot the distribution of all tags using a histogram, bar chart, or pie chart. Do you notice any issues? What steps could you take to potentially enhance performance? Please implement your solution here.

## Assigment 3
Implement a bi-LSTM model for this task.
* Don't know what is bi-LSTM? --check the lecture recording of last deep learning module.
* Don't know how to implement bi-LSTM? -- try chat with ChatGPT. I think you will learn how to communicate with ChatGPT more efficiently after you use it more or learn from others.

## Assigment 4
Use this [BERT model](https://huggingface.co/google-bert/bert-base-uncased) to implement a tokenclassifier for this NER task.
You can follow the [token classification tutorial here](https://huggingface.co/docs/transformers/en/tasks/token_classification).

**Note**: you will need to reimplement a different spans_to_bio function or make another function to align the labels. **Why**?