[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2025/blob/main/Module%2011%20Attribute%20Classification.ipynb)

# Attribute classification

We will still use previous [UUDeCART](https://github.com/UUDeCART/decart_rule_based_nlp) dataset. This dataset was created using the MIMIC demo dataset and was labeled by Dr. Barbara E. Jones. It is relatively small and was not annotated by a second annotator. Therefore, it should only be used for learning or demonstration purposes.

This dataset originally included modifiers (attributes). Unfortunately, during a previous conversion aimed at simplifying the dataset for learning purposes, this attribute information was lost. Here, I will use a straightforward conversion method to simulate an attribute classification dataset. This will solely serve for your exercise purposes.

## Download the dataset

In [None]:
%%capture
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/training_v2.zip

In [None]:
%%capture
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/test_v2.zip

In [None]:
!ls

sample_data  test_v2.zip  training_v2.zip


In [None]:
%%capture
!unzip training_v2.zip

In [None]:
%%capture
!unzip test_v2.zip

In [None]:
!ls

sample_data  test_v2  test_v2.zip  training_v2	training_v2.zip


## Install & import the packages

In [None]:
!pip install quicksectx git+https://github.com/medspacy/medspacy_io

In [None]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
import pandas as pd
pd.set_option('display.max_colwidth', None)


In [None]:
# The dataset files does not include schema configuration, let's create one
concepts=['EVIDENCE_OF_PNEUMONIA', 'PNEUMONIA_DOC_NO', 'PNEUMONIA_DOC_YES']
lines=['[entities]']+concepts
Path('annotation.conf').write_text('\n'.join(lines))

67

In [None]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=True, recursive=True, schema_file='annotation.conf')

found annotation.conf file


In [None]:
Vectorizer.docs_to_sents_df?

In [None]:
# This function will read brat annotation files and convert the snippet annotation into sentence labelled dataframe
def convert2df(data_folder):
  # read brat annotation into spaCy doc object.
  docs = dir_reader.read(txt_dir=data_folder)
  # convert snippet label into sentence-level labels and generate pandas dataframe
  df = Vectorizer.docs_to_sents_df(docs, track_doc_name=True, sent_window=2)
  # remove document-level labels
  df=df[~df['y'].str.contains('_DOC_')]
  return df[['X','y']]



In [None]:
train_df=convert2df('training_v2')
test_df=convert2df('test_v2')

In [None]:
def to_attr_classify(df:pd.DataFrame):
    keywords = ['opacity', 'infiltrate', 'pneumonia', 'effusion', 'consolidation']

    # Function to find offsets of keywords in sentences
    def find_offsets(sentence, keywords):
        offsets = []
        for keyword in keywords:
            start = sentence.find(keyword)
            if start != -1:
                end = start + len(keyword)
                offsets.append((keyword, start, end))
        return offsets

    # Apply the function and expand the list of offsets into a new dataframe
    rows = []
    for i, row in df.iterrows():
        offsets = find_offsets(row['X'].lower(), keywords)
        for keyword, start, end in offsets:
            rows.append({'X': row['X'], 'Keyword': keyword, 'Start': start, 'End': end, 'y':row['y']})

    # Creating a new dataframe from the rows with keywords found
    filtered_df = pd.DataFrame(rows)
    filtered_df['y']=filtered_df['y'].map(lambda x: 'NotAffirmed' if x == 'NEG' else 'Affirmed')
    return filtered_df

In [None]:
train_adf=to_attr_classify(train_df)
test_adf=to_attr_classify(test_df)

In [None]:
train_adf[30:60]

Unnamed: 0,X,Keyword,Start,End,y
30,No infiltrates\n or consolidations are present. There is no pneumothorax.,consolidation,23,36,NotAffirmed
31,IMPRESSION:\n \n 1) Tubes and lines as described above.\n \n 2) No acute infiltrate or consolidation.\n\n,infiltrate,86,96,NotAffirmed
32,IMPRESSION:\n \n 1) Tubes and lines as described above.\n \n 2) No acute infiltrate or consolidation.\n\n,consolidation,100,113,NotAffirmed
33,There has been interval worsening of the left lower lobe opacity.\n There is a small left pleural effusion.,opacity,57,64,Affirmed
34,There has been interval worsening of the left lower lobe opacity.\n There is a small left pleural effusion.,effusion,102,110,Affirmed
35,The NG tube courses below the\n diaphragm. There has been interval worsening of the left lower lobe opacity.\n,opacity,105,112,Affirmed
36,IMPRESSION:\n Worsening consolidation in the left lower lobe.\n\n,consolidation,28,41,Affirmed
37,"[**2616-12-10**] 5:27 AM\n CHEST (PORTABLE AP) Clip # [**Clip Number (Radiology) 11900**]\n Reason: assess interval change \n Admitting Diagnosis: TRACHEAL OBSTRUCTION\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 80 year old woman with COPD, pneumonia, tracheomalacia s/p LMSB stent removal. \n",pneumonia,413,422,NotAffirmed
38,"[**Clip Number (Radiology) 11900**]\n Reason: assess interval change \n Admitting Diagnosis: TRACHEAL OBSTRUCTION\n ______________________________________________________________________________\n UNDERLYING MEDICAL CONDITION:\n 80 year old woman with COPD, pneumonia, tracheomalacia s/p LMSB stent removal. \n Please assess interval change. \n",pneumonia,312,321,NotAffirmed
39,"assess interval change \n ______________________________________________________________________________\n FINAL REPORT\n INDICATION: Assess for change COPD, pneumonia, tracheomalacia, and post left\n main stem stent removal.\n \n",pneumonia,258,267,NotAffirmed


## Assignment 1
Let's assume all the keyword identified above are true labels of NER. Now you would want to find out among these identified concepts, which are "Affirmed" statement, and which are not.

Let's see if a BERT sequence classifier can perform well on this task. You can try following the tutorial here:
https://huggingface.co/docs/transformers/en/tasks/sequence_classification

In [None]:
# Your solution goes here

## Assignment 2

Now, you will want to identify the errors your model made. List the errors observed. What are your thoughts on the potential causes and possible solutions?

In [None]:
# Your code goes here

## Assignment 3

Now, let's try a simple yet proven effective approach. I'm not certain if it will make some difference on the "mimic" dataset, but we'll see :)

Instead of using the X directly as the input, let's try inserting the "[SEP]" token both before and after the keyword (What is "[SEP]"?). Then, use this revised string as the input to train your model.

In [None]:
# Your solution here