[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jianlins/BMI_NLP_2025/blob/main/Module%205%20text%20classification%20demo.ipynb)

# Sentence classification

We will use previous [UUDeCART](https://github.com/UUDeCART/decart_rule_based_nlp) dataset. This dataset was created using the MIMIC demo dataset and was labeled by Dr. Barbara E. Jones. It is relatively small and was not annotated by a second annotator. Therefore, it should only be used for learning or demonstration purposes.

We will start from a very simple implementation, just to get you familiar with a ML model training and evaluation process. And then you will try some extra exercises to see how you can make the baseline better.

## Download the dataset

In [None]:
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/training_v2.zip

In [None]:
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/test_v2.zip

In [None]:
!ls

sample_data  test_v2.zip  training_v2.zip


In [None]:
!unzip training_v2.zip

Archive:  training_v2.zip
caution: filename not matched:  test_v2.zip


In [None]:
!unzip test_v2.zip

In [None]:
!ls

sample_data  test_v2  test_v2.zip  training_v2	training_v2.zip


## Install & import the packages

In [None]:
!pip install quicksectx git+https://github.com/medspacy/medspacy_io

In [None]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

In [None]:
# The dataset files does not include schema configuration, let's create one
concepts=['EVIDENCE_OF_PNEUMONIA', 'PNEUMONIA_DOC_NO', 'PNEUMONIA_DOC_YES']
lines=['[entities]']+concepts
Path('annotation.conf').write_text('\n'.join(lines))

67

In [None]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=True, recursive=True, schema_file='annotation.conf')

In [None]:
# This function will read brat annotation files and convert the snippet annotation into sentence labelled dataframe
def convert2df(data_folder):
  # read brat annotation into spaCy doc object.
  docs = dir_reader.read(txt_dir=data_folder)
  # convert snippet label into sentence-level labels and generate pandas dataframe
  df = Vectorizer.docs_to_sents_df(docs, track_doc_name=True)
  # remove document-level labels
  df=df[~df['y'].str.contains('_DOC_')]
  return df[['X','y']]



In [None]:
train_df=convert2df('training_v2')

In [None]:
# Let's check the EVIDENCE_OF_PNEUMONIA annotations
train_df[train_df['y']!='NEG']

In [None]:
# Take a look at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# and see what you can configure for this vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_df['X'])

In [None]:
%%time
# now we start to train a svm model
from sklearn.svm import SVC
model = SVC()
model.fit(X, train_df['y'])

## Evaluation

In [None]:
# let's how it does on training set, this comparison usually is not considered as evaluation.
# But it can give us an impression about if the model complexity is sufficient, whether the model is overfitting, etc.
predictions = model.predict(X)

In [None]:
print(classification_report(train_df['y'], predictions))

Now we take a look at test set.

In [None]:
test_df=convert2df('test_v2')

In [None]:
# Note: you will have to use "transform" here, instead of "fit_transform", why?
X_test = vectorizer.transform(test_df['X'])

In [None]:
y_test_preds=model.predict(X_test)

In [None]:
print(classification_report(test_df['y'], y_test_preds))

**Compare** the performance above, what the difference tells you?

Now we take a closer look to the errors

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', None)


In [None]:
test_df['pred']=y_test_preds

In [None]:
test_df[test_df['y']!=test_df['pred']][:5]

## Check a few more errors.
What have you found? What's the possible cause of these errors?


## Now let's try applying TfIdf
Read this page and examples
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Exercise:
* Implement your solution.
* Try at least 3 more tricks that you think would be effective and see if these methods can help improve the performance, e.g. stemming, normalization, etc.
* Instead of perform sentence classification, try document classification instead. (Hint: Inside the function "convert2df", we filtered out the document level annotations. For this task, you will actually use these labels and disregard 'EVIDENCE_OF_PNEUMONIA')


You will be asked to demonstrate and explain your work during the class.
