# Sentence classification

We will use previous [UUDeCART](https://github.com/UUDeCART/decart_rule_based_nlp) dataset. This dataset was created using the MIMIC demo dataset and was labeled by Dr. Barbara E. Jones. It is relatively small and was not annotated by a second annotator. Therefore, it should only be used for learning or demonstration purposes.

We will start from a very simple implementation, just to get you familiar with a ML model training and evaluation process. And then you will try some extra exercises to see how you can make the baseline better.

## Download the dataset

In [None]:
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/training_v2.zip

In [None]:
!wget https://github.com/UUDeCART/decart_rule_based_nlp/raw/master/data/test_v2.zip

In [3]:
!ls

sample_data  test_v2.zip  training_v2.zip


In [8]:
!unzip training_v2.zip

Archive:  training_v2.zip
caution: filename not matched:  test_v2.zip


In [None]:
!unzip test_v2.zip

In [9]:
!ls

sample_data  test_v2  test_v2.zip  training_v2	training_v2.zip


## Install & import the packages

In [None]:
!pip install quicksectx git+https://github.com/medspacy/medspacy_io

In [16]:
from spacy.lang.en import English
from medspacy_io.reader import BratDocReader
from medspacy_io.reader import BratDirReader
import spacy
from pathlib import Path
from medspacy_io.vectorizer import Vectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report

In [94]:
# The dataset files does not include schema configuration, let's create one
concepts=['EVIDENCE_OF_PNEUMONIA', 'PNEUMONIA_DOC_NO', 'PNEUMONIA_DOC_YES']
lines=['[entities]']+concepts
Path('annotation.conf').write_text('\n'.join(lines))

67

In [95]:
# set up the Brat reader
nlp=spacy.load("en_core_web_sm", disable=['ner'])
dir_reader = BratDirReader(nlp=nlp, support_overlap=True, recursive=True, schema_file='annotation.conf')

In [99]:
# This function will read brat annotation files and convert the snippet annotation into sentence labelled dataframe
def convert2df(data_folder):
  # read brat annotation into spaCy doc object.
  docs = dir_reader.read(txt_dir=data_folder)
  # convert snippet label into sentence-level labels and generate pandas dataframe
  df = Vectorizer.docs_to_sents_df(docs, track_doc_name=True)
  # remove document-level labels
  df=df[~df['y'].str.contains('_DOC_')]
  return df[['X','y']]



In [100]:
train_df=convert2df('training_v2')

In [101]:
# Let's check the EVIDENCE_OF_PNEUMONIA annotations
train_df[train_df['y']!='NEG']

Unnamed: 0,X,y
39,There is opacity in the\n right upper lobe...,EVIDENCE_OF_PNEUMONIA
40,Findings consistent with right upper lobe pneu...,EVIDENCE_OF_PNEUMONIA
63,There is evidence of patchy\n opacity note...,EVIDENCE_OF_PNEUMONIA
102,"Since\n the prior study, the patchy bilate...",EVIDENCE_OF_PNEUMONIA
103,Emphysema with pulmonary edema and multifocal\...,EVIDENCE_OF_PNEUMONIA
...,...,...
998,Patchy opacities are seen throughout the right...,EVIDENCE_OF_PNEUMONIA
1016,Again seen are patchy opacities in the left lu...,EVIDENCE_OF_PNEUMONIA
1055,"A right IJ line, NGT, and ETT are\n unchan...",EVIDENCE_OF_PNEUMONIA
1095,There has been no significant change in the pa...,EVIDENCE_OF_PNEUMONIA


In [105]:
# Take a look at https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html
# and see what you can configure for this vectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(train_df['X'])

In [98]:
%%time
# now we start to train a svm model
from sklearn.svm import SVC
model = SVC()
model.fit(X, train['y'])

CPU times: user 83.3 ms, sys: 3.5 ms, total: 86.8 ms
Wall time: 181 ms


## Evaluation

In [80]:
# let's how it does on training set, this comparison usually is not considered as evaluation.
# But it can give us an impression about if the model complexity is sufficient, whether the model is overfitting, etc.
predictions = model.predict(X)

In [107]:
print(classification_report(train_df['y'], predictions))

                       precision    recall  f1-score   support

EVIDENCE_OF_PNEUMONIA       0.98      0.93      0.95        68
                  NEG       0.99      1.00      1.00       980

             accuracy                           0.99      1048
            macro avg       0.99      0.96      0.98      1048
         weighted avg       0.99      0.99      0.99      1048



Now we take a look at test set.

In [112]:
test_df=convert2df('test_v2')

In [91]:
# Note: you will have to use "transform" here, instead of "fit_transform", why?
X_test = vectorizer.transform(test_df['X'])

In [108]:
y_test_preds=model.predict(X_test)

In [109]:
print(classification_report(test_df['y'], y_test_preds))

                       precision    recall  f1-score   support

EVIDENCE_OF_PNEUMONIA       0.82      0.25      0.38        36
                  NEG       0.94      1.00      0.97       433

             accuracy                           0.94       469
            macro avg       0.88      0.62      0.68       469
         weighted avg       0.93      0.94      0.92       469



**Compare** the performance above, what the difference tells you?

Now we take a closer look to the errors

In [116]:
import pandas as pd
pd.set_option('display.max_colwidth', None)


In [114]:
test_df['pred']=y_test_preds

In [120]:
test_df[test_df['y']!=test_df['pred']][:5]

Unnamed: 0,X,y,pred
14,There is bilateral mild peribronchial thickening with increased\n interstitial markings and perihilar haziness.,EVIDENCE_OF_PNEUMONIA,NEG
30,There remains increased opacity in\n the right lower lobe predominantly in the retrocardiac region as well as some\n increased patchy opacity in the right mid lung zone.,EVIDENCE_OF_PNEUMONIA,NEG
31,There remains increased opacity in\n the right lower lobe predominantly in the retrocardiac region as well as some\n increased patchy opacity in the right mid lung zone.,EVIDENCE_OF_PNEUMONIA,NEG
32,"Right mid and right lower lung zone opacities are unchanged, and are\n probably related to the previously provided history of pneumonia.\n",EVIDENCE_OF_PNEUMONIA,NEG
80,"Ill-defined multifocal\n opacities are in the left upper lobe and in both bases, right greater than\n left.",EVIDENCE_OF_PNEUMONIA,NEG


In [None]:
t l

## Check a few more errors.
What have you found? What's the possible cause of these errors?


## Now let's try applying TfIdf
Read this page and examples
https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

## Exercise:
* Implement your solution.
* Try at least 3 more tricks that you think would be effective and see if these methods can help improve the performance, e.g. stemming, normalization, etc.
* Instead of perform sentence classification, try document classification instead. (Hint: Inside the function "convert2df", we filtered out the document level annotations. For this task, you will actually use these labels and disregard 'EVIDENCE_OF_PNEUMONIA')


You will be asked to demonstrate and explain your work during the class.
