# Medical Notes Classification

Medical notes is an useful information source for patient data extraction. Notes classification is also an important task in Medical NLP domain. There are many techniques to solve this problem ranging from traditional method (Logistic Regression, SVM,...) to the state-of-the-art models (Transformer).

The below code block is the baseline model for a text classification problem in medical domain.

* Input: the corpus of medical transcriptions.
* Output: the type of each notes.

In this problem, we try to classify five labels:
* Surgery
* Consult - History and Phy.
* Cardiovascular / Pulmonary
* Orthopedic
* Others

The train-test split was also defined, please don't change our split.

Metric to evaluate: `f1_macro`

# Library & Data Loading

In [None]:
!pip install -U transformers simpletransformers
!pip install tensorboardX
!pip install simpletransformers

Collecting transformers
  Downloading transformers-4.11.2-py3-none-any.whl (2.9 MB)
[K     |████████████████████████████████| 2.9 MB 2.8 MB/s 
[?25hCollecting simpletransformers
  Downloading simpletransformers-0.62.0-py3-none-any.whl (230 kB)
[K     |████████████████████████████████| 230 kB 47.8 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.46-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 22.5 MB/s 
[?25hCollecting huggingface-hub>=0.0.17
  Downloading huggingface_hub-0.0.17-py3-none-any.whl (52 kB)
[K     |████████████████████████████████| 52 kB 1.6 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |████████████████████████████████| 636 kB 46.8 MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 3

Collecting tensorboardX
  Downloading tensorboardX-2.4-py2.py3-none-any.whl (124 kB)
[?25l[K     |██▋                             | 10 kB 34.5 MB/s eta 0:00:01[K     |█████▎                          | 20 kB 8.7 MB/s eta 0:00:01[K     |████████                        | 30 kB 7.8 MB/s eta 0:00:01[K     |██████████▌                     | 40 kB 7.5 MB/s eta 0:00:01[K     |█████████████▏                  | 51 kB 4.2 MB/s eta 0:00:01[K     |███████████████▉                | 61 kB 4.4 MB/s eta 0:00:01[K     |██████████████████▍             | 71 kB 4.5 MB/s eta 0:00:01[K     |█████████████████████           | 81 kB 5.0 MB/s eta 0:00:01[K     |███████████████████████▊        | 92 kB 5.2 MB/s eta 0:00:01[K     |██████████████████████████▎     | 102 kB 4.0 MB/s eta 0:00:01[K     |█████████████████████████████   | 112 kB 4.0 MB/s eta 0:00:01[K     |███████████████████████████████▋| 122 kB 4.0 MB/s eta 0:00:01[K     |████████████████████████████████| 124 kB 4.0 MB/s 
Inst

In [None]:
import os
import pandas as pd
import numpy as np
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from simpletransformers.classification import ClassificationModel, ClassificationArgs

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'


### PLEASE DON'T CHANGE ANYTHING IN THIS SECTION ###
DATA = "https://github.com/socd06/private_nlp/raw/master/data/mtsamples.csv"

filtered_labels = [
    "Surgery",
    "Consult - History and Phy.",
    "Cardiovascular / Pulmonary",
    "Orthopedic",
]
data = pd.read_csv(DATA, usecols=['medical_specialty', 'transcription']).dropna()
data.columns = ['labels', 'text']
data['labels'] = [i.strip() if (i.strip() in filtered_labels) else 'Other' for i in data.labels.to_list()]
train, test = train_test_split(data, test_size=0.4, stratify=data.labels, random_state=0)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
### END ###

# Preprocessing

In [10]:
def preprocess(text):
    text = text.lower() # Lowercase
    text = re.sub(r'[^\w\d\s]+', ' ', text) # Remove punctuation
    text = re.sub(r'\s+', ' ', text) # Remove extra spaces
    return text.strip()

preprocess('This is   VEF\'s academy!')

NameError: name 're' is not defined

In [None]:
train['text'] = train.text.apply(preprocess)
test['text'] = test.text.apply(preprocess)

In [None]:
le = LabelEncoder()
train['labels'] = le.fit_transform(train.labels)
test['labels'] = le.transform(test.labels)

# Baseline Model

In [None]:
model_args = ClassificationArgs()
model_args.num_train_epochs = 3
model_args.learning_rate = 1e-5
model_args.reprocess_input_data = True
model_args.overwrite_output_dir = True

In [None]:
model = ClassificationModel('bert', 'microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext', 
                            args=model_args, num_labels = len(list(le.classes_)))

Downloading:   0%|          | 0.00/385 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Ber

Downloading:   0%|          | 0.00/221k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

In [None]:
model.train_model(train)

  0%|          | 0/2979 [00:00<?, ?it/s]

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Running Epoch 0 of 3:   0%|          | 0/373 [00:00<?, ?it/s]

  model.parameters(), args.max_grad_norm


Running Epoch 1 of 3:   0%|          | 0/373 [00:00<?, ?it/s]

Running Epoch 2 of 3:   0%|          | 0/373 [00:00<?, ?it/s]

(1119, 0.9379104327633511)

In [None]:
y_pred, outputs = model.predict(test.text.tolist())

  0%|          | 0/1987 [00:00<?, ?it/s]

  0%|          | 0/249 [00:00<?, ?it/s]

In [None]:
y_test = test.labels.tolist()
print(metrics.f1_score(y_test, y_pred, average='macro'))
print(metrics.classification_report(y_test, y_pred, target_names=list(le.classes_)))

0.3729330560342061
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.35      0.39      0.37       148
Consult - History and Phy.       0.32      0.06      0.10       207
                Orthopedic       0.39      0.14      0.21       142
                     Other       0.66      0.74      0.70      1055
                   Surgery       0.43      0.57      0.49       435

                  accuracy                           0.56      1987
                 macro avg       0.43      0.38      0.37      1987
              weighted avg       0.54      0.56      0.53      1987



# My Model


### Encode Text


In [None]:
X_train = np.asarray(model.doc2vec(train['text']))
X_test = np.asarray(model.doc2vec(test['text']))


In [None]:
# onehot labels
from numpy import array
from numpy import argmax
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

y_train = train['labels']

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
Y_train = onehot_encoder.fit_transform(np.array(y_train).reshape(-1, 1))
Y_test = onehot_encoder.transform(np.array(test['labels']).reshape(-1,1))

# invert first example
inverted = le.inverse_transform([argmax(Y_train[0, :])])
print(inverted)

### Train with LSTM