# Medical Notes Classification

Medical notes is an useful information source for patient data extraction. Notes classification is also an important task in Medical NLP domain. There are many techniques to solve this problem ranging from traditional method (Logistic Regression, SVM,...) to the state-of-the-art models (Transformer).

The below code block is the baseline model for a text classification problem in medical domain.

* Input: the corpus of medical transcriptions.
* Output: the type of each notes.

In this problem, we try to classify five labels:
* Surgery
* Consult - History and Phy.
* Cardiovascular / Pulmonary
* Orthopedic
* Others

The train-test split was also defined, please don't change our split.

Metric to evaluate: `f1_macro`

# Baseline Model Result


0.3729330560342061

                                precision    recall  f1-score   support

    Cardiovascular / Pulmonary       0.35      0.39      0.37       148
    Consult - History and Phy.       0.32      0.06      0.10       207
                    Orthopedic       0.39      0.14      0.21       142
                         Other       0.66      0.74      0.70      1055
                       Surgery       0.43      0.57      0.49       435

                      accuracy                           0.56      1987
                     macro avg       0.43      0.38      0.37      1987
                  weighted avg       0.54      0.56      0.53      1987

# Library & Data Loading

In [1]:
!pip install -U transformers simpletransformers
!pip install tensorboardX
!pip install simpletransformers

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import os
import pandas as pd
import numpy as np
import string
import re
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from simpletransformers.classification import ClassificationModel, ClassificationArgs

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'


### PLEASE DON'T CHANGE ANYTHING IN THIS SECTION ###
DATA = "https://github.com/socd06/private_nlp/raw/master/data/mtsamples.csv"

filtered_labels = [
    "Surgery",
    "Consult - History and Phy.",
    "Cardiovascular / Pulmonary",
    "Orthopedic",
]
data = pd.read_csv(DATA, usecols=['medical_specialty', 'transcription']).dropna()
data.columns = ['labels', 'text']
data['labels'] = [i.strip() if (i.strip() in filtered_labels) else 'Other' for i in data.labels.to_list()]
train, test = train_test_split(data, test_size=0.4, stratify=data.labels, random_state=0)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
### END ###

2021-10-05 16:02:50.028532: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-05 16:02:50.028564: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Preprocessing

# My Model


In [3]:
#encode labels 
le = LabelEncoder()
train['labels'] = le.fit_transform(train.labels)
test['labels'] = le.transform(test.labels)

### Encode Text


In [4]:
#create model
import TopicAllocate as ta
model = ta.Topic_Allocate()

[nltk_data] Downloading package wordnet to /home/kienanh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/kienanh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
#encode texts into matrix
X_train = np.asarray(model.doc2vec(train['text'], vector_size = 500))
X_test = np.asarray(model.doc2vec(test['text'], vector_size = 500))


  return array(a, dtype, copy=False, order=order)


In [6]:
print(train['text'][0])
print(X_train[0].shape)

REVIEW OF SYSTEMS,GENERAL:  Negative weakness, negative fatigue, native malaise, negative chills, negative fever, negative night sweats, negative allergies.,INTEGUMENTARY:  Negative rash, negative jaundice.,HEMATOPOIETIC:  Negative bleeding, negative lymph node enlargement, negative bruisability.,NEUROLOGIC:  Negative headaches, negative syncope, negative seizures, negative weakness, negative tremor.  No history of strokes, no history of other neurologic conditions.,EYES:  Negative visual changes, negative diplopia, negative scotomata, negative impaired vision.,EARS:  Negative tinnitus, negative vertigo, negative hearing impairment.,NOSE AND THROAT:  Negative postnasal drip, negative sore throat.,CARDIOVASCULAR:  Negative chest pain, negative dyspnea on exertion, negative palpations, negative edema.  No history of heart attack, no history of arrhythmias, no history of hypertension.,RESPIRATORY:  No history of shortness of breath, no history of asthma, no history of chronic obstructive 

In [7]:
# onehot labels
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

y_train = train['labels']

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
Y_train = onehot_encoder.fit_transform(np.array(y_train).reshape(-1, 1))
Y_test = onehot_encoder.transform(np.array(test['labels']).reshape(-1,1))

# invert first example
inverted = le.inverse_transform([np.argmax(Y_train[0, :])])
print(inverted)

['Consult - History and Phy.']


### Train with LSTM

In [8]:
#%% import library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout

In [9]:
max_size = np.amax(np.array([x.shape[0] for x in X_train]))
def fill_zeros(x, vector_size):
    missing = max_size - x.shape[0]
    fill_in = np.zeros((missing, vector_size))
    return np.vstack((x, fill_in))
func = lambda x: fill_zeros(x, 500)
X_train_lstm = np.array([func(x) for x in X_train])
X_test_lstm = np.array([func(x) for x in X_test])


In [14]:
#%% Xay dung model LSTM
regressor = Sequential()
# Units: ???
# return_sequences: tra lai ket qua cuoi cho lop tiep theo
regressor.add(LSTM(units=50, return_sequences=True, input_shape=(X_train_lstm.shape[1], X_train_lstm.shape[2])))
# Dropout: bỏ 20% chống overhitting
regressor.add(Dropout(0.2))
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units=50, return_sequences=True))
regressor.add(Dropout(0.2))
regressor.add(LSTM(units=50))
regressor.add(Dropout(0.2))
regressor.add(Dense(units=5, activation="softmax"))
regressor.compile(optimizer="adam", loss="categorical_crossentropy",  metrics=["accuracy"])
regressor.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_8 (LSTM)                (None, 182, 50)           110200    
_________________________________________________________________
dropout_8 (Dropout)          (None, 182, 50)           0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 182, 50)           20200     
_________________________________________________________________
dropout_9 (Dropout)          (None, 182, 50)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 182, 50)           20200     
_________________________________________________________________
dropout_10 (Dropout)         (None, 182, 50)           0         
_________________________________________________________________
lstm_11 (LSTM)               (None, 50)               

In [15]:
regressor.fit(X_train_lstm, Y_train, epochs=200, batch_size=32, validation_split= 0.1)

2021-10-05 16:27:57.113354: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 975884000 exceeds 10% of free system memory.


Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200

KeyboardInterrupt: 

In [16]:
X_test_lstm = np.array([func(x) for x in X_test])

In [18]:
from sklearn.metrics import confusion_matrix
check = regressor.predict(X_test_lstm)
check = np.argmax(check, axis = 1)
ytrain = np.argmax(Y_test, axis = 1)
print(confusion_matrix(check, ytrain))
print(metrics.f1_score(ytrain, check, average='macro'))
print(metrics.classification_report(ytrain, check, target_names=list(le.classes_)))

2021-10-05 16:43:20.699405: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 723268000 exceeds 10% of free system memory.


[[   0    0    0    0    0]
 [   0    0    0    0    0]
 [   0    0    0    0    0]
 [ 148  207  142 1055  435]
 [   0    0    0    0    0]]
0.13872452333990795
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.00      0.00      0.00       148
Consult - History and Phy.       0.00      0.00      0.00       207
                Orthopedic       0.00      0.00      0.00       142
                     Other       0.53      1.00      0.69      1055
                   Surgery       0.00      0.00      0.00       435

                  accuracy                           0.53      1987
                 macro avg       0.11      0.20      0.14      1987
              weighted avg       0.28      0.53      0.37      1987



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
