# Medical Notes Classification

Medical notes is an useful information source for patient data extraction. Notes classification is also an important task in Medical NLP domain. There are many techniques to solve this problem ranging from traditional method (Logistic Regression, SVM,...) to the state-of-the-art models (Transformer).

The below code block is the baseline model for a text classification problem in medical domain.

* Input: the corpus of medical transcriptions.
* Output: the type of each notes.

In this problem, we try to classify five labels:
* Surgery
* Consult - History and Phy.
* Cardiovascular / Pulmonary
* Orthopedic
* Others

The train-test split was also defined, please don't change our split.

Metric to evaluate: `f1_macro`

# Baseline Model Result


0.3729330560342061

                                precision    recall  f1-score   support

    Cardiovascular / Pulmonary       0.35      0.39      0.37       148
    Consult - History and Phy.       0.32      0.06      0.10       207
                    Orthopedic       0.39      0.14      0.21       142
                         Other       0.66      0.74      0.70      1055
                       Surgery       0.43      0.57      0.49       435

                      accuracy                           0.56      1987
                     macro avg       0.43      0.38      0.37      1987
                  weighted avg       0.54      0.56      0.53      1987

# Library & Data Loading

In [1]:
!pip install -U transformers simpletransformers
!pip install tensorboardX
!pip install simpletransformers

Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m
Defaulting to user installation because normal site-packages is not writeable
You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn import metrics
from simpletransformers.classification import ClassificationModel, ClassificationArgs

os.environ['CUDA_LAUNCH_BLOCKING'] = '1'


### PLEASE DON'T CHANGE ANYTHING IN THIS SECTION ###
DATA = "https://github.com/socd06/private_nlp/raw/master/data/mtsamples.csv"

filtered_labels = [
    "Surgery",
    "Consult - History and Phy.",
    "Cardiovascular / Pulmonary",
    "Orthopedic",
]
data = pd.read_csv(DATA, usecols=['medical_specialty', 'transcription']).dropna()
data.columns = ['labels', 'text']
data['labels'] = [i.strip() if (i.strip() in filtered_labels) else 'Other' for i in data.labels.to_list()]
train, test = train_test_split(data, test_size=0.4, stratify=data.labels, random_state=0)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)
### END ###

2021-10-06 08:51:45.271496: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2021-10-06 08:51:45.271519: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


# Preprocessing

# My Model


In [3]:
#encode labels 
le = LabelEncoder()
train['labels'] = le.fit_transform(train.labels)
test['labels'] = le.transform(test.labels)

### Encode Text


In [4]:
#create model
import TopicAllocate as ta
model = ta.Topic_Allocate()

[nltk_data] Downloading package wordnet to /home/kienanh/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /home/kienanh/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [5]:
#encode texts into matrix
X_train = np.asarray(model.doc2vec_s2v(train['text'], vector_size = 500, fit = True))
X_test = np.asarray(model.doc2vec_s2v(test['text'], vector_size = 500))


In [None]:
X_train.shape

In [6]:
print(train['text'][0])
print(X_train[0].shape)

REVIEW OF SYSTEMS,GENERAL:  Negative weakness, negative fatigue, native malaise, negative chills, negative fever, negative night sweats, negative allergies.,INTEGUMENTARY:  Negative rash, negative jaundice.,HEMATOPOIETIC:  Negative bleeding, negative lymph node enlargement, negative bruisability.,NEUROLOGIC:  Negative headaches, negative syncope, negative seizures, negative weakness, negative tremor.  No history of strokes, no history of other neurologic conditions.,EYES:  Negative visual changes, negative diplopia, negative scotomata, negative impaired vision.,EARS:  Negative tinnitus, negative vertigo, negative hearing impairment.,NOSE AND THROAT:  Negative postnasal drip, negative sore throat.,CARDIOVASCULAR:  Negative chest pain, negative dyspnea on exertion, negative palpations, negative edema.  No history of heart attack, no history of arrhythmias, no history of hypertension.,RESPIRATORY:  No history of shortness of breath, no history of asthma, no history of chronic obstructive 

In [7]:
# onehot labels
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

y_train = train['labels']

# binary encode
onehot_encoder = OneHotEncoder(sparse=False)
Y_train = onehot_encoder.fit_transform(np.array(y_train).reshape(-1, 1))
Y_test = onehot_encoder.transform(np.array(test['labels']).reshape(-1,1))

# invert first example
inverted = le.inverse_transform([np.argmax(Y_train[0, :])])
print(inverted)

['Consult - History and Phy.']


### Train with LSTM

#### sequence -> vec

In [8]:
#%% import library
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout

In [9]:
max_size = np.amax(np.array([x.shape[0] for x in X_train]))
def fill_zeros(x, vector_size):
    missing = max_size - x.shape[0]
    fill_in = np.zeros((missing, vector_size))
    return np.vstack((fill_in, x))
func = lambda x: fill_zeros(x, 500)
X_train_lstm_s2v = np.array([func(x) for x in X_train])
X_test_lstm_s2v = np.array([func(x) for x in X_test])


In [10]:
#%% Xay dung model LSTM
lstm_model = Sequential()
# Units: ???
# return_sequences: tra lai ket qua cuoi cho lop tiep theo
lstm_model.add(LSTM(units=50, return_sequences=True, input_shape=(X_train_lstm_s2v.shape[1], X_train_lstm_s2v.shape[2])))
# Dropout: bỏ 20% chống overhitting
lstm_model.add(Dropout(0.2))
# regressor.add(LSTM(units=50, return_sequences=True))
# regressor.add(Dropout(0.2))
# regressor.add(LSTM(units=50, return_sequences=True))
# regressor.add(Dropout(0.2))
lstm_model.add(LSTM(units=50))
lstm_model.add(Dropout(0.2))
lstm_model.add(Dense(units=5, activation="softmax"))
lstm_model.compile(optimizer="adam", loss="categorical_crossentropy",  metrics=["accuracy"])
lstm_model.summary()

2021-10-06 08:52:40.271061: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-10-06 08:52:40.271093: W tensorflow/stream_executor/cuda/cuda_driver.cc:326] failed call to cuInit: UNKNOWN ERROR (303)
2021-10-06 08:52:40.271122: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (LoG): /proc/driver/nvidia/version does not exist
2021-10-06 08:52:40.271440: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm (LSTM)                  (None, 1, 50)             110200    
_________________________________________________________________
dropout (Dropout)            (None, 1, 50)             0         
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                20200     
_________________________________________________________________
dropout_1 (Dropout)          (None, 50)                0         
_________________________________________________________________
dense (Dense)                (None, 5)                 255       
Total params: 130,655
Trainable params: 130,655
Non-trainable params: 0
_________________________________________________________________


In [11]:
lstm_model.fit(X_train_lstm_s2v, Y_train, epochs=50, batch_size=32, validation_split= 0.1)

2021-10-06 08:52:41.423360: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:176] None of the MLIR Optimization Passes are enabled (registered 2)
2021-10-06 08:52:41.443798: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 1999965000 Hz


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f0d54bb4b20>

In [12]:
from sklearn.metrics import confusion_matrix
check = lstm_model.predict(X_test_lstm_s2v)
check = np.argmax(check, axis = 1)
ytrain = np.argmax(Y_test, axis = 1)
print(confusion_matrix(check, ytrain))
print(metrics.f1_score(ytrain, check, average='macro'))
print(metrics.classification_report(ytrain, check, target_names=list(le.classes_)))

[[  5   1   0   5   0]
 [  4   8   2   2   0]
 [  0   0   0   4   0]
 [ 63 198  60 759  51]
 [ 76   0  80 285 384]]
0.28771696008014286
                            precision    recall  f1-score   support

Cardiovascular / Pulmonary       0.45      0.03      0.06       148
Consult - History and Phy.       0.50      0.04      0.07       207
                Orthopedic       0.00      0.00      0.00       142
                     Other       0.67      0.72      0.69      1055
                   Surgery       0.47      0.88      0.61       435

                  accuracy                           0.58      1987
                 macro avg       0.42      0.33      0.29      1987
              weighted avg       0.54      0.58      0.51      1987



#### word -> vec

In [13]:
a = list([])
a.append("a")

In [14]:
X_train_lstm_w2v = model.doc2vec_w2v(train['text'], vector_size = 500)
X_test_lstm_w2v = model.doc2vec_w2v(test['text'], vector_size = 500)

KeyboardInterrupt: 

In [None]:
X_train_lstm_w2v

KeyboardInterrupt: 

In [None]:
X_train_lstm_w2v = np.asarray([])
model.w2v["he"]

AttributeError: 'Topic_Allocate' object has no attribute 'w2v'