<a href="https://colab.research.google.com/github/luca-martial/medical-specialty/blob/main/medical_specialty_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Medical Specialty Prediction with Spark NLP

The goal of this project was to predict medical specialties (surgery, internal medicine, medical records, other) based on a corpus of 4999 medical transcriptions using Spark NLP. The corpus was scraped by [Tara Boyle](https://github.com/terrah27) from a [Transcribed Medical Transcription Sample Reports and Examples website](https://mtsamples.com/) and published on [Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions). The version used in this project was compiled by [Carlos Salgado](https://github.com/socd06) for Natural Language Processing using the scraped corpus and custom-generated clinical stop words and vocabulary. This compiled version was published on [GitHub](https://github.com/socd06/medical-nlp) and is free to use.

## Set-Up

### Installing SparkNLP

In [1]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Saving spark_nlp_for_healthcare_v3.3-1.json to spark_nlp_for_healthcare_v3.3-1.json


In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

[K     |████████████████████████████████| 212.4 MB 64 kB/s 
[K     |████████████████████████████████| 120 kB 55.8 MB/s 
[K     |████████████████████████████████| 198 kB 59.3 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 133 kB 26.3 MB/s 
[?25h

In [3]:
# Import libraries and start session
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

print("Spark NLP version", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'], 
                           gpu=True, 
                           params=params)

spark

Spark NLP version 3.3.0
Spark NLP_JSL Version : 3.3.0


In [4]:
# Import auxiliary libraries
import pandas as pd
from sklearn.metrics import classification_report
from pyspark.sql.functions import explode
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.feature import StringIndexer, CountVectorizer, HashingTF, IDF

### Reading in Data

In [5]:
# Get datasets
! wget -q https://raw.githubusercontent.com/socd06/medical-nlp/master/data/train.csv
! wget -q https://raw.githubusercontent.com/socd06/medical-nlp/master/data/test.csv

In [6]:
labelDict = {'1':'Surgery', '2':'Medical Records', '3':'Internal Medicine', '4':'Other'}

In [7]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("train.csv") \
      .replace(labelDict, subset=['label'])

In [8]:
trainDataset.printSchema()

root
 |-- label: string (nullable = true)
 |-- description: string (nullable = true)
 |-- text: string (nullable = true)



In [9]:
trainDataset.show(10, truncate=50)

+-----------------+-------------------------------------+--------------------------------------------------+
|            label|                          description|                                              text|
+-----------------+-------------------------------------+--------------------------------------------------+
|  Medical Records|                         2-D Doppler |2-D STUDY,1. Mild aortic stenosis, widely calci...|
|          Surgery|                         Gastroscopy |PREOPERATIVE DIAGNOSES: , Dysphagia and esophag...|
|  Medical Records|       Three-Week Postpartum Checkup |CHIEF COMPLAINT:,  The patient comes for three-...|
|          Surgery|             Radiofrequency Ablation |PROCEDURE: , Bilateral L5, S1, S2, and S3 radio...|
|  Medical Records|               Discharge Summary - 3 |DISCHARGE DIAGNOSES:,1. Chronic obstructive pul...|
|Internal Medicine| Heart Catheterization & Angiography |INDICATION:,  Coronary artery disease, severe a...|
|  Medical Records|

In [10]:
trainDataset.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+-----------------+-----+
|            label|count|
+-----------------+-----+
|          Surgery| 1442|
|  Medical Records| 1126|
|Internal Medicine| 1040|
|            Other|  891|
+-----------------+-----+



In [11]:
testDataset = spark.read \
      .option("header", True) \
      .csv("test.csv") \
      .replace(labelDict, subset=['label'])

In [12]:
testDataset.show(10, truncate=50)

+-----------------+-------------------------------------------+--------------------------------------------------+
|            label|                                description|                                              text|
+-----------------+-------------------------------------------+--------------------------------------------------+
|  Medical Records|      Hemiarthroplasty - Discharge Summary |ADMISSION DIAGNOSES:  ,Fracture of the right fe...|
|          Surgery|                        Plantar Fasciotomy |PREOPERATIVE DIAGNOSIS:,  Plantar fascitis, lef...|
|  Medical Records|      Hysterectomy - Discharge Summary - 2 |ADMISSION DIAGNOSIS: , Microinvasive carcinoma ...|
|            Other|        Total Knee Arthoplasty - Right - 1 |PREOPERATIVE DIAGNOSIS:,  Severe degenerative j...|
|          Surgery|         Breast Radiation Therapy Followup |DIAGNOSIS: , Left breast adenocarcinoma stage T...|
|            Other|                         Hamstring Release |PREOPERATIVE DIAG

In [13]:
testDataset.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+-----------------+-----+
|            label|count|
+-----------------+-----+
|          Surgery|  198|
|Internal Medicine|  109|
|  Medical Records|  102|
|            Other|   91|
+-----------------+-----+



The classes are quite imbalanced on both the training and test set.

## DL Classifiers with Sentence Embeddings

### DL Classification with Universal Sentence Encoder



In [27]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(10)\
    .setLr(0.001)\
    .setBatchSize(64)\
    .setEnableOutputLogs(True)

use_clf_pipeline = Pipeline(
    stages = [
        document,
        use,
        classifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [15]:
use_pipelineModel = use_clf_pipeline.fit(trainDataset)

In [16]:
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 10 - learning_rate: 0.001 - batch_size: 64 - training_examples: 4499 - classes: 4
Epoch 0/10 - 0.61s - loss: 88.89431 - acc: 0.506344 - batches: 71
Epoch 1/10 - 0.31s - loss: 86.799675 - acc: 0.53000474 - batches: 71
Epoch 2/10 - 0.30s - loss: 86.150055 - acc: 0.5322369 - batches: 71
Epoch 3/10 - 0.31s - loss: 85.857346 - acc: 0.53357613 - batches: 71
Epoch 4/10 - 0.30s - loss: 85.60088 - acc: 0.5342458 - batches: 71
Epoch 5/10 - 0.29s - loss: 85.382484 - acc: 0.53491545 - batches: 71
Epoch 6/10 - 0.44s - loss: 85.17749 - acc: 0.53580827 - batches: 71
Epoch 7/10 - 0.33s - loss: 84.98671 - acc: 0.5360315 - batches: 71
Epoch 8/10 - 0.29s - loss: 84.81304 - acc: 0.53692436 - batches: 71
Epoch 9/10 - 0.28s - loss: 84.65803 - acc: 0.53692436 - batches: 71



In [17]:
# Evaluate model on test set
use_df = use_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
use_df['result'] = use_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(use_df.label, use_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.33      0.03      0.05       109
  Medical Records       0.41      0.90      0.57       102
            Other       0.00      0.00      0.00        91
          Surgery       0.65      0.88      0.75       198

         accuracy                           0.54       500
        macro avg       0.35      0.45      0.34       500
     weighted avg       0.41      0.54      0.42       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BERT Sentence Embeddings (Compact Version)

In [18]:
bert_sent = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

bert_sent_clf_pipeline = Pipeline(
    stages = [
        document,
        bert_sent,
        classifierdl
    ])

sent_small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]


In [19]:
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainDataset)

In [21]:
# Evaluate model on test set
bert_sent_df = bert_sent_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
bert_sent_df['result'] = bert_sent_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(bert_sent_df.label, bert_sent_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.40      0.16      0.23       109
  Medical Records       0.41      0.84      0.55       102
            Other       0.00      0.00      0.00        91
          Surgery       0.65      0.81      0.73       198

         accuracy                           0.53       500
        macro avg       0.37      0.45      0.37       500
     weighted avg       0.43      0.53      0.45       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BioBERT (Clnical) Sentence Embeddings

In [22]:
biobert_clin = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

biobert_clin_clf_pipeline = Pipeline(
    stages = [
        document,
        biobert_clin,
        classifierdl
    ])

sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]


In [23]:
biobert_clin_pipelineModel = biobert_clin_clf_pipeline.fit(trainDataset)

In [25]:
# Evaluate model on test set
biobert_clin_df = biobert_clin_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
biobert_clin_df['result'] = biobert_clin_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(biobert_clin_df.label, biobert_clin_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.44      0.14      0.21       109
  Medical Records       0.43      0.87      0.57       102
            Other       0.00      0.00      0.00        91
          Surgery       0.65      0.84      0.73       198

         accuracy                           0.54       500
        macro avg       0.38      0.46      0.38       500
     weighted avg       0.44      0.54      0.45       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BioBERT (MedNLI) Sentence Embeddings

In [15]:
biobert_med = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

biobert_med_clf_pipeline = Pipeline(
    stages = [
        document,
        biobert_med,
        classifierdl
    ])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [16]:
biobert_med_pipelineModel = biobert_med_clf_pipeline.fit(trainDataset)

In [17]:
log_file_name = os.listdir("/root/annotator_logs")[0]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 10 - learning_rate: 0.001 - batch_size: 64 - training_examples: 4499 - classes: 4
Epoch 0/10 - 0.87s - loss: 93.51379 - acc: 0.4838581 - batches: 71
Epoch 1/10 - 0.34s - loss: 91.20874 - acc: 0.5130992 - batches: 71
Epoch 2/10 - 0.32s - loss: 91.125656 - acc: 0.5206885 - batches: 71
Epoch 3/10 - 0.29s - loss: 89.788895 - acc: 0.5241189 - batches: 71
Epoch 4/10 - 0.27s - loss: 88.48932 - acc: 0.5276316 - batches: 71
Epoch 5/10 - 0.27s - loss: 88.01813 - acc: 0.53231907 - batches: 71
Epoch 6/10 - 0.27s - loss: 87.811005 - acc: 0.5341048 - batches: 71
Epoch 7/10 - 0.29s - loss: 87.71872 - acc: 0.5376762 - batches: 71
Epoch 8/10 - 0.27s - loss: 87.64678 - acc: 0.5399084 - batches: 71
Epoch 9/10 - 0.28s - loss: 87.587204 - acc: 0.5434798 - batches: 71



In [18]:
# Evaluate model on test set
biobert_med_df = biobert_med_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
biobert_med_df['result'] = biobert_med_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(biobert_med_df.label, biobert_med_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.36      0.24      0.29       109
  Medical Records       0.40      0.68      0.50       102
            Other       0.00      0.00      0.00        91
          Surgery       0.62      0.80      0.70       198

         accuracy                           0.51       500
        macro avg       0.34      0.43      0.37       500
     weighted avg       0.41      0.51      0.44       500



  _warn_prf(average, modifier, msg_start, len(result))


## ML Classifiers with Sentence Embeddings

### Logistic Regression with Universal Sentence Encoder

In [64]:
embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

label_stringIdx = StringIndexer(inputCol = "label", outputCol = "class")

ml_pipeline = Pipeline(
      stages=[
        document,
        use,
        embeddings_finisher,
        label_stringIdx]
      )

In [65]:
# Fit pipeline to train and test sets
ml_model = ml_pipeline.fit(trainDataset)
ml_train = ml_model.transform(trainDataset)
ml_test = ml_model.transform(testDataset)

In [66]:
# Explode sentence embeddings for train and test sets
ml_train = ml_train.withColumn("features", explode(ml_train.finished_embeddings))
ml_test = ml_test.withColumn("features", explode(ml_test.finished_embeddings))

In [67]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(ml_train)

# Get test set preds
lrPredictions = lrModel.transform(ml_test)

In [68]:
# Evaluate performance
logreg_df = lrPredictions.select('text','label','class','prediction').toPandas()
print(classification_report(logreg_df["class"], logreg_df.prediction))

              precision    recall  f1-score   support

         0.0       0.66      0.70      0.68       198
         1.0       0.41      0.53      0.46       102
         2.0       0.39      0.28      0.32       109
         3.0       0.40      0.36      0.38        91

    accuracy                           0.51       500
   macro avg       0.47      0.47      0.46       500
weighted avg       0.50      0.51      0.50       500



### Random Forest with Universal Sentence Encoder

In [69]:
rf = RandomForestClassifier(labelCol="class", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

# Train model with Training Data, get predictions on test set
rfModel = rf.fit(ml_train)
predictions_rf = rfModel.transform(ml_test)

In [70]:
# Evaluate performance
rf_df = predictions_rf.select("class", "prediction").toPandas()
print(classification_report(rf_df["class"], rf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.63      0.87      0.73       198
         1.0       0.42      0.85      0.56       102
         2.0       0.14      0.01      0.02       109
         3.0       0.50      0.08      0.13        91

    accuracy                           0.54       500
   macro avg       0.42      0.45      0.36       500
weighted avg       0.46      0.54      0.43       500



## ML Classifiers with Feature Vectorization Methods

### Logistic Regression with CountVectorizer


In [15]:
tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
      
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

countVectors = CountVectorizer(inputCol="token_features", outputCol="features", vocabSize=10000, minDF=5)

nlp_pipeline = Pipeline(
    stages=[document, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            countVectors,
            label_stringIdx])

In [48]:
# Fit pipeline to train and test set
nlp_model = nlp_pipeline.fit(trainDataset)
countvec_train = nlp_model.transform(trainDataset)
countvec_test = nlp_model.transform(testDataset)

In [17]:
countvec_train.printSchema()

root
 |-- label: string (nullable = true)
 |-- description: string (nullable = true)
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |  

In [49]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(countvec_train)

# Get test set preds
lrPredictions = lrModel.transform(countvec_test)

In [50]:
# Evaluate performance
logreg_df = lrPredictions.select('text','label','class','prediction').toPandas()
print(classification_report(logreg_df["class"], logreg_df.prediction))

              precision    recall  f1-score   support

         0.0       0.60      0.63      0.61       198
         1.0       0.23      0.26      0.24       102
         2.0       0.23      0.19      0.21       109
         3.0       0.23      0.21      0.22        91

    accuracy                           0.38       500
   macro avg       0.32      0.32      0.32       500
weighted avg       0.38      0.38      0.38       500



### Logistic Regression with TF-IDF


In [51]:
hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            hashingTF,
            idf,
            label_stringIdx])

In [52]:
# Fit pipeline to train and test set
tfidf_model = nlp_pipeline_tf.fit(trainDataset)
tfidf_train = tfidf_model.transform(trainDataset)
tfidf_test = tfidf_model.transform(testDataset)

In [57]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(tfidf_train)

# Get test set preds
lrPredictions_tf = lrModel_tf.transform(tfidf_test)

In [58]:
# Evaluate performance
logreg_tf_df = lrPredictions_tf.select('class','prediction').toPandas()
print(classification_report(logreg_tf_df["class"], logreg_tf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.58      0.60      0.59       198
         1.0       0.25      0.31      0.28       102
         2.0       0.16      0.12      0.14       109
         3.0       0.20      0.20      0.20        91

    accuracy                           0.36       500
   macro avg       0.30      0.31      0.30       500
weighted avg       0.35      0.36      0.36       500



### Random Forest with CountVectorizer

In [62]:
rf = RandomForestClassifier(labelCol="class", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

# Train model and get predictions on test set
rfModel_cv = rf.fit(countvec_train)
predictions_rf_cv = rfModel_cv.transform(countvec_test)

In [63]:
# Evaluate performance
rf_cv_df = predictions_rf_cv.select("class", "prediction").toPandas()
print(classification_report(rf_cv_df["class"], rf_cv_df.prediction))

              precision    recall  f1-score   support

         0.0       0.66      0.86      0.75       198
         1.0       0.41      0.98      0.58       102
         2.0       1.00      0.01      0.02       109
         3.0       0.00      0.00      0.00        91

    accuracy                           0.54       500
   macro avg       0.52      0.46      0.34       500
weighted avg       0.56      0.54      0.42       500



  _warn_prf(average, modifier, msg_start, len(result))


### Random Forest with TF-IDF

In [60]:
# Train model and get predictions on test set
rfModel_tf = rf.fit(tfidf_train)
predictions_rf_tf = rfModel_tf.transform(tfidf_test)

In [61]:
# Evaluate performance
rf_tf_df = predictions_rf_tf.select("class", "prediction").toPandas()
print(classification_report(rf_tf_df["class"], rf_tf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.65      0.87      0.75       198
         1.0       0.40      0.92      0.56       102
         2.0       0.00      0.00      0.00       109
         3.0       0.00      0.00      0.00        91

    accuracy                           0.53       500
   macro avg       0.26      0.45      0.33       500
weighted avg       0.34      0.53      0.41       500



  _warn_prf(average, modifier, msg_start, len(result))


## Conclusion



To compare the performance of the different models and account for the class imbalance, we can focus on the weighted average of the F1 score. The top 3 performing models were:

1. Logistic Regression with Universal Sentence Encoder (0.50)
2. DL Classification with BERT Sentence Embeddings (0.45)
3. DL Classification with BioBERT Clnical Sentence Embeddings (0.45)

All models performed poorly overall possibly due to the small sample size and class imbalances.