<a href="https://colab.research.google.com/github/luca-martial/medical-specialty/blob/main/medical_specialty_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Medical Specialty Prediction with Spark NLP

The goal of this project was to predict medical specialties (surgery, internal medicine, medical records, other) based on a corpus of 4999 medical transcriptions using Spark NLP. The corpus was scraped by [Tara Boyle](https://github.com/terrah27) from a [Transcribed Medical Transcription Sample Reports and Examples website](https://mtsamples.com/) and published on [Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions). The version used in this project was compiled by [Carlos Salgado](https://github.com/socd06) for Natural Language Processing using the scraped corpus and custom-generated clinical stop words and vocabulary. This compiled version was published on [GitHub](https://github.com/socd06/medical-nlp) and is free to use.

## Set-Up

### Installing SparkNLP

In [1]:
import json
import os

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

# Defining license key-value pairs as local variables
locals().update(license_keys)

# Adding license key-value pairs to environment variables
os.environ.update(license_keys)

Saving spark_nlp_for_healthcare_spark_ocr_3138.json to spark_nlp_for_healthcare_spark_ocr_3138.json


In [2]:
# Installing pyspark and spark-nlp
! pip install --upgrade -q pyspark==3.1.2 spark-nlp==$PUBLIC_VERSION

# Installing Spark NLP Healthcare
! pip install --upgrade -q spark-nlp-jsl==$JSL_VERSION  --extra-index-url https://pypi.johnsnowlabs.com/$SECRET

[K     |████████████████████████████████| 212.4 MB 41 kB/s 
[K     |████████████████████████████████| 116 kB 66.4 MB/s 
[K     |████████████████████████████████| 198 kB 70.0 MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
[K     |████████████████████████████████| 136 kB 10.0 MB/s 
[K     |████████████████████████████████| 122 kB 8.0 MB/s 
[?25h

In [3]:
# Import libraries and start session
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

print("Spark NLP version", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

params = {"spark.driver.memory":"16G", 
          "spark.kryoserializer.buffer.max":"2000M", 
          "spark.driver.maxResultSize":"2000M"} 

spark = sparknlp_jsl.start(license_keys['SECRET'], 
                           gpu=True, 
                           params=params)

spark

Spark NLP version 3.3.1
Spark NLP_JSL Version : 3.3.1


In [4]:
# Import auxiliary libraries
import pandas as pd
from sklearn.metrics import classification_report
from pyspark.sql.functions import explode
from pyspark.sql.functions import lit
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml.feature import StringIndexer, CountVectorizer, HashingTF, IDF
from pyspark.sql import functions as F
from collections import Counter
from sklearn.ensemble import RandomForestClassifier as skl_RandomForestClassifier
from sklearn.linear_model import LogisticRegression as skl_LogisticRegression

### Reading in Data

In [5]:
# Get datasets
! wget -q https://raw.githubusercontent.com/socd06/medical-nlp/master/data/train.csv
! wget -q https://raw.githubusercontent.com/socd06/medical-nlp/master/data/test.csv

In [6]:
labelDict = {'1':'Surgery', '2':'Medical Records', '3':'Internal Medicine', '4':'Other'}

In [7]:
trainDataset = spark.read \
      .option("header", True) \
      .csv("train.csv") \
      .replace(labelDict, subset=['label'])

In [None]:
trainDataset.printSchema()

root
 |-- label: string (nullable = true)
 |-- description: string (nullable = true)
 |-- text: string (nullable = true)



In [None]:
trainDataset.show(10, truncate=50)

+-----------------+-------------------------------------+--------------------------------------------------+
|            label|                          description|                                              text|
+-----------------+-------------------------------------+--------------------------------------------------+
|  Medical Records|                         2-D Doppler |2-D STUDY,1. Mild aortic stenosis, widely calci...|
|          Surgery|                         Gastroscopy |PREOPERATIVE DIAGNOSES: , Dysphagia and esophag...|
|  Medical Records|       Three-Week Postpartum Checkup |CHIEF COMPLAINT:,  The patient comes for three-...|
|          Surgery|             Radiofrequency Ablation |PROCEDURE: , Bilateral L5, S1, S2, and S3 radio...|
|  Medical Records|               Discharge Summary - 3 |DISCHARGE DIAGNOSES:,1. Chronic obstructive pul...|
|Internal Medicine| Heart Catheterization & Angiography |INDICATION:,  Coronary artery disease, severe a...|
|  Medical Records|

In [None]:
trainDataset.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+-----------------+-----+
|            label|count|
+-----------------+-----+
|          Surgery| 1442|
|  Medical Records| 1126|
|Internal Medicine| 1040|
|            Other|  891|
+-----------------+-----+



In [8]:
testDataset = spark.read \
      .option("header", True) \
      .csv("test.csv") \
      .replace(labelDict, subset=['label'])

In [None]:
testDataset.show(10, truncate=50)

+-----------------+-------------------------------------------+--------------------------------------------------+
|            label|                                description|                                              text|
+-----------------+-------------------------------------------+--------------------------------------------------+
|  Medical Records|      Hemiarthroplasty - Discharge Summary |ADMISSION DIAGNOSES:  ,Fracture of the right fe...|
|          Surgery|                        Plantar Fasciotomy |PREOPERATIVE DIAGNOSIS:,  Plantar fascitis, lef...|
|  Medical Records|      Hysterectomy - Discharge Summary - 2 |ADMISSION DIAGNOSIS: , Microinvasive carcinoma ...|
|            Other|        Total Knee Arthoplasty - Right - 1 |PREOPERATIVE DIAGNOSIS:,  Severe degenerative j...|
|          Surgery|         Breast Radiation Therapy Followup |DIAGNOSIS: , Left breast adenocarcinoma stage T...|
|            Other|                         Hamstring Release |PREOPERATIVE DIAG

In [None]:
testDataset.groupBy("label") \
    .count() \
    .orderBy(col("count").desc()) \
    .show()

+-----------------+-----+
|            label|count|
+-----------------+-----+
|          Surgery|  198|
|Internal Medicine|  109|
|  Medical Records|  102|
|            Other|   91|
+-----------------+-----+



## DL Classifiers with Sentence Embeddings

### DL Classification with Universal Sentence Encoder



In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
    
use = UniversalSentenceEncoder.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(50)\
    .setLr(0.001)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)

use_clf_pipeline = Pipeline(
    stages = [
        document,
        use,
        classifierdl
    ])

tfhub_use download started this may take some time.
Approximate size to download 923.7 MB
[OK!]


In [None]:
use_pipelineModel = use_clf_pipeline.fit(trainDataset)

In [None]:
# Evaluate model on test set
use_df = use_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
use_df['result'] = use_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(use_df.label, use_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.33      0.03      0.05       109
  Medical Records       0.41      0.90      0.57       102
            Other       0.00      0.00      0.00        91
          Surgery       0.65      0.88      0.75       198

         accuracy                           0.54       500
        macro avg       0.35      0.45      0.34       500
     weighted avg       0.41      0.54      0.42       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BERT Sentence Embeddings (Compact Version)

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

bert_sent = BertSentenceEmbeddings.pretrained("sent_small_bert_L8_512")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(15)\
    .setLr(1e-4)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)

bert_sent_clf_pipeline = Pipeline(
    stages = [
        document,
        bert_sent,
        classifierdl
    ])

sent_small_bert_L8_512 download started this may take some time.
Approximate size to download 149.1 MB
[OK!]


In [None]:
bert_sent_pipelineModel = bert_sent_clf_pipeline.fit(trainDataset)

In [None]:
# Evaluate model on test set
bert_sent_df = bert_sent_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
bert_sent_df['result'] = bert_sent_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(bert_sent_df.label, bert_sent_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.00      0.00      0.00       109
  Medical Records       0.40      0.81      0.53       102
            Other       0.00      0.00      0.00        91
          Surgery       0.60      0.88      0.71       198

         accuracy                           0.51       500
        macro avg       0.25      0.42      0.31       500
     weighted avg       0.32      0.51      0.39       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BioBERT (Clnical) Sentence Embeddings

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

biobert_clin = BertSentenceEmbeddings.pretrained("sent_biobert_clinical_base_cased", "en")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(200)\
    .setLr(3e-4)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)

biobert_clin_clf_pipeline = Pipeline(
    stages = [
        document,
        biobert_clin,
        classifierdl
    ])

sent_biobert_clinical_base_cased download started this may take some time.
Approximate size to download 386.6 MB
[OK!]


In [None]:
biobert_clin_pipelineModel = biobert_clin_clf_pipeline.fit(trainDataset)

In [None]:
log_file_name = os.listdir("/root/annotator_logs")[2]

with open("/root/annotator_logs/"+log_file_name, "r") as log_file :
    print(log_file.read())

Training started - epochs: 100 - learning_rate: 3.0E-4 - batch_size: 8 - training_examples: 4499 - classes: 4
Epoch 0/100 - 1.66s - loss: 712.29126 - acc: 0.48658067 - batches: 563
Epoch 1/100 - 1.53s - loss: 695.0834 - acc: 0.5099348 - batches: 563
Epoch 2/100 - 1.53s - loss: 693.25885 - acc: 0.5139383 - batches: 563
Epoch 3/100 - 1.58s - loss: 691.974 - acc: 0.5139383 - batches: 563
Epoch 4/100 - 1.59s - loss: 690.9701 - acc: 0.5157177 - batches: 563
Epoch 5/100 - 1.58s - loss: 690.14355 - acc: 0.51771945 - batches: 563
Epoch 6/100 - 1.57s - loss: 689.4407 - acc: 0.5194988 - batches: 563
Epoch 7/100 - 1.57s - loss: 688.8167 - acc: 0.5208334 - batches: 563
Epoch 8/100 - 1.57s - loss: 688.26056 - acc: 0.5212782 - batches: 563
Epoch 9/100 - 1.59s - loss: 687.7543 - acc: 0.52216786 - batches: 563
Epoch 10/100 - 1.58s - loss: 687.2945 - acc: 0.5226127 - batches: 563
Epoch 11/100 - 1.59s - loss: 686.87646 - acc: 0.5237248 - batches: 563
Epoch 12/100 - 1.58s - loss: 686.4867 - acc: 0.525504

In [None]:
# Evaluate model on test set
biobert_clin_df = biobert_clin_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
biobert_clin_df['result'] = biobert_clin_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(biobert_clin_df.label, biobert_clin_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.42      0.16      0.23       109
  Medical Records       0.44      0.85      0.58       102
            Other       0.00      0.00      0.00        91
          Surgery       0.64      0.85      0.73       198

         accuracy                           0.54       500
        macro avg       0.38      0.46      0.38       500
     weighted avg       0.44      0.54      0.46       500



  _warn_prf(average, modifier, msg_start, len(result))


### DL Classification with BioBERT (MedNLI) Sentence Embeddings

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

biobert_med = BertSentenceEmbeddings.pretrained("sbiobert_base_cased_mli","en","clinical/models")\
      .setInputCols(["document"])\
      .setOutputCol("sentence_embeddings")

classifierdl = ClassifierDLApproach()\
    .setInputCols(["sentence_embeddings"])\
    .setOutputCol("class")\
    .setLabelColumn("label")\
    .setMaxEpochs(50)\
    .setLr(3e-4)\
    .setBatchSize(8)\
    .setEnableOutputLogs(True)

biobert_med_clf_pipeline = Pipeline(
    stages = [
        document,
        biobert_med,
        classifierdl
    ])

sbiobert_base_cased_mli download started this may take some time.
Approximate size to download 384.3 MB
[OK!]


In [None]:
biobert_med_pipelineModel = biobert_med_clf_pipeline.fit(trainDataset)

In [None]:
# Evaluate model on test set
biobert_med_df = biobert_med_pipelineModel.transform(testDataset).select('label', 'text', 'class.result').toPandas()
biobert_med_df['result'] = biobert_med_df['result'].apply(lambda x: str(x[0])).replace(labelDict)
print(classification_report(biobert_med_df.label, biobert_med_df.result))

                   precision    recall  f1-score   support

Internal Medicine       0.36      0.24      0.29       109
  Medical Records       0.40      0.68      0.50       102
            Other       0.00      0.00      0.00        91
          Surgery       0.62      0.80      0.70       198

         accuracy                           0.51       500
        macro avg       0.34      0.43      0.37       500
     weighted avg       0.41      0.51      0.44       500



  _warn_prf(average, modifier, msg_start, len(result))


## ML Classifiers with Sentence Embeddings

### Logistic Regression with Universal Sentence Encoder

In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

use = UniversalSentenceEncoder.pretrained()\
    .setInputCols(["document"])\
    .setOutputCol("sentence_embeddings")

embeddings_finisher = EmbeddingsFinisher() \
      .setInputCols(["sentence_embeddings"]) \
      .setOutputCols(["finished_embeddings"]) \
      .setOutputAsVector(True)\
      .setCleanAnnotations(False)

label_stringIdx = StringIndexer(inputCol = "label", outputCol = "class")

ml_pipeline = Pipeline(
      stages=[
        document,
        use,
        embeddings_finisher,
        label_stringIdx]
      )

In [None]:
# Fit pipeline to train and test sets
ml_model = ml_pipeline.fit(trainDataset)
ml_train = ml_model.transform(trainDataset)
ml_test = ml_model.transform(testDataset)

In [None]:
# Explode sentence embeddings for train and test sets
ml_train = ml_train.withColumn("features", explode(ml_train.finished_embeddings))
ml_test = ml_test.withColumn("features", explode(ml_test.finished_embeddings))

In [None]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=20, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(ml_train)

# Get test set preds
lrPredictions = lrModel.transform(ml_test)

In [None]:
# Evaluate performance
logreg_df = lrPredictions.select('text','label','class','prediction').toPandas()
print(classification_report(logreg_df["class"], logreg_df.prediction))

              precision    recall  f1-score   support

         0.0       0.66      0.70      0.68       198
         1.0       0.41      0.53      0.46       102
         2.0       0.39      0.28      0.32       109
         3.0       0.40      0.36      0.38        91

    accuracy                           0.51       500
   macro avg       0.47      0.47      0.46       500
weighted avg       0.50      0.51      0.50       500



### Random Forest with Universal Sentence Encoder

In [None]:
rf = RandomForestClassifier(labelCol="class", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

# Train model with Training Data, get predictions on test set
rfModel = rf.fit(ml_train)
predictions_rf = rfModel.transform(ml_test)

In [None]:
# Evaluate performance
rf_df = predictions_rf.select("class", "prediction").toPandas()
print(classification_report(rf_df["class"], rf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.63      0.87      0.73       198
         1.0       0.42      0.85      0.56       102
         2.0       0.14      0.01      0.02       109
         3.0       0.50      0.08      0.13        91

    accuracy                           0.54       500
   macro avg       0.42      0.45      0.36       500
weighted avg       0.46      0.54      0.43       500



## ML Classifiers with Feature Vectorization Methods

### Logistic Regression with CountVectorizer


In [None]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = Tokenizer() \
      .setInputCols(["document"]) \
      .setOutputCol("token")
      
normalizer = Normalizer() \
      .setInputCols(["token"]) \
      .setOutputCol("normalized")

stopwords_cleaner = StopWordsCleaner()\
      .setInputCols("normalized")\
      .setOutputCol("cleanTokens")\
      .setCaseSensitive(False)

stemmer = Stemmer() \
      .setInputCols(["cleanTokens"]) \
      .setOutputCol("stem")

finisher = Finisher() \
      .setInputCols(["stem"]) \
      .setOutputCols(["token_features"]) \
      .setOutputAsArray(True) \
      .setCleanAnnotations(False)

countVectors = CountVectorizer(inputCol="token_features", outputCol="features", vocabSize=10000, minDF=5)

label_stringIdx = StringIndexer(inputCol = "label", outputCol = "class")

nlp_pipeline = Pipeline(
    stages=[document, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            countVectors,
            label_stringIdx])

In [None]:
# Fit pipeline to train and test set
nlp_model = nlp_pipeline.fit(trainDataset)
countvec_train = nlp_model.transform(trainDataset)
countvec_test = nlp_model.transform(testDataset)

In [None]:
countvec_train.printSchema()

root
 |-- label: string (nullable = true)
 |-- description: string (nullable = true)
 |-- text: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- token: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |  

In [None]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel = lr.fit(countvec_train)

# Get test set preds
lrPredictions = lrModel.transform(countvec_test)

In [None]:
# Evaluate performance
logreg_df = lrPredictions.select('text','label','class','prediction').toPandas()
print(classification_report(logreg_df["class"], logreg_df.prediction))

              precision    recall  f1-score   support

         0.0       0.60      0.63      0.61       198
         1.0       0.23      0.26      0.24       102
         2.0       0.23      0.19      0.21       109
         3.0       0.23      0.21      0.22        91

    accuracy                           0.38       500
   macro avg       0.32      0.32      0.32       500
weighted avg       0.38      0.38      0.38       500



### Logistic Regression with TF-IDF


In [None]:
hashingTF = HashingTF(inputCol="token_features", outputCol="rawFeatures", numFeatures=10000)

idf = IDF(inputCol="rawFeatures", outputCol="features", minDocFreq=5) #minDocFreq: remove sparse terms

nlp_pipeline_tf = Pipeline(
    stages=[document, 
            tokenizer,
            normalizer,
            stopwords_cleaner, 
            stemmer, 
            finisher,
            hashingTF,
            idf,
            label_stringIdx])

In [None]:
# Fit pipeline to train and test set
tfidf_model = nlp_pipeline_tf.fit(trainDataset)
tfidf_train = tfidf_model.transform(trainDataset)
tfidf_test = tfidf_model.transform(testDataset)

In [None]:
# Fit logreg
lr = LogisticRegression(labelCol="class", maxIter=10, regParam=0.3, elasticNetParam=0)
lrModel_tf = lr.fit(tfidf_train)

# Get test set preds
lrPredictions_tf = lrModel_tf.transform(tfidf_test)

In [None]:
# Evaluate performance
logreg_tf_df = lrPredictions_tf.select('class','prediction').toPandas()
print(classification_report(logreg_tf_df["class"], logreg_tf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.58      0.60      0.59       198
         1.0       0.25      0.31      0.28       102
         2.0       0.16      0.12      0.14       109
         3.0       0.20      0.20      0.20        91

    accuracy                           0.36       500
   macro avg       0.30      0.31      0.30       500
weighted avg       0.35      0.36      0.36       500



### Random Forest with CountVectorizer

In [None]:
rf = RandomForestClassifier(labelCol="class", \
                            featuresCol="features", \
                            numTrees = 100, \
                            maxDepth = 4, \
                            maxBins = 32)

# Train model and get predictions on test set
rfModel_cv = rf.fit(countvec_train)
predictions_rf_cv = rfModel_cv.transform(countvec_test)

In [None]:
# Evaluate performance
rf_cv_df = predictions_rf_cv.select("class", "prediction").toPandas()
print(classification_report(rf_cv_df["class"], rf_cv_df.prediction))

              precision    recall  f1-score   support

         0.0       0.66      0.86      0.75       198
         1.0       0.41      0.98      0.58       102
         2.0       1.00      0.01      0.02       109
         3.0       0.00      0.00      0.00        91

    accuracy                           0.54       500
   macro avg       0.52      0.46      0.34       500
weighted avg       0.56      0.54      0.42       500



  _warn_prf(average, modifier, msg_start, len(result))


### Random Forest with TF-IDF

In [None]:
# Train model and get predictions on test set
rfModel_tf = rf.fit(tfidf_train)
predictions_rf_tf = rfModel_tf.transform(tfidf_test)

In [None]:
# Evaluate performance
rf_tf_df = predictions_rf_tf.select("class", "prediction").toPandas()
print(classification_report(rf_tf_df["class"], rf_tf_df.prediction))

              precision    recall  f1-score   support

         0.0       0.65      0.87      0.75       198
         1.0       0.40      0.92      0.56       102
         2.0       0.00      0.00      0.00       109
         3.0       0.00      0.00      0.00        91

    accuracy                           0.53       500
   macro avg       0.26      0.45      0.33       500
weighted avg       0.34      0.53      0.41       500



  _warn_prf(average, modifier, msg_start, len(result))


## Feature Engineering

### Featurizer Functions

In [9]:
# Function to get counts of each assertion
def get_assertion_stats(assertion):
    ass_list=[]
    for s in assertion:
        ass_list.extend(s)
    x = dict(Counter(assertion))
    return x

In [10]:
# Function to extract features out of df
def create_features(spark_df, pipeline):
    if pipeline == 'full':
        # Exploding entities and icd codes
        pandas_df = spark_df.select('id','label','class','assertion','source',
                  F.explode(F.arrays_zip('clinical_ner_chunk.result',"clinical_ner_chunk.metadata",
                                         'bio_ner_chunk.result',"bio_ner_chunk.metadata",
                                         'posology_ner_chunk.result',"posology_ner_chunk.metadata",
                                         'risk_ner_chunk.result',"risk_ner_chunk.metadata",
                                         'jsl_ner_chunk.result',"jsl_ner_chunk.metadata",
                                         'icd10cm_code.result','icd10pcs_code.result')).alias("cols")) \
              .select('id','label','class','source',
                      F.expr("assertion.result").alias("assertion"),
                      F.expr("cols['0']").alias("clinical_token"),
                      F.expr("cols['1'].entity").alias("clinical_entity"),
                      F.expr("cols['2']").alias("bionlp_token"),
                      F.expr("cols['3'].entity").alias("bionlp_entity"),
                      F.expr("cols['4']").alias("posology_token"),
                      F.expr("cols['5'].entity").alias("posology_entity"),
                      F.expr("cols['6']").alias("risk_token"),
                      F.expr("cols['7'].entity").alias("risk_entity"),
                      F.expr("cols['8']").alias("jsl_token"),
                      F.expr("cols['9'].entity").alias("jsl_entity"),
                      F.expr("cols['10']").alias("icd10cm_code"),
                      F.expr("cols['11']").alias("icd10pcs_code")).toPandas()

        pids = pandas_df['id'].unique()

        stats_list=[]

        for i in pids:
            # Getting counts of each entity, icd code and assertion status
            temp_df = pandas_df[pandas_df.id==i]
            temp_dict = temp_df['clinical_entity'].value_counts().to_dict()
            temp_dict.update(temp_df['bionlp_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['posology_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['jsl_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['risk_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['icd10cm_code'].value_counts().to_dict())
            temp_dict.update(temp_df['icd10pcs_code'].value_counts().to_dict())
            adf = temp_df['assertion'].apply(lambda x: get_assertion_stats(x)).value_counts().reset_index()

            dic = {'present':0,
            'absent':0,
            'associated_with_someone_else':0}

            for j, row in adf.iterrows():
                try:
                    k = list(row['index'].keys())[0]
                    dic[k]= dic[k]+row['index'][k]*row['assertion']
                except:
                    pass

            # Finalizing dictionary
            temp_dict.update(dic)
            temp_dict['id']=i
            stats_list.append(temp_dict)

    elif pipeline == 'no_resolver':
        # Exploding entities and icd codes
        pandas_df = spark_df.select('id','label','class','assertion','source',
                  F.explode(F.arrays_zip('clinical_ner_chunk.result',"clinical_ner_chunk.metadata",
                                         'bio_ner_chunk.result',"bio_ner_chunk.metadata",
                                         'posology_ner_chunk.result',"posology_ner_chunk.metadata",
                                         'risk_ner_chunk.result',"risk_ner_chunk.metadata",
                                         'jsl_ner_chunk.result',"jsl_ner_chunk.metadata")).alias("cols")) \
              .select('id','label','class','source',
                      F.expr("assertion.result").alias("assertion"),
                      F.expr("cols['0']").alias("clinical_token"),
                      F.expr("cols['1'].entity").alias("clinical_entity"),
                      F.expr("cols['2']").alias("bionlp_token"),
                      F.expr("cols['3'].entity").alias("bionlp_entity"),
                      F.expr("cols['4']").alias("posology_token"),
                      F.expr("cols['5'].entity").alias("posology_entity"),
                      F.expr("cols['6']").alias("risk_token"),
                      F.expr("cols['7'].entity").alias("risk_entity"),
                      F.expr("cols['8']").alias("jsl_token"),
                      F.expr("cols['9'].entity").alias("jsl_entity")).toPandas()

        pids = pandas_df['id'].unique()

        stats_list=[]

        for i in pids:
            # Getting counts of each entity, icd code and assertion status
            temp_df = pandas_df[pandas_df.id==i]
            temp_dict = temp_df['clinical_entity'].value_counts().to_dict()
            temp_dict.update(temp_df['bionlp_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['posology_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['jsl_entity'].value_counts().to_dict())
            temp_dict.update(temp_df['risk_entity'].value_counts().to_dict())
            adf = temp_df['assertion'].apply(lambda x: get_assertion_stats(x)).value_counts().reset_index()

            dic = {'present':0,
            'absent':0,
            'associated_with_someone_else':0}

            for j, row in adf.iterrows():
                try:
                    k = list(row['index'].keys())[0]
                    dic[k]= dic[k]+row['index'][k]*row['assertion']
                except:
                    pass

            # Finalizing dictionary
            temp_dict.update(dic)
            temp_dict['id']=i
            stats_list.append(temp_dict)

    # Converting to dataframe
    stats_df = pd.DataFrame(stats_list)

    # Renaming cols
    stats_df.columns = ['entity_{}'.format(c) for c in stats_df.columns]
      
    # Merging counts with unique ids and corresponding labels
    model_df = pandas_df[['id','label','class', 'source']].drop_duplicates().merge(stats_df, left_on='id', right_on='entity_id').fillna(0)

    return model_df

In [11]:
# Create monotonically increasing ids 
trainDataset = trainDataset.withColumn("id", F.monotonically_increasing_id())
testDataset = testDataset.withColumn("id", F.monotonically_increasing_id())

### Full Pipeline

#### Create Pipeline

In [12]:
document = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

token = Tokenizer()\
    .setInputCols(['sentence'])\
    .setOutputCol('tok_checked')

embeddings = WordEmbeddingsModel.pretrained('embeddings_clinical', 'en', 'clinical/models')\
    .setInputCols(["sentence", "tok_checked"])\
    .setOutputCol("embeddings")

# Detect clinical problems, tests and treatments
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large", "en", "clinical/models") \
  .setInputCols(["sentence", "tok_checked", "embeddings"]) \
  .setOutputCol("clinical_ner")

clinical_converter = NerConverter()\
  .setInputCols(["sentence", "tok_checked", "clinical_ner"])\
  .setOutputCol("clinical_ner_chunk")

# Detect clinical entities
jsl_ner = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models") \
  .setInputCols(["sentence", "tok_checked", "embeddings"]) \
  .setOutputCol("jsl_ner")

jsl_converter = NerConverter() \
  .setInputCols(["sentence", "tok_checked", "jsl_ner"]) \
  .setOutputCol("jsl_ner_chunk")

# Detect cancer genetics
bio_ner = MedicalNerModel.pretrained("ner_bionlp", "en", "clinical/models") \
  .setInputCols(["sentence", "tok_checked", "embeddings"]) \
  .setOutputCol("bio_ner")

bio_converter = NerConverter()\
  .setInputCols(["sentence", "tok_checked", "bio_ner"])\
  .setOutputCol("bio_ner_chunk")

# Detect drugs and posology entities
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "tok_checked", "embeddings"]) \
    .setOutputCol("posology_ner")

posology_converter = NerConverter()\
  .setInputCols(["sentence", "tok_checked", "posology_ner"])\
  .setOutputCol("posology_ner_chunk")    

# Detect risk factors
risk_ner = MedicalNerModel.pretrained("ner_risk_factors", "en", "clinical/models") \
  .setInputCols(["sentence", "tok_checked", "embeddings"]) \
  .setOutputCol("risk_ner")

risk_converter = NerConverter()\
  .setInputCols(["sentence", "tok_checked", "risk_ner"])\
  .setOutputCol("risk_ner_chunk")    

# Detect assertion status of risk entities
risk_assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "risk_ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")

# Take clinical NER chunk, convert to doc
chunk_doc = Chunk2Doc()\
      .setInputCols("clinical_ner_chunk")\
      .setOutputCol("ner_chunk_doc")

# Get sentence-chunk Bert embeddings
sentence_chunk_embeddings = BertSentenceEmbeddings.pretrained('sbiobert_base_cased_mli', 'en','clinical/models')\
      .setInputCols(["ner_chunk_doc"])\
      .setOutputCol("sentence_embeddings")

# Sentence Entity Resolver for billable ICD10-CM HCC codes    
icd_cm_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10cm_augmented","en", "clinical/models") \
     .setInputCols(["clinical_ner_chunk", "sentence_embeddings"]) \
     .setOutputCol("icd10cm_code")\
     .setDistanceFunction("EUCLIDEAN")

# Sentence Entity Resolver for ICD10-PCS
icd_pcs_resolver = SentenceEntityResolverModel.pretrained("sbiobertresolve_icd10pcs","en", "clinical/models") \
     .setInputCols(["clinical_ner_chunk", "sentence_embeddings"]) \
     .setOutputCol("icd10pcs_code")\
     .setDistanceFunction("EUCLIDEAN")

# Get numeric class
label_stringIdx = StringIndexer(inputCol = "label", outputCol = "class", handleInvalid="keep")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 367.3 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
ner_jsl download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
ner_bionlp download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]
ner_risk_factors download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [None]:
full_pipeline = Pipeline(
    stages = [
        document,
        sentenceDetector,
        token,
        embeddings,
        clinical_ner,
        clinical_converter,
        jsl_ner,
        jsl_converter,
        bio_ner,
        bio_converter,
        posology_ner,
        posology_converter,
        risk_ner,
        risk_converter,
        risk_assertion,
        chunk_doc,
        sentence_chunk_embeddings,
        icd_cm_resolver,
        icd_pcs_resolver,
        label_stringIdx
    ])

In [None]:
# Save unfit pipeline to disk
full_pipeline.save("/content/full_pipeline")

In [None]:
import shutil
shutil.make_archive('/content/full_pipeline', 'zip', '/content/full_pipeline')

In [None]:
# Load unfit pipeline
full_pipeline = PipelineModel.load("/content/full_pipeline")

#### Fit Pipeline

In [None]:
# Limit dataset, add train/test differentiator
limit_train = trainDataset.limit(500).withColumn("source", lit("train"))
limit_test = testDataset.limit(500).withColumn("source", lit("test"))

# Concat dataframes
full_unionDF = limit_train.union(limit_test)

In [None]:
# Fit pipeline
full_feature_model = full_pipeline.fit(full_unionDF)

In [None]:
# Transform train and test set
full_df = full_feature_model.transform(full_unionDF)

In [None]:
full_df.show(1)

+---------------+-------------+--------------------+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+---------+--------------------+--------------------+--------------------+--------------------+-----+
|          label|  description|                text| id|            document|            sentence|         tok_checked|          embeddings|        clinical_ner|  clinical_ner_chunk|             jsl_ner|       jsl_ner_chunk|             bio_ner|       bio_ner_chunk|        posology_ner|  posology_ner_chunk|            risk_ner|risk_ner_chunk|assertion|       ner_chunk_doc| sentence_embeddings|        icd10cm_code|       icd10pcs_code|class|
+---------------+-------------+--------------------+---+--------------------+--------------------+------------

#### Create Feature Dataframe

In [None]:
# Create features
full_features = create_features(full_df, pipeline='full')

In [None]:
full_features.head()

In [None]:
# Separate train/test
full_train_features = full_features[full_features["source"] == 'train']
full_test_features = full_features[full_features["source"] == 'test']

In [None]:
# Save features
full_train_features.to_csv("/content/full_train_features.csv", index=False)
full_test_features.to_csv("/content/full_test_features.csv", index=False)

In [None]:
# Load features
full_train_features = pd.read_csv("/content/full_train_features.csv")
full_test_features = pd.read_csv("/content/full_test_features.csv")

#### Random Forest

In [None]:
# Create train and test set
X_train = full_train_features.drop(['label', 'class', 'entity_id', 'id', 'source'], axis=1)
X_test = full_test_features.drop(['label', 'class', 'entity_id', 'id', 'source'], axis=1)
y_train = full_train_features['class']
y_test = full_test_features['class']

In [None]:
# Random forest classifier
clf = skl_RandomForestClassifier(n_estimators=100)

# Fit to training set
clf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = clf.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_rf))

In [None]:
feature_imp = pd.Series(clf.feature_importances_,index=X_train.columns).sort_values(ascending=False)
feature_imp[:20]

---

#### Logistic Regression

In [None]:
# Logistic regression
lr_clf = skl_LogisticRegression(max_iter=10000)

# Fit to training set
lr_model = lr_clf.fit(X_train, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test)

In [None]:
print(classification_report(y_test, y_pred_lr))

### No-resolver Pipeline

#### Create Pipeline

In [13]:
noresolver_pipeline = Pipeline(
    stages = [
        document,
        sentenceDetector,
        token,
        embeddings,
        clinical_ner,
        clinical_converter,
        jsl_ner,
        jsl_converter,
        bio_ner,
        bio_converter,
        posology_ner,
        posology_converter,
        risk_ner,
        risk_converter,
        risk_assertion,
        label_stringIdx
    ])

In [None]:
# Save unfit pipeline to disk
noresolver_pipeline.save("/content/noresolver_pipeline")

In [None]:
import shutil
shutil.make_archive('/content/noresolver_pipeline', 'zip', '/content/noresolver_pipeline')

'/content/noresolver_pipeline.zip'

In [None]:
# Load unfit pipeline
noresolver_pipeline = PipelineModel.load("/content/noresolver_pipeline")

#### Fit Pipeline

In [14]:
# Limit dataset, add train/test differentiator
limit_train = trainDataset.limit(500).withColumn("source", lit("train"))
limit_test = testDataset.limit(500).withColumn("source", lit("test"))

# Concat dataframes
noresolver_unionDF = limit_train.union(limit_test)

In [15]:
# Fit pipeline
noresolver_feature_model = noresolver_pipeline.fit(noresolver_unionDF)

In [16]:
# Transform train and test set
noresolver_df = noresolver_feature_model.transform(noresolver_unionDF)

In [None]:
noresolver_df.show(1)

+---------------+-------------+--------------------+---+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------+---------+-----+
|          label|  description|                text| id|source|            document|            sentence|         tok_checked|          embeddings|        clinical_ner|  clinical_ner_chunk|             jsl_ner|       jsl_ner_chunk|             bio_ner|       bio_ner_chunk|        posology_ner|  posology_ner_chunk|            risk_ner|risk_ner_chunk|assertion|class|
+---------------+-------------+--------------------+---+------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+------------

#### Create Feature Dataframe

In [17]:
# Create features
noresolver_features = create_features(noresolver_df, pipeline='no_resolver')

In [18]:
noresolver_features.head()

Unnamed: 0,id,label,class,source,entity_PROBLEM,entity_TREATMENT,entity_TEST,entity_Multi-tissue_structure,entity_Organism,entity_Organ,entity_Simple_chemical,entity_Organism_subdivision,entity_Gene_or_gene_product,entity_Organism_substance,entity_DRUG,entity_Modifier,entity_Disease_Syndrome_Disorder,entity_Internal_organ_or_component,entity_Direction,entity_Symptom,entity_Heart_Disease,entity_Test,entity_Injury_or_Poisoning,entity_Clinical_Dept,entity_Test_Result,entity_Section_Header,entity_Gender,entity_Procedure,entity_Admission_Discharge,entity_Date,entity_External_body_part_or_region,entity_EKG_Findings,entity_Diet,entity_Medical_Device,entity_RelativeDate,entity_Drug_Ingredient,entity_PHI,entity_MEDICATION,entity_present,entity_absent,...,entity_VS_Finding,entity_Treatment,entity_Oxygen_Therapy,entity_O2_Saturation,entity_Time,entity_Psychological_Condition,entity_Substance,entity_CAD,entity_Developing_anatomical_structure,entity_Allergen,entity_Vaccine,entity_Diabetes,entity_Obesity,entity_Relationship_Status,entity_OBESE,entity_SMOKER,entity_DIABETES,entity_Temperature,entity_Cerebrovascular_Disease,entity_ImagingFindings,entity_Hypertension,entity_Communicable_Disease,entity_Social_History_Header,entity_Alcohol,entity_HYPERTENSION,entity_Imaging_Technique,entity_Fetus_NewBorn,entity_Kidney_Disease,entity_Death_Entity,entity_Hyperlipidemia,entity_Total_Cholesterol,entity_Height,entity_LDL,entity_Triglycerides,entity_HDL,entity_HYPERLIPIDEMIA,entity_Birth_Entity,entity_Sexually_Active_or_Sexual_Orientation,entity_BMI,entity_Overweight
0,0,Medical Records,1.0,train,32.0,13.0,5.0,17.0,5.0,5.0,3.0,2.0,1.0,1.0,2.0,13.0,8.0,6.0,6.0,6.0,5.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0,51,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0,Medical Records,1.0,test,32.0,13.0,5.0,17.0,5.0,5.0,3.0,2.0,1.0,1.0,2.0,13.0,8.0,6.0,6.0,6.0,5.0,4.0,4.0,4.0,4.0,4.0,3.0,3.0,3.0,2.0,2.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,0,51,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,Surgery,0.0,train,41.0,49.0,8.0,24.0,9.0,13.0,9.0,28.0,0.0,1.0,14.0,9.0,14.0,30.0,17.0,10.0,0.0,0.0,0.0,6.0,1.0,7.0,3.0,3.0,0.0,0.0,15.0,0.0,1.0,24.0,0.0,7.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,Surgery,0.0,test,41.0,49.0,8.0,24.0,9.0,13.0,9.0,28.0,0.0,1.0,14.0,9.0,14.0,30.0,17.0,10.0,0.0,0.0,0.0,6.0,1.0,7.0,3.0,3.0,0.0,0.0,15.0,0.0,1.0,24.0,0.0,7.0,0.0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2,Medical Records,1.0,train,48.0,27.0,15.0,14.0,23.0,9.0,5.0,4.0,2.0,3.0,11.0,9.0,10.0,10.0,0.0,24.0,0.0,6.0,0.0,1.0,4.0,6.0,48.0,10.0,1.0,4.0,2.0,0.0,1.0,2.0,2.0,3.0,5.0,0.0,130,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [19]:
# Separate train/test
noresolver_train_features = noresolver_features[noresolver_features["source"] == 'train']
noresolver_test_features = noresolver_features[noresolver_features["source"] == 'test']

In [20]:
# Save features
noresolver_train_features.to_csv("/content/noresolver_train_features.csv", index=False)
noresolver_test_features.to_csv("/content/noresolver_test_features.csv", index=False)

In [None]:
# Load features
noresolver_train_features = pd.read_csv("/content/noresolver_train_features.csv")
noresolver_test_features = pd.read_csv("/content/noresolver_test_features.csv")

#### Random Forest

In [21]:
# Create train and test set
X_train = noresolver_train_features.drop(['label', 'class', 'entity_id', 'id', 'source'], axis=1)
X_test = noresolver_test_features.drop(['label', 'class', 'entity_id', 'id', 'source'], axis=1)
y_train = noresolver_train_features['class']
y_test = noresolver_test_features['class']

In [22]:
# Random forest classifier
clf = skl_RandomForestClassifier(n_estimators=100)

# Fit to training set
clf.fit(X_train, y_train)

# Predict on test set
y_pred_rf = clf.predict(X_test)

In [23]:
print(classification_report(y_test, y_pred_rf))

              precision    recall  f1-score   support

         0.0       0.42      0.33      0.37       198
         1.0       0.24      0.31      0.27       102
         2.0       0.19      0.18      0.18       109
         3.0       0.19      0.21      0.20        91

    accuracy                           0.27       500
   macro avg       0.26      0.26      0.26       500
weighted avg       0.29      0.27      0.28       500



In [24]:
feature_imp = pd.Series(clf.feature_importances_,index=X_train.columns).sort_values(ascending=False)
feature_imp[:20]

entity_TREATMENT                       0.032925
entity_Medical_Device                  0.032232
entity_Organ                           0.026649
entity_Direction                       0.024935
entity_Internal_organ_or_component     0.024422
entity_Gender                          0.022633
entity_Procedure                       0.022547
entity_TEST                            0.022206
entity_Simple_chemical                 0.021396
entity_Symptom                         0.021187
entity_Multi-tissue_structure          0.021168
entity_PROBLEM                         0.020336
entity_External_body_part_or_region    0.020122
entity_Modifier                        0.019545
entity_Tissue                          0.018786
entity_Disease_Syndrome_Disorder       0.018508
entity_Test                            0.018285
entity_Organism                        0.018002
entity_DRUG                            0.017398
entity_Injury_or_Poisoning             0.017363
dtype: float64

#### Logistic Regression

In [None]:
# Logistic regression
lr_clf = skl_LogisticRegression(max_iter=10000)

# Fit to training set
lr_model = lr_clf.fit(X_train, y_train)

# Predict on test set
y_pred_lr = lr_model.predict(X_test)

In [26]:
print(classification_report(y_test, y_pred_lr))

              precision    recall  f1-score   support

         0.0       0.51      0.47      0.49       198
         1.0       0.29      0.37      0.33       102
         2.0       0.21      0.18      0.19       109
         3.0       0.25      0.24      0.24        91

    accuracy                           0.35       500
   macro avg       0.31      0.32      0.31       500
weighted avg       0.35      0.35      0.35       500



## Conclusion



To compare the performance of the different models and account for the class imbalance, we can focus on the weighted average of the F1 score. The top 3 performing models were:

1. Logistic Regression with Universal Sentence Encoder (0.50)
2. DL Classification with BERT Sentence Embeddings (0.45)
3. DL Classification with BioBERT Clnical Sentence Embeddings (0.45)

All models performed poorly overall possibly due to the small sample size and slight class imbalances. 

The final step in the analysis was to create features using 5 different clinical NER models, a clinical risk assertion model and 2 clinical entity resolvers. This step was memory intensive and required excessive amounts of runtime which made it difficult to run on the full train and test sets. The results presented are therefore for limited (500 row) train and test sets.