<a href="https://colab.research.google.com/github/hasanabbas21/spark-nlp/blob/main/CS777_Term_Project_Spark_NLP_Clinical_Text_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Import the Healthcare Licenses for Spark NLP Healthcare models to be used in implementation
These are valid for only 1 month until 5/24.

In [1]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

Saving workshop_license_keys_365.json to workshop_license_keys_365 (1).json


Check versions of Spark NLP Healthcare - spark jsl (John Snow Labs)

In [2]:
license_keys['JSL_VERSION']

'3.0.1'

Using the license keys install into colab runtime spark-nlp-jsl. The scripts are provided already by spark nlp getting started page and available in the JSL's github page

Also install spark nlp display library (spark-nlp-display) for visualizations for the results of the models on text (paragraph annotations)

In [4]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

! pip install spark-nlp-display

organize all the necessary imports
We will use :
1. pyspark ML
2. sparknlp
3. sparknlp_jsl 

In [5]:
import json
import os
from pyspark.ml import Pipeline,PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

1. Initialize the spark-nlp-jsl session from the license keys 
2. Start the session with the configuration
3. verify the versions of spark nlp and spark jsl

In [9]:
params = {
    "spark.driver.memory":"16G",
    "spark.kryoserializer.buffer.max":"2000M",
    "spark.driver.maxResultSize":"2000M"
      }

spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

Spark NLP Version : 3.0.1
Spark NLP_JSL Version : 3.0.1


We will use pre-trained clinical models and run then on clinical text and observe the inferences first 
Then, we will train our own clinicla model and run a test 
Three separate clinical use cases are explored in this project:


1.   **Named Entity Recognition** (problem, treatment and test classifier)
2.   **Posology** (medication strength, frequency, duration classifier)
3.   **Clinical Assertion** (presence of a problem)


We will use pipelines through out to chain the results of one clinical model after the other.

The format of data used in training and test is ConLL . this is a NLP format used to train models 

In order to train a Named Entity Recognition DL annotator, we need to get "CoNLL format" data as a "spark dataframe"
The text to process shoukld be in a column of the dataframe (e.g. **text** column in below assempler model)

ref: https://nlp.johnsnowlabs.com/docs/en/training


In [10]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
assembler = DocumentAssembler().setInputCol("text").setOutputCol("document")

Assembler converts data from the **text** column to **documents** for spark nlp to process.
ref: https://nlp.johnsnowlabs.com/docs/en/transformers#documentassembler-getting-data-in

In [11]:
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]


ref: https://nlp.johnsnowlabs.com/docs/en/annotators#sentencedetector

Finds sentence bounds in raw text. Applies rules from Pragmatic Segmenter.


In [12]:
tokenizer = Tokenizer().setInputCols(["sentence"]).setOutputCol("token")

Tokenizer Identifies tokens with tokenization open standards.

ref: https://nlp.johnsnowlabs.com/docs/en/annotators#tokenizer

In [13]:
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]


WordEmbeddings model :
1. takes sentences and tokens in text and returns embeddings (vectors) for every word in the corpus
2. pretrained on "embeddings_clinical" PubMED dataset

In [65]:
medical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner")

ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


In [68]:
posology_ner_greedy = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_greedy")

ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


In [69]:
ner_converter = NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

ner_converter_greedy = NerConverter()\
    .setInputCols(["sentence","token","ner_greedy"])\
    .setOutputCol("ner_chunk_greedy")

In [56]:
medical_ner.getClasses()

['O',
 'B-TREATMENT',
 'I-TREATMENT',
 'B-PROBLEM',
 'I-PROBLEM',
 'B-TEST',
 'I-TEST']

In [55]:
posology_ner.getClasses()

['O',
 'B-DOSAGE',
 'B-STRENGTH',
 'I-STRENGTH',
 'B-ROUTE',
 'B-FREQUENCY',
 'I-FREQUENCY',
 'B-DRUG',
 'I-DRUG',
 'B-FORM',
 'I-DOSAGE',
 'B-DURATION',
 'I-DURATION',
 'I-FORM',
 'I-ROUTE']

**MedicalNerModel** is a pretrained model to recognize (NER's), Posology (dosages) as listed above from clinical text passed as input:

It takes word embeddings as **input** (I/P)
provides NER (Named Entity Recognition) tags as **Output**  O/P

Further, a **NERConverter** will chunk commonly occuring parts of sentences as chunks.




In [79]:
clinical_pipeline = Pipeline(stages=[
                          assembler,
                          sentenceDetector,
                          tokenizer,
                          word_embeddings,
                          medical_ner,
                          ner_converter])

Finally, the pipeline is constructed as a sequence of pretrained models . Each successive model taked output of previous step as input and passes the output to the next pretrained model

In [74]:
nlp_df = spark.createDataFrame([[""]]).toDF("text")


As mentioned above, we need to have the data to train in the text column of a dataframe. Here since we are using a pretrained model, no need to train, just transform. Hence, use a dummy nlp_df and fit the pipeline

In [81]:
clinical_nlp_model = clinical_pipeline.fit(nlp_df)

In [82]:
clinical_nlp_model.stages

[DocumentAssembler_3dcf3b305df8,
 SentenceDetectorDLModel_d2546f0acfe2,
 REGEX_TOKENIZER_6e72e8efc412,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NerConverter_de9392b034cc]

The stages are exactly as described before

In [26]:
!wget -q https://storage.googleapis.com/mirza-cs777-bucket/term_project/mtsamples.csv

source : https://www.kaggle.com/tboyle10/medicaltranscriptions

I have saved a medical transcription data set into my google drive in public mode, we will retrieve it for today's work 

In [27]:
import pyspark.sql.functions as F

pubmed_df = spark.read.option("header", "true").csv("pubmed_sample_text_small.csv")
med_transcript_data = spark.read.option("header", "true").csv("mtsamples.csv")

In [28]:
med_transcript_data.show(2)

+---+--------------------+--------------------+--------------------+--------------------+--------------------+
|_c0|         description|   medical_specialty|         sample_name|       transcription|            keywords|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+
|  0| A 23-year-old wh...| Allergy / Immuno...|  Allergic Rhinitis |SUBJECTIVE:,  Thi...|allergy / immunol...|
|  1| Consult for lapa...|          Bariatrics| Laparoscopic Gas...|PAST MEDICAL HIST...|bariatrics, lapar...|
+---+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [29]:
med_transcript_data.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- description: string (nullable = true)
 |-- medical_specialty: string (nullable = true)
 |-- sample_name: string (nullable = true)
 |-- transcription: string (nullable = true)
 |-- keywords: string (nullable = true)



In [37]:
just_transcript = med_transcript_data.select("transcription")

In [43]:
just_transcript = just_transcript.withColumnRenamed("transcription","text")
just_transcript.show(1)

+--------------------+
|                text|
+--------------------+
|SUBJECTIVE:,  Thi...|
+--------------------+
only showing top 1 row



In [83]:
pred = clinical_nlp_model.transform(just_transcript.limit(10))

In [45]:
pred.show(2)

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|SUBJECTIVE:,  Thi...|[{document, 0, 13...|[{document, 0, 80...|[{token, 0, 9, SU...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 71, 79, ...|
|PAST MEDICAL HIST...|[{document, 0, 24...|[{document, 0, 15...|[{token, 0, 3, PA...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 30, 55, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
only showing top 2 rows



In [48]:
pred.select('token.result','ner.result').show(truncate=150)

+------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                result|                                                                                                                                                result|
+------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------+
|[SUBJECTIVE, :,, This, 23-year-old, white, female, presents, with, complaint, of, aller

Let's annotate using a light pipeline.
text randonmly taken from the 

In [49]:
text = '''
MGUS.  His bone marrow biopsy showed a normal cellular bone marrow; however, there were 10% plasma cells and we proceeded with the workup for a plasma cell dyscrasia.  All his tests came back as consistent with an MGUS.", Hematology - Oncology, MGUS Followup ,"CHIEF COMPLAINT: , MGUS.,HISTORY OF PRESENT ILLNESS:,  This is an extremely pleasant 86-year-old gentleman, who I follow for his MGUS.  I initially saw him for thrombocytopenia when his ANC was 1300.  A bone marrow biopsy was obtained.  Interestingly enough, at the time of his bone marrow biopsy, his hemoglobin was 13.0 and his white blood cell count was 6.5 with a platelet count of 484,000.  His bone marrow biopsy showed a normal cellular bone marrow; however, there were 10% plasma cells and we proceeded with the workup for a plasma cell dyscrasia.  All his tests came back as consistent with an MGUS.,Overall, he is doing well.  Since I last saw him, he tells me that he has had onset of atrial fibrillation.  He has now started going to the gym two times per week, and has lost over 10 pounds.  He has a good energy level and his ECOG performance status is 0.  He denies any fever, chills, or night sweats.  No lymphadenopathy.  No nausea or vomiting.  No change in bowel or bladder habits.,CURRENT MEDICATIONS: , Multivitamin q.d., aspirin one tablet q.d., Lupron q. three months, Flomax  0.4 mg q.d., and Warfarin 2.5 mg q.d.,ALLERGIES:  ,No known drug allergies.,REVIEW OF SYSTEMS: , As per the HPI, otherwise negative.,PAST MEDICAL HISTORY:,1.  He is status post left inguinal hernia repair.,2.  Prostate cancer diagnosed in December 2004, which was a Gleason 3+4.  He is now receiving Lupron.,SOCIAL HISTORY: , He has a very remote history of tobacco use.  He has one to two alcoholic drinks per day.  He is married.,FAMILY HISTORY: , His brother had prostate cancer.,PHYSICAL EXAM:,VIT:",
'''

In [84]:
clinical_nlp_model_light = LightPipeline(clinical_nlp_model)

pred_light = clinical_nlp_model_light.fullAnnotate(text)

Using spark_nlp_display we will annotate the results of the text processing done by the clinicalk pipeline

In [85]:
from sparknlp_display import NerVisualizer

vis = NerVisualizer()

vis.display(pred_light[0], label_col='ner_chunk', document_col='document')

Now repeating ther same for a posology pipeline 

In [91]:
posology_pipeline = Pipeline(stages=[
                          assembler,
                          sentenceDetector,
                          tokenizer,
                          word_embeddings,
                          posology_ner,
                          posology_ner_greedy,
                          ner_converter,
                          ner_converter_greedy])

In [92]:
posology_nlp_model = posology_pipeline.fit(nlp_df)

In [98]:
posology_nlp_model_light = LightPipeline(posology_nlp_model)

posology_pred_light = posology_nlp_model_light.fullAnnotate(text)

In [99]:
vis = NerVisualizer()

vis.display(posology_pred_light[0], label_col='ner_chunk', document_col='document')

In [100]:
vis.display(posology_pred_light[0], label_col='ner_chunk_greedy', document_col='document')