**Study Date: 10 May 2021**

# **Use Case For Deloitte UK**

* Extract Clinical Entities From the Text
* Get Assertion Status of Symptom Entities
* Extract Posology Entities From the Test
* Extract Relations Between Clinical Entities
* Create a Clinial Spell Checker Pipeline

**Setting up environment by uploading license key**

In [None]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

In [None]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

! pip install spark-nlp-display

**Import Libraries**

In [None]:
import json
import os
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

**Start Spark Session**

In [None]:
params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}


spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.0.3
Spark NLP_JSL Version : 3.0.3


**Preparing Sample Data**

We have three text in a list. We are getting them to a spark dataframe by giving them id number.

In [None]:
df = spark.createDataFrame([
               (1, """Find upper GI endoscopy report on this patient from late 2016 and blood test results. I would be grateful if he could be seen to exclude GI malignancy. Adam has been feeling unwell for the past 2 weeks with abdominal pain, lethargy and concern from his family that he looks jaundiced. From his original referral, he has a long history of alcohol abuse, drinking more than 30 beers a week for more than 10 years. He has been alcohol free for the past several months now. In view of the drop in haemoglobin to 80 and his deranged U's & E's I would welcome an urgent opinion to exclude GI malignancy. Thank you for your help. Yours faithfully"""),
               (2, """Dear Doctor, Thanks for seeing Lily. She initially had heartburn in the form of indigestion post mealtime and then this developed into reflux and then dysphagia. She has been suffering from this for about 2 months and isn't able to take in any solids anymore. She used to be able to swallow these with some liquid but it is hard for her to even swallow water now. She is passing less urine than before. This is significantly affecting her work as a lab assistant. HP stool was negative and blood tests were fine. PPI tablets are causing regurgitation so she is taking the oral dispersible type. Ranitidine has no effect. There is a low risk of malignancy from my point of view but her symptoms are getting worse day by day. She has lost 30 kg and although she has been asked to lose weight, this method is not healthy. I would be grateful if she could be seen as soon as possible. Dr M. Christie GP."""),
               (3, """I would be grateful if you could see this 79 y/o woman for a non-urgent OGD. She appears to have been suffering intermittently from dyspepsia and epigastric pain for the course of the last 3 months. She is keen to get it checked. Her pain relief comes in the form of bicarbonate of soda/ gaviscon but it is recently getting worse and she has had 2 episodes of dark vomit. She does not smoke and drinks only occasionally. H Pylori faecal tests have been requested and omeprazole has been prescribed. No weight loss or dysphagia has been reported but due to her age and symptoms, an OGD would provide good reassurance. Best regards, Dr Smith""")
               ]).toDF("patient","text")

In [None]:
df.show(truncate=False)

+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|patient|text                                                                             

# **1. Use Case: Extract Clinical Entities From The Text**

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP 
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner_jsl = MedicalNerModel.pretrained("jsl_ner_wip_clinical","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("jsl_ner")

ner_converter_jsl = NerConverter()\
        .setInputCols(["sentence","token","jsl_ner"])\
        .setOutputCol("ner_chunk_jsl")
       #.setWhiteList(["SYMPTOM"])  #filter entities

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner_jsl,
        ner_converter_jsl])

empty_data = spark.createDataFrame([[""]]).toDF("text")

jsl_model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
jsl_ner_wip_clinical download started this may take some time.
Approximate size to download 14.5 MB
[OK!]


- **Lets take a look at the model stages**

In [None]:
jsl_model.stages

[DocumentAssembler_a8c16b5d0b38,
 SentenceDetectorDLModel_d2546f0acfe2,
 REGEX_TOKENIZER_4a163b9898ae,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_0a2cb03c3989,
 NerConverter_d6c49056b9e2]

- **We have used a pretrained NER model named `jsl_ner_wip_clinical` in the pipeline and the model has labelled the chunks as one of the followings:**

In [None]:
sorted(clinical_ner_jsl.getClasses())

['B-Admission_Discharge',
 'B-Age',
 'B-Alcohol',
 'B-Allergen',
 'B-BMI',
 'B-Birth_Entity',
 'B-Blood_Pressure',
 'B-Cerebrovascular_Disease',
 'B-Clinical_Dept',
 'B-Communicable_Disease',
 'B-Date',
 'B-Death_Entity',
 'B-Diabetes',
 'B-Diet',
 'B-Direction',
 'B-Disease_Syndrome_Disorder',
 'B-Dosage',
 'B-Drug_BrandName',
 'B-Drug_Ingredient',
 'B-Duration',
 'B-EKG_Findings',
 'B-Employment',
 'B-External_body_part_or_region',
 'B-Family_History_Header',
 'B-Fetus_NewBorn',
 'B-Form',
 'B-Frequency',
 'B-Gender',
 'B-HDL',
 'B-Heart_Disease',
 'B-Height',
 'B-Hyperlipidemia',
 'B-Hypertension',
 'B-ImagingFindings',
 'B-Imaging_Technique',
 'B-Injury_or_Poisoning',
 'B-Internal_organ_or_component',
 'B-Kidney_Disease',
 'B-LDL',
 'B-Labour_Delivery',
 'B-Medical_Device',
 'B-Medical_History_Header',
 'B-Modifier',
 'B-O2_Saturation',
 'B-Obesity',
 'B-Oncological',
 'B-Overweight',
 'B-Oxygen_Therapy',
 'B-Pregnancy',
 'B-Procedure',
 'B-Psychological_Condition',
 'B-Pulse',
 'B

- **Lets check params of the our clinical_ner annotator**

In [None]:
clinical_ner_jsl.extractParamMap()

{Param(parent='MedicalNerModel_0a2cb03c3989', name='batchSize', doc='Size of every batch'): 8,
 Param(parent='MedicalNerModel_0a2cb03c3989', name='classes', doc='get the tags used to trained this MedicalNerModel'): ['O',
  'B-Injury_or_Poisoning',
  'B-Direction',
  'B-Test',
  'I-Route',
  'B-Admission_Discharge',
  'B-Death_Entity',
  'I-Oxygen_Therapy',
  'I-Drug_BrandName',
  'B-Relationship_Status',
  'B-Duration',
  'I-Alcohol',
  'I-Triglycerides',
  'I-Date',
  'B-Respiration',
  'B-Hyperlipidemia',
  'I-Test',
  'B-Birth_Entity',
  'I-VS_Finding',
  'B-Age',
  'I-Social_History_Header',
  'B-Labour_Delivery',
  'I-Medical_Device',
  'B-Family_History_Header',
  'B-BMI',
  'I-Fetus_NewBorn',
  'I-BMI',
  'B-Temperature',
  'I-Section_Header',
  'I-Communicable_Disease',
  'I-ImagingFindings',
  'I-Psychological_Condition',
  'I-Obesity',
  'I-Sexually_Active_or_Sexual_Orientation',
  'I-Modifier',
  'B-Alcohol',
  'I-Temperature',
  'I-Vaccine',
  'I-Symptom',
  'B-Kidney_Disea

- **After creating pipeline, we are transforming the dataframe by fitted model.**

In [None]:
jsl_result = jsl_model.transform(df)

In [None]:
jsl_result.show()

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|patient|                text|            document|            sentence|               token|          embeddings|             jsl_ner|       ner_chunk_jsl|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      1|Find upper GI end...|[{document, 0, 63...|[{document, 0, 84...|[{token, 0, 3, Fi...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 22, u...|
|      2|Dear Doctor, Than...|[{document, 0, 89...|[{document, 0, 35...|[{token, 0, 3, De...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 10, D...|
|      3|I would be gratef...|[{document, 0, 63...|[{document, 0, 75...|[{token, 0, 0, I,...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 42, 47, ...|
+-------+--------------------+--------------------+-------

- **Show the labels of the tokens**

In [None]:
jsl_result_df = jsl_result.select(F.explode(F.arrays_zip('token.result', 'jsl_ner.result')).alias("cols")) \
                          .select(F.expr("cols['0']").alias("token"),
                                  F.expr("cols['1']").alias("ner_label"))

jsl_result_df.show(50, truncate=100)

+----------+-------------+
|     token|    ner_label|
+----------+-------------+
|      Find|            O|
|     upper|  B-Procedure|
|        GI|  I-Procedure|
| endoscopy|  I-Procedure|
|    report|            O|
|        on|            O|
|      this|            O|
|   patient|            O|
|      from|            O|
|      late|            O|
|      2016|            O|
|       and|            O|
|     blood|       B-Test|
|      test|       I-Test|
|   results|            O|
|         .|            O|
|         I|            O|
|     would|            O|
|        be|            O|
|  grateful|            O|
|        if|            O|
|        he|     B-Gender|
|     could|            O|
|        be|            O|
|      seen|            O|
|        to|            O|
|   exclude|            O|
|        GI|B-Oncological|
|malignancy|I-Oncological|
|         .|            O|
|      Adam|            O|
|       has|            O|
|      been|            O|
|   feeling|            O|
|

- **Count the labels of the tokens** 

In [None]:
jsl_result_df.select("token", "ner_label").groupby("ner_label").count().orderBy('count', ascending=False).show(truncate=False)

+-----------------+-----+
|ner_label        |count|
+-----------------+-----+
|O                |307  |
|B-Gender         |25   |
|B-Symptom        |19   |
|I-Duration       |15   |
|I-Symptom        |10   |
|I-RelativeDate   |7    |
|I-Test           |6    |
|B-Drug_Ingredient|5    |
|B-Alcohol        |5    |
|B-Test           |5    |
|B-Duration       |4    |
|B-Oncological    |3    |
|B-Test_Result    |3    |
|I-Frequency      |3    |
|I-Procedure      |3    |
|B-Frequency      |2    |
|B-Procedure      |2    |
|I-Oncological    |2    |
|B-Employment     |2    |
|B-Smoking        |1    |
+-----------------+-----+
only showing top 20 rows



- **List the entities with their labels**

In [None]:
jsl_result.select("patient", F.explode(F.arrays_zip('ner_chunk_jsl.result', 'ner_chunk_jsl.metadata')).alias("cols")) \
          .select("patient", F.expr("cols['0']").alias("chunk"),
                             F.expr("cols['1']['entity']").alias("ner_label")).show(50, truncate=False)

+-------+---------------------------+---------------+
|patient|chunk                      |ner_label      |
+-------+---------------------------+---------------+
|1      |upper GI endoscopy         |Procedure      |
|1      |blood test                 |Test           |
|1      |he                         |Gender         |
|1      |GI malignancy              |Oncological    |
|1      |unwell                     |Symptom        |
|1      |for the past 2 weeks       |Duration       |
|1      |abdominal pain             |Symptom        |
|1      |lethargy                   |Symptom        |
|1      |his                        |Gender         |
|1      |he                         |Gender         |
|1      |jaundiced                  |Symptom        |
|1      |his                        |Gender         |
|1      |he                         |Gender         |
|1      |alcohol                    |Alcohol        |
|1      |drinking                   |Alcohol        |
|1      |beers              

### **Show the entities on the raw text**

- **We can show the entities by using `sparknlp_display` library with LightPipeline.**

In [None]:
from sparknlp_display import NerVisualizer

light_model = LightPipeline(jsl_model)

for index, text in enumerate(df.select("text").collect()):

    print("\n", "*"*50, f'Sample Text {index+1}', "*"*50, "\n")
    
    light_result = light_model.fullAnnotate(text)
    visualiser = NerVisualizer()

    # change color of an entity label
    visualiser.set_label_colors({'SYMPTOM':'#eb0033'})
    
    visualiser.display(light_result[0], 
                       label_col='ner_chunk_jsl', 
                       document_col='document')


 ************************************************** Sample Text 1 ************************************************** 




 ************************************************** Sample Text 2 ************************************************** 




 ************************************************** Sample Text 3 ************************************************** 



# **2. Use Case: Get Assertion Status of Symptom Entities**

- **We will create a new pipeline to get assertion status of sypmtom entities. As NER model, we are using `jsl_ner_wip_greedy_clinical`**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")
 
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models") \
    .setInputCols(["document"]) \
    .setOutputCol("sentence")
 
tokenizer = Tokenizer()\
    .setInputCols(["sentence"])\
    .setOutputCol("token")\
 
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentence", "token"])\
    .setOutputCol("embeddings")
 
# Clinical Terminology
jsl_ner = MedicalNerModel.pretrained("jsl_ner_wip_greedy_clinical", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("jsl_ner")
 
jsl_ner_converter = NerConverter() \
    .setInputCols(["sentence", "token", "jsl_ner"]) \
    .setOutputCol("jsl_ner_chunk")\
    .setWhiteList(["Symptom"]) # get only symptoms
 
assertion = AssertionDLModel.pretrained("assertion_dl", "en", "clinical/models") \
    .setInputCols(["sentence", "jsl_ner_chunk", "embeddings"]) \
    .setOutputCol("assertion")
 
 
assertion_pipeline = Pipeline(
    stages = [
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        jsl_ner,
        jsl_ner_converter,
        assertion
    ])
empty_data = spark.createDataFrame([['']]).toDF("text")
assertion_model = assertion_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
jsl_ner_wip_greedy_clinical download started this may take some time.
Approximate size to download 14.5 MB
[OK!]
assertion_dl download started this may take some time.
Approximate size to download 1.3 MB
[OK!]


In [None]:
assertion_res = assertion_model.transform(df)

- **Lets create a pandas dataframe that contains assertion status of the entities**

In [None]:
assertion_df = assertion_res.select("patient", F.explode(F.arrays_zip('jsl_ner_chunk.result', 'jsl_ner_chunk.metadata', 'assertion.result')).alias("cols"))\
                            .select("patient", F.expr("cols['0']").alias("chunk"),
                                               F.expr("cols['1']['entity']").alias("entity"),
                                               F.expr("cols['2']").alias("assertion")).toPandas()

In [None]:
assertion_df

Unnamed: 0,patient,chunk,entity,assertion
0,1,feeling unwell,Symptom,present
1,1,abdominal pain,Symptom,present
2,1,lethargy,Symptom,present
3,1,jaundiced,Symptom,present
4,1,deranged,Symptom,present
5,2,heartburn,Symptom,present
6,2,indigestion,Symptom,present
7,2,reflux,Symptom,present
8,2,dysphagia,Symptom,present
9,2,regurgitation,Symptom,present


- **As you can see we found the status of the symptoms and some of them are absent now.**

# **3. Use Case: Extract Posology Entities From the Text**

- **We will create a pipeline by using `ner_posology` model to extract `DRUG`, `STRENGTH`, `DURATION`, `FREQUENCY`, `FORM`, `DOSAGE`, `ROUTE` entities.**

In [None]:

# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP 
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
posology_ner = MedicalNerModel.pretrained("ner_posology", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology")

ner_converter_pos = NerConverter()\
    .setInputCols(["sentence","token","ner_posology"])\
    .setOutputCol("ner_chunk")

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_posology download started this may take some time.
Approximate size to download 13.8 MB
[OK!]


- **We can also exract the posology entities in greedy form by using `ner_posology_greedy` model. It differs from ner_posology in the sense that it chunks together drugs, dosage, form, strength, and route when they appear together, resulting in a bigger chunk.**

In [None]:
posology_ner_greedy = MedicalNerModel.pretrained("ner_posology_greedy", "en", "clinical/models") \
    .setInputCols(["sentence", "token", "embeddings"]) \
    .setOutputCol("ner_posology_greedy")

ner_converter_greedy = NerConverter()\
    .setInputCols(["sentence","token","ner_posology_greedy"])\
    .setOutputCol("ner_chunk_greedy")


ner_posology_greedy download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


- **We can use these two different posology models in the same pipeline.**

In [None]:
posPipeline = Pipeline(stages=[
    documentAssembler, 
    sentenceDetector,
    tokenizer,
    word_embeddings,
    posology_ner,
    ner_converter_pos,
    posology_ner_greedy,
    ner_converter_greedy])

empty_data = spark.createDataFrame([[""]]).toDF("text")

posology_model = posPipeline.fit(empty_data)

- **Lets take a look at the labels of the `ner_posology` model.**

In [None]:
posology_ner.getClasses()

['O',
 'B-DOSAGE',
 'B-STRENGTH',
 'I-STRENGTH',
 'B-ROUTE',
 'B-FREQUENCY',
 'I-FREQUENCY',
 'B-DRUG',
 'I-DRUG',
 'B-FORM',
 'I-DOSAGE',
 'B-DURATION',
 'I-DURATION',
 'I-FORM',
 'I-ROUTE']

- **Transform dataframe by fitted model**

In [None]:
pos_result = posology_model.transform(df)

In [None]:
pos_result.show()

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|patient|                text|            document|            sentence|               token|          embeddings|        ner_posology|           ner_chunk| ner_posology_greedy|    ner_chunk_greedy|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      1|Find upper GI end...|[{document, 0, 63...|[{document, 0, 84...|[{token, 0, 3, Fi...|[{word_embeddings...|[{named_entity, 0...|                  []|[{named_entity, 0...|                  []|
|      2|Dear Doctor, Than...|[{document, 0, 89...|[{document, 0, 35...|[{token, 0, 3, De...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 513, 515...|[{named_entity, 0...|[{chunk, 513, 523...|
|    

- **We can show the NER labels of the entities found by these models**

In [None]:
pos_result_df = pos_result.select(F.explode(F.arrays_zip('token.result', 'ner_posology.result')).alias('cols'))\
                          .select(F.expr("cols['0']").alias('token'),
                                  F.expr("cols['1']").alias('ner_posology_label'))\
                          .filter("ner_posology_label!= 'O'")

pos_result_df.show()

+----------+------------------+
|     token|ner_posology_label|
+----------+------------------+
|       PPI|            B-DRUG|
|   tablets|            B-FORM|
|      oral|           B-ROUTE|
|Ranitidine|            B-DRUG|
|omeprazole|            B-DRUG|
+----------+------------------+



In [None]:
pos__greedy_result_df = pos_result.select(F.explode(F.arrays_zip('token.result', 'ner_posology_greedy.result')).alias('cols'))\
                          .select(F.expr("cols['0']").alias('token'),
                                  F.expr("cols['1']").alias('ner_posology_greedy_label'))\
                          .filter("ner_posology_greedy_label!= 'O'")

pos__greedy_result_df.show()

+-----------+-------------------------+
|      token|ner_posology_greedy_label|
+-----------+-------------------------+
|        PPI|                   B-DRUG|
|    tablets|                   I-DRUG|
|       oral|                  B-ROUTE|
| Ranitidine|                   B-DRUG|
|bicarbonate|                   B-DRUG|
|         of|                   I-DRUG|
|      soda/|                   I-DRUG|
|   gaviscon|                   I-DRUG|
| omeprazole|                   B-DRUG|
+-----------+-------------------------+



In [None]:
pos_result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.begin', 'ner_chunk.end', 'ner_chunk.metadata')).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']['entity']").alias("ner_label"))\
                  .filter("ner_label!='O'")\
                  .show(truncate=False)

+----------+-----+---+---------+
|chunk     |begin|end|ner_label|
+----------+-----+---+---------+
|PPI       |513  |515|DRUG     |
|tablets   |517  |523|FORM     |
|oral      |572  |575|ROUTE    |
|Ranitidine|595  |604|DRUG     |
|omeprazole|467  |476|DRUG     |
+----------+-----+---+---------+



In [None]:
pos_result.select(F.explode(F.arrays_zip('ner_chunk_greedy.result', 'ner_chunk_greedy.begin', 'ner_chunk_greedy.end', 'ner_chunk_greedy.metadata')).alias("cols")) \
          .select(F.expr("cols['0']").alias("chunk"),
                  F.expr("cols['1']").alias("begin"),
                  F.expr("cols['2']").alias("end"),
                  F.expr("cols['3']['entity']").alias("ner_label"))\
                  .filter("ner_label!='O'")\
                  .show(truncate=False)

+-----------------------------+-----+---+---------+
|chunk                        |begin|end|ner_label|
+-----------------------------+-----+---+---------+
|PPI tablets                  |513  |523|DRUG     |
|oral                         |572  |575|ROUTE    |
|Ranitidine                   |595  |604|DRUG     |
|bicarbonate of soda/ gaviscon|267  |295|DRUG     |
|omeprazole                   |467  |476|DRUG     |
+-----------------------------+-----+---+---------+



**As you can see, we can find the drugs, dosage, form, strength, and route information all together in a bigger chunk.** 

### **Show the posology entities in raw text** 

In [None]:
light_model = LightPipeline(posology_model)

for index, text in enumerate(df.select("text").collect()):

    print("\n", "*"*50, f'Sample Text {index+1}', "*"*50, "\n")
    
    light_result = light_model.fullAnnotate(text)
    visualiser = NerVisualizer()
    visualiser.display(light_result[0], 
                       label_col='ner_chunk', 
                       document_col='document')


 ************************************************** Sample Text 1 ************************************************** 




 ************************************************** Sample Text 2 ************************************************** 




 ************************************************** Sample Text 3 ************************************************** 



- **Lets show our entities in greedy form.**

In [None]:
light_model = LightPipeline(posology_model)

for index, text in enumerate(df.select("text").collect()):

    print("\n", "*"*50, f'Sample Text {index+1}', "*"*50, "\n")
    
    light_result = light_model.fullAnnotate(text)
    visualiser = NerVisualizer()
    visualiser.display(light_result[0], 
                       label_col='ner_chunk_greedy', 
                       document_col='document')


 ************************************************** Sample Text 1 ************************************************** 




 ************************************************** Sample Text 2 ************************************************** 




 ************************************************** Sample Text 3 ************************************************** 



# **4. Use Case: Extract Relations Between Clinical Entities**

The set of relations defined in the 2010 i2b2 relation challenge

TrIP: A certain treatment has improved or cured a medical problem (eg, ‘infection resolved with antibiotic course’)

TrWP: A patient's medical problem has deteriorated or worsened because of or in spite of a treatment being administered (eg, ‘the tumor was growing despite the drain’)

TrCP: A treatment caused a medical problem (eg, ‘penicillin causes a rash’)

TrAP: A treatment administered for a medical problem (eg, ‘Dexamphetamine for narcolepsy’)

TrNAP: The administration of a treatment was avoided because of a medical problem (eg, ‘Ralafen which is contra-indicated because of ulcers’)

TeRP: A test has revealed some medical problem (eg, ‘an echocardiogram revealed a pericardial effusion’)

TeCP: A test was performed to investigate a medical problem (eg, ‘chest x-ray done to rule out pneumonia’)

PIP: Two problems are related to each other (eg, ‘Azotemia presumed secondary to sepsis’)


- **We will use `ner_clinical` model here to extract `PROBLEM`, `TREATMENT` and `TEST` entity relations.**

In [None]:
documenter = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentences")

tokenizer = sparknlp.annotators.Tokenizer()\
    .setInputCols(["sentences"])\
    .setOutputCol("tokens")

words_embedder = WordEmbeddingsModel()\
    .pretrained("embeddings_clinical", "en", "clinical/models")\
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("embeddings")

pos_tagger = PerceptronModel()\
    .pretrained("pos_clinical", "en", "clinical/models") \
    .setInputCols(["sentences", "tokens"])\
    .setOutputCol("pos_tags")

clinical_ner_tagger = MedicalNerModel.pretrained("ner_clinical", "en", "clinical/models")\
    .setInputCols("sentences", "tokens", "embeddings")\
    .setOutputCol("ner_tags")     

ner_chunker = NerConverterInternal()\
    .setInputCols(["sentences", "tokens", "ner_tags"])\
    .setOutputCol("ner_chunks")

dependency_parser = DependencyParserModel()\
    .pretrained("dependency_conllu", "en")\
    .setInputCols(["sentences", "pos_tags", "tokens"])\
    .setOutputCol("dependencies")

clinical_re_Model = RelationExtractionModel()\
    .pretrained("re_clinical", "en", 'clinical/models')\
    .setInputCols(["embeddings", "pos_tags", "ner_chunks", "dependencies"])\
    .setOutputCol("relations")\
    .setMaxSyntacticDistance(7)\
    .setRelationPairs(["problem-test", "problem-treatment"]) # we can set the possible relation pairs (if not set, all the relations will be calculated)

rel_pipeline = Pipeline(stages=[
    documenter,
    sentenceDetector,
    tokenizer, 
    words_embedder, 
    pos_tagger, 
    clinical_ner_tagger,
    ner_chunker,
    dependency_parser,
    clinical_re_Model
])

empty_data = spark.createDataFrame([[""]]).toDF("text")
rel_model = rel_pipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
pos_clinical download started this may take some time.
Approximate size to download 1.5 MB
[OK!]
ner_clinical download started this may take some time.
Approximate size to download 13.9 MB
[OK!]
dependency_conllu download started this may take some time.
Approximate size to download 16.7 MB
[OK!]
re_clinical download started this may take some time.
Approximate size to download 6 MB
[OK!]


- **Transform dataframe using fitted model**

In [None]:
rel_df = rel_model.transform(df)

In [None]:
rel_df.show()

+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|patient|                text|            document|           sentences|              tokens|          embeddings|            pos_tags|            ner_tags|          ner_chunks|        dependencies|           relations|
+-------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|      1|Find upper GI end...|[{document, 0, 63...|[{document, 0, 84...|[{token, 0, 3, Fi...|[{word_embeddings...|[{pos, 0, 3, JJ, ...|[{named_entity, 0...|[{chunk, 5, 22, u...|[{dependency, 0, ...|[{category, 481, ...|
|      2|Dear Doctor, Than...|[{document, 0, 89...|[{document, 0, 35...|[{token, 0, 3, De...|[{word_embeddings...|[{pos,

- **Create a pandas dataframe to show relations**

In [None]:
clinical_rel_df = rel_df.select("patient", F.explode(F.arrays_zip('relations.result', 'relations.metadata')).alias("cols"))\
                        .select("patient", F.expr("cols['0']").alias("relation"),
                                           F.expr("cols['1']['entity1']").alias("entity1"),
                                           F.expr("cols['1']['chunk1']").alias("chunk1"),
                                           F.expr("cols['1']['entity2']").alias("entity2"),
                                           F.expr("cols['1']['chunk2']").alias("chunk2"),
                                           F.expr("cols['1']['confidence']").alias("confidence")\
                                           ).filter("relation != 'O'").toPandas()

In [None]:
clinical_rel_df

Unnamed: 0,patient,relation,entity1,chunk1,entity2,chunk2,confidence
0,1,PIP,PROBLEM,the drop,TEST,haemoglobin,0.9999993
1,1,TeRP,TEST,haemoglobin,PROBLEM,his deranged U's,0.99998546
2,1,TeRP,TEST,haemoglobin,PROBLEM,GI malignancy,1.0
3,2,TrNAP,TREATMENT,PPI tablets,PROBLEM,regurgitation,0.9776462
4,2,TeRP,PROBLEM,lost 30 kg,TEST,this method,0.75753903
5,3,TrAP,TREATMENT,bicarbonate,PROBLEM,dark vomit,0.9999956
6,3,TrAP,TREATMENT,gaviscon,PROBLEM,dark vomit,0.99473035


### **Show the relations on the raw text**

- **We can visualize the relations between entities by using `RelationExtractionVisualizer`.**

In [None]:
from sparknlp_display import RelationExtractionVisualizer

light_model = LightPipeline(rel_model)

for index, text in enumerate(df.select("text").collect()):

    print("\n", "*"*50, f'Sample Text {index+1}', "*"*50, "\n")
    
    light_result = light_model.fullAnnotate(text)
    visualiser = RelationExtractionVisualizer()
    visualiser.display(light_result[0], 'relations', show_relations=True)


 ************************************************** Sample Text 1 ************************************************** 



  relation_coordinates = np.array(relation_coordinates)



 ************************************************** Sample Text 2 ************************************************** 




 ************************************************** Sample Text 3 ************************************************** 



# **5. Use Case: Create a Clinical Spell Checker Pipeline**

- **We will create a pipeline to correct the mistaken spell.**

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\
    .setPrefixes(["\"", "(", "[", "\n"])\
    .setSuffixes([".", ",", "?", ")","!", "'s", "/"])

clinical_spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_clinical', 'en', 'clinical/models')\
    .setInputCols("token")\
    .setOutputCol("checked_clinical")\
    .setErrorThreshold(30)\
    .setTradeoff(30)


pipeline = Pipeline(
    stages = [
    documentAssembler,
    tokenizer,
    clinical_spellModel
    ])


empty_ds = spark.createDataFrame([[""]]).toDF("text")

light_model = LightPipeline(pipeline.fit(empty_ds))

spellcheck_clinical download started this may take some time.
Approximate size to download 142.2 MB
[OK!]


- **Lets choose one of the clinial note and make it a broken text.** 

In [None]:
broken_text = [
               "I would be crateful if you could see this 79 y/o woman for a non - urgnt OGD.",\
               "She apprs to have been suferig intermitntly from dyspepia and etigstric pain for the course of the lasd 3 months.",\
               "She is meen to get it checked.",\
               "Her pain relief comes in the form of bicbonate of soda/ gavscon but it is recntly getting wrse and she has had 2 episdes of dark vomt.",\
               "She does not smok and drinks only occasiolly.",\
               "H Plori faecel tests have been requsted and omeprezile has been prescbed.",\
               "No weigt loss or dysphag has been reportd but due to her age and symtos, an OGD would provide good reassurnce.",\
               "Best regards, Dr Smith"
]

- **Now we will compare the broken tokens and checked ones**

In [None]:
token_list = []
for sample in broken_text:
  sample_checked = light_model.annotate(sample)
  token_list.append([(a,b) for a, b in zip(sample_checked['token'], sample_checked['checked_clinical']) if a.lower()!=b.lower() ])

new_list=[]
for sample in token_list:
  for token in sample:
    new_list.append((token[0], token[1]))

In [None]:
pd.set_option("display.max_rows", 50)
pd_df = pd.DataFrame(new_list, columns=["Token", "spellcheck_clinical"])
pd_df

Unnamed: 0,Token,spellcheck_clinical
0,crateful,careful
1,urgnt,urgent
2,apprs,appears
3,suferig,suffering
4,intermitntly,intermittently
5,dyspepia,dyspepsia
6,etigstric,epigastric
7,lasd,last
8,meen,been
9,bicbonate,bicarbonate


- **As you can see, it's really good at especially mistaken clinical tokens.**

End Of Notebook#