<a href="https://colab.research.google.com/github/muhammetsnts/AGILE/blob/main/Deloitte.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Use Case For Deloitte UK

* Clinical NER
* Clinical Spell Checker

In [None]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

In [None]:
%%capture
for k,v in license_keys.items(): 
    %set_env $k=$v

!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp-workshop/master/jsl_colab_setup.sh
!bash jsl_colab_setup.sh

! pip install spark-nlp-display

## Importing Libraries

In [None]:
import json
import os
from pyspark.ml import Pipeline, PipelineModel
from pyspark.sql import SparkSession

from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl
import sparknlp

from sparknlp.util import *
from sparknlp.pretrained import ResourceDownloader
from pyspark.sql import functions as F

import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', None)

import string
import numpy as np

## Starting Spark Session

In [None]:
params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}


spark = sparknlp_jsl.start(license_keys['SECRET'],params=params)

print ("Spark NLP Version :", sparknlp.version())
print ("Spark NLP_JSL Version :", sparknlp_jsl.version())

spark

Spark NLP Version : 3.0.3
Spark NLP_JSL Version : 3.0.3


## Preparing Sample Data

We have three text in a list. We are getting them to a spark dataframe.

In [None]:
sample_text = ["""Find upper GI endoscopy report on this patient from late 2016 and blood test results. I would be grateful if he could be seen to exclude GI malignancy. Adam has been feeling unwell for the past 2 weeks with abdominal pain, lethargy and concern from his family that he looks jaundiced. From his original referral, he has a long history of alcohol abuse, drinking more than 30 beers a week for more than 10 years. He has been alcohol free for the past several months now. In view of the drop in haemoglobin to 80 and his deranged U's & E's I would welcome an urgent opinion to exclude GI malignancy. Thank you for your help. Yours faithfully""",
               """Dear Doctor, Thanks for seeing Lily. She initially had heartburn in the form of indigestion post mealtime and then this developed into reflux and then dysphagia. She has been suffering from this for about 2 months and isn't able to take in any solids anymore. She used to be able to swallow these with some liquid but it is hard for her to even swallow water now. She is passing less urine than before. This is significantly affecting her work as a lab assistant. HP stool was negative and blood tests were fine. PPI tablets are causing regurgitation so she is taking the oral dispersible type. Ranitidine has no effect. There is a low risk of malignancy from my point of view but her symptoms are getting worse day by day. She has lost 30 kg and although she has been asked to lose weight, this method is not healthy. I would be grateful if she could be seen as soon as possible. Dr M. Christie GP.""",
               """I would be grateful if you could see this 79 y/o woman for a non-urgent OGD. She appears to have been suffering intermittently from dyspepsia and epigastric pain for the course of the last 3 months. She is keen to get it checked. Her pain relief comes in the form of bicarbonate of soda/ gaviscon but it is recently getting worse and she has had 2 episodes of dark vomit. She does not smoke and drinks only occasionally. H Pylori faecal tests have been requested and omeprazole has been prescribed. No weight loss or dysphagia has been reported but due to her age and symptoms, an OGD would provide good reassurance. Best regards, Dr Smith"""
              ]

In [None]:
df = spark.createDataFrame(pd.DataFrame({"text":sample_text}))

In [None]:
df.show(truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 text | Find upper GI endoscopy report on this patient from late 2016 and blood test result

## Creating pipeline for NER.  

In [None]:
# Annotator that transforms a text column from dataframe into an Annotation ready for NLP
documentAssembler = DocumentAssembler()\
        .setInputCol("text")\
        .setOutputCol("document")

# Sentence Detector annotator, processes various sentences per line
sentenceDetector = SentenceDetectorDLModel.pretrained("sentence_detector_dl_healthcare","en","clinical/models")\
        .setInputCols(["document"])\
        .setOutputCol("sentence")

# Tokenizer splits words in a relevant format for NLP 
tokenizer = Tokenizer()\
        .setInputCols(["sentence"])\
        .setOutputCol("token")

# Clinical word embeddings trained on PubMED dataset
word_embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical","en","clinical/models")\
        .setInputCols(["sentence","token"])\
        .setOutputCol("embeddings")

# NER model trained on i2b2 (sampled from MIMIC) dataset
clinical_ner = MedicalNerModel.pretrained("ner_clinical_large","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

ner_converter = NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")
       #.setWhiteList(["PROBLEM"])  #filter entities

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

sentence_detector_dl_healthcare download started this may take some time.
Approximate size to download 363.9 KB
[OK!]
embeddings_clinical download started this may take some time.
Approximate size to download 1.6 GB
[OK!]
ner_clinical_large download started this may take some time.
Approximate size to download 13.9 MB
[OK!]


- Lets take a look at the model stages

In [None]:
model.stages

[DocumentAssembler_2e7532991f4e,
 SentenceDetectorDLModel_d2546f0acfe2,
 REGEX_TOKENIZER_4e7f64a149c0,
 WORD_EMBEDDINGS_MODEL_9004b1d00302,
 MedicalNerModel_1a8637089929,
 NerConverter_550625fe573b]

- We have used a pretrained NER model named `ner_clinical_large` in the pipeline and the model has labelled the chunks as one of the followings:
    * TEST
    * PROBLEM
    * TREATMENT

In [None]:
clinical_ner.getClasses()

['O',
 'B-TREATMENT',
 'I-TREATMENT',
 'B-PROBLEM',
 'I-PROBLEM',
 'B-TEST',
 'I-TEST']

- Lets check params of the our clinical_ner annotator

In [None]:
clinical_ner.extractParamMap()

{Param(parent='MedicalNerModel_1a8637089929', name='batchSize', doc='Size of every batch'): 64,
 Param(parent='MedicalNerModel_1a8637089929', name='classes', doc='get the tags used to trained this MedicalNerModel'): ['O',
  'B-TREATMENT',
  'I-TREATMENT',
  'B-PROBLEM',
  'I-PROBLEM',
  'B-TEST',
  'I-TEST'],
 Param(parent='MedicalNerModel_1a8637089929', name='includeConfidence', doc='whether to include confidence scores in annotation metadata'): True,
 Param(parent='MedicalNerModel_1a8637089929', name='inputCols', doc='previous annotations columns, if renamed'): ['sentence',
  'token',
  'embeddings'],
 Param(parent='MedicalNerModel_1a8637089929', name='lazyAnnotator', doc='Whether this AnnotatorModel acts as lazy in RecursivePipelines'): False,
 Param(parent='MedicalNerModel_1a8637089929', name='outputCol', doc='output annotation column. can be left default.'): 'ner',
 Param(parent='MedicalNerModel_1a8637089929', name='storageRef', doc='unique reference name for identification'): 'cl

- After creating pipeline, we are transforming the dataframe by fitted model.

In [None]:
result = model.transform(df)

In [None]:
result.show()

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|          embeddings|                 ner|           ner_chunk|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|Find upper GI end...|[{document, 0, 63...|[{document, 0, 84...|[{token, 0, 3, Fi...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 5, 22, u...|
|Dear Doctor, Than...|[{document, 0, 89...|[{document, 0, 35...|[{token, 0, 3, De...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 55, 63, ...|
|I would be gratef...|[{document, 0, 63...|[{document, 0, 75...|[{token, 0, 0, I,...|[{word_embeddings...|[{named_entity, 0...|[{chunk, 59, 74, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+

- Show the labels of the tokens

In [None]:
result_df = result.select(F.explode(F.arrays_zip('token.result', 'ner.result')).alias("cols")) \
                  .select(F.expr("cols['0']").alias("token"),
                          F.expr("cols['1']").alias("ner_label"))

result_df.show(50, truncate=100)

+----------+---------+
|     token|ner_label|
+----------+---------+
|      Find|        O|
|     upper|   B-TEST|
|        GI|   I-TEST|
| endoscopy|   I-TEST|
|    report|        O|
|        on|        O|
|      this|        O|
|   patient|        O|
|      from|        O|
|      late|        O|
|      2016|        O|
|       and|        O|
|     blood|        O|
|      test|        O|
|   results|        O|
|         .|        O|
|         I|        O|
|     would|        O|
|        be|        O|
|  grateful|        O|
|        if|        O|
|        he|        O|
|     could|        O|
|        be|        O|
|      seen|        O|
|        to|        O|
|   exclude|        O|
|        GI|B-PROBLEM|
|malignancy|I-PROBLEM|
|         .|        O|
|      Adam|        O|
|       has|        O|
|      been|        O|
|   feeling|        O|
|    unwell|B-PROBLEM|
|       for|        O|
|       the|        O|
|      past|        O|
|         2|        O|
|     weeks|        O|
|      with

- Count the labels of the tokens 

In [None]:
result_df.select("token", "ner_label").groupby("ner_label").count().orderBy('count', ascending=False).show(truncate=False)

+-----------+-----+
|ner_label  |count|
+-----------+-----+
|O          |370  |
|B-PROBLEM  |26   |
|I-PROBLEM  |17   |
|I-TEST     |8    |
|B-TEST     |6    |
|B-TREATMENT|6    |
|I-TREATMENT|2    |
+-----------+-----+



As you can see;

- 26 PROBLEM
- 6 TEST
- 6 TREATMENT

different entities detected.

- List the entities with their labels

In [None]:
result.select(F.explode(F.arrays_zip('ner_chunk.result', 'ner_chunk.metadata')).alias("cols")) \
.select(F.expr("cols['0']").alias("chunk"),
        F.expr("cols['1']['entity']").alias("ner_label")).show(truncate=False)

+------------------+---------+
|chunk             |ner_label|
+------------------+---------+
|upper GI endoscopy|TEST     |
|GI malignancy     |PROBLEM  |
|unwell            |PROBLEM  |
|abdominal pain    |PROBLEM  |
|lethargy          |PROBLEM  |
|jaundiced         |PROBLEM  |
|alcohol abuse     |PROBLEM  |
|the drop          |PROBLEM  |
|haemoglobin       |TEST     |
|his deranged U's  |PROBLEM  |
|GI malignancy     |PROBLEM  |
|heartburn         |PROBLEM  |
|indigestion       |PROBLEM  |
|reflux            |PROBLEM  |
|dysphagia         |PROBLEM  |
|some liquid       |TREATMENT|
|HP stool          |TEST     |
|blood tests       |TEST     |
|PPI tablets       |TREATMENT|
|regurgitation     |PROBLEM  |
+------------------+---------+
only showing top 20 rows



## Show the entities in raw text

- We can show the entities by using `sparknlp_display` library with LightPipeline.

In [None]:
from sparknlp_display import NerVisualizer

light_model = LightPipeline(model)

for index, text in enumerate(sample_text):
    print("*"*50)
    print(f'Sample Text {index+1}')
    print("*"*50)
    
    light_result = light_model.fullAnnotate(text)
    visualiser = NerVisualizer()
    visualiser.display(light_result[0], 
                       label_col='ner_chunk', 
                       document_col='document')

**************************************************
Sample Text 1
**************************************************


**************************************************
Sample Text 2
**************************************************


**************************************************
Sample Text 3
**************************************************


- In Spark NLP, we have many pretrained NER models for different purposes. Let's use another NER model this time.

In [None]:
clinical_ner = MedicalNerModel.pretrained("ner_jsl","en","clinical/models")\
        .setInputCols(["sentence","token","embeddings"])\
        .setOutputCol("ner")

ner_converter = NerConverter()\
        .setInputCols(["sentence","token","ner"])\
        .setOutputCol("ner_chunk")

nlpPipeline = Pipeline(stages=[
        documentAssembler,
        sentenceDetector,
        tokenizer,
        word_embeddings,
        clinical_ner,
        ner_converter])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = nlpPipeline.fit(empty_data)

ner_jsl download started this may take some time.
Approximate size to download 14 MB
[OK!]


In [None]:
from sparknlp_display import NerVisualizer

light_model = LightPipeline(model)

for index, text in enumerate(sample_text):
    print("*"*50)
    print(f'Sample Text {index+1}')
    print("*"*50)
    
    light_result = light_model.fullAnnotate(text)
    visualiser = NerVisualizer()
    visualiser.display(light_result[0], 
                       label_col='ner_chunk', 
                       document_col='document')

**************************************************
Sample Text 1
**************************************************


**************************************************
Sample Text 2
**************************************************


**************************************************
Sample Text 3
**************************************************


# Spell Checking

In [None]:
documentAssembler = DocumentAssembler()\
    .setInputCol("text")\
    .setOutputCol("document")

tokenizer = RecursiveTokenizer()\
    .setInputCols(["document"])\
    .setOutputCol("token")\
    .setPrefixes(["\"", "(", "[", "\n"])\
    .setSuffixes([".", ",", "?", ")","!", "'s"])

spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_dl')\
    .setInputCols("token")\
    .setOutputCol("checked")\
    .setErrorThreshold(4.0)\
    .setTradeoff(6.0)

clinical_spellModel = ContextSpellCheckerModel\
    .pretrained('spellcheck_clinical', 'en', 'clinical/models')\
    .setInputCols("token")\
    .setOutputCol("checked_clinical")

spellcheck_dl download started this may take some time.
Approximate size to download 111.4 MB
[OK!]
spellcheck_clinical download started this may take some time.
Approximate size to download 142.2 MB
[OK!]


In [None]:
spellModel.getErrorThreshold()

4.0

In [None]:
def spell_check(trade_off=18, threshold=4) :
  
  clinical_spellModel.setTradeoff(trade_off)
  spellModel.setTradeoff(trade_off)

  clinical_spellModel.setErrorThreshold(threshold)
  spellModel.setErrorThreshold(threshold)

  print("Trade off: ", clinical_spellModel.getTradeoff())
  print("Threshold: ", clinical_spellModel.getErrorThreshold())

  pipeline = Pipeline(
      stages = [
      documentAssembler,
      tokenizer,
      clinical_spellModel,
      spellModel
      ])

  empty_ds = spark.createDataFrame([[""]]).toDF("text")

  return LightPipeline(pipeline.fit(empty_ds))

In [None]:
sample_new = lp.annotate(sample_text[0])

In [None]:
" ".join(sample_new['checked']).replace(" .", ".").replace(" '","'")

"Find upper II endoscopy report on this patient from late 2016 and blood test results. I would be careful if he could be seen to exclude I malignancy. Adams has been feeling unwell for the past 2 weeks with abdominal pain , lethargy and concern from his family that he looks jaundiced. From his original referral , he has a long history of alcohol abuse , drinking more than 30 beers a week for more than 10 years. He has been alcohol free for the past several months now. In view of the drop in haemoglobin to 80 and his arranged U's & E's I would welcome an urgent opinion to exclude I malignancy. Thank you for your help. hours faithfully"

In [None]:
for trade_off in list(range(24, 2, -3)) :
  for threshold in list(range(8,1,-1)):

    print("*"*30)
    lp = spell_check(trade_off, threshold)
    token_list = []
    for sample in sample_text:
      sample_checked = lp.annotate(sample)
      token_list.append([(a,b) for a, b in zip(sample_checked['token'], sample_checked['checked_clinical']) if a!=b ])
 
    print(token_list)

******************************
Trade off:  24.0
Threshold:  8.0
[[('GI', 'I'), ('grateful', 'careful'), ('GI', 'I'), ('Adam', 'Adams'), ('deranged', 'arranged'), ('GI', 'I'), ('Yours', 'Your')], [('Dear', 'near'), ('Lily', 'oily'), ('mealtime', 'meantime'), ("isn't", "don't"), ('dispersible', 'dispersive'), ('grateful', 'careful'), ('Christie', 'Cristae')], [('grateful', 'careful'), ('non-urgent', 'concurrent'), ('keen', 'seen'), ('soda/', 'soda'), ('gaviscon', 'Gaviscon'), ('reassurance', 'assurance')]]
******************************
Trade off:  24.0
Threshold:  7.0
[[('GI', 'I'), ('grateful', 'careful'), ('GI', 'I'), ('Adam', 'Adams'), ('deranged', 'arranged'), ('GI', 'I'), ('Yours', 'Your')], [('Dear', 'near'), ('Lily', 'oily'), ('mealtime', 'meantime'), ("isn't", "don't"), ('dispersible', 'dispersive'), ('grateful', 'careful'), ('Christie', 'Cristae')], [('grateful', 'careful'), ('non-urgent', 'concurrent'), ('keen', 'seen'), ('soda/', 'soda'), ('gaviscon', 'Gaviscon'), ('reassuran

In [None]:
clinical_spellModel.setTradeoff(18)

SPELL_ee443bf328dc

In [None]:
token_list = []
for sample in sample_text:
  sample_checked = lp.annotate(sample)
  token_list.append([(a,b,c) for a, b, c in zip(sample_checked['token'], sample_checked['checked'], sample_checked['checked_clinical']) if (a!=b)|(a!=c) ])

In [None]:
new_list=[]
for sample in token_list:
  for token in sample:
    if token[2].lower() == token[0].lower() :
      selected = token[2]
    elif token[1].lower() == token[0].lower():
      selected = token[1]
    else:
      selected = "**"
    new_list.append((token[0], token[1], token[2], selected))

In [None]:
pd.set_option("display.max_rows", 100)

In [None]:
df = pd.DataFrame(new_list, columns=["Token", "spellcheck_dl", "spellcheck_clinical", "selected"])
df

Unnamed: 0,Token,spellcheck_dl,spellcheck_clinical,selected
0,Find,Find,bind,Find
1,GI,GI,II,GI
2,I,I,",",I
3,grateful,grateful,careful,grateful
4,exclude,exclude,include,exclude
5,GI,GI,I,GI
6,Adam,Adam,day,Adam
7,unwell,well,well,**
8,2,2,",",2
9,lethargy,Bethany,lethargy,lethargy


In [None]:
for sample in token_list:
  for token in sample:
    print(token, end="-")
    if token[2] == token[0]:
      print(token[2])
    elif token[1] == token[0]:
      print(token[1])
    else:
      print("*********")

('Find', 'Find', 'bind')-Find
('GI', 'GI', 'II')-GI
('I', 'I', ',')-I
('grateful', 'grateful', 'careful')-grateful
('exclude', 'exclude', 'include')-exclude
('GI', 'GI', 'I')-GI
('Adam', 'Adam', 'day')-Adam
('unwell', 'well', 'well')-*********
('2', '2', ',')-2
('lethargy', 'Bethany', 'lethargy')-lethargy
('looks', 'looks', 'locus')-looks
('jaundiced', 'Candice', 'jaundiced')-jaundiced
('He', 'He', 'The')-He
('been', 'been', 'bean')-been
('haemoglobin', 'hemoglobin', 'haemoglobin')-haemoglobin
('deranged', 'deranged', 'ranged')-deranged
('U', 'U', ',')-U
('E', ',', 'E')-E
("'s", 'as', "'s")-'s
('welcome', 'welcome', 'become')-welcome
('exclude', 'exclude', 'include')-exclude
('GI', 'GI', 'I')-GI
('Thank', 'Thank', 'Than')-Thank
('Yours', 'Your', 'hours')-*********
('Dear', 'near', 'near')-*********
('Doctor', 'Doctor', 'Doctar')-Doctor
('Thanks', 'Thanks', 'Than')-Thanks
('Lily', 'Lily', 'only')-Lily
('She', 'the', 'the')-*********
('heartburn', 'Dearborn', 'heartburn')-heartburn
('ind

In [None]:
from IPython.utils.text import columnize
beautify = lambda annotations: [columnize(sent['checked']) for sent in annotations]

In [None]:
sample = 'We are going to meet Jovita in the city hall.'
beautify([lp.annotate(sample)])

['We  are  going  to  meet  Avita  in  the  city  half  .\n']

In [None]:
for pairs in lp.annotate(sample_text[0]):
    print (list(zip(pairs['token'],pairs['checked'])))

TypeError: ignored

In [None]:
for sample in sample_text:
  print(beautify([lp.annotate(sample)]))

["Find       careful     2          original  for      view         an        \nupper      if          weeks      referral  more     of           urgent    \nII         he          with       ,         than     the          opinion   \nendoscopy  could       abdominal  he        10       drop         to        \nreport     be          pain       has       years    in           exclude   \non         seen        ,          a         .        haemoglobin  I         \nthis       to          lethargy   long      He       to           malignancy\npatient    exclude     and        history   has      80           .         \nfrom       I           concern    of        been     and          Thank     \nlate       malignancy  from       alcohol   alcohol  his          you       \n2016       .           his        abuse     free     arranged     for       \nand        Adams       family     ,         for      U            your      \nblood      has         that       drinking  the      's       

In [None]:
sample_text

["Find upper GI endoscopy report on this patient from late 2016 and blood test results. I would be grateful if he could be seen to exclude GI malignancy. Adam has been feeling unwell for the past 2 weeks with abdominal pain, lethargy and concern from his family that he looks jaundiced. From his original referral, he has a long history of alcohol abuse, drinking more than 30 beers a week for more than 10 years. He has been alcohol free for the past several months now. In view of the drop in haemoglobin to 80 and his deranged U's & E's I would welcome an urgent opinion to exclude GI malignancy. Thank you for your help. Yours faithfully",
 "Dear Doctor, Thanks for seeing Lily. She initially had heartburn in the form of indigestion post mealtime and then this developed into reflux and then dysphagia. She has been suffering from this for about 2 months and isn't able to take in any solids anymore. She used to be able to swallow these with some liquid but it is hard for her to even swallow w

In [None]:
def spell_check2(threshold=4):
  
  clinical_spellModel.setErrorThreshold(threshold)
  print(clinical_spellModel.getErrorThreshold())
  pipeline = Pipeline(
      stages = [
      documentAssembler,
      tokenizer,
      clinical_spellModel,
      spellModel
      ])

  empty_ds = spark.createDataFrame([[""]]).toDF("text")

  return LightPipeline(pipeline.fit(empty_ds))

In [None]:
clinical_spellModel.getErrorThreshold()

10.0

In [None]:
clinical_spellModel.getTradeoff()

18.0

In [None]:
for threshold in list(range(24, 2, -3)) :
  print("*"*30)
  lp = spell_check2(threshold)
  token_list = []
  for sample in sample_text:
    sample_checked = lp.annotate(sample)
    token_list.append([(a,b) for a, b in zip(sample_checked['token'], sample_checked['checked_clinical']) if a!=b ])
 
  print(token_list)

******************************
24.0
[[('GI', 'II'), ('grateful', 'careful'), ('GI', 'I'), ('Adam', 'Adams'), ('deranged', 'ranged'), ('GI', 'I'), ('Yours', 'hours')], [('Dear', 'near'), ('Lily', 'oily'), ('mealtime', 'meantime'), ("isn't", "wasn't"), ('dispersible', 'dispersive'), ('grateful', 'careful'), ('Christie', 'Christian')], [('grateful', 'careful'), ('non-urgent', 'concurrent'), ('keen', 'been'), ('soda/', 'soda'), ('gaviscon', 'Gaviscon'), ('reassurance', 'assurance')]]
******************************
21.0
[[('GI', 'II'), ('grateful', 'careful'), ('GI', 'I'), ('Adam', 'Adams'), ('deranged', 'ranged'), ('GI', 'I'), ('Yours', 'hours')], [('Dear', 'near'), ('Lily', 'oily'), ('mealtime', 'meantime'), ("isn't", "wasn't"), ('dispersible', 'dispersive'), ('grateful', 'careful'), ('Christie', 'Christian')], [('grateful', 'careful'), ('non-urgent', 'concurrent'), ('keen', 'been'), ('soda/', 'soda'), ('gaviscon', 'Gaviscon'), ('reassurance', 'assurance')]]
******************************

In [None]:
from sparknlp.pretrained import PretrainedPipeline

In [None]:
spell_checker_pip = PretrainedPipeline('check_spelling', lang='en')


check_spelling download started this may take some time.
Approx size to download 913.5 KB
[OK!]


In [None]:
spell_checker_pip.model.stages

[document_b92f0aaaa5d0,
 SENTENCE_942cd730cc6e,
 REGEX_TOKENIZER_7b0eba20d829,
 SPELL_73aa38a2cec0]

In [None]:
result = spell_checker_pip.annotate(sample_text[0])

list(zip(result['token'], result['checked']))

[('Find', 'Find'),
 ('upper', 'upper'),
 ('GI', 'GI'),
 ('endoscopy', 'endoscopy'),
 ('report', 'report'),
 ('on', 'on'),
 ('this', 'this'),
 ('patient', 'patient'),
 ('from', 'from'),
 ('late', 'late'),
 ('2016', '2016'),
 ('and', 'and'),
 ('blood', 'blood'),
 ('test', 'test'),
 ('results', 'results'),
 ('.', '.'),
 ('I', 'I'),
 ('would', 'would'),
 ('be', 'be'),
 ('grateful', 'grateful'),
 ('if', 'if'),
 ('he', 'he'),
 ('could', 'could'),
 ('be', 'be'),
 ('seen', 'seen'),
 ('to', 'to'),
 ('exclude', 'exclude'),
 ('GI', 'GI'),
 ('malignancy', 'malignancy'),
 ('.', '.'),
 ('Adam', 'Adam'),
 ('has', 'has'),
 ('been', 'been'),
 ('feeling', 'feeling'),
 ('unwell', 'unwell'),
 ('for', 'for'),
 ('the', 'the'),
 ('past', 'past'),
 ('2', '2'),
 ('weeks', 'weeks'),
 ('with', 'with'),
 ('abdominal', 'abdominal'),
 ('pain', 'pain'),
 (',', ','),
 ('lethargy', 'lethargy'),
 ('and', 'and'),
 ('concern', 'concern'),
 ('from', 'from'),
 ('his', 'his'),
 ('family', 'family'),
 ('that', 'that'),
 ('he