<a href="https://colab.research.google.com/github/kgoz12/JSL/blob/master/Clinical_Drug_Normalizer_Possible_Bug.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![JohnSnowLabs](https://nlp.johnsnowlabs.com/assets/images/logo.png)

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/23.Drug_Normalizer.ipynb)

# 23.Clinical Drug Normalizer

### New Annotator that transforms text to the format used in the RxNorm and SNOMED standards

It takes in input annotated documents of type Array\[AnnotatorType\](DOCUMENT) and gives as output annotated document of type AnnotatorType.DOCUMENT .

Parameters are:
- inputCol: input column name string which targets a column of type Array(AnnotatorType.DOCUMENT).
- outputCol: output column name string which targets a column of type AnnotatorType.DOCUMENT.
- lowercase: whether to convert strings to lowercase. Default is False.
- policy: rule to remove patterns from text.  Valid policy values are:  
  + **"all"**,   
  + **"abbreviations"**,   
  + **"dosages"** 
   
Defaults is "all". "abbreviation" policy used to expend common drugs abbreviations, "dosages" policy used to convert drugs dosages and values to the standard form (see examples bellow).

#### Examples of transformation:
    
1) "Sodium Chloride/Potassium Chloride 13bag"  >>>  "Sodium Chloride / Potassium Chloride **13 bag**" : add extra spaces in the form entity

2) "interferon alfa-2b 10 million unit ( 1 ml ) injec" >>> "interferon alfa - 2b 10000000 unt ( 1 ml ) injection " : convert **10 million unit** to the **10000000 unt**, replace **injec** with **injection**

3) "aspirin 10 meq/ 5 ml oral sol" >>> "aspirin 2 meq/ml oral solution" : normalize **10 meq/ 5 ml** to the **2 meq/ml**, extend abbreviation **oral sol** to the **oral solution**

4) "adalimumab 54.5 + 43.2 gm" >>> "adalimumab 97700 mg" : combine **54.5 + 43.2** and normalize **gm** to **mg**

5) "Agnogenic one  half cup" >>> "Agnogenic 0.5 oral solution" : replace **one  half** to the **0.5**, normalize **cup** to the **oral solution**

In [1]:
import json

from google.colab import files

license_keys = files.upload()

with open(list(license_keys.keys())[0]) as f:
    license_keys = json.load(f)

license_keys.keys()

Saving jsl_keys.json to jsl_keys.json


dict_keys(['SECRET', 'SPARK_NLP_LICENSE', 'AWS_ACCESS_KEY_ID', 'AWS_SECRET_ACCESS_KEY', 'JSL_VERSION', 'PUBLIC_VERSION'])

In [2]:
license_keys['JSL_VERSION']

'2.7.3'

In [3]:
import os

# Install java
! apt-get install -y openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["PATH"] = os.environ["JAVA_HOME"] + "/bin:" + os.environ["PATH"]
! java -version

secret = license_keys['SECRET']

os.environ['SPARK_NLP_LICENSE'] = license_keys['SPARK_NLP_LICENSE']
os.environ['AWS_ACCESS_KEY_ID']= license_keys['AWS_ACCESS_KEY_ID']
os.environ['AWS_SECRET_ACCESS_KEY'] = license_keys['AWS_SECRET_ACCESS_KEY']
jsl_version = license_keys['JSL_VERSION']
version = license_keys['PUBLIC_VERSION']

! pip install --ignore-installed -q pyspark==2.4.7

! python -m pip install --upgrade spark-nlp-jsl==$jsl_version  --extra-index-url https://pypi.johnsnowlabs.com/$secret

! pip install --ignore-installed -q spark-nlp==2.7.3

import sparknlp

print (sparknlp.version())

import json
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession


from sparknlp.annotator import *
from sparknlp_jsl.annotator import *
from sparknlp.base import *
import sparknlp_jsl

params = {"spark.driver.memory":"16G",
"spark.kryoserializer.buffer.max":"2000M",
"spark.driver.maxResultSize":"2000M"}
spark = sparknlp_jsl.start(secret, params=params)

openjdk version "1.8.0_282"
OpenJDK Runtime Environment (build 1.8.0_282-8u282-b08-0ubuntu1~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)
[K     |████████████████████████████████| 217.9MB 53kB/s 
[K     |████████████████████████████████| 204kB 21.4MB/s 
[?25h  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
Looking in indexes: https://pypi.org/simple, https://pypi.johnsnowlabs.com/2.7.3-3f5059a2258ea6585a0bd745ca84dac427bca70c
Collecting spark-nlp-jsl==2.7.3
[?25l  Downloading https://pypi.johnsnowlabs.com/2.7.3-3f5059a2258ea6585a0bd745ca84dac427bca70c/spark-nlp-jsl/spark_nlp_jsl-2.7.3-py3-none-any.whl (50kB)
[K     |████████████████████████████████| 51kB 2.9MB/s 
[?25hCollecting spark-nlp==2.7.3
[?25l  Downloading https://files.pythonhosted.org/packages/32/9e/2f43d668eefea486e7417c1e83554c72a41e0786976e9429846b753f5014/spark_nlp-2.7.3-py2.py3-none-any.whl (138kB)
[K     |████████████████████████████████| 143kB 5.3MB/s 
[?25hInstalling collected

In [4]:
spark

In [5]:
import sys, os, time
from sparknlp.base import *
from sparknlp.annotator import *
from sparknlp.util import *
from sparknlp_jsl.annotator import *

from sparknlp.pretrained import ResourceDownloader

from pyspark.sql import functions as F
from pyspark.ml import Pipeline, PipelineModel

In [7]:
documentAssembler = DocumentAssembler()\
  .setInputCol("text")\
  .setOutputCol("document")

drug_normalizer = DrugNormalizer() \
    .setInputCols("document") \
    .setOutputCol("document_normalized") \
    .setPolicy("all") # issue is with "all" and "dosages"; "abbreviations" is fine

pipeline_ner = Pipeline(
    stages = [
    documentAssembler,
    drug_normalizer 
  ])

In [8]:
# empty data frame
empty_df = spark.createDataFrame([[""]]).toDF("text")

In [9]:
from sparknlp.base import LightPipeline
LightPipeline(pipeline_ner.fit(empty_df)).annotate("amox trihydrate clavulanate k 125mg")

{'document': ['amox trihydrate clavulanate k 125mg'],
 'document_normalized': ['amox trihydrate clavulanate k 125 mg']}

In [10]:
LightPipeline(pipeline_ner.fit(empty_df)).annotate("amox trihydrate + clavulanate k 125mg")

Py4JJavaError: ignored