# Automated Interpretation of Medical Text
This pipeline runs as follows:
- Step 1. Identify complex words in a document with the CWI tool
- Step 2. Then use spaCy, scispaCy to perform basic NLP tasks
- Step 3. Iterate over tokens that are considered complex
- Step 4. Find the hypernyms in WordNet that correspond to the identified complex words
- Step 5. If the hypernym for a complex word doesn't exist in WordNet, then use our TFIDF search engine on the pre-built corpus hypernym tree from UMLS lookup file
- Step 6. Replace complex words in the sentence with hypernyms
- Step 7. Grade the readability of the pre-substitute and post-substituted document

**The code uses and exemplifies each function from CWI in the `Complexity_labeller class`, from the CWI method first described in:**
*Complex Word Identifier from the paper: Complex Word Identification as a Sequence Labelling Task, 2019,* Authors: Gooding, Sian and Kochmar, Ekaterina


**This code uses a sequence labeling methods first described in:**
*Semi-supervised multitask learning for sequence labeling, 2017,* Authors: Rei, Marek

## Using the complex word sequence labeller
In order to use the complex word models you must download the sequence labeller files available [here](https://github.com/marekrei/sequence-labeler), please cite both the sequence labeller paper and CWI sequence labelling paper if using these models for research.

Additionally, the CWI method uses tensorflow < 2.0.0, so if you install from git source above, then you must open the labeler.py script and replace *import tensorflow* with the following:

*import tensorflow.compat.v1 as tf*
*tf.disable_v2_behavior()*

Notes:
- If you see warnings from TF this is because of the above, we are using TF >1.0.0 but <2.0.0, so it sees it as deprecated behavior
- If you edit this script you must restart the cluster or else TF will break due to word embeddings already being present

## There are two options when converting text to CoNLL-type tab-separated format:
- convert_format_string, convert_format_token
- Complexity_labeller.convert_format_string(model, 'You can convert a string like this')
- Complexity_labeller.convert_format_token(model, ['You','can','convert','tokens','like','this'])

## Once the text has been converted there are four methods to access complexity information:
- `get_dataframe`, `get_bin_labels`, `get_prob_labels`

### After identifying complex words with the CWI:
**This script uses various tools from Explosion's spaCy and AllenAI's scispaCy in combination with wordnet to substitute complex words with their less complex hypernyms**

In [1]:
import sys
sys.path.insert(0, './sequence-labeler-master')

from complex_labeller import Complexity_labeller
model_path = './cwi_seq.model'
temp_path = './temp_file.txt'

Instructions for updating:
non-resource variables are not supported in the long term


In [2]:
model = Complexity_labeller(model_path, temp_path)

Instructions for updating:
Please use `keras.layers.Bidirectional(keras.layers.RNN(cell))`, which is equivalent to this API
Instructions for updating:
Please use `keras.layers.RNN(cell)`, which is equivalent to this API
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Please use `rate` instead of `keep_prob`. Rate should be set to `rate = 1 - keep_prob`.


2021-08-29 09:26:22.252943: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-08-29 09:26:22.255264: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.
2021-08-29 09:26:22.278296: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-08-29 09:26:22.312521: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 3792965000 Hz


In [3]:
#Converting example sentence/document
test_document = 'strategies used for regulating blood glucose levels. such strategies include administration of insulin; dietary modification; and exercise. "'
Complexity_labeller.convert_format_string(model,test_document)

In [4]:
#The `get_dataframe` method returns a dataframe containing the original tokenized sentence, binary complexity labels and complex class probabilities.
#If a word recieves a binary label = 1, it has been classified as a complex word.
dataframe = Complexity_labeller.get_dataframe(model)

#Access binary labeling information from the dataframe format:
cw_list = list(zip(dataframe['sentences'].values[0],dataframe['labels'].values[0],dataframe['probs'].values[0]))

#get_bin_labels returns the binary complexity labels for the input
#bin_label_list = Complexity_labeller.get_bin_labels(model)

#The `get_prob_labels` method returns the probability of each token belonging to the complex class.
#prob_label_list = Complexity_labeller.get_prob_labels(model)


In [5]:
dataframe

Unnamed: 0,index,sentences,labels,probs
0,0,"[strategies, used, for, regulating, blood, glu...","[1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, ...","[[0.087280236, 0.9127197], [0.99852604, 0.0014..."


In [6]:
cw_list

#prob_label_list
#bin_label_list

[('strategies', 1, array([0.08728024, 0.9127197 ], dtype=float32)),
 ('used', 0, array([0.99852604, 0.00147404], dtype=float32)),
 ('for', 0, array([9.9995494e-01, 4.5058394e-05], dtype=float32)),
 ('regulating', 1, array([0.0708539, 0.9291461], dtype=float32)),
 ('blood', 0, array([0.99519855, 0.00480145], dtype=float32)),
 ('glucose', 1, array([0.45347735, 0.5465227 ], dtype=float32)),
 ('levels', 0, array([0.961756  , 0.03824396], dtype=float32)),
 ('.', 0, array([9.9996638e-01, 3.3573986e-05], dtype=float32)),
 ('such', 0, array([9.995981e-01, 4.019722e-04], dtype=float32)),
 ('strategies', 1, array([0.11672881, 0.88327116], dtype=float32)),
 ('include', 0, array([0.9947866 , 0.00521339], dtype=float32)),
 ('administration', 1, array([0.05484764, 0.94515234], dtype=float32)),
 ('of', 0, array([9.9995613e-01, 4.3843611e-05], dtype=float32)),
 ('insulin', 1, array([0.40310553, 0.59689444], dtype=float32)),
 (';', 0, array([9.9995375e-01, 4.6290221e-05], dtype=float32)),
 ('dietary', 

In [7]:
import scispacy
import spacy
nlp_med = spacy.load("en_core_sci_scibert")
doc_med = nlp_med(test_document)




In [8]:
from nltk.corpus import wordnet
from nltk.corpus import wordnet as wn

In [9]:
new_document = test_document
for token in doc_med.ents:
    token_string = str(token)
    token_string_list = token_string.split(' ')
    print('\n')
    print('token: ', token)


    for i in cw_list:
        cw = i[0]
        cw_bin_complexity = i[1]
        cw_prob_complexity_1 = i[2]
        cw_prob_complexity_2 = str(cw_prob_complexity_1[1])
        cw_prob_complexity_3 = float(cw_prob_complexity_2)
        #print(cw_bin_complexity_3.)

        if cw in token_string_list and cw_bin_complexity==1 and cw_prob_complexity_3 >= 0.9:
            print(cw)
            token2 = wn.synsets(cw)
            try:
                print(token2[0])
                hypernym = token2[0].hypernyms()
                hypernym = str(hypernym).split("'")[1].strip(".n.01")
                print(hypernym)
                new_document = new_document.replace(cw, hypernym)
                print(new_document)
                print('-------')
            except:
                print('no synset')
                print('-------')


#print(doc_med.ents)
#for token in doc_med.ents:
#    print(token)



token:  strategies
strategies
Synset('scheme.n.01')
plan_of_actio
plan_of_actio used for regulating blood glucose levels. such plan_of_actio include administration of insulin; dietary modification; and exercise. "
-------


token:  regulating blood glucose levels
regulating
Synset('regulation.n.06')
control.n.05
plan_of_actio used for control.n.05 blood glucose levels. such plan_of_actio include administration of insulin; dietary modification; and exercise. "
-------


token:  administration
administration
Synset('administration.n.01')
management
plan_of_actio used for control.n.05 blood glucose levels. such plan_of_actio include management of insulin; dietary modification; and exercise. "
-------


token:  insulin


token:  dietary modification


token:  exercise


In [11]:
import readability_v_ks
nlp_read = readability_v_ks.spacy.load("en_core_web_sm")
nlp_read.add_pipe('readability')


<readability_v_ks.ReadabilityComponent at 0x7f49fb81d510>

In [12]:
print(test_document)
doc_med = nlp_read(test_document)
print(doc_med._.flesch_kincaid_grade_level)
print(doc_med._.flesch_kincaid_reading_ease)


strategies used for regulating blood glucose levels. such strategies include administration of insulin; dietary modification; and exercise. "
16.183823529411768
-5.8277941176470165


In [13]:
print(new_document)
doc_new = nlp_read(new_document)
print(doc_new._.flesch_kincaid_grade_level)
print(doc_new._.flesch_kincaid_reading_ease)

plan_of_actio used for control.n.05 blood glucose levels. such plan_of_actio include management of insulin; dietary modification; and exercise. "
13.407352941176473
14.07808823529416
