<h1>Task 2 Data Wrangling</h1>

The main goal of this task is to first bring our datasets into a form which secondly knodle then can use for building a sequence tagger. Since the MIMIC III training data originally does not provide any labels, we have to make use of a technique named weak annotation which generally speaking consists of building labeling functions with certain keyword lists and then applying them on the data to create labels for every matching datapoint. The labels which are denoting classes of clinical events such as departments, problems and treatments can then be used for automatically detecting these sequences in other clinical texts. For validating the sequence tagger we also make use of the original labels of our test data (taken from the 2012 i2b2 challenge) in order to provide a proper test suite.

The idea behind this kind of sequence tagging is called named entity recognition (NER) which is the task of assigning entities in a given text to their corresponding part of speech tag.

In [1]:
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk.util import ngrams
import string
import scipy
import numpy as np
import torch
import tensorflow as tf
import joblib
import xml.etree.ElementTree as ET
import re
from typing import List
from torch.utils.data import TensorDataset
from transformers import AutoTokenizer

pd.set_option('display.max_rows', 10)

  from .autonotebook import tqdm as notebook_tqdm
2023-01-30 10:28:22.069537: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-30 10:28:22.540491: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-01-30 10:28:22.623436: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2023-01-30 10:28:22.623469: I tensorflow/compiler/x

<h2>MIMIC III</h2>

First we load the MIMIC III dataset, which will function as our training dataset. Before we can use the data for building our weak supervision models we still have to do some preprocessing of our text column.

In order to access the MIMIC III dataset, a PhysioNet account and a completion of the necessary credentialing process are needed. This involves passing the "Data or Specimens Only Research" course from the "Human Research" curriculum offered by the Collaborative Institutional Training Initiative (CITI Program). This can entail some waiting time.

MIMIC III ("Medical Information Mart for Intensive Care") is a large medical database containing detailed information on patients admitted to critical care units at a large tertiary care hospital. Included is, among others, data on vital signs, medications, laboratory measurements, observations and notes. Originally it is a relational database consisting of 26 tables.

Events such as notes and laboratory tests are stored in a series of ‘events’ tables. For example the NOTEEVENTS table contains all clinical notes related to an event for a given patient.

For the purpose of this task, we will solely focus on the clinical notes, stored as text in NOTEEVENTS.


In [2]:
df_notes = pd.read_csv('./data/train/mimic-iii-clinical-database-1.4/NOTEEVENTS.csv', low_memory=False)

In [5]:
df_notes.shape

(2083180, 11)

In [5]:
df_notes.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,Admission Date: [**2151-7-16**] Dischar...
1,175,13702,107527.0,2118-06-14,,,Discharge summary,Report,,,Admission Date: [**2118-6-2**] Discharg...
2,176,13702,167118.0,2119-05-25,,,Discharge summary,Report,,,Admission Date: [**2119-5-4**] D...
3,177,13702,196489.0,2124-08-18,,,Discharge summary,Report,,,Admission Date: [**2124-7-21**] ...
4,178,26880,135453.0,2162-03-25,,,Discharge summary,Report,,,Admission Date: [**2162-3-3**] D...


<h3>Preprocessing</h3>

In the following section the MIMIC data in use will be examined, the relevant sections are singled out and further preprocessed into lists of strings.
For this we remove stopwords, punctuation and numbers as well as normalize the input format.

N.B.: The whole pipeline processes most of the data in batches given the large dimensions of the MIMIC III datset - this means that the pipeline was conceptualised with scalability in mind and can therefore also be applied to the whole MIMIC data, given enough computational power. Nevertheless we decided in the end to sample our data down to a hundredth to provide a fully functioning proof of concept without frying our RAMs (and brains).

In [3]:
df_notes = df_notes.sample(frac=0.01, random_state=19) # sampling the data for a proof of concept because putting the whole data through the pipeline would not be feasible

In [4]:
notes_text = df_notes.TEXT # slicing only the text column

In [5]:
del df_notes # removing variables from memory for tidying up RAM usage

In [6]:
notes_text = notes_text.str.replace(r'\[\*\*(.*?)\*\*\]', '') # extract tag placeholders

  """Entry point for launching an IPython kernel.


In [7]:
notes_text = notes_text.str.replace(r'[0-9]+', '') # extract all digits

  """Entry point for launching an IPython kernel.


In [8]:
punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '') # excluding slashes from punctuation removal, since we need them to match with some disease

notes_text = notes_text.str.translate(str.maketrans('', '', punctuation))

In [9]:
stopwords = stopwords.words('english')
stopwords.remove('and')
stopwords.remove('or')

In [10]:
notes_text = notes_text.apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stopwords)]))

In [11]:
notes_text.to_csv('./data/train/notes_cleaned_sample.csv', index=False)

In [12]:
del notes_text

In [13]:
notes_text = pd.read_csv('./data/train/notes_cleaned_sample.csv', low_memory=False, chunksize = 10000, index_col = False)

For feeding the MIMIC III data to our labeling functions we opted for a n-gram approach - i. e. every note entry is split into x n-grams corresponding to multiple rows. Given the exploding dimensions of our keyword lists we decided to only use unigrams for now.

In theory the code also provides the possiblity to generate every possible iteration from unigrams to n-grams in order to then apply our labeling functions to every subset and achieve better results through the usage of more data and processing power. The function get_ngrams() here can be provided with the parameter n which stands for the number of grams it should produce as output for a given dataset.

Since a lot of sequences of interest which are to be tagged consist of more than one word widening the scope of the ngrams parameter may result in better accuracy and predicition quality within the final model. This on the other hand may again entail some considerable performance hits.

In [14]:
# processing ngrams in chunks for performance reasons

def get_ngrams(file_path, n, df):
    try:
        os.remove(file_path)
    except OSError:
        pass

    for subset in df:
        subset = subset.dropna()
        subset['ngrams'] = subset['TEXT'].str.split().apply(lambda x: list(map(' '.join, ngrams(x, n=n))))
        subset = (subset.assign(count=subset['ngrams'].str.len())
    .explode('ngrams')
    .query('count > 0'))
        subset['index_notes'] = subset.index
        subset = subset.drop(['count', 'TEXT'], axis=1)
        if not os.path.isfile(file_path):
            subset.to_csv(file_path, index=False)
        else:
            subset.to_csv(file_path, index=False, mode='a', header=False)
    return None

In [15]:
get_ngrams('./data/train/ngrams/1grams_sample.csv', 1, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/2grams.csv', 2, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/3grams.csv', 3, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/4grams.csv', 4, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/5grams.csv', 5, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/6grams.csv', 6, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/7grams.csv', 7, notes_text)

<h3>Labeling Functions</h3>

As a next step we load a text file containing a list of the most important clinical departments for our first labeling function. This list was scraped from different sources, which are also provided in the text file. The input from the text file is then transformed into a list, consisting of various ngrams, with each ngram denoting a clinical department. This list will then be applied to our ngrams from the MIMIC dataset where every exact match gets assigned the label "CLINICAL_DEPARTMENT".

In [16]:
with open("./data/keywords/departments_list.txt") as dataFile:
    departments = {}
    for line in dataFile:
        if line.strip() == '': # exclude unnecessary lines
            break
        else:
            line = line.split(':')
            key, value = line[0], line[1:]
            value = [i.replace(';', ',') for i in value]
            value = [i.split(',') for i in value]
            [[value]] = [value]
            value = [i.strip() for i in value]
            departments[key] = value

list(departments.items())[0:3]

[('Department of Admissions',
  ['Admission', 'Admissions', 'Admitting Department']),
 ('Department of Anaesthesia, Intensive Care Medicine and Pain Medicine',
  ['Anesthetics', 'Anesthesiology']),
 ('Department of Blood Group Serology and Transfusion Medicine',
  ['Serology', 'Transfusion Medicine'])]

In [17]:
departments_keywords = list(departments.values())
departments_keywords = [item.lower() for sublist in departments_keywords for item in sublist]

In [18]:
departments_keywords[0:3]

['admission', 'admissions', 'admitting department']

Another keywords list which we will use contains a collection of terms denoting clinical events - here every exact match in the MIMIC dataset gets assigned the label "EVIDENTIAL".

Source: National Cancer Institute Thesaurus: NCI9d - Thesaurus.txt: NCI Thesaurus Version 22.09d - flattened from the owl format 
from their own webpage: https://evs.nci.nih.gov/evs-download/thesaurus-downloads

In [19]:
events_keywords_df = pd.read_csv('./data/keywords/Thesaurus.txt', delimiter='\t')

In [20]:
events_keywords_df = events_keywords_df.drop(events_keywords_df.columns[[0,1,2,5,6]], axis=1)
events_keywords_df = events_keywords_df.rename(columns={events_keywords_df.columns[0]:'Keywords', events_keywords_df.columns[1]:'Desc', events_keywords_df.columns[2]:'Label'})
events_keywords_df = events_keywords_df.drop(events_keywords_df[events_keywords_df.Label != 'Event'].index)

In [21]:
events_keywords = events_keywords_df.Keywords.str.replace(r'\|.+', '')
events_keywords = events_keywords.str.replace('/', ' ')

  """Entry point for launching an IPython kernel.


In [22]:
events_keywords = events_keywords.tolist()
events_keywords = [x.lower() for x in events_keywords]

In order to also extract evidence of problematic clinical events/occurences we create a list of negative phrases which will then be matched to the label "PROBLEM".

Source: The vast ULMS dataset (Unified Medical Language System),
after authentification, to be accessed at: https://www.nlm.nih.gov/research/umls/licensedcontent/umlsknowledgesources.html

In [23]:
with open("./data/keywords/negCueWords.txt") as dataFile:
    problems_keywords = []
    for line in dataFile:
        res = line.split('|', 1)
        problems_keywords.append(res[0])

In [24]:
def ngrams_keywords(n, input_ls):
    """function for extracting subset of ngrams from a given keyword list"""
    output_ls = []
    for i in input_ls:
        ws = i.count(' ')
        if ws == n-1:
            output_ls.append(i)
    return output_ls

In [55]:
def labeling_function(file_path, df, keyword_list):
    """labeling function takes as input a dataframe and a list of keywords and encodes every ngram with 0/1 columns for all the possible labels (1 if exact match, else 0)"""
    try:
        os.remove(file_path)
    except OSError:
        pass

    for subset in df:
        subset = subset.dropna()
        for keyword in keyword_list:
            subset[keyword] = subset['ngrams'].map(lambda x: 1 if x == keyword  else 0)
    if not os.path.isfile(file_path):
        subset.to_csv(file_path, index=False)
    else:
        subset.to_csv(file_path, index=False, mode='a', header=False)
    return None

In [62]:
notes_ngrams = pd.read_csv('./data/train/ngrams/1grams_sample.csv', low_memory=False, chunksize = 500000, index_col = False)

In [60]:
departments_keywords_unigrams = ngrams_keywords(1, departments_keywords)
events_keywords_unigrams = ngrams_keywords(1, events_keywords)
problems_keywords_unigrams = ngrams_keywords(1, problems_keywords)
keywords_list = departments_keywords_unigrams + events_keywords_unigrams + problems_keywords_unigrams
keywords_list = list(set(keywords_list))

In [63]:
labeling_function('./data/train/ngrams/notes_1grams_labeled_sample.csv', notes_ngrams, keywords_list)

  # This is added back by InteractiveShellApp.init_path()


Again, the labeling part can still be expanded, depending on the available ressources and the expectations of the final output. Be it for example via providing more keywords lists with different classes within the medical sphere and/or via making the already existing keywords lists longer with a larger amount of synonyms.

The following function explode_ngrams() is not used for our proof of concept at the moment. In case of scaling the amount of data up and computing more/different iterations of ngrams the function would be needed in addition in order to normalise the dimensions of the training data after the labeling process is done.

In [13]:
def explode_ngrams(file_path, df):
    """takes as input a dataframe of ngrams and creates for every word within a ngram a separate row, then the columns of the newly generated dataframe are also rearranged and renamed for clearer formatting"""
    try:
        os.remove(file_path)
    except OSError:
        pass

    for subset in df:
        subset['words'] = subset['ngrams'].str.split(' ')
        subset = subset.explode('words')

        cols = subset.columns.tolist()
        cols = cols[-1:] + cols[:-1]

        subset = subset[cols]

        subset = subset.rename(columns={'ngrams':'keywords'})

        if not os.path.isfile(file_path):
            subset.to_csv(file_path, index=False)
        else:
            subset.to_csv(file_path, index=False, mode='a', header=False)

<h2>i2b2 2012</h2>

<h3>Preprocessing</h3>

As our testing data we will use the i2b2 set from 2012 which consists of multiple clinical notes and annotations corresponding to different types of clinical events. Since all the annotated clinical notes are in .xml-format we first have to parse the separate files, clean them and finally concatenate the relevant attributes into a single dataframe. For these preprocessing steps we mostly adhere to the same pipeline we have already used for the MIMIC III data.

Source: https://www.i2b2.org/ through the DBMI Data Portal (authentification required)

In [9]:
punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '')

stopwords = stopwords.words('english')
stopwords.remove('and')
stopwords.remove('or')

In [34]:
ls_text = []
ls_note_id = []
ls_class = []

for filename in os.listdir('./data/test/2012-08-08.test-data.event-timex-groundtruth/xml/'):
    mytree = ET.parse('./data/test/2012-08-08.test-data.event-timex-groundtruth/xml/' + filename)
    myroot = mytree.getroot()
    for i in myroot[1]:
        if i.tag == 'EVENT': # filtering out time expressions, since they are not relevant for the given task
            if i.attrib.get('text').count(' ') == 0:
                ls_text.append(i.attrib.get('text'))
                ls_class.append(i.attrib.get('type'))
                ls_note_id.append(int(filename.replace('.xml', '')))

In [35]:
for filename in os.listdir('./data/test/2012-08-08.test-data.event-timex-groundtruth/xml/'):
    mytree = ET.parse('./data/test/2012-08-08.test-data.event-timex-groundtruth/xml/' + filename)
    myroot = mytree.getroot()
    t = myroot[0].text
    t = re.sub(r'[0-9]+', '', t)
    t = t.translate(str.maketrans('', '', punctuation))
    t = t.split(' ')
    t = [item.lower() for item in t if item]
    t = [item for item in t if item not in stopwords]
    for el in t:
        if el not in ls_text: # checking if words are already contained in the labeled subset, and if not append them to the existing arrays
            ls_text.append(el)
            ls_class.append('OTHER')
            ls_note_id.append(int(filename.replace('.xml', '')))

In [36]:
df_test = pd.DataFrame()
df_test['notes_id'] = ls_note_id
df_test['words'] = ls_text
df_test['classes'] = ls_class

df_test = df_test.dropna()

In [37]:
df_test.head()

Unnamed: 0,notes_id,words,classes
0,221,ADMISSION,OCCURRENCE
1,221,chemotherapy,TREATMENT
2,221,Cisplatin,TREATMENT
3,221,tolerated,OCCURRENCE
4,221,chemotherapy,TREATMENT


In [38]:
df_test = df_test[(df_test.classes == 'CLINICAL_DEPT') | (df_test.classes == 'EVIDENTIAL') | (df_test.classes == 'PROBLEM') | (df_test.classes == 'OTHER')]

In [39]:
joblib.dump(df_test, './data/test/df_test.lib')

['./data/test/df_test.lib']

<h2>Knodle Matrices</h2>

Before we can finally put our generated data into knodle and build our models, we still have to create the necessary input matrices.

In [64]:
df_T = pd.DataFrame(keywords_list)
df_T = df_T.rename(columns={df_T.columns[0]:'Keywords'})
df_T['CLINICAL_DEPT'] = df_T['Keywords'].map(lambda x: 1 if x in departments_keywords else 0)
df_T['EVIDENTIAL'] = df_T['Keywords'].map(lambda x: 1 if x in events_keywords else 0)
df_T['PROBLEM'] = df_T['Keywords'].map(lambda x: 1 if x in problems_keywords else 0)

In [65]:
t_matrix = df_T.set_index('Keywords')
t_matrix = scipy.sparse.csr_matrix(t_matrix)
t_matrix = t_matrix.tocoo()

In [66]:
t_matrix = torch.sparse.LongTensor(torch.LongTensor([t_matrix.row.tolist(), t_matrix.col.tolist()]), torch.LongTensor(t_matrix.data.astype(np.int32)))

First, we create the T-matrix with the dimensions keywords x labels and save it to the joblib format for later use.

In [67]:
joblib.dump(t_matrix, './data/train/t_matrix_sample.lib')

['./data/train/t_matrix_sample.lib']

In [79]:
notes_1grams_matched = pd.read_csv('./data/train/ngrams/notes_1grams_labeled_sample.csv', low_memory=False, chunksize=500000, index_col = False)

In [75]:
notes_1grams_matched = pd.read_csv('./data/train/ngrams/notes_1grams_labeled_sample.csv', low_memory=False, index_col = False)

In [80]:
ls_matrix = []
for subset in notes_1grams_matched:
    try:
        subset = subset.set_index('ngrams')
        subset = subset.drop('index_notes', axis=1)
        temp_matrix = scipy.sparse.csr_matrix(subset)
        ls_matrix.append(temp_matrix)
    except:
        pass

In [81]:
joblib.dump(ls_matrix, './data/train/ls_matrix_sample.lib')

['./data/train/ls_matrix_sample.lib']

In [82]:
ls_matrix = joblib.load('./data/train/ls_matrix_sample.lib')

In [83]:
z_matrix = scipy.sparse.vstack(ls_matrix)

In [84]:
z_matrix = z_matrix.tocoo()
z_matrix = torch.sparse.LongTensor(torch.LongTensor([z_matrix.row.tolist(), z_matrix.col.tolist()]), torch.LongTensor(z_matrix.data.astype(np.int32)))

Secondly, we also create the Z-matrix which consists of our ngrams x keywords and an additional index column for the respective clinical note from which a given ngram stems.

In [85]:
joblib.dump(z_matrix, './data/train/z_matrix_sample.lib')

['./data/train/z_matrix_sample.lib']

Finally, we also create the X-matrices and Y-matrix for our train and test data.

In [2]:
df_test = joblib.load('./data/test/df_test.lib')

In [3]:
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


Here we utilise these two little helper functions - they make use of the transformers and the scipy package in order to create our X- and Y-matrices.

The X-Matrix contains the individual words from the test data.

The Y-Matrix contains the occuring classes in the test data. 

In [4]:
def convert_text_to_transformer_input(tokenizer, texts: List[str]) -> TensorDataset:
    encoding = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    input_ids = encoding.get('input_ids')
    attention_mask = encoding.get('attention_mask')

    input_values_x = TensorDataset(input_ids, attention_mask)

    return input_values_x


def np_array_to_tensor_dataset(x: np.ndarray) -> TensorDataset:
    if isinstance(x, scipy.sparse.csr_matrix):
        x = x.toarray()
    x = torch.from_numpy(x)
    x = TensorDataset(x)
    
    return x

In [5]:
x_test = convert_text_to_transformer_input(tokenizer, df_test['words'].tolist())

In [6]:
unique_list = df_test['classes'].unique()

dic_count = dict()
count = 0

for el in unique_list:
    count += 1
    dic_count.update({el:count}) # create dict mapping for unique classes
    

In [7]:
df_test['classes_int'] = df_test['classes'].apply(lambda x: dic_count.get(x))

In [8]:
y_test = np_array_to_tensor_dataset(df_test['classes_int'].values)

In [50]:
joblib.dump(x_test, './data/test/x_test.lib')
joblib.dump(y_test, './data/test/y_test.lib')

['./data/test/y_test.lib']

In [2]:
notes_1grams = pd.read_csv('./data/train/ngrams/1grams_sample.csv', low_memory=False, index_col = False)

In [5]:
notes_1grams = notes_1grams.dropna()
x_train = convert_text_to_transformer_input(tokenizer, notes_1grams['ngrams'].tolist())

In [6]:
joblib.dump(x_train, './data/train/x_train_sample.lib')

['./data/train/x_train_sample.lib']

Now that we have created all our necessary matrices from the original data, we will put all of it into knodle to build our models. For the modeling part please refer to the notebook "data_modeling".