<h1>Task 2 Data Wrangling</h1>

In [1]:
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk.util import ngrams
import string

pd.set_option('display.max_rows', 10)

<h2>MIMIC III</h2>

First we load the MIMIC III dataset, which will function as our training/development dataset. Before we can use the data for building our weak supervision models we still have to do some preprocessing of certain columns.

In [2]:
df_notes = pd.read_csv('./data/train/mimic-iii-clinical-database-1.4/NOTEEVENTS.csv', low_memory=False)

In [5]:
df_notes.shape

(2083180, 11)

In [5]:
df_notes.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,Admission Date: [**2151-7-16**] Dischar...
1,175,13702,107527.0,2118-06-14,,,Discharge summary,Report,,,Admission Date: [**2118-6-2**] Discharg...
2,176,13702,167118.0,2119-05-25,,,Discharge summary,Report,,,Admission Date: [**2119-5-4**] D...
3,177,13702,196489.0,2124-08-18,,,Discharge summary,Report,,,Admission Date: [**2124-7-21**] ...
4,178,26880,135453.0,2162-03-25,,,Discharge summary,Report,,,Admission Date: [**2162-3-3**] D...


<h2>Preprocessing</h2>

In [3]:
notes_text = df_notes.TEXT

In [4]:
del df_notes # removing variables from memory for tidying up RAM usage

In [5]:
notes_text = notes_text.str.replace(r'\[\*\*(.*?)\*\*\]', '') # extract tag placeholders

  """Entry point for launching an IPython kernel.


In [6]:
notes_text = notes_text.str.replace(r'[0-9]+', '') # extract all digits

  """Entry point for launching an IPython kernel.


In [10]:
punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '') # excluding slashes from punctuation removal, since we need them to match with some disease

notes_text = notes_text.str.translate(str.maketrans('', '', punctuation))

In [12]:
stopwords = stopwords.words('english')
stopwords.remove('and')
stopwords.remove('or')

In [13]:
notes_text = notes_text.apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stopwords)]))

In [14]:
notes_text.to_csv('./data/train/notes_cleaned.csv', index=False)

In [15]:
del notes_text

In [30]:
notes_text = pd.read_csv('./data/train/notes_cleaned.csv', low_memory=False, chunksize = 10000, index_col = False)

For feeding the MIMIC III data to our labeling functions we opted for a n-gram approach - i. e. every note entry is split into x n-grams corresponding to x number of rows. Given the dimensions of our keyword lists we decided to generate every possible iteration from unigrams to 7-grams in order to then apply our labeling functions to every subset.

In [29]:
# processing ngrams in chunks for performance reasons

def get_ngrams(file_path, n, df):
    try:
        os.remove(file_path)
    except OSError:
        pass

    for subset in df:
        subset = subset.dropna()
        subset['ngrams'] = subset['TEXT'].str.split().apply(lambda x: list(map(' '.join, ngrams(x, n=n))))
        subset = (subset.assign(count=subset['ngrams'].str.len())
    .explode('ngrams')
    .query('count > 0'))
        subset['index_notes'] = subset.index
        subset = subset.drop(['count', 'TEXT'], axis=1)
        if not os.path.isfile(file_path):
            subset.to_csv(file_path, index=False)
        else:
            subset.to_csv(file_path, index=False, mode='a', header=False)
    return None

In [None]:
get_ngrams('./data/train/ngrams/1grams.csv', 1, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/2grams.csv', 2, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/3grams.csv', 3, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/4grams.csv', 4, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/5grams.csv', 5, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/6grams.csv', 6, notes_text)

In [None]:
get_ngrams('./data/train/ngrams/7grams.csv', 7, notes_text)

<h2>Labeling Functions</h2>

As a next step we load a text file containing a list of the most important clinical departments for our first labeling function. This list was scraped from different sources, which are also provided in the text file. The input from the text file is then transformed into a list, consisting of various ngrams, with each ngram denoting a clinical department. This list will then be applied to our ngrams from the MIMIC dataset where every exact match gets assigned the label "DEP".

In [2]:
with open("./data/keywords/departments_list.txt") as dataFile:
    departments = {}
    for line in dataFile:
        if line.strip() == '': # exclude unnecessary lines
            break
        else:
            line = line.split(':')
            key, value = line[0], line[1:]
            value = [i.replace(';', ',') for i in value]
            value = [i.split(',') for i in value]
            [[value]] = [value]
            value = [i.strip() for i in value]
            departments[key] = value

departments

{'Department of Admissions': ['Admission',
  'Admissions',
  'Admitting Department'],
 'Department of Anaesthesia, Intensive Care Medicine and Pain Medicine': ['Anesthetics',
  'Anesthesiology'],
 'Department of Blood Group Serology and Transfusion Medicine': ['Serology',
  'Transfusion Medicine'],
 'Department of Cardiac Surgery': ['Cardiology'],
 'Department of Clinical Pathology': ['Clinical Pathology',
  'Medical Laboratory'],
 'Department of Dermatology': ['Dermatology'],
 'Department of Ear, Nose and Throat Diseases': ['Otolaryngology',
  'ENT Department',
  'Ear',
  'Nose and Throat Diseases'],
 'Department of Emergency Medicine': ['Accident and emergency',
  'A&E',
  'Casualty Department'],
 'Department of Gastroenterology': ['Gastroenterology'],
 'Department of General Surgery': ['General Surgery', 'Surgery'],
 'Department of Geriatry': ['Geriatric Department', 'Geriatrics'],
 'Department of Haematology': ['Haematology'],
 'Department of Hospital Hygiene and Infection Control'

In [3]:
departments_keywords = list(departments.values())
departments_keywords = [item.lower() for sublist in departments_keywords for item in sublist]

In [4]:
departments_keywords

['admission',
 'admissions',
 'admitting department',
 'anesthetics',
 'anesthesiology',
 'serology',
 'transfusion medicine',
 'cardiology',
 'clinical pathology',
 'medical laboratory',
 'dermatology',
 'otolaryngology',
 'ent department',
 'ear',
 'nose and throat diseases',
 'accident and emergency',
 'a&e',
 'casualty department',
 'gastroenterology',
 'general surgery',
 'surgery',
 'geriatric department',
 'geriatrics',
 'haematology',
 'central sterile services department',
 'cssd',
 'sterile processing department',
 'spd',
 'sterile processing',
 'infection control',
 'pharmacy',
 'medicine department',
 'neurology',
 'nursing department',
 'nutrition department',
 'dietetics',
 'gynaecology',
 'gynecology',
 'obstetrics',
 'ophthalmology',
 'optometry',
 'oral surgery',
 'maxillofacial surgery',
 'orthopedics',
 'orthopaedics',
 'pediatrics',
 'paediatrics',
 'plastic surgery department',
 'aesthetic surgery department',
 'reconstructive surgery department',
 'physiotherapy',

In [5]:
# function for extracting subset of ngrams from a given keyword list

def ngrams_keywords(n, input_ls):
    output_ls = []
    for i in input_ls:
        ws = i.count(' ')
        if ws == n-1:
            output_ls.append(i)
    return output_ls

In [6]:
# labeling function takes as input a dataframe, a label and a list of keywords and assigns for every exact match the label to a given row 

def labeling_function(file_path, label, df, keyword_list):
    try:
        os.remove(file_path)
    except OSError:
        pass

    for subset in df:
        subset = subset.dropna()
        subset = subset.drop_duplicates()
        subset['label'] = subset['ngrams'].map(lambda x: label if x in keyword_list else 0)
        subset = subset[subset.label != 0]
        if not os.path.isfile(file_path):
            subset.to_csv(file_path, index=False)
        else:
            subset.to_csv(file_path, index=False, mode='a', header=False)
    return None

In [7]:
notes_ngrams = pd.read_csv('./data/train/ngrams/notes_1grams.csv', low_memory=False, chunksize = 10000, index_col = False)

In [49]:
departments_keywords_unigrams = ngrams_keywords(1, departments_keywords)

In [50]:
labeling_function('./data/train/ngrams/notes_1grams_DEP.csv', 'DEP', notes_ngrams, departments_keywords_unigrams)

In [7]:
notes_ngrams = pd.read_csv('./data/train/ngrams/notes_2grams.csv', low_memory=False, chunksize = 10000, index_col = False)

In [8]:
departments_keywords_2grams = ngrams_keywords(2, departments_keywords)

In [9]:
labeling_function('./data/train/ngrams/notes_2grams_DEP.csv', 'DEP', notes_ngrams, departments_keywords_2grams)

In [None]:
notes_ngrams = pd.read_csv('./data/train/ngrams/notes_3grams.csv', low_memory=False, chunksize = 10000, index_col = False)

In [None]:
departments_keywords_3grams = ngrams_keywords(3, departments_keywords)

In [None]:
labeling_function('./data/train/ngrams/notes_3grams_DEP.csv', 'DEP', notes_ngrams, departments_keywords_3grams)

In [11]:
notes_2grams_matched = pd.read_csv('./data/train/ngrams/notes_2grams_DEP.csv', low_memory=False, index_col = False)

In [12]:
notes_2grams_matched = notes_2grams_matched.drop_duplicates()
notes_2grams_matched['words'] = notes_2grams_matched['ngrams'].str.split(' ')
notes_2grams_matched = notes_2grams_matched.explode('words')

cols = notes_2grams_matched.columns.tolist()
cols = cols[-1:] + cols[:-1]

notes_2grams_matched = notes_2grams_matched[cols]

notes_2grams_matched = notes_2grams_matched.rename(columns={'ngrams':'keywords'})
    
notes_2grams_matched.to_csv('./data/train/ngrams/notes_ngrams_DEP.csv', index=False)

In [13]:
notes_2grams_matched

Unnamed: 0,words,keywords,index_notes,label
0,intensive,intensive care,1,DEP
0,care,intensive care,1,DEP
1,intensive,intensive care,3,DEP
1,care,intensive care,3,DEP
2,operating,operating room,5,DEP
...,...,...,...,...
39563,control,infection control,2078792,DEP
39564,critical,critical care,2080809,DEP
39564,care,critical care,2080809,DEP
39565,intensive,intensive care,2081655,DEP


In [6]:

# file_path = './data/train/ngrams/notes_ngrams_DEP.csv'

# try:
#     os.remove(file_path)
# except OSError:
#     pass

# for subset in notes_2grams_matched:
#     subset = subset.drop_duplicates()
#     subset['words'] = subset['ngrams'].str.split(' ')
#     subset = subset.explode('words')

#     cols = subset.columns.tolist()
#     cols = cols[-1:] + cols[:-1]

#     subset = subset[cols]

#     subset = subset.rename(columns={'ngrams':'keywords'})

#     if not os.path.isfile(file_path):
#         subset.to_csv(file_path, index=False)
#     else:
#         subset.to_csv(file_path, index=False, mode='a', header=False)

In [17]:
# TODO:
# - make knodle matrices
# - preprocess test data
# - alles in knodle reinnudeln

<h2>Knodle Matrices</h2>