<h1>Task 2 Data Wrangling</h1>

In [1]:
import pandas as pd
import os
from nltk.corpus import stopwords
from nltk.util import ngrams
import string

pd.set_option('display.max_rows', 10)

<h2>MIMIC III</h2>

First we load the MIMIC III dataset, which will function as our training/development dataset. Before we can use the data for building our weak supervision models we still have to do some preprocessing of certain columns.

In [2]:
df_notes = pd.read_csv('./data/train/mimic-iii-clinical-database-1.4/NOTEEVENTS.csv', low_memory=False)

In [5]:
df_notes.shape

(2083180, 11)

In [5]:
df_notes.head()

Unnamed: 0,ROW_ID,SUBJECT_ID,HADM_ID,CHARTDATE,CHARTTIME,STORETIME,CATEGORY,DESCRIPTION,CGID,ISERROR,TEXT
0,174,22532,167853.0,2151-08-04,,,Discharge summary,Report,,,Admission Date: [**2151-7-16**] Dischar...
1,175,13702,107527.0,2118-06-14,,,Discharge summary,Report,,,Admission Date: [**2118-6-2**] Discharg...
2,176,13702,167118.0,2119-05-25,,,Discharge summary,Report,,,Admission Date: [**2119-5-4**] D...
3,177,13702,196489.0,2124-08-18,,,Discharge summary,Report,,,Admission Date: [**2124-7-21**] ...
4,178,26880,135453.0,2162-03-25,,,Discharge summary,Report,,,Admission Date: [**2162-3-3**] D...


<h2>Preprocessing</h2>

In [15]:
notes_text = df_notes.TEXT

In [16]:
notes_text = notes_text.str.replace(r'\[\*\**\*\*\]', '') # extract tag placeholders

  """Entry point for launching an IPython kernel.


In [17]:
punctuation = string.punctuation.replace('/', '')
punctuation = punctuation.replace('\\', '')

notes_text = notes_text.str.translate(str.maketrans('', '', punctuation))

In [31]:
stopwords = stopwords.words('english')
stopwords.remove('and')
stopwords.remove('or')

In [32]:
notes_text = notes_text.apply(lambda x: ' '.join([word for word in x.lower().split() if word not in (stopwords)]))

In [17]:
notes_text.to_csv('./data/train/notes_cleaned.csv', index=False)

In [2]:
chunksize = 10000
notes_text = pd.read_csv('./data/train/notes_cleaned.csv', low_memory=False, chunksize = chunksize, index_col = False)

For feeding the MIMIC III data to our labeling functions we opted for a n-gram approach - i. e. every note entry is split into x n-grams corresponding to x number of rows. As a starting point we opted for a value of 3 for our n-gram parameter, which can of course still be adapted if a given subtask requires finetuning.

In [3]:
# processing ngrams in chunks for performance reasons

file_path = './data/train/notes_preprocessed.csv'

for subset in notes_text:
    subset = subset.dropna()
    subset['TEXT'] = subset['TEXT'].str.split().apply(lambda x: list(map(' '.join, ngrams(x, n=3)))).explode(ignore_index = True)
    if not os.path.isfile(file_path):
        subset.to_csv(file_path, index=False)
    else:
        subset.to_csv(file_path, index=False, mode='a', header=True)

<h2>Labeling Functions</h2>

As a next step we load a text file containing a list of the most important clinical departments. This list was scraped from different sources, which are also provided in the text file. The input from the text file is then transformed into a dictionary, consisting of various key-value pairs, with each key denoting a different clinical department while the corresponding value refers to a list of different possible names and acronomys for the respective key. This dictionary will later be used as an input for our labeling functions.

In [12]:
with open("./data/keywords/departments_list.txt") as dataFile:
    departments = {}
    for line in dataFile:
        if line.strip() == '': # exclude unnecessary lines
            break
        else:
            line = line.split(':')
            key, value = line[0], line[1:]
            value = [i.replace(';', ',') for i in value]
            value = [i.split(',') for i in value]
            [[value]] = [value]
            value = [i.strip() for i in value]
            departments[key] = value

departments

{'Department of Admissions': ['Admission',
  'Admissions',
  'Admitting Department'],
 'Department of Anaesthesia, Intensive Care Medicine and Pain Medicine': ['Anesthetics',
  'Anesthesiology'],
 'Department of Blood Group Serology and Transfusion Medicine': ['Serology',
  'Transfusion Medicine'],
 'Department of Cardiac Surgery': ['Cardiology'],
 'Department of Clinical Pathology': ['Clinical Pathology',
  'Medical Laboratory'],
 'Department of Dermatology': ['Dermatology'],
 'Department of Ear, Nose and Throat Diseases': ['Otolaryngology',
  'ENT Department',
  'Ear',
  'Nose and Throat Diseases'],
 'Department of Emergency Medicine': ['Accident and emergency',
  'A&E',
  'Casualty Department'],
 'Department of Gastroenterology': ['Gastroenterology'],
 'Department of General Surgery': ['General Surgery', 'Surgery'],
 'Department of Geriatry': ['Geriatric Department', 'Geriatrics'],
 'Department of Haematology': ['Haematology'],
 'Department of Hospital Hygiene and Infection Control'

In [13]:
departments_keywords = list(departments.values())
departments_keywords = [item.lower() for sublist in departments_keywords for item in sublist]

In [18]:
departments_keywords

['admission',
 'admissions',
 'admitting department',
 'anesthetics',
 'anesthesiology',
 'serology',
 'transfusion medicine',
 'cardiology',
 'clinical pathology',
 'medical laboratory',
 'dermatology',
 'otolaryngology',
 'ent department',
 'ear',
 'nose and throat diseases',
 'accident and emergency',
 'a&e',
 'casualty department',
 'gastroenterology',
 'general surgery',
 'surgery',
 'geriatric department',
 'geriatrics',
 'haematology',
 'central sterile services department',
 'cssd',
 'sterile processing department',
 'spd',
 'sterile processing',
 'infection control',
 'pharmacy',
 'medicine department',
 'neurology',
 'nursing department',
 'nutrition department',
 'dietetics',
 'gynaecology',
 'gynecology',
 'obstetrics',
 'ophthalmology',
 'optometry',
 'oral surgery',
 'maxillofacial surgery',
 'orthopedics',
 'orthopaedics',
 'pediatrics',
 'paediatrics',
 'plastic surgery department',
 'aesthetic surgery department',
 'reconstructive surgery department',
 'physiotherapy',