-process_mimic.py: This script processes MIMIC-III CSV files to gener…

…ate a dataset to run Med2Vec. -README.md: Added instructions to use process_mimic.py
mp2893 · Mar 12, 2017 · 69d40c7 · 69d40c7
1 parent 4cafce2
commit 69d40c7
Show file tree

Hide file tree

Showing 2 changed files with 183 additions and 3 deletions.
diff --git a/README.md b/README.md
@@ -21,7 +21,30 @@ Med2Vec implements an algorithm introduced in the following:
 
 3. Download/clone the Med2Vec code  
 
-**STEP 2: Preparing training data**  
+**STEP 2: Fast way to test Med2Vec with MIMIC-III**
+This step describes how to run, with minimum number of steps, Med2Vec using MIMIC-III. 
+
+0. You will first need to request access for [MIMIC-III](https://mimic.physionet.org/gettingstarted/access/), a publicly avaiable electronic health records collected from ICU patients over 11 years. 
+
+1. You can use "process_mimic.py" to process MIMIC-III dataset and generate a suitable training dataset for Med2Vec.
+Place the script to the same location where the MIMIC-III CSV files are located, and run the script. 
+The execution command is `python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file>`.
+Instructions are described inside the script. 
+
+2. Run Med2Vec using the ".seqs" file generated by process_mimic.py, using the following command.
+`python med2vec.py <seqs file> 4894 <output path>`
+where 4894 is the number of unique ICD9 diagnosis codes in the dataset.
+As described in the paper, however, it is a good idea to use the grouped codes for training the Softmax component of Med2Vec. Therefore we recommend using the following command instead.
+`python med2vec.py <seqs file> 4894 <output path> --label_file <3digitICD9.seqs file> --n_output_codes 942`
+where 942 is the number of unique 3-digit ICD9 diagnosis codes in the dataset.
+You can also use ".3digitICD9.seqs" to begin with, if you interested in learning the representation of 3-digit ICD9 codes only, using the following command.
+`python med2vec.py <3digitICD9.seqs file> 942 <output path>`
+
+3. As suggested in STEP 4, you might want to adjust the hyper-parameters. 
+I recommend decreasing the `--batch_size` to 100 or so, since the default value 1,000 is too big considering the small number of patients in MIMIC-III datasets. 
+There are only 7500 patients who made more than a single visit, and most of them have only two visits.
+
+**STEP 3: Preparing training data**  
 
 1. Med2Vec training data need to be a Python Pickled list of list of medical codes (e.g. diagnosis codes, medication codes, or procedure codes). 
 First, medical codes need to be converted to an integer. Then a single visit can be converted as a list of integers. 
@@ -51,7 +74,7 @@ We will refer to this file as the "demo file".
 6. Similar to step 2, you will need to remeber the size of the demographics vector if you plan to use the demo file. 
 In the example of step 5, the size of the demographics vector is 7.
 
-**STEP 3: Running Med2Vec**  
+**STEP 4: Running Med2Vec**  
 
 1. The minimum input you need to run Med2Vec is the visit file, the number of unique medical codes and the output path
 `python med2vec <path/to/visit_file> <the number of unique medical codes> <path/to/output>`  
@@ -60,7 +83,7 @@ In the example of step 5, the size of the demographics vector is 7.
 
 3. Additional options can be specified such as the size of the code representation, the size of the visit representation and the number of epochs. Detailed information can be accessed by `python med2vec --help`
 
-**STEP 4: Looking at your results**  
+**STEP 5: Looking at your results**  
 
 Med2Vec produces a model file after each epoch. The model file is generated by [numpy.savez_compressed](http://docs.scipy.org/doc/numpy-1.10.1/reference/generated/numpy.savez_compressed.html).
 

diff --git a/process_mimic.py b/process_mimic.py
@@ -0,0 +1,157 @@
+# This script processes MIMIC-III dataset and builds longitudinal diagnosis records for patients with at least two visits.
+# The output data are cPickled, and suitable for training Doctor AI or RETAIN
+# Written by Edward Choi (mp2893@gatech.edu)
+# Usage: Put this script to the foler where MIMIC-III CSV files are located. Then execute the below command.
+# python process_mimic.py ADMISSIONS.csv DIAGNOSES_ICD.csv <output file> 
+
+# Output files
+# <output file>.seqs: Dataset that follows the format described in the README.md.
+# <output file>.types: Python dictionary that maps string diagnosis codes to integer diagnosis codes.
+# <output file>.3digitICD9.seqs: Dataset that follows the format described in the README.md. This uses only the first 3 digits of the ICD9 diagnosis code.
+# <output file>.3digitICD9.types: Python dictionary that maps 3-digit string diagnosis codes to integer diagnosis codes.
+
+import sys
+import cPickle as pickle
+from datetime import datetime
+
+def convert_to_icd9(dxStr):
+	if dxStr.startswith('E'):
+		if len(dxStr) > 4: return dxStr[:4] + '.' + dxStr[4:]
+		else: return dxStr
+	else:
+		if len(dxStr) > 3: return dxStr[:3] + '.' + dxStr[3:]
+		else: return dxStr
+
+def convert_to_3digit_icd9(dxStr):
+	if dxStr.startswith('E'):
+		if len(dxStr) > 4: return dxStr[:4]
+		else: return dxStr
+	else:
+		if len(dxStr) > 3: return dxStr[:3]
+		else: return dxStr
+
+if __name__ == '__main__':
+	admissionFile = sys.argv[1]
+	diagnosisFile = sys.argv[2]
+	outFile = sys.argv[3]
+
+	print 'Building pid-admission mapping, admission-date mapping'
+	pidAdmMap = {}
+	admDateMap = {}
+	infd = open(admissionFile, 'r')
+	infd.readline()
+	for line in infd:
+		tokens = line.strip().split(',')
+		pid = int(tokens[1])
+		admId = int(tokens[2])
+		admTime = datetime.strptime(tokens[3], '%Y-%m-%d %H:%M:%S')
+		admDateMap[admId] = admTime
+		if pid in pidAdmMap: pidAdmMap[pid].append(admId)
+		else: pidAdmMap[pid] = [admId]
+	infd.close()
+
+	print 'Building admission-dxList mapping'
+	admDxMap = {}
+	admDxMap_3digit = {}
+	infd = open(diagnosisFile, 'r')
+	infd.readline()
+	for line in infd:
+		tokens = line.strip().split(',')
+		admId = int(tokens[2])
+		dxStr = 'D_' + convert_to_icd9(tokens[4][1:-1]) ############## Uncomment this line and comment the line below, if you want to use the entire ICD9 digits.
+		dxStr_3digit = 'D_' + convert_to_3digit_icd9(tokens[4][1:-1])
+
+		if admId in admDxMap: 
+			admDxMap[admId].append(dxStr)
+		else: 
+			admDxMap[admId] = [dxStr]
+
+		if admId in admDxMap_3digit: 
+			admDxMap_3digit[admId].append(dxStr_3digit)
+		else: 
+			admDxMap_3digit[admId] = [dxStr_3digit]
+	infd.close()
+
+	print 'Building pid-sortedVisits mapping'
+	pidSeqMap = {}
+	pidSeqMap_3digit = {}
+	for pid, admIdList in pidAdmMap.iteritems():
+		if len(admIdList) < 2: continue
+
+		sortedList = sorted([(admDateMap[admId], admDxMap[admId]) for admId in admIdList])
+		pidSeqMap[pid] = sortedList
+
+		sortedList_3digit = sorted([(admDateMap[admId], admDxMap_3digit[admId]) for admId in admIdList])
+		pidSeqMap_3digit[pid] = sortedList_3digit
+
+	print 'Building pids, dates, strSeqs'
+	pids = []
+	dates = []
+	seqs = []
+	for pid, visits in pidSeqMap.iteritems():
+		pids.append(pid)
+		seq = []
+		date = []
+		for visit in visits:
+			date.append(visit[0])
+			seq.append(visit[1])
+		dates.append(date)
+		seqs.append(seq)
+
+	print 'Building pids, dates, strSeqs for 3digit ICD9 code'
+	seqs_3digit = []
+	for pid, visits in pidSeqMap_3digit.iteritems():
+		seq = []
+		for visit in visits:
+			seq.append(visit[1])
+		seqs_3digit.append(seq)
+
+	print 'Converting strSeqs to intSeqs, and making types'
+	types = {}
+	newSeqs = []
+	for patient in seqs:
+		newPatient = []
+		for visit in patient:
+			newVisit = []
+			for code in visit:
+				if code in types:
+					newVisit.append(types[code])
+				else:
+					types[code] = len(types)
+					newVisit.append(types[code])
+			newPatient.append(newVisit)
+		newSeqs.append(newPatient)
+
+	print 'Converting strSeqs to intSeqs, and making types for 3digit ICD9 code'
+	types_3digit = {}
+	newSeqs_3digit = []
+	for patient in seqs_3digit:
+		newPatient = []
+		for visit in patient:
+			newVisit = []
+			for code in set(visit):
+				if code in types_3digit:
+					newVisit.append(types_3digit[code])
+				else:
+					types_3digit[code] = len(types_3digit)
+					newVisit.append(types_3digit[code])
+			newPatient.append(newVisit)
+		newSeqs_3digit.append(newPatient)
+
+	print 'Re-formatting to Med2Vec dataset'
+	seqs = []
+	for patient in newSeqs:
+		seqs.extend(patient)
+		seqs.append([-1])
+	seqs = seqs[:-1]
+
+	seqs_3digit = []
+	for patient in newSeqs_3digit:
+		seqs_3digit.extend(patient)
+		seqs_3digit.append([-1])
+	seqs_3digit = seqs_3digit[:-1]
+
+	pickle.dump(seqs, open(outFile+'.seqs', 'wb'), -1)
+	pickle.dump(types, open(outFile+'.types', 'wb'), -1)
+	pickle.dump(seqs_3digit, open(outFile+'.3digitICD9.seqs', 'wb'), -1)
+	pickle.dump(types_3digit, open(outFile+'.3digitICD9.types', 'wb'), -1)