# Reformatting Data Files
`localglobalembed.py` reads in a dataset for *one* abbreviation and reformats it as desired for training. Our dataset consists of a set of `AbbrRep` objects, which has attributes corresponding to its label and its fastText embedding, original source, and its surrounding word context.

We can use `localglobalembed.py` to reformat our dataset to have a different local context length (via the `-window` argument) and whether or not to aggregate the local context into a fixed size vector or keep it variable length (the `-variable_local` argument).

`create_var_embeddings.py` is a Python shell script which calls `localglobalembed.py` to reformat datasets for all abbreviations.

# Convolutional Neural Network
The file `train_cnn_models.py` is a Python shell script which trains a CNN model (defined in `cnn_model.py`) on each abbreviation.

### Reformatting Data for CNNs
The `cnn_model.py` is designed to train on "variable" length embeddings. The following calls will generate the embedding files that the CNN can train on for varying context lengths of 3, 5, 8. Note that each reformatting takes up a significant amount of disk space, so typically one reformats data for context length 3, then runs all experiments on this data, deletes this reformatted data and reformat the original data again for context length 5, run those experiments, and so on.

In [None]:
!python create_var_embeddings.py -prefix="var_w3_" -window=3 -variable_local -g

In [None]:
!python create_var_embeddings.py -prefix="var_w5_" -window=5 -variable_local -g

In [None]:
!python create_var_embeddings.py -prefix="var_w8_" -window=8 -variable_local -g

### Training CNNs
The following commmands will train CNN models on each of the above reformatted datasets, for varying maximum number of samples per label. The corresponding output is listed immediately below the command.

In [None]:
!python train_cnn_models.py -prefix="var_w3" -num_epochs=50 -ns=100 > ne50_ns100_w3.txt

In [1]:
!cat ne50_ns100_w3.txt

Dataset: var_w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 2.8601, Val Loss: 2.4698
Epoch: 5, Train Loss: 0.8711, Val Loss: 0.9437
Epoch: 10, Train Loss: 0.4320, Val Loss: 0.6393
Epoch: 15, Train Loss: 0.2397, Val Loss: 0.5323
Epoch: 20, Train Loss: 0.1550, Val Loss: 0.4858
Epoch: 25, Train Loss: 0.1023, Val Loss: 0.4607
Epoch: 30, Train Loss: 0.0698, Val Loss: 0.4458
Epoch: 35, Train Loss: 0.0522, Val Loss: 0.4470
Epoch: 40, Train Loss: 0.0378, Val Loss: 0.4518
Epoch: 45, Train Loss: 0.0260, Val Loss: 0.4463
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.

In [None]:
!python train_cnn_models.py -prefix="var_w3" -num_epochs=50 -ns=500 > ne50_ns500_w3.txt

In [2]:
!cat ne50_ns500_w3.txt

Dataset: var_w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.8683, Val Loss: 1.0245
Epoch: 5, Train Loss: 0.3273, Val Loss: 0.3638
Epoch: 10, Train Loss: 0.1732, Val Loss: 0.3218
Epoch: 15, Train Loss: 0.1013, Val Loss: 0.3182
Epoch: 20, Train Loss: 0.0612, Val Loss: 0.3310
Epoch: 25, Train Loss: 0.0450, Val Loss: 0.3428
Epoch: 30, Train Loss: 0.0337, Val Loss: 0.3554
Epoch: 35, Train Loss: 0.0247, Val Loss: 0.3655
Epoch: 40, Train Loss: 0.0219, Val Loss: 0.3940
Epoch: 45, Train Loss: 0.0159, Val Loss: 0.3933
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07', '0.00', '0.

In [None]:
!python train_cnn_models.py -prefix="var_w3" -num_epochs=50 -ns=1000 > ne50_ns1000_w3.txt

In [3]:
!cat ne50_ns1000_w3.txt

Dataset: var_w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.4449, Val Loss: 0.6478
Epoch: 5, Train Loss: 0.2359, Val Loss: 0.2923
Epoch: 10, Train Loss: 0.1261, Val Loss: 0.2668
Epoch: 15, Train Loss: 0.0800, Val Loss: 0.2743
Epoch: 20, Train Loss: 0.0515, Val Loss: 0.2913
Epoch: 25, Train Loss: 0.0339, Val Loss: 0.2912
Epoch: 30, Train Loss: 0.0286, Val Loss: 0.3044
Epoch: 35, Train Loss: 0.0245, Val Loss: 0.3238
Epoch: 40, Train Loss: 0.0228, Val Loss: 0.3358
Epoch: 45, Train Loss: 0.0193, Val Loss: 0.3380
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08',

In [None]:
!python train_cnn_models.py -prefix="var_w5" -num_epochs=50 -ns=100 > ne50_ns100_w5.txt

In [4]:
!cat ne50_ns100_w5.txt

Dataset: var_fsh_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 0.9661, Val Loss: 1.0515
Epoch: 5, Train Loss: 0.3156, Val Loss: 0.8759
Epoch: 10, Train Loss: 0.1432, Val Loss: 0.8380
Epoch: 15, Train Loss: 0.0367, Val Loss: 0.7532
Epoch: 20, Train Loss: 0.0297, Val Loss: 0.6766
Epoch: 25, Train Loss: 0.0115, Val Loss: 0.6092
Epoch: 30, Train Loss: 0.0055, Val Loss: 0.5637
Epoch: 35, Train Loss: 0.0028, Val Loss: 0.5360
Epoch: 40, Train Loss: 0.0055, Val Loss: 0.5199
Epoch: 45, Train Loss: 0.0026, Val Loss: 0.5115
Histogram of MIMIC train labels:	 ['0.75', '0.25', '0.00']
Histogram of MIMIC val labels:		 ['0.33', '0.67', '0.00']
Histogram of CASI test labels:		 ['0.98', '0.02', '0.00']
Final MIMIC Train Accuracy:		1.0000 (4 of 4)
Final MIMIC Val Accuracy:		0.6667 (2 of 3)
CASI Accuracy:				0.9850 (262 of 266)
Dataset: var_ama_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 2.1559, Val Loss: 1.6820
Epoch: 5, Train Loss: 0.5439, Val Loss: 0.6

In [None]:
!python train_cnn_models.py -prefix="var_w5" -num_epochs=50 -ns=500 > ne50_ns500_w5.txt

In [5]:
!cat ne50_ns500_w5.txt

Dataset: var_fsh_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.0081, Val Loss: 1.0447
Epoch: 5, Train Loss: 0.3673, Val Loss: 0.9668
Epoch: 10, Train Loss: 0.1224, Val Loss: 0.9416
Epoch: 15, Train Loss: 0.0758, Val Loss: 0.8524
Epoch: 20, Train Loss: 0.0286, Val Loss: 0.7363
Epoch: 25, Train Loss: 0.0128, Val Loss: 0.6536
Epoch: 30, Train Loss: 0.0081, Val Loss: 0.5995
Epoch: 35, Train Loss: 0.0044, Val Loss: 0.5687
Epoch: 40, Train Loss: 0.0041, Val Loss: 0.5572
Epoch: 45, Train Loss: 0.0033, Val Loss: 0.5567
Histogram of MIMIC train labels:	 ['0.75', '0.25', '0.00']
Histogram of MIMIC val labels:		 ['0.33', '0.67', '0.00']
Histogram of CASI test labels:		 ['0.98', '0.02', '0.00']
Final MIMIC Train Accuracy:		1.0000 (4 of 4)
Final MIMIC Val Accuracy:		0.6667 (2 of 3)
CASI Accuracy:				0.9850 (262 of 266)
Dataset: var_ama_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.9855, Val Loss: 1.4986
Epoch: 5, Train Loss: 0.5124, Val Loss: 0.5

In [None]:
!python train_cnn_models.py -prefix="var_w5" -num_epochs=50 -ns=1000 > ne50_ns1000_w5.txt

In [6]:
!cat ne50_ns1000_w5.txt

Dataset: var_fsh_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 0.9399, Val Loss: 1.1930
Epoch: 5, Train Loss: 0.3709, Val Loss: 1.0673
Epoch: 10, Train Loss: 0.1284, Val Loss: 1.0115
Epoch: 15, Train Loss: 0.0540, Val Loss: 0.9062
Epoch: 20, Train Loss: 0.0212, Val Loss: 0.8112
Epoch: 25, Train Loss: 0.0139, Val Loss: 0.7380
Epoch: 30, Train Loss: 0.0106, Val Loss: 0.6923
Epoch: 35, Train Loss: 0.0050, Val Loss: 0.6623
Epoch: 40, Train Loss: 0.0051, Val Loss: 0.6438
Epoch: 45, Train Loss: 0.0026, Val Loss: 0.6387
Histogram of MIMIC train labels:	 ['0.75', '0.25', '0.00']
Histogram of MIMIC val labels:		 ['0.33', '0.67', '0.00']
Histogram of CASI test labels:		 ['0.98', '0.02', '0.00']
Final MIMIC Train Accuracy:		1.0000 (4 of 4)
Final MIMIC Val Accuracy:		0.6667 (2 of 3)
CASI Accuracy:				0.9850 (262 of 266)
Dataset: var_ama_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 2.0071, Val Loss: 1.4983
Epoch: 5, Train Loss: 0.5098, Val Loss: 0.5126
Epoch: 10, Trai

In [None]:
!python train_cnn_models.py -prefix="var_w8" -num_epochs=50 -ns=100 > ne50_ns100_w8.txt

In [7]:
!cat ne50_ns100_w8.txt

Dataset: var_w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 2.8648, Val Loss: 2.4685
Epoch: 5, Train Loss: 0.9728, Val Loss: 1.0433
Epoch: 10, Train Loss: 0.5277, Val Loss: 0.7419
Epoch: 15, Train Loss: 0.3052, Val Loss: 0.6265
Epoch: 20, Train Loss: 0.1897, Val Loss: 0.5738
Epoch: 25, Train Loss: 0.1221, Val Loss: 0.5472
Epoch: 30, Train Loss: 0.0811, Val Loss: 0.5347
Epoch: 35, Train Loss: 0.0606, Val Loss: 0.5289
Epoch: 40, Train Loss: 0.0445, Val Loss: 0.5238
Epoch: 45, Train Loss: 0.0355, Val Loss: 0.5188
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.

In [None]:
!python train_cnn_models.py -prefix="var_w8" -num_epochs=50 -ns=500 > ne50_ns500_w8.txt

In [8]:
!cat ne50_ns500_w8.txt

Dataset: var_w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.9507, Val Loss: 1.1602
Epoch: 5, Train Loss: 0.3798, Val Loss: 0.3899
Epoch: 10, Train Loss: 0.1827, Val Loss: 0.3070
Epoch: 15, Train Loss: 0.1039, Val Loss: 0.2946
Epoch: 20, Train Loss: 0.0610, Val Loss: 0.2837
Epoch: 25, Train Loss: 0.0353, Val Loss: 0.2894
Epoch: 30, Train Loss: 0.0279, Val Loss: 0.2902
Epoch: 35, Train Loss: 0.0222, Val Loss: 0.3078
Epoch: 40, Train Loss: 0.0168, Val Loss: 0.3047
Epoch: 45, Train Loss: 0.0137, Val Loss: 0.3227
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07',

In [None]:
!python train_cnn_models.py -prefix="var_w8" -num_epochs=50 -ns=1000 > ne50_ns1000_w8.txt

In [9]:
!cat ne50_ns1000_w8.txt

Dataset: var_w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Epoch: 0, Train Loss: 1.5377, Val Loss: 0.7720
Epoch: 5, Train Loss: 0.2569, Val Loss: 0.2945
Epoch: 10, Train Loss: 0.1254, Val Loss: 0.2430
Epoch: 15, Train Loss: 0.0667, Val Loss: 0.2369
Epoch: 20, Train Loss: 0.0409, Val Loss: 0.2438
Epoch: 25, Train Loss: 0.0280, Val Loss: 0.2555
Epoch: 30, Train Loss: 0.0190, Val Loss: 0.2593
Epoch: 35, Train Loss: 0.0166, Val Loss: 0.2727
Epoch: 40, Train Loss: 0.0143, Val Loss: 0.2826
Epoch: 45, Train Loss: 0.0148, Val Loss: 0.2813
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08',

# Fully Connected Network
The file `train_fcn_models.py` is a Python shell script which trains a CNN model (defined in `fcn_model.py`) on each abbreviation.

### Reformatting Data for CNNs
The `fcn_model.py` is designed to train on aggregated local embeddings. The following calls will generate the embedding files that the FCN can train on for varying context lengths of 3, 5, 8. Note that each reformatting takes up a significant amount of disk space, so typically one reformats data for context length 3, then runs all experiments on this data, deletes this reformatted data and reformat the original data again for context length 5, run those experiments, and so on.

In [None]:
!python create_var_embeddings.py -prefix="w3_" -window=3 -g

In [None]:
!python create_var_embeddings.py -prefix="w5_" -window=5 -g

In [None]:
!python create_var_embeddings.py -prefix="w8_" -window=8 -g

### Training FCNs
The following commmands will train FCN models on each of the above reformatted datasets, for varying maximum number of samples per label. The corresponding output is listed immediately below the command.

In [None]:
!python train_fcn_models.py -prefix="w3" -num_epochs=50 -ns=100 > fcn_ne50_ns100_w3.txt

In [10]:
!cat fcn_ne50_ns100_w3.txt

Dataset: w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.07', '0.07', '0.05', '0.05', '0.00', '0.09', '0.08', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		1.0000 (1157 of 1157)
Final MIMIC Val Accuracy:		0.8310 (413 of 497)
CASI Accuracy:				0.8827 (143 of 162)
Data

In [None]:
!python train_fcn_models.py -prefix="w3" -num_epochs=50 -ns=500 > fcn_ne50_ns500_w3.txt

In [11]:
!cat fcn_ne50_ns500_w3.txt

Dataset: w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07', '0.00', '0.05', '0.06', '0.06', '0.06', '0.00', '0.07', '0.06', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9855 (5303 of 5381)
Final MIMIC Val Accuracy:		0.8444 (1948 of 2307)
CASI Accuracy:				0.8395 (136 of 162)
Dataset: 

In [None]:
!python train_fcn_models.py -prefix="w3" -num_epochs=50 -ns=1000 > fcn_ne50_ns1000_w3.txt

In [12]:
!cat fcn_ne50_ns1000_w3.txt

Dataset: w3_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08', '0.00', '0.03', '0.07', '0.08', '0.05', '0.00', '0.07', '0.04', '0.08', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9848 (8878 of 9015)
Final MIMIC Val Accuracy:		0.8636 (3337 of 3864)
CASI Accuracy:				0.8704 (141 of 162)
Da

In [None]:
!python train_fcn_models.py -prefix="w5" -num_epochs=50 -ns=100 > fcn_ne50_ns100_w5.txt

In [13]:
!cat fcn_ne50_ns100_w5.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.07', '0.07', '0.05', '0.05', '0.00', '0.09', '0.08', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9991 (1156 of 1157)
Final MIMIC Val Accuracy:		0.8249 (410 of 497)
CASI Accuracy:				0.7407 (120 of 162)
Dataset

In [None]:
!python train_fcn_models.py -prefix="w5" -num_epochs=50 -ns=500 > fcn_ne50_ns500_w5.txt

In [14]:
!cat fcn_ne50_ns500_w5.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07', '0.00', '0.05', '0.06', '0.06', '0.06', '0.00', '0.07', '0.06', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9790 (5268 of 5381)
Final MIMIC Val Accuracy:		0.8461 (1952 of 2307)
CASI Accuracy:				0.8519 (138 of 162)
Datas

In [None]:
!python train_fcn_models.py -prefix="w5" -num_epochs=50 -ns=1000 > fcn_ne50_ns1000_w5.txt

In [15]:
!cat fcn_ne50_ns1000_w5.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08', '0.00', '0.03', '0.07', '0.08', '0.05', '0.00', '0.07', '0.04', '0.08', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9740 (8781 of 9015)
Final MIMIC Val Accuracy:		0.8698 (3361 of 3864)
CASI Accuracy:				0.7284 (118 of 162)
Datas

In [None]:
!python train_fcn_models.py -prefix="w8" -num_epochs=50 -ns=100 > fcn_ne50_ns100_w8.txt

In [16]:
!cat fcn_ne50_ns100_w8.txt

Dataset: w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.07', '0.07', '0.05', '0.05', '0.00', '0.09', '0.08', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9706 (1123 of 1157)
Final MIMIC Val Accuracy:		0.7907 (393 of 497)
CASI Accuracy:				0.8272 (134 of 162)
Data

In [None]:
!python train_fcn_models.py -prefix="w8" -num_epochs=50 -ns=500 > fcn_ne50_ns500_w8.txt

In [17]:
!cat fcn_ne50_ns500_w8.txt

Dataset: w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07', '0.00', '0.05', '0.06', '0.06', '0.06', '0.00', '0.07', '0.06', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9758 (5251 of 5381)
Final MIMIC Val Accuracy:		0.8375 (1932 of 2307)
CASI Accuracy:				0.8457 (137 of 162)
Da

Histogram of MIMIC val labels:		 ['0.00', '0.00', '0.00', '0.01', '0.00', '0.17', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.16', '0.19', '0.07', '0.00', '0.00', '0.20', '0.20', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of CASI test labels:		 ['0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		1.0000 (1803 of 1803)
Final MIMIC Val Accuracy:		0.8862 (685 of 773)
CASI Accuracy:				0.6486 (299 of 461)
Dataset: w8_otc_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of MIMIC val labels:		 ['0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of CASI test labels:		 ['0.06', '0.94', '0.00', '0.00', '0.00'

In [None]:
!python train_fcn_models.py -prefix="w8" -num_epochs=50 -ns=1000 > fcn_ne50_ns1000_w8.txt

In [18]:
!cat fcn_ne50_ns1000_w8.txt

Dataset: w8_ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08', '0.00', '0.03', '0.07', '0.08', '0.05', '0.00', '0.07', '0.04', '0.08', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.9688 (8734 of 9015)
Final MIMIC Val Accuracy:		0.8411 (3250 of 3864)
CASI Accuracy:				0.7407 (120 of 162)
Da

# Most Frequent Sense
The file `train_mfs_models.py` is a Python shell script which trains a MFS classifier.

### Reformatting Data for MFS
The MFS classifier does not use the embeddings to make predictions, so no additional reformatting of data is required.

### Training MFS
The following commmands will train MFS models on each of the datasets, for varying maximum number of samples per label. The corresponding output is listed immediately below the command.

In [None]:
!python train_mfs_models.py -ns=100 > mfs_ns100.txt

In [19]:
!cat mfs_ns100.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.06', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.07', '0.00', '0.06', '0.00', '0.07', '0.06', '0.01', '0.06', '0.01', '0.00', '0.06', '0.00', '0.06', '0.06', '0.06', '0.06', '0.00', '0.05', '0.05', '0.06', '0.00']
Histogram of MIMIC val labels:		 ['0.05', '0.00', '0.06', '0.00', '0.06', '0.01', '0.00', '0.00', '0.05', '0.00', '0.06', '0.00', '0.05', '0.07', '0.01', '0.07', '0.01', '0.00', '0.05', '0.00', '0.07', '0.07', '0.05', '0.05', '0.00', '0.09', '0.08', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.0666 (77 of 1157)
Final MIMIC Val Accuracy:		0.0463 (23 of 497)
CASI Accuracy:				0.0000 (0 of 162)
Dataset: ald

In [None]:
!python train_mfs_models.py -ns=500 > mfs_ns500.txt

In [20]:
!cat mfs_ns500.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.07', '0.00', '0.06', '0.00', '0.07', '0.00', '0.00', '0.00', '0.06', '0.00', '0.06', '0.00', '0.06', '0.05', '0.00', '0.04', '0.00', '0.00', '0.07', '0.00', '0.05', '0.07', '0.07', '0.07', '0.00', '0.06', '0.06', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.06', '0.00', '0.07', '0.00', '0.06', '0.00', '0.00', '0.00', '0.07', '0.00', '0.07', '0.00', '0.07', '0.06', '0.00', '0.05', '0.00', '0.00', '0.07', '0.00', '0.05', '0.06', '0.06', '0.06', '0.00', '0.07', '0.06', '0.06', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.0680 (366 of 5381)
Final MIMIC Val Accuracy:		0.0581 (134 of 2307)
CASI Accuracy:				0.0000 (0 of 162)
Dataset: 

Histogram of MIMIC val labels:		 ['0.00', '0.00', '0.00', '0.01', '0.00', '0.17', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.16', '0.19', '0.07', '0.00', '0.00', '0.20', '0.20', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of CASI test labels:		 ['0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.1958 (353 of 1803)
Final MIMIC Val Accuracy:		0.1902 (147 of 773)
CASI Accuracy:				1.0000 (461 of 461)
Dataset: otc_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of MIMIC val labels:		 ['0.00', '1.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00']
Histogram of CASI test labels:		 ['0.06', '0.94', '0.00', '0.00', '0.00', '0

In [None]:
!python train_mfs_models.py -ns=1000 > mfs_ns1000.txt

In [21]:
!cat mfs_ns1000.txt

Dataset: ac_mimic_casi_w5_ns1000_g_20190408.pickle
Histogram of MIMIC train labels:	 ['0.08', '0.00', '0.07', '0.00', '0.08', '0.00', '0.00', '0.00', '0.08', '0.00', '0.07', '0.00', '0.08', '0.03', '0.00', '0.02', '0.00', '0.00', '0.07', '0.00', '0.03', '0.08', '0.08', '0.05', '0.00', '0.06', '0.04', '0.07', '0.00']
Histogram of MIMIC val labels:		 ['0.07', '0.00', '0.07', '0.00', '0.07', '0.00', '0.00', '0.00', '0.08', '0.00', '0.06', '0.00', '0.07', '0.03', '0.00', '0.03', '0.00', '0.00', '0.08', '0.00', '0.03', '0.07', '0.08', '0.05', '0.00', '0.07', '0.04', '0.08', '0.00']
Histogram of CASI test labels:		 ['0.02', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.01', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.00', '0.06', '0.00', '0.00', '0.00', '0.91', '0.00', '0.00', '0.00']
Final MIMIC Train Accuracy:		0.0795 (717 of 9015)
Final MIMIC Val Accuracy:		0.0732 (283 of 3864)
CASI Accuracy:				0.0247 (4 of 162)
Dataset: 

# Summarizing Results
To summarize the above results into statistics which we report in the tables, we have developed simple scripts for parsing the above output files and computing statistics from them. The following calls will generate these statistics, and immediately below them is the corresponding summary output, which was transcribed into the table used

In [None]:
!python parse_cnn_results.py > parsed_cnn_results.txt

In [22]:
!cat parsed_cnn_results.txt

Parsing ne50_ns100_w3.txt
MIMIC Train Macro: 0.9975
MIMIC Train Micro: 0.9960
MIMIC Val Macro: 0.8487
MIMIC Val Micro: 0.7939
Casi Macro: 0.6127
Casi Micro: 0.6277
Parsing ne50_ns100_w5.txt
MIMIC Train Macro: 0.9990
MIMIC Train Micro: 0.9984
MIMIC Val Macro: 0.8502
MIMIC Val Micro: 0.7978
Casi Macro: 0.6075
Casi Micro: 0.6146
Parsing ne50_ns100_w8.txt
MIMIC Train Macro: 0.9998
MIMIC Train Micro: 0.9998
MIMIC Val Macro: 0.8494
MIMIC Val Micro: 0.7927
Casi Macro: 0.6025
Casi Micro: 0.6075
Parsing ne50_ns500_w3.txt
MIMIC Train Macro: 0.9961
MIMIC Train Micro: 0.9946
MIMIC Val Macro: 0.8950
MIMIC Val Micro: 0.8667
Casi Macro: 0.6476
Casi Micro: 0.6809
Parsing ne50_ns500_w5.txt
MIMIC Train Macro: 0.9987
MIMIC Train Micro: 0.9984
MIMIC Val Macro: 0.9084
MIMIC Val Micro: 0.8774
Casi Macro: 0.6443
Casi Micro: 0.6734
Parsing ne50_ns500_w8.txt
MIMIC Train Macro: 0.9995
MIMIC Train Micro: 0.9994
MIMIC Val Macro: 0.9113
MIMIC Val Micro: 0.8793
Casi Macro: 0.

In [None]:
!python parse_fcn_results.py > parsed_fcn_results.txt

In [23]:
!cat parsed_fcn_results.txt

Parsing fcn_ne50_ns100_w3.txt
MIMIC Train Macro: 0.9921
MIMIC Train Micro: 0.9822
MIMIC Val Macro: 0.8185
MIMIC Val Micro: 0.7268
Casi Macro: 0.5488
Casi Micro: 0.5630
Parsing fcn_ne50_ns100_w5.txt
MIMIC Train Macro: 0.9886
MIMIC Train Micro: 0.9719
MIMIC Val Macro: 0.8477
MIMIC Val Micro: 0.7566
Casi Macro: 0.5367
Casi Micro: 0.5396
Parsing fcn_ne50_ns100_w8.txt
MIMIC Train Macro: 0.9922
MIMIC Train Micro: 0.9834
MIMIC Val Macro: 0.8175
MIMIC Val Micro: 0.7246
Casi Macro: 0.5305
Casi Micro: 0.5429
Parsing fcn_ne50_ns500_w3.txt
MIMIC Train Macro: 0.9818
MIMIC Train Micro: 0.9647
MIMIC Val Macro: 0.8620
MIMIC Val Micro: 0.7902
Casi Macro: 0.5925
Casi Micro: 0.6142
Parsing fcn_ne50_ns500_w5.txt
MIMIC Train Macro: 0.9826
MIMIC Train Micro: 0.9659
MIMIC Val Macro: 0.8840
MIMIC Val Micro: 0.8130
Casi Macro: 0.5962
Casi Micro: 0.6166
Parsing fcn_ne50_ns500_w8.txt
MIMIC Train Macro: 0.9767
MIMIC Train Micro: 0.9497
MIMIC Val Macro: 0.8605
MIMIC Val Micro

In [None]:
!python parse_mfs_results.py > parsed_mfs_results.txt

In [24]:
!cat parsed_mfs_results.txt

Parsing mfs_ns100.txt
MIMIC Train Macro: 0.2784
MIMIC Train Micro: 0.0980
MIMIC Val Macro: 0.2542
MIMIC Val Micro: 0.0784
Casi Macro: 0.2360
Casi Micro: 0.2335
Parsing mfs_ns500.txt
MIMIC Train Macro: 0.3568
MIMIC Train Micro: 0.1393
MIMIC Val Macro: 0.3402
MIMIC Val Micro: 0.1283
Casi Macro: 0.2443
Casi Micro: 0.2737
Parsing mfs_ns1000.txt
MIMIC Train Macro: 0.4000
MIMIC Train Micro: 0.1713
MIMIC Val Macro: 0.3896
MIMIC Val Micro: 0.1652
Casi Macro: 0.2612
Casi Micro: 0.3018
