特征矩阵(feature matrix)

Building the basic MIMIC feautre matrix for NLD work.
We take minimum and maximum values of each variable during four hour sample window (four hours prior to callout)
For 'single_feature_variables' we take only the final value (variables which are infrequently recorded will have the same value for min and max and therefore using both min and max as features for these variables does not make sense).
We find that there are a lot of missing values when using the four hour sample window. So, where values are missing we look back over an extended sample window of 36 hours and take the final value during that time as the feature value.
The above procedure results in imbalanced class sizes (there are relatively few patients with negative outcomes following callout). To balance the class sizes we sample from icustays using the same method described above, but at a sample time that is an integer multiple of 24 hours prior to callout (but not within 24 hours of icu admission). We specify that this sample must be taken at least 72 hours prior to callout, under the assumption that a patient is deintitely not ready for discharge at this point in time.
These extra instances of the negative class are assigned unique identifier and treated as separate patients, however we save the id mapping so we can tell which stay each sample was taken from and at what time point.

为NLD工作建立基本的MIMIC特征矩阵。
在四个小时的样本窗口中(调用前四个小时)，我们取每个变量的最小值和最大值。
对于'single_feature_variables'，我们只取最终的值(不经常被记录的变量的min和max值相同，因此使用min和max作为这些变量的特征是没有意义的)。
我们发现在使用4小时的样本窗口时，存在大量的缺失值。因此，当缺少值时，我们回顾了一个扩展的36小时的样本窗口，并将这段时间内的最终值作为特征值。
上述程序导致了班级规模的不平衡(在调出后出现不良结果的患者相对较少)。为了平衡班级规模，我们使用上述相同的方法对icustay进行了采样，但采样时间是调出前24小时
(但不是在icu入院24小时内)的整数倍。我们规定，该样本必须在callout前至少72小时采集，假设患者此时确实没有准备好出院。
负类的这些额外实例被分配了唯一的标识符，并作为单独的患者对待，但是我们保存了id映射，这样我们就可以知道每个样本是从哪个位置以及在什么时间点采集的。


First we configure this script:
首先我们配置这个脚本:

In [100]:
#我们之前从mimic中提取和清理的数据存储在哪里
DATA_PATH = 'mimic_all_data_CLEANED'   ## where the data we previously extracted and cleaned up from mimic is stored
#允许运行此脚本而不自动覆盖以前的特征矩阵
SAVE_FLAG = True    ## allows running this script without autmatically overwriting previous feature matrices
#原始样本窗口(NLD指定4小时)
HOURS_BEFORE_RFD = 24    ## original sample window (four hours specified by NLD)
#扩展到样本窗口时，没有值记录在原始窗口(最终值采取)
MAXIMUM_LOOKBACK = 36   ## extension to sample window when no value recorded in original window (final valule taken)

## variables for which to only use single feature column:
#只使用单一特性列的变量:
single_feature_variables = ['k', 'na', 'bun', 'creatinine', 'hco3', 'haemoglobin', 'fio2', 'airway', 'pco2', 'po2', 'pain',
                           'ALB','ALT','WBC','PLT','TBIL','PTT','PT','CL','LA'
                           ]

In [101]:
# import graphlab
import numpy as np
import pickle
import datetime
import shutil
from collections import OrderedDict
import matplotlib.pyplot as plt
import matplotlib.dates as dates
import pandas as pd
# import graphlab.aggregate as agg
%matplotlib inline



Load the data that we have previously extracted and pre-processed from MIMIC:
只使用单一特性列的变量:

In [102]:
# all_data = graphlab.SFrame(DATA_PATH)
all_data = pd.read_pickle(DATA_PATH)


Create an SFrame to store summary of missing data:
创建一个SFrame来存储缺失数据的摘要:

In [103]:
# missing_data_summary = graphlab.SFrame()
missing_data_summary = pd.DataFrame()


Select only measurments taken in HOURS_BEFORE_DISCHARGE:
仅选择HOURS_BEFORE_DISCHARGE中的测量值:（前几个小时的放电）

In [104]:
#取 0 到 4小时 之间的数据
all_data = all_data[all_data['hrs_bRFD']<=HOURS_BEFORE_RFD]
all_data = all_data[all_data['hrs_bRFD']>=0]
print ("There are %d icu satys with variables recorded in the final %d hours before CALLOUT." \
      %(len(all_data['C.ICUSTAY_ID'].unique()),HOURS_BEFORE_RFD))

There are 4070 icu satys with variables recorded in the final 24 hours before CALLOUT.


In [105]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'bp', 'PT',
       'PTT', 'INR', 'pain', 'ALB', 'ALT', 'TBIL', 'po2', 'pco2', 'NEUT',
       'fio2', 'peep', 'airway'], dtype=object)

Produce data summary:
（4小时）生成数据汇总:

In [106]:
# summary = all_data.groupby(key_columns=['C.ICUSTAY_ID'], operations={'outcome':agg.SELECT_ONE('outcome'),
#                                                                      'cohort':agg.SELECT_ONE('cohort'),
#                                                                      'readmit':agg.SELECT_ONE('readmit'),
#                                                                      'in_h_death':agg.SELECT_ONE('in_h_death'),
#                                                                      'in_icu_death':agg.SELECT_ONE('in_icu_death'),
#                                                                      'II.LOS':agg.SELECT_ONE('II.LOS'),
#                                                                      'II.OUTTIME':agg.SELECT_ONE('II.OUTTIME'),
#                                                                      'II.INTIME':agg.SELECT_ONE('II.INTIME')
#                                                                     })
all_data1 = all_data.groupby(by=['C.ICUSTAY_ID'])


In [107]:
# 选择 8 个 需要的数据
all_data2=all_data1.agg(
    outcome=('outcome',np.random.choice),
    cohort=('cohort',np.random.choice),
    readmit=('readmit',np.random.choice),
    in_h_death=('in_h_death',np.random.choice),
    in_icu_death=('in_icu_death',np.random.choice),
    LOS=('II.LOS',np.random.choice),
    OUTTIME=('II.OUTTIME',np.random.choice),
    INTIME=('II.INTIME',np.random.choice)
).rename(columns={
    'LOS':'II.LOS','OUTTIME':'II.OUTTIME','INTIME':'II.INTIME'
})

In [108]:
# all_data2['VARIABLE'].unique()

In [109]:
summary=all_data2

In [110]:
print( "There are %d icu stays in the cohort." %sum(summary['cohort']==1))
cohort_summary = summary[summary['cohort']==1]

There are 3757 icu stays in the cohort.


We now begin to construct the feature matrix.
现在我们开始构造特征矩阵。

In [111]:
# features = all_data.groupby(key_columns=['C.ICUSTAY_ID', 'VARIABLE'],
#                             operations={'min':agg.MIN('C.VALUENUM'),
#                                         'max':agg.MAX('C.VALUENUM'),
#                                         'values':agg.CONCAT('C.VALUENUM'),
#                                         'count':agg.COUNT('C.VALUENUM'),
#                                         'cohort':agg.SELECT_ONE('cohort'),
#                                         'outcome':agg.SELECT_ONE('outcome')
#                                         })

In [112]:
# all_data3=all_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])
all_data3=all_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'],as_index=False)

In [113]:
# all_data3.loc[all_data3['VARIABLE']=='bp']

In [114]:
a1=all_data3.count()
a1['VARIABLE'].unique()

array(['CL', 'GLU', 'PLT', 'WBC', 'bp', 'bun', 'creatinine',
       'haemoglobin', 'hco3', 'hr', 'k', 'na', 'pain', 'pco2', 'po2',
       'resp', 'spo2', 'temp', 'ALT', 'INR', 'NEUT', 'PT', 'PTT', 'TBIL',
       'ALB', 'fio2', 'peep', 'airway'], dtype=object)

In [115]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'bp', 'PT',
       'PTT', 'INR', 'pain', 'ALB', 'ALT', 'TBIL', 'po2', 'pco2', 'NEUT',
       'fio2', 'peep', 'airway'], dtype=object)

In [116]:
def yuanzhi(value):
    return value
yuanzhi ([1234])

[1234]

In [117]:
##all_data['C.VALUENUM']=all_data['C.VALUENUM'].astype(float)
all_data4=all_data3.agg(
    min=('C.VALUENUM','min'),
    max=('C.VALUENUM','max'),
    values=('C.VALUENUM',np.random.choice),
    count=('C.VALUENUM','count'),
    cohort=('cohort',np.random.choice),
    outcome=('outcome',np.random.choice)
)
# all_data4['values'] = all_data3['C.VALUENUM']
all_data4

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,min,max,values,count,cohort,outcome
0,200001.0,CL,98.000000,98.000000,98.000000,1,1,1
1,200001.0,GLU,99.000000,99.000000,99.000000,1,1,1
2,200001.0,PLT,168.000000,168.000000,168.000000,1,1,1
3,200001.0,WBC,3.400000,3.400000,3.400000,1,1,1
4,200001.0,bp,84.000000,135.000000,118.000000,29,1,1
...,...,...,...,...,...,...,...,...
77931,299998.0,peep,5.000000,5.000000,5.000000,6,0,0
77932,299998.0,po2,13.863200,22.394400,13.863200,2,0,0
77933,299998.0,resp,0.000000,21.000000,17.000000,35,0,0
77934,299998.0,spo2,94.000000,100.000000,100.000000,22,0,0


In [118]:
features =all_data4
features = features.sort_values(ascending=True, by='C.ICUSTAY_ID')

In [119]:
features['VARIABLE'].unique()

array(['CL', 'temp', 'spo2', 'resp', 'po2', 'pain', 'na', 'k', 'hr',
       'pco2', 'haemoglobin', 'creatinine', 'bun', 'bp', 'WBC', 'PLT',
       'GLU', 'hco3', 'ALT', 'INR', 'NEUT', 'PT', 'PTT', 'TBIL', 'ALB',
       'fio2', 'peep', 'airway'], dtype=object)

We define a function that takes the above dataframe and splits it into feature columns:
我们定义了一个函数来接受上述数据帧并将其拆分为特性列:

In [120]:
# 已经在下面实现了
def _split_features_to_columns(features,all_variables,all_stay_ids,thisFM, single_feature_vars=single_feature_variables):
    ''' Note: sort features by ICUSTAY_ID before using this function.'''

    # all_variables = features['VARIABLE'].unique()
    # all_stay_ids = features['C.ICUSTAY_ID'].unique().sort()
    #
    # # thisFM = graphlab.SFrame()
    # thisFM = pd.DataFrame()
    # thisFM['ICUSTAY_ID'] = all_stay_ids

    for var in all_variables:
        print (var)

        var_min = []
        var_max = []
        var_values = []
        var_count = []

        _subset = features[features['VARIABLE']==var]
        N = len(_subset)-1

        rid = 0
        for sid in all_stay_ids:
            if rid<=N and _subset[rid]['C.ICUSTAY_ID']==sid:
                ## add data and move on a row
                row = _subset[rid]
                var_min.append(row['min'])
                var_max.append(row['max'])
                var_values.append(row['values'])
                rid += 1
            else:
                ## add None values and stay in place
                var_min.append(None)
                var_max.append(None)
                var_values.append([])

        if var not in single_feature_vars:
            thisFM[var + '_min'] = var_min
            thisFM[var + '_max'] = var_max

        else:
            ## if it is a single variable feature - take final value only
            thisFM[var] = [val[-1]  if len(val)>0 else None for val in var_values]

    return thisFM



In [121]:
all_variables = features['VARIABLE'].unique()
# # all_stay_ids = features['C.ICUSTAY_ID'].unique().sort()
all_stay_ids = features['C.ICUSTAY_ID'].unique()
all_stay_ids.sort()

In [122]:
# # thisFM = graphlab.SFrame()
thisFM = pd.DataFrame()
thisFM['ICUSTAY_ID'] = all_stay_ids
# thisFM['variables'] = all_variables

In [123]:
all_variables

array(['CL', 'temp', 'spo2', 'resp', 'po2', 'pain', 'na', 'k', 'hr',
       'pco2', 'haemoglobin', 'creatinine', 'bun', 'bp', 'WBC', 'PLT',
       'GLU', 'hco3', 'ALT', 'INR', 'NEUT', 'PT', 'PTT', 'TBIL', 'ALB',
       'fio2', 'peep', 'airway'], dtype=object)

In [124]:
all_stay_ids

array([200001., 200010., 200038., ..., 299872., 299901., 299998.])

In [125]:
len(all_stay_ids)

4070

In [126]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'bp', 'PT',
       'PTT', 'INR', 'pain', 'ALB', 'ALT', 'TBIL', 'po2', 'pco2', 'NEUT',
       'fio2', 'peep', 'airway'], dtype=object)

In [127]:
all_data4['VARIABLE'].unique()
# FM = _split_features_to_columns(features,all_variables,all_stay_ids,thisFM, single_feature_variables)

array(['CL', 'GLU', 'PLT', 'WBC', 'bp', 'bun', 'creatinine',
       'haemoglobin', 'hco3', 'hr', 'k', 'na', 'pain', 'pco2', 'po2',
       'resp', 'spo2', 'temp', 'ALT', 'INR', 'NEUT', 'PT', 'PTT', 'TBIL',
       'ALB', 'fio2', 'peep', 'airway'], dtype=object)

In [128]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'bp', 'PT',
       'PTT', 'INR', 'pain', 'ALB', 'ALT', 'TBIL', 'po2', 'pco2', 'NEUT',
       'fio2', 'peep', 'airway'], dtype=object)

In [129]:
for var in all_variables:
        print (var)

        var_min = []
        var_max = []
        var_values = []
        var_count = []

        _subset = features[features['VARIABLE']==var].reset_index(drop=True)
        N = len(_subset)-1

CL
temp
spo2
resp
po2
pain
na
k
hr
pco2
haemoglobin
creatinine
bun
bp
WBC
PLT
GLU
hco3
ALT
INR
NEUT
PT
PTT
TBIL
ALB
fio2
peep
airway


In [130]:
for var in all_variables:
    if var in single_feature_variables:
        print (var)


CL
po2
pain
na
k
pco2
haemoglobin
creatinine
bun
WBC
PLT
hco3
ALT
PT
PTT
TBIL
ALB
fio2
airway


In [131]:
all_variables

array(['CL', 'temp', 'spo2', 'resp', 'po2', 'pain', 'na', 'k', 'hr',
       'pco2', 'haemoglobin', 'creatinine', 'bun', 'bp', 'WBC', 'PLT',
       'GLU', 'hco3', 'ALT', 'INR', 'NEUT', 'PT', 'PTT', 'TBIL', 'ALB',
       'fio2', 'peep', 'airway'], dtype=object)

In [132]:
all_data4.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77936 entries, 0 to 77935
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   C.ICUSTAY_ID  77936 non-null  float64
 1   VARIABLE      77936 non-null  object 
 2   min           77936 non-null  float64
 3   max           77936 non-null  float64
 4   values        77936 non-null  float64
 5   count         77936 non-null  int64  
 6   cohort        77936 non-null  int64  
 7   outcome       77936 non-null  int64  
dtypes: float64(4), int64(3), object(1)
memory usage: 4.8+ MB


In [133]:
features

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,min,max,values,count,cohort,outcome
0,200001.0,CL,98.000000,98.000000,98.000000,1,1,1
17,200001.0,temp,35.277778,36.777778,35.277778,6,1,1
16,200001.0,spo2,87.000000,100.000000,97.000000,29,1,1
15,200001.0,resp,15.000000,29.000000,25.000000,29,1,1
14,200001.0,po2,14.929600,14.929600,14.929600,1,1,1
...,...,...,...,...,...,...,...,...
77912,299998.0,GLU,130.000000,206.000000,184.000000,3,0,0
77911,299998.0,CL,107.000000,109.000000,108.000000,3,0,0
77934,299998.0,spo2,94.000000,100.000000,100.000000,22,0,0
77922,299998.0,creatinine,88.420000,97.262000,97.262000,3,0,0


In [134]:
features_count1=features[features['count']==1]
features_count1

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,min,max,values,count,cohort,outcome
0,200001.0,CL,98.0000,98.0000,98.0000,1,1,1
14,200001.0,po2,14.9296,14.9296,14.9296,1,1,1
11,200001.0,na,135.0000,135.0000,135.0000,1,1,1
10,200001.0,k,4.2000,4.2000,4.2000,1,1,1
13,200001.0,pco2,5.1987,5.1987,5.1987,1,1,1
...,...,...,...,...,...,...,...,...
77878,299872.0,creatinine,229.8920,229.8920,229.8920,1,0,0
77917,299998.0,PTT,28.9000,28.9000,28.9000,1,0,0
77916,299998.0,PT,12.9000,12.9000,12.9000,1,0,0
77914,299998.0,NEUT,77.0000,77.0000,77.0000,1,0,0


In [135]:
len(features_count1['C.ICUSTAY_ID'].unique())

3853

In [136]:
features_count1['C.ICUSTAY_ID'].unique()

array([200001., 200010., 200038., ..., 299866., 299872., 299998.])

In [137]:
len(features[features['VARIABLE']=='CL'].reset_index(drop=True)['C.ICUSTAY_ID'].unique())

3963

In [138]:
features[features['VARIABLE']=='CL'].reset_index(drop=True)

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,min,max,values,count,cohort,outcome
0,200001.0,CL,98.0,98.0,98.0,1,1,1
1,200010.0,CL,108.0,108.0,108.0,1,1,1
2,200038.0,CL,108.0,108.0,108.0,1,1,1
3,200049.0,CL,100.0,100.0,100.0,2,1,0
4,200094.0,CL,109.0,111.0,109.0,2,1,1
...,...,...,...,...,...,...,...,...
3958,299856.0,CL,101.0,103.0,101.0,2,1,1
3959,299866.0,CL,106.0,106.0,106.0,1,1,1
3960,299872.0,CL,88.0,88.0,88.0,1,0,0
3961,299901.0,CL,112.0,119.0,119.0,2,1,1


In [139]:
thisFM

Unnamed: 0,ICUSTAY_ID
0,200001.0
1,200010.0
2,200038.0
3,200049.0
4,200094.0
...,...
4065,299856.0
4066,299866.0
4067,299872.0
4068,299901.0


In [148]:
for var in all_variables:
        print (var)

        var_min = []
        var_max = []
        var_values = []
        var_count = []

        _subset = features[features['VARIABLE']==var].reset_index(drop=True)
        N = len(_subset)-1

        rid = 0
        for sid in all_stay_ids:
            # if rid<=N and _subset[rid]['C.ICUSTAY_ID']==sid:
            if rid<=N and _subset.loc[rid]['C.ICUSTAY_ID']==sid:
                ## add data and move on a row
                row = _subset.loc[rid]
                var_min.append(row['min'])
                var_max.append(row['max'])
                var_values.append(row['values'])
                rid += 1
            else:
                ## add None values and stay in place
                var_min.append(None)
                var_max.append(None)
                var_values.append(None)

        if var not in  single_feature_variables:
            thisFM[var + '_min'] = var_min
            thisFM[var + '_max'] = var_max

        else:
            ## if it is a single variable feature - take final value only
            #如果它是单个变量特性—只取最终值
            thisFM[var] = var_values

CL
temp
spo2
resp
po2
pain
na
k
hr
pco2
haemoglobin
creatinine
bun
bp
WBC
PLT
GLU
hco3
ALT
INR
NEUT
PT
PTT
TBIL
ALB
fio2
peep
airway


In [149]:
thisFM

Unnamed: 0,ICUSTAY_ID,CL,temp_min,temp_max,spo2_min,spo2_max,resp_min,resp_max,po2,pain,...,NEUT_min,NEUT_max,PT,PTT,TBIL,ALB,fio2,peep_min,peep_max,airway
0,200001.0,98.0,35.277778,36.777778,87.0,100.0,15.0,29.0,14.9296,0.0,...,,,,,,,,,,
1,200010.0,108.0,36.055556,36.666667,98.0,100.0,6.0,22.0,,3.0,...,66.6,66.6,12.9,25.2,0.4,,,,,
2,200038.0,108.0,36.555556,36.888889,93.0,99.0,12.0,30.0,,0.0,...,,,14.9,28.4,,,,,,
3,200049.0,100.0,36.444444,37.388889,88.0,100.0,11.0,27.0,,5.0,...,,,20.8,,4.1,,,,,
4,200094.0,109.0,37.222222,37.611111,92.0,99.0,13.0,26.0,,3.0,...,,,12.9,26.4,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4065,299856.0,101.0,35.833333,39.444444,81.0,100.0,20.0,43.0,,0.0,...,77.9,77.9,,,,,50.0,,,
4066,299866.0,106.0,36.000000,37.888889,94.0,100.0,12.0,25.0,,7.0,...,,,,,0.4,,,,,
4067,299872.0,88.0,35.777778,37.277778,84.0,100.0,11.0,21.0,,0.0,...,,,34.9,150.0,,,1.0,,,
4068,299901.0,119.0,35.000000,36.000000,93.0,98.0,17.0,28.0,,,...,80.0,82.0,18.5,44.7,4.3,3.4,,,,


In [150]:
# 查看缺失值
thisFM.isnull().sum()

ICUSTAY_ID        0
CL              107
temp_min         19
temp_max         19
spo2_min         16
spo2_max         16
resp_min         17
resp_max         17
po2            3452
pain            812
na               96
k                95
hr_min           12
hr_max           12
pco2           3358
haemoglobin     151
creatinine      110
bun             109
bp_min           15
bp_max           15
WBC             151
PLT             151
GLU_min         102
GLU_max         102
hco3            110
ALT            2651
INR_min        1499
INR_max        1499
NEUT_min       3156
NEUT_max       3156
PT             1499
PTT            1559
TBIL           2632
ALB            3262
fio2           3109
peep_min       3717
peep_max       3717
airway         4057
dtype: int64

In [153]:
# 删除缺失值过多的列
columns_drop = ['po2','pco2','ALT','INR_min','INR_max','NEUT_min','NEUT_max','PT','PTT','TBIL','ALB','fio2','peep_min','peep_max','airway']
thisFM=thisFM.drop(columns=columns_drop,axis=1)
thisFM

Unnamed: 0,ICUSTAY_ID,CL,temp_min,temp_max,spo2_min,spo2_max,resp_min,resp_max,pain,na,...,haemoglobin,creatinine,bun,bp_min,bp_max,WBC,PLT,GLU_min,GLU_max,hco3
0,200001.0,98.0,35.277778,36.777778,87.0,100.0,15.0,29.0,0.0,135.0,...,8.4,256.418,20.7118,84.0,135.0,3.4,168.0,99.0,99.0,28.0
1,200010.0,108.0,36.055556,36.666667,98.0,100.0,6.0,22.0,3.0,143.0,...,8.6,79.578,3.9281,115.0,157.0,10.0,227.0,127.0,127.0,26.0
2,200038.0,108.0,36.555556,36.888889,93.0,99.0,12.0,30.0,0.0,138.0,...,9.9,88.420,6.7849,111.0,206.0,10.6,208.0,120.0,120.0,23.0
3,200049.0,100.0,36.444444,37.388889,88.0,100.0,11.0,27.0,5.0,137.0,...,8.9,433.258,16.4266,95.0,143.0,17.1,105.0,187.0,209.0,26.0
4,200094.0,109.0,37.222222,37.611111,92.0,99.0,13.0,26.0,3.0,139.0,...,6.9,503.994,24.2828,137.0,166.0,1.3,70.0,105.0,136.0,18.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4065,299856.0,101.0,35.833333,39.444444,81.0,100.0,20.0,43.0,0.0,134.0,...,11.0,79.578,4.6423,94.0,139.0,6.0,120.0,105.0,135.0,27.0
4066,299866.0,106.0,36.000000,37.888889,94.0,100.0,12.0,25.0,7.0,138.0,...,9.7,79.578,7.4991,122.0,180.0,12.9,170.0,111.0,111.0,27.0
4067,299872.0,88.0,35.777778,37.277778,84.0,100.0,11.0,21.0,0.0,132.0,...,9.1,229.892,20.3547,93.0,148.0,14.2,288.0,52.0,52.0,33.0
4068,299901.0,119.0,35.000000,36.000000,93.0,98.0,17.0,28.0,,149.0,...,12.7,123.788,13.2127,108.0,147.0,21.9,122.0,93.0,144.0,19.0


In [193]:
# 重新赋值（丢弃了部分列）
single_feature_variables = ['k', 'na', 'bun', 'creatinine', 'hco3', 'haemoglobin', 'fio2', 'airway',  'pain',
                           'WBC','PLT','CL','LA'
                           ]

In [155]:
thisFM=FM

In [154]:
FM = thisFM

How many missing values are there in this feature matrix?
在这个特征矩阵中有多少缺失值?

In [156]:
nrows = float(len(FM))

# missing_data = graphlab.SFrame()
missing_data = pd.DataFrame()
vname = []
miss_freq = []

In [157]:
# for col in FM.column_names():
for col in FM.columns:
    vname.append(col)
    # miss_freq.append(sum(FM[col]==None)/nrows)
    miss_freq.append((FM['k'].isnull().sum())/nrows)

missing_data['variable'] = vname
missing_data['fraction missing'] = miss_freq

missing_data = missing_data.sort_values(by='fraction missing', ascending=True)
# missing_data.print_rows(num_rows=25)
rows = missing_data.iloc[0:25]
print(rows)

       variable  fraction missing
0    ICUSTAY_ID          0.023342
20      GLU_min          0.023342
19          PLT          0.023342
18          WBC          0.023342
17       bp_max          0.023342
16       bp_min          0.023342
15          bun          0.023342
14   creatinine          0.023342
13  haemoglobin          0.023342
12       hr_max          0.023342
21      GLU_max          0.023342
11       hr_min          0.023342
9            na          0.023342
8          pain          0.023342
7      resp_max          0.023342
6      resp_min          0.023342
5      spo2_max          0.023342
4      spo2_min          0.023342
3      temp_max          0.023342
2      temp_min          0.023342
1            CL          0.023342
10            k          0.023342
22         hco3          0.023342


In [158]:
missing_data1=missing_data
missing_data1

Unnamed: 0,variable,fraction missing
0,ICUSTAY_ID,0.023342
20,GLU_min,0.023342
19,PLT,0.023342
18,WBC,0.023342
17,bp_max,0.023342
16,bp_min,0.023342
15,bun,0.023342
14,creatinine,0.023342
13,haemoglobin,0.023342
12,hr_max,0.023342


In [159]:
missing_data2=missing_data1.reset_index(drop=True)
missing_data2

Unnamed: 0,variable,fraction missing
0,ICUSTAY_ID,0.023342
1,GLU_min,0.023342
2,PLT,0.023342
3,WBC,0.023342
4,bp_max,0.023342
5,bp_min,0.023342
6,bun,0.023342
7,creatinine,0.023342
8,haemoglobin,0.023342
9,hr_max,0.023342


Some variables have a very high frequency of missing values.
We decided to improve this by relaxing the 4 hour measurement window of the NLD criteria - SEE BLEOW.
In some cases (e.g. laboratory results) this is clearly too short a window and is producing lots of missingness.
有些变量丢失值的频率很高。我们决定通过放宽全国民主联盟标准的4小时测量窗口来改进这一点-见BLEOW。
在某些情况下(如实验室结果)，这显然是一个太短的窗口，并产生许多缺失。

In [160]:
missing_data_summary['variable'] = missing_data['variable']
# missing_data_summary = missing_data_summary.join(how='inner',on='variable',right=missing_data)
missing_data_summary = pd.merge(missing_data_summary,missing_data, how='inner', on='variable')
missing_data_summary = missing_data_summary.rename({'fraction missing':'mimic_4hr'})

# all_data = all_data.join(_stays_join, how='inner', on='C.ICUSTAY_ID')
# all_data = pd.merge(all_data,_stays_join, how='inner', on='C.ICUSTAY_ID')

Here we fill airway since absence -> no ETT at this time and therefore patent airway:
在这里，我们填补了缺失的气道->此时没有ETT，因此气道通畅:
多了一个airway

In [161]:
FM_filled = FM.copy()
# FM_filled = FM_filled.fillna(column='airway', value=0.0)

In [162]:
#dataframe.fillna({'code':'code', 'date':'date'})，
# 第一个code和date分别表示列，后面的表示在该列填充的内容
FM_filled['airway'] =0.0
FM_filled
# FM_filled = FM_filled.fillna({'airway':0.0})

Unnamed: 0,ICUSTAY_ID,CL,temp_min,temp_max,spo2_min,spo2_max,resp_min,resp_max,pain,na,...,creatinine,bun,bp_min,bp_max,WBC,PLT,GLU_min,GLU_max,hco3,airway
0,200001.0,98.0,35.277778,36.777778,87.0,100.0,15.0,29.0,0.0,135.0,...,256.418,20.7118,84.0,135.0,3.4,168.0,99.0,99.0,28.0,0.0
1,200010.0,108.0,36.055556,36.666667,98.0,100.0,6.0,22.0,3.0,143.0,...,79.578,3.9281,115.0,157.0,10.0,227.0,127.0,127.0,26.0,0.0
2,200038.0,108.0,36.555556,36.888889,93.0,99.0,12.0,30.0,0.0,138.0,...,88.420,6.7849,111.0,206.0,10.6,208.0,120.0,120.0,23.0,0.0
3,200049.0,100.0,36.444444,37.388889,88.0,100.0,11.0,27.0,5.0,137.0,...,433.258,16.4266,95.0,143.0,17.1,105.0,187.0,209.0,26.0,0.0
4,200094.0,109.0,37.222222,37.611111,92.0,99.0,13.0,26.0,3.0,139.0,...,503.994,24.2828,137.0,166.0,1.3,70.0,105.0,136.0,18.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4065,299856.0,101.0,35.833333,39.444444,81.0,100.0,20.0,43.0,0.0,134.0,...,79.578,4.6423,94.0,139.0,6.0,120.0,105.0,135.0,27.0,0.0
4066,299866.0,106.0,36.000000,37.888889,94.0,100.0,12.0,25.0,7.0,138.0,...,79.578,7.4991,122.0,180.0,12.9,170.0,111.0,111.0,27.0,0.0
4067,299872.0,88.0,35.777778,37.277778,84.0,100.0,11.0,21.0,0.0,132.0,...,229.892,20.3547,93.0,148.0,14.2,288.0,52.0,52.0,33.0,0.0
4068,299901.0,119.0,35.000000,36.000000,93.0,98.0,17.0,28.0,,149.0,...,123.788,13.2127,108.0,147.0,21.9,122.0,93.0,144.0,19.0,0.0


In [163]:
FM_filled.columns

Index(['ICUSTAY_ID', 'CL', 'temp_min', 'temp_max', 'spo2_min', 'spo2_max',
       'resp_min', 'resp_max', 'pain', 'na', 'k', 'hr_min', 'hr_max',
       'haemoglobin', 'creatinine', 'bun', 'bp_min', 'bp_max', 'WBC', 'PLT',
       'GLU_min', 'GLU_max', 'hco3', 'airway'],
      dtype='object')

We construct a table with the final value before CALLOUT for each variable for each stay:
我们为每个stay变量在CALLOUT之前构造了一个表:

In [164]:
all_data = pd.read_pickle(DATA_PATH)
all_data = all_data[all_data['hrs_bRFD']>=0]
# all_data = all_data.filter_by(column_name='C.ICUSTAY_ID', values=FM_filled['ICUSTAY_ID'])
all_data = all_data[all_data['C.ICUSTAY_ID'].isin(FM_filled['ICUSTAY_ID'])]

In [165]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'bp', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'PT', 'PTT',
       'INR', 'po2', 'pco2', 'ALT', 'TBIL', 'NEUT', 'pain', 'ALB', 'fio2',
       'peep', 'airway'], dtype=object)

In [166]:
# final_values = all_data.groupby(key_columns=['C.ICUSTAY_ID', 'VARIABLE'],
#                                 operations={
#                                             'hrs_bRFD':agg.ARGMAX('C.CHARTTIME', 'hrs_bRFD'),
#                                             'C.VALUENUM':agg.ARGMAX('C.CHARTTIME', 'C.VALUENUM')
#                                             })
final_values9 =  all_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'],as_index=False)
# final_values9 =  all_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])

In [167]:
type(final_values9)

pandas.core.groupby.generic.DataFrameGroupBy

In [194]:
final_values9

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002662E3A73C8>

In [169]:
all_data['VARIABLE'].unique()

array(['temp', 'hr', 'resp', 'spo2', 'bp', 'haemoglobin', 'WBC', 'CL',
       'creatinine', 'GLU', 'na', 'bun', 'k', 'hco3', 'PLT', 'PT', 'PTT',
       'INR', 'po2', 'pco2', 'ALT', 'TBIL', 'NEUT', 'pain', 'ALB', 'fio2',
       'peep', 'airway'], dtype=object)

In [170]:
# final_values.loc[0:30]
final_values['VARIABLE'].unique()

array(['ALT', 'CL', 'GLU', 'INR', 'PLT', 'PT', 'PTT', 'TBIL', 'WBC', 'bp',
       'bun', 'creatinine', 'fio2', 'haemoglobin', 'hco3', 'hr', 'k',
       'na', 'pain', 'pco2', 'po2', 'resp', 'spo2', 'temp', 'NEUT', 'ALB',
       'airway', 'peep'], dtype=object)

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦

出问题啦


In [171]:
def argMax1(con):
    ma=0
    id=0
    for i,v in enumerate(con):
        val=final_values.loc[i]['C.CHARTTIME']
        if val>ma:
            ma=val
            id = i
    ma = final_values.loc[id]['hrs_bRFD']
    return ma
    # id=con.idxmax()
    # val=final_values.loc[id]['C.CHARTTIME']
    # # val=final_values.loc[final_values['hrs_bRFD']==final_values['hrs_bRFD'].max()]['C.CHARTTIME'].values[0]
    # return con

In [172]:
def argMax2(con):
    ma=0
    id=0
    for i,v in enumerate(con):
        val=final_values.loc[i]['C.CHARTTIME']
        if val>ma:
            ma=val
            id = i
    ma = final_values.loc[id]['C.VALUENUM']
    return ma
    # id=con.idxmax()
    # val=final_values.loc[id]['C.CHARTTIME']
    # # val=final_values.loc[final_values['C.VALUENUM']==final_values[id]['C.VALUENUM'].max()]['C.CHARTTIME'].values[0]
    # return val
#['actionTime'].apply(lambda x:x.max())

In [173]:
# mean1=final_values9['hrs_bRFD'].apply(lambda i:i.max())
mean1

NameError: name 'mean1' is not defined

In [174]:
mean2=final_values9['C.VALUENUM'].apply(lambda i:i.max())
mean2

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,C.VALUENUM
0,200001.0,ALT,6.000000
1,200001.0,CL,101.000000
2,200001.0,GLU,108.000000
3,200001.0,INR,3.800000
4,200001.0,PLT,168.000000
...,...,...,...
89603,299998.0,peep,5.000000
89604,299998.0,po2,22.394400
89605,299998.0,resp,21.000000
89606,299998.0,spo2,100.000000


In [175]:
vall=final_values9['C.CHARTTIME'].apply(lambda i:i.max())
type(vall)

pandas.core.frame.DataFrame

In [176]:
def mean1(con):
    return final_values9['hrs_bRFD'].apply(lambda i:i.max())
def mean2(con):
    return final_values9['C.VALUENUM'].apply(lambda i:i.max())

In [177]:
#thisFM.bp_min.max()
# turicreate.aggregate.ARGMAX(out_column agg_column)
# 返回 agg_column 最大值 对应的 out_column 值
#GPA   ID
# 2.3  Tina1
# 3.4  Bob1
# 3.6  Lia1
# 2.9  Tina2
# 4.0  Blake1
# 4.5  Conor2
# df.loc[df.GPA == df.GPA.max(), 'ID'].values[0]
#    operations={
#     'hrs_bRFD':agg.ARGMAX('C.CHARTTIME', 'hrs_bRFD'),
#     'C.VALUENUM':agg.ARGMAX('C.CHARTTIME', 'C.VALUENUM')
# )


##########  all_data['C.VALUENUM']=all_data['C.VALUENUM'].astype(float)  相关可用，取均值应该好一些

final_values1 =  final_values9.agg(
    hrs_bRFD=('hrs_bRFD','max'),
    VALUENUM=('C.VALUENUM',np.random.choice)
                                   ).rename({
    'VALUENUM':'C.VALUENUM'
                                   })

In [178]:
final_values1

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,hrs_bRFD,VALUENUM
0,200001.0,ALT,53.676667,6.000000
1,200001.0,CL,53.676667,98.000000
2,200001.0,GLU,53.676667,108.000000
3,200001.0,INR,53.676667,3.800000
4,200001.0,PLT,53.676667,155.000000
...,...,...,...,...
89603,299998.0,peep,24.237500,5.000000
89604,299998.0,po2,22.270833,22.394400
89605,299998.0,resp,24.454167,15.000000
89606,299998.0,spo2,24.454167,96.000000


In [179]:
final_values1

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,hrs_bRFD,VALUENUM
0,200001.0,ALT,53.676667,6.000000
1,200001.0,CL,53.676667,98.000000
2,200001.0,GLU,53.676667,108.000000
3,200001.0,INR,53.676667,3.800000
4,200001.0,PLT,53.676667,155.000000
...,...,...,...,...
89603,299998.0,peep,24.237500,5.000000
89604,299998.0,po2,22.270833,22.394400
89605,299998.0,resp,24.454167,15.000000
89606,299998.0,spo2,24.454167,96.000000


In [180]:
final_values9['C.CHARTTIME'].count()

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,C.CHARTTIME
0,200001.0,ALT,1
1,200001.0,CL,3
2,200001.0,GLU,3
3,200001.0,INR,2
4,200001.0,PLT,3
...,...,...,...
89603,299998.0,peep,7
89604,299998.0,po2,2
89605,299998.0,resp,40
89606,299998.0,spo2,25


In [181]:
final_values = final_values1

In [182]:
final_values = final_values[final_values['hrs_bRFD']<=MAXIMUM_LOOKBACK]
final_values = final_values[final_values['hrs_bRFD']>HOURS_BEFORE_RFD]
final_values = final_values.sort_values(ascending=True, by=['VARIABLE', 'C.ICUSTAY_ID'])
# sort_values(by='fraction missing', ascending=True)

In [183]:
final_values['VARIABLE'].unique()

array(['ALB', 'ALT', 'CL', 'GLU', 'INR', 'NEUT', 'PLT', 'PT', 'PTT',
       'TBIL', 'WBC', 'airway', 'bp', 'bun', 'creatinine', 'fio2',
       'haemoglobin', 'hco3', 'hr', 'k', 'na', 'pain', 'pco2', 'peep',
       'po2', 'resp', 'spo2', 'temp'], dtype=object)

In [184]:
vars_to_use_fv = [var for var in final_values['VARIABLE'].unique()]

In [185]:
vars_to_use_fv

['ALB',
 'ALT',
 'CL',
 'GLU',
 'INR',
 'NEUT',
 'PLT',
 'PT',
 'PTT',
 'TBIL',
 'WBC',
 'airway',
 'bp',
 'bun',
 'creatinine',
 'fio2',
 'haemoglobin',
 'hco3',
 'hr',
 'k',
 'na',
 'pain',
 'pco2',
 'peep',
 'po2',
 'resp',
 'spo2',
 'temp']

In [186]:
FM_filled.columns

Index(['ICUSTAY_ID', 'CL', 'temp_min', 'temp_max', 'spo2_min', 'spo2_max',
       'resp_min', 'resp_max', 'pain', 'na', 'k', 'hr_min', 'hr_max',
       'haemoglobin', 'creatinine', 'bun', 'bp_min', 'bp_max', 'WBC', 'PLT',
       'GLU_min', 'GLU_max', 'hco3', 'airway'],
      dtype='object')

In [187]:
FM.columns

Index(['ICUSTAY_ID', 'CL', 'temp_min', 'temp_max', 'spo2_min', 'spo2_max',
       'resp_min', 'resp_max', 'pain', 'na', 'k', 'hr_min', 'hr_max',
       'haemoglobin', 'creatinine', 'bun', 'bp_min', 'bp_max', 'WBC', 'PLT',
       'GLU_min', 'GLU_max', 'hco3'],
      dtype='object')

In [188]:
FM_filled = FM_filled.sort_values(ascending=True, by='ICUSTAY_ID')


We now replace missing values (from four hour window) with final values from the extended window (where available):
我们现在替换缺失的值(从四个小时窗口)与最终值从扩展窗口(在可用的地方):


In [189]:
def replace_missing_value(subset, variable, FM_filled, N, vtype='min'):
    if vtype=='min':
        suffix = '_min'
    elif vtype=='max':
        suffix = '_max'
    elif vtype=='single':
        suffix = ''
    new_col = []
    _subset_row_counter = 0

    for iid,val in zip(FM_filled['ICUSTAY_ID'], FM_filled[variable+suffix]):
#if rid<=N and _subset.loc[rid]['C.ICUSTAY_ID']==sid:
        if val is None:
            if _subset_row_counter<=N and subset.loc[_subset_row_counter]['C.ICUSTAY_ID']==iid:
                new_col.append(subset.loc[_subset_row_counter]['VALUENUM'])
                _subset_row_counter += 1
            else:
                new_col.append(None)
        else:
            new_col.append(val)

    FM_filled[variable+suffix] = new_col
    return FM_filled

In [190]:
for variable in vars_to_use_fv:
    print (variable)
    _subset2 = final_values[final_values['VARIABLE']==variable].reset_index(drop=True)
    _subset3 = _subset2.sort_values(by='C.ICUSTAY_ID')
    _subset = _subset2
#_subset = features[features['VARIABLE']==var].reset_index(drop=True)
    N = len(_subset)-1
    if variable not in single_feature_variables:
        FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='min')
        FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='max')
    else:
        FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='single')

ALB


KeyError: 'ALB'

In [191]:
_subset

Unnamed: 0,C.ICUSTAY_ID,VARIABLE,hrs_bRFD,VALUENUM
0,200227.0,ALB,35.268611,2.6
1,200349.0,ALB,28.687500,2.9
2,200708.0,ALB,31.309167,2.7
3,200749.0,ALB,31.736667,3.3
4,200806.0,ALB,26.615278,2.0
...,...,...,...,...
309,298599.0,ALB,32.312778,2.8
310,298618.0,ALB,27.541389,2.8
311,298866.0,ALB,34.163056,3.3
312,298894.0,ALB,27.092778,4.4


We see that this has improved the missingness for most variables:
我们看到，这改善了大多数变量的缺失:

In [192]:
nrows = float(len(FM_filled))
for col in FM_filled.columns:
    print( '%s : %.3f' %(col, (FM_filled[col].isnull().sum())/nrows))

# miss_freq.append((FM['k'].isnull().sum())/nrows)
# (FM['k'].isnull().sum())

ICUSTAY_ID : 0.000
CL : 0.026
temp_min : 0.005
temp_max : 0.005
spo2_min : 0.004
spo2_max : 0.004
resp_min : 0.004
resp_max : 0.004
pain : 0.200
na : 0.024
k : 0.023
hr_min : 0.003
hr_max : 0.003
haemoglobin : 0.037
creatinine : 0.027
bun : 0.027
bp_min : 0.004
bp_max : 0.004
WBC : 0.037
PLT : 0.037
GLU_min : 0.025
GLU_max : 0.025
hco3 : 0.027
airway : 0.000


错了  下面有类四的方法，，，在最后的部分，，， 解决方法放在了下面

In [None]:
#                         原来程序错误，使用替代方法
# missing_data_summary['mimic_36hr'] = missing_data_summary.apply\
#     (lambda row: sum(FM_filled[row['variable']].isnull())/nrows
#     if row['variable']!='ICUSTAY' else 0.0)
nrows = float(len(FM_filled))
for i in FM_filled.columns:
    missing_data_summary['mimic_36hr']=\
        missing_data_summary['variable'].apply(lambda i : (FM_filled[i].isnull().sum())/nrows
                                               if i!='ICUSTAY' else 0.0)

In [None]:
# missing_data_summary

In [None]:
missing_data_summary['mimic_36hr']

In [None]:
summary1=summary.reset_index()
# summary1
_sum_sub = summary1[['C.ICUSTAY_ID', 'cohort', 'outcome']]
_sum_sub = _sum_sub.rename(columns={'C.ICUSTAY_ID':'ICUSTAY_ID'})

In [None]:
# FM_filled = FM_filled.join(how='inner', on='ICUSTAY_ID', right=_sum_sub)
FM_filled = pd.merge(FM_filled,_sum_sub ,how='inner', on='ICUSTAY_ID')

We now have a feature matrix. However, the class sizes are highly imbalanced.
We decided to balance the class sizes by creating more instances of the negative class. We do this by sampling the patients at integer multiples of 24 hours prior to their CALLOUT.
现在我们有了一个特征矩阵。然而，班级规模是高度不平衡的。我们决定通过创建更多的负类实例来平衡类的大小。
我们通过在患者CALLOUT前24小时的整数倍采样来做到这一点。

In [None]:
all_data =pd.read_pickle(DATA_PATH)

We look at how many stays are available at Xi days prior to callout,
and how many extra samples would be required to balance the class sizes:
我们查看了在预约前的几天，Xi有多少个住宿名额，以及需要多少额外的样本来平衡班级规模:

In [None]:
# _stay_summary = \
#     all_data.groupby(key_columns='C.ICUSTAY_ID',
#                      operations={'outcome':agg.SELECT_ONE('outcome'),
#                                  'RFD':agg.SELECT_ONE('RFD'),
#                                  'intime':agg.SELECT_ONE('II.INTIME')
#                                  })
_stay_summary = all_data.groupby(by='C.ICUSTAY_ID').agg(
    outcome=('outcome',np.random.choice),
    RFD=('RFD',np.random.choice),
    intime=('II.INTIME',np.random.choice)
)

In [None]:
_stay_summary['len'] = (_stay_summary['RFD'] - _stay_summary['intime'])/pd.Timedelta('1h')

In [None]:
# _stay_summary

In [None]:
# _stay_summary.info()

In [None]:
fx = lambda x:24*x + 4


In [None]:
_stays = []
for i in range(10):
    _stays.append(sum(_stay_summary['len'] >= fx(i)))

In [None]:
_stays

In [None]:
num_rfd0 = sum(FM_filled['outcome'])
num_nrfd0 = len(FM_filled['outcome']) - num_rfd0

print (num_rfd0)
print (num_nrfd0)
print ("We would need at least %d extra samples to balance the classes." %(num_rfd0 - num_nrfd0))



In [None]:
fsa = 15

plt.bar(range(10), _stays, color='r', label='NRFD')
ax = plt.gca()
ax.bar(0,num_rfd0, color='g', label='RFD')
plt.legend(fontsize=fsa)
plt.xlabel('Xi days', fontsize=fsa)
plt.ylabel('number of patients', fontsize=fsa)
plt.xticks(fontsize=fsa-2)
plt.yticks(fontsize=fsa-2)
plt.tight_layout()
plt.savefig('mimic_negex_bars.png')



We determine that sampling at Xi in (3-8) will provide more than enough extra NRFD instances.
我们确定在Xi(3-8)取样将提供足够多的额外NRFD实例。

In [None]:
def sample_at_xi(_all_data, _summary, xi=3, h=4):
    ## selects stays to include in negative extras
    ## xi is integer number of days before RFD flag
    ## h is sample window length (4 hours is NLD standard)
    ##      new: only stays that are at least (xi+1)*24 hours long are included
    ##           so that sample point does not fall within 24 hours of ICU admission (thanks to reviewer 2)

    sids = _summary[_summary['len']>= fx(xi + 1)]['C.ICUSTAY_ID']  ## added +1 to filter out sample points that would be witin 24 hours of admission to ICU
    # sub_data = _all_data.filter_by(column_name='C.ICUSTAY_ID',values=sids, exclude=False)
    sub_data = _all_data[_all_data['C.ICUSTAY_ID'].isin(sids)]

    sub_data = sub_data[sub_data['hrs_bRFD']>= 24*xi]
    sub_data = sub_data[sub_data['hrs_bRFD']<= 24*xi + h]
    return sub_data

In [None]:
Xi_range = [3,4,5,6,7,8]
H = 4
fx = lambda x:24*x + H
_stay_summary=_stay_summary.reset_index()

for Xi in Xi_range:

    sub_data = sample_at_xi(all_data, _stay_summary, Xi, H)
    print ("Xi = %d, ns = %d" %(Xi, len(sub_data['C.ICUSTAY_ID'].unique())))

In [None]:
sub_data

In [None]:
print ("The lowest ICUSTAY_ID in the dataset is: %s." %min(all_data['C.ICUSTAY_ID']))

We create a one-to-one ID mapping
so we can unqileuy identify which patient each new sample comes from:
我们创建了一个一对一的ID映射，这样我们就可以确定每个新样本来自哪个病人:



In [None]:
id_mapping = pd.DataFrame()
largest_id = max(all_data['C.ICUSTAY_ID'])

In [None]:
for Xi in Xi_range:
    sub_data = sample_at_xi(all_data, _stay_summary, Xi, H)

    L = len(sub_data['C.ICUSTAY_ID'].unique())
    _temp_id_mapping = pd.DataFrame()
    _temp_id_mapping['Xi'] = Xi * np.ones(L)
    _temp_id_mapping['ICUSTAY_ID'] = sub_data['C.ICUSTAY_ID'].unique()
    id_mapping = id_mapping.append(_temp_id_mapping)

In [None]:
id_mapping['new_ID'] = np.arange(len(id_mapping))

In [None]:
id_mapping

In [None]:
if SAVE_FLAG:
    id_mapping.to_pickle('mimic_negative_extras_id_mapping1')

Get features for the new samples at Xi = [3,4,5,6,7,8]:

获取Xi =[3,4,5,6,7,8]的新样本特征:

In [None]:
sub_data1 = sample_at_xi(all_data, _stay_summary, Xi_range[0], 4)
sub_data1['C.VALUENUM']

In [None]:
def get_Xi_features(Xi, all_data, _stay_summary, id_mapping, H=4):
    sub_data3 = sample_at_xi(all_data, _stay_summary, Xi, H)
    ## now get features from subset of data:现在从数据子集中获取特征:
    sub_data3['C.VALUENUM']=sub_data3['C.VALUENUM'].astype(float)
    # sub_data['C.VALUENUM']=sub_data['C.VALUENUM'].astype(float)
    features3 = sub_data3.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])
    features = features3.agg(
        min=('C.VALUENUM','min'),
        max=('C.VALUENUM','max'),
        mean=('C.VALUENUM','mean'),
        var=('C.VALUENUM','var'),
        times=('C.CHARTTIME',np.random.choice),
        values=('C.VALUENUM',np.random.choice),
        count=('C.VALUENUM','count'),
        cohort=('cohort',np.random.choice),
        outcome=('outcome',np.random.choice)
        )
    ids = id_mapping[id_mapping['Xi']==float(Xi)]
    features = features.reset_index()
    features = features.rename(columns={'C.ICUSTAY_ID': 'ICUSTAY_ID'})
    # features = features.join(right=ids, on='ICUSTAY_ID', how='inner')
    features = pd.merge(features,ids, on='ICUSTAY_ID', how='inner')

    # features = features.remove_column('ICUSTAY_ID')
    features = features.drop(columns='ICUSTAY_ID',axis= 1)
    features = features.rename(columns={'new_ID': 'C.ICUSTAY_ID'})

    features = features.sort_values(ascending=True, by='C.ICUSTAY_ID')

    _temp_FM = _split_features_to_columns(features)
    return _temp_FM

In [None]:
_FM_negative = get_Xi_features(Xi_range[0], all_data, _stay_summary, id_mapping)

In [None]:
_FM_negative

从此处开始，是实验过程，很重要，对应上面的定义的函数

In [None]:
# sub_data2 = sample_at_xi(all_data, _stay_summary, Xi_range[0], 4)

In [None]:
# # sub_data['C.VALUENUM']=sub_data['C.VALUENUM'].astype(float)
# sub_data2['C.VALUENUM']=sub_data2['C.VALUENUM'].astype(float)
# features1 = sub_data2.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])

In [None]:
# features21 = features1.agg(
#     min=('C.VALUENUM','min'),
#     max=('C.VALUENUM','max'),
#     mean=('C.VALUENUM','mean'),
#     var=('C.VALUENUM','var'),
#     times=('C.CHARTTIME',np.random.choice),
#     values=('C.VALUENUM',np.random.choice),
#     count=('C.VALUENUM','count'),
#     cohort=('cohort',np.random.choice),
#     outcome=('outcome',np.random.choice)
# )
#

In [None]:
# # sub_data['C.VALUENUM'].astype(float)

In [None]:
# Xi=Xi_range[0]
# H=4
# # ids = id_mapping.filter_by(column_name='Xi', values=float(Xi))
# ids1 = id_mapping[id_mapping['Xi']==float(Xi)]

In [None]:
# features21 = features21.reset_index()

In [None]:
# features21 = features21.rename(columns={'C.ICUSTAY_ID': 'ICUSTAY_ID'})
# # features = features.join(right=ids, on='ICUSTAY_ID', how='inner')
# features21 = pd.merge(features21,ids1, on='ICUSTAY_ID', how='inner')

In [None]:
# features21

In [None]:
# # features = features.remove_column('ICUSTAY_ID')
# features21 = features21.drop(columns='ICUSTAY_ID',axis= 1)
# features21 = features21.rename(columns={'new_ID': 'C.ICUSTAY_ID'})
#
# features21 = features21.sort_values(ascending=True, by='C.ICUSTAY_ID')

In [None]:
def _split_features_to_columns(features, single_feature_vars=single_feature_variables):
    ''' Note: sort features by ICUSTAY_ID before using this function.'''

    all_variables = features['VARIABLE'].unique()
    # all_stay_ids = features['C.ICUSTAY_ID'].unique().sort()
    all_stay_ids11=features['C.ICUSTAY_ID'].unique()
    all_stay_ids11.sort()
    all_stay_ids=all_stay_ids11

    thisFM = pd.DataFrame()
    thisFM['ICUSTAY_ID'] = all_stay_ids

    for var in all_variables:
        print(var)

        var_min = []
        var_max = []
        var_values = []
        var_count = []

        _subset = features[features['VARIABLE']==var].reset_index(drop=True)
        N = len(_subset)-1

        rid = 0
        for sid in all_stay_ids:
            if rid<=N and _subset.loc[rid]['C.ICUSTAY_ID']==sid:
                ## add data and move on a row
                row = _subset.loc[rid]
                var_min.append(row['min'])
                var_max.append(row['max'])
                var_values.append(row['values'])
                rid += 1
            else:
                ## add None values and stay in place
                var_min.append(None)
                var_max.append(None)
                var_values.append([])

        if var not in single_feature_vars:
            thisFM[var + '_min'] = var_min
            thisFM[var + '_max'] = var_max

        else:
            ## if it is a single variable feature - take final value only .loc
            for val in var_values:
                # if val == 'bun' :
                #     thisFM[var] = val[-1]
                # else:
                #     if len(val)>0:
                #         thisFM[var] = val[-1]
                #     else:
                #         thisFM[var] = None
                try:
                    if len(val)>0:
                        thisFM[var] = val[-1]
                    else:
                        thisFM[var] = None
                except:
                    val=str(val)
                    thisFM[var] = val[-1]
                # if val == 'bun' or len(val)>0:
                #     thisFM[var] = val[-1]
                # else:
                #     thisFM[var] = None
            # thisFM[var] = [val[-1]  if len(val)>0 else None for val in var_values]

    return thisFM

In [None]:
# # _temp_FM = _split_features_to_columns(features)
# _temp_FM1 = _split_features_to_columns(features21)
#

In [None]:
# _FM_negative=_temp_FM

In [None]:
# _temp_FM1=_temp_FM

实验过程结束，中间 的定义的函数，是其中使用到的

In [None]:
for Xi in Xi_range[1:]:
    print (Xi)
    #下面不用管，实验过程

    # H=4
    # sub_data = sample_at_xi(all_data, _stay_summary, Xi, H)
    # sub_data['C.VALUENUM']=sub_data['C.VALUENUM'].astype(float)
    # features3 = sub_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])
    # features4 = features3.agg(
    #     min=('C.VALUENUM','min'),
    #     max=('C.VALUENUM','max'),
    #     mean=('C.VALUENUM','mean'),
    #     var=('C.VALUENUM','var'),
    #     times=('C.CHARTTIME',np.random.choice),
    #     values=('C.VALUENUM',np.random.choice),
    #     count=('C.VALUENUM','count'),
    #     cohort=('cohort',np.random.choice),
    #     outcome=('outcome',np.random.choice)
    # )
    # ids = id_mapping[id_mapping['Xi']==float(Xi)]
    # features4 = features4.rename(columns={'C.ICUSTAY_ID': 'ICUSTAY_ID'})
    # features4 = pd.merge(features4,ids, on='ICUSTAY_ID', how='inner')
    # features4 = features4.drop(columns='ICUSTAY_ID',axis= 1)
    # features4 = features4.rename(columns={'new_ID': 'C.ICUSTAY_ID'})
    # features4 = features4.sort_values(ascending=True, by='C.ICUSTAY_ID')
    # _temp_FM = _split_features_to_columns(features4)
    # _temp_FM=_temp_FM


    _temp_FM = get_Xi_features(Xi, all_data, _stay_summary, id_mapping)
    _FM_negative = _FM_negative.append(_temp_FM)

In [None]:
_FM_negative

How much missing data?
多少缺失的数据?

In [None]:
nrows = float(len(_FM_negative))

missing_data = pd.DataFrame()
vname = []
miss_freq = []

In [None]:
for col in _FM_negative.columns:
        vname.append(col)
        # miss_freq.append(sum(_FM_negative[col]==None)/nrows)
        # ##(FM_filled[col].isnull().sum())
        miss_freq.append((_FM_negative[col].isnull().sum())/nrows)

In [None]:
missing_data['variable'] = vname
missing_data['fraction missing'] = miss_freq

In [None]:
missing_data = missing_data.sort_values(by='fraction missing', ascending=True)
# missing_data.print_rows(num_rows=25)
missing_data.loc[0:]

In [None]:
# missing_data_summary = missing_data_summary.join(how='inner', on='variable', right=missing_data)
missing_data_summary = pd.merge(missing_data_summary,missing_data,how='inner', on='variable')
missing_data_summary = missing_data_summary.rename(columns={'fraction missing':'mimic_negex_4hour'})



Fill the missing data:



In [None]:
_FM_negative_filled = _FM_negative.copy()
# _FM_negative_filled = _FM_negative_filled.fillna(column='airway', value=0.0)
_FM_negative_filled = _FM_negative_filled.fillna({'airway':0.0})
# # ## FM_filled = FM_filled.fillna({'airway':0.0})

In [None]:
def fill_with_final_values(FM_filled, Xi, new_icustays):

    # all_data = graphlab.SFrame(DATA_PATH)
    all_data = pd.read_pickle(DATA_PATH)
    # all_data = all_data[all_data['hrs_bRFD'] >= 24*Xi]
    all_data = all_data.loc[all_data['hrs_bRFD'] >= 24*Xi]

    # new_icustays = new_icustays.rename({'ICUSTAY_ID':'C.ICUSTAY_ID'})
    new_icustays = new_icustays.rename(columns={'ICUSTAY_ID':'C.ICUSTAY_ID'})
    # all_data = all_data.join(right=new_icustays, on='C.ICUSTAY_ID', how='inner')
    all_data = pd.merge(all_data,new_icustays, on='C.ICUSTAY_ID', how='inner')
    # all_data = all_data.remove_column('C.ICUSTAY_ID')
    all_data = all_data.drop(columns='C.ICUSTAY_ID',axis= 1)
    # all_data = all_data.rename({'new_ID':'C.ICUSTAY_ID'})
    all_data = all_data.rename(columns={'new_ID':'C.ICUSTAY_ID'})

    # final_values = all_data.groupby(key_columns=['C.ICUSTAY_ID', 'VARIABLE'],
    #                                 operations={
    #                                             'hrs_bRFD':agg.ARGMAX('C.CHARTTIME', 'hrs_bRFD'),
    #                                             'C.VALUENUM':agg.ARGMAX('C.CHARTTIME', 'C.VALUENUM')
    #                                             })
    all_data['C.VALUENUM']=all_data['C.VALUENUM'].astype(float)
    final_values = all_data.groupby(by=['C.ICUSTAY_ID', 'VARIABLE'])
    final_values = final_values.agg(
        hrs_bRFD=('hrs_bRFD','mean'),
        VALUENUM=('C.VALUENUM','mean')
    )
    #.rename(columns={'VALUENUM':'C.VALUENUM'})
    final_values = final_values.reset_index()

    final_values = final_values[final_values['hrs_bRFD'] <= MAXIMUM_LOOKBACK + (24*Xi)]
    final_values = final_values[final_values['hrs_bRFD'] > H + (24*Xi)]

    # final_values = final_values.sort(ascending=True, sort_columns=['VARIABLE', 'C.ICUSTAY_ID'])
    final_values = final_values.sort_values(ascending=True, by=['VARIABLE', 'C.ICUSTAY_ID'])
    vars_to_use_fv = [var for var in final_values['VARIABLE'].unique()]
    # FM_filled = FM_filled.sort(ascending=True, sort_columns='ICUSTAY_ID')
    FM_filled = FM_filled.sort_values(ascending=True, by='ICUSTAY_ID')

    for variable in vars_to_use_fv:
        print( variable)

        # _subset = final_values[final_values['VARIABLE']==variable]
        _subset = final_values[final_values['VARIABLE']==variable].reset_index()
        # _subset = _subset.sort(sort_columns='C.ICUSTAY_ID', ascending=True)
        _subset = _subset.sort_values(by='C.ICUSTAY_ID', ascending=True)
#_subset = features[features['VARIABLE']==var].reset_index
        N = len(_subset)-1

        if variable not in single_feature_variables:
            FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='min')
            FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='max')
        else:
            FM_filled = replace_missing_value(_subset, variable, FM_filled, N, vtype='single')

    return FM_filled


Fill separately for each Xi and then recombine:
分别填充每个Xi，然后重新组合:


In [None]:
Xi = Xi_range[0]
## filter feature matrix to only contain instances from Xi sample过滤特征矩阵，只包含来自Xi样本的实例
# _temp_FM_filled = _FM_negative_filled.filter_by(column_name='ICUSTAY_ID', values=id_mapping[id_mapping['Xi']==float(Xi)]['new_ID'])
_temp_FM_filled = _FM_negative_filled[_FM_negative_filled['ICUSTAY_ID'].isin(id_mapping[id_mapping['Xi']==float(Xi)]['new_ID'])]

In [None]:
## fill with appropriate final values:填充适当的最终值:
new_icustays = id_mapping[id_mapping['Xi']==float(Xi)]
_temp_FM_filled = fill_with_final_values(FM_filled=_temp_FM_filled, Xi=Xi, new_icustays = new_icustays)
FM_negative_filled = _temp_FM_filled.copy()

In [None]:
for Xi in Xi_range[1:]:
    ## filter feature matrix to only contain instances from Xi sample过滤特征矩阵，只包含来自Xi样本的实例
    # _temp_FM_filled = _FM_negative_filled.filter_by(column_name='ICUSTAY_ID', values=id_mapping[id_mapping['Xi']==float(Xi)]['new_ID'])
    _temp_FM_filled = _FM_negative_filled[_FM_negative_filled['ICUSTAY_ID'].isin(id_mapping[id_mapping['Xi']==float(Xi)]['new_ID'])]
    ## fill with appropriate final values:填充适当的最终值:
    new_icustays = id_mapping[id_mapping['Xi']==float(Xi)]
    _temp_FM_filled = fill_with_final_values(FM_filled=_temp_FM_filled, Xi=Xi, new_icustays = new_icustays)
    FM_negative_filled = FM_negative_filled.append(_temp_FM_filled)


How much missing data in the feature matrix for these extra NRFD samples:
这些额外的NRFD样本在特征矩阵中缺失了多少数据:

In [None]:
nrows = float(len(FM_negative_filled))
for col in FM_negative_filled.columns:
        print('%s : %.3f' %(col, (FM_negative_filled[col].isnull().sum())/nrows))

#(_FM_negative[col].isnull().sum()

出错了，老错误，缺少 'variable'  FM_negative_filled[row['variable']],,想出了一个替代的方法

出错了，老错误，缺少 'variable'  FM_negative_filled[row['variable']]
for i in FM_negative_filled.columns:
    missing_data_summary['mimic_negex_36hr']=missing_data_summary['variable'].apply(lambda i : (FM_negative_filled[i].isnull().sum())/nrows)

出错了，老错误，缺少 'variable'  FM_negative_filled[row['variable']]

In [None]:
# missing_data_summary['mimic_negex_36hr'] = \
#     missing_data_summary.apply(lambda row: (FM_negative_filled[row['variable']].isnull().sum())/nrows )
# s = pd.DataFrame()
for i in FM_negative_filled.columns:
    missing_data_summary['mimic_negex_36hr']=missing_data_summary['variable'].apply(lambda i : (FM_negative_filled[i].isnull().sum())/nrows)

In [None]:
# n=[]
# for i in FM_negative_filled.columns:
#     print(i)
#     n.append(i)
# n

In [None]:
nrows = float(len(FM_negative_filled))
# s = pd.DataFrame()
for i in FM_negative_filled.columns:
    missing_data_summary['mimic_negex_36hr']=missing_data_summary['variable'].apply(lambda i : (FM_negative_filled[i].isnull().sum())/nrows)

In [None]:
# missing_data_summary1['mimic_negex_36hr']=missing_data_summary['mimic_negex_36hr']

In [None]:
all_data = pd.read_pickle(DATA_PATH)
summary1 = all_data.groupby(by=['C.ICUSTAY_ID'])
summary = summary1.agg(
    outcome=('outcome',np.random.choice),
    cohort=('cohort',np.random.choice),
    readmit=('readmit',np.random.choice),
    in_h_death=('in_h_death',np.random.choice),
    in_icu_death=('in_icu_death',np.random.choice),
    LOS=('II.LOS',np.random.choice),
    OUTTIME=('II.OUTTIME',np.random.choice),
    INTIME=('II.INTIME',np.random.choice),
    RFD=('RFD',np.random.choice)
).rename(columns={
    'LOS':'II.LOS',
    'OUTTIME':'II.OUTTIME',
    'INTIME':'II.INTIME'
})

In [None]:
summary=summary.reset_index()
# summary

In [None]:
summary = summary.rename(columns={'C.ICUSTAY_ID':'ICUSTAY_ID'})
# cohort_info = id_mapping.join(how='inner', on='ICUSTAY_ID', right=summary)
cohort_info = pd.merge(id_mapping,summary,how='inner', on='ICUSTAY_ID')

In [None]:
cohort_info = cohort_info[['new_ID', 'cohort']]
cohort_info = cohort_info.rename(columns={'new_ID':'ICUSTAY_ID'})

# FM_negative_filled = FM_negative_filled.join(how='inner', on='ICUSTAY_ID', right=cohort_info)
FM_negative_filled = pd.merge(FM_negative_filled,cohort_info,how='inner', on='ICUSTAY_ID')

FM_negative_filled['outcome'] = np.zeros(len(FM_negative_filled), dtype=int)

We finally join the extra negative instances to the original feature matrix:
我们最终将额外的负实例加入到原始特征矩阵中:

In [None]:
FM_filled = FM_filled.append(FM_negative_filled)



Save the feature matrix with and without missing values included:
保存包含或不包含缺失值的特征矩阵:


In [None]:
if SAVE_FLAG:
        FM_filled.to_pickle('Feature_Matrix_MIMIC_with_missing_values')
        print(len(FM_filled))

        FM_filled_cc = FM_filled.dropna(axis=0,how='any')
        FM_filled=FM_filled_cc.reset_index(drop=True)
        FM_filled_cc.to_pickle('Feature_Matrix_MIMIC_complete_case')
        print(len(FM_filled))

Produce summary of missing values as recorded throughout this script:
生成缺失值的总结，记录在整个脚本中:



In [None]:
order = pd.DataFrame()
order['variable'] =['hr_max', 'hr_min','spo2_max',
                'spo2_min','airway','resp_max',
                'resp_min','bp_min','bp_max',
                'temp_min','temp_max','gcs_max',
                'gcs_min','pain','hco3',
                'pco2','po2','fio2',
                'haemoglobin','k','bun',
                'creatinine','na']

In [None]:
order['num'] = range(len(order))



In [None]:
# missing_data_summary = missing_data_summary.join(how='inner', on='variable', right=order)
missing_data_summary = pd.merge(missing_data_summary,order,how='inner', on='variable')
missing_data_summary = missing_data_summary.sort_values(by='num')
# missing_data_summary = missing_data_summary.remove_column('num')
missing_data_summary = missing_data_summary.drop(columns='num',axis= 1)



In [None]:
missing_data_summary=missing_data_summary.reset_index(drop=True)
# missing_data_summary.export_csv(header=True, filename='missing_data_summary_mimic.csv')
missing_data_summary.to_csv(header=True, path_or_buf='missing_data_summary_mimic.csv')

In [None]:
# missing_data_summary.print_rows(num_rows=len(missing_data_summary))
missing_data_summary.loc[0:(len(missing_data_summary)-1)]

In [None]:
# FM_filled.export_csv('Feature_Matrix_MIMIC_with_missing_values.csv')
FM_filled.to_csv('Feature_Matrix_MIMIC_with_missing_values.csv')

In [None]:
FM_filled_cc.to_csv('Feature_Matrix_MIMIC_complete_case.csv')

In [None]:
FM_filled1=pd.read_pickle('Feature_Matrix_MIMIC_with_missing_values')



In [None]:
FM_filled1.to_csv('Feature_Matrix_MIMIC_with_missing_values.csv')


#下面为别的文件



In [None]:
all_data_to=pd.read_pickle('mimic_all_data_CLEANED1')
all_data_to

In [None]:
all_data_to.to_csv('mimic_all_data_CLEANED1')



In [None]:
all_data_to[all_data_to['outcome']==0]
