Important notes:

+ min_event_date = 2006-02-01
+ max_event_date = 2017-01-01
+ periods_to_predict = max_event_date - periods_to_predict = 12 (months) = 1 year
+ max_observation_date = 2016-01-01 (because there is no target data for 2017)

A) User **split between train and validation users**:
+ train_users = train_proportion * total_users
+ validation_users = (1 - train_proportion) * total_users

B) For **train and valid users** (all users, in fact):
Considerations:
+ max_train_valid_date = max_observation_date - periods_to_predict = 2015-01-01
+ cut_train_valid_date = minimum(death_date, cancer_dx_date, max_train_valid_date)

To do (same for train and validation):
+ Keep the last 48 observations (months) previous to cut_valid_date
+ Healthy users at max_train_valid_date: events from 2011-01-01 to 2015-01-01 (even if they die or they are cancer diagnosed after 2015-01-01)
+ Death or cancer diagnosed users before max_train_valid_date: 48 observations / months prior to their death or cancer diagnosis date
+ This selection has target data about cancer diagnosis between the min_event_date (2006-02-01) and max_observation_date (2016-01-01)

C) **Create test set** (not all users are in the test set!)
+ Test users: Only those who are healthy at max_obsrvation_date (2016-01-01)
+ This selection has target data about cancer diagnosis between the max_observation_date (2016-01-01) and max_event_date (2017-01-01)


FAQ:
+ Is it possible to find a user both in train and valid sets?
- NO

+ Is it possible to find a user both in train and test sets?
- YES, we will just be predicting a completely different period of their life, with a different but partly overlapped set of events.

In [1]:
import sys
sys.path.append('/TFM/')

In [2]:
%matplotlib inline
from dateutil import relativedelta

from keras.models import Sequential, Model
from keras.layers import LSTM, Dense, TimeDistributed, Input, Embedding, Concatenate, BatchNormalization, Reshape, Dropout, MaxPooling1D, Conv1D
from keras.utils import to_categorical
from keras.regularizers import L1L2


from keras.callbacks import History 
from sklearn.preprocessing import LabelEncoder, Imputer, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, log_loss, f1_score, auc
from keras.optimizers import Adam

from sklearn.externals import joblib
import datetime as dt
import matplotlib.pylab as plt
import numpy as np
import os
import pandas as pd
from TFM.settings import set_col_names
from TFM.feature_transformation import get_all_times, transform_features, expand, adding_columns, generating_catalegs, pivoting_events
from TFM.validation import run_test
import TFM.feature_engineering as FE
from TFM.generate_model import generate_model2
from TFM.load_data import getting_events
import gc

Using TensorFlow backend.


In [3]:
set_col_names(right_ts='dat', user_id='id', target='target', code='new_code')

In [4]:
path = 'MODEL10_ALL/'
num_model = '10'

# Loading events

In [5]:
events_type1 = joblib.load('pickles_ALL2/ALL_events1_moved.pickle')
print('Events of type 1: ', events_type1.size)
display(events_type1.head())

Events of type 1:  104474632


Unnamed: 0,id,dat,val,new_code
309,124,2006-02-01,1.0,f_M01AB05
311,124,2006-09-01,3.0,f_D01AE14
313,124,2007-01-01,3.0,f_D01AE14
314,124,2007-01-01,100.0,m_EK201
315,124,2007-01-01,60.0,m_EK202


In [6]:
events_type2 = joblib.load('pickles_ALL2/ALL_events2_moved.pickle')
print('Events of type 2: ', events_type2.size)
display(events_type2.head())

Events of type 2:  77041104


Unnamed: 0,dat,id,new_code
595,2006-03-01,124,b
596,2006-04-01,124,b
597,2006-09-01,124,d_L29.9
598,2006-10-01,124,d_L29.9
599,2006-11-01,124,d_L29.9


# Loading population

In [7]:
pob = joblib.load('pickles_ALL2/ALL_pob_sex.pickle')
print('Original population size: ',pob['id'].unique().size)

Original population size:  60170


In [8]:
deaths = joblib.load('pickles_ALL2/DEF_deaths_moved.pickle')
print('Original death file size: ',deaths['id'].unique().size)

Original death file size:  8700


# Recurrent functions

In [9]:
def get_intervals(events_type1, events_type2, deaths, p2p):
    
    
    full_range2 = events_type2['dat'].agg(['min', 'max'])
    df = events_type2.groupby('id')['dat'].agg(['min']).reset_index()
    df1 = events_type1.groupby('id')['dat'].agg(['min']).reset_index()
    merg = df.merge(df1, on='id', how = 'outer', copy = False)
    periods = merg.copy()

    periods['dat_min'] = merg[['min_x','min_y']].min(axis=1)
    periods['dat_max_all'] = pd.to_datetime('2017-01-01', format='%Y-%m-%d')
    periods.drop(['min_x','min_y'], axis = 1, inplace = True)

    #periods['dat_max'] = periods['dat_max'].apply(lambda x: x - relativedelta.relativedelta(months=12))
    periods['dat_end'] = periods['dat_max_all'].apply(lambda x: x - relativedelta.relativedelta(months=12+p2p))
    

    periods['dat_end_test'] = periods['dat_max_all'].apply(lambda x: x - relativedelta.relativedelta(months=p2p))
    
    deaths_events = deaths[['id', 'dat']].sort_values('dat').drop_duplicates(['id'], keep='first').set_index('id')['dat'].apply(lambda x: x - relativedelta.relativedelta(months=1)).to_dict()
    #periods['dat_max'] = periods['id'].map(deaths_events).apply(lambda x: x - relativedelta.relativedelta(months=12)).fillna(periods['dat_max'])
    periods['dat_death'] = periods['id'].map(deaths_events)
    periods['temp'] = periods['id'].map(deaths_events)
    ix = periods['temp'].notnull()
    periods.loc[ix, 'end_type'] = 'death'


    m = (events_type2['new_code'].str[:3] == 'd_C')
    melanoma = (events_type2['new_code'].str[:3] == 'd_C44')

    pob_cancer = events_type2.loc[m&~melanoma, ['id', 'dat']].sort_values('dat').drop_duplicates(['id'], keep='first').set_index('id')['dat'].apply(lambda x: x - relativedelta.relativedelta(months=1))
    #pob_cancer['dat'] = pob_cancer['dat'].apply(lambda x: x - relativedelta.relativedelta(months=1))
    #display(pob_cancer.head())
    pob_cancer = pob_cancer.to_dict()
    #periods['dat_max'] = periods['id'].map(pob_cancer).apply(lambda x: x - relativedelta.relativedelta(months=1)).fillna(periods['dat_max'])
    periods['temp'] = periods['id'].map(pob_cancer)
    periods['dat_can'] = periods['id'].map(pob_cancer)
    
    
    ix = periods['temp'].notnull()
    periods.loc[ix, 'end_type'] = 'cancer'
    #np.ma.masked_array(a, np.isnan(a)

    #periods['train_dat_max'] = np.nanmin(periods[['dat_end','dat_can','dat_death']], axis=1)
    periods['train_dat_max'] = np.nanmin(periods[['dat_end','dat_death']], axis=1)
    periods['train_dat_max'].loc[(periods['end_type'] == 'cancer') & (periods['dat_can'] < '2016-01-01')] = periods['dat_can']
    #periods['train_dat_max'] = np.nanmin(periods[['dat_end','dat_death']], axis=1)

    #periods['train_dat_max'] = np.nanmax(periods[['train_dat_max','dat_can']], axis=1)
    
    periods.drop(['temp'], axis = 1, inplace = True)

    periods['diff_month'] = (periods['train_dat_max'].dt.to_period('M') - periods['dat_min'].dt.to_period('M'))#.clip_upper(48)

    m = (periods['diff_month'] >= 48)

    print('Population at first: ', periods['id'].unique().size)

    periods['train'] = m.astype(float)
    
    periods['train_dat_min'] = periods['train_dat_max'].apply(lambda x: x - relativedelta.relativedelta(months=48))
    
    # ----------------------------------------------------------------------------------------------------------
    
   
    periods['temp'] = periods['dat_end'].apply(lambda x: x + relativedelta.relativedelta(months=12))

    #periods['test_dat_max'] = np.nanmin(periods[['temp','dat_can','dat_death']], axis=1)
    periods['test_dat_max'] = np.nanmin(periods[['temp','dat_death']], axis=1)
    periods['test_dat_max'].loc[(periods['end_type'] == 'cancer') & (periods['dat_can'] > '2016-01-01')] = periods['dat_can']
    
    f = periods['test_dat_max'].apply(lambda x: x - relativedelta.relativedelta(months=48)) >= periods['dat_min']
    
    periods['test_dat_min'] = periods.loc[f]['test_dat_max'].apply(lambda x: x - relativedelta.relativedelta(months=48)).fillna(periods['dat_min'])
    
    #np.max(periods[['train_dat_min', 'dat_min']], axis = 1) .apply(lambda x: x + relativedelta.relativedelta(months=12))
    
    
    periods['diff_month_test'] = (periods['test_dat_max'].dt.to_period('M') - periods['test_dat_min'].dt.to_period('M'))

    
    m1 = (periods['dat_death'].isnull()) | (periods['test_dat_max'] < periods['dat_death']).astype(float)
    m2 = (periods['dat_can'].isnull()) | (periods['dat_can'] >= '2016-01-01').astype(float)
    
    m3 = (periods['diff_month_test'] >= 48).astype(float)
    
    periods['test'] = (m1&m2&m3).astype(float)
    
    periods['cancer_year'] = periods['dat_can'].map(lambda x: x.year)

    print('Clipped population test: ', periods.groupby('test')['id'].count())
    print('Clipped population train: ',periods.groupby('train')['id'].count())
    
    
    test = (periods['test'] == 1)
    train = (periods['train'] == 1)
    periods = periods.loc[test|train]
    
    
    print('Clipped population train or test: ', periods['id'].unique().size)
    
    
    periods.drop(['dat_max_all','diff_month','temp', 'diff_month_test','dat_end_test','dat_end'], axis = 1, inplace = True)
    
    periods['temp_can'] = pd.to_datetime('2017-01-01', format='%Y-%m-%d')
    periods['temp_can'] = periods['temp_can'].apply(lambda x: x - relativedelta.relativedelta(months=12))

    del events_type1
    del events_type2
    del deaths
    return periods

In [10]:
def filling_nans(events_clipped):
  
    list_type_MOD = [t for t in list_type1 if t.startswith(('f_', 'b_', 'e_', 'v_'))] + list_type2
    binary_values = dict(zip(list_type_MOD, [0.] * len(list_type_MOD)))
    ffill_cols = set(list_type1) - set(list_type_MOD)
    e1 = FE.fillna_cols(events_clipped, fillna_val=binary_values, ffill_cols=ffill_cols)
    
    vaccines_list = [t for t in list_type1 if t.startswith(('a_'))]
    vaccines = dict(zip(vaccines_list, [0.] * len(vaccines_list)))
    ffill_cols = set(list_type1) - set(list_type_MOD) - set(vaccines_list)
    again_ffill_cols = dict(zip(ffill_cols, [-1.] * len(ffill_cols)))
    merge_lists = {**vaccines, **again_ffill_cols}
    del events_clipped
    e2 = FE.fillna_cols(e1, fillna_val=merge_lists)
    del e1
    return e2 

In [11]:
def cleaning_empty_time(e, p, test = False):
    ev = e.merge(p, on='id', how='left', copy= False)
    if test:
        m = (ev['dat'] > ev['test_dat_min'])
        m2 = (ev['dat'] <= ev['test_dat_max'])
        
    else:
        m = (ev['dat'] > ev['train_dat_min'])
        m2 = (ev['dat'] <= ev['train_dat_max'])
    
    
    ev = ev.loc[m&m2]
    
    ev.drop(['dat_min','dat_death','end_type','dat_can','train_dat_min','train_dat_max','test_dat_min', 'test_dat_max','train','test', 'cancer_year','temp_can'], axis = 1, inplace = True)
    del p
    del e
    return ev

In [12]:
def get_df(chunk, p2p, time_index, test = False, verbose = True):
    start = dt.datetime.now()
    if (verbose):
        print('This chunk has', len(chunk), 'population')
        print("Starting with events of type 2... ")
    
    # Events 2
    events_chunk2 = getting_events(events_type2, chunk)
    events_chunk2['val'] = 1
    events_chunk2 = pivoting_events(events_chunk2, 'count', verbose)
    
    adding_columns(events_chunk2, list_type2, verbose)
    # Events 1
    if (verbose):
        print("Starting with events of type 1... ")
    
    events_chunk1 = getting_events(events_type1, chunk)
    events_chunk1 = pivoting_events(events_chunk1, 'mean', verbose)
    
    adding_columns(events_chunk1, list_type1, verbose)

    if verbose:
        print("Merging events type 1 and 2... ")
    
    events_all_chunk = events_chunk1.merge(events_chunk2, on = ['id','dat'], how ='outer', copy = False)
    del events_chunk2
    del events_chunk1
    if (verbose):
        print(events_all_chunk.shape)
    
    if (verbose):
        print("Expanding events...")
    e_expanded = expand(events_all_chunk, time_index)
    
    if (verbose):
        print(e_expanded.shape)
    del events_all_chunk

    if (verbose):
        print("Cleaning pre - post events...")
    events = cleaning_empty_time(e_expanded, life_periods, test)
    del e_expanded
    if (verbose):
        print(events.shape)
    
    if (verbose):
        print('Adding column target...')
    FE.adding_target(events, life_periods, p2p, test)
    if (verbose):
        print('Target columns ',events[events['target'] == 1]['target'].count())

    if (verbose):
        print('Filling nans...')
    e = filling_nans(events)
    del events
    
    e['weight'] = 1.
    
    e = e.merge(pob, on='id', how='left', copy = False)

    e['age'] = e['dat'].dt.year - pd.to_numeric(e['dnaix'], errors='coerce')
    e.drop(['dnaix'], axis = 1, inplace = True)


    if (verbose):
        print('We have finished with this chunk....')
        
    gc.collect()
    print(dt.datetime.now() - start)
    return e

## Running

In [14]:
list_type1 = generating_catalegs(events_type1)
list_type2 = generating_catalegs(events_type2)

In [15]:
life_periods = get_intervals(events_type1, events_type2, deaths, p2p= 12)
time_index = get_all_times(events_type2)
life_periods.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


Population at first:  60169
Clipped population test:  test
0.0    13742
1.0    46427
Name: id, dtype: int64
Clipped population train:  train
0.0     8081
1.0    52088
Name: id, dtype: int64
Clipped population train or test:  53048


Unnamed: 0,id,dat_min,dat_death,end_type,dat_can,train_dat_max,train,train_dat_min,test_dat_max,test_dat_min,test,cancer_year,temp_can
0,124,2006-02-01,NaT,,NaT,2015-01-01,1.0,2011-01-01,2016-01-01,2012-01-01,1.0,,2016-01-01
1,172,2007-11-01,NaT,,NaT,2015-01-01,1.0,2011-01-01,2016-01-01,2012-01-01,1.0,,2016-01-01
2,512,2006-10-01,NaT,,NaT,2015-01-01,1.0,2011-01-01,2016-01-01,2012-01-01,1.0,,2016-01-01
3,730,2006-04-01,NaT,,NaT,2015-01-01,1.0,2011-01-01,2016-01-01,2012-01-01,1.0,,2016-01-01
4,913,2006-09-01,NaT,,NaT,2015-01-01,1.0,2011-01-01,2016-01-01,2012-01-01,1.0,,2016-01-01


In [16]:
test = life_periods[(life_periods['train'] == 1) & (~life_periods['dat_can'].isnull())]

In [18]:
l = {'46140'}
x = get_df(l, 12, time_index, verbose = False)

102
(51, 2)
5600
(70, 80)
0:00:01.718144


In [19]:
x

Unnamed: 0,dat,id,a_A-GRIP,a_A00001,a_A00002,f_A02BC01,f_A03FA01,f_A12AX93,f_A12AX94,f_C07AA05,...,d_I21,d_R49.0,d_D12.6,d_J42,d_R11,target,weight,sexe,qmedea,age
0,2008-11-01,46140,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,73.0
1,2008-12-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,73.0
2,2009-01-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
3,2009-02-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
4,2009-03-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
5,2009-04-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
6,2009-05-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
7,2009-06-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
8,2009-07-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0
9,2009-08-01,46140,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,,74.0


In [23]:
con_cols = ['age', 'sexe']
cat_cols = {'qmedea': 2}

print(len(list(set(list_type1))))
print(len(list(set(list_type2))))      
lstm_list = list(set(list_type1 + list_type2) - set(con_cols) - set(cat_cols))

print(len(lstm_list))
M = [[7, 2]]

model = generate_model2(con_cols, lstm_list, M, cells=24)

549
386
935
937
__________________________________________________________________________________________
Layer (type)                 Output Shape        Param #    Connected to                  
input_1 (InputLayer)         (None, None, 1)     0                                        
__________________________________________________________________________________________
input_2 (InputLayer)         (None, None, 935)   0                                        
__________________________________________________________________________________________
embedding_1 (Embedding)      (None, None, 1, 2)  14         input_1[0][0]                 
__________________________________________________________________________________________
lstm_1 (LSTM)                (None, None, 24)    92160      input_2[0][0]                 
__________________________________________________________________________________________
reshape_1 (Reshape)          (None, None, 2)     0          embedding_1[0]

In [52]:
len(lstm_list)

935

In [24]:
m1 = life_periods['train'] == 1
#m2 = life_periods['dat_can'] <= '2016-01-01'
m2 = life_periods['dat_can'] < life_periods['temp_can']

can_notest_ids = life_periods.loc[m1 & m2, 'id']
nocan_notest_ids = life_periods.loc[m1 & ~m2, 'id']
print(can_notest_ids.size)
print(nocan_notest_ids.size)

2832
49256


In [25]:
train_p = .85

can_train_ids = pd.Series(can_notest_ids).sample(frac=train_p, random_state= 22).tolist()
can_valid_ids = [id for id in can_notest_ids if id not in can_train_ids]

nocan_train_ids = pd.Series(nocan_notest_ids).sample(frac=train_p, random_state= 22).tolist()
nocan_valid_ids = [id for id in nocan_notest_ids if id not in nocan_train_ids]

In [26]:
print(len(can_train_ids))
print(len(nocan_train_ids))
print(len(nocan_valid_ids))
print(len(can_valid_ids))
joblib.dump(can_valid_ids, path + 'can_validation_ids.pickle', compress = 3)
joblib.dump(nocan_valid_ids, path + 'nocan_validation_ids.pickle', compress = 3)
joblib.dump(can_train_ids, path + 'can_train_ids.pickle', compress = 3)
joblib.dump(nocan_train_ids, path + 'nocan_train_ids.pickle', compress = 3)

2407
41868
7388
425


['MODEL10_ALL/nocan_train_ids.pickle']

In [27]:
i_batch_val = np.concatenate((can_valid_ids, nocan_valid_ids))
df_val = get_df(i_batch_val, 12, time_index, verbose = False)
joblib.dump(df_val, path + 'validation_dataset9a.pickle', compress = 3)

302916202
(784757, 386)
284957352
(519048, 549)
0:11:31.799297


['MODEL10_ALL/validation_dataset9a.pickle']

In [28]:
start1 = dt.datetime.now()
res_val, feat_dict_val = transform_features(df_val, con_cols, lstm_list, cat_cols, verbose = False)
con_feats = feat_dict_val['con_feats']
lstm_feats = feat_dict_val['lstm_feats']
M = feat_dict_val['M']
cat_feats = feat_dict_val['cat_feats']
print(dt.datetime.now() - start1)
del df_val
joblib.dump(feat_dict_val,  path + 'validation_feat_dict9.pickle', compress = 3)
joblib.dump(res_val,  path + 'validation_feature_done9.pickle', compress = 3)

2:08:56.664014


['MODEL10_ALL/validation_feature_done9.pickle']

In [53]:
res_val = joblib.load(path + 'validation_feature_done9.pickle')
feat_dict_val = joblib.load(path + 'validation_feat_dict9.pickle')
#res_val.drop(['temp'], axis =1, inplace = True)
con_feats = feat_dict_val['con_feats']
lstm_feats = feat_dict_val['lstm_feats']
M = feat_dict_val['M']
cat_feats = feat_dict_val['cat_feats']

In [21]:
res_val.head()

Unnamed: 0,dat,id,target,weight,con__age,con__sexe,con__d_H10.0,con__d_Z28.0,con__d_H66.9,con__l_CKDEPI,...,con__l_138.8,con__d_Z63.9,con__l_CALCI,con__m_VMDLO,con__d_E11,con__d_M80.9,con__d_B34.9,con__f_D06BX01,con__e_RA00062,cat__qmedea
0,2011-11-01,1317,0.0,1.0,0.041667,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2011-12-01,1317,0.0,1.0,0.041667,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2012-01-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
3,2012-02-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
4,2012-03-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


In [54]:
lstm_feats

['con__d_H10.0',
 'con__d_Z28.0',
 'con__d_H66.9',
 'con__l_CKDEPI',
 'con__d_R74.0',
 'con__f_R03BA02',
 'con__l_CAC',
 'con__d_I50.9',
 'con__d_J45.9',
 'con__d_M47.8',
 'con__e_PD00127',
 'con__d_I50',
 'con__d_H53.8',
 'con__d_R03.0',
 'con__f_J01FF01',
 'con__f_L02BG06',
 'con__f_S01EE01',
 'con__d_Z01.2',
 'con__d_Z95.0',
 'con__d_M79.1',
 'con__d_L60.0',
 'con__d_Z13.60',
 'con__f_C07AG02',
 'con__m_TT101',
 'con__d_M65.3',
 'con__l_138.0',
 'con__d_J18',
 'con__d_H25',
 'con__d_I48',
 'con__d_J30.4',
 'con__f_R05CB01',
 'con__a_A00020',
 'con__d_E11.9',
 'con__d_Z28.1',
 'con__f_C07AB12',
 'con__d_H57',
 'con__d_M15',
 'con__f_C10AA02',
 'con__d_G45.9',
 'con__f_S01GX09',
 'con__d_D25.9',
 'con__d_M62.4',
 'con__d_R12',
 'con__d_B35.3',
 'con__f_S01AA30',
 'con__e_RA01172',
 'con__d_L21',
 'con__f_A12AA06',
 'con__f_G04CX02',
 'con__d_R10',
 'con__v_ODO_C',
 'con__f_R06AX29',
 'con__f_J05AB01',
 'con__m_VK2021',
 'con__d_G56.0',
 'con__d_R06.0',
 'con__d_Z60.2',
 'con__l_141.3'

In [22]:
res_val[res_val['id'] == 1317]

Unnamed: 0,dat,id,target,weight,con__age,con__sexe,con__d_H10.0,con__d_Z28.0,con__d_H66.9,con__l_CKDEPI,...,con__l_138.8,con__d_Z63.9,con__l_CALCI,con__m_VMDLO,con__d_E11,con__d_M80.9,con__d_B34.9,con__f_D06BX01,con__e_RA00062,cat__qmedea
0,2011-11-01,1317,0.0,1.0,0.041667,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
1,2011-12-01,1317,0.0,1.0,0.041667,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
2,2012-01-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
3,2012-02-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
4,2012-03-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
5,2012-04-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
6,2012-05-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
7,2012-06-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
8,2012-07-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5
9,2012-08-01,1317,0.0,1.0,0.055556,1,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5


In [29]:
i_batch_val = np.concatenate((can_valid_ids, nocan_valid_ids))
temp_size_val = len(i_batch_val)
length = 48
val_x_n = (res_val[lstm_feats]).values.reshape(temp_size_val, length, len(lstm_feats))
val_y = res_val['target'].values.reshape(temp_size_val, 48, 1)
val_w = (res_val['weight']).values.reshape(temp_size_val, 48)
val_x_cc = [res_val[[i]].values.reshape(temp_size_val, 48, 1) for i in cat_feats]
val_x_con = res_val[con_feats].values.reshape(temp_size_val, length, len(con_feats))

In [30]:
m1 = life_periods['test'] == 1
m2 = ~life_periods['dat_can'].isnull()

can_test_ids = life_periods.loc[m1 & m2, 'id']
nocan_test_ids = life_periods.loc[m1 & ~m2, 'id']

print(len(can_test_ids))
print(len(nocan_test_ids))

#nocan_test_sample_ids = pd.Series(nocan_test_ids).sample(frac=0.15).tolist()
#print(len(nocan_test_sample_ids))

412
46015


# Training 12 MONTHS

In [37]:
train_all = nocan_train_ids + can_train_ids

In [32]:
s = []

result_scores_val = []
res_val = []
p2p = 12
epochs = 5
for e in range(epochs):
    print('Starting epoch number: ', e)
    start1 = dt.datetime.now()
    np.random.shuffle(train_all)
    #np.random.shuffle(can_notest_list)
    positive_size = 768
    for i in range(0, len(train_all), positive_size):
        start = dt.datetime.now()
        print('Batch number: ', i)
        i_batch = np.asarray(((train_all[i:min(len(train_all), i + positive_size)])))
        
        df = get_df(i_batch, p2p, time_index, verbose = False)
        
        print('Transform features has started... ')
        temp, feat_dict = transform_features(df, con_cols, lstm_list, cat_cols, verbose=False)
        lstm_feats = feat_dict['lstm_feats']
        con_feats = feat_dict['con_feats']
        cat_feats = feat_dict['cat_feats']
        M = feat_dict['M']
        print('Transform features has finished... ')
        
        del df
        
        x_n = temp[lstm_feats].values.reshape(len(i_batch), 48, len(lstm_feats))
        y = temp['target'].values.reshape(len(i_batch), 48, 1)
        w = (temp['weight']).values.reshape(len(i_batch), 48)
        x_cc = [temp[[i]].values.reshape(len(i_batch), 48, 1) for i in cat_feats]
        x_con= temp[con_feats].values.reshape(len(i_batch), 48, len(con_feats))
        
        del temp
        del feat_dict
        print('Model fitting... ')
        h = model.fit(x_cc + [x_con, x_n], y,  
                              batch_size=len(i_batch), validation_split= 0, sample_weight=w, verbose=False)
        scores_val = model.predict(val_x_cc + [val_x_con, val_x_n]).ravel()
        y_val = val_y.ravel()
        w = val_w.ravel()
        scores_val = scores_val[w > 0]
        y_val = y_val[w > 0]
        
        g = [round(h.history['binary_crossentropy'][0],5), 
             round(log_loss(y_val, scores_val), 5), 
             round(roc_auc_score(y_val, scores_val), 3)]
        
        print(g)
        s.append([e, i] + g)
        
        res_val.append(y_val)
        result_scores_val.append(scores_val)
        
        gc.collect()
        print('Batch computation time: ',dt.datetime.now() - start)
    print('EPOCH TIME: ', dt.datetime.now() - start1)
    print(g)

joblib.dump(s, path + 'results_model7.pickle', compress = 3)
model.save(path + 'model_7.h5')
joblib.dump(res_val, path + 'val_y_model7.pickle', compress = 3)
joblib.dump(result_scores_val, path + 'ALL_scores_val_model7.pickle', compress = 3)

Starting epoch number:  0
Batch number:  0
29533596
(77516, 381)
27781408
(50696, 548)
0:01:02.608125
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.75951, 0.72838, 0.453]
Batch computation time:  0:06:10.749261
Batch number:  768
29495600
(77620, 380)
28349369
(51827, 547)
0:00:59.657918
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.72646, 0.69322, 0.44]
Batch computation time:  0:06:17.143198
Batch number:  1536
29577114
(77427, 382)
27777024
(50688, 548)
0:01:03.917565
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.69537, 0.65884, 0.392]
Batch computation time:  0:05:48.519025
Batch number:  2304
28605404
(75476, 379)
27510696
(50202, 548)
0:00:55.116041
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.65654, 0.6272, 0.321]
Batch computation time:  0:05:57.410823
Batch number:  3072
29740479
(78059, 381)
27

29633859
(77373, 383)
27866202
(51037, 546)
0:00:51.059257
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.11437, 0.10899, 0.243]
Batch computation time:  0:05:42.657294
Batch number:  27648
29453568
(76702, 384)
27019612
(49396, 547)
0:00:54.211339
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.10653, 0.10867, 0.246]
Batch computation time:  0:05:57.802167
Batch number:  28416
29868638
(77986, 383)
27340701
(49983, 547)
0:00:55.644778
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.11256, 0.1082, 0.248]
Batch computation time:  0:06:00.848104
Batch number:  29184
29341572
(77012, 381)
27222448
(49676, 548)
0:00:55.558473
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.11388, 0.10756, 0.251]
Batch computation time:  0:05:56.387077
Batch number:  29952
29701632
(77348, 384)
27559251
(50199, 549)
0:01:00.769949
Tr

[0.09875, 0.08932, 0.354]
Batch computation time:  0:04:33.072946
Batch number:  9216
29612640
(77928, 380)
27231764
(49693, 548)
0:00:41.582210
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.08435, 0.08906, 0.357]
Batch computation time:  0:04:40.206384
Batch number:  9984
29522406
(77082, 383)
27895940
(50905, 548)
0:00:41.555926
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.09855, 0.0888, 0.36]
Batch computation time:  0:04:38.394545
Batch number:  10752
29898512
(78064, 383)
27364356
(49844, 549)
0:00:41.240930
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.08932, 0.08854, 0.364]
Batch computation time:  0:04:42.509330
Batch number:  11520
29299662
(76902, 381)
28700952
(52374, 548)
0:00:41.501829
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.08664, 0.08829, 0.367]
Batch computation time:  0:04:38.46646

Transform features has finished... 
Model fitting... 
[0.08667, 0.0751, 0.758]
Batch computation time:  0:04:31.247244
Batch number:  18432
29588192
(77456, 382)
28374344
(51778, 548)
0:00:41.529105
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.0823, 0.07495, 0.76]
Batch computation time:  0:04:38.012340
Batch number:  19200
29740992
(77856, 382)
28643412
(52269, 548)
0:00:40.347634
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07268, 0.0748, 0.762]
Batch computation time:  0:04:30.019548
Batch number:  19968
28906600
(76070, 380)
26936136
(49064, 549)
0:00:39.937472
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07769, 0.07464, 0.764]
Batch computation time:  0:04:36.090095
Batch number:  20736
29366632
(76876, 382)
27793070
(50810, 547)
0:00:41.441317
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07531, 

29280733
(76451, 383)
27409374
(49926, 549)
0:00:41.493362
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07771, 0.07097, 0.805]
Batch computation time:  0:04:33.516319
Batch number:  768
29434155
(77255, 381)
26975664
(49136, 549)
0:00:41.460193
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.0735, 0.07087, 0.807]
Batch computation time:  0:04:41.027282
Batch number:  1536
29937002
(77557, 386)
28096108
(51364, 547)
0:00:41.177474
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07517, 0.07075, 0.808]
Batch computation time:  0:04:38.194656
Batch number:  2304
29022450
(75975, 382)
27178794
(49506, 549)
0:00:41.253826
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07368, 0.07062, 0.809]
Batch computation time:  0:04:36.619328
Batch number:  3072
29527385
(77095, 383)
27927632
(51056, 547)
0:00:41.256307
Transfo

29365486
(76873, 382)
26595687
(48621, 547)
0:00:44.030436
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07151, 0.0675, 0.843]
Batch computation time:  0:04:59.105931
Batch number:  27648
29159040
(75935, 384)
27264096
(49752, 548)
0:00:46.535766
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.06885, 0.06726, 0.845]
Batch computation time:  0:05:18.871397
Batch number:  28416
29230640
(76520, 382)
28325830
(51974, 545)
0:00:49.031147
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.06985, 0.06683, 0.847]
Batch computation time:  0:04:54.296636
Batch number:  29184
29187648
(77216, 378)
27243335
(49805, 547)
0:00:44.725831
Transform features has started... 
Transform features has finished... 
Model fitting... 
[0.07308, 0.06635, 0.85]
Batch computation time:  0:05:04.678601
Batch number:  29952
29239680
(76145, 384)
26855512
(49096, 547)
0:00:44.044842
Tra

KeyboardInterrupt: 

In [32]:
df.head()

Unnamed: 0,dat,id,a_A-GRIP,a_A00001,a_A00002,a_A00020,a_A00028,a_A00033,e_AP00791,e_IQ44.13-1,...,s_2.0,d_D25.9,d_D18.0,d_D50,d_R01.1,target,weight,sexe,qmedea,age
0,2011-11-01,2134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,U5,46.0
1,2011-12-01,2134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,U5,46.0
2,2012-01-01,2134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,U5,47.0
3,2012-02-01,2134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,U5,47.0
4,2012-03-01,2134,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0,U5,47.0


In [33]:
joblib.dump(s, path + 'results_model7.pickle', compress = 3)
model.save(path + 'model_7.h5')
joblib.dump(res_val, path + 'val_y_model7.pickle', compress = 3)
joblib.dump(result_scores_val, path + 'ALL_scores_val_model7.pickle', compress = 3)

['MODEL10_ALL/ALL_scores_val_model7.pickle']

In [35]:
temp[temp['id'] == 2134]

Unnamed: 0,dat,id,target,weight,con__age,con__sexe,con__f_A06AD12,con__e_PD00046,con__f_A02BC05,con__m_PAA004,...,con__d_K21.9,con__d_R14,con__d_F41.2,con__f_G04BD07,con__f_N03AX12,con__e_RA01126,con__m_AIVDL,con__l_LEUC_N,con__l_139.7,cat__qmedea
0,2011-11-01,2134,0.0,1.0,0.018519,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
1,2011-12-01,2134,0.0,1.0,0.018519,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
2,2012-01-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
3,2012-02-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
4,2012-03-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
5,2012-04-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
6,2012-05-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
7,2012-06-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
8,2012-07-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6
9,2012-08-01,2134,0.0,1.0,0.037037,0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,6


In [36]:
len(lstm_feats)

935

# Running the test

In [34]:
i_batch_test = np.concatenate((can_test_ids, nocan_test_ids))
ps = 2000


def run_test(i_batch_test, ps, num_model, path):
    res_test = []
    result_scores = []
    print(len(i_batch_test))
    positive_size = ps
    
    for i in range(0, len(i_batch_test), positive_size):
        start = dt.datetime.now()
        print('Iteration number: ', i)
        i_batch = i_batch_test[i:min(len(i_batch_test), i + positive_size)]

        df_test = get_df(i_batch, p2p, time_index, verbose=False, test=True)

        print('Transform features has started... ')
        temp, feat_dict = transform_features(df_test, con_cols, lstm_list, cat_cols, verbose=False)
        lstm_feats = feat_dict['lstm_feats']
        con_feats = feat_dict['con_feats']
        cat_feats = feat_dict['cat_feats']
        M = feat_dict['M']
        print('Transform features has finished... ')

        del df_test
        #temp = R[R['id'].isin(i_batch)]
        #break
        test_x_n = temp[lstm_feats].values.reshape(len(i_batch), 48, len(lstm_feats))
        test_y = temp['target'].values.reshape(len(i_batch), 48, 1)
        test_w = (temp['weight']).values.reshape(len(i_batch), 48)
        test_x_cc = [temp[[i]].values.reshape(len(i_batch), 48, 1) for i in cat_feats]
        test_x_con = temp[con_feats].values.reshape(len(i_batch), 48, len(con_feats))

        scores_test = model.predict(test_x_cc + [test_x_con, test_x_n]).ravel()

        y_test = test_y.ravel()
        w_test = test_w.ravel()
        y_test = y_test[w_test > 0]

        res_test.append(y_test)
        result_scores.append(scores_test)
        print('Batch computation time: ', dt.datetime.now() - start)

    joblib.dump(res_test, path + 'ALL_res_test_model' + num_model + '.pickle', compress=3)
    joblib.dump(result_scores, path + 'ALL_scores_test' + num_model + '.pickle', compress=3)
    
run_test(i_batch_test, ps, num_model, path)

46427
Iteration number:  0
78679152
(203832, 386)
75272732
(137359, 548)
0:01:59.356261
Transform features has started... 
Transform features has finished... 
Batch computation time:  0:13:25.405990
Iteration number:  2000
76619862
(201102, 381)
72579242
(132686, 547)
0:01:58.024549
Transform features has started... 
Transform features has finished... 
Batch computation time:  0:13:01.338837
Iteration number:  4000
77715288
(205596, 378)
74302839
(135837, 547)
0:01:56.621358
Transform features has started... 
Transform features has finished... 
Batch computation time:  0:12:56.782781
Iteration number:  6000
76398688
(203188, 376)
72510264
(132318, 548)
0:01:49.097074
Transform features has started... 
Transform features has finished... 
Batch computation time:  0:12:26.629482
Iteration number:  8000
76635102
(201142, 381)
70500095
(128885, 547)
0:01:50.531362
Transform features has started... 
Transform features has finished... 
Batch computation time:  0:12:10.218921
Iteration number:

['MODEL10_ALL/ALL_scores_test10.pickle']