Import all the necessary libraries for constructing the model.

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, classification_report

import tensorflow as tf #This is to construct a feed forward neural network for our model.
import datetime

Now, lets get a hold of our training set by reading 'parquet' files using pandas and store it in a variable called df_train.

In [22]:
df_train = pd.read_parquet('train.parquet', engine='pyarrow')
#We have used the pyarrow engine for the conversion of the '.parquet' file into a more readable one. 
# This engine must be pre-installed in your device. If not use pip install pyarrow in the command terminal of device.
df_train

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1
...,...,...,...
29080886,a0ee9f75-1c7c-11ec-94c7-16262ee38c7f,2018-07-06,DRUG_TYPE_6
29080897,a0ee1284-1c7c-11ec-a3d5-16262ee38c7f,2017-12-29,DRUG_TYPE_6
29080900,a0ee9b26-1c7c-11ec-8a40-16262ee38c7f,2018-10-18,DRUG_TYPE_10
29080903,a0ee1a92-1c7c-11ec-8341-16262ee38c7f,2015-09-18,DRUG_TYPE_6


We observe that the data entries are not linear with respect to the 'Date' feature nor are they arranged with the different 'Patient-Uid' in mind. So, we need to arrange our given data set more systematically.

**Feature Engineering:** 
Here, we need to be able to choose the right kind of features for our model. For our purpose, I'm looking at two features as shown:

-> **'Time-Elapsed'** - We calculate the days after the first incident of each patient and keep track of the patient's visit to the hospital. This should give us an idea on the frequency of visits from each patient.

-> **'Sequential'** - Here, we keep track on the order of the various Incidents happening and send it to the model for training. This sequential pattern would definetely be unique to each patient and hence would serve as a good feature.

The **'Eligibility'** column is binary and gives us the eligibility of a particular patient for 'TARGET DRUG' within 30 days of its previous encounter.

Given below is the implementation of cleaning the given dataset as well as obtaining the required features.

In [21]:
class DataCleaning: 
    def __init__(self,X): 
        self.X = X
        self.num_feat = X.shape[0]
        self.num_col = X.shape[1]

    def Ordering(self):
        self.X['Date'] = pd.to_datetime(self.X['Date']) #This converts the dates present in the 'Date' column from strings to panda datetime frames.
        self.X = self.X.sort_values(['Patient-Uid','Date']) #Sorts the data with each patient and date in chronological order.
        self.X = self.X.reset_index(drop=True)
        return self.X
        
    def features(self,X):
        X['Elapsed-Time'] = X.groupby('Patient-Uid')['Date'].transform(lambda x: (x - x.min()).dt.days)
        #Here, we apply the function of x - x.min() to all the dates of each patient expressed in days.
        #Since, the data is already organized, x.min() shall correspond to the first entry for each individual patient.

        #Similarly for the 'Sequential' feature, we have:
        X['Sequential'] = X.groupby('Patient-Uid')['Incident'].transform(lambda x: '->'.join(x))
        #Here, we used the '->' delimiter to keep track of the order of the incidents per patient.
        #Now, since 'Sequential' is just a string, to be able to process it, we need to convert it into a panda dataframe.
        seq = X['Incident'].str.get_dummies(sep='->') #panda df
        #One thing we need to make sure here is to ensure that each  unique Incident has a column in seq even if some values are missing in the testing sets.

        X['Eligibility']=0
        groups = X.groupby('Patient-Uid')

        for i, grp in groups:
            if 'TARGET DRUG' in grp['Incident'].values:
                date1 = grp.loc[grp['Incident']=='TARGET DRUG','Date'].min()
                date2 = date1 + datetime.timedelta(days=30) #date1 and date2 are the ends of the possible range of dates.
                X.loc[(X['Patient-Uid']==grp['Patient-Uid'].iloc[0])&(X['Date']>=date1)&(X['Date']<=date2),'Eligibility'] = 1

        In_Feature = pd.concat([X[['Elapsed-Time']],seq],axis=1)
        Out_Feature = X['Eligibility']

        return X, In_Feature, Out_Feature

In [23]:
df = DataCleaning(df_train)
df_train = df.Ordering()

df_train, X_train, y_train = df.features(df_train)

The final output table looks like this:

In [24]:
df_train    

Unnamed: 0,Patient-Uid,Date,Incident,Elapsed-Time,Sequential,Eligibility
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7,0,DRUG_TYPE_7->SYMPTOM_TYPE_2->DRUG_TYPE_7->SYMP...,0
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2,934,DRUG_TYPE_7->SYMPTOM_TYPE_2->DRUG_TYPE_7->SYMP...,0
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7,953,DRUG_TYPE_7->SYMPTOM_TYPE_2->DRUG_TYPE_7->SYMP...,0
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0,1158,DRUG_TYPE_7->SYMPTOM_TYPE_2->DRUG_TYPE_7->SYMP...,0
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9,1158,DRUG_TYPE_7->SYMPTOM_TYPE_2->DRUG_TYPE_7->SYMP...,0
...,...,...,...,...,...,...
3220863,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6,1886,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1
3220864,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG,1906,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1
3220865,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-10,DRUG_TYPE_1,1907,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1
3220866,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-08-05,TARGET DRUG,1933,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,0


We can calculate the positve sets and negative sets based on the 'Eligibility' column as shown below:

In [29]:
ps = df_train[df_train['Eligibility']==1] #Set of patients using 'TARGET DRUG' at least once.
ns = df_train[df_train['Eligibility']==0]

In [30]:
ps

Unnamed: 0,Patient-Uid,Date,Incident,Elapsed-Time,Sequential,Eligibility
1784189,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2020-07-08,PRIMARY_DIAGNOSIS,1912,DRUG_TYPE_7->TEST_TYPE_0->DRUG_TYPE_0->DRUG_TY...,1
1784190,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2020-07-08,TARGET DRUG,1912,DRUG_TYPE_7->TEST_TYPE_0->DRUG_TYPE_0->DRUG_TY...,1
1784191,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2020-08-05,PRIMARY_DIAGNOSIS,1940,DRUG_TYPE_7->TEST_TYPE_0->DRUG_TYPE_0->DRUG_TY...,1
1784192,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2020-08-05,TARGET DRUG,1940,DRUG_TYPE_7->TEST_TYPE_0->DRUG_TYPE_0->DRUG_TY...,1
1784330,a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f,2018-04-24,PRIMARY_DIAGNOSIS,1104,PRIMARY_DIAGNOSIS->DRUG_TYPE_2->DRUG_TYPE_2->P...,1
...,...,...,...,...,...,...
3220782,a0f0d553-1c7c-11ec-a70a-16262ee38c7f,2020-07-21,TARGET DRUG,1892,DRUG_TYPE_9->SYMPTOM_TYPE_7->DRUG_TYPE_2->DRUG...,1
3220862,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-18,TARGET DRUG,1885,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1
3220863,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6,1886,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1
3220864,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG,1906,DRUG_TYPE_6->DRUG_TYPE_1->DRUG_TYPE_6->DRUG_TY...,1


In [26]:
df_test = pd.read_parquet('test.parquet', engine='pyarrow')
df_test

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0
...,...,...,...
1372854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-05-11,DRUG_TYPE_13
1372856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2018-08-22,DRUG_TYPE_2
1372857,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-02-04,DRUG_TYPE_2
1372858,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-09-25,DRUG_TYPE_8


In [27]:
df_t = DataCleaning(df_test)
df_test = df_t.Ordering()

df_test, X_test, y_test = df_t.features(df_test)

In [33]:
X_train.columns

Index(['Elapsed-Time', 'DRUG_TYPE_0', 'DRUG_TYPE_1', 'DRUG_TYPE_10',
       'DRUG_TYPE_11', 'DRUG_TYPE_12', 'DRUG_TYPE_13', 'DRUG_TYPE_14',
       'DRUG_TYPE_15', 'DRUG_TYPE_16', 'DRUG_TYPE_17', 'DRUG_TYPE_18',
       'DRUG_TYPE_2', 'DRUG_TYPE_3', 'DRUG_TYPE_4', 'DRUG_TYPE_5',
       'DRUG_TYPE_6', 'DRUG_TYPE_7', 'DRUG_TYPE_8', 'DRUG_TYPE_9',
       'PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'SYMPTOM_TYPE_1',
       'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_11', 'SYMPTOM_TYPE_12',
       'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_15',
       'SYMPTOM_TYPE_16', 'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_2', 'SYMPTOM_TYPE_20',
       'SYMPTOM_TYPE_21', 'SYMPTOM_TYPE_22', 'SYMPTOM_TYPE_23',
       'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25', 'SYMPTOM_TYPE_26',
       'SYMPTOM_TYPE_27', 'SYMPTOM_TYPE_28', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_4', 'SYMPTOM_TYPE_5', 'SYMPTOM_TYPE_6',
       'SYMPTOM_TYPE_7', 'SYMPTOM_TYPE_8', 'SYMPTOM_TYPE_9', 'TAR

In [50]:
X_test.columns

Index(['Elapsed-Time', 'DRUG_TYPE_0', 'DRUG_TYPE_1', 'DRUG_TYPE_10',
       'DRUG_TYPE_11', 'DRUG_TYPE_12', 'DRUG_TYPE_13', 'DRUG_TYPE_14',
       'DRUG_TYPE_15', 'DRUG_TYPE_16', 'DRUG_TYPE_17', 'DRUG_TYPE_18',
       'DRUG_TYPE_2', 'DRUG_TYPE_3', 'DRUG_TYPE_4', 'DRUG_TYPE_5',
       'DRUG_TYPE_6', 'DRUG_TYPE_7', 'DRUG_TYPE_8', 'DRUG_TYPE_9',
       'PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'SYMPTOM_TYPE_1',
       'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_11', 'SYMPTOM_TYPE_12',
       'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_15',
       'SYMPTOM_TYPE_16', 'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_2', 'SYMPTOM_TYPE_20',
       'SYMPTOM_TYPE_21', 'SYMPTOM_TYPE_22', 'SYMPTOM_TYPE_23',
       'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25', 'SYMPTOM_TYPE_26',
       'SYMPTOM_TYPE_27', 'SYMPTOM_TYPE_28', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_4', 'SYMPTOM_TYPE_5', 'SYMPTOM_TYPE_6',
       'SYMPTOM_TYPE_7', 'SYMPTOM_TYPE_8', 'SYMPTOM_TYPE_9', 'TAR

In [48]:
#We see that two columns of 'TARGET DRUG' and 'DRUG_TYPE_18' do not exist in the test set. Hence we are going to include them at the right
#position and initialize them to 0.
td = np.zeros(X_test.shape[0])
d18 = np.zeros(X_test.shape[0])
X_test.insert(11,'DRUG_TYPE_18',td)
X_test.insert(51,'TARGET DRUG',d18) #This is to ensure that the columns are placed at the right index.

In [49]:
X_test.columns == X_train.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True])

Now, lets go about building our model. For this purpose, I have gone with a feed forward neural network using tensorflow.

In [51]:
#The Following Neural Network is going to contain 2 inner layers with 64 nodes each and one output layer. 

model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(64,activation='relu',input_shape=(58,)), #This input shape is from the dimensions of X_Test.
    tf.keras.layers.Dense(64,activation='relu'),
    tf.keras.layers.Dense(1,activation='sigmoid') #This activation gives the probabilities of the binary output predicted.
])

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

In [52]:
#Now, lets fit in the data.
model.fit(X_train,y_train, epochs=10,verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1f014d794e0>

In [53]:
y_pred = model.predict(X_test)
y_pred = np.round(y_pred).flatten() #To convert the probabiities to binary output.
y_pred



array([0., 0., 0., ..., 0., 0., 0.], dtype=float32)

In [54]:
acc = accuracy_score(y_test,y_pred)
acc*100

100.0

In [56]:
rep = classification_report(y_test,y_pred)
print(f"Classification Report:\n{rep}")

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00   1065524

    accuracy                           1.00   1065524
   macro avg       1.00      1.00      1.00   1065524
weighted avg       1.00      1.00      1.00   1065524



We see that the model predicted the output with a good accuracy. This is expected as in the testing set, there were 0 accounts with trace of 'TARGET DRUG' in the testing set and so, we can directly estimate the patients who were eligible for it.