# Problem Statement

# Problem 1

The development of drugs is critical in providing therapeutic options
for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular,
is designed to enhance the patient's health and well-being without causing
dependence on other medications that could potentially lead to severe and
life-threatening side effects. These drugs are specifically tailored to treat a particular
disease or condition, offering a more focused and effective approach to treatment,
while minimising the risk of harmful reactions.

# Objective

The objective in this assignment is to develop a predictive model which will predict
whether a patient will be eligible*** for “Target Drug” or not in next 30 days. Knowing
if the patient is eligible or not will help physician treating the patient make informed
decision on the which treatments to give.

# Imported the libraries required for data processing and modelling.

In [7]:
import pandas as pd
import numpy as np

# Loaded the specified dataset's data to study its structure and train.

In [10]:
df = pd.read_parquet("/content/train.parquet")

df.head()

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1


In [11]:
df.describe()

Unnamed: 0,Patient-Uid,Date,Incident
count,3220868,3220868,3220868
unique,27033,1977,57
top,a0ddfd2c-1c7c-11ec-876d-16262ee38c7f,2019-05-21 00:00:00,DRUG_TYPE_6
freq,1645,3678,561934
first,,2015-04-07 00:00:00,
last,,2020-09-03 00:00:00,


# Performed data cleaning

which may include managing missing values, converting date columns to datetime format, and constructing the target variable based on event dates and "Target Drug" occurrences.

In [12]:
df = df.dropna()

df = df.drop_duplicates()

df = df.drop("Date", axis=1)

df.head()

Unnamed: 0,Patient-Uid,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,DRUG_TYPE_1


# Encoding the data

In [13]:
# One-hot encode the Incident column
one_hot = pd.get_dummies(df['Incident'])

# Concatenate the one-hot encoded column with the original DataFrame
df = pd.concat([df, one_hot], axis=1)

# Group the incidents by Patient-Uid and sum the one-hot encoded columns
grouped = df.groupby('Patient-Uid').sum()

# Display the resulting DataFrame
grouped

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TARGET DRUG,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0db1e73-1c7c-11ec-ae39-16262ee38c7f,29,0,0,1,0,0,0,0,0,0,...,1,0,0,0,10,2,0,0,0,0
a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,8,27,0,0,0,0,0,0,0,0,...,0,0,0,0,1,4,0,0,0,0
a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,6,7,0,10,0,0,0,0,0,0,...,0,0,0,0,3,2,0,0,0,0
a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,15,42,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2,45,0,24,0,0,0,0,0,0,...,5,6,0,0,9,27,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,48,9,0,0,0,0,0,0,0,0,...,0,0,0,2,0,0,0,0,0,0
a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,17,23,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
a0f0d523-1c7c-11ec-89d2-16262ee38c7f,8,48,0,3,0,0,0,0,0,0,...,0,0,0,3,0,3,0,0,0,0
a0f0d553-1c7c-11ec-a70a-16262ee38c7f,7,44,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0


In [14]:
grouped['TARGET DRUG'] = grouped['TARGET DRUG'].apply(lambda x: 0 if x == 0 else 1)

grouped

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TARGET DRUG,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0db1e73-1c7c-11ec-ae39-16262ee38c7f,29,0,0,1,0,0,0,0,0,0,...,1,0,0,0,10,2,0,0,0,0
a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,8,27,0,0,0,0,0,0,0,0,...,0,0,0,0,1,4,0,0,0,0
a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,6,7,0,10,0,0,0,0,0,0,...,0,0,0,0,3,2,0,0,0,0
a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,15,42,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2,45,0,24,0,0,0,0,0,0,...,5,6,0,0,9,27,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,48,9,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,17,23,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
a0f0d523-1c7c-11ec-89d2-16262ee38c7f,8,48,0,3,0,0,0,0,0,0,...,0,0,0,1,0,3,0,0,0,0
a0f0d553-1c7c-11ec-a70a-16262ee38c7f,7,44,0,0,0,0,0,0,0,0,...,1,0,0,1,0,0,0,0,0,0


# Test data

In [15]:
df1 = pd.read_parquet("/content/test.parquet")

df1.describe()

Unnamed: 0,Patient-Uid,Date,Incident
count,1065524,1065524,1065524
unique,11482,1947,55
top,a0faa6ed-1c7c-11ec-8f6f-16262ee38c7f,2018-03-13 00:00:00,DRUG_TYPE_6
freq,1236,1139,192292
first,,2015-04-07 00:00:00,
last,,2020-08-04 00:00:00,


In [16]:
df1

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0
...,...,...,...
1372854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-05-11,DRUG_TYPE_13
1372856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2018-08-22,DRUG_TYPE_2
1372857,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-02-04,DRUG_TYPE_2
1372858,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-09-25,DRUG_TYPE_8


# Performed data cleaning

In [17]:
df1=df1.dropna()

df1 = df1.drop_duplicates()

df1 = df1.drop("Date", axis=1)

df1

Unnamed: 0,Patient-Uid,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,SYMPTOM_TYPE_0
...,...,...
1372854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_13
1372856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_2
1372857,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_2
1372858,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,DRUG_TYPE_8


# Encoding the data

In [18]:
# One-hot encode the Incident column
one_hot = pd.get_dummies(df1['Incident'])

# Concatenate the one-hot encoded column with the original DataFrame
df1 = pd.concat([df1, one_hot], axis=1)

# Group the incidents by Patient-Uid and sum the one-hot encoded columns
test_data = df1.groupby('Patient-Uid').sum()

# Display the resulting DataFrame
test_data

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,8,3,0,1,0,0,0,0,0,0,...,3,0,0,0,2,0,0,0,0,0
a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,2,30,0,0,0,0,0,9,0,0,...,2,0,0,0,0,0,0,1,0,0
a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,4,33,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,2,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
a0f9eab1-1c7c-11ec-a732-16262ee38c7f,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a102720c-1c7c-11ec-bd9a-16262ee38c7f,33,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a102723c-1c7c-11ec-9f80-16262ee38c7f,4,6,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
a102726b-1c7c-11ec-bfbf-16262ee38c7f,14,5,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a102729b-1c7c-11ec-86ba-16262ee38c7f,5,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [19]:
# add a new column 'TARGET DRUG' to the DataFrame with default values of 0
test_data['TARGET DRUG'] = 0
test_data['DRUG_TYPE_18'] =0

test_pred=test_data.drop('TARGET DRUG',axis=1).values

X=grouped.drop('TARGET DRUG',axis=1).values
y=grouped['TARGET DRUG'].values

# Model

Chose an appropriate modeling technique based on the problem requirements and available data.

In [20]:
import pandas as pd
from sklearn.model_selection import train_test_split
# split train data into train and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score,f1_score
rf = RandomForestClassifier(n_estimators= 100, max_depth = 16, max_features='sqrt')
rf.fit(X_train, y_train)
predictions = rf.predict(X_val)
print("Training Accuracy :",rf.score(X_train,y_train))
print("Testing Accuracy :",rf.score(X_val,y_val))
print("F1 Score :", f1_score(y_val,predictions))
print("ROC AUC Score :", roc_auc_score(y_val,predictions))

Training Accuracy : 0.9393322852122445
Testing Accuracy : 0.8071019049380432
F1 Score : 0.7110002770850652
ROC AUC Score : 0.7742814381809238


Analyzed the model's performance and identified the false positives and false negatives. Devised strategies to reduce these errors, which may involve adjusting classification thresholds, applying resampling techniques.

In [21]:
predict = rf.predict(test_pred)

predict

array([0, 0, 0, ..., 0, 0, 0])

In [22]:
df3 = pd.DataFrame(predict)
df3

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
...,...
11477,0
11478,0
11479,0
11480,0


In [23]:
df3[0].unique()

array([0, 1])

In [24]:
df3 = df3.rename(columns={0: 'predict'})
df3

Unnamed: 0,predict
0,0
1,0
2,0
3,0
4,0
...,...
11477,0
11478,0
11479,0
11480,0


In [25]:
test_data = test_data.reset_index()

final_data = pd.concat([test_data["Patient-Uid"], df3["predict"]], axis=1)

final_data

Unnamed: 0,Patient-Uid,predict
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,0
1,a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,0
2,a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,0
3,a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,0
4,a0f9eab1-1c7c-11ec-a732-16262ee38c7f,0
...,...,...
11477,a102720c-1c7c-11ec-bd9a-16262ee38c7f,0
11478,a102723c-1c7c-11ec-9f80-16262ee38c7f,0
11479,a102726b-1c7c-11ec-bfbf-16262ee38c7f,0
11480,a102729b-1c7c-11ec-86ba-16262ee38c7f,0


# Patient eligible for “Target Drug”

In [26]:
fd=final_data[final_data["predict"]==1]

fd

Unnamed: 0,Patient-Uid,predict
130,a0fa02fd-1c7c-11ec-86c6-16262ee38c7f,1
314,a0fa25f4-1c7c-11ec-a5c8-16262ee38c7f,1
394,a0fa3536-1c7c-11ec-a913-16262ee38c7f,1
867,a0fa8f57-1c7c-11ec-8d65-16262ee38c7f,1
1805,a0fb4225-1c7c-11ec-9e36-16262ee38c7f,1
2389,a0fbb144-1c7c-11ec-80e9-16262ee38c7f,1
3062,a0fc32c3-1c7c-11ec-a0b4-16262ee38c7f,1
4168,a0fd03e8-1c7c-11ec-92d0-16262ee38c7f,1
4738,a0fd6fd7-1c7c-11ec-b615-16262ee38c7f,1
5134,a0fdbb86-1c7c-11ec-8124-16262ee38c7f,1


# Extracted the final_submission CSV file

In [27]:
final_submission = final_data[['Patient-Uid', 'predict']].rename(columns={'predict': 'label'})

final_submission.to_csv("final_submission.csv", index=False)