# Akaike Structured Data Assignment

# > Description of data

The dataset in question contains a comprehensive collection of electronic
health records belonging to patients who have been diagnosed with a specific
disease. These health records comprise a detailed log of every aspect of the
patients' medical history, including all diagnoses, symptoms, prescribed drug
treatments, and medical tests that they have undergone. Each row represents a
healthcare record/medical event for a patient and it includes a timestamp for each
entry/event, thereby allowing for a chronological view of the patient's medical history

The Data has mainly three columns   

1) Patient-Uid - Unique Alphanumeric Identifier for a patient   
2) Date - Date when patient encountered the event.              
3) Incident - This columns describes which event occurred on the day.

# > Problem Statement

# > Problem 1

The development of drugs is critical in providing therapeutic options
for patients suffering from chronic and terminal illnesses. “Target Drug”, in particular,
is designed to enhance the patient's health and well-being without causing
dependence on other medications that could potentially lead to severe and
life-threatening side effects. These drugs are specifically tailored to treat a particular
disease or condition, offering a more focused and effective approach to treatment,
while minimising the risk of harmful reactions.

# > Objective of Problem Statement

The primary objective of this assignment is to develop a predictive model that accurately determines a patient's eligibility for the "Target Drug" within the next 30 days. By achieving this objective, the following outcomes are anticipated:

# **Importing Libraries**

In [157]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import xgboost as xgb
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score, accuracy_score

In [158]:
df = pd.read_parquet("/content/drive/MyDrive/dataset/Akaike/train.parquet")
df.head()

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1


In [159]:
df.shape

(3220868, 3)

In [160]:
len(df['Patient-Uid'].unique())

27033

In [161]:
df['Incident'].unique()

array(['PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'DRUG_TYPE_0',
       'DRUG_TYPE_1', 'DRUG_TYPE_2', 'TEST_TYPE_0', 'DRUG_TYPE_3',
       'DRUG_TYPE_4', 'DRUG_TYPE_5', 'DRUG_TYPE_6', 'DRUG_TYPE_8',
       'DRUG_TYPE_7', 'SYMPTOM_TYPE_1', 'DRUG_TYPE_10', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_2', 'DRUG_TYPE_11', 'DRUG_TYPE_9', 'DRUG_TYPE_13',
       'SYMPTOM_TYPE_5', 'TEST_TYPE_1', 'SYMPTOM_TYPE_6', 'TEST_TYPE_2',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_8', 'DRUG_TYPE_14', 'DRUG_TYPE_12',
       'SYMPTOM_TYPE_9', 'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_7',
       'SYMPTOM_TYPE_11', 'TEST_TYPE_3', 'DRUG_TYPE_15', 'SYMPTOM_TYPE_4',
       'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_16',
       'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_15', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_12', 'SYMPTOM_TYPE_20', 'SYMPTOM_TYPE_21',
       'DRUG_TYPE_17', 'SYMPTOM_TYPE_22', 'TEST_TYPE_4',
       'SYMPTOM_TYPE_23', 'DRUG_TYPE_16', 'TEST_TYPE_5',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25',
   

# 1 > **Data Preprocessing**

## 1.1 > **Data cleaning**

### 1.1.1 **Data type**

In [162]:
# Checking data types
df.dtypes

Patient-Uid            object
Date           datetime64[ns]
Incident               object
dtype: object

```
Datas are in correct data type
```

### 1.1.2 **Data structure**

In [163]:
df.head()

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2019-03-09,PRIMARY_DIAGNOSIS
1,a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,2015-05-16,PRIMARY_DIAGNOSIS
3,a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,2018-01-30,SYMPTOM_TYPE_0
4,a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,2015-04-22,DRUG_TYPE_0
8,a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2016-06-18,DRUG_TYPE_1


```
Data is in structured format
```

### 1.1.3 **Duplicate data**

In [164]:
df.duplicated().sum()

35571

```
Totally 35571 duplicates values are there duplicate values are unwanted data. So duplicate values can be remove
```

In [165]:
df = df.drop_duplicates()


In [166]:
df.duplicated().sum()

0

```
Duplicate values are removed from the data
```

### 1.1.4 **Missing values**

In [167]:
df.isnull().sum()

Patient-Uid    0
Date           0
Incident       0
dtype: int64

```
There is no missing values in the data

```

## **Creating Positive and Negative set**

In [168]:
# sorting the data based on patient id and date
df_sort = df.sort_values(by=['Patient-Uid', 'Date']).reset_index()
df_sort

Unnamed: 0,index,Patient-Uid,Date,Incident
0,1750087,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7
1,1473893,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2
2,1387922,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7
3,223191,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0
4,557302,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9
...,...,...,...,...
3185292,26581536,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6
3185293,27737944,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG
3185294,20027927,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-10,DRUG_TYPE_1
3185295,14145873,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-08-05,TARGET DRUG


In [169]:
#removing duplicate index column
df_sort = df_sort.drop('index', axis=1)
df_sort

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9
...,...,...,...
3185292,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6
3185293,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG
3185294,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-10,DRUG_TYPE_1
3185295,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-08-05,TARGET DRUG


### 1 > **Positive set**

In [170]:
# Geting the patient id who is having incident value as "target drug"
positive = df_sort[df_sort['Incident']=='TARGET DRUG']['Patient-Uid'].unique()
positive

array(['a0e9c384-1c7c-11ec-81a0-16262ee38c7f',
       'a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f',
       'a0e9c3e3-1c7c-11ec-a8b9-16262ee38c7f', ...,
       'a0f0d523-1c7c-11ec-89d2-16262ee38c7f',
       'a0f0d553-1c7c-11ec-a70a-16262ee38c7f',
       'a0f0d582-1c7c-11ec-a6c1-16262ee38c7f'], dtype=object)

In [171]:
positive_set = df_sort[df_sort['Patient-Uid'].isin(positive)]
positive_set

Unnamed: 0,Patient-Uid,Date,Incident
1763701,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-04-14,DRUG_TYPE_7
1763702,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,TEST_TYPE_0
1763703,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_0
1763704,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_8
1763705,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_7
...,...,...,...
3185292,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6
3185293,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG
3185294,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-10,DRUG_TYPE_1
3185295,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-08-05,TARGET DRUG


In [172]:
incident_pos_freq = pd.get_dummies(positive_set['Incident'])
incident_pos_freq

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TARGET DRUG,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
1763701,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763702,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1763703,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763704,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763705,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3185292,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185293,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3185294,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185295,0,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [173]:
#combining positive_set and incident_pos_freq
new_pos_df = pd.concat([positive_set,incident_pos_freq],axis =1)
new_pos_df

Unnamed: 0,Patient-Uid,Date,Incident,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TARGET DRUG,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
1763701,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-04-14,DRUG_TYPE_7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763702,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,TEST_TYPE_0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1763703,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763704,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_8,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763705,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,2015-09-07,DRUG_TYPE_7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3185292,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-06-19,DRUG_TYPE_6,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185293,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-09,TARGET DRUG,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
3185294,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-07-10,DRUG_TYPE_1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185295,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,2020-08-05,TARGET DRUG,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0


In [174]:
#droping unwanted column
new_pos_df.drop(['Incident','Date','TARGET DRUG'],axis=1,inplace=True)

In [175]:
new_pos_df

Unnamed: 0,Patient-Uid,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
1763701,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763702,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
1763703,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763704,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763705,a0e9c384-1c7c-11ec-81a0-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3185292,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185293,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185294,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3185295,a0f0d582-1c7c-11ec-a6c1-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [176]:
#geting column name
column=list(new_pos_df.columns[1:])

In [177]:
# geting freq data based on patient id
new_positive_set = new_pos_df.groupby('Patient-Uid')[column].sum()

In [178]:
new_positive_set

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0e9c384-1c7c-11ec-81a0-16262ee38c7f,6,10,0,9,0,0,0,0,0,0,...,5,0,0,0,4,1,1,0,0,0
a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f,19,21,10,0,13,0,0,0,0,0,...,2,0,0,0,0,2,1,0,0,0
a0e9c3e3-1c7c-11ec-a8b9-16262ee38c7f,4,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0e9c414-1c7c-11ec-889a-16262ee38c7f,1,2,0,0,0,0,0,0,0,0,...,1,1,0,0,0,10,0,0,0,0
a0e9c443-1c7c-11ec-9eb0-16262ee38c7f,21,18,0,6,0,0,0,0,0,0,...,3,2,0,0,2,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,48,9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,17,23,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0f0d523-1c7c-11ec-89d2-16262ee38c7f,8,48,0,3,0,0,0,0,0,0,...,0,0,0,0,0,3,0,0,0,0
a0f0d553-1c7c-11ec-a70a-16262ee38c7f,7,44,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [179]:
# positive set are eligible for next target drug so we can mention it as 1
new_positive_set['Target'] = 1

In [180]:
new_positive_set

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5,Target
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0e9c384-1c7c-11ec-81a0-16262ee38c7f,6,10,0,9,0,0,0,0,0,0,...,0,0,0,4,1,1,0,0,0,1
a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f,19,21,10,0,13,0,0,0,0,0,...,0,0,0,0,2,1,0,0,0,1
a0e9c3e3-1c7c-11ec-a8b9-16262ee38c7f,4,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
a0e9c414-1c7c-11ec-889a-16262ee38c7f,1,2,0,0,0,0,0,0,0,0,...,1,0,0,0,10,0,0,0,0,1
a0e9c443-1c7c-11ec-9eb0-16262ee38c7f,21,18,0,6,0,0,0,0,0,0,...,2,0,0,2,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0f0d4c5-1c7c-11ec-bfec-16262ee38c7f,48,9,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
a0f0d4f4-1c7c-11ec-b144-16262ee38c7f,17,23,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
a0f0d523-1c7c-11ec-89d2-16262ee38c7f,8,48,0,3,0,0,0,0,0,0,...,0,0,0,0,3,0,0,0,0,1
a0f0d553-1c7c-11ec-a70a-16262ee38c7f,7,44,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1


### 2 > **Negative Set**

In [181]:
# Geting negative from main dataframe
negative_set = df_sort[~df_sort['Patient-Uid'].isin(positive)]
negative_set

Unnamed: 0,Patient-Uid,Date,Incident
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9
...,...,...,...
1763696,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-06-17,DRUG_TYPE_1
1763697,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-07-14,PRIMARY_DIAGNOSIS
1763698,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-07-14,DRUG_TYPE_1
1763699,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-08-06,DRUG_TYPE_1


In [182]:
Incident_neg_freq = pd.get_dummies(negative_set['Incident'])
Incident_neg_freq

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1763696,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763697,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763698,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763699,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [183]:
new_neg_df = pd.concat([negative_set,Incident_neg_freq],axis =1)
new_neg_df

Unnamed: 0,Patient-Uid,Date,Incident,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2015-09-22,DRUG_TYPE_7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-04-13,SYMPTOM_TYPE_2,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-05-02,DRUG_TYPE_7,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,SYMPTOM_TYPE_0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,2018-11-23,DRUG_TYPE_9,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1763696,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-06-17,DRUG_TYPE_1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763697,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-07-14,PRIMARY_DIAGNOSIS,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763698,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-07-14,DRUG_TYPE_1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763699,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,2020-08-06,DRUG_TYPE_1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [184]:
new_neg_df.drop(['Incident','Date'],axis=1,inplace=True)

In [185]:
new_neg_df

Unnamed: 0,Patient-Uid,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
0,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,a0db1e73-1c7c-11ec-ae39-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1763696,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763697,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763698,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1763699,a0e9c354-1c7c-11ec-84f5-16262ee38c7f,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [186]:
neg_column = list(new_neg_df.columns[1:])


In [187]:
new_negative_set=new_neg_df.groupby('Patient-Uid')[neg_column].sum()

In [188]:
new_negative_set

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0db1e73-1c7c-11ec-ae39-16262ee38c7f,29,0,0,1,0,0,0,0,0,0,...,0,1,0,0,10,2,0,0,0,0
a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,8,27,0,0,0,0,0,0,0,0,...,2,0,0,0,1,4,0,0,0,0
a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,6,7,0,10,0,0,0,0,0,0,...,8,0,0,0,3,2,0,0,0,0
a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,15,42,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2,45,0,24,0,0,0,0,0,0,...,6,5,6,0,9,27,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0e9c298-1c7c-11ec-954b-16262ee38c7f,4,41,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0e9c2c7-1c7c-11ec-9b2e-16262ee38c7f,7,0,0,0,0,0,0,0,0,0,...,0,1,0,0,16,1,0,0,0,0
a0e9c2f7-1c7c-11ec-8bac-16262ee38c7f,0,11,0,0,0,0,0,0,0,0,...,0,1,0,0,0,5,0,0,0,0
a0e9c325-1c7c-11ec-8885-16262ee38c7f,5,10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [189]:
new_negative_set['Target'] = 0

In [190]:
new_negative_set

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5,Target
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0db1e73-1c7c-11ec-ae39-16262ee38c7f,29,0,0,1,0,0,0,0,0,0,...,1,0,0,10,2,0,0,0,0,0
a0dc93f2-1c7c-11ec-9cd2-16262ee38c7f,8,27,0,0,0,0,0,0,0,0,...,0,0,0,1,4,0,0,0,0,0
a0dc94c6-1c7c-11ec-a3a0-16262ee38c7f,6,7,0,10,0,0,0,0,0,0,...,0,0,0,3,2,0,0,0,0,0
a0dc950b-1c7c-11ec-b6ec-16262ee38c7f,15,42,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0dc9543-1c7c-11ec-bb63-16262ee38c7f,2,45,0,24,0,0,0,0,0,0,...,5,6,0,9,27,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0e9c298-1c7c-11ec-954b-16262ee38c7f,4,41,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a0e9c2c7-1c7c-11ec-9b2e-16262ee38c7f,7,0,0,0,0,0,0,0,0,0,...,1,0,0,16,1,0,0,0,0,0
a0e9c2f7-1c7c-11ec-8bac-16262ee38c7f,0,11,0,0,0,0,0,0,0,0,...,1,0,0,0,5,0,0,0,0,0
a0e9c325-1c7c-11ec-8885-16262ee38c7f,5,10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Combine positive and negative set

In [191]:
# Combining positive and negative set data
com_df = pd.concat([new_positive_set,new_negative_set])
com_df

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5,Target,DRUG_TYPE_18
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0e9c384-1c7c-11ec-81a0-16262ee38c7f,6,10,0,9,0,0,0,0,0,0,...,0,0,4,1,1,0,0,0,1,
a0e9c3b3-1c7c-11ec-ae8e-16262ee38c7f,19,21,10,0,13,0,0,0,0,0,...,0,0,0,2,1,0,0,0,1,
a0e9c3e3-1c7c-11ec-a8b9-16262ee38c7f,4,20,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,
a0e9c414-1c7c-11ec-889a-16262ee38c7f,1,2,0,0,0,0,0,0,0,0,...,0,0,0,10,0,0,0,0,1,
a0e9c443-1c7c-11ec-9eb0-16262ee38c7f,21,18,0,6,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a0e9c298-1c7c-11ec-954b-16262ee38c7f,4,41,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0
a0e9c2c7-1c7c-11ec-9b2e-16262ee38c7f,7,0,0,0,0,0,0,0,0,0,...,0,0,16,1,0,0,0,0,0,0.0
a0e9c2f7-1c7c-11ec-8bac-16262ee38c7f,0,11,0,0,0,0,0,0,0,0,...,0,0,0,5,0,0,0,0,0,0.0
a0e9c325-1c7c-11ec-8885-16262ee38c7f,5,10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0.0


In [192]:
com_df.isnull().sum()

DRUG_TYPE_0             0
DRUG_TYPE_1             0
DRUG_TYPE_10            0
DRUG_TYPE_11            0
DRUG_TYPE_12            0
DRUG_TYPE_13            0
DRUG_TYPE_14            0
DRUG_TYPE_15            0
DRUG_TYPE_16            0
DRUG_TYPE_17            0
DRUG_TYPE_2             0
DRUG_TYPE_3             0
DRUG_TYPE_4             0
DRUG_TYPE_5             0
DRUG_TYPE_6             0
DRUG_TYPE_7             0
DRUG_TYPE_8             0
DRUG_TYPE_9             0
PRIMARY_DIAGNOSIS       0
SYMPTOM_TYPE_0          0
SYMPTOM_TYPE_1          0
SYMPTOM_TYPE_10         0
SYMPTOM_TYPE_11         0
SYMPTOM_TYPE_12         0
SYMPTOM_TYPE_13         0
SYMPTOM_TYPE_14         0
SYMPTOM_TYPE_15         0
SYMPTOM_TYPE_16         0
SYMPTOM_TYPE_17         0
SYMPTOM_TYPE_18         0
SYMPTOM_TYPE_19         0
SYMPTOM_TYPE_2          0
SYMPTOM_TYPE_20         0
SYMPTOM_TYPE_21         0
SYMPTOM_TYPE_22         0
SYMPTOM_TYPE_23         0
SYMPTOM_TYPE_24         0
SYMPTOM_TYPE_25         0
SYMPTOM_TYPE

```
There is 9374 duplicate value in data because negative set no having drug type 18 so will combining it as recorded as null value we can impute the null value with 0
```

In [193]:
# Imputing Null value with 0
com_df = com_df.fillna(0)

In [194]:
com_df.isnull().sum()

DRUG_TYPE_0          0
DRUG_TYPE_1          0
DRUG_TYPE_10         0
DRUG_TYPE_11         0
DRUG_TYPE_12         0
DRUG_TYPE_13         0
DRUG_TYPE_14         0
DRUG_TYPE_15         0
DRUG_TYPE_16         0
DRUG_TYPE_17         0
DRUG_TYPE_2          0
DRUG_TYPE_3          0
DRUG_TYPE_4          0
DRUG_TYPE_5          0
DRUG_TYPE_6          0
DRUG_TYPE_7          0
DRUG_TYPE_8          0
DRUG_TYPE_9          0
PRIMARY_DIAGNOSIS    0
SYMPTOM_TYPE_0       0
SYMPTOM_TYPE_1       0
SYMPTOM_TYPE_10      0
SYMPTOM_TYPE_11      0
SYMPTOM_TYPE_12      0
SYMPTOM_TYPE_13      0
SYMPTOM_TYPE_14      0
SYMPTOM_TYPE_15      0
SYMPTOM_TYPE_16      0
SYMPTOM_TYPE_17      0
SYMPTOM_TYPE_18      0
SYMPTOM_TYPE_19      0
SYMPTOM_TYPE_2       0
SYMPTOM_TYPE_20      0
SYMPTOM_TYPE_21      0
SYMPTOM_TYPE_22      0
SYMPTOM_TYPE_23      0
SYMPTOM_TYPE_24      0
SYMPTOM_TYPE_25      0
SYMPTOM_TYPE_26      0
SYMPTOM_TYPE_27      0
SYMPTOM_TYPE_28      0
SYMPTOM_TYPE_29      0
SYMPTOM_TYPE_3       0
SYMPTOM_TYP

```
Null values are imputed we can use this data
```

## 1.2 > **Spliting**

In [195]:
# making feature column as x and target column as y
x = com_df.drop(['Target'],axis =1)
y = com_df['Target']
# Spliting training data and test data
X_train,X_test,y_train,y_test = train_test_split(x,y,test_size= 0.2, random_state= 7)
X_train.shape,X_test.shape,y_train.shape,y_test.shape

((21626, 56), (5407, 56), (21626,), (5407,))

## 1.3 > **Check for imbalanced Data**

In [196]:
y_train.value_counts()

0    14075
1     7551
Name: Target, dtype: int64

In [197]:
7551/14075


0.5364831261101244

```
Balanced dataset

```

## 1.4 > **Scaling**

In [198]:
# scaling the data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
x_train_scaled = scaler.transform(X_train)
x_test_scaled = scaler.transform(X_test)

# 2 > **Task**

```
It is binary classification dataset we can use classification models. So Xg boost will give better result so we can use that

```

# 3 > **Model**

### 1 > **Extreme Gradient Boosting**

#### **Finding best learning rate**

In [199]:
# Finding the best learning rate

high_xg =[]
for lrn in [0.05, 0.1, 0.15,0.2,0.25,0.3,0.35,0.4,0.45,0.5,0.55,0.6,0.65,0.7,0.75,0.8,0.85,0.9,0.95,1]:
  xg = xgb.XGBClassifier(learning_rate = lrn)
  xg.fit(x_train_scaled, y_train)
  train_score = xg.score(x_train_scaled, y_train)
  cv_score = np.mean(cross_val_score(xg, x_train_scaled, y_train, cv = 10))
  # print('learn', lrn, 'train score', train_score, 'cross_val_score', cv_score)
  b= ({'learn' : lrn , 'train_score' : train_score, 'cv_score' : cv_score})
  high_xg.append(b)
df_xg = pd.DataFrame(high_xg)
df_xg1 = df_xg.sort_values(by = 'cv_score', ascending = False).reset_index()
df_xg1

Unnamed: 0,index,learn,train_score,cv_score
0,2,0.15,0.873116,0.816332
1,1,0.1,0.862157,0.814667
2,4,0.25,0.889485,0.813882
3,0,0.05,0.845556,0.813835
4,5,0.3,0.899288,0.81157
5,3,0.2,0.882595,0.8112
6,6,0.35,0.903635,0.80861
7,7,0.4,0.912698,0.804541
8,8,0.45,0.914362,0.802368
9,9,0.5,0.925553,0.800934


#### **Implementing XG Boost model**

In [200]:
# implementing best learning rate in model
xg1 = xgb.XGBClassifier(learning_rate = 0.15)
xg1.fit(x_train_scaled, y_train)

# 4 > **Evaluation**

#### 1 > **F1_score**

In [201]:
f1_score(xg1.predict(x_test_scaled), y_test)

0.722905027932961

```
F1-score = 0.72
```

#### 2 > **Accuracy**

In [202]:
accuracy_score(xg1.predict(x_test_scaled), y_test)

0.816534122433882

```
Accuracy = 0.81
```

# Loading TEST dataset

In [203]:
# importing test dataset
df_test = pd.read_parquet('/content/drive/MyDrive/dataset/Akaike/test.parquet')
df_test

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0
...,...,...,...
1372854,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-05-11,DRUG_TYPE_13
1372856,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2018-08-22,DRUG_TYPE_2
1372857,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-02-04,DRUG_TYPE_2
1372858,a10272c9-1c7c-11ec-b3ce-16262ee38c7f,2017-09-25,DRUG_TYPE_8


## 1 > **Data Cleaning for test data**

### 1.1 > **Data type**

In [204]:
df_test.dtypes

Patient-Uid            object
Date           datetime64[ns]
Incident               object
dtype: object

### 1.2 > **Data Structure**

In [205]:
df_test.head()

Unnamed: 0,Patient-Uid,Date,Incident
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2016-12-08,SYMPTOM_TYPE_0
1,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-10-17,DRUG_TYPE_0
2,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-12-01,DRUG_TYPE_2
3,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2018-12-05,DRUG_TYPE_1
4,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,2017-11-04,SYMPTOM_TYPE_0


In [206]:
df_test.shape

(1065524, 3)

### 1.3 > **Duplicate data**

In [207]:
df_test.duplicated().sum()

12100

In [208]:
df_test = df_test.drop_duplicates()

In [209]:
df_test.duplicated().sum()

0

In [210]:
df_test.shape

(1053424, 3)

### 1.4 > **Missing value**

In [211]:
df_test.isnull().sum()

Patient-Uid    0
Date           0
Incident       0
dtype: int64

In [212]:
df_test.nunique()

Patient-Uid    11482
Date            1947
Incident          55
dtype: int64

In [213]:
train = df_sort['Incident'].unique()
test = df_test['Incident'].unique()

In [214]:
c =[]
for i in train:
  if i not in test:
    c.append(i)

In [215]:
c

['DRUG_TYPE_18', 'TARGET DRUG']

Drug_type_18 drug was missing in test set

## Exploring Data

In [216]:
df_test[df_test['Incident']== 'TARGET DRUG']

Unnamed: 0,Patient-Uid,Date,Incident


In [217]:
incident_test_freq = pd.get_dummies(df_test['Incident'])
incident_test_freq

Unnamed: 0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1372854,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1372856,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1372857,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1372858,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [218]:
incident_test_freq.shape

(1053424, 55)

In [219]:
new_test_df = pd.concat([df_test,incident_test_freq], axis = 1)

new_test_df.columns

Index(['Patient-Uid', 'Date', 'Incident', 'DRUG_TYPE_0', 'DRUG_TYPE_1',
       'DRUG_TYPE_10', 'DRUG_TYPE_11', 'DRUG_TYPE_12', 'DRUG_TYPE_13',
       'DRUG_TYPE_14', 'DRUG_TYPE_15', 'DRUG_TYPE_16', 'DRUG_TYPE_17',
       'DRUG_TYPE_2', 'DRUG_TYPE_3', 'DRUG_TYPE_4', 'DRUG_TYPE_5',
       'DRUG_TYPE_6', 'DRUG_TYPE_7', 'DRUG_TYPE_8', 'DRUG_TYPE_9',
       'PRIMARY_DIAGNOSIS', 'SYMPTOM_TYPE_0', 'SYMPTOM_TYPE_1',
       'SYMPTOM_TYPE_10', 'SYMPTOM_TYPE_11', 'SYMPTOM_TYPE_12',
       'SYMPTOM_TYPE_13', 'SYMPTOM_TYPE_14', 'SYMPTOM_TYPE_15',
       'SYMPTOM_TYPE_16', 'SYMPTOM_TYPE_17', 'SYMPTOM_TYPE_18',
       'SYMPTOM_TYPE_19', 'SYMPTOM_TYPE_2', 'SYMPTOM_TYPE_20',
       'SYMPTOM_TYPE_21', 'SYMPTOM_TYPE_22', 'SYMPTOM_TYPE_23',
       'SYMPTOM_TYPE_24', 'SYMPTOM_TYPE_25', 'SYMPTOM_TYPE_26',
       'SYMPTOM_TYPE_27', 'SYMPTOM_TYPE_28', 'SYMPTOM_TYPE_29',
       'SYMPTOM_TYPE_3', 'SYMPTOM_TYPE_4', 'SYMPTOM_TYPE_5', 'SYMPTOM_TYPE_6',
       'SYMPTOM_TYPE_7', 'SYMPTOM_TYPE_8', 'SYMPTOM_TYPE_9', '

In [220]:
new_test_df.drop(['Date','Incident'],axis = 1, inplace = True)

In [221]:
new_test_df.shape

(1053424, 56)

In [222]:
column_test = list(new_test_df.columns[1:])
len(column_test)

55

In [223]:
new_test_df = new_test_df.groupby('Patient-Uid')[column_test].sum()


In [224]:
new_test_df

Unnamed: 0_level_0,DRUG_TYPE_0,DRUG_TYPE_1,DRUG_TYPE_10,DRUG_TYPE_11,DRUG_TYPE_12,DRUG_TYPE_13,DRUG_TYPE_14,DRUG_TYPE_15,DRUG_TYPE_16,DRUG_TYPE_17,...,SYMPTOM_TYPE_6,SYMPTOM_TYPE_7,SYMPTOM_TYPE_8,SYMPTOM_TYPE_9,TEST_TYPE_0,TEST_TYPE_1,TEST_TYPE_2,TEST_TYPE_3,TEST_TYPE_4,TEST_TYPE_5
Patient-Uid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,8,3,0,1,0,0,0,0,0,0,...,3,0,0,0,2,0,0,0,0,0
a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,2,30,0,0,0,0,0,9,0,0,...,2,0,0,0,0,0,0,1,0,0
a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,4,33,0,0,0,0,0,0,0,0,...,0,0,0,0,0,2,0,0,0,0
a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,2,0,0,1,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
a0f9eab1-1c7c-11ec-a732-16262ee38c7f,5,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
a102720c-1c7c-11ec-bd9a-16262ee38c7f,33,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a102723c-1c7c-11ec-9f80-16262ee38c7f,4,6,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
a102726b-1c7c-11ec-bfbf-16262ee38c7f,14,5,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
a102729b-1c7c-11ec-86ba-16262ee38c7f,5,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [225]:
new_test_df['DRUG_TYPE_18'] = 0

In [226]:
# Transforming the test data
test_scaled = scaler.transform(new_test_df)
label = xg1.predict(test_scaled)
label

array([0, 1, 0, ..., 0, 0, 0])

In [227]:
#Converting into dataframe
final_submission = pd.DataFrame({'Patient-Uid':new_test_df.index, 'Label': label})
final_submission

Unnamed: 0,Patient-Uid,Label
0,a0f9e8a9-1c7c-11ec-8d25-16262ee38c7f,0
1,a0f9e9f9-1c7c-11ec-b565-16262ee38c7f,1
2,a0f9ea43-1c7c-11ec-aa10-16262ee38c7f,0
3,a0f9ea7c-1c7c-11ec-af15-16262ee38c7f,0
4,a0f9eab1-1c7c-11ec-a732-16262ee38c7f,0
...,...,...
11477,a102720c-1c7c-11ec-bd9a-16262ee38c7f,0
11478,a102723c-1c7c-11ec-9f80-16262ee38c7f,0
11479,a102726b-1c7c-11ec-bfbf-16262ee38c7f,0
11480,a102729b-1c7c-11ec-86ba-16262ee38c7f,0


In [228]:
final_submission['Label'].value_counts()

0    8805
1    2677
Name: Label, dtype: int64

In [229]:
# converting dataframe into csv file
final_submission.to_csv('Final_submission.csv',index = False)