Congratulations – you have been hired as Chief Data Scientist of MedCamp – a not for profit organization dedicated in making health conditions for working professionals better. MedCamp was started because the founders saw their family suffer due to bad work life balance and neglected health.

MedCamp organizes health camps in several cities with low work life balance. They reach out to working people and ask them to register for these health camps. For those who attend, MedCamp provides them facility to undergo health checks or increase awareness by visiting various stalls (depending on the format of camp). 

MedCamp has conducted 65 such events over a period of 4 years and they see a high drop off between “Registration” and Number of people taking tests at the Camps. In last 4 years, they have stored data of ~110,000 registrations they have done.

One of the huge costs in arranging these camps is the amount of inventory you need to carry. If you carry more than required inventory, you incur unnecessarily high costs. On the other hand, if you carry less than required inventory for conducting these medical checks, people end up having bad experience.

 

***The Process:
MedCamp employees / volunteers reach out to people and drive registrations.
During the camp, People who “ShowUp” either undergo the medical tests or visit stalls depending on the format of health camp.
 

*** Other things to note:
* Since this is a completely voluntary activity for the working professionals, MedCamp usually has little profile information about these people.
* For a few camps, there was hardware failure, so some information about date and time of registration is lost.
* MedCamp runs 3 formats of these camps. The first and second format provides people with an instantaneous health score. The third format provides information about several health issues through various awareness stalls.

#### Favorable outcome:
For the first 2 formats, a favourable outcome is defined as getting a health_score, while in the third format it is defined as visiting at least a stall.
You need to predict the chances (probability) of having a favourable outcome.
 

**** Data Description
Train.zip contains the following 6 csvs alongside the data dictionary that contains definitions for each variable

* Health_Camp_Detail.csv – File containing Health_Camp_Id, Camp_Start_Date, Camp_End_Date and Category details of each camp.

* Train.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

* Patient_Profile.csv – This file contains Patient profile details like Patient_ID, Online_Follower, Social media details, Income, Education, Age, First_Interaction_Date, City_Type and Employer_Category

* First_Health_Camp_Attended.csv – This file contains details about people who attended health camp of first format. This includes Donation (amount) & Health_Score of the person.

* Second_Health_Camp_Attended.csv - This file contains details about people who attended health camp of second format. This includes Health_Score of the person.

* Third_Health_Camp_Attended.csv - This file contains details about people who attended health camp of third format. This includes Number_of_stall_visited & Last_Stall_Visited_Number.


Test Set

Test.csv – File containing registration details for all the test camps. This includes Patient_ID, Health_Camp_ID, Registration_Date and a few anonymized variables as on registration date.

 

Train / Test split:

Camps started on or before 31st March 2006 are considered in Train
Test data is for all camps conducted on or after 1st April 2006.


Sample Submission:

Patient_ID: Unique Identifier for each patient. This ID is not sequential in nature and can not be used in modeling

Health_Camp_ID: Unique Identifier for each camp. This ID is not sequential in nature and can not be used in modeling

Outcome: Predicted probability of a favourable outcome

In [3]:
# importing the libraries

import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams['figure.figsize']=[15, 6]

pd.set_option('display.max_columns', 50) # Displaying all the data

In [121]:
# Import the dataset
train = pd.read_csv("~/Downloads/Healthcare/Train.csv")
test = pd.read_csv("~/Downloads/Healthcare/test_l0Auv8Q.csv")

fhc = pd.read_csv("~/Downloads/Healthcare/First_Health_Camp_Attended.csv")
shc = pd.read_csv("~/Downloads/Healthcare/Second_Health_Camp_Attended.csv")
thc = pd.read_csv("~/Downloads/Healthcare/Third_Health_Camp_Attended.csv")
hcdetail = pd.read_csv("~/Downloads/Healthcare/Health_Camp_Detail.csv")
pp = pd.read_csv("~/Downloads/Healthcare/Patient_Profile.csv")

In [122]:
# Combine the Train and Test
combined = pd.concat([train, test], ignore_index = True)

In [123]:
combined.shape

(110527, 8)

In [124]:
thc.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Number_of_stall_visited,Last_Stall_Visited_Number
0,517875,6527,3,1
1,504692,6578,1,1


In [125]:
combined.head(2)

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5
0,489652,6578,10-Sep-05,4,0,0,0,2
1,507246,6578,18-Aug-05,45,5,0,0,7


In [126]:
# merge fhc with combined
combined = pd.merge(left = combined, right = fhc, 
                    on = ["Patient_ID", "Health_Camp_ID"], 
                    how = "left")

In [127]:
# merge shc with combined
combined = pd.merge(left = combined, right = shc, 
                    on = ["Patient_ID", "Health_Camp_ID"], 
                    how = "left")

In [128]:
# merge thc with combined
combined = pd.merge(left = combined, right = thc, 
                    on = ["Patient_ID", "Health_Camp_ID"], 
                    how = "left")

In [129]:
# merge healthcamp detail with combined
combined = pd.merge(left = combined, right = hcdetail, 
                    on = "Health_Camp_ID", 
                    how = "left")

In [130]:
# merge Patient Profile detail with combined
combined = pd.merge(left = combined, right = pp, 
                    on = "Patient_ID", 
                    how = "left")

### Feature Engineering

In [134]:
# code to convert string into date

combined["Registration_Date"]=pd.to_datetime(combined['Registration_Date'], 
                                             dayfirst = True)

combined["Registration_Month"]= combined.Registration_Date.dt.month
combined["Registration_Year"]= combined.Registration_Date.dt.year
combined["Registration_Day"]= combined.Registration_Date.dt.day

In [135]:
combined.Registration_Date.isnull().sum()

334

In [34]:
# Camp Start and End Date

combined["Camp_Start_Month"]= pd.DatetimeIndex(combined.Camp_Start_Date).month
combined["Camp_Start_Year"]= pd.DatetimeIndex(combined.Camp_Start_Date).year
combined["Camp_Start_Day"]= pd.DatetimeIndex(combined.Camp_Start_Date).day

combined["Camp_End_Month"]= pd.DatetimeIndex(combined.Camp_End_Date).month
combined["Camp_End_Year"]= pd.DatetimeIndex(combined.Camp_End_Date).year
combined["Camp_End_Day"]= pd.DatetimeIndex(combined.Camp_End_Date).day

In [36]:
# Creating Online_Presence using Patient Profiling Data

combined["Online_Presence"] = combined["Online_Follower"]+
combined["LinkedIn_Shared"]+
combined["Twitter_Shared"]+combined["Facebook_Shared"]

In [37]:
# Dropping the columns above
combined.drop(["LinkedIn_Shared", "Twitter_Shared", "Online_Follower",
              "Facebook_Shared"], axis = 1, inplace = True)

In [39]:
del pp

In [40]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_Presence
0,489652,6578,2005-09-10,4,0,0,0,2,,,,,2.0,1.0,16-Aug-05,14-Oct-05,Third,G,2,,,,06-Dec-04,,,9.0,2005.0,10.0,8,2005,16,10,2005,14,0
1,507246,6578,2005-08-18,45,5,0,0,7,,,,,,,16-Aug-05,14-Oct-05,Third,G,2,1.0,75.0,40.0,08-Sep-04,C,Others,8.0,2005.0,18.0,8,2005,16,10,2005,14,0
2,523729,6534,2006-04-29,0,0,0,0,0,,,,0.402054,,,17-Oct-05,07-Nov-07,Second,A,2,,,,22-Jun-04,,,4.0,2006.0,29.0,10,2005,17,11,2007,7,0
3,524931,6535,2004-02-07,0,0,0,0,0,,,,,,,01-Feb-04,18-Feb-04,First,E,2,,,,07-Feb-04,I,,2.0,2004.0,7.0,2,2004,1,2,2004,18,0
4,521364,6529,2006-02-28,15,1,0,0,7,,,,0.845597,,,30-Mar-06,03-Apr-06,Second,A,2,1.0,70.0,40.0,04-Jul-03,I,Technology,2.0,2006.0,28.0,3,2006,30,4,2006,3,1


In [51]:
# Camp Duration from CampEndDate - Camp Start Date

combined["Camp_Start_Date"]=  pd.to_datetime(combined.Camp_Start_Date, 
                                             dayfirst = True)

combined["Camp_End_Date"]=  pd.to_datetime(combined.Camp_End_Date, 
                                           dayfirst = True)

combined["Camp_Duration"] =(combined["Camp_End_Date"] - 
                            combined["Camp_Start_Date"]).dt.days

In [55]:
# Converting First Interaction Date into datetime format

combined["First_Interaction"]=pd.to_datetime(combined.First_Interaction, 
                                             dayfirst = True)

# Days_interaction_diff = registration date - first interaction
combined["Int_Days_Diff"] = (combined["Registration_Date"] - 
                             combined["First_Interaction"]).dt.days 

In [57]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_Presence,Camp_Duration,Int_Days_Diff
0,489652,6578,2005-09-10,4,0,0,0,2,,,,,2.0,1.0,2005-08-16,2005-10-14,Third,G,2,,,,2004-12-06,,,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0
1,507246,6578,2005-08-18,45,5,0,0,7,,,,,,,2005-08-16,2005-10-14,Third,G,2,1.0,75.0,40.0,2004-09-08,C,Others,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0
2,523729,6534,2006-04-29,0,0,0,0,0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2004-06-22,,,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0
3,524931,6535,2004-02-07,0,0,0,0,0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-07,I,,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0
4,521364,6529,2006-02-28,15,1,0,0,7,,,,0.845597,,,2006-03-30,2006-04-03,Second,A,2,1.0,70.0,40.0,2003-07-04,I,Technology,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0


In [61]:
# Donation vs Healthcamp
combined.groupby("Health_Camp_ID")["Donation"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Health_Camp_ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
6523,0.0,,,,,,,
6524,54.0,25.555556,10.580628,10.0,20.0,20.0,30.0,60.0
6525,0.0,,,,,,,
6526,140.0,31.214286,24.034384,10.0,10.0,20.0,40.0,150.0
6527,0.0,,,,,,,
...,...,...,...,...,...,...,...,...
6583,0.0,,,,,,,
6584,0.0,,,,,,,
6585,203.0,40.985222,26.921499,10.0,20.0,30.0,50.0,140.0
6586,600.0,30.683333,19.527808,10.0,20.0,30.0,40.0,140.0


In [60]:
combined.Health_Camp_ID.nunique()

65

In [69]:
# Unique Patient Per HC

combined["HC_per_patient"]= combined.groupby("Patient_ID")
["Health_Camp_ID"].transform('nunique')
combined["Patient_Per_HC"]= combined.groupby("Health_Camp_ID")
["Patient_ID"].transform('nunique')

In [70]:
# Lets Check the Desc Stats
combined.HC_per_patient.describe()

count    110527.000000
mean          7.501036
std           6.888289
min           1.000000
25%           2.000000
50%           5.000000
75%          11.000000
max          40.000000
Name: HC_per_patient, dtype: float64

In [71]:
combined.Patient_Per_HC.describe()

count    110527.000000
mean       2904.425480
std        1330.060994
min          44.000000
25%        1993.000000
50%        2763.000000
75%        3809.000000
max        6543.000000
Name: Patient_Per_HC, dtype: float64

In [75]:
# Unique Patients visited in beginning of the year camp started
combined.groupby("Camp_Start_Year")["Patient_ID"].transform('nunique')

0         22359
1         22359
2         22359
3         10902
4         16175
          ...  
110522    16175
110523    16175
110524    16175
110525     2579
110526    16175
Name: Patient_ID, Length: 110527, dtype: int64

In [80]:
combined["Patients_Per_Year"]= combined.groupby("Camp_Start_Year")
["Patient_ID"].transform('nunique')

combined.groupby("Camp_Start_Year")["Patient_ID"].nunique()

Camp_Start_Year
2003     3081
2004    10902
2005    22359
2006    16175
2007     2579
Name: Patient_ID, dtype: int64

In [81]:
combined["Patients_Per_Month"]= combined.groupby("Camp_Start_Month")
["Patient_ID"].transform('nunique')

combined.groupby("Camp_Start_Month")["Patient_ID"].nunique()

Camp_Start_Month
1      9076
2      8717
3      3823
4      6164
5      1903
6      8208
7      3726
8     10445
9     11210
10     8242
11     8757
12     6774
Name: Patient_ID, dtype: int64

In [84]:
combined["Patients_Per_End_Year"]= combined.groupby("Camp_End_Year")
["Patient_ID"].transform('nunique')

combined["Patients_Per_End_Month"]= combined.groupby("Camp_End_Month")
["Patient_ID"].transform('nunique')

In [91]:
combined.groupby("Camp_Start_Year")["Patient_ID"].nunique()

Camp_Start_Year
2003     3081
2004    10902
2005    22359
2006    16175
2007     2579
Name: Patient_ID, dtype: int64

In [92]:
combined.groupby("Camp_End_Year")["Patient_ID"].nunique()

Camp_End_Year
2003     1753
2004     6319
2005    17330
2006    14842
2007    14379
Name: Patient_ID, dtype: int64

In [93]:
combined.head()

Unnamed: 0,Patient_ID,Health_Camp_ID,Registration_Date,Var1,Var2,Var3,Var4,Var5,Donation,Health_Score,Unnamed: 4,Health Score,Number_of_stall_visited,Last_Stall_Visited_Number,Camp_Start_Date,Camp_End_Date,Category1,Category2,Category3,Income,Education_Score,Age,First_Interaction,City_Type,Employer_Category,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_Presence,Camp_Duration,Int_Days_Diff,HC_per_patient,Patient_Per_HC,Patients_Per_Year,Patients_Per_Month,Patients_Per_End_Year,Patients_Per_End_Month
0,489652,6578,2005-09-10,4,0,0,0,2,,,,,2.0,1.0,2005-08-16,2005-10-14,Third,G,2,,,,2004-12-06,,,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0,11,2837,22359,10445,17330,6109
1,507246,6578,2005-08-18,45,5,0,0,7,,,,,,,2005-08-16,2005-10-14,Third,G,2,1.0,75.0,40.0,2004-09-08,C,Others,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0,26,2837,22359,10445,17330,6109
2,523729,6534,2006-04-29,0,0,0,0,0,,,,0.402054,,,2005-10-17,2007-11-07,Second,A,2,,,,2004-06-22,,,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0,7,3597,22359,8242,14379,14643
3,524931,6535,2004-02-07,0,0,0,0,0,,,,,,,2004-02-01,2004-02-18,First,E,2,,,,2004-02-07,I,,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0,6,1882,10902,8717,6319,11010
4,521364,6529,2006-02-28,15,1,0,0,7,,,,0.845597,,,2006-03-30,2006-04-03,Second,A,2,1.0,70.0,40.0,2003-07-04,I,Technology,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0,23,3823,16175,3823,14842,6519


### Target Variable

In [95]:
def outcome(a, b, c, d):
    if((a>0) | (b>0) | (c>0) |(d>0)):
        return(1)
    else:
        return(0)

In [99]:
# generate the target variable...

combined["Target"] = combined.apply(lambda x:outcome(x["Health_Score"], 
                                x["Health Score"], 
                                x["Number_of_stall_visited"], 
                                x["Last_Stall_Visited_Number"]), axis = 1)

In [100]:
combined.Target.value_counts()

0    89993
1    20534
Name: Target, dtype: int64

In [103]:
print(combined.columns)

Index(['Patient_ID', 'Health_Camp_ID', 'Registration_Date', 'Var1', 'Var2',
       'Var3', 'Var4', 'Var5', 'Donation', 'Health_Score', 'Unnamed: 4',
       'Health Score', 'Number_of_stall_visited', 'Last_Stall_Visited_Number',
       'Camp_Start_Date', 'Camp_End_Date', 'Category1', 'Category2',
       'Category3', 'Income', 'Education_Score', 'Age', 'First_Interaction',
       'City_Type', 'Employer_Category', 'Registration_Month',
       'Registration_Year', 'Registration_Day', 'Camp_Start_Month',
       'Camp_Start_Year', 'Camp_Start_Day', 'Camp_End_Month', 'Camp_End_Year',
       'Camp_End_Day', 'Online_Presence', 'Camp_Duration', 'Int_Days_Diff',
       'HC_per_patient', 'Patient_Per_HC', 'Patients_Per_Year',
       'Patients_Per_Month', 'Patients_Per_End_Year', 'Patients_Per_End_Month',
       'Target'],
      dtype='object')


In [106]:
# Drop the Unnecessary Variables

new = combined.drop(['Patient_ID', 'Health_Camp_ID', 'Registration_Date',
                    'Donation', 'Health_Score', 'Unnamed: 4',
       'Health Score', 'Number_of_stall_visited', 
                     'Last_Stall_Visited_Number',
                    'Camp_Start_Date', 'Camp_End_Date','Income', 
                     'Education_Score',
                     'First_Interaction','City_Type', 
                     "Age", 'Employer_Category'], axis = 1)

NameError: name 'combined' is not defined

In [105]:
del combined, fhc, shc, hcdetail, thc

In [108]:
new.shape

(110527, 27)

In [109]:
new.head()

Unnamed: 0,Var1,Var2,Var3,Var4,Var5,Category1,Category2,Category3,Registration_Month,Registration_Year,Registration_Day,Camp_Start_Month,Camp_Start_Year,Camp_Start_Day,Camp_End_Month,Camp_End_Year,Camp_End_Day,Online_Presence,Camp_Duration,Int_Days_Diff,HC_per_patient,Patient_Per_HC,Patients_Per_Year,Patients_Per_Month,Patients_Per_End_Year,Patients_Per_End_Month,Target
0,4,0,0,0,2,Third,G,2,9.0,2005.0,10.0,8,2005,16,10,2005,14,0,59,278.0,11,2837,22359,10445,17330,6109,1
1,45,5,0,0,7,Third,G,2,8.0,2005.0,18.0,8,2005,16,10,2005,14,0,59,344.0,26,2837,22359,10445,17330,6109,0
2,0,0,0,0,0,Second,A,2,4.0,2006.0,29.0,10,2005,17,11,2007,7,0,751,676.0,7,3597,22359,8242,14379,14643,1
3,0,0,0,0,0,First,E,2,2.0,2004.0,7.0,2,2004,1,2,2004,18,0,17,0.0,6,1882,10902,8717,6319,11010,0
4,15,1,0,0,7,Second,A,2,2.0,2006.0,28.0,3,2006,30,4,2006,3,1,4,970.0,23,3823,16175,3823,14842,6519,1


In [113]:
new.Category2.value_counts()

E    28488
F    28115
G    21327
A    19041
D     8742
B     2426
C     2388
Name: Category2, dtype: int64

In [112]:
mapped = {"First":1, "Second":2, "Third":3}
new["Category1"] = new.Category1.map(mapped)

In [116]:
# Category 2
new["Categor2"] = pd.factorize(new.Category2, sort = True)[0]

In [117]:
new.drop(["Category2"], axis = 1, inplace = True)

In [118]:
new.shape

(110527, 27)

In [119]:
new.isnull().sum()

Var1                        0
Var2                        0
Var3                        0
Var4                        0
Var5                        0
Category1                   0
Category3                   0
Registration_Month        334
Registration_Year         334
Registration_Day          334
Camp_Start_Month            0
Camp_Start_Year             0
Camp_Start_Day              0
Camp_End_Month              0
Camp_End_Year               0
Camp_End_Day                0
Online_Presence             0
Camp_Duration               0
Int_Days_Diff             334
HC_per_patient              0
Patient_Per_HC              0
Patients_Per_Year           0
Patients_Per_Month          0
Patients_Per_End_Year       0
Patients_Per_End_Month      0
Target                      0
Categor2                    0
dtype: int64

In [155]:
new["Registration_Year"].fillna(new.Registration_Year.mode()[0],
                               inplace = True)

In [154]:
new["Registration_Month"].fillna(new.Registration_Month.mode()[0],
                                 inplace = True)

In [152]:
new["Registration_Day"].fillna(new.Registration_Day.mode()[0],
                                 inplace = True)

In [147]:
new["Int_Days_Diff"].fillna(new.Int_Days_Diff.median(),
                                 inplace = True)

In [156]:
new.isnull().sum()

Var1                      0
Var2                      0
Var3                      0
Var4                      0
Var5                      0
Category1                 0
Category3                 0
Registration_Month        0
Registration_Year         0
Registration_Day          0
Camp_Start_Month          0
Camp_Start_Year           0
Camp_Start_Day            0
Camp_End_Month            0
Camp_End_Year             0
Camp_End_Day              0
Online_Presence           0
Camp_Duration             0
Int_Days_Diff             0
HC_per_patient            0
Patient_Per_HC            0
Patients_Per_Year         0
Patients_Per_Month        0
Patients_Per_End_Year     0
Patients_Per_End_Month    0
Target                    0
Categor2                  0
dtype: int64

### Split the Data back in Train and Test and delete target in Test

In [157]:
train.shape, test.shape

((75278, 8), (35249, 8))

In [227]:
newtrain = new.loc[0:train.shape[0]-1, :]
newtest = new.loc[train.shape[0]:, :]

In [228]:
newtrain.shape, newtest.shape

((75278, 27), (35249, 27))

In [229]:
# Drop Target from test
newtest.drop("Target", axis =1, inplace = True)

In [230]:
newtrain.shape, newtest.shape

((75278, 27), (35249, 26))

### Model Building

1. Logistic Regression Model
2. Random Forest Model
3. Gradient Boosting Model
4. Xtreme Gradient Boosting Model
5. Catboost Model

In [231]:
# Data in X and y
X = newtrain.drop("Target", axis = 1)
y = newtrain.Target

In [164]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [184]:
# Models
lr = LogisticRegression()
rf = RandomForestClassifier()
gbm = GradientBoostingClassifier()
xgb = XGBClassifier(eval_metric = "auc")
cboost = CatBoostClassifier(eval_metric = "AUC")

In [187]:
# Lets Build a Voting Classifier Model
from sklearn.ensemble import VotingClassifier
vc=VotingClassifier(estimators=[('lr', lr), ('rf', rf), 
                                ('gbm', gbm), ('xgb', xgb), 
                                ('cb', cboost)], voting = "soft")

In [188]:
pred = vc.fit(X, y).predict_proba(newtest)

Learning rate set to 0.065204
0:	total: 8.02ms	remaining: 8.01s
1:	total: 15.4ms	remaining: 7.67s
2:	total: 23.8ms	remaining: 7.89s
3:	total: 31.3ms	remaining: 7.79s
4:	total: 38.5ms	remaining: 7.66s
5:	total: 57.6ms	remaining: 9.54s
6:	total: 65.3ms	remaining: 9.26s
7:	total: 72.9ms	remaining: 9.04s
8:	total: 80.6ms	remaining: 8.88s
9:	total: 88.2ms	remaining: 8.73s
10:	total: 95.9ms	remaining: 8.62s
11:	total: 107ms	remaining: 8.84s
12:	total: 116ms	remaining: 8.79s
13:	total: 123ms	remaining: 8.66s
14:	total: 138ms	remaining: 9.06s
15:	total: 148ms	remaining: 9.1s
16:	total: 156ms	remaining: 9s
17:	total: 163ms	remaining: 8.9s
18:	total: 171ms	remaining: 8.8s
19:	total: 178ms	remaining: 8.71s
20:	total: 193ms	remaining: 9.02s
21:	total: 206ms	remaining: 9.15s
22:	total: 213ms	remaining: 9.07s
23:	total: 232ms	remaining: 9.45s
24:	total: 240ms	remaining: 9.36s
25:	total: 249ms	remaining: 9.32s
26:	total: 258ms	remaining: 9.29s
27:	total: 266ms	remaining: 9.23s
28:	total: 274ms	remain

In [190]:
pred[:, 1]

array([0.53714874, 0.407947  , 0.19134794, ..., 0.38144899, 0.26111715,
       0.44184495])

In [191]:
# Submission DataFrame
submission = pd.DataFrame({"Patient_ID": test.Patient_ID,
                          "Health_Camp_ID":test.Health_Camp_ID, 
                          "Outcome":pred[:, 1]})

In [194]:
submission.to_csv("VotingModel.csv", index = False) #  0.6385

In [193]:
cd

C:\Users\IT


In [220]:
X.columns

Index(['Var1', 'Var2', 'Var3', 'Var4', 'Var5', 'Category1', 'Category3',
       'Registration_Month', 'Registration_Year', 'Registration_Day',
       'Online_Presence', 'Camp_Duration', 'Int_Days_Diff', 'HC_per_patient',
       'Patient_Per_HC', 'Patients_Per_Year', 'Patients_Per_Month',
       'Patients_Per_End_Year', 'Patients_Per_End_Month', 'Categor2'],
      dtype='object')

In [217]:
newtest = newtest.loc[:, X.columns]

In [222]:
X.shape, newtest.shape

((75278, 20), (35249, 20))

In [218]:
# Lightgbm
from lightgbm import LGBMClassifier
lgbm = LGBMClassifier(n_estimators = 500, 
                      max_depth = 10, random_state = 42, 
                      learning_rate=0.01, 
                     scale_pos_weight = 3)

In [223]:
pred_lgbm = lgbm.fit(X, y).predict_proba(newtest)

In [224]:
# Submission DataFrame
submission = pd.DataFrame({"Patient_ID": test.Patient_ID,
                          "Health_Camp_ID":test.Health_Camp_ID, 
                          "Outcome":pred_lgbm[:, 1]})

submission.to_csv("LGBM_Model.csv", index = False) #  0.6316

### RFECV and Cross Validation

In [235]:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()

from sklearn.feature_selection import RFECV

rfe = RFECV(estimator = dtree, step = 1, 
           min_features_to_select= 5, cv =  5, verbose = 5)

In [236]:
rfe.fit(X, y)
features = list(rfe.get_feature_names_out())
print(features)

Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 features.
Fitting estimator with 19 features.
Fitting estimator with 18 features.
Fitting estimator with 17 features.
Fitting estimator with 16 features.
Fitting estimator with 15 features.
Fitting estimator with 14 features.
Fitting estimator with 13 features.
Fitting estimator with 12 features.
Fitting estimator with 11 features.
Fitting estimator with 10 features.
Fitting estimator with 9 features.
Fitting estimator with 8 features.
Fitting estimator with 7 features.
Fitting estimator with 6 features.
Fitting estimator with 26 features.
Fitting estimator with 25 features.
Fitting estimator with 24 features.
Fitting estimator with 23 features.
Fitting estimator with 22 features.
Fitting estimator with 21 features.
Fitting estimator with 20 featur

In [237]:
features

['Var1',
 'Category1',
 'Registration_Month',
 'Registration_Day',
 'Online_Presence',
 'Camp_Duration',
 'Int_Days_Diff',
 'HC_per_patient',
 'Patient_Per_HC']

In [239]:
# RFE Features
rfe_input = X.loc[:, features]
rfe_test = newtest.loc[:, features]

In [240]:
rfe_input.shape, rfe_test.shape

((75278, 9), (35249, 9))

In [241]:
lgbm = LGBMClassifier(n_estimators = 500, 
                      max_depth = 10, random_state = 42, 
                      learning_rate=0.01, 
                     scale_pos_weight = 3)

pred_lgbm = lgbm.fit(rfe_input, y).predict_proba(rfe_test)

# Submission DataFrame
submission = pd.DataFrame({"Patient_ID": test.Patient_ID,
                          "Health_Camp_ID":test.Health_Camp_ID, 
                          "Outcome":pred_lgbm[:, 1]})

submission.to_csv("RFE_Model.csv", index = False) #  0.6316


### Cross Validation for LGBM

In [249]:
from sklearn.model_selection import KFold
kfold = KFold(n_splits=5, shuffle = True)
lgbm = LGBMClassifier(n_estimators = 500, 
                      max_depth = 10, random_state = 42, 
                      learning_rate=0.01, 
                     scale_pos_weight = 3)

pred_df = pd.DataFrame()

n = 5
for i in range(n):
    folds = next(kfold.split(X))
    xtrain = X.iloc[folds[0]] # builds xtrain data
    ytrain = y.iloc[folds[0]] # creates y values
    lgbm.fit(xtrain, ytrain)
    pred_df[i] = lgbm.predict_proba(newtest)[:, 1]

In [262]:
pred_df

Unnamed: 0,0,1,2,3,4
0,0.834854,0.868591,0.862783,0.857990,0.860368
1,0.658681,0.625407,0.669577,0.590478,0.648944
2,0.387994,0.382216,0.347617,0.331157,0.377994
3,0.571716,0.585202,0.617777,0.637673,0.586138
4,0.131144,0.131048,0.124423,0.163304,0.114202
...,...,...,...,...,...
35244,0.734100,0.751343,0.733299,0.658999,0.697985
35245,0.434850,0.451803,0.550005,0.412618,0.440168
35246,0.735216,0.687731,0.664526,0.626814,0.672619
35247,0.343776,0.386784,0.380612,0.423357,0.368626


In [254]:
median_prob = pred_df.median(axis = 1)

In [255]:
# Submission DataFrame
submission = pd.DataFrame({"Patient_ID": test.Patient_ID,
                          "Health_Camp_ID":test.Health_Camp_ID, 
                          "Outcome":median_prob})

submission.to_csv("Median_Prob.csv", index = False) #  0.6316

#### Summary

* Model Performance drastically gets impacted by the Feature Engineering
* We saw that LGBM appeared to be the Best Model across for this competition leading at #2 in Public Leaderboard.
* We saw that Features Selected using RFECV did not work well and we had to discard that model.

* Parameter Tuning of LGBM can take the Model Performance at the whole new level.

* Cross Validation Model for LGBM did a fantastic job earning us couple of brownie points on Public Leaderboard.

* We learnt that Target Variables can be masked in the dataset which can be extracted by studying the Problem Statement meticulously.