OBJECTIVES: CLASSIFY PATIENTS AS WITH OR WITHOUT LIVER PROBLEM

Data Description:
    ID_Patient_Care_Situation: Care situation of a patient during treatment
    Diagnosed_Condition: The diagnosed condition of the patient
    ID_Patient: Patient identifier number
    Treatment_with_drugs: Class of drugs used during treatment
    Survived_1_year: If the patient survived after one year (0 means did not survive; 1 means survived)
    Patient_Age: Age of the patient
    Patient_Body_Mass_Index: A calculated value based on the patient’s weight, height, etc.
    Patient_Smoker: If the patient was a smoker or not
    Patient_Rural_Urban: If the patient stayed in Rural or Urban part of the country
    Previous_Condition: Condition of the patient before the start of the treatment ( This variable is splitted into 8 columns - A, B, C, D, E, F, Z and Number_of_prev_cond. A, B, C, D, E, F and Z are the previous conditions of the patient. Suppose for one patient, if the entry in column A is 1, it means that the previous condition of the patient was A. If the patient didn't have that condition, it is 0 and same for other conditions. If a patient has previous condition as A and C , columns A and C will have entries as 1 and 1 respectively while the other column B, D, E, F, Z will have entries 0, 0, 0, 0, 0 respectively. The column Number_of_prev_cond will have entry as 2 i.e. 1 + 0 + 1 + 0 + 0 + 0 + 0 + 0 = 2 in this case. 

In [1]:
# IMPORT REQUIRED PACKAGES

import pandas as pd 
import numpy as np
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
import time

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler

from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier
from xgboost import XGBClassifier 

import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
#dataset=pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/liver_patient_data/indian_liver_patient_dataset.csv')

In [2]:
# IMPORT DATASET
pharma_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Training_set_begs.csv')
pharma_data.head(2)

Unnamed: 0,ID_Patient_Care_Situation,Diagnosed_Condition,Patient_ID,Treated_with_drugs,Patient_Age,Patient_Body_Mass_Index,Patient_Smoker,Patient_Rural_Urban,Patient_mental_condition,A,B,C,D,E,F,Z,Number_of_prev_cond,Survived_1_year
0,22374,8,3333,DX6,56,18.479385,YES,URBAN,Stable,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0,0
1,18164,5,5740,DX2,36,22.945566,YES,RURAL,Stable,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1


In [None]:
pharma_data.head()

In [None]:
pharma_data.shape

In [None]:
#CHECK FOR NULL VALUES
pharma_data.isnull().sum()

In [None]:
#EXPLORE DATASET TO FIND OUT THE SPREAD
pharma_data.nunique()

In [None]:
#MORE EXPLORATION
pharma_data.describe()


In [None]:
#NON-NUMERIC VALUES
pharma_data.describe(exclude=np.number)

In [None]:
#FIND THE RELATIONSHIP BETWEEN TARGET AND PREDICTOR VARIABLES
pharma_data.corr()

ANALYSIS

In [None]:
from sklearn.preprocessing import MinMaxScaler
mm=MinMaxScaler()
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
rs = RobustScaler()
from sklearn.preprocessing import LabelEncoder

In [None]:
data=pharma_data

In [None]:
data.shape

In [None]:
#DEFINE THE DIFFERENT SET OF COLUMNS WE WILL USE
cols= data.columns
number_cols = data.select_dtypes(include=np.number).columns
cat_cols= list(set(cols) - set(number_cols)) 

In [None]:
#FILL NULL VALUES WHERE NECCESSARY
#ELIMINATE AUTOMATICALLY VARIABLES WITH MORE THAN 20% OF MISSINGNESS
data = data[data.columns[data.isnull().mean() < 0.2]]

In [None]:
#check WHICH AMONG REMAINING COLUMNS HAVE MISSING VALUES IN THE CODE BELOW
#All features have less than 20 % null/na value
data1=data.columns[data.isnull().mean() > 0]
data1

In [None]:
# IMPUTE NUMERICAL VARIABLES WITH MEAN OF THE VARIABLE
data.loc[:,number_cols]=data.loc[:,number_cols].fillna(data.loc[:,number_cols].mean())

In [None]:
data[cat_cols]=data[cat_cols].ffill()
data.head(2)

In [None]:
data.isnull().sum()

In [None]:
# IMPUTE CATEGORICAL VARIABLES WITH FREQUENCY OF THE FEATURE COLUMNS
# MULTIPLE COLUMNS FREQUNCY ENCODING                                           
for j in cat_cols:    
    encoding = data.groupby(j).size()
    encoding = encoding/len(data)   #calculates frequency
    data[j] = data[j].map(encoding)
data.head(2)

In [None]:
                                         # CHECK POPULATION SIZE OF TARGET VARIABLE'S VALUES  and plot
data['Survived_1_year'].value_counts()
#1    14603
#0    8494

In [None]:
plt.figure(figsize=(5, 3))
data['Survived_1_year'].value_counts().plot.bar()
plt.show()

In [None]:
#USE HEATMAP TO SEE RELATIONSHIP (correlation) BETWEEN TARGET AND PREDICTION VARIABLES
plt.figure(figsize=(9, 6))
sn.heatmap(data[number_cols].corr(), annot=True, fmt='.3f', yticklabels=number_cols, xticklabels=number_cols)
plt.title('Covariance for correlation coefficients')
plt.tight_layout()
plt.show()

 #Analysis:
'''From the corelation matrix
        Patient_Body_Mass_Index and D(Number_of_prev_cond) has the highest impact (negative) on Survived_1_year 
        The matrix shows that the lower the 2 values the higher the chances of survival
        Other faetures that impact survival include [Number_of_prev_cond', 'D', 'Patient_Age']
        'ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID' have little or no impact, so we remove them
'''

In [None]:
plt.figure(figsize=(12, 6))
sn.heatmap(data.corr(), annot=True, fmt='.3f', yticklabels=cols, xticklabels=cols)
plt.title('Covariance for correlation coefficients')
plt.tight_layout()
plt.show()

 #Analysis 2:
'''The corelation matrix with the entire dataset after encoding
        Other featurres with importance['Patient_Smoker', 'Treated_with_drugs', 'Patient_Rural_Urban']        
'''

In [None]:
       # SELECT FEATURES THAT ARE NOT HILGHLY CORELATED AND DRAW THE CORELATION MATRIX AGAIN

col2=['Treated_with_drugs', 'Patient_Age', 'Patient_Body_Mass_Index','Patient_Smoker', 'Patient_Rural_Urban', 'D', 
'Number_of_prev_cond','Survived_1_year']
data_reduced=data[col2]

In [None]:
#Corr matrix for fetures tha impact survival
plt.figure(figsize=(10, 5))
sn.heatmap(data_reduced.corr(), annot=True, fmt='.2f', yticklabels=col2, xticklabels=col2)
plt.title('Covariance for correlation coefficients')
plt.tight_layout()
plt.show()

In [None]:
#Patient_Smoker' HAS A POSITIVE RELATION WITH THE TARGET, SO WE MAKE A PLOT OF ALBUMIN THAT RELATING TO Patient_Rural_Urban

p = sn.catplot(x="Survived_1_year", y="Patient_Smoker", hue="Patient_Rural_Urban", kind="bar", data=data, height=4, aspect=7/4)
plt.title('Patient_Smoker related Survival by Patient_Rural_Urban')
plt.show()

In [None]:
data['Treated_with_drugs'].value_counts().plot.bar()
plt.show()

More people are treated with the drugs[DX6, DX5, DX1, DX3, DX4] than with a combination of the drugs

In [None]:
 # 'Number_of_prev_cond'  HAS A NEGATIVE RELATION WITH THE TARGET

p = sn.catplot(x="Number_of_prev_cond", y='Survived_1_year', hue="Patient_Rural_Urban", kind="bar", data=data, height=4, aspect=12/4)
plt.title('Number_of_prev_cond related to Survived_1_year by Patient_Rural_Urba')
plt.show()

In [None]:
#DISTRIBUTION OF THE VARIABLES
p = sn.catplot(data = data,height=4, aspect=16/4)
plt.title('Range (min-max) of data distribution within features')
plt.show()

In [None]:
#DISTRIBUTION OF THE VARIABLES WITH REDUCED COLUMNS
p = sn.catplot(data = data_reduced,height=4, aspect=16/4)
plt.title('Range (min-max) of data distribution within reduced features')
plt.show()

Notice from above that patient_age has the most outliers, which was not evident with the full features and 
Patient_body_mass_index has a skewed distribution

In [None]:
#PLOT OF THE DISTRIBUTION OF DATA IN FEATURES
                                    
data.hist(alpha=0.5, figsize=(9, 5), bins=10, facecolor='blue')
plt.tight_layout()
plt.show()

Notice that ID_Patient_Care_Situation', 'Diagnosed_Condition', 'Patient_ID', have very irregular distibution, which may pose a negative impact to the model

In [None]:
# PLOT OF THE DISTRIBUTION OF DATA IN SELECTED FEATURES
                                    
data_reduced.hist(alpha=0.5, figsize=(8, 4), bins=10, facecolor='blue')
plt.tight_layout()
plt.show()

Most of the features do not have guasian distribution (very skewed), but there are outliers

In [None]:
# BOXPLOT OF THE ACTUAL DATASET  UNDERSTAND THE ABOVE
plt.figure(figsize=(18, 4))
data.boxplot()
plt.show()

In [None]:
#BOXPLOT OF THE DATASET WITH REDUCED COLUMNS
plt.figure(figsize=(18, 4))
data_reduced.boxplot()
plt.show()

See a clear distribution of the Patient_Body_Mass_index...Most of the lower values are outliers, but for Patient age,most of the higher values are outliers

REDUCED FEATURES DATASET

In [None]:
#DEFINE X AND Y FOR THE REDUCED FEATURES FROM NUMBER 14 (OPTIONAL)
y_reduced=data_reduced['Survived_1_year']
y_reduced2 = LabelEncoder().fit_transform(y_reduced.astype('str'))
x_reduced=data_reduced.drop('Survived_1_year',axis =1)

In [None]:
#SCALE THE DATASET ON REDUCED FEATURES (STANDARD SCALER)
Xrs=sc.fit_transform(x_reduced)
Xrs=pd.DataFrame(data=Xrs, columns=list(set(col2)-set(['Survived_1_year'])))

In [None]:
#SCALE THE DATASET ON REDUCED FEATURES (ROBUST SCALER)
Xrr=rs.fit_transform(x_reduced)
Xrr=pd.DataFrame(data=Xrr, columns=list(set(col2)-set(['Survived_1_year'])))

In [None]:
#SCALE THE DATASET ON REDUCED FEATURES (mIN-MAX SCALER)
Xrm=mm.fit_transform(x_reduced)
Xrm=pd.DataFrame(data=Xrm, columns=list(set(col2)-set(['Survived_1_year'])))

In [None]:
plt.figure(figsize=(18, 4))
Xrs.boxplot()
plt.show()

See the dataset has a more generally spread out distribution standardscaler, which is good fo skewed datasets

In [None]:
plt.figure(figsize=(18, 4))
Xrr.boxplot()
plt.show()

The RobustScaler seems to scale the dataset better on reduced featrures although both scalers worked fine. However Robust scaler is good with outliers

In [None]:
plt.figure(figsize=(18, 4))
Xrm.boxplot()
plt.show()

The min-max didnt seem to perform well on scalling the reduced column dataset, but we will try with the reduced columns later

COMPLETE FEATURES DATASET

In [None]:
# DEFINE X AND Y FOR THE COMPLETE FEATURES
y=data['Survived_1_year'].values
y = LabelEncoder().fit_transform(y.astype('str'))
x=data.drop('Survived_1_year', axis =1)

In [None]:
#SCALE THE DATASET ON (mIN-MAX SCALER)
Xm=mm.fit_transform(x)
Xm=pd.DataFrame(data=Xm, columns=list(set(cols)-set(['Survived_1_year'])))

PLOTTING BOX PLOT TO COMPARE SCALLERS FOR COMPLETE FEATURE DATASET

In [None]:
plt.figure(figsize=(18, 4))
Xm.boxplot()
plt.show()

In [None]:
#SCALE THE DATASET ON (ROBUST SCALER)
Xr = rs.fit_transform(x)
Xr=pd.DataFrame(data=Xr, columns=list(set(cols)-set(['Survived_1_year'])))

In [None]:
plt.figure(figsize=(18, 4))
Xr.boxplot()
plt.show()

In [None]:
#SCALE THE DATASET ON (STANDARD SCALER)
Xs= sc.fit_transform(x)
Xs=pd.DataFrame(data=Xs, columns=list(set(cols)-set(['Survived_1_year'])))

In [None]:
plt.figure(figsize=(18, 4))
Xs.boxplot()
plt.show()

Obiviously as you can see from scaling the full features and the reduced features dataset, the min-max scaler is not good for this dataset.

SPLITTING

In [None]:
# Splitting the dataset into Training set and Test set
from sklearn.model_selection import train_test_split
#Splitting full dataset
Xf_train,Xf_test,yf_train,yf_test=train_test_split(x,y,test_size=0.3,random_state=0)#full unscaled

Xm_train,Xm_test,ym_train,ym_test=train_test_split(Xm,y,test_size=0.3,random_state=0)
Xr_train,Xr_test,yr_train,yr_test=train_test_split(Xr,y,test_size=0.3,random_state=0)
Xs_train,Xs_test,ys_train,ys_test=train_test_split(Xs,y,test_size=0.3,random_state=0)

#Splitting reduced column dataset
Xrs_train,Xrs_test,yrs_train,yrs_test=train_test_split(Xrs,y_reduced2,test_size=0.3,random_state=0)
Xrr_train,Xrr_test,yrr_train,yrr_test=train_test_split(Xrr,y_reduced2,test_size=0.3,random_state=0)
Xrm_train,Xrm_test,yrm_train,yrm_test=train_test_split(Xrm,y_reduced2,test_size=0.3,random_state=0)

CROSSING CHECKING SELECTED COLUMNS BEFORE MODELING CONTINUES

In [None]:
# SELECTING FEATURES WITH SKLEARN SELECTOR
clf_lr = LogisticRegression(random_state = 1)
from sklearn.feature_selection import RFE 
rfe = RFE(estimator= clf_lr , step = 1)
fit = rfe.fit(Xs, y)

print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

In [None]:
#COLUMNS WE SELECTED
['Treated_with_drugs','Patient_Age','Patient_Body_Mass_Index','Patient_Smoker','Patient_Rural_Urban','D','Number_of_prev_cond','Survived_1_year']

#COLUMNS SELECTED BY RFE ON STANDARDSCALER
xt=data.iloc[:2, [3, 5, 6, 7, 9, 10, 12, 15]]
Xsfe=data[xt.columns]
['Treated_with_drugs','Patient_Body_Mass_Index','Patient_Smoker','Patient_Rural_Urban', 'A', 'B', 'D', 'Z']

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=1)
Xlda=lda.fit_transform(Xs, y)

#COLUMNS SELECTED BY LDA ON SC
#Xlda

In [None]:
rf_clf = RandomForestClassifier(n_estimators=10000, random_state=1, n_jobs=-1)
rf_clf.fit(Xs, y)

#plot graph of feature importances for better visualization
feat_importances = pd.Series(rf_clf.feature_importances_, index=Xs.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

In [None]:
#COLUMNS SELECTED BY RANDOM FOREST
c=['Diagnosed_Condition','Patient_Rural_Urban','Treated_with_drugs','Patient_ID','Z','C','A','Patient_Smoker']
xz=data.loc[:, c]
Xz=data[xz.columns]

In [None]:
from xgboost import plot_importance
xgb_model = XGBClassifier(random_state = 0, use_label_encoder=False, eval_metric='logloss')
xgb_model.fit(Xs, y)
#print("Feature Importances : ", xgb_model.feature_importances_)
# plot feature importance
plot_importance(xgb_model)
plt.show()

In [None]:
#COLUMNS SELECTED BY XGBOOST
d=['Diagnosed_Condition','Treated_with_drugs','Patient_Rural_Urban','Patient_ID','Z','C','A','Patient_Smoker']
Xgb=data.loc[:, d]
Xgb=data[Xgb.columns]

In [None]:
#Splitting reduced column dataset from other selection models
Xrfe_train,Xrfe_test,yrfe_train,yrfe_test=train_test_split(Xsfe,y,test_size=0.3,random_state=0)#RFE
Xlda_train,Xlda_test,ylda_train,ylda_test=train_test_split(Xlda,y,test_size=0.3,random_state=0)#LDA
Xzf_train,Xzf_test,yzf_train,yzf_test=train_test_split(Xz,y,test_size=0.3,random_state=0)#RF
Xgb_train,Xgb_test,ygb_train,ygb_test=train_test_split(Xgb,y,test_size=0.3,random_state=0)#XGB

RUNNING THE MODELS

#HYPERPARAMETERS FOR ALL MODELS

classifiers=[LogisticRegression(C=1, penalty='l2'), BernoulliNB(alpha=5), 
GradientBoostingClassifier(n_estimators=5, learning_rate=0.05, max_depth=4, random_state=0), SVC(C=1.0, gamma =0.7, kernel='rbf'),
   DecisionTreeClassifier(criterion='entropy',max_depth=3,max_features='log2', min_samples_leaf =3, min_samples_split=10, random_state=0), 
   MLPClassifier(hidden_layer_sizes= (10,), activation= 'relu', solver='adam', alpha=0.001,learning_rate= 'constant'),
    KNeighborsClassifier(leaf_size=10, p=1, n_neighbors=5), RandomForestClassifier(criterion='entropy',max_depth=2,max_features='log2',
    min_samples_leaf=3,min_samples_split=8, n_estimators=5,n_jobs=1, random_state=0)]

In [None]:
#START WITH THE BASELINE MODELS (TESTING 10 MODELS)
classifiers=[GradientBoostingClassifier(random_state = 0),XGBClassifier(random_state = 0, use_label_encoder=False, eval_metric='logloss'),
CatBoostClassifier(random_state = 0), LogisticRegression(random_state = 0), BernoulliNB(),  SVC(random_state = 0), 
DecisionTreeClassifier(random_state = 0),MLPClassifier(random_state = 0), KNeighborsClassifier(), RandomForestClassifier(random_state = 0)]

In [None]:
model_lisdt=[]
def squared_error(actual, pred):
    return (pred - actual) ** 2


def evaluate(actual, pred):
    if actual==pred:
        return 1
    else:
        return 0

def modeld(classifiers, X_train, y_train, X_test, y_test):    
    for classers in classifiers:
        
        t1=time.time()
        classify=classers
        classify = classify.fit(X_train, y_train)
        model_lisdt.append(classify)
        time_taken = time.time() - t1
        
        predicted=classify.predict(X_test)
        error=0
        predict=0
        
        for i in range(len(X_test)):
            error+=squared_error(y_test[i],predicted[i])
            
            predict+=evaluate(y_test[i],predicted[i])
            
        Mse=error/len(X_test)
        
        conf_mat=confusion_matrix(y_test, predicted)
    
        st=str(classers)
        print(' \n ############## {} ################'.format(st))
        print(" Duration =:\t {}".format(time_taken))   
        print(" Mean squared Error \t {}".format(Mse))
        print(" Accuracy =: \t {:0.2f} %".format(accuracy_score(y_test, predicted)))
        print(" F_Score =: \t {:0.2f} %".format(f1_score(y_test, predicted)))
        print(" Precision =: \t {:0.2f} %".format(precision_score(y_test, predicted)))
        print(" Recall =: \t {:0.2f} %".format(recall_score(y_test, predicted)))
        print(conf_mat)
    return(model_lisdt)

RUN ALL MODELS ON COMPLETE COLUMNS (USING ALL THE FEATURES)

In [None]:
#unscaled_full=modeld(classifiers, Xf_train,yf_train, Xf_test,yf_test)

In [None]:
#normalise_full= modeld(classifiers, Xm_train,ym_train, Xm_test,ym_test)

In [None]:
#robustscaled_full= modeld(classifiers, Xr_train,yr_train,Xr_test,yr_test)

In [None]:
standarsised_full=modeld(classifiers, Xs_train,ys_train,Xs_test,ys_test)

RUN ALL MODELS ON DATASET USED REDUCED FEATURES

In [None]:
#normalise_reduced= modeld(classifiers, Xrm_train,yrm_train,Xrm_test,yrm_test)

In [None]:
#robustscaled_reduced= modeld(classifiers, Xrr_train,yrr_train,Xrr_test,yrr_test)

In [None]:
#standarsised_reduced=modeld(classifiers, Xrs_train,yrs_train,Xrs_test,yrs_test)

RUN ALL THE MODELS ON DATASET USING AUTO SELECTED COLUMS

In [None]:
#standarsised_rfe=modeld(classifiers, Xrfe_train,yrfe_train,Xrfe_test,yrfe_test)#RFE REDUCED

In [None]:
#standarsised_lda=modeld(classifiers, Xlda_train,ylda_train,Xlda_test,ylda_test)#LDA REDUCED

In [None]:
#standarsised_rf=modeld(classifiers, Xzf_train,yzf_train,Xzf_test,yzf_test)#RANDOMFOREST REDUCED

In [None]:
#standarsised_xgb=modeld(classifiers, Xgb_train,ygb_train,Xgb_test,ygb_test)#XGBOOST REDUCED

EVALUATING GIVEN TEST DATASET

**After testing all the different aspect of the dataset on different scalers, this three classifier performed the best.
GradientBoostingClassifier(),XGBClassifier(), and CatBoostClassifier()

**for all the classifiers and scalers, CatBoostClassifier() was the best

**Also, using the full features showed a better performance than any form of data reduction we carried out.

**The standard scaler performed the best while the min-max scaler was very poor in all data aspect.

**Our reduced features performed better than all other sklearn supported feature selectors

**So we select the full feature standardised dataset aspect.

**Next, below, we checked the logloss for the three best performing classifiers, and , CatBoostClassifier() was still the best, so we select the classifier.

In [None]:
from sklearn.metrics import log_loss

In [None]:
models=model_lisdt[2]
pre=models.predict(Xs_test)
ll = log_loss(ys_test, pre)
ll

In [None]:
models2=model_lisdt[1]
pr=models2.predict(Xs_test)
ll1 = log_loss(ys_test, pr)
ll1

In [None]:
models3=model_lisdt[0]
pred=models3.predict(Xs_test)
ll2 = log_loss(ys_test, pred)
ll2

In [None]:
models.feature_importances_
feat_importances = pd.Series(models.feature_importances_, index=Xs.columns)
feat_importances.nlargest(10).plot(kind='barh')
plt.show()

                        ##OPTIMIZATION
                        
                        F1_scaore before optimization=87.0

In [None]:
from sklearn.metrics import f1_score
def fisc(models, xtest, ytest):    
    preds=models.predict(xtest)
    f1=f1_score(ytest, preds)
    return(f1)

In [None]:
from sklearn.model_selection import RepeatedStratifiedKFold
cvs = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

FIRST
Let's prepare a dataset that can use the the catboost ability without much encoding

In [None]:
ndata=pharma_data
ndata.head(2)

In [None]:
ndata.loc[:,number_cols]=ndata.loc[:,number_cols].fillna(ndata.loc[:,number_cols].mean())

In [None]:
ndata[cat_cols]=ndata[cat_cols].ffill()
ndata.head(2)

In [None]:
#ndata.isnull().sum()

In [None]:
ndata[number_cols]=sc.fit_transform(ndata[number_cols])
ny=ndata['Survived_1_year']
nx=ndata.drop('Survived_1_year', axis=1)

SECOND
LETS PREPARE THE DATA SET USING BY ENCODING THE CATEGORICAL COLUMNS WITH 
A DIFFERENT ENCODING SCHEME FROM THE ONE WE USED FOR ALL THE OTHER MODELS

In [None]:
import category_encoders as ce
cdata=pharma_data
cdata.loc[:,number_cols]=cdata.loc[:,number_cols].fillna(cdata.loc[:,number_cols].mean())
cdata[cat_cols]=cdata[cat_cols].ffill()

In [None]:
cdata.head(3)

In [None]:
#cdata.isnull().sum()

In [None]:
import category_encoders as ce
enc= ce.OrdinalEncoder(cols=cat_cols)
cdata[cat_cols]=enc.fit_transform(cdata[cat_cols])

In [None]:
cdata['Treated_with_drugs']=cdata['Treated_with_drugs'].ffill()

In [None]:
ency=cdata['Survived_1_year']
encx=cdata.drop('Survived_1_year', axis=1)

In [None]:
#                                SPLITTING THE NEW DATASETS
Xenc_train,Xenc_test,yenc_train,yenc_test=train_test_split(encx,ency,test_size=0.3,random_state=0)
Xcat_train,Xcat_test,ycat_train,ycat_test=train_test_split(nx,ny,test_size=0.3,random_state=0)

                                SEARCHING BEST MODEL

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

def grid_search(Xs_train,ys_train, Xs_test,ys_test):
        f=fisc(models, Xs_test,ys_test)
        print ('INITIAL F1-SCORE=: \t\t {:0.2f}, for {}'.format(f, str(models)))

        print('################## AFTER OPTIMIZATION#######################')
        grid = {'iterations': [500],
                'learning_rate': [0.03, 0.1, 0.5],
                'depth': [4, 6, 10, 12],
                'l2_leaf_reg': [1, 3, 5, 7, 9],
                }

        m=CatBoostClassifier(random_state = 0)        
        best_model = m.randomized_search(X=Xs_train,y=ys_train, param_distributions=grid,verbose=100, plot=True)     
        #best_model= RandomizedSearchCV(m, grid, scoring='accuracy', cv =cvs) 
        #GridSearchCV(clf, param_grid, cv=cv, scoring='accuracy', verbose=10) 
        return(best_model) 

best_model=grid_search(Xs_train,ys_train, Xs_test,ys_test)
#best_model2=grid_search(Xcat_train, ycat_train, Xcat_test, ycat_test)
#best_model3=grid_search(Xenc_train, yenc_train, Xenc_test,yenc_test)

In [None]:
b=best_model['params']
b

In [None]:
#{'depth': 4, 'l2_leaf_reg': 1, 'iterations': 1000, 'learning_rate': 0.5} using category encoding
#{'depth': 6, 'l2_leaf_reg': 3, 'iterations': 1000, 'learning_rate': 0.03}using category encoding
#{'depth': 6, 'l2_leaf_reg': 3, 'iterations': 500, 'learning_rate': 0.1}# original dataset
clf=CatBoostClassifier(iterations=203, 
                        learning_rate=0.1,
                        depth=6,
                        l2_leaf_reg=3,
                        eval_metric='F1')

In [None]:
def evaluate(clf,Xs_train,ys_train, Xs_test,ys_test):
    clf.fit(Xs_train,ys_train)         
    acc= clf.score(Xs_test,ys_test) 
    print('FINAL MODEL ACCURACY =: \t{:0.2f}'.format(acc))

    f1=fisc(clf, Xs_test,ys_test)
    print ('FINAL F1-SCORE=: \t\t {:0.2f}'.format(f1))
    #print('Improvement of {:0.2f}%.'.format( 100 * (f1))) 
    return(clf)

main_model=evaluate(clf,Xs_train,ys_train, Xs_test,ys_test)
#catdata=evaluate(clf, Xcat_train, ycat_train, Xcat_test, ycat_test)
#codeddata=evaluate(clf, Xenc_train, yenc_train, Xenc_test,yenc_test)

In [None]:
#VALUATE SELECTED MODEL
from sklearn.model_selection import cross_val_score
print('VALADATION BEFORE OPTIMIZATION')
scores = cross_val_score(models, Xs_train,ys_train, scoring='accuracy', cv=cvs, n_jobs=-1, error_score='raise')
print('VALIDATION F1_SCORE AND STD =:\t %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

print('VALADATION AFTER OPTIMIZATION')
scores2 = cross_val_score(best_model.best_estimator_, Xs_train,ys_train, scoring='accuracy', cv=cvs, n_jobs=-1, error_score='raise')
print('VALIDATION F1_SCORE AND STD =:\t %.3f (%.3f)' % (np.mean(scores2), np.std(scores2)))

TESTING THE EVALUATION DATASET

In [None]:
main_model

In [None]:
test_data = pd.read_csv('https://raw.githubusercontent.com/dphi-official/Datasets/master/pharma_data/Testing_set_begs.csv')

In [None]:
#test_data.isnull().sum()

In [None]:
dff=test_data 

In [None]:
dff.shape

In [None]:
cat_cols2=dff.select_dtypes(object).columns
cat_cols2

In [None]:
# IMPUTE CATEGORICAL VARIABLES WITH FREQUENCY OF THE FEATURE COLUMNS
# MULTIPLE COLUMNS FREQUNCY ENCODING                                           
for j in cat_cols2:    
    encoding = dff.groupby(j).size()
    encoding = encoding/len(dff)   #calculates frequency
    dff[j] = dff[j].map(encoding)
dff.head(2)

In [None]:
scaled_df=sc.fit_transform(dff)
scaled_df

In [None]:
predicted=models.predict(scaled_df)
predicted

In [None]:
#scaled_df['prediction']=predicted

In [None]:
dframes=pd.DataFrame(data=predicted, columns=['predicted'])
dframes

In [None]:
                                  #   CONVERT FINAL DATASFRAME TO CSV
filename='glsam_datathonCATS'
dframes.to_csv('G:\\My Drive\\2022 Course DPHI\\{}'.format(filename + '.csv'), sep=',', header=True, index=False)

USING AND SAVING THE MODEL

In [None]:
###########    USING THE MODEL
def use_Model(arr):
   #if the data has no missing vaue, then  
   sett=sc.fit_transform(arr)   
   predict= models.predict(sett)

   if (predict==0):
      print('Congratulations, you have no traces of  liver problem')
   else:
      print('Sorry to say this, you have a possibility of getting liver problem')
      
print('ENTER DETAILS TO CHECK STATUS')
print('\n#####################')
use_Model([[19150,40,3709,0.080297,16,29.443894,0.573256,0.698914,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,2.0]])

In [None]:
# SAVE THE MODEL FOR FUTURE USE
import joblib

# save the model to disk
f_name = 'liverdata.sav'
joblib.dump(models, f_name)

In [None]:
#LOAD MODEL LATER
newx=[[]]      
# load the model from disk
saved_model = joblib.load(f_name)
new_model = saved_model.predict([[newx]])
#result = saved_model.score(newx, newy)
#print(result)