# Business Analytics - Analytics Cup 21

**Status Quo (January 31):**   
XGB Classifier - balanced accuracy: 0.7136  
Logistic Regression - balanced accuracy: 0.5359  
Naive Bayes - balanced accuracy: 0.5297  

General background on imbalanced classification:  
https://towardsdatascience.com/guide-to-classification-on-imbalanced-datasets-d6653aa5fa23

# Importing Libraries

In [None]:
# installing a library to learn from imbalanced data sets via the terminal
#! sudo pip install imbalanced-learn 

In [None]:
import os
project_folder = '/Users/Manu/Documents/GitHub/analytics_cup_21/submission_4'
os.chdir(project_folder)

In [None]:
# importing libaries:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, RepeatedStratifiedKFold
from sklearn.metrics import confusion_matrix, balanced_accuracy_score, roc_curve, roc_auc_score
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder

from sklearn.model_selection import cross_val_score, KFold, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB

from collections import Counter

from scipy.stats import loguniform, randint, uniform

import imblearn
print(imblearn.__version__)

from imblearn.over_sampling import RandomOverSampler
from sklearn.utils import class_weight


In [None]:
# hiding warnings
import warnings
warnings.filterwarnings('ignore')

## Reading in the data

In [None]:
train = pd.read_csv("train_physicians_df_22_F.csv", sep =";")

In [None]:
test = pd.read_csv("test_physicians_df_22_F.csv", sep =";")

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train = train.set_index('Physician_ID')

In [None]:
train["Ownership_Indicator"].value_counts()

In [None]:
# labels 
y_train = train["Ownership_Indicator"][0:4000]

y_test = train["Ownership_Indicator"][4001:5000]

print(y_train, y_test)

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

In [None]:
train.columns

In [None]:
# features (manually)
X_train = train[['Primary_Specialty', 'Ownership_Indicator', 'total_payments',
       'number_of_payments', 'top_nature', 'total_of_top_nature',
       'range_count', 'range_total', 'top_company', 'pay_count', 'std',
       'top_rpi', 'rpi_count', 'cash', 'services', 'stock', 'stock_opt',
       'any_ownership', 'dividend', 'stock_or_other', 'top_fop', 'fop_count']][0:4000]

X_test = train[['Primary_Specialty', 'Ownership_Indicator', 'total_payments',
       'number_of_payments', 'top_nature', 'total_of_top_nature',
       'range_count', 'range_total', 'top_company', 'pay_count', 'std',
       'top_rpi', 'rpi_count', 'cash', 'services', 'stock', 'stock_opt',
       'any_ownership', 'dividend', 'stock_or_other', 'top_fop', 'fop_count']][4001:5000]

# Preprocessing

# Ordinal/One-hot Encoding 

 
Link: https://machinelearningmastery.com/one-hot-encoding-for-categorical-data/

Machine learning models require all input and output variables to be numeric.

This means that if your data contains categorical data, you must encode it to numbers before you can fit and evaluate a model.  

Types of data:    
**Numerical data**. Variable or features that is only composed of numbers, such as integers or floating-point values  
**Nominal/Categorical Variable**. Variable comprises a finite set of discrete values with no relationship between values.  
**Ordinal Variable**. Variable comprises a finite set of discrete values with a ranked ordering between values.

A numerical variable can be converted to an ordinal variable by dividing the range of the numerical variable into bins and assigning values to each bin. For example, a numerical variable between 1 and 10 can be divided into an ordinal variable with 5 labels with an ordinal relationship: 1-2, 3-4, 5-6, 7-8, 9-10. This is called **discretization**.


The following variants exist:
* One-hot encoding (each category value is converted into a new column and assigned a 1 or 0 (notation for true/false) value to the column)
* Ordinal encoding (=Integer Encoding) (each unique category value is assigned an integer value For example, “red” is 1, “green” is 2, and “blue” is 3)
* Label encoding (This approach is very simple and it involves converting each value in a column to a number)






**However, some algorithms can work with categorical data directly.**

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

In [None]:
train.head()

In [None]:
train.shape

In [None]:
# checking the column data types to figures out which attributes are categorical (object) and which ones numerical (int)
train.dtypes

In [None]:
cat_feature_mask = train.dtypes == object # boolean test (mask) of whether the col is of "object type" or not
cat_feature_mask

In [None]:
# filter categorical columns using the mask and turn it into a list
cat_list = train.columns[cat_feature_mask].to_list()

cat_list

In [None]:
# removing a list element
cat_list.remove("State")

cat_list

In [None]:
train["Primary_Specialty"].unique()

In [None]:
train["Primary_Specialty"].unique().shape

In [None]:
train["top_nature"].unique()

In [None]:
train["top_nature"].unique().shape

removing NaN values

In [None]:
# detecting and counting missing values (NaNs)
train.isnull().sum()

In [None]:
# mode class = Internal Medicine 
train["Primary_Specialty"].value_counts()

In [None]:
# imputation of  NaN values with the most frequent class
train["Primary_Specialty"] = train["Primary_Specialty"].replace(np.nan, 'Internal Medicine', regex=True)

In [None]:
train["std"].describe()

In [None]:
train["std"].median()

In [None]:
# imputation of  NaN values with the median value of the feature
train["std"] = train["std"].replace(np.nan, train["std"].median(), regex=True)

In [None]:
train.isnull().sum() # check

In [None]:
# transforming the categorical columns to numerical values using a ONE HOT ENCODER

# The encoder has a method called fit_transform(). It takes a dataframe as an input, and will spit out the 
# transformed dataframe. Remember to specify the columns you need to transform (you have a list for this). 


# define one hot encoding
onehot_encoder = OneHotEncoder(sparse=False)

# one hot encoding primary specialty
##ps_enc = onehot_encoder.fit_transform(train["Primary_Specialty"])

data = train[cat_list]

# transform data
result_oh = pd.DataFrame(onehot_encoder.fit_transform(data))
print(result_oh)


In [None]:
# colnames of encoded features
onehot_encoder.get_feature_names()

In [None]:
# renaming the encoded columns 
result_oh.columns = onehot_encoder.get_feature_names()

result_oh

In [None]:
# transforming the categorical columns to numerical values using an ORDINAL ENCODER

# The encoder has a method called fit_transform(). It takes a dataframe as an input, and will spit out the 
# transformed dataframe. Remember to specify the columns you need to transform (you have a list for this). 


# define ordinal encoding
##ordinal_encoder = OrdinalEncoder()

# one hot encoding primary specialty
##ps_enc = onehot_encoder.fit_transform(train["Primary_Specialty"])

#data = train[cat_list]

# transform data
##result = pd.DataFrame(ordinal_encoder.fit_transform(data))
##print(result)



Adding the encoded columns to the training and test data

In [None]:
# merging the two dfs
train_final = pd.concat([train.reset_index(drop=True), result_oh], axis=1)

train_final

In [None]:
# dropping the categorical features that have been encoded
train_final = train_final.drop(["Primary_Specialty","top_nature","top_rpi", "top_fop", "State"], axis=1)



In [None]:
train_final.head()

train test split

In [None]:
# labels 
y_train = train_final["Ownership_Indicator"][0:4000]

y_test = train_final["Ownership_Indicator"][4001:5000]

print(y_train, y_test)

In [None]:
# features (manually)
X_train = train_final.drop(["Ownership_Indicator"], axis=1)[0:4000]

X_test = train_final.drop(["Ownership_Indicator"], axis=1)[4001:5000]
print(X_train, X_test)

encoding test set

In [None]:
test.head()

In [None]:
cat_feature_mask_ts = test.dtypes == object # boolean test (mask) of whether the col is of "object type" or not
cat_feature_mask_ts

In [None]:
# filter categorical columns using the mask and turn it into a list
cat_list_ts = test.columns[cat_feature_mask_ts].to_list()

cat_list_ts

In [None]:
cat_list_ts.remove("State")


In [None]:
test[cat_list_ts]

In [None]:
# detecting and counting missing values (NaNs)
test.isnull().sum()

In [None]:
# mode class = Internal Medicine 
test["Primary_Specialty"].value_counts()

In [None]:
# imputation of  NaN values with the most frequent class
test["Primary_Specialty"] = test["Primary_Specialty"].replace(np.nan, 'Internal Medicine', regex=True)

In [None]:
test["std"].describe()

In [None]:
# imputation of  NaN values with the median value of the feature
test["std"] = test["std"].replace(np.nan, test["std"].median(), regex=True)

In [None]:
test.isnull().sum() # check

In [None]:
# transforming the categorical columns to numerical values using a ONE HOT ENCODER

# The encoder has a method called fit_transform(). It takes a dataframe as an input, and will spit out the 
# transformed dataframe. Remember to specify the columns you need to transform (you have a list for this). 


# define one hot encoding
onehot_encoder = OneHotEncoder(sparse=False)

# one hot encoding primary specialty
##ps_enc = onehot_encoder.fit_transform(train["Primary_Specialty"])

data = test[cat_list_ts]



# transform data
result_test_oh = pd.DataFrame(onehot_encoder.fit_transform(data))
print(result_test_oh)


In [None]:
# colnames of encoded features
onehot_encoder.get_feature_names()

In [None]:
# renaming the encoded columns 
result_test_oh.columns = onehot_encoder.get_feature_names()

result_test_oh

In [None]:
# merging the two dfs
test_final = pd.concat([test.reset_index(drop=True), result_test_oh], axis=1,)

test_final

In [None]:
# dropping the categorical features that have been encoded
test_final = test_final.drop(["Primary_Specialty","top_nature","top_rpi", "top_fop", "State"], axis=1)

test_final


In [None]:
enc_features = test_final.columns[1:test_final.columns.shape[0]]

enc_features

In [None]:

~X_train.columns.isin(X_test.columns)

In [None]:
~X_test.columns.isin(X_train.columns)

End encoding & preprocessing

In [None]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

In [None]:
X_train.head()

# Random Resampling

Things to try:
**Random Oversampling**:  
https://machinelearningmastery.com/random-oversampling-and-undersampling-for-imbalanced-classification/

The simplest strategy is to choose examples for the transformed dataset randomly, called random resampling.  

Resampling involves creating a new transformed version of the training dataset in which the selected examples have a different class distribution.  

There are two main approaches to random resampling for imbalanced classification; they are oversampling and undersampling.

**Random Oversampling**: Randomly duplicate examples in the minority class.  
**Random Undersampling**: Randomly delete examples in the majority class

In [None]:
# summarize class distribution
print(Counter(y_train))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_train_over, y_train_over = oversample.fit_resample(X_train, y_train)
# summarize class distribution
print(Counter(y_train_over))

In [None]:
# summarize class distribution
print(Counter(y_test))
# define oversampling strategy
oversample = RandomOverSampler(sampling_strategy='minority')
# fit and apply the transform
X_test_over, y_test_over = oversample.fit_resample(X_test, y_test)
# summarize class distribution
print(Counter(y_test_over))

We have randomly oversampled the 241 instances of the minority class and we now have 3759 instances in each class 

**Using the XGB Classifier with randomly resampled data:**



In [None]:
# initilizing the XGBClassifier as an object: (using all the default hyperparameters)
# here we initialize an XGBClassifier with a scale_pos_weight for imbalanced classification data
xgbc_os = xgb.XGBClassifier()



In [None]:
# CV type 3: repeated stratified k-fold CV (due to imbalance in the data set) 
# maintains the ratio of instances in each class for each fold
# Repeated stratified k-fold CV involves simply repeating the cross-validation procedure multiple times 
# and reporting the mean result across all folds from all runs

n_folds=3 # no. of folds
n_repeats=3 # no. of runs

rskfold = RepeatedStratifiedKFold(n_splits=n_folds,n_repeats=n_repeats,
                                 random_state=123)

In [None]:
# estimator = xgbc is the xgb classifier model; "cv" determines the cross validation splitting strategy; "scoring" determines the loss function
clf_rscv_scores_os = cross_val_score(xgbc_os, X_train_over, y_train_over, cv=rskfold, scoring='balanced_accuracy')
print("Repeated Stratified K-fold CV average score: %.2f" % clf_rscv_scores_os.mean())

clf_rscv_scores_os # score per fold and run

In [None]:
# define a search space for the RSCV
space = dict()
# Log-uniform is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step.
space['reg_lambda'] = loguniform(1e-5, 100)
space['reg_alpha'] = loguniform(1e-5, 100) # loguniform distribution
space['max_depth'] = [3, 4, 5, 6, 7, 8, 9] # fixed set of values
space['min_child_weight'] = [1, 2, 3, 4]
space['colsample_by_tree'] = uniform(0.5, 0.4) # uniform distribution with lower bound 0.5 and range 0.4 so between 0.5 and 0.9
space['subsample'] = uniform(0.5, 0.4)
space['n_estimators'] = randint(150, 1000) # returns a pseudo random integer number from the given range 
space['learning_rate'] = uniform(0.01, 0.6)


In [None]:
# Randomized Search cross validation (scoring parameter is used to set the loss function for the gscv)
# n_iter: sets the number of iterations (the # of random combinations from the search space to try)

xgbc_rscv_os = RandomizedSearchCV(estimator = xgbc_os, 
                               n_iter= 20, 
                               scoring = 'balanced_accuracy', 
                               param_distributions = space, 
                               verbose=2, 
                               cv = rskfold, 
                               n_jobs=-1, 
                               random_state=123,
                               return_train_score = True)

In [None]:
# fitting the model multiple times to find the best hyperparameter combination
##xgbc_rscv_os.fit(X_train_over, y_train_over)

In [None]:
# obtaining the best hyperparameter values from the Randomized Search CV
##xgbc_rscv_os.best_params_

In [None]:
# best score 
##print(xgbc_rscv_os.scoring, xgbc_rscv_os.best_score_)

In [None]:
# setting up the optimal xgb classifier model using the configuration of hyperparameters from the GSCV or RSCV
xgbc_opt_os = xgb.XGBClassifier(objective="binary:logistic", obj="balanced_accuracy", feval = "balanced_accuracy", scale_pos_weight = 1, booster="gbtree",learning_rate=0.44467319491638113, colsample_bytree=0.8397727176311158, max_depth=7, min_child_weight=1,n_estimators = 330, subsample = 0.6495205707200266, reg_alpha =0.008326637080293384, reg_lambda=0.3828680728500237)

In [None]:
# fitting the XGB classifier model to the training data (final model with optimal hyperparameters from the gscv)
xgbc_opt_os.fit(X_train_over, y_train_over)

In [None]:
# making predictions using the test data
y_pred_xgbc_os = xgbc_opt_os.predict(X_test_over)


In [None]:
# unique values (of a numpy array) of the predictions
np.unique(y_pred_xgbc_os, return_counts=True)


In [None]:
np.unique(y_test_over, return_counts=True)

In [None]:
# computing the balanced accuracy score for the predictions
balanced_accuracy_score(y_test_over, y_pred_xgbc_os)

# Cost-Sensitive Learning



We have discussed sampling techniques and are now ready to discuss cost-sensitive learning. In many ways, the two approaches are analogous — the main difference being that in cost-sensitive learning we perform under- and over-sampling by altering the relative weighting of individual samples

* **Upweighting**. Upweighting is analogous to over-sampling and works by increasing the weight of one of the classes keeping the weight of the other class at one.  

* **Down-weighting**. Down-weighting is analogous to under-sampling and works by decreasing the weight of one of the classes keeping the weight of the other class at one.

The sklearn.utils function class_weight() can be applied to any sklearn classifier and with keras

    from sklearn.utils import class_weight
    class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)
    model.fit(X_train, y_train, class_weight=class_weights)

In [None]:
# Computing the class weights of the data 
##class_weights = class_weight.compute_class_weight('balanced', np.unique(y_train), y_train)                                                 np.unique(y_train), y_train)
##model.fit(X_train, y_train, class_weight=class_weights) # the class_weights can be applied to any classifier

In this case, we have set the instances to be ‘balanced’, meaning that we will treat these instances to have balanced weighting based on their relative number of points — this is what I would recommend unless you have a good reason for setting the values yourself

XGBoostClassifier’s "scale_pos_weight" parameter is used to train a class-weighted XGBoost classifier for imbalanced data

scale_pos_weight = (sum(negative instances i.e. 0s) / sum(positive instances i.e. 1s)

Generally, **scale_pos_weight is the ratio of the number of negative class instances to the number of positive class instances**.

Suppose, the dataset has 90 observations of the negative class and 10 observations of the positive class, then the ideal value of scale_pos_weight should be 9.

In [None]:
# count # of examples/data points in each class
counter = Counter(y_train)

counter


In [None]:
# estimate scale_pos_weight value, assuming the class labels are 0 and 1
weight = counter[0] / counter[1]

weight


# XGB Boosting Classifier

Blogpost about using an XGB Classifier for imbalanced classification data
https://towardsdatascience.com/how-to-effectively-predict-imbalanced-classes-in-python-e8cd3b5720c4

In [None]:
# initilizing the XGBClassifier as an object: (using all the default hyperparameters)
# here we initialize an XGBClassifier with a scale_pos_weight for imbalanced classification data
xgbc = xgb.XGBClassifier(scale_pos_weight= weight)



When to use which splitting strategy:

Generally, k-fold cross validation is the gold-standard for evaluating the performance of a machine learning algorithm on unseen data with k set to 3, 5, or 10.  
Use stratified cross validation to enforce class distributions when there are a large number of classes or an imbalance in instances for each class.

There are 3 different APIs for evaluating the quality of a model’s predictions:

* Estimator score method: Estimators have a score method providing a default   
evaluation criterion for the problem they are designed to solve.   This is not discussed on this page, but in each estimator’s documentation.  

* Scoring parameter: Model-evaluation tools using cross-validation (such as model_selection.cross_val_score and model_selection.GridSearchCV) rely on an internal scoring strategy. This is discussed in the section The scoring parameter: defining model evaluation rules.  

* Metric functions: The sklearn.metrics module implements functions assessing prediction error for specific purposes. These metrics are detailed in sections on Classification metrics, Multilabel ranking metrics, Regression metrics and Clustering metrics.


The list of available scoring metrics is the following one:
https://scikit-learn.org/stable/modules/model_evaluation.html

"binary:logistic": logistic regression for binary classification, output probability

"binary:hinge": hinge loss for binary classification. This makes predictions of 0 or 1, rather than producing probabilities.

Balanced Accuracy = (((TP/(TP+FN)+(TN/(TN+FP))) / 2
 <->
Balanced Accuracy = (Sensitivity + Specificity) / 2

* When the outcome classes are the same size, accuracy and balanced accuracy are the same but the two metrics differ if the classes are imbalanced   

* Balanced accuracy is a good measure when you have imbalanced data and you are indifferent between correctly predicting the negative and positive classes

In [None]:
# CV type 1: k-fold CV
##kfold = KFold(n_splits=5, shuffle=True) 

Stratified Cross Validation

In [None]:
# CV type 2: stratified k-fold CV (due to imbalance in the data set) -> maintains the ratio of instances in each class for each fold
skfold = StratifiedKFold(n_splits=3, shuffle=True, random_state = 123) 
# printing the skfold CV folds
##for train_index, test_index in skfold.split(X_train, y_train):
##    print("TRAIN:", train_index, "TEST:", test_index)

In [None]:
# estimator = xgbc is the xgb classifier model; "cv" determines the cross validation splitting strategy; "scoring" determines the loss function
clf_cv_scores = cross_val_score(xgbc, X_train, y_train, cv=skfold,scoring='balanced_accuracy')
print("Stratified K-fold CV average score: %.2f" % clf_cv_scores.mean())

clf_cv_scores # score per fold


# Hyperparametertuning

**Hyperparameter tuning** is the process of determining the right combination of hyperparameters that allows the model to maximize model performance

* **Model parameters**: These are the parameters that are estimated by the model from the given data
* **Model hyperparameters**: These are the parameters that cannot be estimated by the model from the given data. These parameters are used to estimate the model parameters

*Methods:*
* **Random Search**. Define a search space as a bounded domain of hyperparameter values, randomly select a combination of hyperparameters from that domain in each iteration and record the corresponding model performance. After the last iteration, return the best performing combination of hyperparameters.
* **Grid Search**. Define a search space as a grid of hyperparameter values and evaluate every position in the grid. It fits the model on each and every combination of hyperparameter possible and records the model performance. Finally, it returns the best model with the best hyperparameters.

**Pros/Cons:**  

* Grid search is great for spot-checking combinations that are known to perform well generally.   
* Random search is great for discovery and getting hyperparameter combinations that you would not have guessed intuitively, although it often requires more time to execute.

*Useful links:*  
https://machinelearningmastery.com/hyperparameter-optimization-with-random-search-and-grid-search/  
https://neptune.ai/blog/hyperparameter-tuning-in-python-a-complete-guide-2020

## Grid Search CV

Learning rate: determines the step size of exploration during gradient descent  
Min_child_weight: Defines the minimum sum of weights of all observations required in a child. Too high values can lead to underfitting  
max_depth: The maximum depth of a tree  
subsample: Denotes the fraction of observations to be random samples for each tree.  
colsample_by_tree: Denotes the fraction of columns to be random samples for each tree.  
reg_lambda: L2 regularization term on weights (analogous to Ridge regression). Combats overfitting.   
reg_alpha: L1 regularization term on weight (analogous to Lasso regression)  
scale_pos_weight: can be set up to deal with class imbalance  
n_estimators: number of base learners (usually trees)

In [None]:
# gridsearch cross validation (scoring parameter is used to set the loss function for the gscv)

xgbc_gscv = GridSearchCV(estimator = xgbc, 
                         scoring = 'balanced_accuracy', 
                         param_grid = {'colsample_by_tree':[0.7,0.9], # [0.6,0.7,0.8,0.9]
                                       'subsample':[0.7,0.9], # [0.6,0.7,0.8,0.9]
                                       "min_child_weight":[1,3], # default = 1.0
                                       'learning_rate': [0.01,0.1] ,  # [0.001,0.01,0.1,0.3]
                                       'max_depth': [5,10], # [2,5,10,20]
                                       'n_estimators': [100,250], # [100,250,500]
                                       'reg_lambda': [0.8,1], # default = 1
                                       'reg_alpha':[0.0,0.2]}, # default = 0
                         verbose=1, 
                         cv = skfold, 
                         n_jobs=-1)

In [None]:
# fitting the model multiple times to find the best hyperparameter combination
##xgbc_gscv.fit(X_train, y_train)

In [None]:
# obtaining the best hyperparameter values from the Gridsearch CV
##xgbc_gscv.best_params_

In [None]:
##print(xgbc_gscv.scoring, xgbc_gscv.best_score_)

In [None]:
##xgbc_gscv.best_estimator_

## Randomized Search CV 

In contrast to GridSearchCV, not all parameter values are tried out, but
rather a fixed number of parameter settings is sampled from the specified
distributions. The number of parameter settings that are tried is
given by n_iter.

**Repeated stratified k-fold CV** involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs  
This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

In [None]:
# CV type 3: repeated stratified k-fold CV (due to imbalance in the data set) 
# maintains the ratio of instances in each class for each fold
# Repeated stratified k-fold CV involves simply repeating the cross-validation procedure multiple times 
# and reporting the mean result across all folds from all runs

n_folds=3 # no. of folds
n_repeats=3 # no. of runs

rskfold = RepeatedStratifiedKFold(n_splits=n_folds,n_repeats=n_repeats,
                                 random_state=123)

In [None]:
# estimator = xgbc is the xgb classifier model; "cv" determines the cross validation splitting strategy; "scoring" determines the loss function
clf_rscv_scores = cross_val_score(xgbc, X_train, y_train, cv=rskfold, scoring='balanced_accuracy')
print("Repeated Stratified K-fold CV average score: %.2f" % clf_rscv_scores.mean())

clf_rscv_scores # score per fold and run

The Search Space is a dictionary where names are arguments to the model and values are distributions from which to draw samples

In [None]:
# define a search space for the RSCV
space = dict()
# Log-uniform is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step.
space['reg_lambda'] = loguniform(1e-5, 100)
space['reg_alpha'] = loguniform(1e-5, 100) # loguniform distribution
space['max_depth'] = [3, 4, 5, 6, 7, 8, 9] # fixed set of values
space['min_child_weight'] = [1, 2, 3, 4]
space['colsample_by_tree'] = uniform(0.5, 0.4) # uniform distribution with lower bound 0.5 and range 0.4 so between 0.5 and 0.9
space['subsample'] = uniform(0.5, 0.4)
space['n_estimators'] = randint(150, 1000) # returns a pseudo random integer number from the given range 
space['learning_rate'] = uniform(0.01, 0.6)


In [None]:
# Randomized Search cross validation (scoring parameter is used to set the loss function for the gscv)
# n_iter: sets the number of iterations (the # of random combinations from the search space to try)

xgbc_rscv = RandomizedSearchCV(estimator = xgbc, 
                               n_iter= 20, 
                               scoring = 'balanced_accuracy', 
                               param_distributions = space, 
                               verbose=1, 
                               cv = rskfold, 
                               n_jobs=-1, 
                               random_state=123,
                               return_train_score = True)

In [None]:
# fitting the model multiple times to find the best hyperparameter combination
##xgbc_rscv.fit(X_train, y_train)

In [None]:
# obtaining the best hyperparameter values from the Randomized Search CV
##xgbc_rscv.best_params_

In [None]:
# best score 
##print(xgbc_rscv.scoring, xgbc_rscv.best_score_)

## Fitting the Optimal Model

{'colsample_by_tree': 0.7,
 'learning_rate': 0.01,
 'max_depth': 5,
 'min_child_weight': 1,
 'n_estimators': 250,
 'reg_alpha': 0.0,
 'reg_lambda': 0.8,
 'subsample': 0.7}

In [None]:
# setting up the optimal xgb classifier model using the configuration of hyperparameters from the GSCV or RSCV
xgbc_opt = xgb.XGBClassifier(objective="binary:logistic", obj="balanced_accuracy", feval = "balanced_accuracy", scale_pos_weight = weight, booster="gbtree",learning_rate=0.44864382150673426, colsample_bytree=0.6577480215811099, max_depth=5, min_child_weight=4,n_estimators = 557, subsample = 0.8304820510130102, reg_alpha =81.7237442395881, reg_lambda=0.002791929884483455)

In [None]:
# fitting the XGB classifier model to the training data (final model with optimal hyperparameters from the gscv)
xgbc_opt.fit(X_train, y_train)

In [None]:
xgbc_opt.evals_result

In [None]:
# making predictions using the test data
y_pred_xgbc = xgbc_opt.predict(X_test)


In [None]:
y_pred_xgbc[0:10]

In [None]:
# transforming the true/false values into 0/1 values
##y_pred_xgbc = y_pred_xgbc.astype(int)

##y_pred_xgbc[0:10]

In [None]:
# unique values (of a numpy array) of the predictions
np.unique(y_pred_xgbc, return_counts=True)


In [None]:
np.unique(y_test, return_counts=True)

Model Evaluation

In [None]:
# evaluation the predicted classifications using a confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred_xgbc)

conf_matrix

In [None]:
# computing an accuracy score for the predictions (evaluating predictions)
accuracy_score(y_test, y_pred_xgbc)

In [None]:
# computing the balanced accuracy score for the predictions
balanced_accuracy_score(y_test, y_pred_xgbc)

In [None]:
# feature scores
xgbc_opt.get_booster().get_fscore()

In [None]:
# feature importance plot
xgb.plot_importance(xgbc_opt)
plt.rcParams['figure.figsize'] = [12, 12]
plt.show()

# Logistic Regression Model

Logistic Regression is one of the most simple and commonly used Machine Learning algorithms for two-class classification. It is easy to implement and can be used as the baseline for any binary classification problem  
  
Estimation: Maximum Likelihood

In [None]:
# creating a logistic regression classifier object 
log_reg = LogisticRegression()

Regular Logistic regression without Gridsearch

In [None]:
# fitting the logistic regression model to the training data (regular LR model without gridsearch)
log_reg.fit(X_train, y_train)

In [None]:
# predicting on the test set
y_pred_lr = log_reg.predict(X_test)

In [None]:
# unique values (of a numpy array) of the predictions
np.unique(y_pred_lr, return_counts=True)

In [None]:
# unique values (of a numpy array) of the validation targets
np.unique(y_test, return_counts=True)

In [None]:
y_pred_lr.shape

Model evaluation regular Log Reg

In [None]:
# computing an accuracy score for the predictions (evaluating predictions)
accuracy_score(y_test, y_pred_lr)

In [None]:
# computing the balanced accuracy score for the predictions
balanced_accuracy_score(y_test, y_pred_lr)

In [None]:
# evaluation the predicted classifications using a confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred_lr)

conf_matrix

The Receiver Operating Characteristic(ROC) curve is a plot of the true positive rate against the false positive rate. It shows the tradeoff between sensitivity and specificity

In [None]:
# Plotting the Receiver Operator Curve (ROC)
y_pred_proba = log_reg.predict_proba(X_test)[::,1]
fpr, tpr, _ = roc_curve(y_test,  y_pred_proba)
auc = roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

# Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the value of the class variable

In [None]:
# initializing a Gaussian Naive Bayes Classifier
gnb = GaussianNB()

In [None]:
# fitting the model to the training data
gnb.fit(X_train, y_train)

In [None]:
# obtaining predictions
y_pred_nb = gnb.predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
% (X_test.shape[0], (y_test != y_pred_nb).sum()))

In [None]:
# unique values (of a numpy array) of the predictions
np.unique(y_pred_nb, return_counts=True)

Model Evaluation

In [None]:
# computing the balanced accuracy score for the predictions
balanced_accuracy_score(y_test, y_pred_nb)

In [None]:
# computing an accuracy score for the predictions (evaluating predictions)
accuracy_score(y_test, y_pred_nb)

In [None]:
# evaluation the predicted classifications using a confusion matrix

conf_matrix = confusion_matrix(y_test, y_pred_nb)

conf_matrix

# Final model with all Training data for Server submission

In [None]:
# we train a final model on 5000 data points
# labels 
y_train = train_final["Ownership_Indicator"]


# y_test is unknown

print(y_train)

In [None]:
# comparing column names of the final training data and the final test data
train_final.columns.isin(test_final.columns)

In [None]:
# returning the columns only present in the training set but not in the test set
train_final.columns[~train_final.columns.isin(test_final.columns)]

In [None]:
all_features_test = test_final.columns[1:test_final.columns.shape[0]]

all_features_test = enc_features.drop("x0_Phlebology") # dropping the feature column not present in the training data

all_features_test

In [None]:
# features (manually) 5000 data points
X_train = train_final.drop(['Ownership_Indicator', 'x0_Chiropractic Providers',
       'x0_Medical Genetics', 'x0_Neuromusculoskeletal Medicine & OMM',
       'x0_Neuromusculoskeletal Medicine, Sports Medicine',
       'x0_Nuclear Medicine', 'x0_Oral & Maxillofacial Surgery',
       'x1_Charitable Contribution',
       'x1_Compensation for serving as faculty or as a speaker for an accredited or certified continuing education program',
       'x1_Current or prospective ownership or investment interest',
       'x2_Combination'], axis=1)
#X_train = X_train[enc_features] # subsetting features

# 1000 data points for test set features
X_test = test_final[all_features_test]
print(X_train, X_test)

print(X_train.shape, X_test.shape)

In [None]:
X_train.columns

In [None]:
X_test.columns

Fitting an XGBClassifier model on 5000 data points

In [None]:
# count # of examples/data points in each class
counter = Counter(y_train)

counter


In [None]:
# estimate scale_pos_weight value, assuming the class labels are 0 and 1
weight = counter[0] / counter[1]

weight


In [None]:
# initilizing the XGBClassifier as an object: (using all the default hyperparameters)
# here we initialize an XGBClassifier with a scale_pos_weight for imbalanced classification data
xgbc_final = xgb.XGBClassifier(scale_pos_weight= weight)



## Final Random Search & Model fit

In [None]:
# CV type 3: repeated stratified k-fold CV (due to imbalance in the data set) 
# maintains the ratio of instances in each class for each fold
# Repeated stratified k-fold CV involves simply repeating the cross-validation procedure multiple times 
# and reporting the mean result across all folds from all runs

n_folds=3 # no. of folds
n_repeats=3 # no. of runs

rskfold = RepeatedStratifiedKFold(n_splits=n_folds,n_repeats=n_repeats,
                                 random_state=123)

In [None]:
# estimator = xgbc is the xgb classifier model; "cv" determines the cross validation splitting strategy; "scoring" determines the loss function
clf_rscv_scores = cross_val_score(xgbc_final, X_train, y_train, cv=rskfold, scoring='balanced_accuracy')
print("Repeated Stratified K-fold CV average score: %.2f" % clf_rscv_scores.mean())

clf_rscv_scores # score per fold and run

In [None]:
# define a search space for the RSCV
space = dict()
# Log-uniform is useful for searching penalty values as we often explore values at different orders of magnitude, at least as a first step.
space['reg_lambda'] = loguniform(1e-5, 100)
space['reg_alpha'] = loguniform(1e-5, 100) # loguniform distribution
space['max_depth'] = [3, 4, 5, 6, 7, 8, 9] # fixed set of values
space['min_child_weight'] = [1, 2, 3, 4]
space['colsample_by_tree'] = uniform(0.5, 0.4) # uniform distribution with lower bound 0.5 and range 0.4 so between 0.5 and 0.9
space['subsample'] = uniform(0.5, 0.4)
space['n_estimators'] = randint(150, 1000) # returns a pseudo random integer number from the given range 
space['learning_rate'] = uniform(0.01, 0.6)


In [None]:
# Randomized Search cross validation (scoring parameter is used to set the loss function for the gscv)
# n_iter: sets the number of iterations (the # of random combinations from the search space to try)

xgbc_rscv = RandomizedSearchCV(estimator = xgbc_final, 
                               n_iter= 20, 
                               scoring = 'balanced_accuracy', 
                               param_distributions = space, 
                               verbose=1, 
                               cv = rskfold, 
                               n_jobs=-1, 
                               random_state=123,
                               return_train_score = True)

In [None]:
# fitting the model multiple times to find the best hyperparameter combination
##xgbc_rscv.fit(X_train, y_train)

In [None]:
# obtaining the best hyperparameter values from the Randomized Search CV
##xgbc_rscv.best_params_

In [None]:
# best score 
##print(xgbc_rscv.scoring, xgbc_rscv.best_score_)

## Fitting the Optimal Model

{'colsample_by_tree': 0.7,
 'learning_rate': 0.01,
 'max_depth': 5,
 'min_child_weight': 1,
 'n_estimators': 250,
 'reg_alpha': 0.0,
 'reg_lambda': 0.8,
 'subsample': 0.7}

In [None]:
# setting up the optimal xgb classifier model using the configuration of hyperparameters from the GSCV or RSCV
xgbc_final_opt = xgb.XGBClassifier(objective="binary:logistic", obj="balanced_accuracy", feval = "balanced_accuracy", scale_pos_weight = weight, booster="gbtree",learning_rate=0.44864382150673426, colsample_bytree=0.6577480215811099, max_depth=5, min_child_weight=4,n_estimators = 557, subsample = 0.8304820510130102, reg_alpha =81.72374423958811, reg_lambda=0.002791929884483455)

In [None]:
# fitting the XGB classifier model to the training data (final model with optimal hyperparameters from the gscv)
xgbc_final_opt.fit(X_train, y_train)

In [None]:
# making predictions using the test data
y_pred_xgbc_final = xgbc_final_opt.predict(X_test)


In [None]:
# unique values (of a numpy array) of the predictions
np.unique(y_pred_xgbc_final, return_counts=True)


# Exporting the predicitions for a submission

Export predictions into csv file  
• Format: id, prediction  
• Predictions must be 0 or 1 (not 0.5, not ‘Yes’, not ‘FALSE’)  
• Must contain all instances of the original test dataset

In [None]:
test_final.columns

In [None]:
final_preds = y_pred_xgbc_final

In [None]:
# storing the ids and the corresponding predictions as a combined dataframe

submission = pd.DataFrame()

submission["id"] = test_final["Physician_ID"]

submission["prediction"] = final_preds


submission



In [None]:
# storing the submission data in a csv file
submission.to_csv('submission_team_sgs_4.csv', index=False) 