Your model will be evaluated using two metrics: profit @ top-20, and AUC. The reasons for this is to be in line with a more realistic setting. E.g. one can image data scientists in a team arguing to use AUC and optimize for that. However, as seen in the course, for this scenario, we also imagine management arguing that there is not enough budget (in terms of time and money) to contact a lot of people (or hand out a lot of promotions). Hence, they have come up with the following: based on the top-k would-be churners as predicted by your model, sum some proxy of "retained profitability" in case the customer was indeed a churner, or zero otherwise

As a proxy of profitability, the feature average cost min was deemed to be a good value. Based on the size of the test set, k=20 was deemed to be a good choice. Hence, management cares about optimizing this metric
Note that only about half of the test set is used for the "public" leaderboard. That means that the score you will see on the leaderboard is done using this part of the test only (you don't know which half). Later on through the semester, submissions are frozen and the resuls on the "hidden" part will be revealed

Also, whilst you can definitely try, the goal is not to "win", but to help you reflect on your model's results, see how others are doing, etc.

Objectives:

Some groups prefer to write their final report using Jupyter Notebook, which is fine too, as long as it is readable top-to-bottom

You can use any predictive technique/approach you want, though focus on the whole process: general setup, critical thinking, and the ability to get and validate an outcome

You're free to use unsupervised technique for your data exploration part, too. When you decide to build a black box model, including some interpretability techniques to explain it is a plus

Any other assumptions or insights are thoughts can be included as well: the idea is to take what we've seen in class, get your hands dirty and try out what we've seen

Perform a critical review of the evaluation metric chosen by management. How in line is it with AUC? What would you have picked instead? Were there particular issues with this chosen metric, in your view?

In [None]:
# !pip install shap

In [None]:
import pandas as pd
import os
import plotly.graph_objects as go
import matplotlib.pyplot as plt
import numpy as np
import shap
import seaborn as sns

from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
from sklearn.model_selection import cross_val_score, GridSearchCV, RandomizedSearchCV, train_test_split
from sklearn.metrics import roc_auc_score, RocCurveDisplay
from sklearn.inspection import permutation_importance
from imblearn.ensemble import BalancedRandomForestClassifier

from sklearn.svm import SVC

import xgboost as xgb


pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [None]:
# Initialising
TRAIN_SET_FRAC = 0.8
SEED = 42
TARGET_VAR = "target"
DROP_VARS = ['Connect_Date', 'id'] # TBC
KFOLD = 5

**Loading Data**

In [None]:
# GitHib urls to fetch data from
url_train = 'https://raw.githubusercontent.com/hello-bob/AA_P1/main/data/train.csv'
url_test = 'https://raw.githubusercontent.com/hello-bob/AA_P1/main/data/test.csv'

# Read train and test data
train_data = pd.read_csv(url_train, sep = ',', skipinitialspace = True, engine = 'python')
train_data = train_data.drop(columns=DROP_VARS)
test_data  = pd.read_csv(url_test, sep = ',', skipinitialspace = True, engine = 'python')

**Data exploration**

In [None]:
train_data.head()

In [None]:
# Check data types
train_data.info()
test_data.info()

In [None]:
# Basic descriptives
train_data.describe(include='all')

**Data Cleaning**

In [None]:
# Impute missing data before modelling: Can quantitate and put it on the report since 4/5k samples
# Apply on the test set. Train set is ok.

train_data.isnull().any().sort_values(ascending=False) # Columns with missing values: Dropped_calls_ratio, Usage_Band, call_cost_per_min.
train_data[train_data.isnull().any(axis=1)] # 4 cases, 2 churners

imputer_compiled = ColumnTransformer(
    [("numeric_imputer", SimpleImputer(strategy="median",), ["Dropped_calls_ratio", "call_cost_per_min"]),
     ("cat_imputer", SimpleImputer(strategy="most_frequent"), ["Usage_Band"])]
)

# Imput median for numeric variables first. Because "most_frequent" strategy will impute for both numeric and categorical data
train_data[["Dropped_calls_ratio", "call_cost_per_min", "Usage_Band"]] = imputer_compiled.fit_transform(train_data)
test_data[["Dropped_calls_ratio", "call_cost_per_min", "Usage_Band"]] = imputer_compiled.transform(test_data)

# Correcting dtype
train_data[["Dropped_calls_ratio", "call_cost_per_min"]] = train_data[["Dropped_calls_ratio", "call_cost_per_min"]].astype(float)
test_data[["Dropped_calls_ratio", "call_cost_per_min"]] = test_data[["Dropped_calls_ratio", "call_cost_per_min"]].astype(float)


**Exploratory**

In [None]:
# [For report] Pie chart about class inbalance (train set) + Percentage churn in categorical variable


In [None]:
# [For report] correlation plot
corr = train_data.corr(numeric_only=True)

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x = corr.columns,
        y = corr.index,
        z = np.array(corr),
        text=corr.values,
        texttemplate='%{text:.2f}'
    )
)
fig.update_layout(
    autosize=False,
    width=800,
    height=800,
)
fig.show()

In [None]:
# Print correlations which has >0.7
"""
All_calls_mins highly correlated to National minutes. It's a sum of National minutes + International mins maybe. This may indicate that
majority of calls are within nation. This corresponds to how Nat_call_cost_Sum is highly correlated to actual call cost. Not sure what's the
diff betwen actuall call cost and total call cost. Nat_call_cost_Sum could be an adjustment of actual call cost, correlation 0.999.

Total_call_cost most strongly correlated with International_mins_Sum and Total_Cost. This may suggest that costs are largely driven by 
international calls

Not sure what total calls indicate, maybe it's cost from other telco-related services e.g. broadband, cable tv etc.

Peak_mins_Sum highly correlated with all_calls_mins and national mins. This may make sense since the more minutes of calling, the higher
likelihood of calling during the peak period?

"""
corr_long = train_data.corr(numeric_only=True).melt(ignore_index=False).reset_index(drop=False)
corr_long = corr_long[(corr_long['index'] != corr_long['variable']) & (abs(corr_long['value']) >0.7)]
corr_long.sort_values(by=['index', 'variable'], ascending=True)

In [None]:
(round((train_data['International_mins_Sum'] + train_data['National mins']),2) == round(train_data['All_calls_mins'], 2)).sum() #almost all
train_data[['All_calls_mins', 'National mins']]

In [None]:
train_data.select_dtypes('object').columns.to_list()

In [None]:
# [For report] Correlation between categorical variables, using cramer's V
# Generally, 
from scipy.stats.contingency import association

cat_corr_cols = train_data.select_dtypes('object').columns.to_list()
while len(cat_corr_cols) > 0:
    col = cat_corr_cols.pop(0)
    print(f'Correlation for {col}')
    for other_col in cat_corr_cols:
        contingency_tbl = pd.crosstab(train_data[col], train_data[other_col])
        cramer_V = association(contingency_tbl, method="cramer")
        print(f'Association with {other_col}: {cramer_V}')
    print('\n')
    


In [None]:
pd.crosstab(train_data['tariff'], train_data['Usage_Band'])
pd.crosstab(train_data['Gender'], train_data['Handset'])
pd.crosstab(train_data['high Dropped calls'], train_data['Handset'])

In [None]:
# On the metric by management
train_data.sort_values(by='average cost min', ascending=False).head(20)

In [None]:
# Identifying outliers via isolation forest
from sklearn.ensemble import IsolationForest

outlier_df = (train_data.select_dtypes(include='number')
              .drop(columns=TARGET_VAR)
              .dropna()
              .copy())

iso_forest = IsolationForest(random_state=SEED, n_jobs=-1, contamination=0.05).fit(outlier_df)

In [None]:
outlier_df['outlier_score'] = iso_forest.decision_function(outlier_df) # more negative indicates higher outlier-ness

In [None]:
# https://stats.stackexchange.com/questions/404017/how-to-get-top-features-that-contribute-to-anomalies-in-isolation-forest
"""
Outstanding variables contributing to outlier (to the left of 0 on x axis) are AveOffPeak, average cost min, AveWeekend, AveNational and 
Dropped_calls_ratio. These tend to indicate the higher the values, the more of an outlier they are.
"""

# Create shap values and plot them
shap_values = shap.TreeExplainer(iso_forest).shap_values(outlier_df)
shap.summary_plot(shap_values, outlier_df, plot_type='violin')

In [None]:
# Average shap per variable Top few: Weekend_calls_sum, nat_call_cost_sum, dropped_calls, average cost min
# a global measure of feature importance (https://shap.readthedocs.io/en/latest/example_notebooks/overviews/An%20introduction%20to%20explainable%20AI%20with%20Shapley%20values.html)
# These values seem low to the values I saw online. 
explainer = shap.Explainer(iso_forest, outlier_df)
shap_values = explainer(outlier_df)
shap.plots.bar(shap_values)

In [None]:
# 3 of the top 10 outliers are churners. No discernible pattern between churn and non-churn. Propose to keep all and compare 
# between models which are robust against outliers, and those not.
outlier_df[TARGET_VAR] = train_data[TARGET_VAR].copy()
outlier_df.sort_values(by='outlier_score', ascending=True).head(10).sort_values(by='target')
outlier_df.groupby('target').mean()

**Modelling**

In [None]:
X = train_data.drop(columns=TARGET_VAR)
y = train_data[TARGET_VAR] 

NUM_VARS = train_data.select_dtypes(include='number').drop(columns=TARGET_VAR).columns
CAT_VARS = train_data.select_dtypes(include='object').columns

X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=1/KFOLD, stratify=y, random_state=SEED)

In [None]:
print(NUM_VARS)
print(CAT_VARS)

In [None]:
# Define preprocessors for numerical and categorical features
numerical_preprocessor = Pipeline([
    ("scaler", MinMaxScaler(clip=True))
])

categorical_preprocessor = Pipeline([
    ("onehot", OneHotEncoder(drop="if_binary"))
])

In [None]:
# Combine preprocessors and model
model = Pipeline([
    ("preprocessor", ColumnTransformer([
        ("numerical", numerical_preprocessor, NUM_VARS),
        ("categorical", categorical_preprocessor, CAT_VARS)
    ])),
    ("model", SVC(probability=True, random_state=SEED, max_iter=25000))
])
model

In [None]:
%%time
# For SVM

parameters = {'model__kernel':['linear', 'rbf', 'sigmoid'], 
              'model__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000], 
              'model__gamma':[0.001, 0.01, 0.1, 1, 10, 100, 1000],
             'model__class_weight':[None, 'balanced']} # rmb to add the double underscores to allow gridsearch to fit on pipelines
svc_gs_est = GridSearchCV(estimator=model, param_grid=parameters,cv=KFOLD,
                      scoring="roc_auc",n_jobs=-1, refit=True, verbose=10)
svc_gs_est.fit(X_train, y_train)

print("Done")

In [None]:
svc_gs_results = pd.DataFrame(data=svc_gs_est.cv_results_)
svc_gs_results.sort_values(by='rank_test_score', ascending = True).to_csv('output/svc_gridsearchcv_minmaxScaler.csv')
svc_gs_results.sort_values(by='rank_test_score', ascending = True)

In [None]:
# Best params
svc_gs_est.best_params_

In [None]:
# Test on test set
"""
Goes up fast, but then struggles to improve TPR without increasing FPR
"""
predicted_probabilities = svc_gs_est.predict_proba(X_test)
auc_score = roc_auc_score(y_test, predicted_probabilities[:, 1])
print(f"AUC Score: {round(auc_score, 3)}") # 0.9265646948649914

RocCurveDisplay.from_estimator(svc_gs_est, X_test, y_test)

In [None]:
# Feature importance: To note, some numerical variables share permutation importance due to high correlation
# Handset, international mins sum and usage band is at the top. As above:

"""
All_calls_mins highly correlated to National minutes. It's a sum of National minutes + International mins maybe. [International and all call mins quite impt]
Nat_call_cost_Sum could be an adjustment of actual call cost, correlation 0.999. [Actual call cost much more important than nat call cost]
Total_call_cost most strongly correlated with International_mins_Sum and Total_Cost [all 3 relatively impt]
Peak_mins_Sum highly correlated with all_calls_mins and national mins [these 3 all low feature importance]

"""
perm_importance = permutation_importance(svc_gs_est, X_test, y_test)

features = svc_gs_est.feature_names_in_

sorted_idx = perm_importance.importances_mean.argsort()
plt.barh(features[sorted_idx], perm_importance.importances_mean[sorted_idx])
plt.xlabel("Permutation Importance")

In [None]:
# Plotting predicted probabilities against metric avg cost min
# 50% seems to be a clean cut-off
plt.scatter(y=predicted_probabilities[:, 1], x=X_test['average cost min'], alpha=0.6, )
plt.xlabel("average cost min")
plt.ylabel("pred probability")

In [None]:
# Plotting residuals against metric avg cost min
# Most individuals where the algorithm predicts within a limit of +/-0.5 do not have high average cost per minute.
# The high value customers were "false positives" in a sense, if we were to use a decision boundary of 0.5.
plt.scatter(y=(y_test-predicted_probabilities[:, 1]), x=X_test['average cost min'], alpha=0.6)
plt.xlabel("average cost min")
plt.ylabel("residuals")

In [None]:
# Using 25% as threshold 
pd.crosstab((predicted_probabilities[:, 1]>=0.25).astype(int), y_test)

In [None]:
# Retrain best model on all train data: Set up the params accordingly
numerical_preprocessor = Pipeline([
    ("scaler", MinMaxScaler())
])

categorical_preprocessor = Pipeline([
    ("onehot", OneHotEncoder(drop="if_binary"))
])

best_svc_model = Pipeline([
    ("preprocessor", ColumnTransformer([
        ("numerical", numerical_preprocessor, NUM_VARS),
        ("categorical", categorical_preprocessor, CAT_VARS)
    ])),
    ("model", SVC(probability=True, random_state=SEED, C=1000, gamma=0.001, class_weight='balanced', kernel="rbf"))
])

best_svc_model.fit(X, y)

In [None]:
# For submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':best_svc_model.predict_proba(test_data)[:,1]})
test_data_sub.to_csv('output/svc_submission.csv', header=True, index=False)

In [None]:
# For submission: Adjust for those cases with high average cost min
test_data_sub_adjusted = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':best_svc_model.predict_proba(test_data)[:,1],
                                           'average cost min':test_data['average cost min']})
test_data_sub_adjusted.plot(y="PRED", x="average cost min", kind="scatter")
test_data_sub_adjusted['quantile'] = (pd.qcut(test_data_sub_adjusted["average cost min"].values, 10,labels=False) + 1) / 10
test_data_sub_adjusted['PRED'] = test_data_sub_adjusted['PRED'] * test_data_sub_adjusted['quantile']
test_data_sub_adjusted['PRED'].describe()
test_data_sub.to_csv('output/svc_submission_adjusted1.csv', header=True, index=False)

In [None]:
# For submission: use prediction threshold of 25%.  
test_data_sub_hardadjusted = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':best_svc_model.predict_proba(test_data)[:,1]})
test_data_sub_hardadjusted['PRED'] = (test_data_sub_hardadjusted['PRED'] >=0.25).astype(int)
test_data_sub_hardadjusted.to_csv('output/svc_submission_hardadjusted.csv', header=True, index=False)

In [None]:
# For submission: use prediction threshold of 25%. Sort values by average cost min 
test_data_sub_hardadjusted_sorted = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':best_svc_model.predict_proba(test_data)[:,1],
                                           'average cost min':test_data['average cost min']})
test_data_sub_hardadjusted_sorted['PRED'] = (test_data_sub_hardadjusted_sorted['PRED'] >=0.25).astype(int)
test_data_sub_hardadjusted_sorted = test_data_sub_hardadjusted_sorted.sort_values(by=['PRED', 'average cost min'], ascending=False)
test_data_sub_hardadjusted_sorted = test_data_sub_hardadjusted_sorted.drop(columns=['average cost min'])
test_data_sub_hardadjusted_sorted.to_csv('output/svc_submission_hardadjusted_sorted.csv', header=True, index=False)

In [None]:
# For submission: use prediction threshold of 25% + adjust for cases with high average cosst min. 
test_data_sub_adjusted2 = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':best_svc_model.predict_proba(test_data)[:,1],
                                           'average cost min':test_data['average cost min']})
test_data_sub_adjusted2['PRED'] = (test_data_sub_adjusted2['PRED'] >=0.25).astype(int)
test_data_sub_adjusted2['quantile'] = (pd.qcut(test_data_sub_adjusted2["average cost min"].values, 10,labels=False) + 1) / 10
test_data_sub_adjusted2['PRED'] = test_data_sub_adjusted2['PRED'] * test_data_sub_adjusted2['quantile'] 

test_data_sub_adjusted2.plot(x="quantile", y="PRED", kind="scatter")
test_data_sub_adjusted2 = test_data_sub_adjusted2.drop(columns=['quantile', 'average cost min'])
test_data_sub_adjusted2.to_csv('output/svc_submission_adjusted2.csv', header=True, index=False)

**XGBoost**

In [None]:
# Basic preprocessing
numeric_transformer = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder(handle_unknown='ignore'))
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, NUM_VARS),
        ("cat", categorical_transformer, CAT_VARS)
    ]
)

X_preprocessed = preprocessor.fit_transform(X)
# Alternatively split train-test before, do preprocessing on training data (fit_transform) then transform test data


In [None]:
X_train, X_test, y_train, y_test= train_test_split(X_preprocessed, y, test_size=0.2, stratify=y, random_state=420)

Grid search

In [None]:
# Calcualte class ratio. Will be used to assess class imbalance during training
ratio = float(y_train.value_counts()[0]) / y_train.value_counts()[1]

In [None]:
# First do a random search over a large hyperparameter space, trying 1000 random models 
gbm_param_grid_large = {  'n_estimators': np.arange(5, 101, 1)
                        , 'max_depth': range(2, 13)
                        , 'learning_rate': np.arange(0.001, 5, 0.01)
                        , 'subsample': [0.2, 0.4, 0.6, 0.8, 1]
                        , 'colsample_bytree': [0.2, 0.5, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10, 100]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_train, y_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Now do another random search over a smaller hyperparameter space around the preivously found "best" values
gbm_param_grid_medium = {  'n_estimators': np.arange(50, 120, 1)
                         , 'max_depth': range(6, 15)
                         , 'learning_rate': np.arange(0.001, 5, 0.01)
                         , 'subsample': [0.6, 0.8, 1]
                         , 'colsample_bytree': [0.8, 1]
                         , 'reg_lambda': [0, 1, 2, 5]
                         }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_medium
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_train, y_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# We got quite different reults, but different hyperparameter combinations can give similar results.
# Finally a grid-search that is not random around the previous "best" values.
gbm_param_grid = {  'n_estimators': [70, 80, 100, 120]
                  , 'max_depth': [8, 10, 12]
                  , 'learning_rate': [0.05, 0.1, 0.15, 0.2]
                  , 'subsample': [0.8, 1]
                  , 'colsample_bytree': [0.8, 1]
                  , 'reg_lambda': [0, 1, 2]
                  }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
grid_auc = GridSearchCV(  estimator=gbm
                        , param_grid=gbm_param_grid
                        , scoring='roc_auc'
                        , cv=5
                        , n_jobs=-1
                        , verbose=1
                        )

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
grid_auc.fit(X_train, y_train)
print("Best parameters found: ", grid_auc.best_params_)
print("Lowest AUC found: ", grid_auc.best_score_)

In [None]:
# Save an output based on the previous (last) grid search
pd.DataFrame(grid_auc.cv_results_).sort_values(by='rank_test_score', ascending = True).to_csv('output/xgb_gridsearch.csv')

In [None]:
# Get an idea about how the model performs on the test set
# Test AUC is close to the "best" model AUC on the cross-validated training set which is a good indication of not suffering from overfitting
predicted_probabilities = grid_auc.predict_proba(X_test)
auc_score = roc_auc_score(y_test, predicted_probabilities[:, 1])
auc_score

In [None]:
# Re-train the tuned model on the entire training data (not just on the 80% of it)
best_model_xgb = Pipeline([
    ("preprocessor", ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUM_VARS),
            ("cat", categorical_transformer, CAT_VARS)
            ]
    )),
    ("xgboost model", xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio, colsample_bytree=1, learning_rate=0.05, max_depth=12, n_estimators=100, reg_lambda=0, subsample=0.8))
])


best_model_xgb.fit(X, y)
pred_xgb = pd.DataFrame(best_model_xgb.predict_proba(test_data), columns=["0", "1"])

# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_xgb["1"]})
test_data_sub

In [None]:
# Exporting results
test_data_sub.to_csv('output/xgboost_pred_submission_v2.csv', header=True, index=False)

Sample weights based on average cost min (?) - experimental code, can be deleted later, not sure exactly how it works

In [None]:
# First do a random search over a large hyperparameter space, trying 1000 random models 
gbm_param_grid_large = {  'n_estimators': np.arange(5, 101, 1)
                        , 'max_depth': range(2, 13)
                        , 'learning_rate': np.arange(0.001, 5, 0.01)
                        , 'subsample': [0.2, 0.4, 0.6, 0.8, 1]
                        , 'colsample_bytree': [0.2, 0.5, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10, 100]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_train, y_train, sample_weight=X_train[:,24])
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Now do another random search over a smaller hyperparameter space around the preivously found "best" values
gbm_param_grid_large = {  'n_estimators': np.arange(20, 60, 1)
                        , 'max_depth': range(5, 11)
                        , 'learning_rate': np.arange(0.05, 5.05, 0.05)
                        , 'subsample': [0.6, 0.8, 1]
                        , 'colsample_bytree': [0.6, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_train, y_train, sample_weight=X_train[:,24])
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Finally a grid-search that is not random around the previous "best" values.
gbm_param_grid = {  'n_estimators': [30, 50, 100]
                  , 'max_depth': [8, 10, 12]
                  , 'learning_rate': [0.01, 0.05, 0.1, 0.2]
                  , 'subsample': [0.8, 1]
                  , 'colsample_bytree': [0.8, 1]
                  , 'reg_lambda': [0, 1, 2]
                  }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio)
grid_auc = GridSearchCV(  estimator=gbm
                        , param_grid=gbm_param_grid
                        , scoring='roc_auc'
                        , cv=5
                        , n_jobs=-1
                        , verbose=1
                        )

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
grid_auc.fit(X_train, y_train, sample_weight=X_train[:,24])
print("Best parameters found: ", grid_auc.best_params_)
print("Lowest AUC found: ", grid_auc.best_score_)

In [None]:
# Save an output based on the previous (last) grid search
pd.DataFrame(grid_auc.cv_results_).sort_values(by='rank_test_score', ascending = True).to_csv('output/xgb_gridsearch_weighted.csv')

In [None]:
# Get an idea about how the model performs on the test set
# Test AUC is close to the "best" model AUC on the cross-validated training set which is a good indication of not suffering from overfitting
predicted_probabilities = grid_auc.predict_proba(X_test)
auc_score = roc_auc_score(y_test, predicted_probabilities[:, 1])
auc_score

In [None]:
# Re-train the tuned model on the entire training data (not just on the 80% of it)
best_model_xgb = Pipeline([
    ("preprocessor", ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUM_VARS),
            ("cat", categorical_transformer, CAT_VARS)
            ]
    )),
    ("xgboost_model", grid_auc.best_estimator_)
])


best_model_xgb.fit(X, y) # , xgboost_model__sample_weight=np.array(X['average cost min'])
pred_xgb = pd.DataFrame(best_model_xgb.predict_proba(test_data), columns=["0", "1"])

# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_xgb["1"]})
test_data_sub

In [None]:
# Exporting results
test_data_sub.to_csv('output/xgboost_pred_submission_v3_weighted.csv', header=True, index=False)

**Random_Forest**

In [None]:
# Creating the balanced Random Forest classifier
rf_classifier = BalancedRandomForestClassifier(random_state=42, sampling_strategy="all", replacement=True, bootstrap=False)

Grid search

In [None]:
# Let's start with a broader grid search to get the best criterions
n_estimators=np.arange(10,200,10)
criterion=["entropy","gini"]
max_features=["log2","sqrt"]
class_weight=[None, "balanced", "balanced_subsample"]
max_depth=[None,3,7,11]
min_samples_split=[2,3,4] 
min_samples_leaf=[1,2]

param_grid={"n_estimators":n_estimators,"criterion":criterion,"max_features":max_features,"class_weight":class_weight,"max_depth":max_depth,"min_samples_split":min_samples_split,"min_samples_leaf":min_samples_leaf}

# to get the results
rf_grid=GridSearchCV(estimator=rf_classifier,param_grid=param_grid,cv=3,verbose=0,n_jobs=4)
rf_grid.fit(X_train,y_train)

rf_grid.fit(X_train, y_train, sample_weight=X_train[:,24])
print("Best parameters found: ",rf_grid.best_params_)
print("best score found: ", rf_grid.best_score_)

In [None]:
# now we do it one more time around the best parameters
n_estimators=np.arange(110,130,1)

param_grid2={"n_estimators":n_estimators}

# All the best estimator found are the default ones
rf_classifier2 = BalancedRandomForestClassifier(criterion="entropy", random_state=42, sampling_strategy="all", replacement=True, bootstrap=False)

# to get the results
rf_grid2 = GridSearchCV(estimator=rf_classifier2,param_grid=param_grid2,cv=3,verbose=0,n_jobs=4)
rf_grid2.fit(X_train,y_train)

rf_grid2.fit(X_train, y_train, sample_weight=X_train[:,24])
print("Best parameters found: ",rf_grid2.best_params_)
print("best score found: ", rf_grid2.best_score_)

In [None]:
# So now we train the tuned model on the full data
best_model_random_forest = BalancedRandomForestClassifier(n_estimators=116, criterion="entropy", random_state=42, sampling_strategy="all", replacement=True, bootstrap=False)

best_model_random_forest.fit(X, y) # , xgboost_model__sample_weight=np.array(X['average cost min'])
pred_random_forest = pd.DataFrame(best_model_random_forest.predict_proba(test_data), columns=["0", "1"])

# Creating data for submission
test_data_sub2 = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_random_forest["1"]})
test_data_sub2

In [None]:
# xxporting the results
test_data_sub2.to_csv('output/balanced_random_forest_weighted_pred_submission.csv', header=True, index=False)

**Training a seperate models on different tiers of customers (based on profitability)**

In [None]:
# Let's look at the density plot of 'average cost min'
# We see that most customers have around 0.15 average cost min. Anything above 0.3 is pretty high (toptop customers), and anything above 0.2 is relatively high (top customers)
sns.kdeplot(data=train_data, x='average cost min', fill=True)

In [None]:
# Around the top 12% of customers have higher 'average cost min' than 0.2 and around the top 3.3% of customers have higher 'average cost min' than 0.3
print(train_data['average cost min'].quantile(0.88))
print(train_data['average cost min'].quantile(0.967))

**Creating 3 models**

1st on customers with 'average cost min' > 0.3 ==> Gold customers

2nd on customers with 0.2 < 'average cost min' <= 0.3 ==> Silver customers 

3rd is the rest ('average cost min' <= 0.2) ==> Bronze customers 

In [None]:
gold_data   = train_data[train_data['average cost min'] > 0.3]
silver_data = train_data[(train_data['average cost min'] > 0.2) & (train_data['average cost min'] <= 0.3)]
bronze_data = train_data[train_data['average cost min'] <= 0.2]

In [None]:
X_gold   = gold_data.drop(columns=TARGET_VAR)
y_gold   = gold_data[TARGET_VAR]
X_silver = silver_data.drop(columns=TARGET_VAR)
y_silver = silver_data[TARGET_VAR]
X_bronze = bronze_data.drop(columns=TARGET_VAR)
y_bronze = bronze_data[TARGET_VAR]

X_gold_preprocessed = preprocessor.fit_transform(X_gold)
X_silver_preprocessed = preprocessor.fit_transform(X_silver)
X_bronze_preprocessed = preprocessor.fit_transform(X_bronze)

X_gold_train, X_gold_test, y_gold_train, y_gold_test= train_test_split(X_gold_preprocessed, y_gold, test_size=0.2, stratify=y_gold, random_state=420)
X_silver_train, X_silver_test, y_silver_train, y_silver_test= train_test_split(X_silver_preprocessed, y_silver, test_size=0.2, stratify=y_silver, random_state=420)
X_bronze_train, X_bronze_test, y_bronze_train, y_bronze_test= train_test_split(X_bronze_preprocessed, y_bronze, test_size=0.2, stratify=y_bronze, random_state=420)

In [None]:
# Calcualte class ratio. Will be used to assess class imbalance during training
ratio_gold = float(y_gold_train.value_counts()[0]) / y_gold_train.value_counts()[1]
ratio_silver = float(y_silver_train.value_counts()[0]) / y_silver_train.value_counts()[1]
ratio_bronze = float(y_bronze_train.value_counts()[0]) / y_bronze_train.value_counts()[1]

Gold customers

In [None]:
# First do a random search over a large hyperparameter space, trying 1000 random models 
gbm_param_grid_large = {  'n_estimators': np.arange(5, 101, 1)
                        , 'max_depth': range(2, 13)
                        , 'learning_rate': np.arange(0.001, 5, 0.01)
                        , 'subsample': [0.6, 0.8, 1]
                        , 'colsample_bytree': [0.5, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10, 100]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_gold)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_gold_train, y_gold_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Now do another random search (with more random hyperpar. combinations) over a smaller hyperparameter space around the preivously found "best" values
gbm_param_grid_large = {  'n_estimators': np.arange(50, 110, 1)
                        , 'max_depth': range(5, 20)
                        , 'learning_rate': np.arange(0.05, 5.05, 0.01)
                        , 'subsample': [0.8, 1]
                        , 'colsample_bytree': [0.5, 0.8, 1]
                        , 'reg_lambda': [10, 100, 200]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_gold)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=5000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_gold_train, y_gold_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Finally a grid-search that is not random around the previous "best" values.
gbm_param_grid = {  'n_estimators': [70, 73, 74, 75, 80]
                  , 'max_depth': [15, 18, 19, 20, 25]
                  , 'learning_rate': [0.5, 0.6, 0.67, 1]
                  , 'subsample': [0.8, 1]
                  , 'colsample_bytree': [0.5, 0.8, 1]
                  , 'reg_lambda': [100, 200, 300]
                  }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_gold)
grid_auc = GridSearchCV(  estimator=gbm
                        , param_grid=gbm_param_grid
                        , scoring='roc_auc'
                        , cv=5
                        , n_jobs=-1
                        , verbose=1
                        )

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
grid_auc.fit(X_gold_train, y_gold_train)
print("Best parameters found: ", grid_auc.best_params_)
print("Lowest AUC found: ", grid_auc.best_score_)

In [None]:
# Save an output based on the previous (last) grid search
pd.DataFrame(grid_auc.cv_results_).sort_values(by='rank_test_score', ascending = True).to_csv('output/xgb_gridsearch_gold.csv')

In [None]:
# Get an idea about how the model performs on the test set - on the test set of the "top customers"
# Test AUC is lower than the "best" model AUC on the cross-validated training set, but not tragic. Also the 'gold' customer subset has the least data points.
predicted_probabilities = grid_auc.predict_proba(X_gold_test)
auc_score = roc_auc_score(y_gold_test, predicted_probabilities[:, 1])
auc_score

In [None]:
# Re-train the tuned model on the entire training data (not just on the 80% of it)
best_model_xgb = Pipeline([
    ("preprocessor", ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUM_VARS),
            ("cat", categorical_transformer, CAT_VARS)
            ]
    )),
    ("xgboost model", xgb.XGBClassifier(  random_state=420
                                        , scale_pos_weight=ratio_gold
                                        , colsample_bytree=0.5
                                        , learning_rate=0.67
                                        , max_depth=15
                                        , n_estimators=74
                                        , reg_lambda=200
                                        , subsample=1))
                        # Could have just used grid_auc.best_estimator_, but that needs to be saved into memory and if we close the notebook we would need to rerun the
                        # gridsearch to get it into memory again
])


best_model_xgb.fit(X_gold, y_gold)
pred_xgb = pd.DataFrame(best_model_xgb.predict_proba(test_data), columns=["0", "1"])

# Appending predictions to the test data
test_data['gold_pred'] = pred_xgb["1"]

# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_xgb["1"]})
test_data_sub

In [None]:
# Exporting results
test_data_sub.to_csv('output/xgboost_submission_gold.csv', header=True, index=False)

Silver customers

In [None]:
# First do a random search over a large hyperparameter space, trying 1000 random models 
gbm_param_grid_large = {  'n_estimators': np.arange(5, 101, 1)
                        , 'max_depth': range(2, 13)
                        , 'learning_rate': np.arange(0.001, 5, 0.01)
                        , 'subsample': [0.6, 0.8, 1]
                        , 'colsample_bytree': [0.5, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10, 100]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_silver)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_silver_train, y_silver_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Now do another random search (with more random hyperpar. combinations) over a smaller hyperparameter space around the preivously found "best" values
gbm_param_grid_large = {  'n_estimators': np.arange(5, 30, 1)
                        , 'max_depth': range(5, 20)
                        , 'learning_rate': np.arange(0.05, 5.05, 0.01)
                        , 'subsample': [0.8, 1]
                        , 'colsample_bytree': [0.8, 1]
                        , 'reg_lambda': [10, 100, 200]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_silver)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=5000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_silver_train, y_silver_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Finally a grid-search that is not random around the previous "best" values.
gbm_param_grid = {  'n_estimators': [8, 10, 15, 20]
                  , 'max_depth': [5, 6, 7, 10, 15]
                  , 'learning_rate': [0.5, 0.79, 1]
                  , 'subsample': [0.8, 1]
                  , 'colsample_bytree': [0.8, 1]
                  , 'reg_lambda': [50, 100, 200]
                  }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_silver)
grid_auc = GridSearchCV(  estimator=gbm
                        , param_grid=gbm_param_grid
                        , scoring='roc_auc'
                        , cv=5
                        , n_jobs=-1
                        , verbose=1
                        )

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
grid_auc.fit(X_silver_train, y_silver_train)
print("Best parameters found: ", grid_auc.best_params_)
print("Lowest AUC found: ", grid_auc.best_score_)

In [None]:
# Save an output based on the previous (last) grid search
pd.DataFrame(grid_auc.cv_results_).sort_values(by='rank_test_score', ascending = True).to_csv('output/xgb_gridsearch_silver.csv')

In [None]:
# Get an idea about how the model performs on the test set - on the test set of the "top customers"
# Test AUC is close to the "best" model AUC on the cross-validated training set which is a good indication of not suffering from overfitting
predicted_probabilities = grid_auc.predict_proba(X_silver_test)
auc_score = roc_auc_score(y_silver_test, predicted_probabilities[:, 1])
auc_score

In [None]:
# Re-train the tuned model on the entire training data (not just on the 80% of it)
best_model_xgb = Pipeline([
    ("preprocessor", ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUM_VARS),
            ("cat", categorical_transformer, CAT_VARS)
            ]
    )),
    ("xgboost model", xgb.XGBClassifier(  random_state=420
                                        , scale_pos_weight=ratio_silver
                                        , colsample_bytree=1
                                        , learning_rate=0.79
                                        , max_depth=6
                                        , n_estimators=10
                                        , reg_lambda=100
                                        , subsample=0.8))
                        # Could have just used grid_auc.best_estimator_, but that needs to be saved into memory and if we close the notebook we would need to rerun the
                        # gridsearch to get it into memory again)
])


best_model_xgb.fit(X_silver, y_silver)
pred_xgb = pd.DataFrame(best_model_xgb.predict_proba(test_data), columns=["0", "1"])

# Appending predictions to the test data
test_data['silver_pred'] = pred_xgb["1"]

# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_xgb["1"]})
test_data_sub

In [None]:
# Exporting results
test_data_sub.to_csv('output/xgboost_submission_silver.csv', header=True, index=False)

Bronze customers

In [None]:
# First do a random search over a large hyperparameter space, trying 1000 random models 
gbm_param_grid_large = {  'n_estimators': np.arange(5, 101, 1)
                        , 'max_depth': range(2, 13)
                        , 'learning_rate': np.arange(0.001, 5, 0.01)
                        , 'subsample': [0.6, 0.8, 1]
                        , 'colsample_bytree': [0.5, 0.8, 1]
                        , 'reg_lambda': [0, 1, 5, 10, 100]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_bronze)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_bronze_train, y_bronze_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Now do another random search (with more random hyperpar. combinations) over a smaller hyperparameter space around the preivously found "best" values
gbm_param_grid_large = {  'n_estimators': np.arange(15, 50, 1)
                        , 'max_depth': range(5, 20)
                        , 'learning_rate': np.arange(0.05, 5.05, 0.01)
                        , 'subsample': [0.8, 1]
                        , 'colsample_bytree': [0.8, 1]
                        , 'reg_lambda': [0, 1, 10, 20]
                        }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_bronze)
randomized_auc = RandomizedSearchCV(  estimator=gbm
                                    , param_distributions=gbm_param_grid_large
                                    , n_iter=1000
                                    , scoring='roc_auc'
                                    , cv=5
                                    , n_jobs=-1
                                    , verbose=1
                                    , random_state=420)

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
randomized_auc.fit(X_bronze_train, y_bronze_train)
print("Best parameters found: ",randomized_auc.best_params_)
print("Lowest AUC found: ", randomized_auc.best_score_)

In [None]:
# Finally a grid-search that is not random around the previous "best" values.
gbm_param_grid = {  'n_estimators': [15, 20, 21, 22, 30]
                  , 'max_depth': [5, 8, 9, 10, 15]
                  , 'learning_rate': [0.05, 0.1, 0.13, 0.2]
                  , 'subsample': [0.8, 1]
                  , 'colsample_bytree': [0.8, 1]
                  , 'reg_lambda': [0, 1, 2, 5]
                  }

gbm = xgb.XGBClassifier(random_state=420, scale_pos_weight=ratio_bronze)
grid_auc = GridSearchCV(  estimator=gbm
                        , param_grid=gbm_param_grid
                        , scoring='roc_auc'
                        , cv=5
                        , n_jobs=-1
                        , verbose=1
                        )

# See how the cross-validation performed (on the "training" data) and the best tuned hyperparameter values
grid_auc.fit(X_bronze_train, y_bronze_train)
print("Best parameters found: ", grid_auc.best_params_)
print("Lowest AUC found: ", grid_auc.best_score_)

In [None]:
# Save an output based on the previous (last) grid search
pd.DataFrame(grid_auc.cv_results_).sort_values(by='rank_test_score', ascending = True).to_csv('output/xgb_gridsearch_bronze.csv')

In [None]:
# Get an idea about how the model performs on the test set - on the test set of the "top customers"
# Test AUC is close to the "best" model AUC on the cross-validated training set which is a good indication of not suffering from overfitting
predicted_probabilities = grid_auc.predict_proba(X_bronze_test)
auc_score = roc_auc_score(y_bronze_test, predicted_probabilities[:, 1])
auc_score

In [None]:
# Re-train the tuned model on the entire training data (not just on the 80% of it)
best_model_xgb = Pipeline([
    ("preprocessor", ColumnTransformer(
        transformers=[
            ("num", numeric_transformer, NUM_VARS),
            ("cat", categorical_transformer, CAT_VARS)
            ]
    )),
    ("xgboost model", xgb.XGBClassifier(  random_state=420
                                        , scale_pos_weight=ratio_bronze
                                        , colsample_bytree=0.8
                                        , learning_rate=0.13
                                        , max_depth=10
                                        , n_estimators=15
                                        , reg_lambda=1
                                        , subsample=0.8))
                        # Could have just used grid_auc.best_estimator_, but that needs to be saved into memory and if we close the notebook we would need to rerun the
                        # gridsearch to get it into memory again, and it can take 5-10 minutes
])


best_model_xgb.fit(X_bronze, y_bronze)
pred_xgb = pd.DataFrame(best_model_xgb.predict_proba(test_data), columns=["0", "1"])

# Appending predictions to the test data
test_data['bronze_pred'] = pred_xgb["1"]

# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred_xgb["1"]})
test_data_sub

In [None]:
# Exporting results
test_data_sub.to_csv('output/xgboost_submission_bronze.csv', header=True, index=False)

Creating the final predictions from the 3 models

In [None]:
# Create an "operational prediction" column
def create_operational_pred(row):
    if row['average cost min'] > 0.3:
        return row['gold_pred']
    elif 0.2 < row['average cost min'] <= 0.3:
        return row['silver_pred']
    else:
        return row['bronze_pred']

# Apply the function to create the new column 'operational_pred'
test_data['operational_pred'] = test_data.apply(create_operational_pred, axis=1)

In [None]:
# Create a final prediction - 
# For the top 40 observations sorted by 'average cost min' descending (so the most valuable customers) where the 'operational_pred' is higher than 0.5,
# We overwrite the prediction to 1, 0.999999, 0.999998 etc.
# This immitates that if we predict a customer to churn with 50%+ probability, and the customer has high 'average cost min', then we will try to retain that customer. We do
# this by artificially inflating its prediction of churn. So even if a customer had barely higher "probability" than 0.5 but they had a large 'average cost min', we will 
# still give a high "probability" (close to 1) of churning. This will help elevate our TOP20 evaluation metric.

# Sort the DataFrame by 'average cost min' in descending order
df_sorted = test_data.sort_values(by='average cost min', ascending=False)

# Define a function to calculate 'final_pred' based on the specified conditions
def create_final_pred(row):
    if row['operational_pred'] > 0.5:
        top_records = df_sorted[df_sorted['operational_pred'] > 0.5].head(40)
        if row.name in top_records.index:
            return 1 - (top_records.index.get_loc(row.name) / 1000000)  # Starting from 1 and decreasing by 0.000001 for each subsequent record
        else:
            return row['operational_pred']
    else:
        return row['operational_pred']

# Apply the function to create the new column 'final_pred'
test_data['final_pred'] = test_data.apply(create_final_pred, axis=1)

In [None]:
# Creating data for submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':test_data['final_pred']})

# Exporting results
test_data_sub.to_csv('output/xgboost_submission_3models.csv', header=True, index=False)

test_data_sub