Your model will be evaluated using two metrics: profit @ top-20, and AUC. The reasons for this is to be in line with a more realistic setting. E.g. one can image data scientists in a team arguing to use AUC and optimize for that. However, as seen in the course, for this scenario, we also imagine management arguing that there is not enough budget (in terms of time and money) to contact a lot of people (or hand out a lot of promotions). Hence, they have come up with the following: based on the top-k would-be churners as predicted by your model, sum some proxy of "retained profitability" in case the customer was indeed a churner, or zero otherwise

As a proxy of profitability, the feature average cost min was deemed to be a good value. Based on the size of the test set, k=20 was deemed to be a good choice. Hence, management cares about optimizing this metric
Note that only about half of the test set is used for the "public" leaderboard. That means that the score you will see on the leaderboard is done using this part of the test only (you don't know which half). Later on through the semester, submissions are frozen and the resuls on the "hidden" part will be revealed

Also, whilst you can definitely try, the goal is not to "win", but to help you reflect on your model's results, see how others are doing, etc.

Objectives:

Some groups prefer to write their final report using Jupyter Notebook, which is fine too, as long as it is readable top-to-bottom

You can use any predictive technique/approach you want, though focus on the whole process: general setup, critical thinking, and the ability to get and validate an outcome

You're free to use unsupervised technique for your data exploration part, too. When you decide to build a black box model, including some interpretability techniques to explain it is a plus

Any other assumptions or insights are thoughts can be included as well: the idea is to take what we've seen in class, get your hands dirty and try out what we've seen

Perform a critical review of the evaluation metric chosen by management. How in line is it with AUC? What would you have picked instead? Were there particular issues with this chosen metric, in your view?

In [None]:
import pandas as pd
import os
import plotly.graph_objects as go
import numpy as np

pd.options.display.max_columns = 100

In [None]:
# Initialising
TRAIN_SET_FRAC = 0.8
SEED = 42
TARGET_VAR = "target"
DROP_VARS = ['Connect_Date', 'id'] # TBC
KFOLD = 5

**Loading Data**

In [None]:
# GitHib urls to fetch data from
url_train = 'https://raw.githubusercontent.com/hello-bob/AA_P1/main/data/train.csv'
url_test = 'https://raw.githubusercontent.com/hello-bob/AA_P1/main/data/test.csv'

# Read train and test data
train_data = pd.read_csv(url_train, sep = ',', skipinitialspace = True, engine = 'python')
train_data = train_data.drop(columns=DROP_VARS)
test_data  = pd.read_csv(url_test, sep = ',', skipinitialspace = True, engine = 'python')

**Data exploration**

In [None]:
train_data.head()

In [None]:
# Check data types
train_data.info()
test_data.info()

In [None]:
# Basic descriptives
train_data.describe(include='all')

In [None]:
# Impute missing data before modelling: Can quantitate and put it on the report since 4/5k samples
# Apply on the test set. Train set is ok.
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

train_data.isnull().any().sort_values(ascending=False) # Columns with missing values: Dropped_calls_ratio, Usage_Band, call_cost_per_min.
train_data[train_data.isnull().any(axis=1)] # 4 cases, 2 churners

imputer_compiled = ColumnTransformer(
    [("numeric_imputer", SimpleImputer(strategy="median",), ["Dropped_calls_ratio", "call_cost_per_min"]),
     ("cat_imputer", SimpleImputer(strategy="most_frequent"), ["Usage_Band"])]
)

# Imput median for numeric variables first. Because "most_frequent" strategy will impute for both numeric and categorical data
train_data[["Dropped_calls_ratio", "call_cost_per_min", "Usage_Band"]] = imputer_compiled.fit_transform(train_data)
test_data[["Dropped_calls_ratio", "call_cost_per_min", "Usage_Band"]] = imputer_compiled.transform(test_data)

# Correcting dtype
train_data[["Dropped_calls_ratio", "call_cost_per_min"]] = train_data[["Dropped_calls_ratio", "call_cost_per_min"]].astype(float)
test_data[["Dropped_calls_ratio", "call_cost_per_min"]] = test_data[["Dropped_calls_ratio", "call_cost_per_min"]].astype(float)


In [None]:
# [For report] Pie chart about class inbalance (train set) + Percentage churn in categorical variable


In [None]:
# [For report] correlation plot
corr = train_data.corr(numeric_only=True)

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x = corr.columns,
        y = corr.index,
        z = np.array(corr),
        text=corr.values,
        texttemplate='%{text:.2f}'
    )
)
fig.update_layout(
    autosize=False,
    width=800,
    height=800,
)
fig.show()

In [None]:
# [For report] Correlation between categorical variables



**Data preprocessing**

In [None]:
# Imputing missing values
# outliers
from sklearn.ensemble import IsolationForest

outlier_df = (train_data.select_dtypes(include='number')
              .drop(columns=TARGET_VAR)
              .dropna()
              .copy())

iso_forest = IsolationForest(random_state=SEED, n_jobs=-1).fit(outlier_df)
pred = iso_forest.predict(outlier_df)
outlier_df['is_outlier'] = (pred == -1).astype(int)

In [None]:
# Finding what drives outliers: TBC
corr = outlier_df.corr(numeric_only=True)

fig = go.Figure()
fig.add_trace(
    go.Heatmap(
        x = corr.columns,
        y = corr.index,
        z = np.array(corr),
        text=corr.values,
        texttemplate='%{text:.2f}'
    )
)
fig.update_layout(
    autosize=False,
    width=800,
    height=800,
)
fig.show()

In [None]:
# !pip install shap

In [None]:
# https://stats.stackexchange.com/questions/404017/how-to-get-top-features-that-contribute-to-anomalies-in-isolation-forest
import shap

# Create shap values and plot them
shap_values = shap.TreeExplainer(iso_forest).shap_values(outlier_df)
shap.summary_plot(shap_values, outlier_df)

In [None]:
# Decide or not to keep/drop/use the outliers as a feature. To research on churn context

**Modelling**

In [None]:
X = train_data.drop(columns=TARGET_VAR)
y = train_data[TARGET_VAR] 

NUM_VARS = train_data.select_dtypes(include='number').drop(columns=TARGET_VAR).columns
CAT_VARS = train_data.select_dtypes(include='object').columns

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV

# Define preprocessors for numerical and categorical features
numerical_preprocessor = Pipeline([
    ("scaler", StandardScaler())
])

categorical_preprocessor = Pipeline([
    ("onehot", OneHotEncoder(drop="if_binary"))
])

In [None]:
# Combine preprocessors and model
model = Pipeline([
    ("preprocessor", ColumnTransformer([
        ("numerical", numerical_preprocessor, NUM_VARS),
        ("categorical", categorical_preprocessor, CAT_VARS)
    ])),
    ("model", SVC(probability=True, random_state=SEED))
])


In [None]:
# For SVM
parameters = {'model__kernel':['linear', 'rbf'], 
              'model__C':[1]} # rmb to add the double underscores to allow gridsearch to fit on pipelines
svc_gs_est = GridSearchCV(estimator=model, param_grid=parameters,cv=KFOLD,
                      scoring="roc_auc",n_jobs=-1, refit=True)
svc_gs_est.fit(X, y)

In [None]:
svc_gs_results = pd.DataFrame(data=svc_gs_est.cv_results_)
svc_gs_results.sort_values(by='rank_test_score', ascending = True)

In [None]:
# Interpretability

**Prediction**

In [None]:
# Retrain best model: Set up the params accordingly
numerical_preprocessor = Pipeline([
    ("scaler", StandardScaler())
])

categorical_preprocessor = Pipeline([
    ("onehot", OneHotEncoder(drop="if_binary"))
])

best_model = Pipeline([
    ("preprocessor", ColumnTransformer([
        ("numerical", numerical_preprocessor, NUM_VARS),
        ("categorical", categorical_preprocessor, CAT_VARS)
    ])),
    ("model", SVC(probability=True, random_state=SEED, C=1, kernel="rbf"))
])


best_model.fit(X, y)
pred = pd.DataFrame(best_model.predict_proba(test_data), 
                    columns=["0", "1"])

In [None]:
# For submission
test_data_sub = pd.DataFrame(data={'ID':test_data['id'], 
                                   'PRED':pred["1"]})
test_data_sub

**XGBoost**

In [None]:
# Basic preprocessing
numeric_transformer = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="median"))
    ]
)

categorical_transformer = Pipeline(
    steps = [
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("encoder", OneHotEncoder())
    ]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, NUM_VARS),
        ("cat", categorical_transformer, CAT_VARS),
    ]
)

X = preprocessor.fit_transform(X)
# Alternatively split train-test before, do preprocessing on training data (fit_transform) then transform test data


In [None]:
X

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y,
test_size=0.2, random_state=420)

In [None]:
import xgboost as xgb
xg_cl = xgb.XGBClassifier(objective='binary:logistic', n_estimators=10, seed=420)

In [None]:
xg_cl.fit(X_train, y_train)

In [None]:
preds = xg_cl.predict(X_test)

In [None]:
accuracy = float(np.sum(preds==y_test))/y_test.shape[0]
print("accuracy: %f" % (accuracy))

In [None]:
churn_dmatrix = xgb.DMatrix(data=X,label=y)

In [None]:
params={"objective":"binary:logistic","max_depth":4}

In [None]:
cv_results = xgb.cv(dtrain=churn_dmatrix, params=params, nfold=4,
num_boost_round=10, metrics="auc", as_pandas=True, seed = 420)
print(cv_results)

In [None]:
print("AUC: %f" %((cv_results["test-auc-mean"]).iloc[-1]))

In [76]:
# Hyperparameter tuning

gbm_param_grid = {'learning_rate': [0.01,0.1,0.5,0.9],
                  'n_estimators': [50],
                  'subsample': [0.3, 0.5, 0.9]}

gbm = xgb.XGBClassifier()
grid_auc = GridSearchCV(estimator=gbm,param_grid=gbm_param_grid,
scoring='roc_auc', cv=4, verbose=1)

In [77]:
grid_auc.fit(X, y)
print("Best parameters found: ",grid_auc.best_params_)
print("Lowest AUC found: ", np.sqrt(np.abs(grid_auc.best_score_)))

Fitting 4 folds for each of 12 candidates, totalling 48 fits
Best parameters found:  {'eta': 0.1, 'n_estimators': 50, 'subsample': 0.9}
Lowest AUC found:  0.9691073591383579
