*Keep in mind while writing up:*

- *Be concise! Less is more - the fully story is in the source code for those interested.*
- *Be deliberate about: What to highlight in which section (e.g., “this dataset was special due to its high number of variables”…)*
- *Work with visuals and only exceptionally with code. Refer to GitHub, dump code there, the technical people will go there. And (hiring) managers will only read the write-up.*
- *Optimize business value, not model performance! Time/Resource constraints, ….*

### Project Report
# **Preventing Customer Churn with Artificial Neural Networks**
*Disclaimer: This mock project report serves educational purposes only. The data used is public (https://www.kdd.org/kdd-cup/view/kdd-cup-2009/Data). All other company information is fictional. The author has no commercial relationship with mentioned parties.* 
***
### **Executive Summary (max. 7 sentences)**
Situation (1 sentence based on 1.)
<br>
Complication (1 sentence based on 1.)
<br>
Solution (1 sentence based on 2.)
<br>
Recommendations including Solutions' Business Value Add (1-3 sentences based on 3.)
"much buzz around ANN, let's test that here"
***
### **Report Structure**
[Include nice + simple process visualization!]
1. Business Problem Statement
2. Technical Solution
<br>    *2.1. Technical Problem Statement*
<br>    *2.2. Exploratory Data Analysis*
<br>    *2.3. Data Preprocessing*
<br>    *2.4. Model Selection (incl. Optimization)*
<br>    *2.5. Final Model Evaluation*
<br>    *2.6. Future Optimization Potentials*
3. Business Recommendations
***

## **1. Business Problem Statement**
For French telecommunication provider Orange, customer retention is critical. This is because retaining customers is much cheaper than the alternative: losing a customer and their revenues *plus* having additional costs for acquiring a new customer. However, Orange lacks an automated, scalable, and data-driven method for predicting customer churn that would allow Orange to initiate retention measures before customers leave. That is, predicting customer churn currently more or less relies on sporadic guessing. Thus, Orange requested a proof-of-concept for a predictive model that can help identify customers who will likely churn. Specifically, encouraged by the enthusiasm surrounding this model class, Orange wants the proof-of-concept to explore the potential of "deep learning".

<br>

## **2. Technical Solution**
#### Main used resources:
- Data: Orange has provided historical customer data (50,000 observations/customers; 230 features) for model optimization, selection, and evaluation. 
- Software: Python 3.8.5., main packages:
    - Pandas, Numpy (for data wrangling)
    - Keras/TensorFlow (for neural network modelling)
    - Scikit-learn (for optimization/gridsearch automation)
    - Matplotlib, Seaborn (for visualization)
- Hardware: standard enduser office notebook (i7-8550U; 4 cores @1.80 GHz)


- for visualization inspirations see here: https://towardsdatascience.com/predict-customer-churn-the-right-way-using-pycaret-8ba6541608ac


### *2.1. Technical Solution: Technical Problem Statement*
The business problem, as put by Orange, is "to predict customer churn". This problem requires translation into a better specified, technical problem before it is solvable using mathematical-statistical methods. Technically put, the problem we solve is to

*maximize the F1-score over the churn predictions of Orange's customers by implementing an artificial neural network with more than one hidden layer and an output layer containing a single neuron with an activation function*.

(TO-DO: F1 bei churn quote und random guessng ermitteln - benchmark for improvement/business value add)

Each component of this technical problem statement follows from considering the following three issues in light of the business problem we solve: 

#### Specifying the business problem
It is first important to understand that predicting customer churn is, technically, a binary classification problem: given the data available for any particular customer (e.g., age, gender, purchased services, average call duration), we want our model to assign this customer to one of the two classes "churn"/"no churn". Understanding that we solve a classification problem has important implications for two main elements of the technical problem statement:

#### Choosing an adequate model class
In a typical data science project, we would train models from many different model classes (e.g., logistic regression classifiers, trees, support vector machines) and select the best performing models (or combine them in an *ensemble*) for deployment. In this project, however, the client Orange has specified upfront that they want a "deep learning" model, which in more precise technical terms is widely understood as an artificial neural network (ANN) with more than one hidden layer. Further, since we want the ANN's output to always be either "churn" or "no churn", its output layer must contain a single neuron with an activation function (e.g., ReLU, sigmoid) that translates continuous into binary values (1/0).

#### Choosing adequate evaluation metrics
An evaluation metric enables us to assess how "good" a developed model is and optimize it. The perhaps most intuitive metric for a classification model is the *accuracy* of its predictions. Accuracy tells us in which percentage of cases a classification model's predictions ("churn"/"no churn") are true (that is, correctly predict what customers will actually do). However, we can infer from the business context that the classes "churn"/"no churn" we are interested in are *imbalanced*: only a minority of all customers will churn in any given time period. We can thus expect many more customers to be in the "no churn" rather than the "churn" class. Accuracy will thus be a bad metric to optimize: the model could 'cheat' and simply predict "no churn" in 100% of the cases, and never detect a single churning customer, and still have awesome accuracy. In presence of class imbalance, a metric more adequate to optimize is the *F1-score*. A high F1-score indicates not only that the model is able to detect many of those customers who will indeed churn (high *recall*), but also that the model's "churn"-predictions are typically correct (high *precision*). These are two different things!   

Deliverable:
-  source code in Python that can serve as proof of concept before deploying and putting into production (check vocabulary; for containerization also check this: https://github.com/Azure-Samples/MachineLearningSamples-TDSPUCIAdultIncome/blob/master/docs/deliverable_docs/ProjectReport.md)

<br>

### *2.2. Technical Solution: Exploratory Data Analysis (EDA)*
Now that we have specified the technical problem we solve in this project, we familiarize ourselves with the historical customer data Orange has provided. Exploratory data analysis helps us identify how we need to preprocess this data so that the ANN can better learn from it to predict churn. This typically involves some basic overall checks (overall dataset structure, feature types, missing values), but also analyses more focused on our target variable, that is, the class label vector "churn"/"no churn" (= what we want to predict).

#### Loading the data
We first load the data from a local drive. X is a matrix containing features and observations, y is a vector containing the class labels we want to predict.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

from numpy.random import seed
import tensorflow as tf
seed(3992)
tf.random.set_seed(3992)

import pandas as pd
pd.set_option("float_format", "{:f}".format)
pd.set_option('display.max_columns', None)
X = pd.read_table('data/orange_small_train.data')
y = pd.read_table('data/orange_small_train_churn.labels', header=None, names=['Churn'])
data = pd.concat([X, y], axis=1)

In [None]:
X.shape

<br>

#### Inspecting the overall dataset structure

Our very first analytical step is to take a broad look at the overall dataset structure, including the number of features (columns) and observations (rows = customers), feature names, features' data types, missing value formatting, and some basic descriptive statistics.

In [None]:
X.head()

In [None]:
X.info()

In [None]:
X.describe(include='all')

<br>

#### Inspecting missing values

Since we noted that the data contains missing values ("NaN"), we want to know precisely which percentage of values is missing in the data we have been given, and also how these missing values are distributed across features.

In [None]:
round(X.isna().sum().sum()/(X.shape[0]*X.shape[1]), 3)

In [None]:
import matplotlib.pyplot as plt
temp = X.isna().sum()/(X.shape[0])
plt.bar(range(len(temp)), sorted(temp), color='blue', alpha=0.65)

<br>

#### Checking class balance "churn"/"no churn"

Having acquired an overall impression of the data, we now take a more focused look on our target variable 'Churn', that is, the column containing the class labels our model will need to predict.  

In [None]:
plt.hist(y['Churn'], bins=3)

In [None]:
y['Churn'].value_counts()

<br>

#### > EDA key insights:
To summarize, our brief exploratory analysis helped us gain some important insights about the features, observations, and target variable in our data. Going forward, we will need to keep these insights at the back of our mind as they will instruct us how to properly preprocess the data so that our predictive model can best make use of it. Key insight we have gleaned are:

- Features: 
	- Most or all features' scales differ.
	- Orange has anonymized data before providing it (likely to protect customers' privacy).
	- The 230 total features include 38 categorical and 192 numerical features.
- Observations:
	- We have been given data on 50.000 customers and their churn behaviour that we can use for optimizing and evaluating our model.
	- Missing values are a big issue in this dataset (around 70%).
- Target variable:
	- As expected, the two classes in our target variable 'Churn' are heavily imbalanced (around 1 churning customer for 12 non-churning customers).
	- The class labels "churn"/"no churn" are represented by the numerical values 1/-1. 


<br>

### *2.3. Technical Solution: Data Preprocessing*
Before the data can be used for effective model training, some preprocessing is highly advisable. This is because the ANN model we implement requires data to be in a particular format to effectively 'learn' how to make good predictions. In addition, since preprocessing steps often have interdependencies, performing them in a meaningful order is important. Otherwise we risk messing up the data used for model training, compromising prediction quality. 

Some main preprocessing steps that we infer from our learnings during EDA to help improve prediction quality are  to remove observations and features with many missing values, encode (= 'make numerical') categorical features, create binary indicator columns for missing values, impute missing values, transform the class labels, and normalize the features' scales.

multicollienarity: https://datascience.stackexchange.com/questions/28328/how-does-multicollinearity-affect-neural-networks

#### Putting aside some test data
A step often wrongly seen to be done only *after* preprocessing is to put aside some test data. This test data (often 20% of the overall data) is used to evaluate a final (= optimized) model. Putting aside test data allows us to pretend that we would have some of the new/unseen data that will actually be incoming not before our model will have been deployed (= used productively in some kind of software application). Such data would not be available during preprocessing and model training. For preprocessing steps that depend on the data they are applied to (e.g., scaling, encoding), having the test data still 'in' when performing the preprocessing step would allow the model to learn from information that it could not have during real use. This is also called 'peeking' or 'data leakage', and upwards-biases model quality.

As another issue, we also need to account for the class imbalance in our data that EDA has confirmed. Usually we would just randomly split test and training data. However, since we want the test data to have the same class ("churn"/"no churn") distribution as the training data (as also real unseen data would have), we do a *stratified* split that replicates class distribution from training to test data.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

#### Remove missing value-intensive observations and features
EDA has shown that a big issue in the data is the high number of missing values. Observations (rows) and features (columns) with many or even all values missing will increase model training time and model complexity (even further exacerbating ANNs' general tendency to overfit), but possess not much information which could be exploited in model training. In other words, keeping these rows and columns in will be a bad deal. We thus remove all observations which might have had only missing values...

In [None]:
X_train.dropna(axis=0, how='all', inplace=True)

... and all features which have more than 80% missing values.

In [None]:
X_train.dropna(axis=1, inplace=True, thresh=X_train.shape[0] * 0.2)

We can look once more at the earlier graph to see that the remaining features contain much more information per feature. This will much help our model during training to focus on what's important. 

In [None]:
temp = X_train.isna().sum()/(X_train.shape[0])
plt.bar(range(len(temp)), sorted(temp), color='blue', alpha=0.65)

#### Encoding categorical features
We have also learned during EDA that there are quite some categorical features. Our ANN model (like many other model classes), however, can only handle numerical inputs. We thus *one-hot encode* the categorical features. This means that for each unique value in each categorical feature, we create an additional feature which indicates via the values 1 or 0 for each observation the presence or absence of that unqiue value.

However, since we have an average number of X unique values per categorical feature, regular one-hot encoding would explode the number of features, making our data high-dimensional. This is undesirable for a variety of reasons (see *curse of dimensionality*). To keep dimensionality at bay, we thus modify one-hot encoding: we do not add an additional feature for *each* unique value in a categorical feature, but only the 20 *most frequent* unique values. At the end, we drop the categorical feature.

In [None]:
X_train.shape

In [None]:
features_num_train = list(X_train.select_dtypes(include=['float64', 'int64']).columns)
features_cat_train = list(X_train.select_dtypes(include=['object']).columns)

In [None]:
for iteration, clm in enumerate(features_cat_train):
    most_freq_vals = X_train[clm].value_counts()[:20].index.tolist()
    dummy_clms = pd.get_dummies(X_train[clm].loc[X_train[clm].isin(most_freq_vals)], prefix=clm)
    X_train = pd.merge(
        X_train,
        dummy_clms,
        left_index=True,
        right_index=True,
        how='outer')
    for dum_clm in X_train[dummy_clms.columns]:
        X_train[dum_clm].fillna(0, inplace=True)
    X_train.drop(clm, axis=1, inplace=True)

This preprocessing step adds 414 features (net), leaving us with 490 total features - all numerical.

In [None]:
X_train.info()

#### Adding missing value indicator features
We now return once more to the issue of missing values. Since the number of missing values in the data is so high, we want to think some more about how we might exploit that. While a missing value represents an absence of information, the very fact per se that a value is missing might indicate something (e.g., a customer not booking a particular service) and possess predictive power. Thus, for each feature with missing values, we create an additional feature which indicates via the values 1 or 0 for each observation the presence or absence of a missing value.

In [None]:
import numpy as np
for clm in X_train:
    if X_train[clm].isna().sum() > 0:
        X_train.insert(X_train.shape[1], f"{clm}_NaNInd", 0)
        X_train[f"{clm}_NaNInd"] = np.where(np.isnan(X_train[clm]), 1, 0)

This preprocessing step adds 38 features, leaving us with 528 total features.

In [None]:
X_train.info()

#### Imputing missing values
To add at least some meaningful information to the data where values are missing, we impute missing values with feature-wise medians. (Alternatives we have tried that did not result in better model performance, all feature-wise: mean, mode, minimum, maximum, missing value count). Median-imputation has the additional benefit that it is not sensitive to outliers (such as mean, minium, or maximum).

In [None]:
X_train.fillna(X.median(), inplace=True)

As we can see, our data now does not contain any missing values anymore:

In [None]:
round(X_train.isna().sum().sum()/(X_train.shape[0]*X.shape[1]), 3)

#### Transforming class labels
We also bring the class labels into a more intuitive form.

In [None]:
y_train['Churn'] = (y['Churn'] + 1) / 2
y_train.value_counts()

#### Rescale features
Finally, we rescale the features. We have seen in EDA that the features were measured using different scales. To rule out that this could affect our model, we remove from all features (except the indicator columns we added) the dataset median and scale the feature according to the range between the 1st and 3rd Quartile. This way, we also account for possible outliers in the data.

In [None]:
from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()
X_train[features_num_train] = scaler.fit_transform(X_train[features_num_train])

### *2.4. Technical Solution: Model Selection (incl. Optimization)*

**PRIO 1: IMPLEMENT SO THAT WORKS, PRIO 2: ADD SOME VISUALIZATIONS (e.g., learning curves)**

Now that we have preprocessed our data, we use it for building and optimizing an artificial neural network classification model. Herein, we will benefit from having spelled out earlier a technical problem statement (*maximize the F1-score over the churn predictions of Orange's customers by implementing an artificial neural network with more than one hidden layer and an output layer containing a single neuron with an activation function*). This statement will guide us in the following steps: defining the ANN's architecture, defining an optimization strategy, optimization, and final model training.


#### Define model optimization strategy
The reason for defining a model optimization strategy is simple: resources are limited, but the number of possible models is infinite (literally). This is due to the typically high number of optimizable parameters, and the infinite ranges of possible values for several of these parameters. This implies that simply jumping into optimization will likely make one end up endlessly tune everything and anything. Generally speaking, while model optimization strategies should be tailored to data science projects case-by-case, it requires grounding in a thorough understanding of the project's stakeholder expectations, solved problem, used model classes, and available resources.
<br>
<br>
In our particular case, the prediction of churn among Orange's customers, we will focus our optimization strategy on four basic elements:
- (1) the evaluation metric
- (2) the optimization metric
- (3) the optimized hyperparameters, and
- (4) the optimization procedure. 

(1) _Evaluation metric_: F1-score.
<br>
We use this metric to identify which model among all trained models best solves our problem. We have explained our choice of the F1-score in 2.1. We also pull some further metrics from Keras' metrics library to enable a more comprehensive assessment of the trained models.

In [None]:
import sklearn.metrics
scoring = {
    "F1": sklearn.metrics.make_scorer(sklearn.metrics.f1_score),
    "Accuracy": sklearn.metrics.make_scorer(sklearn.metrics.accuracy_score),
    "Recall": sklearn.metrics.make_scorer(sklearn.metrics.recall_score),
    "Precision": sklearn.metrics.make_scorer(sklearn.metrics.precision_score)
}

(2) _Optimization metric_: Binary cross-entropy loss.
<br>
We use this metric to allow the model to "learn", that is, adjust its coefficients ('weights') during training. Binary cross-entropy is a default optimization metric for binary classification problems, and there is no obvious reason to deviate. (Implementation: see step "Define artificial neural network architecture".)

(3) _Optimized hyperparameters_:
<br>
We will optimize the following parameters. These are also called "hyper"parameters to distinguish them from model-specific parameters such as coefficients. To be able to optimite these parameters, we will have to include them in the ANN's architecture (see step "Define artificial neural network architecture").

| param name | explanation |
| --- | --- |
| Stretched | Gaussian |

        batch_size=[80],
        learning_rate=[0.0001],
        epochs=[5],
        dropout_rate=[0.1],
        noise=[0.001],
        reg=[0.001],
        beta_1=[0.8],
        beta_2=[0.999],
        weight_constraint=[0.5],
        deep=["n", "y"],
        neurons=[30, 180, 350]

(4) _Optimization procedure_: Grid search with class-weighted, k-fold cross-validation.
<br>
This means we define some values for each optimized hyperparameter, and exhaustively search through the resulting parameter "grid". For each grid node, a model using this grid node's parameter value combination will be trained on k training subsets (using the optimization metric) and evaluated on k validation subsets of the training data (using the evaluation metric). To account for class imbalance, weights will be assigned to classes during training to penalize misclassification of the two different classes to different degrees (we unsuccessfully tried SMOTETomek resampling as an alternative). The purpose of validation sets is to have an indication how well a trained model would generalize to unseen data after deployment (similar to the train/test-split logic described above). The k folds allow to compute the evaluation metric as a mean, and thus make model evaluation more robust against bias resulting from random validation set sampling. (Implementation: see step "Optimization".)

#### Define artificial neural network architecture
For the "deep learning" proof-of-concept requested by Orange, we choose a simple "feedforward" (as opposed to, e.g., recurrent or convolutional) neural net architecture. In essence, feedforward here means that the outputs of neurons on one network layer are sent ("fed forward") only to neurons of subsequent layers (instead of the same or previous layers). Further, to probe into the potential of "deep" versus "shallow" learning, we variabilize the number of hidden layers.

In [None]:
from keras.wrappers.scikit_learn import KerasClassifier
from keras.optimizers import Adam
from keras.constraints import maxnorm
from keras.regularizers import l2
from keras.layers import GaussianNoise
from keras.layers import Dropout
from keras.layers import Dense
from keras.models import Sequential
import keras.metrics

def create_model(learning_rate=0.001,
    dropout_rate=0.0,
    noise=0.001,
    reg=0.0,
    beta_1=0.9,
    beta_2=0.999,
    weight_constraint=100.0,
    deep="n",
    neurons=round(X_train.shape[1]**(1/1.2), 0),
):
    model = Sequential()
    model.add(
        Dense(
            neurons,
            activation="relu",
            input_dim=X_train.shape[1],
            kernel_constraint=maxnorm(weight_constraint),
            activity_regularizer=l2(reg),
        )
    )
    model.add(Dropout(dropout_rate))
    model.add(GaussianNoise(stddev=noise))
    if deep == "y":
        model.add(
            Dense(
                round(neurons**(1/1.2), 0),
                activation="relu",
                kernel_constraint=maxnorm(weight_constraint),
                activity_regularizer=l2(reg),
            )
        )
        model.add(Dropout(dropout_rate))
        model.add(GaussianNoise(stddev=noise))
        model.add(
            Dense(
                round(neurons**(1/1.5), 0),
                activation="relu",
                kernel_constraint=maxnorm(weight_constraint),
                activity_regularizer=l2(reg),
            )
        )
        model.add(Dropout(dropout_rate))
        model.add(GaussianNoise(stddev=noise))
        model.add(
            Dense(
                round(neurons**(1/1.8), 0),
                activation="relu",
                kernel_constraint=maxnorm(weight_constraint),
                activity_regularizer=l2(reg),
            )
        )
        model.add(Dropout(dropout_rate))
        model.add(GaussianNoise(stddev=noise))
    model.add(Dense(1, activation="sigmoid"))
    model.compile(
        optimizer=Adam(
            learning_rate=learning_rate,
            beta_1=beta_1,
            beta_2=beta_2),
        loss="binary_crossentropy",
        metrics=[
            "accuracy",
            keras.metrics.Precision(name="precision"),
            keras.metrics.Recall(name="recall")
        ]
    )
    return model

#### Optimization
We now turn to the actual model optimization. Hereto, we first compile the optimization procedure we had defined as part of our optimization strategy:

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.utils.class_weight import compute_class_weight

grid_name = "param_bundleTEST_grid_"

if grid_name == "param_bundleTEST_grid":
    param_grid = dict(
        batch_size=[64, 128, 256],
        learning_rate=[0.0001],
        epochs=[5],
    )
elif grid_name == "param_bundle1_grid":
    param_grid = dict(
        batch_size=[16, 32, 64, 128, 256],
        learning_rate=[0.0001, 0.001, 0.01],
        epochs=[5, 10, 20, 40, 80],
    )
elif grid_name == "param_bundle2_grid":
    param_grid = dict(
        batch_size=[X],
        learning_rate=[X],
        epochs=[X],
        dropout_rate=[0.1, 0.5, 0.9],
        noise=[0.001, 0.1, 1],
        reg=[0.001, 0.01, 0.5],
    )
elif grid_name == "param_bundle3_grid":
    param_grid = dict(
        batch_size=[X],
        learning_rate=[X],
        epochs=[X],
        dropout_rate=[X],
        noise=[X],
        reg=[X],
        beta_1=[0.8, 0.9, 0.99],
        beta_2=[0.990, 0.995, 0.999],
        weight_constraint=[0.5, 2.0, 8.0],
    )
elif grid_name == "param_bundle4_grid":
    param_grid = dict(
        batch_size=[X],
        learning_rate=[X],
        epochs=[X],
        dropout_rate=[X],
        noise=[X],
        reg=[X],
        beta_1=[X],
        beta_2=[X],
        weight_constraint=[X],
        deep=["n", "y"],
        neurons=[30, 180, 350],
    )

model = KerasClassifier(build_fn=create_model, verbose=2) #this wrapper allows us to feed the Keras model into Sklearn's GridSearchCV class

grid = GridSearchCV(
    estimator=model, #model will be the ANN model compiled below
    param_grid=param_grid,
    verbose=3,
    refit="F1",
    n_jobs=2,
    scoring=scoring,
    return_train_score=True,
    cv=StratifiedKFold(n_splits=3, shuffle=True),
)

class_weights = compute_class_weight(
    class_weight="balanced",
    classes=np.unique(y_train.values),
    y=y_train.values.reshape(-1),
)
class_weights = dict(zip(np.unique(y_train.values), class_weights))

In [None]:
grid_name

In [None]:
grid_result = grid.fit(
    X_train,
    y_train,
    class_weight=class_weights)

All now left to do in optimization is run the grid search model on our data ("fit" it to our data, in SKlearn language and logic) and wait:

#### Model selection
As a basis for model selection, we store the grid search's results to a dataframe:

In [None]:
results = pd.DataFrame(grid_result.cv_results_["params"])
results["means_val_F1"] = grid_result.cv_results_["mean_test_F1"]
results["means_val_Accuracy"] = grid_result.cv_results_["mean_test_Accuracy"]
results["means_val_Recall"] = grid_result.cv_results_["mean_test_Recall"]
results["means_val_Precision"] = grid_result.cv_results_["mean_test_Precision"]
results["means_train_F1"] = grid_result.cv_results_["mean_train_F1"]
results["means_train_Accuracy"] = grid_result.cv_results_["mean_train_Accuracy"]
results["means_train_Recall"] = grid_result.cv_results_["mean_train_Recall"]
results["means_train_Precision"] = grid_result.cv_results_["mean_train_Precision"]

from datetime import datetime
import openpyxl
path = "C:\\Users\\marc.feldmann\\Documents\\data_science_local\\CustomerChurnPrediction\\results\\hyparam_opt\\"
filename = (path + "FNN_clf_GSresults_" + grid_name + datetime.now().strftime("%d_%m_%Y__%H_%M_%S") + ".xlsx")
results.to_excel(filename)


In [None]:
print(grid_name)
results

    - grid search visualization
    - nod to: model diagnosis: learning curves (train [how well model learns], val [how well model generalizes to new data]) > show that I can interpret and draw conclusions from that

#### Final model training
4) train model last time with these optimal parameters (i.e., those that optimize F1 on validation data), this time on entire training data instead of subsets such as in CV: result: "final" model 

store model! pickle

### *2.5. Technical Solution: Final Model Evaluation*
Model Evaluation on 'Unseen' Data (simulate by priorly held out 'Test Data')
- Do the results make sense?


Result interpretation
- in write-up: reflect on fact neural networks / deep learning seem to be overhyped
- see e.g.: Peter Roßbach: "Neural Networks vs. Random Forests – Does it always have to be Deep Learning?
- - make that explicit point of the write-up! "test" that!
- show here that I know how to work with learning curves

when looking at results, come back to earlier point, explain via clas imbalance
come back to earlier point: 
- make it one main technical point in the article that high accuracy can be misleading (when? why?) - have to also check other measures
- - includein write-up my reflections for using precision/recall instead of AUC (argue by importance to detect minority class relative to importance of TPs and FPs) (expl in simple language)

- show/compare how accuracy can be misleading


- feature importance chart

### *2.6. Technical Solution: Future Optimization Potentials*
(hier sammeln alles ich zeitlich nicht geschafft hab, aber für wichtig halte - um Kritik zu preempten)

Schema: Potential - Umsetzungsaufwand - erwarteter Umsetzungseffekt auf Business Metric

- version 2: optimization potentials (versus v1) to explore ceteris paribus:
- not explored/limitations: only individually optimized, due to constraints in processing power and time, optimization dependencies between variables neglected
- only narrow ranges in gridsearch covered, so sound change that only found local optima per parameter
- potential: NaN imputation with means on subsets of rows: one could search powerful clustering criteria first and than impute cluster means
- also: was using smaller dataset, large dataset with many more variables may allow to increase a classifier's precision/recall
- Make sure to also compare to others' results - I seem to be already working at the upper boundary of what's possible on this dataset with ANNs!
- optimization potential: in practice, one would normally traing many different models and select/stack the best; show somehow that I'm aware of that
- (optimization potential: add and compare AUC: simple logistic regression, random forest, 'flat' neural network, XGBoost)
- optimization potential: put data into an AWS instance and run there
- multicollinearity - check whether an issue - we want to have model as simple as possible! will decrease risk of overfitting that ANNs are especially prone to
- outliers: Scaler Min Max or Robust made no big difference (see model_comparision Excel), suggesting it is not problematic that we have not removed outliers in data preprocessing; still might contain some potential to increase model performance
- feature selection: have touched (selectKBest), but not exhausted feature selection 
- feature engineering: dimensionality reduction to reduce dimensionality, create new and more 'powerful' features; kurz auf curse of dimensionality eingehen und auf ANNs overviffting tendency; feature selection would have benefit to be explanable however, features anonymized anyways
- optimization: "Optimization of only thought about in terms of tuning hyperparamters; but also preprocessing includes many steps that can be done in different ways - meaning also has potential to optimize: read following in conjunction with "diary" to see what I have optimzied here and with which success: <br>
- code cleaning: to increase code reusability and readability, repetitive parts could be wrapped into functions and moved into separate script (e.g., preprocessing steps applied both to training and test data)
- clarify optimization approach: first optimized preprocessing (experiemented eg. with X instead of Y) - resulted in above described procedure; now: hyperparameter tuning:
- optimize network architecture, e.g. LSTM, see "neural network zoo"

## **3. Business Recommendations**
"What do the generated insights/model urge us/allow us to do different next Monday, and which value (business metric!) will that generate?"

direkt aus auftrag (1.) ableiten. incl.:
- (after implementing comparative models:) "turns out, deep learning (might) not be best for this kind of problem; best practice computer vision, very large datasets; here: tree model such as XGboost or simple logistic regression better 
- based on a feature importance chart for final ANN, identify potential churn drivers: discuss how can be made visible and influenced by which staff groups XY (account managers? service managers?); measure, enable and encourage these staff groups to act on identified drivers  

Good Example: https://www.kaggle.com/code/hamzaben/employee-churn-model-w-strategic-retention-plan/notebook