In [28]:
#Imports

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import accuracy_score
import pandas as pd 
from scipy.stats import chi2_contingency
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score, recall_score
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif



For this binary prediction task, I have decided to use the decision tree classifier. The latter is indeed a 'white box' model, as it is highly interpretable and its functiong and choises for each data point can be deeply investigated and explained.
Furthermore, the decision tree classifier is one of the best model for classification tasks. 
A decision tree is also capable of working with missing values, and this dataset contains a non-negligible number of them.


In [3]:
# This is a custom function aimed at displayig the most importan metrics for a binary classification, together with a confusion matrix
# The input for this function must be binary predictions

def evaluate(y_pred, y_validation):
    cm = confusion_matrix(y_validation, y_pred)
    acc = accuracy_score(y_validation,y_pred)
    tn, fp, fn, tp = cm.ravel()
    recall = tp / (tp + fn)
    beta = 0.5

    if tp == 0 and fp == 0:
        precision = 0.0
        F1 = 0.0
    else:
        precision = tp / (tp + fp)
        F1 = (1+beta**2)*((precision * recall)/((beta**2 * precision) + recall))
        
    print("Confusion Matrix:")
    print("{:>10} {:>10} {:>10}".format("", "Predicted 0", "Predicted 1"))
    print("{:>10} {:>10} {:>10}".format("Actual 0", tn, fp))
    print("{:>10} {:>10} {:>10}".format("Actual 1", fn, tp))
    print("Recall:", round(recall, 3))
    print("Precision:", round(precision, 3))
    print("Accuracy:", round(acc, 4))
    print("F1 score:", round(F1, 4))


In [29]:
df = pd.read_csv("https://raw.githubusercontent.com/leotasso3/Xtream_Tasso/main/datasets/employee-churn/churn.csv")  # Reading the dataset

# isolating the target variable, which is in the last column of the dataset
y = df.iloc[:, -1]

# The predictors will be all the other variables, but for the city (explanation below) and the ID column 
x = df.iloc[:, 1:13]
x=x.drop('city', axis=1)

I have decided to split the dataset into training (70%), validation (15%) and test set (15%). The training set will be indeed used for training the DT to build its nodes and leaves. 
I want to try different configuration of the model, starting from a baseline one with the (almost) untouched dataset, for then moving to hyperparameter optimization, PCA and feature selection. This is why the validation set is here. After training the model, I will test the performance of the latter on the validation data, and as I want to try different configuration, I will do it more than once. 
The metrics obtained on the validation set will be used for selecting the model configuration that yielded the best results. That model will then be tested on the test set: this will give the opportunity to evaluate the generalization of the model on unseen data.
Without the validation set, testing different models configurations on the test set would could lead to overfitting, as doing this would basically mean to adapt the choice of the best model configuration on future data, which of course is not possible in reality. 
As the dataset is slightly imbalanced, I will apply stratification with respect to the target variable, so that the same proportion of the latter will be present in every set

In [30]:
# Set creation and stratification (the randon_state is for reproducibility purposes)
x_train_temp, x_test, y_train_temp, y_test = train_test_split(x, y, test_size=0.15, random_state=42, stratify=y)
x_train, x_val, y_train, y_val = train_test_split(x_train_temp, y_train_temp, test_size=0.18, random_state=42, stratify=y_train_temp)


# Verification of stratification
print("Distribution of the target variable in the training set:", y_train.value_counts(normalize=True))
print("Distribution of the target variable in the validation set:", y_val.value_counts(normalize=True))
print("Distribution of the target variable in the test set:", y_test.value_counts(normalize=True))

Distribuzione della variabile target nel train set: target
0.0    0.750674
1.0    0.249326
Name: proportion, dtype: float64
Distribuzione della variabile target nel validation set: target
0.0    0.750682
1.0    0.249318
Name: proportion, dtype: float64
Distribuzione della variabile target nel test set: target
0.0    0.750522
1.0    0.249478
Name: proportion, dtype: float64


As the DecisionTreeClassifier won't work with 'object' as variable type (categorical), I will apply dummy encoding.
This will increase the dimensionality of the dataset and could therefore increment the computational costs for the model training.
For an eventual future dataset way larger than the one provided here, it is advisable to apply PCA or feature selection (which will in any case tested on this dataset as well).
The reasons why I have decided to remove the 'city' column are mainly 2: 
-This variable has 73 classes, and dummy coding it would result in a severe increase in the dimensionality of the dataset, which of course is not positive either for computational costs and risk of overfitting
-Stratify this variable could be very difficult: if there are very few people coming from one city, it could be that the relative coded column won't be present in all the 3 sets, which of course will lead to an error of model training or model prediction.


In [31]:
categorical_columns = x_train.select_dtypes(include=['object']).columns.tolist() # Storing the categorical variable

# Dummy coding the categorical variables in all the 3 sets 
x_train_encoded = pd.get_dummies(x_train, columns=categorical_columns) 
x_val_encoded = pd.get_dummies(x_val, columns=categorical_columns)
x_test_encoded = pd.get_dummies(x_test, columns=categorical_columns)


The following cell is aimed at computing the baseline results on the validation set. These results will be used as a reference for eventual improvements after parameters tuning

In [None]:
model = DecisionTreeClassifier()
model.fit(x_train_encoded, y_train)
pred = model.predict(x_val_encoded)
evaluate(pred, y_val)

In [35]:
hyperparameters = {
    "criterion":['gini', 'entropy'],
    "max_depth":[10,15,20,30,35,40,45],
    'min_samples_split': [2, 5, 10, 20]
}

grid_search = GridSearchCV(model, param_grid = hyperparameters, cv = 10)

grid_search.fit(x_train_encoded, y_train)

In [36]:
grid_search.best_params_

{'criterion': 'entropy', 'max_depth': 15, 'min_samples_split': 20}

In [37]:
model = DecisionTreeClassifier(criterion='entropy', max_depth=15, min_samples_split=20)
model.fit(x_train_encoded, y_train)
pred = model.predict(x_val_encoded)
evaluate(pred, y_val)

Confusion Matrix:
           Predicted 0 Predicted 1
  Actual 0       1843        358
  Actual 1        329        402
Recall: 0.55
Precision: 0.529
Accuracy: 0.7657
F1 score: 0.533


trying pca

In [12]:
scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train_encoded)
x_val_scaled = scaler.fit_transform(x_val_encoded)

# Applica la PCA sulle variabili dopo l'encoding
pca = PCA(n_components=20)  # Sostituisci 'numero_di_componenti' con il numero desiderato di componenti
x_train_pca = pca.fit_transform(x_train_scaled)
x_val_pca = pca.fit_transform(x_val_scaled)

model = DecisionTreeClassifier(criterion='entropy', max_depth=15, min_samples_split=20)
model.fit(x_train_pca, y_train)
pred = model.predict(x_val_pca)
evaluate(pred, y_val)


Confusion Matrix:
           Predicted 0 Predicted 1
  Actual 0       1663        538
  Actual 1        490        241
Recall: 0.33
Precision: 0.309
Accuracy: 0.6494
F1 score: 0.3132


trying selecting k-best

In [33]:
selector = SelectKBest(score_func=f_classif, k=15)  # Sostituisci 'k' con il numero desiderato di caratteristiche
x_selected = selector.fit_transform(x_train_encoded, y_train)

selected_columns = selector.get_support()

# Ottieni l'elenco delle colonne selezionate utilizzando l'array booleano
all_columns = x_train_encoded.columns.tolist()  # Sostituisci con i nomi reali delle colonne
selected_column_names = [column for column, selected in zip(all_columns, selected_columns) if selected]

x_val_selected = x_val_encoded[selected_column_names]

model = DecisionTreeClassifier(criterion='entropy', max_depth=15, min_samples_split=20)
model.fit(x_selected, y_train)
pred = model.predict(x_val_selected)
evaluate(pred, y_val)



Confusion Matrix:
           Predicted 0 Predicted 1
  Actual 0       1966        235
  Actual 1        421        310
Recall: 0.424
Precision: 0.569
Accuracy: 0.7763
F1 score: 0.5325


