# Feature Engineering and Modelling

---

1. Import packages
2. Load data
3. Modelling

---

## 1. Import packages

In [3]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
from datetime import datetime
import matplotlib.pyplot as plt


# Shows plots in jupyter notebook
%matplotlib inline

# Set plot style
sns.set(color_codes=True)

---
## 2. Load data

In [5]:
df = pd.read_csv('data_for_predictions.csv')
df.drop(columns=["Unnamed: 0"], inplace=True)
df.head()

Unnamed: 0,id,cons_12m,cons_gas_12m,cons_last_month,forecast_cons_12m,forecast_discount_energy,forecast_meter_rent_12m,forecast_price_energy_off_peak,forecast_price_energy_peak,forecast_price_pow_off_peak,...,months_modif_prod,months_renewal,channel_MISSING,channel_ewpakwlliwisiwduibdlfmalxowmwpci,channel_foosdfpfkusacimwkcsosbicdxkicaua,channel_lmkebamcaaclubfxadlmueccxoimlema,channel_usilxuppasemubllopkaafesmlibmsdf,origin_up_kamkkxfxxuwbdslkwifmmcsiusiuosws,origin_up_ldkssxwpmemidmecebumciepifcamkci,origin_up_lxidpiddsbxsbosboudacockeimpuepw
0,24011ae4ebbe3035111d65fa7c15bc57,0.0,4.739944,0.0,0.0,0.0,0.444045,0.114481,0.098142,40.606701,...,2,6,0,0,1,0,0,0,0,1
1,d29c2c54acc38ff3c0614d0a653813dd,3.668479,0.0,0.0,2.28092,0.0,1.237292,0.145711,0.0,44.311378,...,76,4,1,0,0,0,0,1,0,0
2,764c75f661154dac3a6c254cd082ea7d,2.736397,0.0,0.0,1.689841,0.0,1.599009,0.165794,0.087899,44.311378,...,68,8,0,0,1,0,0,1,0,0
3,bba03439a292a1e166f80264c16191cb,3.200029,0.0,0.0,2.382089,0.0,1.318689,0.146694,0.0,44.311378,...,69,9,0,0,0,1,0,1,0,0
4,149d57cf92fc41cf94415803a877cb4b,3.646011,0.0,2.721811,2.650065,0.0,2.122969,0.1169,0.100015,40.606701,...,71,9,1,0,0,0,0,1,0,0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14606 entries, 0 to 14605
Data columns (total 63 columns):
 #   Column                                      Non-Null Count  Dtype  
---  ------                                      --------------  -----  
 0   id                                          14606 non-null  object 
 1   cons_12m                                    14606 non-null  float64
 2   cons_gas_12m                                14606 non-null  float64
 3   cons_last_month                             14606 non-null  float64
 4   forecast_cons_12m                           14606 non-null  float64
 5   forecast_discount_energy                    14606 non-null  float64
 6   forecast_meter_rent_12m                     14606 non-null  float64
 7   forecast_price_energy_off_peak              14606 non-null  float64
 8   forecast_price_energy_peak                  14606 non-null  float64
 9   forecast_price_pow_off_peak                 14606 non-null  float64
 10  has_gas   

---

## 3. Modelling

We now have a dataset containing features that we have engineered and we are ready to start training a predictive model.

In [7]:
from sklearn import metrics
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

### Data sampling

The first thing we want to do is split our dataset into training and test samples. The reason why we do this, is so that we can simulate a real life situation by generating predictions for our test sample, without showing the predictive model these data points. This gives us the ability to see how well our model is able to generalise to new data, which is critical.

A typical % to dedicate to testing is between 20-30, for this example we will use a 75-25% split between train and test respectively.

In [8]:
# Make a copy of our data
train_df = df.copy()

# Separate target variable from independent variables
y = df['churn']
X = df.drop(columns=['id', 'churn'])
print(X.shape)
print(y.shape)
print(y.value_counts())

(14606, 61)
(14606,)
churn
0    13187
1     1419
Name: count, dtype: int64


In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(10954, 61)
(10954,)
(3652, 61)
(3652,)


### handling imbalance 

In [10]:
from imblearn.over_sampling import SMOTE
sm = SMOTE(random_state=42)
X_train_res, y_train_res = sm.fit_resample(X_train, y_train)

### Model training

## About random forest classifier

Once again, we are using a `Random Forest` classifier in this example. A Random Forest sits within the category of `ensemble` algorithms because internally the `Forest` refers to a collection of `Decision Trees` which are tree-based learning algorithms. As the data scientist, you can control how large the forest is (that is, how many decision trees you want to include).

The reason why an `ensemble` algorithm is powerful is because of the laws of averaging, weak learners and the central limit theorem. If we take a single decision tree and give it a sample of data and some parameters, it will learn patterns from the data. It may be overfit or it may be underfit, but that is now our only hope, that single algorithm. 

With `ensemble` methods, instead of banking on 1 single trained model, we can train 1000's of decision trees, all using different splits of the data and learning different patterns. It would be like asking 1000 people to all learn how to code. You would end up with 1000 people with different answers, methods and styles! The weak learner notion applies here too, it has been found that if you train your learners not to overfit, but to learn weak patterns within the data and you have a lot of these weak learners, together they come together to form a highly predictive pool of knowledge! This is a real life application of many brains are better than 1.

Now instead of relying on 1 single decision tree for prediction, the random forest puts it to the overall views of the entire collection of decision trees. Some ensemble algorithms using a voting approach to decide which prediction is best, others using averaging. 

As we increase the number of learners, the idea is that the random forest's performance should converge to its best possible solution.

Some additional advantages of the random forest classifier include:

- The random forest uses a rule-based approach instead of a distance calculation and so features do not need to be scaled
- It is able to handle non-linear parameters better than linear based models

On the flip side, some disadvantages of the random forest classifier include:

- The computational power needed to train a random forest on a large dataset is high, since we need to build a whole ensemble of estimators.
- Training time can be longer due to the increased complexity and size of thee ensemble

In [None]:
#using randomizedSearchCV to increase model performance 
param_dist = {
    'n_estimators': [200, 300, 400, 500],
    'max_depth': [None, 15, 25],
    'max_features': ['sqrt', 'log2', 0.3],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}
rf = RandomForestClassifier(class_weight='balanced', random_state=42, n_jobs=-1)
search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=25,   # random combos for speed
    cv=5,
    scoring='f1_macro',  # more robust for imbalance
    random_state=42,
    n_jobs=-1,
    verbose=2
)


In [None]:
search.fit(X_train_res, y_train_res)
best_model = search.best_estimator_
print("Best Params:", search.best_params_)


Fitting 5 folds for each of 25 candidates, totalling 125 fits
Best Params: {'n_estimators': 300, 'min_samples_split': 5, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 25, 'bootstrap': False}


In [None]:
y_pred = best_model.predict(X_test) # for classification report 
y_proba = best_model.predict_proba(X_test)[:, 1] # for probability based metrices

### Evaluation

Now let's evaluate how well this trained model is able to predict the values of the test dataset.

In [None]:
# Calculate performance metrics here!
from sklearn.metrics import classification_report,roc_auc_score,f1_score,log_loss

print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print("F1 (macro):", f1_score(y_test, y_pred, average='macro'))
print("Log Loss:", log_loss(y_test, y_proba))


Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.99      0.95      3286
           1       0.58      0.09      0.16       366

    accuracy                           0.90      3652
   macro avg       0.74      0.54      0.55      3652
weighted avg       0.87      0.90      0.87      3652

ROC-AUC: 0.6522982914766737
F1 (macro): 0.5540514609681639
Log Loss: 0.33328866367492505


### now lets try XGboost in place in random forest 

In [None]:
from xgboost import XGBClassifier 

# Define the hyperparameter 
param_dist = {
    'n_estimators': [200, 300, 400, 500],#Number of boosting rounds (trees)
    'max_depth': [3, 5, 7, 10, 15],
    'learning_rate': [0.01, 0.05, 0.1, 0.2], #Step size shrinkage to prevent overfitting (Regularization)
    'gamma': [0, 0.1, 0.5, 1], # Minimum loss reduction required to make a further partition
    'min_child_weight': [1, 5, 10], # Minimum sum of instance weight (hessian) needed in a child
    'subsample': [0.6, 0.8, 1.0], # Subsample ratio of the training instances
    'colsample_bytree': [0.6, 0.8, 1.0], # Subsample ratio of columns when constructing each tree
}

# Initialize the XGBoost classifier
# Use 'scale_pos_weight' if you choose to skip SMOTE, but here we assume SMOTE is used.
xgb_model = XGBClassifier(
    random_state=42, 
    n_jobs=-1,
    use_label_encoder=False, 
    eval_metric='logloss' 
)

# Initialize RandomizedSearchCV with the XGBoost model
search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_dist,
    n_iter=25,   # random combos for speed
    cv=5,
    scoring='f1_macro',  # more robust for imbalance
    random_state=42,
    n_jobs=-1,
    verbose=2
)

### training and fitting 

In [20]:
# Fitting the model (Cell 46)
search.fit(X_train_res, y_train_res) 
best_model = search.best_estimator_
print("Best Params:", search.best_params_)

# Prediction (Cell after 46)
y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]


Fitting 5 folds for each of 25 candidates, totalling 125 fits


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Best Params: {'subsample': 0.8, 'n_estimators': 200, 'min_child_weight': 1, 'max_depth': 15, 'learning_rate': 0.1, 'gamma': 0, 'colsample_bytree': 0.8}


In [19]:
from sklearn.metrics import classification_report,roc_auc_score,f1_score,log_loss

print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, y_proba))
print("F1 (macro):", f1_score(y_test, y_pred, average='macro'))
print("Log Loss:", log_loss(y_test, y_proba))


Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.98      0.95      3286
           1       0.53      0.16      0.24       366

    accuracy                           0.90      3652
   macro avg       0.72      0.57      0.59      3652
weighted avg       0.87      0.90      0.88      3652

ROC-AUC: 0.6700857088692217
F1 (macro): 0.5938988453768742
Log Loss: 0.381408241435056


### why i choose these evaluation metrics

1.simple accuracy is misleading because of class imbalance 

2.ROC-AUC gives more information about the ability of our model to seprate classes

3.classification report provides parameters like precision and recall which are better options to understand performance of our model

4.f1 macro treats both classes equally no matter if there is imbalance

## final thoughts
1.altough the final evaluation of xgboost too didn't performed as expected but still better than random forest as can be seen in recall and f1(macro).

2.more hyperparameter tuning is needed and i in future i can try other algorithms too. 