# Modeling
The objective of this notebook is to create the ML model, based on the feature engineering tasks performed on
the last exercise.

The objetive is to understand if a customer will churn or not, using binary classification. The goal is to maximize predictive performance for better business decision-making.

In [6]:
import pandas as pd
import joblib  
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

We split the dataset for training and testing purposes.


In [2]:
df = pd.read_csv("../data/processed/engineered_churn_data.csv")

x = df.drop(columns=["Churn","customerID","tenure_group"]) # features
y = df["Churn"] # target variable

x_train, x_test, y_train, y_test = train_test_split(
    x,y, test_size=0.2, random_state=42, stratify=y
    )


train_test_split is used to split the dataset into training and testing sets.
The dataset is split into 80% training data and 20% testing data, ensuring that
the distribution of the target variable 'Churn' is maintained in both sets using the `stratify` parameter.

For the models, Logistic Regression and Random Forest will be used.

**Logistic Regression** is a supervised ML algorithm, useful for binary classification. Unlike Linear Regression, it predicts the probability of an outcome in a categorical way instead of a continuous value ([IBM, 2025](https://www.ibm.com/think/topics/logistic-regression)).

**Random Forest** is a ML algorithm, consisting of various decision trees combined to reach a single result. It handles classification and regression problems ([IBM, 2025](https://www.ibm.com/think/topics/random-forest)).

### ❗Understanding Logistic Regression
This code trains a Logistic Regression model on the training data.
It uses the `max_iter` parameter to ensure convergence and sets a random seed for reproducibility

In [3]:
# Train Logistic Regression model
log_reg = LogisticRegression(max_iter=1000, random_state=42) 
log_reg.fit(x_train, y_train)

STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT

Increase the number of iterations to improve the convergence (max_iter=1000).
You might also want to scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,42
,solver,'lbfgs'
,max_iter,1000


### ❗Understanding Random Forest
This code snippet trains a Random Forest model on the training data.
It uses 100 trees in the forest and sets a random seed for reproducibility.

In [4]:
# Train Random Forest model
rf_clf = RandomForestClassifier(n_estimators=100, random_state=42)
rf_clf.fit(x_train, y_train)


0,1,2
,n_estimators,100
,criterion,'gini'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,'sqrt'
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


Now, a comparison of both models on different categories. 
- Accuracy
- Precision
- Recall
- F1-Score
- ROC AUC

Recall measures the ability of a model to correctly identify all relevant instances of a class ([Google, 2025](https://developers.google.com/machine-learning/crash-course/classification/accuracy-precision-recall#:~:text=of%20the%20word.-,Recall%2C%20or%20true%20positive%20rate,P%20T%20P%20+%20F%20N))

F1-score provides a balanced measure of a model's performance in classification tasks, particularly useful when dealing with imbalanced datasets. It combines precision and recall into a single value ([V7 Labs, 2022](https://www.v7labs.com/blog/f1-score-guide))

The area under the ROC curve (AUC) represents the probability that the model, if given a randomly chosen positive and negative example, will rank the positive higher than the negative ([Google, 2025](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc))

In [5]:
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    y_proba = model.predict_proba(x_test)[:, 1] # Probability estimates for the positive class, [:,1] means we take the second column (positive class)

    print(f"Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
    print(f"Precision: {precision_score(y_test, y_pred):.4f}")
    print(f"Recall:    {recall_score(y_test, y_pred):.4f}")
    print(f"F1-score:  {f1_score(y_test, y_pred):.4f}")
    print(f"ROC AUC:   {roc_auc_score(y_test, y_proba):.4f}")

# Evaluate Logistic Regression model
print("Logistic Regression Model Evaluation:")
evaluate_model(log_reg, x_test, y_test)
# Evaluate Random Forest model
print("\nRandom Forest Model Evaluation:")
evaluate_model(rf_clf, x_test, y_test)

Logistic Regression Model Evaluation:
Accuracy:  0.8010
Precision: 0.6478
Recall:    0.5508
F1-score:  0.5954
ROC AUC:   0.8413

Random Forest Model Evaluation:
Accuracy:  0.7839
Precision: 0.6215
Recall:    0.4786
F1-score:  0.5408
ROC AUC:   0.8183

Logistic Regression outperforms Random Forest based on ROC AUC score.


We now export both models using `joblib`.

In [7]:
joblib.dump(log_reg, "../models/logistic_regression_model.pkl")
joblib.dump(rf_clf, "../models/random_forest_model.pkl")
print("Models exported successfully.")

Models exported successfully.
