## Predictive Modeling


In [2]:
# Import all required libraries here

# Analytical tools
import pandas as pd
from summarytools import dfSummary
import pycaret.classification as pc
import sweetviz as sv 

# Import custom modules
import custom_pipeline as cp
from data_processing import Input

### General Plan

1. **Data Preprocessing**: Clean and preprocess the data using custom tools.

2. **Metric Selection**: Decide on metric to use. Evaluate the suitability of various metrics such as Recall, Precision, Accuracy, and F1 Score.

4. **Hyperparameter Tuning**: Play around with hyperparameters of the most suitable models to see if predictive power can be improved further (if time allows!).

5. **Prediction**: Make predictions on the test dataset and save the results to the file ```predictions.xlsx```.

6. **Deployment-ready pipeline**: Save pipelines and make model deployment-ready. Integrate model to an existing (in theory) system. Justify the approach.

7. **Continuous Monitoring**: If time allows, launch MLFLow (locally or in cloud) and add model to constantly trck its performance over time


In [8]:
# Upload Data and Clean it automatically using customly developed tools
df = Input('data/train_file.csv').preprocess()
dfSummary(df, is_collapsible=True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,age [int64],Mean (sd) : 40.1 (10.2) min < med < max: 20.0 < 38.0 < 70.0 IQR (CV) : 15.0 (3.9),51 distinct values,,0 (0.0%)
2,job [category],1. admin. 2. blue-collar 3. technician 4. services 5. management 6. retired 7. entrepreneur 8. self-employed 9. housemaid 10. unemployed 11. other,"7,704 (24.7%) 7,043 (22.6%) 5,137 (16.5%) 3,108 (10.0%) 2,283 (7.3%) 1,338 (4.3%) 1,150 (3.7%) 1,090 (3.5%) 843 (2.7%) 791 (2.5%) 702 (2.3%)",,0 (0.0%)
3,marital [category],1. married 2. single 3. divorced,"18,827 (60.4%) 8,760 (28.1%) 3,602 (11.5%)",,0 (0.0%)
4,education [category],1. university.degree 2. high.school 3. basic.9y 4. professional.course 5. basic.4y 6. basic.6y,"9,445 (30.3%) 7,592 (24.3%) 4,767 (15.3%) 4,171 (13.4%) 3,333 (10.7%) 1,881 (6.0%)",,0 (0.0%)
5,contact [category],1. cellular 2. telephone,"19,622 (62.9%) 11,567 (37.1%)",,0 (0.0%)
6,day_of_week [category],1. mon 2. thu 3. wed 4. tue 5. fri,"6,484 (20.8%) 6,461 (20.7%) 6,160 (19.8%) 6,093 (19.5%) 5,991 (19.2%)",,0 (0.0%)
7,campaign [int64],Mean (sd) : 2.5 (2.0) min < med < max: 1.0 < 2.0 < 10.0 IQR (CV) : 2.0 (1.2),10 distinct values,,0 (0.0%)
8,previous [int64],Mean (sd) : 0.2 (0.5) min < med < max: 0.0 < 0.0 < 3.0 IQR (CV) : 0.0 (0.4),4 distinct values,,0 (0.0%)
9,poutcome [category],1. nonexistent 2. failure 3. success,"26,719 (85.7%) 3,371 (10.8%) 1,099 (3.5%)",,0 (0.0%)
10,y [category],1. no 2. yes,"27,507 (88.2%) 3,682 (11.8%)",,0 (0.0%)


### Definitions of Metrics (KPIs used to measure how well selected model perfroms)

<details>

**Recall (Sensitivity, True Positive Rate)**:
Recall is the ratio of correctly predicted positive observations to all the actual positives. It answers the question: "Of all the actual positive cases, how many did we correctly identify?"

$$ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $$

**Precision**:
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all the cases we predicted as positive, how many are actually positive?"

$$ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} $$

**Accuracy**:
Accuracy is the ratio of correctly predicted observations (both positive and negative) to the total observations. It answers the question: "How often is the classifier correct?"

$$ \text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Observations}} $$

**F1 Score**:
The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both concerns. It is particularly useful when you need to balance the trade-off between precision and recall.

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

</details>

##### When to Use Each Metric

| Metric| When to Use
|--------------|-------------------------------------------------------------------------------------------------|
| **Recall**   | Use when the cost of false negatives is high. For example, if missing an opportunity to invest in campaign that will be favored by customers from Group 1 |
| **Precision**| Use when the cost of false positives is high. For example, if cost of wrong investment can lead to serious problems and thus should be avoided |
| **Accuracy** | Use when the class distribution is not heavily imbalanced and the cost of false positives and false negatives is similar. |
| **F1 Score** | Use when you need a balance between precision and recall, especially when the class distribution is imbalanced. |

In [29]:
df['y'].value_counts(normalize=True).round(2)

Unnamed: 0_level_0,proportion
y,Unnamed: 1_level_1
no,0.88
yes,0.12


##### Target Value Distribution and Potential Issues

<details>

**Problem:**

Imbalanced target value distribution, where one class ('no' in this case) has significantly more samples than the other, can lead to biased models that perform poorly on the minority class.

**Potential Solutions:**

1. **Resampling Techniques:**
    * **Synthetic Minority Over-sampling Technique:** SMOTE creates synthetic samples of the minority class to balance the dataset.

2. **Ensemble Methods:**
    * Use ensemble methods like Bagging or Boosting, which combine multiple models to improve overall performance. These methods can be particularly effective in handling imbalanced datasets.

3. **Algorithm Selection:**
    * Choose algorithms that are less sensitive to class imbalance, such as tree-based models or support vector machines.

**PyCaret Implementation:**

PyCaret's `setup` function provides options for handling class imbalance using techniques like SMOTE. You can enable this by setting `fix_imbalance=True` and specifying the desired `fix_imbalance_method` (e.g., `SMOTE()`).

</details>

#### Use of AutoML library PyCaret for rapid prototyping

**Remark**: I have tried different combinations when setting up data, including:

- removing vs not removing multicollinearity
- Applying vs NOT applying SMOTE for imbalance fixing
- excluding vs leaving outliers
- dropping certain columns VS inluding

In all cases, neither predictive power nor underlying ML models change significantly, although some minor improvements were noticable.

In [13]:
df_clean = pd.read_csv('truncated_train_file.csv')

dfSummary(df_clean, is_collapsible=True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,age [int64],Mean (sd) : 40.1 (10.2) min < med < max: 20.0 < 38.0 < 70.0 IQR (CV) : 15.0 (3.9),51 distinct values,,0 (0.0%)
2,job [object],1. admin. 2. blue-collar 3. technician 4. services 5. management 6. retired 7. entrepreneur 8. self-employed 9. housemaid 10. unemployed 11. other,"7,695 (24.7%) 7,074 (22.7%) 5,121 (16.4%) 3,121 (10.0%) 2,279 (7.3%) 1,337 (4.3%) 1,148 (3.7%) 1,078 (3.5%) 848 (2.7%) 791 (2.5%) 697 (2.2%)",,0 (0.0%)
3,marital [object],1. married 2. single 3. divorced,"18,843 (60.4%) 8,745 (28.0%) 3,601 (11.5%)",,0 (0.0%)
4,education [object],1. university.degree 2. high.school 3. basic.9y 4. professional.course 5. basic.4y 6. basic.6y,"9,447 (30.3%) 7,597 (24.4%) 4,777 (15.3%) 4,154 (13.3%) 3,322 (10.7%) 1,892 (6.1%)",,0 (0.0%)
5,contact [object],1. cellular 2. telephone,"19,620 (62.9%) 11,569 (37.1%)",,0 (0.0%)
6,day_of_week [object],1. mon 2. thu 3. wed 4. tue 5. fri,"6,487 (20.8%) 6,455 (20.7%) 6,163 (19.8%) 6,093 (19.5%) 5,991 (19.2%)",,0 (0.0%)
7,campaign [int64],1. 1 2. 2 3. 3 4. 4 5. 5 6. 10 7. 6 8. 7 9. 8 10. 9,"12,958 (41.5%) 8,081 (25.9%) 4,171 (13.4%) 2,078 (6.7%) 1,243 (4.0%) 852 (2.7%) 771 (2.5%) 489 (1.6%) 328 (1.1%) 218 (0.7%)",,0 (0.0%)
8,previous [int64],1. 0 2. 1 3. 2 4. 3 5. 4,"26,718 (85.7%) 3,614 (11.6%) 604 (1.9%) 173 (0.6%) 80 (0.3%)",,0 (0.0%)
9,poutcome [object],1. nonexistent 2. failure 3. success,"26,718 (85.7%) 3,370 (10.8%) 1,101 (3.5%)",,0 (0.0%)
10,duration_mins [int64],Mean (sd) : 4.3 (3.9) min < med < max: 0.0 < 3.0 < 20.0 IQR (CV) : 3.0 (1.1),21 distinct values,,0 (0.0%)


In [9]:
classification_setup = pc.setup(
    data=df,
    target='y',
    fold=5,  # Number of folds for cross-validation
    train_size=0.8,  # Percentage of data for training (80% in this case)
    data_split_shuffle=True,
    data_split_stratify=True,  # Stratify the data split,
    normalize=True,  # Normalize the data
    remove_outliers=False,  # Remove outliers
    remove_multicollinearity=True,  # Remove multicollinearity
    multicollinearity_threshold=0.9,  # Threshold for multicollinearity,
    fix_imbalance=True,  # Fix class imbalance
    fix_imbalance_method='SMOTE',  # Method for fixing imbalance
    session_id=42  # Set a random seed for reproducibility
)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,y
2,Target type,Binary
3,Target mapping,"no: 0, yes: 1"
4,Original data shape,"(31189, 16)"
5,Transformed data shape,"(50248, 42)"
6,Transformed train set shape,"(44010, 42)"
7,Transformed test set shape,"(6238, 42)"
8,Numeric features,4
9,Categorical features,9


In [10]:
#models=['lightgbm', 'xgboost', 'ada', 'lr', 'rf']
best_model = pc.compare_models(sort = 'Recall')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8979,0.8911,0.8979,0.8884,0.8918,0.4538,0.4589,1.622
gbc,Gradient Boosting Classifier,0.8888,0.8892,0.8888,0.8919,0.8903,0.4803,0.4806,5.07
rf,Random Forest Classifier,0.8872,0.8603,0.8872,0.8723,0.8773,0.3684,0.3775,1.776
ada,Ada Boost Classifier,0.8847,0.8815,0.8847,0.8886,0.8865,0.464,0.4645,1.258
dummy,Dummy Classifier,0.8819,0.5,0.8819,0.7778,0.8266,0.0,0.0,0.564
et,Extra Trees Classifier,0.8795,0.8335,0.8795,0.8661,0.8714,0.3469,0.352,3.072
dt,Decision Tree Classifier,0.8504,0.6623,0.8504,0.8564,0.8532,0.3095,0.31,0.482
ridge,Ridge Classifier,0.842,0.882,0.842,0.8946,0.8601,0.4375,0.4649,0.444
lda,Linear Discriminant Analysis,0.842,0.882,0.842,0.8946,0.8601,0.4375,0.4649,0.688
knn,K Neighbors Classifier,0.839,0.7499,0.839,0.863,0.8493,0.3314,0.3367,2.44


In [11]:
# Get the best model's parameters
print(best_model)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)


In [44]:
pc.evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [12]:
pc.plot_model(best_model, plot="AUC")

ValueError: Plot Not Available. Please see docstring for list of available Plots.

In [45]:
# predict on test set
test_predictions = pc.predict_model(best_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.882,0.4976,0.882,0.7779,0.8267,0.0,0.0


### Hyperparameter Tuning

In [18]:
!pip install scikit-optimize


In [19]:
# #Fine-tuning the best model
# tuned_best_model = pc.tune_model(
#     estimator=best_model, 
#     fold=3, 
#     optimize='Recall', 
#     search_library='scikit-optimize', 
#     search_algorithm='bayesian', 
#     n_iter=10  # Limit the number of iterations to reduce time
# )

Time to test on test_file.csv


In [21]:
test_data = Input('data/test_file.csv').preprocess()
test_data.head()

Unnamed: 0,age,job,marital,education,contact,day_of_week,campaign,previous,poutcome,duration_mins,education_level,job_type,income_level,was_contacted_before,deposited_before
0,34,services,married,high.school,telephone,thu,4,0,nonexistent,4,mid,blue-collar,mid,False,False
1,29,blue-collar,single,basic.9y,cellular,thu,1,0,nonexistent,3,low,blue-collar,mid,False,False
2,35,admin.,single,high.school,cellular,wed,2,0,nonexistent,3,mid,white-collar,mid,False,False
3,60,admin.,divorced,high.school,cellular,fri,1,0,nonexistent,3,mid,white-collar,mid,False,False
4,45,management,married,university.degree,telephone,wed,2,0,nonexistent,2,high,white-collar,high,False,False


In [22]:
predictions = pc.predict_model(best_model, data = test_data)

Save file with predictions


In [25]:
predictions.to_excel('data/predictions.xlsx', index=False)


In [82]:
pc.save_model(best_model, 'ml_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['age', 'campaign', 'previous',
                                              'duration_mins'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=Fal...
                  LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                 colsample_bytree=1.0, importance_type='split',
                                 learning_rate=0.1, max_d