In [5]:
!pip install pycaret summarytools seaborn pandas --quiet



In [26]:
# Import all required libraries here

# Analytical tools
import pandas as pd
from summarytools import dfSummary
import pycaret.classification as pc

# Import custom modules
import custom_pipeline as cp
from data_processing import Input

In [27]:
# report = sv.analyze(source=df, target_feat='y')
# report.show_notebook(layout="vertical", scale=0.8)


### General Plan


1. Decide on metric to use
2. Decide which transformation (scaling, encoding etc.) to apply
3. Deploy AutoML library to obtain an initial idea on which models are more likely to perform well
4. Perfrom prediction, hyperparameter-tuning and subsequent validation
5. Save pipelines and make model deployment-ready


 Upload Data and Clean it automatically using customly developed tools

In [28]:
df = Input('train_file.csv').preprocess()
dfSummary(df, is_collapsible=True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,age [int64],Mean (sd) : 40.1 (10.2) min < med < max: 20.0 < 38.0 < 70.0 IQR (CV) : 15.0 (3.9),51 distinct values,,0 (0.0%)
2,job [category],1. admin. 2. blue-collar 3. technician 4. services 5. management 6. retired 7. entrepreneur 8. self-employed 9. housemaid 10. unemployed 11. other,"7,721 (24.7%) 7,061 (22.6%) 5,120 (16.4%) 3,116 (10.0%) 2,280 (7.3%) 1,350 (4.3%) 1,152 (3.7%) 1,079 (3.5%) 847 (2.7%) 793 (2.5%) 699 (2.2%)",,0 (0.0%)
3,marital [category],1. married 2. single 3. divorced,"18,858 (60.4%) 8,755 (28.0%) 3,605 (11.5%)",,0 (0.0%)
4,education [category],1. university.degree 2. high.school 3. basic.9y 4. professional.course 5. basic.4y 6. basic.6y,"9,461 (30.3%) 7,607 (24.4%) 4,801 (15.4%) 4,136 (13.2%) 3,328 (10.7%) 1,885 (6.0%)",,0 (0.0%)
5,contact [category],1. cellular 2. telephone,"19,652 (63.0%) 11,566 (37.0%)",,0 (0.0%)
6,day_of_week [category],1. mon 2. thu 3. wed 4. tue 5. fri,"6,485 (20.8%) 6,474 (20.7%) 6,170 (19.8%) 6,095 (19.5%) 5,994 (19.2%)",,0 (0.0%)
7,campaign [int64],1. 1 2. 2 3. 3 4. 4 5. 5 6. 10 7. 6 8. 7 9. 8 10. 9,"12,965 (41.5%) 8,081 (25.9%) 4,172 (13.4%) 2,080 (6.7%) 1,243 (4.0%) 871 (2.8%) 771 (2.5%) 489 (1.6%) 328 (1.1%) 218 (0.7%)",,0 (0.0%)
8,previous [int64],1. 0 2. 1 3. 2 4. 3,"26,745 (85.7%) 3,616 (11.6%) 604 (1.9%) 253 (0.8%)",,0 (0.0%)
9,poutcome [category],1. nonexistent 2. failure 3. success,"26,745 (85.7%) 3,371 (10.8%) 1,102 (3.5%)",,0 (0.0%)
10,y [category],1. no 2. yes,"27,533 (88.2%) 3,685 (11.8%)",,0 (0.0%)


### Definitions of Metrics (KPIs used to measure how well selected model perfroms)

<details>

**Recall (Sensitivity, True Positive Rate)**:
Recall is the ratio of correctly predicted positive observations to all the actual positives. It answers the question: "Of all the actual positive cases, how many did we correctly identify?"

$$ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $$

**Precision**:
Precision is the ratio of correctly predicted positive observations to the total predicted positives. It answers the question: "Of all the cases we predicted as positive, how many are actually positive?"

$$ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} $$

**Accuracy**:
Accuracy is the ratio of correctly predicted observations (both positive and negative) to the total observations. It answers the question: "How often is the classifier correct?"

$$ \text{Accuracy} = \frac{\text{True Positives (TP)} + \text{True Negatives (TN)}}{\text{Total Observations}} $$

**F1 Score**:
The F1 score is the harmonic mean of precision and recall. It provides a single metric that balances both concerns. It is particularly useful when you need to balance the trade-off between precision and recall.

$$ \text{F1 Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $$

</details>

##### When to Use Each Metric

| Metric| When to Use
|--------------|-------------------------------------------------------------------------------------------------|
| **Recall**   | Use when the cost of false negatives is high. For example, if missing an opportunity to invest in campaign that will be favored by customers from Group 1 |
| **Precision**| Use when the cost of false positives is high. For example, if cost of wrong investment can lead to serious problems and thus should be avoided |
| **Accuracy** | Use when the class distribution is not heavily imbalanced and the cost of false positives and false negatives is similar. |
| **F1 Score** | Use when you need a balance between precision and recall, especially when the class distribution is imbalanced. |

In [29]:
df['y'].value_counts(normalize=True).round(2)

Unnamed: 0_level_0,proportion
y,Unnamed: 1_level_1
no,0.88
yes,0.12


##### Target Value Distribution and Potential Issues

<details>

**Problem:**

Imbalanced target value distribution, where one class ('no' in this case) has significantly more samples than the other, can lead to biased models that perform poorly on the minority class.

**Potential Solutions:**

1. **Resampling Techniques:**
    * **Synthetic Minority Over-sampling Technique:** SMOTE creates synthetic samples of the minority class to balance the dataset.

2. **Ensemble Methods:**
    * Use ensemble methods like Bagging or Boosting, which combine multiple models to improve overall performance. These methods can be particularly effective in handling imbalanced datasets.

3. **Algorithm Selection:**
    * Choose algorithms that are less sensitive to class imbalance, such as tree-based models or support vector machines.

**PyCaret Implementation:**

PyCaret's `setup` function provides options for handling class imbalance using techniques like SMOTE. You can enable this by setting `fix_imbalance=True` and specifying the desired `fix_imbalance_method` (e.g., `SMOTE()`).

</details>

#### Use of AutoML library PyCaret for rapid prototyping

**Remark**: I have tried different combinations when setting up data, including:

- removing vs not removing multicollinearity
- Applying vs NOT applying SMOTE for imbalance fixing
- excluding vs leaving outliers
- dropping certain columns VS inluding

In all cases, neither predictive power nor underlying ML models change significantly, although some minor improvements were noticable.

In [13]:
df_clean = pd.read_csv('truncated_train_file.csv')

dfSummary(df_clean, is_collapsible=True)

No,Variable,Stats / Values,Freqs / (% of Valid),Graph,Missing
1,age [int64],Mean (sd) : 40.1 (10.2) min < med < max: 20.0 < 38.0 < 70.0 IQR (CV) : 15.0 (3.9),51 distinct values,,0 (0.0%)
2,job [object],1. admin. 2. blue-collar 3. technician 4. services 5. management 6. retired 7. entrepreneur 8. self-employed 9. housemaid 10. unemployed 11. other,"7,695 (24.7%) 7,074 (22.7%) 5,121 (16.4%) 3,121 (10.0%) 2,279 (7.3%) 1,337 (4.3%) 1,148 (3.7%) 1,078 (3.5%) 848 (2.7%) 791 (2.5%) 697 (2.2%)",,0 (0.0%)
3,marital [object],1. married 2. single 3. divorced,"18,843 (60.4%) 8,745 (28.0%) 3,601 (11.5%)",,0 (0.0%)
4,education [object],1. university.degree 2. high.school 3. basic.9y 4. professional.course 5. basic.4y 6. basic.6y,"9,447 (30.3%) 7,597 (24.4%) 4,777 (15.3%) 4,154 (13.3%) 3,322 (10.7%) 1,892 (6.1%)",,0 (0.0%)
5,contact [object],1. cellular 2. telephone,"19,620 (62.9%) 11,569 (37.1%)",,0 (0.0%)
6,day_of_week [object],1. mon 2. thu 3. wed 4. tue 5. fri,"6,487 (20.8%) 6,455 (20.7%) 6,163 (19.8%) 6,093 (19.5%) 5,991 (19.2%)",,0 (0.0%)
7,campaign [int64],1. 1 2. 2 3. 3 4. 4 5. 5 6. 10 7. 6 8. 7 9. 8 10. 9,"12,958 (41.5%) 8,081 (25.9%) 4,171 (13.4%) 2,078 (6.7%) 1,243 (4.0%) 852 (2.7%) 771 (2.5%) 489 (1.6%) 328 (1.1%) 218 (0.7%)",,0 (0.0%)
8,previous [int64],1. 0 2. 1 3. 2 4. 3 5. 4,"26,718 (85.7%) 3,614 (11.6%) 604 (1.9%) 173 (0.6%) 80 (0.3%)",,0 (0.0%)
9,poutcome [object],1. nonexistent 2. failure 3. success,"26,718 (85.7%) 3,370 (10.8%) 1,101 (3.5%)",,0 (0.0%)
10,duration_mins [int64],Mean (sd) : 4.3 (3.9) min < med < max: 0.0 < 3.0 < 20.0 IQR (CV) : 3.0 (1.1),21 distinct values,,0 (0.0%)


In [30]:
classification_setup = pc.setup(
    data=df,
    target='y',
    fold=5,  # Number of folds for cross-validation
    train_size=0.8,  # Percentage of data for training (80% in this case)
    data_split_shuffle=True,
    data_split_stratify=True,  # Stratify the data split,
    normalize=True,  # Normalize the data
    remove_outliers=False,  # Remove outliers
    remove_multicollinearity=True,  # Remove multicollinearity
    multicollinearity_threshold=0.9,  # Threshold for multicollinearity,
    fix_imbalance=True,  # Fix class imbalance
    fix_imbalance_method='SMOTE',  # Method for fixing imbalance
    session_id=42  # Set a random seed for reproducibility
)

Unnamed: 0,Description,Value
0,Session id,42
1,Target,y
2,Target type,Binary
3,Target mapping,"no: 0, yes: 1"
4,Original data shape,"(31218, 16)"
5,Transformed data shape,"(50296, 42)"
6,Transformed train set shape,"(44052, 42)"
7,Transformed test set shape,"(6244, 42)"
8,Numeric features,4
9,Categorical features,9


In [31]:
#models=['lightgbm', 'xgboost', 'ada', 'lr', 'rf']
best_model = pc.compare_models(sort = 'Recall')

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8952,0.8938,0.8952,0.8852,0.8889,0.4383,0.4434,6.138
xgboost,Extreme Gradient Boosting,0.8922,0.885,0.8922,0.8797,0.8839,0.4066,0.4143,2.884
gbc,Gradient Boosting Classifier,0.8889,0.8917,0.8889,0.8927,0.8906,0.4834,0.4839,9.552
rf,Random Forest Classifier,0.8879,0.8603,0.8879,0.8726,0.8775,0.3674,0.3775,6.264
ada,Ada Boost Classifier,0.885,0.8825,0.885,0.89,0.8872,0.4697,0.4708,3.136
dummy,Dummy Classifier,0.882,0.5,0.882,0.7778,0.8266,0.0,0.0,0.974
et,Extra Trees Classifier,0.8803,0.8304,0.8803,0.8671,0.8723,0.3511,0.3563,6.964
dt,Decision Tree Classifier,0.8508,0.665,0.8508,0.8573,0.8539,0.3136,0.3142,1.196
ridge,Ridge Classifier,0.845,0.8845,0.845,0.896,0.8626,0.445,0.4719,1.048
lda,Linear Discriminant Analysis,0.845,0.8845,0.845,0.896,0.8626,0.445,0.4719,1.0


Processing:   0%|          | 0/65 [00:00<?, ?it/s]

In [41]:
# Get the best model's parameters
print(best_model)

LGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
               importance_type='split', learning_rate=0.1, max_depth=-1,
               min_child_samples=20, min_child_weight=0.001, min_split_gain=0.0,
               n_estimators=100, n_jobs=-1, num_leaves=31, objective=None,
               random_state=42, reg_alpha=0.0, reg_lambda=0.0, subsample=1.0,
               subsample_for_bin=200000, subsample_freq=0)


In [44]:
pc.evaluate_model(best_model)

interactive(children=(ToggleButtons(description='Plot Type:', icons=('',), options=(('Pipeline Plot', 'pipelin…

In [45]:
# predict on test set
test_predictions = pc.predict_model(best_model)

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
0,Light Gradient Boosting Machine,0.882,0.4976,0.882,0.7779,0.8267,0.0,0.0


### Hyperparameter Tuning

In [47]:
#Picking the winner
best_model = pc.automl(optimize = 'Recall')

#Fine-tuning the best model
#tuned_best_model = pc.tune_model(best_model)

Time to test on test_file.csv


In [66]:
test_data = Input('test_file.csv').preprocess()
test_data.head()

Unnamed: 0,age,job,marital,education,contact,day_of_week,campaign,previous,poutcome,duration_mins,education_level,job_type,income_level,was_contacted_before,deposited_before
0,34,services,married,high.school,telephone,thu,4,0,nonexistent,4,mid,blue-collar,mid,False,False
1,29,blue-collar,single,basic.9y,cellular,thu,1,0,nonexistent,3,low,blue-collar,mid,False,False
2,35,admin.,single,high.school,cellular,wed,2,0,nonexistent,3,mid,white-collar,mid,False,False
3,60,admin.,divorced,high.school,cellular,fri,1,0,nonexistent,3,mid,white-collar,mid,False,False
4,45,management,married,university.degree,telephone,wed,2,0,nonexistent,2,high,white-collar,high,False,False


In [78]:
predictions = pc.predict_model(best_model, data = test_data)

Save file with predictions


In [87]:
# Reset the index to a default RangeIndex
#predictions = predictions.drop(columns=['level_0', 'index']).reset_index(drop=True)
predictions.head()
predictions.to_excel('predictions.xlsx', index=False)


In [82]:
pc.save_model(best_model, 'ml_pipeline')

Transformation Pipeline and Model Successfully Saved


(Pipeline(memory=Memory(location=None),
          steps=[('label_encoding',
                  TransformerWrapperWithInverse(exclude=None, include=None,
                                                transformer=LabelEncoder())),
                 ('numerical_imputer',
                  TransformerWrapper(exclude=None,
                                     include=['age', 'campaign', 'previous',
                                              'duration_mins'],
                                     transformer=SimpleImputer(add_indicator=False,
                                                               copy=True,
                                                               fill_value=None,
                                                               keep_empty_features=Fal...
                  LGBMClassifier(boosting_type='gbdt', class_weight=None,
                                 colsample_bytree=1.0, importance_type='split',
                                 learning_rate=0.1, max_d