<p style="font-size: 35px; font-family: 'Montserrat', sans-serif; font-weight: 900; color: #7D7D7D; text-align: center; text-transform: uppercase; letter-spacing: 2px; background: linear-gradient(to right, #D3D3D3, #A9A9A9); color: white; padding: 15px 30px; border-radius: 25px; box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);">
    Predicting the success of Bank telemarketing </p>

## Important Libraries:

In [1]:
import pandas as pd 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Libraries for Metrics:

In [2]:
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, f1_score, confusion_matrix


## Preprocessing Libraries:

In [3]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split


## Model Libraries:

In [4]:
from sklearn.linear_model import SGDClassifier, LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

In [5]:
# Load the training dataset from the provided file path and store it in the df dataframe
df = pd.read_csv('/kaggle/input/predict-the-success-of-bank-telemarketing/train.csv')

# Display the first 100 rows of the loaded training dataframe to inspect its structure
df.head(100)


Unnamed: 0,last contact date,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome,target
0,2009-04-17,26,blue-collar,married,secondary,no,647,yes,no,cellular,357,2,331,1,other,no
1,2009-10-11,52,technician,married,secondary,no,553,yes,no,telephone,160,1,-1,0,,no
2,2010-11-20,44,blue-collar,married,secondary,no,1397,no,no,cellular,326,1,-1,0,,no
3,2009-09-01,33,admin.,married,secondary,no,394,yes,no,telephone,104,3,-1,0,,no
4,2008-01-29,31,entrepreneur,single,tertiary,no,137,no,no,cellular,445,2,-1,0,,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2008-05-15,34,blue-collar,married,,no,-17,yes,no,,319,7,-1,0,,no
96,2009-05-11,57,technician,married,secondary,no,124,yes,no,telephone,296,1,287,12,failure,no
97,2008-05-07,39,blue-collar,married,secondary,no,-219,yes,no,cellular,101,1,-1,0,,no
98,2010-07-22,53,entrepreneur,married,secondary,no,0,yes,no,cellular,200,1,-1,0,,no


In [6]:
# Drop the 'poutcome' column from the dataframe (df) as it might not be relevant or necessary for the analysis
df = df.drop('poutcome', axis=1)

# Display the first few rows of the dataframe after dropping the 'poutcome' column
df.head()


Unnamed: 0,last contact date,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,target
0,2009-04-17,26,blue-collar,married,secondary,no,647,yes,no,cellular,357,2,331,1,no
1,2009-10-11,52,technician,married,secondary,no,553,yes,no,telephone,160,1,-1,0,no
2,2010-11-20,44,blue-collar,married,secondary,no,1397,no,no,cellular,326,1,-1,0,no
3,2009-09-01,33,admin.,married,secondary,no,394,yes,no,telephone,104,3,-1,0,no
4,2008-01-29,31,entrepreneur,single,tertiary,no,137,no,no,cellular,445,2,-1,0,no


In [7]:
# Load the test dataset from the provided file path and store it in the df_test dataframe
df_test = pd.read_csv("/kaggle/input/predict-the-success-of-bank-telemarketing/test.csv")

# Display the first few rows of the loaded test dataframe to inspect its structure
df_test.head()


Unnamed: 0,last contact date,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous,poutcome
0,2009-11-21,36,management,single,tertiary,no,7,no,no,,20,1,-1,0,
1,2010-02-04,30,unemployed,married,tertiary,no,1067,no,no,cellular,78,2,-1,0,
2,2010-07-28,32,blue-collar,single,secondary,no,82,yes,no,cellular,86,4,-1,0,
3,2010-06-09,38,admin.,married,primary,no,1487,no,no,,332,2,-1,0,
4,2008-03-02,59,management,married,tertiary,no,315,no,no,cellular,591,1,176,2,failure


In [8]:
# Drop the 'poutcome' column from the test dataframe df_test
df_test = df_test.drop('poutcome', axis=1)

# Display the first few rows of the updated df_test dataframe
df_test.head()


Unnamed: 0,last contact date,age,job,marital,education,default,balance,housing,loan,contact,duration,campaign,pdays,previous
0,2009-11-21,36,management,single,tertiary,no,7,no,no,,20,1,-1,0
1,2010-02-04,30,unemployed,married,tertiary,no,1067,no,no,cellular,78,2,-1,0
2,2010-07-28,32,blue-collar,single,secondary,no,82,yes,no,cellular,86,4,-1,0
3,2010-06-09,38,admin.,married,primary,no,1487,no,no,,332,2,-1,0
4,2008-03-02,59,management,married,tertiary,no,315,no,no,cellular,591,1,176,2


In [9]:
# Drop the 'target' column from the dataframe and assign it to the variable 'x' (features)
x = df.drop('target', axis=1)

# Assign the 'target' column to the variable 'y' (target variable)
y = df['target']


In [10]:
# Convert the 'last contact date' column to datetime format
x['last contact date'] = pd.to_datetime(x['last contact date'])

# Extract the year from the 'last contact date' and create a new column 'contact_year'
x['contact_year'] = x['last contact date'].dt.year

# Extract the month from the 'last contact date' and create a new column 'contact_month'
x['contact_month'] = x['last contact date'].dt.month

# Extract the day from the 'last contact date' and create a new column 'contact_day'
x['contact_day'] = x['last contact date'].dt.day

# Drop the original 'last contact date' column as we no longer need it
x = x.drop('last contact date', axis=1)


In [11]:
# Step 2: Define numerical and categorical columns
num_cols = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'contact_year', 'contact_month', 'contact_day']
cat_cols = ['job', 'marital', 'default', 'housing', 'loan', 'contact']
edu_col = ['education']  # Education column

# Step 3: Create preprocessing pipelines
# For numerical columns
num_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with the mean of the column
    ('scaler', StandardScaler())  # Scale features (standardize) to mean=0, std=1 for better model performance
])

# For categorical columns (excluding education)
cat_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent category
    ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'))  # One-hot encoding (avoid dummy variable trap)
])

# For education column using Ordinal Encoding (specific ranking between categories)
edu_pipeline = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent category
    ('ordinal', OrdinalEncoder(categories=[['primary', 'secondary', 'tertiary']]))  # Ordinal encoding for education (primary < secondary < tertiary)
])

# Step 4: Combine all pipelines into a ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', num_pipeline, num_cols),  # Apply num_pipeline to num_cols (numerical columns)
        ('cat', cat_pipeline, cat_cols),   # Apply cat_pipeline to cat_cols (categorical columns)
        ('edu', edu_pipeline, edu_col)     # Apply edu_pipeline to education column
    ]
)

# Step 5: Split the data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

# Step 6: Create a full pipeline with the preprocessor
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)  # Preprocessing step applied to all features
])

# Step 7: Fit the pipeline on the training data
# Fit the preprocessor and transform the training set
x_train_transformed = pipeline.fit_transform(x_train)

# Transform the test data using the same preprocessing steps
x_test_transformed = pipeline.transform(x_test)


# Convert the transformed NumPy arrays back to DataFrames with proper column names
x_train_transformed = pd.DataFrame(x_train_transformed, columns=preprocessor.get_feature_names_out())
x_test_transformed = pd.DataFrame(x_test_transformed, columns=preprocessor.get_feature_names_out())

# Output the shape of the transformed training and test sets to ensure they are correctly processed
print(x_train_transformed.shape)
print(x_test_transformed.shape)


(31368, 26)
(7843, 26)


In [12]:
z = df_test # Test Data

In [13]:
# Convert 'last contact date' column to datetime format
z['last contact date'] = pd.to_datetime(z['last contact date'])

# Extract the year from the 'last contact date' and store it in a new column 'contact_year'
z['contact_year'] = z['last contact date'].dt.year

# Extract the month from the 'last contact date' and store it in a new column 'contact_month'
z['contact_month'] = z['last contact date'].dt.month

# Extract the day from the 'last contact date' and store it in a new column 'contact_day'
z['contact_day'] = z['last contact date'].dt.day

# Drop the 'last contact date' column from the dataframe, as it's no longer needed
z = z.drop('last contact date', axis=1)


In [14]:
# Transform the validation or test set using the previously fitted pipeline
z_test_transformed = pipeline.transform(z)

# Convert the transformed data into a DataFrame and set the correct column names based on the feature names from the preprocessor
z_test_transformed = pd.DataFrame(z_test_transformed, columns=preprocessor.get_feature_names_out())

# Display the first few rows of the transformed test data to check the result
z_test_transformed.head()


Unnamed: 0,num__age,num__balance,num__duration,num__campaign,num__pdays,num__previous,num__contact_year,num__contact_month,num__contact_day,cat__job_blue-collar,...,cat__job_student,cat__job_technician,cat__job_unemployed,cat__marital_married,cat__marital_single,cat__default_yes,cat__housing_yes,cat__loan_yes,cat__contact_telephone,edu__education
0,-0.483611,-0.331969,-0.54439,-0.41467,-0.454847,-0.266955,-0.000858,1.827486,0.626869,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,2.0
1,-0.955588,-0.266765,-0.468921,-0.313649,-0.454847,-0.266955,1.222368,-1.620194,-1.408964,0.0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0
2,-0.798262,-0.327356,-0.458512,-0.111609,-0.454847,-0.266955,1.222368,0.295184,1.465154,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
3,-0.326285,-0.240929,-0.138422,-0.313649,-0.454847,-0.266955,1.222368,-0.087892,-0.81019,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.325635,-0.313023,0.198583,-0.41467,0.641016,-0.221716,-1.224083,-1.237119,-1.648474,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,2.0


### Confusion Matrix for "Defaulter" and "Non-Defaulter"

|                         | **Defaulter (Predicted: Yes)** | **Non-Defaulter (Predicted: No)** |
|-------------------------|--------------------------------|----------------------------------|
| **Defaulter (Actual: Yes)** | True Positive (TP): Correctly predicted as defaulter | False Negative (FN): Incorrectly predicted as non-defaulter |
| **Non-Defaulter (Actual: No)** | False Positive (FP): Incorrectly predicted as defaulter | True Negative (TN): Correctly predicted as non-defaulter |

- # Simple Explanation:

- **True Positive (TP)**: The model correctly predicts someone as a defaulter (they are a defaulter).
- **False Positive (FP)**: The model wrongly predicts someone as a defaulter when they are actually not (this is bad for the customer).
- **False Negative (FN)**: The model wrongly predicts someone as not a defaulter when they actually are (this is bad for the bank).
- **True Negative (TN)**: The model correctly predicts someone as not a defaulter (they are not a de
---faulter).

---

### Why is this important?

- **False Positives (FP)** affect **Precision**:  
  If the model predicts a customer as a defaulter when they are actually not (False Positive), it will **lower the Precision**. This means that the bank is wrongly identifying people who are **not defaulters** as defaulters. This could cause harm to customers, affecting their credit ratings and causing unnecessary actions like blocking loans.

- **False Negatives (FN)** affect **Recall**:  
  If the model predicts a customer as **not a defaulter** when they are actually a defaulter (False Negative), it will **lower Recall**. This means that the bank is missing out on identifying people who are actual defaulters. This could result in financial loss for the bank, as defaulters would not be caught and might not pay back their debts.

In essence, we need to balance **Precision** and **Recall** to ensure both the **customer's safety** (no false accusations) and the **bank's financial security** (no missed defaulters). 

- **Precision** is important for ensuring that the bank doesn't wrongly flag people as defaulters.  
- **Recall** is important for ensuring that the bank doesn't miss actual defaulters.  

The **F1 Score** is a great way to balance **Precision** and **Recall**. By focusing on improving both, we can make the model more reliable and fair for both the **bank** and the **customers**.


<p style="font-size: 42px; font-family: 'Montserrat', sans-serif; font-weight: 900; color: #7D7D7D; text-align: center; text-transform: uppercase; letter-spacing: 2px; background: linear-gradient(to right, #D3D3D3, #A9A9A9); color: white; padding: 15px 30px; border-radius: 25px; box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);">
    Model 1 - Logistic Regression </p>


### Why Logistic Regression?

1. **Simplicity and Interpretability**: Logistic Regression is a simple, linear model that's easy to understand and interpret. It's useful in business scenarios where understanding the reasons behind predictions is important.

2. **Probabilistic Output**: It provides probabilities, making it flexible in decision-making. You can set a threshold to determine when to flag a customer as a defaulter, such as when the model's confidence is above 70%.

3. **Efficient for Binary Classification**: It's ideal for binary classification tasks like predicting whether a customer is a defaulter or not (Yes/No).

4. **Less Prone to Overfitting**: Compared to more complex models, Logistic Regression is less prone to overfitting, especially when the dataset is smaller.

5. **Computationally Efficient**: It is less computationally expensive and can handle large datasets efficiently, which is essential for real-time predictions in business environments.

6. **Good Baseline Model**: Logistic Regression provides solid performance and is often used as a baseline model before experimenting with more complex aclassification tasks.


In [15]:
# Initialize the Logistic Regression model with a random seed for reproducibility
log_reg_model = LogisticRegression(random_state=42)

# Defining a dictionary with the hyperparameters to tune during RandomizedSearchCV
param_dist_log_reg = {
    'C': [0.01, 0.1, 1, 10, 100],  # This controls the regularization strength
    'penalty': ['l2', 'l1'],  # Type of regularization (either L1 or L2)
    'solver': ['liblinear', 'saga'],  # Solvers to use for optimization
    'max_iter': [100, 200, 500]  # Maximum number of iterations for convergence
}

# Setting up RandomizedSearchCV to perform hyperparameter tuning
random_search_log_reg = RandomizedSearchCV(log_reg_model, 
                                           param_distributions=param_dist_log_reg, 
                                           n_iter=50,  # Number of different hyperparameter combinations to try
                                           scoring='f1_macro',  # Optimizing the F1 score (macro average)
                                           cv=5,  # 5-fold cross-validation to evaluate performance
                                           verbose=2,  # Show detailed output for progress tracking
                                           random_state=42,  # Ensures reproducibility of results
                                           n_jobs=-1)  # Use all available cores for faster processing

# Fitting the RandomizedSearchCV to the training data
random_search_log_reg.fit(x_train_transformed, y_train)

# Get the best Logistic Regression model after hyperparameter tuning
log_reg_model = random_search_log_reg.best_estimator_

# Print the best hyperparameters found by RandomizedSearchCV
print("Best Parameters for Logistic Regression:", random_search_log_reg.best_params_)

# Making predictions for the test and train sets
y_test_pred_log_reg = log_reg_model.predict(x_test_transformed)
y_train_pred_log_reg = log_reg_model.predict(x_train_transformed)

# Evaluating the model by calculating the accuracy and printing the classification report
print("Logistic Regression Training Accuracy:", accuracy_score(y_train, y_train_pred_log_reg))
print("Logistic Regression Test Accuracy:", accuracy_score(y_test, y_test_pred_log_reg))
print("Logistic Regression Classification Report:\n", classification_report(y_test, y_test_pred_log_reg))

# Calculate and print F1 Score (macro average) for both train and test sets
f1_train = f1_score(y_train, y_train_pred_log_reg, average='macro')
f1_test = f1_score(y_test, y_test_pred_log_reg, average='macro')
print(f"Training F1 Score (Macro): {f1_train:.2f}")
print(f"Test F1 Score (Macro): {f1_test:.2f}")

# Calculate and print ROC AUC Score for both train and test sets
roc_auc_train = roc_auc_score(y_train, log_reg_model.predict_proba(x_train_transformed)[:, 1])
roc_auc_test = roc_auc_score(y_test, log_reg_model.predict_proba(x_test_transformed)[:, 1])
print(f"Training ROC AUC Score: {roc_auc_train:.2f}")
print(f"Test ROC AUC Score: {roc_auc_test:.2f}")

# --- Now, let's create the submission file ---
# Predicting the target variable for the test set (for submission purposes)
y_test_pred_submission_log_reg = log_reg_model.predict(z_test_transformed)  # Using the test set (z_test_transformed)

# Creating a DataFrame to store the ID and predictions (required for submission)
submission_log_reg = pd.DataFrame({
    'id': range(0, len(z_test_transformed)),  # Creating an ID column (from 0 to the length of the test data)
    'target': y_test_pred_submission_log_reg  # Storing the predicted target values
})

# Saving the DataFrame to a CSV file named 'submission_log_reg.csv'
submission_log_reg.to_csv('submission_log_reg.csv', index=False)

# Informing that the submission file has been successfully created
print("Submission file created successfully: 'submission_log_reg.csv'")


Fitting 5 folds for each of 50 candidates, totalling 250 fits
Best Parameters for Logistic Regression: {'solver': 'liblinear', 'penalty': 'l2', 'max_iter': 100, 'C': 0.01}
Logistic Regression Training Accuracy: 0.8528755419535833
Logistic Regression Test Accuracy: 0.8489098559224787
Logistic Regression Classification Report:
               precision    recall  f1-score   support

          no       0.86      0.98      0.92      6645
         yes       0.52      0.15      0.23      1198

    accuracy                           0.85      7843
   macro avg       0.69      0.56      0.57      7843
weighted avg       0.81      0.85      0.81      7843

Training F1 Score (Macro): 0.57
Test F1 Score (Macro): 0.57
Training ROC AUC Score: 0.80
Test ROC AUC Score: 0.79
Submission file created successfully: 'submission_log_reg.csv'


---

**When is Logistic Regression not suitable?(Cons)**
- When the data is highly **non-linear**, more advanced models like **Random Forests** may perform better.

---



<p style="font-size: 42px; font-family: 'Montserrat', sans-serif; font-weight: 900; color: #7D7D7D; text-align: center; text-transform: uppercase; letter-spacing: 2px; background: linear-gradient(to right, #D3D3D3, #A9A9A9); color: white; padding: 15px 30px; border-radius: 25px; box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);">
    Model 2 - SGD Classifier </p>


# Why SGD Classifier?

- **Efficiency with Large Datasets**: SGD is fast and efficient, especially with large datasets. It updates weights incrementally, making it suitable for large-scale classification problems.
  
- **Flexibility**: SGD works with different loss functions and penalties, making it a versatile model for a wide range of tasks, such as logistic regression and SVM.

- **Memory Efficient**: Since it processes one sample at a time, SGD is memory efficient, which is beneficial when dealing with large datasets.

- **Good for Sparse Data**: SGD performs well with sparse data, such as text classification, where many features are zeroistic regression.


In [16]:
from imblearn.over_sampling import SMOTE

# Resample the training set using SMOTE to handle class imbalance
smote = SMOTE(random_state=42)
x_resampled, y_resampled = smote.fit_resample(x_train_transformed, y_train)

# Define the SGDClassifier
sgd = SGDClassifier(random_state=42)

# Hyperparameter grid with added 'tol' for tolerance
param_grid = {
    'loss': ['hinge'],  # 'hinge' loss corresponds to linear SVM
    'penalty': ['l1'],  # L1 regularization (Lasso)
    'alpha': [0.0001],  # Regularization strength
    'learning_rate': ['adaptive'],  # Adaptive learning rate
    'eta0': [1.0],  # Initial learning rate
    'max_iter': [1000],  # Number of iterations
    'tol': [1e-4, 1e-3, 1e-2]  # Tolerance values for early stopping
}

# Perform GridSearchCV to find the best hyperparameters
grid_search = GridSearchCV(
    estimator=sgd,
    param_grid=param_grid,
    scoring='f1_weighted',  # Use weighted F1-score to handle class imbalance
    cv=5,  # 5-fold cross-validation
    verbose=0,  # Set verbose to 1 to see the progress of the grid search
    n_jobs=-1  # Use all cores to speed up the computation
)

# Fit the model with resampled data
grid_search.fit(x_resampled, y_resampled)

# Best model after grid search
best_sgd = grid_search.best_estimator_
print("Best Parameters:", grid_search.best_params_)

# Make predictions on the test set
y_test_pred = best_sgd.predict(x_test_transformed)
y_test_probs = best_sgd.decision_function(x_test_transformed)

# Evaluate model performance
test_f1 = f1_score(y_test, y_test_pred, average='weighted')
test_roc_auc = roc_auc_score(y_test, y_test_probs)

# Print classification report and confusion matrix
print("Classification Report:\n", classification_report(y_test, y_test_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

# Display F1 Score and ROC-AUC score
print(f"Test F1 Score: {test_f1:.2f}")
print(f"Test ROC-AUC Score: {test_roc_auc:.2f}")



# Make predictions on the test data for submission
z_test_pred = best_sgd.predict(z_test_transformed)

submission = pd.DataFrame({
    'id': range(len(z_test_pred)),  # Replace with actual ID column if available
    'target': z_test_pred
})

# Save submission to a CSV file
submission.to_csv('submission.csv', index=False)

print("Submission file saved as 'submission.csv'.")




Best Parameters: {'alpha': 0.0001, 'eta0': 1.0, 'learning_rate': 'adaptive', 'loss': 'hinge', 'max_iter': 1000, 'penalty': 'l1', 'tol': 0.0001}
Classification Report:
               precision    recall  f1-score   support

          no       0.94      0.82      0.88      6645
         yes       0.42      0.73      0.53      1198

    accuracy                           0.80      7843
   macro avg       0.68      0.77      0.70      7843
weighted avg       0.86      0.80      0.82      7843

Confusion Matrix:
 [[5426 1219]
 [ 328  870]]
Test F1 Score: 0.82
Test ROC-AUC Score: 0.84
Submission file saved as 'submission.csv'.


---
# Why not SGD?

- **Sensitive to Hyperparameters**: SGD is sensitive to the learning rate and requires careful tuning. If not set correctly, it may lead to poor results.

- **Convergence Issues**: The algorithm may not converge if the learning rate is too high or too low, leading to oscillations or slow progress.

- **Requires Proper Scaling**: Feature scaling (like normalization or standardization) is essential for proper convergence of SGD, especially in models like logistic regression.


---




<p style="font-size: 42px; font-family: 'Montserrat', sans-serif; font-weight: 900; color: #7D7D7D; text-align: center; text-transform: uppercase; letter-spacing: 2px; background: linear-gradient(to right, #D3D3D3, #A9A9A9); color: white; padding: 15px 30px; border-radius: 25px; box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);">
    Model 3 - Random Forest </p>


# Why Random Forest?

- **Robust to Overfitting**: Random Forest is an ensemble method that reduces the risk of overfitting by averaging multiple decision trees, making it more robust compared to a single decision tree.

- **Handles Non-linear Data**: It can capture complex, non-linear relationships in data, making it a great choice for a variety of tasks.

- **Handles Missing Data Well**: Random Forest can handle missing data without requiring imputation, which simplifies preprocessing.

- **Feature Importance**: It provides feature importance scores, helping to understand which features contribute the most to the predictions.

- **Versatility**: Random Forest can be used for both classification and regressioor each prediction.


In [17]:
# Import necessary libraries
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import numpy as np
import pandas as pd

# Define the parameter grid for RandomizedSearchCV
# Here, we are specifying a range of hyperparameters to tune the Random Forest model.
param_dist = {
    'n_estimators': [50, 100, 200, 300],  # The number of trees in the forest.
    'max_features': ['sqrt', 'log2', None],  # Maximum number of features to consider when splitting a node.
    'max_depth': [10, 20, 30, None],  # The maximum depth of each tree. 'None' means no limit.
    'min_samples_split': [2, 5, 10],  # The minimum number of samples required to split an internal node.
    'min_samples_leaf': [1, 2, 4],  # The minimum number of samples required to be at a leaf node.
    'bootstrap': [True, False],  # Whether to use bootstrap samples when building trees.
    'class_weight': [None, 'balanced', 'balanced_subsample'],  # Class weighting to handle imbalanced classes.
}

# Initialize the RandomForestClassifier model
# We set a fixed random state for reproducibility of the results.
random_f_model = RandomForestClassifier(random_state=42)

# Initialize RandomizedSearchCV
# This is used to perform hyperparameter tuning by randomly sampling combinations from the parameter grid.
random_search = RandomizedSearchCV(
    estimator=random_f_model,
    param_distributions=param_dist,
    n_iter=50,  # Number of different parameter combinations to try.
    scoring='f1_macro',  # Optimize based on the F1 macro score, which treats all classes equally.
    cv=5,  # Perform 5-fold cross-validation to evaluate the model.
    verbose=0,  # Show detailed output during the search.
    random_state=42,
    n_jobs=-1  # Use all available CPU cores to speed up the computation.
)

# Fit RandomizedSearchCV to the training data
# This step will search for the best combination of hyperparameters based on cross-validation performance.
random_search.fit(x_train_transformed, y_train)

# Print the best hyperparameters and the best F1 score from cross-validation.
print(f"Best Parameters: {random_search.best_params_}")
print(f"Best F1 Macro Score (CV): {random_search.best_score_:.2f}")

# Now, we use the best model found to make predictions on both the training and test sets.
random_f_model = random_search.best_estimator_
y_test_pred = random_f_model.predict(x_test_transformed)
y_train_pred = random_f_model.predict(x_train_transformed)

# Calculate and display the accuracy for both training and test sets.
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Calculate and display the F1 macro score for both training and test sets.
f1_macro_train = f1_score(y_train, y_train_pred, average='macro')
f1_macro_test = f1_score(y_test, y_test_pred, average='macro')

# Print out the accuracy and F1 scores.
print(f"Training Accuracy: {train_accuracy:.2f}")
print(f"Test Accuracy: {test_accuracy:.2f}")
print(f"Training F1 Macro Score: {f1_macro_train:.2f}")
print(f"Test F1 Macro Score: {f1_macro_test:.2f}")

# Print the classification report and confusion matrix for the test set to further evaluate model performance.
print("Classification Report:\n", classification_report(y_test, y_test_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))

# --- Submission Part ---
# Predict on the test set (for submission)
y_test_pred_submission = random_f_model.predict(z_test_transformed) 

# Create a DataFrame with 'id' and the predicted target values
submission = pd.DataFrame({
    'id': range(0, len(z_test_transformed)),
    'target': y_test_pred_submission
})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

print("Submission file created successfully: 'submission.csv'")


Best Parameters: {'n_estimators': 100, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 20, 'class_weight': 'balanced', 'bootstrap': True}
Best F1 Macro Score (CV): 0.77
Training Accuracy: 0.91
Test Accuracy: 0.86
Training F1 Macro Score: 0.85
Test F1 Macro Score: 0.77
Classification Report:
               precision    recall  f1-score   support

          no       0.95      0.87      0.91      6645
         yes       0.52      0.77      0.62      1198

    accuracy                           0.86      7843
   macro avg       0.74      0.82      0.77      7843
weighted avg       0.89      0.86      0.87      7843

Confusion Matrix:
 [[5798  847]
 [ 281  917]]
Submission file created successfully: 'submission.csv'


---

# Why not Random Forest?

- **Computationally Expensive**: Training many trees can be computationally expensive, especially with large datasets, and may require considerable memory.

- **Difficult to Interpret**: While individual decision trees are interpretable, Random Forest models, due to their ensemble nature, are difficult to interpret.

- **Slower Predictions**: For real-time applications, the ensemble of trees can lead to slower prediction times, as multiple trees must be queried for each prediction.
---

<p style="font-size: 42px; font-family: 'Montserrat', sans-serif; font-weight: 900; color: #7D7D7D; text-align: center; text-transform: uppercase; letter-spacing: 2px; background: linear-gradient(to right, #D3D3D3, #A9A9A9); color: white; padding: 15px 30px; border-radius: 25px; box-shadow: 0 10px 20px rgba(0, 0, 0, 0.2);">
    Model 4 - Decision Tree 
</p>


# Why Decision Tree?

- **Simple to Understand and Interpret**: Decision trees are easy to visualize and interpret, making them great for explaining model decisions to non-technical stakeholders.

- **No Need for Feature Scaling**: Unlike other models, decision trees don’t require scaling of features, which simplifies preprocessing.

- **Handles Non-linear Data**: Decision trees can capture non-linear relationships between features, making them flexible for various data types.

- **Efficient on Large Datasets**: They are relatively efficient in handling large datasets and can handle both numerical and categorical data.

- **Can Handle Missing Values**: Decision trees can handle missing data by splitting nodes on available data and leaving out missing to ensemble methods.


In [18]:
# Initialize the Decision Tree model
dt_model = DecisionTreeClassifier(random_state=42)

# Define parameter grid for GridSearchCV
param_grid_dt = {
    'criterion': ['gini', 'entropy', 'log_loss'],  # Evaluate different impurity measures
    'max_depth': [5, 10, 20, None],  # Depth of the tree, controlling the tree's complexity
    'min_samples_split': [2, 5, 10],  # Minimum samples required to split an internal node
    'min_samples_leaf': [1, 2, 5],  # Minimum samples required to be at a leaf node
    'max_features': [None, 'sqrt', 'log2']  # Strategies for feature selection (None means using all features)
}

# Initialize GridSearchCV for hyperparameter tuning
grid_search_dt = GridSearchCV(
    dt_model,
    param_grid_dt,
    scoring='f1_macro',  # F1 macro is used to optimize for balanced precision and recall
    cv=5,  # 5-fold cross-validation
    verbose=0,  # Show progress messages during the fitting process
    n_jobs=-1  # Use all available cores to speed up the computation
)

# Fit the model on training data
grid_search_dt.fit(x_train_transformed, y_train)

# Get the best model from GridSearchCV
dt_model = grid_search_dt.best_estimator_

# Output the best parameters found by GridSearchCV
print("Best Parameters for Decision Tree:", grid_search_dt.best_params_)

# Evaluate the model on training data and test data
y_train_pred = dt_model.predict(x_train_transformed)
y_test_pred = dt_model.predict(x_test_transformed)

# Training and Test Accuracy
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

# Display training and test accuracy
print("Decision Tree Training Accuracy:", train_accuracy)
print("Decision Tree Test Accuracy:", test_accuracy)

# Display classification report (precision, recall, f1-score, support)
print("Decision Tree Classification Report:")
print(classification_report(y_test, y_test_pred))

# Confusion Matrix for test data
print("Confusion Matrix (Test Data):")
print(confusion_matrix(y_test, y_test_pred))

# ROC-AUC Score: The area under the receiver operating characteristic curve
roc_auc = roc_auc_score(y_test, dt_model.predict_proba(x_test_transformed)[:, 1])
print("Decision Tree ROC-AUC Score:", roc_auc)

# Handle feature names for Feature Importances (if available)
if hasattr(x_train_transformed, "columns"):  # Check if x_train_transformed is a DataFrame
    feature_names = x_train_transformed.columns
elif hasattr(preprocessor, "get_feature_names_out"):  # If using a preprocessor pipeline
    feature_names = preprocessor.get_feature_names_out()
else:
    feature_names = [f"Feature_{i}" for i in range(x_train_transformed.shape[1])]

# Create a DataFrame for Feature Importances
feature_importances = pd.DataFrame({
    'Feature': feature_names,
    'Importance': dt_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

# Display the feature importances sorted by importance
print("Feature Importances:")
print(feature_importances)

# Optionally, save feature importances to a CSV file
feature_importances.to_csv('feature_importances.csv', index=False)

# Make predictions for submission
z_test_pred = dt_model.predict(z_test_transformed) 

# Create a submission DataFrame with predicted targets
submission = pd.DataFrame({
    'id': range(len(z_test_pred)),
    'target': z_test_pred  # Predictions for the test set
})

# Save the submission DataFrame to a CSV file
submission.to_csv('submission.csv', index=False)

# the submission file has been successfully created
print("Submission file created successfully: 'submission.csv'")


Best Parameters for Decision Tree: {'criterion': 'gini', 'max_depth': 10, 'max_features': None, 'min_samples_leaf': 5, 'min_samples_split': 2}
Decision Tree Training Accuracy: 0.9010456516194848
Decision Tree Test Accuracy: 0.8520974117047049
Decision Tree Classification Report:
              precision    recall  f1-score   support

          no       0.90      0.93      0.91      6645
         yes       0.52      0.44      0.48      1198

    accuracy                           0.85      7843
   macro avg       0.71      0.68      0.70      7843
weighted avg       0.84      0.85      0.85      7843

Confusion Matrix (Test Data):
[[6151  494]
 [ 666  532]]
Decision Tree ROC-AUC Score: 0.8433854518001536
Feature Importances:
                   Feature  Importance
2            num__duration    0.499196
4               num__pdays    0.122485
7       num__contact_month    0.066238
22        cat__housing_yes    0.065947
0                 num__age    0.064495
8         num__contact_day    0.0

---

# Not Why Decision Tree?

- **Prone to Overfitting**: Decision trees can easily overfit if not pruned properly, especially when the tree is very deep.

- **Instability**: Small variations in data can lead to very different trees, making decision trees less stable compared to ensemble methods like Random Forest.

- **Bias Toward Features with More Categories**: Decision trees tend to favor features with more categories, which can lead to bias in the splits.

- **Limited Performance on Complex Data**: While decision trees are flexible, they may struggle with complex datasets and fail to capture intricate patterns compared to ensemble methods.

----


----
<p style="font-size: 36px; font-family: 'High Tower Text', serif; font-weight: bold; color: #333333; text-align: center; text-transform: uppercase;">
   THANK YOU
</pp>


----