## Logistic Regression – Using More Intuitive Feature Selection for Easier Interpretation

In this step, I’m transitioning from the EDA phase into building a logistic regression model.  
Instead of using the full one-hot encoded dataset with 50+ features, I’m intentionally **selecting a smaller, more interpretable set of features** that are both relevant and easier to explain in terms of their relationship to the target variable (`y`).

**Rationale for this approach:**
- **Interpretability:** A smaller set of features makes it easier to understand how each variable influences the prediction.
- **Simplicity:** Reduces complexity and multicollinearity issues that can occur with many dummy variables.
- **Focus on meaningful predictors:** Using variables that have clear business or contextual meaning.

**Features selected:**
- `age` – Age of the individual.
- `previous` – Number of previous contacts with the client.
- `emp.var.rate` – Employment variation rate.
- `cons.price.idx` – Consumer price index.
- `cons.conf.idx` – Consumer confidence index.
- `euribor3m` – 3-month Euribor interest rate.
- `nr.employed` – Number of employees.
- `prior_contact` – Binary flag if the client had been previously contacted.
- `campaign_capped` – Number of contacts made during the current campaign (capped).

**Target variable:**
- `y` – Binary outcome indicating if the client subscribed to the term deposit (1) or not (0).

**Next steps:**
1. Split data into training and testing sets.
2. Scale numeric features for logistic regression.
3. Train the model.
4. Evaluate performance (accuracy, classification report, confusion matrix).
5. Optionally, interpret coefficients to understand feature importance.


In [1]:
#lets first import pandas and read from the cleaned dataset
import pandas as pd


df = pd.read_csv('cleaned_bank.csv')

1.Feature selection (My reduced intuitive feature list).

2.Train/test split.

3.Scaling.

4.Model fitting.

5.Evaluation.

In [2]:
#Feature selection for logistic regression
#We will use the following features for our logistic regression model:
#1. age, previous, emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed

features_to_keep = [
    'age',
    'previous',
    'emp.var.rate',
    'cons.price.idx',
    'cons.conf.idx',
    'euribor3m',
    'nr.employed',
    'prior_contact',
    'campaign_capped'
]




In [3]:
#keep only selected features compare against the target variable 'y'
X = df[features_to_keep].copy () # Features for the model
y = df['y'].astype(int)  # Convert target variable to integer (0 or 1)

type(X), type(y)
# ensure binary columns are ints (0/1) not bool/object
X[['prior_contact', 'campaign_capped']] = X[['prior_contact', 'campaign_capped']].astype(int)

Identify Column Types
Identify which columns are numeric and which are categorical

In [4]:
numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_cols = X.select_dtypes(exclude=['int64', 'float64']).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

Numeric columns: ['age', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'euribor3m', 'nr.employed']
Categorical columns: ['prior_contact', 'campaign_capped']


Now Lets start the Train_Test_Split

In [None]:
from sklearn.model_selection import train_test_split
# Split the dataset into training and testing sets
# test size 20% of rows go to test set, 80% to train set
# random_state ensures reproducibility of the split
# stratify ensures that the proportion of classes in y is maintained in both train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42, stratify=y)

#
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(31462, 9) (7866, 9) (31462,) (7866,)


"""
========================================
   Train/Test Split — Output Summary
========================================

📊 What the Output Means
------------------------
• X_train → 31,462 rows × 9 features (80% of the data)
• X_test  →  7,866 rows × 9 features (20% of the data)
• y_train → 31,462 labels (matches X_train rows)
• y_test  →  7,866 labels (matches X_test rows)

🧮 Why the Numbers Make Sense
-----------------------------
• Total rows:
    31,462 + 7,866 = 39,328 total rows in dataset.

• 80/20 split check:
    Train: 39,328 × 0.8 ≈ 31,462  ✅
    Test:  39,328 × 0.2 ≈  7,866  ✅

• Same features in train/test:
    Both X_train and X_test have exactly 9 columns (features).
========================================
"""
