# Feature Engineering: Binarization
---
**Notebook:** Binarization | **Dataset:** Titanic  
**Goal:** Understand how binarizing a continuous/count feature affects model performance.


## What is Binarization?

**Binarization** is a data preprocessing technique that converts numerical features into **binary values (0 or 1)** based on a threshold.

### How it works:
| Condition | Output |
|-----------|--------|
| value ≤ threshold | → **0** |
| value > threshold | → **1** |

###  When to use Binarization:
- When the **presence/absence** of something matters more than its magnitude  
  *(e.g., Does a passenger have family on board? Yes/No)*
- To convert count data into binary flags
- When you want to reduce noise from outliers in count/magnitude features

###  When NOT to use:
- When the actual magnitude of the feature is important for prediction
- On features that have a natural ordinal relationship across many values

###  In this notebook:
We binarize the `Family` column (SibSp + Parch = total family members).  
The question becomes: **"Does the passenger travel with family?"** (1=Yes, 0=No)


### Import Libraries

In [130]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import Binarizer
import warnings
warnings.filterwarnings('ignore')

# Binarizer: sklearn utility that applies threshold-based binary transformation
# ColumnTransformer: applies different transformations to different columns


### 2. Load & Explore Dataset

We use the **Titanic dataset** — a classic binary classification problem.  
- **Target:** `Survived` (0 = No, 1 = Yes)  
- **Features we'll use:** Age, Fare, Family size


In [133]:
data = pd.read_csv("train.csv")
data.head(2)  # Preview first 2 rows


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C


### Check for Missing Values

Before preprocessing, always check for null values — they can crash transformers.


In [136]:
data.isnull().sum()  # Count nulls per column
# Age: 177 missing, Cabin: 687 missing, Embarked: 2 missing


PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Feature Engineering

**Key Insight:** Instead of using `SibSp` (siblings/spouses) and `Parch` (parents/children) separately,  
we combine them into a single `Family` column representing the total number of family members aboard.

> `Family = SibSp + Parch`  
> A value of 0 means the passenger traveled alone.


In [139]:
# Select only relevant columns
data = data[['Age', 'Fare', 'SibSp', 'Parch', 'Survived']]

# Engineer 'Family' feature: total family members on board
data['Family'] = data['SibSp'] + data['Parch']

# Drop original columns (they are now encoded in 'Family')
data = data.drop(['SibSp', 'Parch'], axis=1)
data.head(2)


Unnamed: 0,Age,Fare,Survived,Family
0,22.0,7.25,0,1
1,38.0,71.2833,1,1


In [141]:
# Drop rows with missing Age values (177 rows)
# We do this AFTER feature engineering to avoid losing Family column
data.dropna(inplace=True)
print(f'Dataset shape after cleaning: {data.shape}')


Dataset shape after cleaning: (714, 4)


###  Split Features & Target — Train/Test Split

- **X** = feature matrix (Age, Fare, Family)  
- **y** = target labels (Survived)  
- **80/20 split** with `random_state=42` for reproducibility


In [144]:
X = data.drop(['Survived'], axis=1)  # Feature matrix
y = data[['Survived']]                # Target variable

X.head(2)  # Preview features


Unnamed: 0,Age,Fare,Family
0,22.0,7.25,1
1,38.0,71.2833,1


In [146]:
y.head(2)  # Preview target


Unnamed: 0,Survived
0,0
1,1


In [148]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
print('Shape of X_train : ', X_train.shape)
print('Shape of y_train : ', y_train.shape)
print('Shape of X_test  : ', X_test.shape)
print('Shape of y_test  : ', y_test.shape)


Shape of X_train :  (571, 3)
Shape of y_train :  (571, 1)
Shape of X_test  :  (143, 3)
Shape of y_test  :  (143, 1)


###  Baseline Model — Without Binarization

We first train a **DecisionTreeClassifier** on the raw data (no transformation).  
This gives us a **baseline accuracy** to compare against.

The `Family` column here is a raw count: 0, 1, 2, 3, 4, etc.


In [151]:
clf = DecisionTreeClassifier()
clf.fit(X_train, y_train)  # Train on raw (untransformed) features


In [153]:
y_pred = clf.predict(X_test)

# Accuracy on hold-out test set
print('Test Accuracy (without Binarization):', accuracy_score(y_test, y_pred))

# Cross-validated accuracy (more reliable estimate — averages over 10 folds)
print('CV Accuracy (without Binarization):  ', np.mean(cross_val_score(clf, X, y, scoring='accuracy', cv=10)))


Test Accuracy (without Binarization): 0.6293706293706294
CV Accuracy (without Binarization):   0.647085289514867


###  With Binarization Applied to `Family`

### Why binarize `Family`?
Instead of using the exact number of family members (which may add noise),  
we ask: **"Does the passenger travel with at least one family member?"**

- `Family = 0` → **0** (traveling alone)  
- `Family ≥ 1` → **1** (traveling with family)

### Pipeline:
We use `ColumnTransformer` to apply `Binarizer` **only** to the `Family` column,  
while leaving `Age` and `Fare` untouched (`remainder='passthrough'`).


In [156]:
# Create a ColumnTransformer:
# - Apply Binarizer (default threshold=0.0) to 'Family' column
# - Pass through Age and Fare unchanged
trf = ColumnTransformer(
    [('bin', Binarizer(copy=False), ['Family'])],
    remainder='passthrough'  # Keep Age and Fare as-is
)


In [158]:
# IMPORTANT: fit on training data only, then transform both sets
# fit_transform on train: learns the transformation parameters (threshold)
# transform on test:      applies same transformation (no data leakage)
X_train_trf = trf.fit_transform(X_train)
X_test_trf  = trf.transform(X_test)


In [160]:
clf = DecisionTreeClassifier()
clf.fit(X_train_trf, y_train)  # Train on binarized features


In [162]:
y_pred1 = clf.predict(X_test_trf)

# Accuracy on hold-out test set (binarized)
print('Test Accuracy (with Binarization):', accuracy_score(y_test, y_pred1))

# Cross-validated accuracy (binarized) — transform all X first
X_tran = trf.fit_transform(X)
print('CV Accuracy (with Binarization):  ', np.mean(cross_val_score(clf, X_tran, y, scoring='accuracy', cv=10)))


Test Accuracy (with Binarization): 0.6013986013986014
CV Accuracy (with Binarization):   0.6345657276995305


### Results & Conclusion

###  Interpretation:
In this particular experiment, binarizing `Family` **slightly reduced** accuracy.  
This tells us that the **exact count of family members** carries some predictive signal  
that is lost when we reduce it to a binary flag.

###  Key Takeaways:
1. Binarization is a quick, simple transformation — but not always beneficial.
2. Always compare accuracy **with and without** the transformation.
3. It works best when the **distinction is binary** in nature (presence vs. absence).
4. Use `cross_val_score` instead of a single train/test split for more robust evaluation.
