# FeatureUnion and ColumnTransformer

## Overview

Real-world datasets often have **mixed data types**:
- Numerical features (age, price, ratings)
- Categorical features (gender, city, product type)
- Text features (reviews, descriptions)
- Datetime features (timestamps)

Sklearn provides two powerful tools for handling heterogeneous data:

### FeatureUnion (Legacy)
- Applies multiple transformers in parallel
- Concatenates results horizontally
- Works on the **same input data**
- Useful for combining different feature extraction methods

### ColumnTransformer (Modern, Recommended)
- Applies different transformers to **different columns**
- Perfect for mixed data types
- More intuitive and flexible
- Introduced in sklearn 0.20+

**Rule of Thumb**: Use **ColumnTransformer** for column-specific preprocessing!

## Setup and Import

In [None]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, mean_squared_error, r2_score

print("✓ Libraries imported successfully")

## Sample Dataset with Mixed Types

Let's create a realistic dataset:

In [None]:
# Create sample employee dataset
np.random.seed(42)

n_samples = 1000

data = pd.DataFrame({
    'age': np.random.randint(22, 65, n_samples),
    'experience_years': np.random.randint(0, 40, n_samples),
    'department': np.random.choice(['Engineering', 'Sales', 'Marketing', 'HR'], n_samples),
    'education': np.random.choice(['Bachelor', 'Master', 'PhD'], n_samples),
    'skills': [
        np.random.choice(['Python Java SQL', 'Leadership Communication', 
                         'Marketing Analytics SEO', 'HR Management Recruiting'], 1)[0] 
        for _ in range(n_samples)
    ],
    'rating': np.random.uniform(1, 5, n_samples),
    'remote_work': np.random.choice(['Yes', 'No'], n_samples),
})

# Target: High performer (binary classification)
data['high_performer'] = (
    (data['rating'] > 3.5) & 
    (data['experience_years'] > 5)
).astype(int)

print("Employee Dataset:")
print(data.head(10))
print(f"\nShape: {data.shape}")
print(f"Target distribution: {data['high_performer'].value_counts().to_dict()}")

In [None]:
# Identify column types
numeric_features = ['age', 'experience_years', 'rating']
categorical_features = ['department', 'education', 'remote_work']
text_features = ['skills']

print("Feature Types:")
print(f"  Numeric: {numeric_features}")
print(f"  Categorical: {categorical_features}")
print(f"  Text: {text_features}")
print(f"\nData types:\n{data.dtypes}")

## 1. FeatureUnion - Combining Feature Extractors

**Use Case**: Apply multiple transformations to the **same input** and combine results.

Example: Extract both unigrams and bigrams from text data.

In [None]:
# Example: Combine unigrams and bigrams from skills column
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer to select text column
class TextSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.column]

# Create FeatureUnion
text_features_union = FeatureUnion([
    ('unigrams', Pipeline([
        ('selector', TextSelector('skills')),
        ('vectorizer', CountVectorizer(ngram_range=(1, 1), max_features=50))
    ])),
    ('bigrams', Pipeline([
        ('selector', TextSelector('skills')),
        ('vectorizer', CountVectorizer(ngram_range=(2, 2), max_features=50))
    ]))
])

# Transform
X_union = text_features_union.fit_transform(data)

print("FeatureUnion Results:")
print(f"Output shape: {X_union.shape}")
print(f"  - Unigrams: 50 features")
print(f"  - Bigrams: 50 features")
print(f"  - Total: {X_union.shape[1]} features (concatenated)")
print(f"\nMatrix type: {type(X_union)}")

## 2. ColumnTransformer - The Modern Approach

**Best for**: Applying different transformations to different column groups.

### Basic Syntax

```python
ColumnTransformer([
    ('name', transformer, columns),
    ('name2', transformer2, columns2),
], remainder='drop')  # or 'passthrough'
```

In [None]:
# Basic ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False), categorical_features)
    ],
    remainder='drop'  # Drop text column for now
)

# Prepare data
X = data.drop('high_performer', axis=1)
y = data['high_performer']

# Transform
X_transformed = preprocessor.fit_transform(X)

print("ColumnTransformer Results:")
print(f"Original shape: {X.shape}")
print(f"Transformed shape: {X_transformed.shape}")
print(f"\nBreakdown:")
print(f"  - Numeric (scaled): {len(numeric_features)} features")
print(f"  - Categorical (one-hot): {X_transformed.shape[1] - len(numeric_features)} features")
print(f"  - Text (dropped): 0 features")

In [None]:
# Get feature names after transformation
feature_names = preprocessor.get_feature_names_out()

print(f"Feature names after transformation ({len(feature_names)} total):")
print(feature_names[:20])  # Show first 20
print("...")

## 3. Complete Pipeline: All Feature Types

Let's process **numeric, categorical, AND text** features together:

In [None]:
# Complete preprocessing pipeline
from sklearn.pipeline import make_pipeline

complete_preprocessor = ColumnTransformer(
    transformers=[
        # Numeric: Impute missing + scale
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler', StandardScaler())
        ]), numeric_features),
        
        # Categorical: Impute + one-hot encode
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='most_frequent')),
            ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
        ]), categorical_features),
        
        # Text: TF-IDF vectorization
        ('text', TfidfVectorizer(max_features=50, stop_words='english'), 'skills')
    ],
    remainder='drop'
)

# Transform data
X_complete = complete_preprocessor.fit_transform(X)

print("Complete Preprocessing Results:")
print(f"Original columns: {X.shape[1]}")
print(f"Transformed features: {X_complete.shape[1]}")
print(f"\nFeature breakdown:")
print(f"  - Numeric (scaled): 3 features")
print(f"  - Categorical (one-hot): ~8 features")
print(f"  - Text (TF-IDF): 50 features")
print(f"  - Total: {X_complete.shape[1]} features")
print(f"\nOutput type: {type(X_complete)}")

## 4. Full ML Pipeline: Preprocessing + Model

Combine preprocessing and modeling in one pipeline:

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training samples: {X_train.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

# Create full pipeline
full_pipeline = Pipeline([
    ('preprocessor', complete_preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Train (preprocessor + model in one step!)
print("\nTraining pipeline...")
full_pipeline.fit(X_train, y_train)

# Predict
y_pred = full_pipeline.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"\nTest Accuracy: {accuracy:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Cross-validation with full pipeline
cv_scores = cross_val_score(
    full_pipeline, X, y, cv=5, scoring='accuracy'
)

print("5-Fold Cross-Validation Scores:")
print(f"  Scores: {cv_scores}")
print(f"  Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

## 5. Real-World Example: Titanic Dataset

Classic dataset with mixed types:

In [None]:
# Load Titanic dataset
from sklearn.datasets import fetch_openml

print("Loading Titanic dataset...")
titanic = fetch_openml('titanic', version=1, as_frame=True, parser='auto')
df = titanic.frame

# Clean and prepare
df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'survived']].copy()
df['survived'] = df['survived'].astype(int)

print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
print(f"\nMissing values:\n{df.isnull().sum()}")
print(f"\nSurvival rate: {df['survived'].mean():.2%}")

In [None]:
# Define features by type
numeric_cols = ['age', 'sibsp', 'parch', 'fare']
categorical_cols = ['pclass', 'sex', 'embarked']

# Create preprocessor
titanic_preprocessor = ColumnTransformer(
    transformers=[
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='median')),  # Handle missing values
            ('scaler', StandardScaler())
        ]), numeric_cols),
        
        ('cat', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
            ('encoder', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'))
        ]), categorical_cols)
    ],
    remainder='drop'
)

# Prepare data
X_titanic = df.drop('survived', axis=1)
y_titanic = df['survived']

# Split
X_train_t, X_test_t, y_train_t, y_test_t = train_test_split(
    X_titanic, y_titanic, test_size=0.2, random_state=42, stratify=y_titanic
)

print(f"Training set: {X_train_t.shape}")
print(f"Test set: {X_test_t.shape}")

In [None]:
# Compare multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
}

print("=" * 60)
print("TITANIC SURVIVAL PREDICTION")
print("=" * 60)

for name, model in models.items():
    # Create pipeline
    pipeline = Pipeline([
        ('preprocessor', titanic_preprocessor),
        ('classifier', model)
    ])
    
    # Train
    pipeline.fit(X_train_t, y_train_t)
    
    # Predict
    y_pred = pipeline.predict(X_test_t)
    
    # Evaluate
    accuracy = accuracy_score(y_test_t, y_pred)
    
    print(f"\n{name}:")
    print(f"  Accuracy: {accuracy:.4f}")
    
    # Cross-validation
    cv_scores = cross_val_score(pipeline, X_titanic, y_titanic, cv=5)
    print(f"  CV Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

## 6. Advanced ColumnTransformer Techniques

### Using `make_column_transformer` (Simpler Syntax)

In [None]:
from sklearn.compose import make_column_transformer

# Simpler syntax (no need for tuples with names)
preprocessor_simple = make_column_transformer(
    (StandardScaler(), numeric_cols),
    (OneHotEncoder(drop='first', sparse_output=False), categorical_cols),
    remainder='drop'
)

X_simple = preprocessor_simple.fit_transform(X_titanic)
print(f"Transformed shape: {X_simple.shape}")
print(f"Feature names: {preprocessor_simple.get_feature_names_out()}")

### Using `remainder` Parameter

- `remainder='drop'`: Drop untransformed columns (default)
- `remainder='passthrough'`: Keep untransformed columns as-is

In [None]:
# Compare remainder options
preprocessor_drop = ColumnTransformer(
    [('num', StandardScaler(), numeric_cols)],
    remainder='drop'
)

preprocessor_pass = ColumnTransformer(
    [('num', StandardScaler(), numeric_cols)],
    remainder='passthrough'
)

X_drop = preprocessor_drop.fit_transform(X_titanic)
X_pass = preprocessor_pass.fit_transform(X_titanic)

print("Remainder Comparison:")
print(f"  Original: {X_titanic.shape[1]} columns")
print(f"  remainder='drop': {X_drop.shape[1]} columns (only scaled numeric)")
print(f"  remainder='passthrough': {X_pass.shape[1]} columns (numeric + original categorical)")

### Using Column Selectors (Advanced)

Select columns by data type automatically:

In [None]:
from sklearn.compose import make_column_selector

# Automatic column selection by dtype
preprocessor_auto = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
        ('cat', OneHotEncoder(sparse_output=False, handle_unknown='ignore'), 
         make_column_selector(dtype_include=object))
    ]
)

X_auto = preprocessor_auto.fit_transform(X_titanic)

print("Automatic Column Selection:")
print(f"  Input shape: {X_titanic.shape}")
print(f"  Output shape: {X_auto.shape}")
print(f"\n  Numeric columns (auto-detected): {X_titanic.select_dtypes(include=np.number).columns.tolist()}")
print(f"  Object columns (auto-detected): {X_titanic.select_dtypes(include=object).columns.tolist()}")

## 7. Creating Custom Transformers

Build your own transformers for specialized preprocessing:

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

# Custom transformer: Log transformation
class LogTransformer(BaseEstimator, TransformerMixin):
    def __init__(self, offset=1):
        self.offset = offset
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return np.log(X + self.offset)

# Custom transformer: Age binning
class AgeBinner(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_copy = X.copy()
        X_copy['age_group'] = pd.cut(
            X_copy['age'], 
            bins=[0, 18, 30, 50, 100],
            labels=['child', 'young', 'adult', 'senior']
        )
        return X_copy

# Use custom transformers in pipeline
custom_preprocessor = ColumnTransformer(
    transformers=[
        ('log_fare', LogTransformer(), ['fare']),
        ('standard_age', StandardScaler(), ['age']),
        ('cat', OneHotEncoder(sparse_output=False), ['sex', 'embarked'])
    ],
    remainder='drop'
)

# Test custom preprocessor
X_custom = custom_preprocessor.fit_transform(X_titanic.fillna(0))
print(f"Custom preprocessing output shape: {X_custom.shape}")

## 8. FeatureUnion vs ColumnTransformer Comparison

### When to Use Each

| Scenario | Use |
|----------|-----|
| Different transformations on **different columns** | **ColumnTransformer** |
| Multiple transformations on **same columns** | **FeatureUnion** |
| Mixed data types (numeric, categorical, text) | **ColumnTransformer** |
| Extracting different text features (unigrams + bigrams) | **FeatureUnion** |
| Modern, clean syntax | **ColumnTransformer** |
| Need to select specific columns | **ColumnTransformer** |

### Example Comparison

In [None]:
# Sample text data
text_data = pd.DataFrame({
    'review': [
        'great product amazing quality',
        'terrible waste of money',
        'good value for price'
    ]
})

print("Scenario: Extract both word counts AND character n-grams from same text\n")

# FeatureUnion: Multiple transformations on SAME column
text_union = FeatureUnion([
    ('word_count', CountVectorizer()),
    ('char_ngrams', CountVectorizer(analyzer='char', ngram_range=(2, 3)))
])

X_text = text_union.fit_transform(text_data['review'])
print(f"FeatureUnion output: {X_text.shape}")
print("  → Combines word-level and character-level features from same text\n")

# ColumnTransformer: Different transformations on DIFFERENT columns
mixed_data = pd.DataFrame({
    'price': [100, 200, 150],
    'category': ['A', 'B', 'A'],
    'description': ['great product', 'terrible item', 'good value']
})

col_trans = ColumnTransformer([
    ('scale_price', StandardScaler(), ['price']),
    ('encode_cat', OneHotEncoder(sparse_output=False), ['category']),
    ('vectorize_text', CountVectorizer(), 'description')
])

X_mixed = col_trans.fit_transform(mixed_data)
print(f"ColumnTransformer output: {X_mixed.shape}")
print("  → Different preprocessing for each column type")

## Best Practices

### 1. Always Use Pipelines
```python
# ✓ GOOD: Everything in one pipeline
pipeline = Pipeline([
    ('preprocessor', ColumnTransformer(...)),
    ('model', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)

# ✗ BAD: Manual preprocessing (risk of data leakage!)
X_scaled = scaler.fit_transform(X_train)  # DON'T DO THIS
```

### 2. Handle Missing Values
```python
Pipeline([
    ('imputer', SimpleImputer(strategy='median')),  # Always impute first
    ('scaler', StandardScaler())
])
```

### 3. Use `handle_unknown='ignore'` for OneHotEncoder
```python
OneHotEncoder(drop='first', handle_unknown='ignore')  # Prevents errors on new categories
```

### 4. Set `sparse_output=False` When Needed
```python
# If mixing sparse and dense transformers:
OneHotEncoder(sparse_output=False)  # Returns dense array
```

### 5. Use `make_column_selector` for Dynamic Selection
```python
ColumnTransformer([
    ('num', StandardScaler(), make_column_selector(dtype_include=np.number)),
    ('cat', OneHotEncoder(), make_column_selector(dtype_include=object))
])
```

## Key Takeaways

### ColumnTransformer (Modern, Recommended)
- ✓ Apply different transformations to different columns
- ✓ Perfect for mixed data types
- ✓ Clean, intuitive syntax
- ✓ Integrates seamlessly with Pipeline
- ✓ Prevents data leakage

### FeatureUnion (Legacy, Specific Use Cases)
- ✓ Multiple transformations on same input
- ✓ Combining different feature extraction methods
- ✗ Less intuitive for mixed data types
- → Use ColumnTransformer instead for most cases

### Complete Workflow Template

```python
# 1. Define column groups
numeric_cols = ['age', 'income']
categorical_cols = ['gender', 'city']
text_cols = 'description'

# 2. Create preprocessor
preprocessor = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]), numeric_cols),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(drop='first', handle_unknown='ignore'))
    ]), categorical_cols),
    ('text', TfidfVectorizer(max_features=100), text_cols)
])

# 3. Create full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])

# 4. Train and evaluate
pipeline.fit(X_train, y_train)
pipeline.score(X_test, y_test)
```

This approach:
- Prevents data leakage
- Makes code reproducible
- Enables easy hyperparameter tuning
- Simplifies deployment