# Featuristic: Advanced Feature Engineering Pipeline

This notebook demonstrates an optimized `featuristic` pipeline. We improve the results by:
1. **Increasing Search Resources**: Larger population and more generations for deeper exploration.
2. **Cross-Validated Selection**: Using 5-fold cross-validation in the feature selector to ensure features generalize.
3. **Complexity Control**: Tuning the parsimony coefficient to prevent overfitting.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, log_loss, ConfusionMatrixDisplay
from featuristic import FeatureSynthesizer, FeatureSelector

# Set seed for reproducibility
np.random.seed(42)

## 1. Create a Challenging Dataset

We'll generate a synthetic classification dataset with 20 features to make selection more relevant.

In [None]:
X_raw, y = make_classification(
    n_samples=1000, 
    n_features=20, 
    n_informative=7, 
    n_redundant=4, 
    flip_y=0.1, 
    random_state=42
)

feature_names = [f"feature_{i}" for i in range(20)]
X = pd.DataFrame(X_raw, columns=feature_names)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"Training set shape: {X_train.shape}")

## 2. Baseline Model (Raw Features)

Standard Logistic Regression on the original 20 features.

In [None]:
baseline_model = LogisticRegression(max_iter=1000)
baseline_model.fit(X_train, y_train)
acc_baseline = accuracy_score(y_test, baseline_model.predict(X_test))
print(f"Baseline Accuracy: {acc_baseline:.4f}")

## 3. Advanced Feature Synthesis

We increase the `population_size` and `generations` to find more powerful features, and lower the `parsimony_coefficient` to allow for slightly more complex expressions.

In [None]:
synth = FeatureSynthesizer(
    n_features=25, 
    population_size=250,   # Increased from 100
    generations=250,        # Increased from 20
    parsimony_coefficient=0.0001, # Lowered from 0.001 to allow more complexity
    verbose=True, 
    random_state=42
)

# Create synthetic features
X_train_synth = synth.fit_transform(X_train, y_train)
X_test_synth = synth.transform(X_test)

# Combine with original features
X_train_combined = np.hstack([X_train, X_train_synth])
X_test_combined = np.hstack([X_test, X_test_synth])

combined_names = feature_names + [f"synth_{i}" for i in range(25)]
X_train_combined = pd.DataFrame(X_train_combined, columns=combined_names)
X_test_combined = pd.DataFrame(X_test_combined, columns=combined_names)

print(f"Combined features shape: {X_train_combined.shape}")

In [None]:
synth.plot_convergence()

In [None]:
for prog in synth.get_programs()[:5]:
    print(prog["expression"])

In [None]:
all_features_model = LogisticRegression(max_iter=1000)
all_features_model.fit(X_train_combined, y_train)
acc_all_features = accuracy_score(y_test, all_features_model.predict(X_test_combined))
print(f"All Features Accuracy: {acc_all_features:.4f}")

## 4. Robust Evolutionary Feature Selection

We optimize the selector by using **5-fold cross-validation** within the objective function. This helps prevent selecting features that only work well on the training split.

In [None]:
# Define a robust objective: minimize negative CV accuracy
def objective(X_subset, y):
    model = LogisticRegression(max_iter=1000)
    # Use cross-validation for a more stable estimate of performance
    scores = cross_val_score(
        model, X_subset, y, cv=3, scoring='neg_log_loss', n_jobs=-1
    )
    return -scores.mean()

selector = FeatureSelector(
    objective_function=objective, 
    population_size=50,  # Increased for better search
    max_generations=20,    # Increased for better search
    verbose=True, 
    random_state=42
)

X_train_final = selector.fit_transform(X_train_combined, y_train)
X_test_final = selector.transform(X_test_combined)

print(f"Selected {len(selector.selected_features_)} features: {selector.selected_features_}")

In [None]:
selector.plot_convergence()

## 5. Final Comparison

Compare the baseline with the advanced engineered pipeline.

In [None]:
final_model = LogisticRegression(max_iter=1000)
final_model.fit(X_train_final, y_train)

y_pred_final = final_model.predict(X_test_final)
acc_final = accuracy_score(y_test, y_pred_final)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

ConfusionMatrixDisplay.from_predictions(y_test, baseline_model.predict(X_test), ax=ax1, cmap='Blues')
ax1.set_title(f'Baseline (20 Features)\nAccuracy: {acc_baseline:.4f}')

ConfusionMatrixDisplay.from_predictions(y_test, y_pred_final, ax=ax2, cmap='Greens')
ax2.set_title(f'Advanced Pipeline ({len(selector.selected_features_)} Features)\nAccuracy: {acc_final:.4f}')

plt.tight_layout()
plt.show()

improvement = (acc_final - acc_baseline) / acc_baseline * 100
print(f"Final Accuracy Improvement: {improvement:.2f}%")