# What is Featuristic?

![featuristic_logo](_static/logo.png "Featuristic")

Featuristic is an automated feature engineering library based on Evolutionary Feature Synthesis. It leverages symbolic regression, genetic programming, and information-theoretic pruning to discover high-quality features from raw data without human-crafted heuristics.

The features generated by EFS are model-agnostic and can be used with any downstream estimator (e.g., XGBoost, SVM, Random Forests, etc.).

Whether you're working with classification or regression problems, Featuristic intelligently builds new, interpretable features using a combination of mathematical operations and evolutionary search.

---

## üß† Introduction to Evolutionary Feature Synthesis (EFS)

**Evolutionary Feature Synthesis (EFS)** is a symbolic feature engineering framework that automatically discovers high-value, interpretable features using genetic programming and symbolic regression.

EFS generates mathematical expressions (e.g. `log(abs(feature_1 - feature_2)) * sin(feature_3)`) that capture nonlinear patterns and interactions in your data, without manual intervention.

It balances **predictive power** with **interpretability**, helping you build transparent models and understand the transformations behind your features.

---

## What is Feature Synthesis?

Feature synthesis is the process of automatically generating **new features** from existing raw inputs to improve model performance. This typically involves:

- Mathematical transformations (e.g., log, square root)
- Combinations of variables (e.g., ratios, differences, products)
- Interaction terms (e.g., `feature_1 * sin(feature_2)`)

Where traditional approaches rely on manual feature crafting or opaque embeddings, **EFS** builds these transformations **automatically** using **symbolic programs** evolved through **genetic programming**.

---

## Why Symbolic Regression for Feature Engineering?

Symbolic regression is a form of regression that searches the space of **symbolic expressions** to model relationships between inputs and outputs.

Unlike linear models or neural networks, symbolic regression:

- Produces **closed-form equations** you can read and interpret
- Captures **nonlinear** and **combinatorial** relationships
- Can work with small-to-medium tabular datasets
- Offers **transparency** over "black-box" feature generators

EFS leverages symbolic regression not to model the target directly, but to **synthesize new input features** that make your final model more effective.

---

## Core Concepts

### üßÆ Symbolic Programs

EFS represents each candidate feature as a **tree-structured symbolic program**, where:
- Leaf nodes are input features (e.g., `feature_1`)
- Internal nodes are symbolic functions (e.g., `log`, `sin`, `+`, `*`)
- The output is a mathematical expression, like:
  
  ```text
  log(abs(feature_1 - feature_2)) * sin(feature_3)
  ```

  Each program is stored as a dictionary tree structure and evaluated directly on your dataset.

---

### üß¨ Genetic Programming

EFS evolves its population of symbolic programs using genetic programming, an evolutionary algorithm inspired by natural selection.

Each generation:

- Evaluates each program on the data using a fitness function (e.g., correlation with the target)
- Selects the best-performing programs via tournament selection
- Applies:
  - Crossover: combines subtrees from two parents
  - Mutation: replaces random nodes with new subtrees
  - Forms a new generation and repeats

This process allows symbolic programs to evolve toward more predictive and concise forms over time.

---

### ‚úÇÔ∏è Parsimony and Overfitting

Symbolic programs can easily grow in size without improving fitness, a problem known as bloat.

To combat this, EFS uses a parsimony coefficient, which penalizes large programs during fitness evaluation:

```text
fitness = raw_score / (program_size ** parsimony_coefficient)
```

This helps ensure that features remain **interpretable**, **efficient**, and **less prone to overfitting**.

---

### üîç Maximum Relevance, Minimum Redundancy (mRMR)

After several generations, EFS may have hundreds of candidate features. To select the best ones, it applies **Maximum Relevance Minimum Redundancy (mRMR)**:

- **Relevance**: How strongly a feature correlates with the target
- **Redundancy**: How much a feature overlaps with other features

mRMR selects a subset of features that are both:

- **Highly predictive**
- **Diverse** (i.e., not just duplicates of each other)

This ensures your final feature set is both powerful and complementary.
