# üß© Genetic Feature Synthesis

The `GeneticFeatureSynthesis` class is the main interface for generating new symbolic features using **Genetic Feature Synthesis (GFS)**.

It acts as a scikit-learn-compatible transformer (`fit`, `transform`, `fit_transform`) that automatically evolves **interpretable mathematical expressions** from your input features and selects the best ones based on predictive performance.

---

## üîß Parameters

| Argument | Type | Description | Default Value |
|---|---|---|---|
| `num_features` | `int` | The number of best features to generate. Internally, `3 * num_features` programs are generated, and the best `num_features` are selected via Maximum Relevance Minimum Redundancy (mRMR). | `10` |
| `population_size` | `int` | The number of programs in each generation. A larger population increases the likelihood of finding a good solution but also increases computation time. | `50` |
| `max_generations` | `int` | The maximum number of generations to run. More generations can lead to better solutions but will take longer. | `25` |
| `tournament_size` | `int` | The size of the tournament for parent selection. A larger size increases the chance of selecting fitter programs but requires more computation. | `7` |
| `crossover_proba` | `float` (0-1) | The probability of crossover mutation occurring between selected parents in each generation. | `0.75` |
| `parsimony_coefficient` | `float` | Controls the penalty for larger programs. Higher values encourage smaller, more interpretable features by penalizing complexity. This helps prevent "bloat". | `0.001` |
| `adaptive_parsimony` | `bool` | Whether to dynamically adjust the parsimony coefficient over generations based on the average program size. This can help manage complexity as evolution progresses. | `True` |
| `early_termination_iters` | `int` | If the best score does not improve for this number of consecutive generations, the algorithm will terminate early. | `15` |
| `functions` | `List[str]` or `List[SymbolicFunction]` | A list of functions to use when constructing symbolic programs. If `None`, all built-in functions are used. Functions must be names returned by `list_symbolic_functions`. | `None` (all built-in) |
| `custom_functions` | `List[SymbolicFunction]` | A list of user-defined custom functions to include in the programs. Each function must be an instance of the `CustomSymbolicFunction` class. | `None` |
| `fitness_function` | `str` or `Callable` | The metric used to evaluate the fitness of programs (e.g., `"pearson"`, `"r2"`, `"log_loss"`). If `None`, the fitness function is inferred based on the target type (e.g., Pearson correlation for regression, Log Loss for classification). | `None` (inferred) |
| `return_all_features` | `bool` | If `True`, the `transform` method will return both the original input features and the newly synthesized features. If `False`, only the synthesized features will be returned. | `True` |
| `n_jobs` | `int` | The number of CPU cores to use for parallel processing. Set to `-1` to use all available cores. If `1`, computations run serially. | `-1` (all cores) |
| `show_progress_bar` | `bool` | Whether to display a progress bar during the evolutionary process. | `True` |
| `verbose` | `bool` | If `True`, additional information, such as generation progress and the best program found in each generation, will be printed. | `False` |
| `min_constant_val` | `float` | The minimum value for ephemeral random constants generated within programs. | `-10.0` |
| `max_constant_val` | `float` | The maximum value for ephemeral random constants generated within programs. | `10.0` |
| `include_constants` | `bool` | Whether to allow the generation of ephemeral random constants in the symbolic programs. If `False`, programs will only use input features and functions. | `True` |
| `optimize_constants` | `bool` | Whether to optimize the constant values within the generated programs using a numerical optimization routine. | `True` |
| `constant_optimization_maxiter` | `int` | The maximum number of iterations for the constant optimization process. | `100` |
| `const_prob` | `float` | The probability of generating a constant leaf node during program creation. | `0.15` |
| `stop_prob` | `float` | The probability of stopping the program generation when building new programs (influences program size and depth). | `0.8` |
| `max_depth` | `int` | The maximum allowed depth of the generated symbolic programs. Deeper programs can be more complex. | `3` |

---

## ‚öôÔ∏è How it Works

1. **Initialization**:
   - A random population of symbolic programs is created from your input features and available symbolic functions.

2. **Evolution**:
   - Over multiple generations:
    - Each program is evaluated using a fitness function (e.g., correlation with the target)
    - Top programs are selected via tournament selection
    - Offspring are generated via crossover and mutation

3. **Selection (mRMR)**:
   - After evolution, the best programs are filtered using Maximum Relevance Minimum Redundancy (mRMR) to ensure that selected features are:
     - Highly predictive
     - Minimally redundant

4. **Transformation**:
   - The best programs are used to transform data in `.transform()` or `.fit_transform()`.

---

## üß™ Methods

### `.fit(X, y)`

Train the feature synthesizer on your dataset. Evolves symbolic expressions predictive of the target `y`.

### `.transform(X)`

Apply the top symbolic formulas to new input data `X`, returning new synthesized features.

### `.fit_transform(X, y)`

Convenient method to run `.fit()` and `.transform()` in one step.

### `.get_feature_info()`

Returns a pandas.DataFrame showing:

- **name**: Auto-generated feature name
- **formula**: The final simplified symbolic formula
- **raw_formula**: The original (possibly unsimplified) formula
- **fitness**: The final score used for selection

### `.plot_history()`

Generates a line plot of:

- Best fitness per generation
- Parsimony coefficient (if adaptive)
- Early stopping indicator

---

## üìù Example

```python
from featuristic import GeneticFeatureSynthesis
from featuristic.datasets import fetch_wine_dataset

X, y = fetch_wine_dataset()

gfs = GeneticFeatureSynthesis(num_features=5, max_generations=30)
X_new = gfs.fit_transform(X, y)

gfs.get_feature_info()
gfs.plot_history()
```

---

## üß† Tip

- Set `adaptive_parsimony=True` to automatically discourage overly complex features as the search progresses.
- Want transparency? Set `verbose=True` to print the best symbolic formula every generation.
- Need to restrict operations? Use `functions=["add", "log", "sqrt"]` to limit what expressions are allowed.
