# What is Featuristic?

![featuristic_logo](_static/logo.png "Featuristic")

**Featuristic** uses Genetic Algorithms to perform **automated feature engineering and feature selection** to optimise your machine learning models and improve their predictions.

## How Does Genetic Feature Synthesis Work?

Featuristic uses a form of [symbolic regression](https://en.wikipedia.org/wiki/Symbolic_regression) to intelligently generate interpretable mathematical formulae that are used to create new features from your dataset.

Initially, Featuristic does this by creating a random population of formulae from standard mathematical operators, such as `add`, `subtract`, `sin`, `tan`, `square` etc. 

For example: `(abs(square(feature_1)) - feature_2) * feature_3 `

The formulae that generate features that correlate most highly with the target variable are then selected and are recombined together using a genetic algorithm to produce offspring.

![Symbolic Regression Example](_static/symbolic_regression_example.png "Symbolic Regression Example")

These offspring can also undergo point mutatations, changing the formulae's operators at individual locations.

![Mutation Example](_static/mutation_example.png "Mutation Example")

This process is then repeated for multiple generations to constantly evolve the population of formulae, with the goal of producing features that are highly correlated with the target variable.

## Quickstart

Below is a simple example of using Featuristic for carrying out automated feature engineering and feature selection on the well known `cars` dataset.

Featuristic works in two steps:

1. The first step is intelligently evolve new features via **Genetic Feature Synthesis**
2. A **Genetic Feature Selection** algorithm then finds the optimal subset of features within the new feature space, with the aim of maximizing the predictive ability whilst minimizing the number of features required

In [11]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import featuristic as ft
import numpy as np

np.random.seed(8888)

print(ft.__version__)

0.1.1


### Load the Data

In [2]:
X, y = ft.fetch_cars_dataset()

X.head()

Unnamed: 0,displacement,cylinders,horsepower,weight,acceleration,model_year,origin
0,307.0,8,130.0,3504,12.0,70,1
1,350.0,8,165.0,3693,11.5,70,1
2,318.0,8,150.0,3436,11.0,70,1
3,304.0,8,150.0,3433,12.0,70,1
4,302.0,8,140.0,3449,10.5,70,1


In [3]:
y.head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

### Genetic Feature Synthesis

Now let's run the Genetic Feature Synthesis to evolve to automtaically engineer new features from our dataset. 

We've set the genetic algorithm to intelligently synthesise 10 new features for us, using a population comprising 100 individuals that are evolved over 50 generations. We also tell the algorithm to terminate early if it goes 15 generations without improving on the best feature found so far, and set `n_jobs` to -1 so that it runs in parallel across all the CPUs in our computer. We then call the `fit` function to run the Genetic Feature Synthesis.

In [4]:
synth = ft.GeneticFeatureSynthesis(
    num_features=10,
    population_size=100,
    max_generations=50,
    early_termination_iters=15,
    n_jobs=-1,
)

synth.fit(X, y)

None

Creating new features...:  40%|██████████████████████████████████                                                   | 20/50 [00:05<00:05,  5.17it/s]
Pruning feature space...: 100%|████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 228.34it/s][A
Creating new features...:  40%|██████████████████████████████████                                                   | 20/50 [00:05<00:07,  3.77it/s]


Next, we call the `transform` function to outout a dataframe containing our new features. If we'd split our data into train and test splits, we'd `fit` the `Genetic Feature Synthesis` class to our training data and then use the `transform` function on both the train and test splits to avoid data leakage. 

We could also combine both the `fit` and `transform` functions into one step by calling `fit_transform` instead.

In [5]:
features = synth.transform(X)

features.head()

Unnamed: 0,feature_0,feature_1,feature_2,feature_3,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10
0,1396.0,1396.0,1526.0,-3249.249286,-3360.0,-3360.0,-3360.0,-3360.0,-3360.0,-3360.0
1,1207.0,1207.0,1372.0,-3497.0,-3560.75,-3560.75,-3560.75,-3560.75,-3560.75,-3560.75
2,1464.0,1464.0,1614.0,-3198.568728,-3315.0,-3315.0,-3315.0,-3315.0,-3315.0,-3315.0
3,1467.0,1467.0,1617.0,-3173.196503,-3289.0,-3289.0,-3289.0,-3289.0,-3289.0,-3289.0
4,1451.0,1451.0,1591.0,-3185.744002,-3338.75,-3338.75,-3338.75,-3338.75,-3338.75,-3338.75


Our new features have generic names. However, since Featuristic's features are all synthesised by applying mathematical expressions to the data, we can look at the formula used to create each feature. 

Let's take a look at the formulae for a few of our new features.

In [12]:
info = synth.get_feature_info()
print(info["formula"].iloc[2])

((abs(square(model_year)) - weight) + abs(horsepower))


In [13]:
info = synth.get_feature_info()
print(info["formula"].iloc[1])

(abs(-(square(model_year))) - weight)


### Feature Selection

### Define the Cost Function

We set up a custom cost finction that the Genetic Feature Selection algorithm uses to quantify how well the subset of features predicts the target. 

In [None]:
def cost_function(X, y):
    model = LinearRegression()
    scores = cross_val_score(model, X, y, cv=3, scoring="neg_mean_absolute_error")
    return scores.mean()

In [None]:
features, feature_info = ft.featurize(
    X,
    y,
    selection_cost_func=cost_function,
    selection_bigger_is_better=True,
    n_jobs=-1,
    generate_parsimony_coefficient=0.025,
    selection_early_termination_iters=35
)

### The New Features

Let's print out our new features to see what was generated for us. You can see that featurize has kept three of the original features ("displacement", "cylinders", "origin") and has kept four of the features created via the Genetic Feature Synthesis.

In [None]:
features.head()

In [None]:
feature_info

In [None]:
original = cost_function(X, y)
original

In [None]:
new = cost_function(features, y)
new

In [None]:
print(f"Old: {original}, New: {new}, Improvement: {round((1 - (new / original))* 100, 1)}%")