# What is Featuristic?

![featurize_logo](_static/logo.png "Featurize")

**Featuristic** uses Genetic Feature Synthesis to perform **automated feature engineering and feature selection** to optimise your machine learning models and improve their predictions.

## Quickstart

Below is a simple example of using Featuristic for carrying out automated feature engineering and feature selection on the well known *cars* dataset.

Featuristic works in two steps:

1. The first step is intelligently evolve new features via **Genetic Feature Synthesis**
2. A **Genetic Feature Selection** algorithm then finds the optimal subset of features within the new feature space, with the aim of maximizing the predictive ability whilst minimizing the number of features required

In [1]:
from ucimlrepo import fetch_ucirepo 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import featuristic as ft
import numpy as np

np.random.seed(8888)

### Load the Data

In [2]:
auto_mpg = fetch_ucirepo(id=9) 
  
X = auto_mpg.data.features 
y = auto_mpg.data.targets 

### Prepare the Data

The data has a few *null* values in, which we'll remove for simplicity.

In [3]:
rows_with_nulls = X.isnull().sum(axis=1)
X = X[rows_with_nulls == 0].reset_index(drop=True)
y = y[rows_with_nulls == 0]["mpg"].reset_index(drop=True)
  
X.head()

Unnamed: 0,displacement,cylinders,horsepower,weight,acceleration,model_year,origin
0,307.0,8,130.0,3504,12.0,70,1
1,350.0,8,165.0,3693,11.5,70,1
2,318.0,8,150.0,3436,11.0,70,1
3,304.0,8,150.0,3433,12.0,70,1
4,302.0,8,140.0,3449,10.5,70,1


In [4]:
y.head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

### Genetic Feature Synthesis

Now we run the Genetic Feature Synthesis to evolve our new features. We've set the genetic algorithm to intelligently synthesise 10 new features for us, using a population comprising 100 individuals that are evolved over 50 generations. 

In [6]:
synth = ft.GeneticFeatureSynthesis(
    num_features=10,
    population_size=100,
    crossover_proba=0.8,
    max_generations=50,
    parsimony_coefficient=0.02,
    early_termination_iters=15,
    n_jobs=-1,
)

X_new = synth.fit_transform(X, y)

X_new.head()

Creating new features...:  68%|█████████████████████████████████████████████████████████                           | 34/50 [00:07<00:03,  4.26it/s]
Pruning feature space...: 100%|███████████████████████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 340.04it/s][A
Creating new features...:  68%|█████████████████████████████████████████████████████████                           | 34/50 [00:07<00:03,  4.28it/s]


Unnamed: 0,feature_9,feature_0,feature_10,feature_11,feature_1,feature_5,feature_6,feature_7,feature_8,feature_15
0,10590.555697,607713.367291,10590.555697,10590.555697,607713.367291,612613.367291,612613.367291,612613.367291,612613.367291,617513.367291
1,4567.715181,253006.066337,4567.715181,4567.715181,253006.066337,257906.066337,257906.066337,257906.066337,257906.066337,262806.066337
2,10046.091888,681829.300749,10046.091888,10046.091888,681829.300749,686729.300749,686729.300749,686729.300749,686729.300749,691629.300749
3,10137.876699,734143.300749,10137.876699,10137.876699,734143.300749,739043.300749,739043.300749,739043.300749,739043.300749,743943.300749
4,10714.618422,713658.197814,10714.618422,10714.618422,713658.197814,718558.197814,718558.197814,718558.197814,718558.197814,723458.197814


We've now created our new features, but what do they mean? Since Featuristic's features are synthesised by applying mathematical expressions to the data, we can look at how each new feature was created. Let's take a look at the formula for the first feature.

In [15]:
info = synth.get_feature_info()
info["prog"].iloc[0]

'((((square(((acceleration - weight) + square(model_year))) + square(model_year)) / (cylinders + horsepower)) - (abs(displacement) + weight)) - cos(horsepower))'

### Feature Selection

### Define the Cost Function

We set up a custom cost finction that the Genetic Feature Selection algorithm uses to quantify how well the subset of features predicts the target. 

In [None]:
def cost_function(X, y):
    model = LinearRegression()
    scores = cross_val_score(model, X, y, cv=3, scoring="neg_mean_absolute_error")
    return scores.mean()

In [None]:
features, feature_info = ft.featurize(
    X,
    y,
    selection_cost_func=cost_function,
    selection_bigger_is_better=True,
    n_jobs=-1,
    generate_parsimony_coefficient=0.025,
    selection_early_termination_iters=35
)

### The New Features

Let's print out our new features to see what was generated for us. You can see that featurize has kept three of the original features ("displacement", "cylinders", "origin") and has kept four of the features created via the Genetic Feature Synthesis.

In [None]:
features.head()

In [None]:
feature_info

In [None]:
original = cost_function(X, y)
original

In [None]:
new = cost_function(features, y)
new

In [None]:
print(f"Old: {original}, New: {new}, Improvement: {round((1 - (new / original))* 100, 1)}%")