# What is Featurize?

![featurize_logo](_static/logo.png "Featurize")

**Featurize** uses Genetic Feature Synthesis to perform **automated feature engineering and feature selection** to optimise your machine learning models and improve their predictions.

## Quickstart

Below is a simple example of using Featurize for carrying out automated feature engineering and feature selection on the well known *cars* dataset.

Featurize works in three steps:

1. The first step is intelligently evolve new features via **Genetic Feature Synthesis**
2. These new features are then filtered via a **Maximum Relevance — Minimum Redundancy** algorithm to find those features that correlate most highly with the target whilst minimizing their correlation with each other
3. A **Genetic Feature Selection** algorithm then finds the optimal subset of features within the new feature space, with the aim of maximizing the predictive ability whilst minimizing the number of features required

In [1]:
from ucimlrepo import fetch_ucirepo 
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
import featurize as ft
import numpy as np

np.random.seed(8888)

### Load the Data

In [2]:
auto_mpg = fetch_ucirepo(id=9) 
  
X = auto_mpg.data.features 
y = auto_mpg.data.targets 

### Prepare the Data

The data has a few *null* values in, which we'll remove for simplicity.

In [3]:
rows_with_nulls = X.isnull().sum(axis=1)
X = X[rows_with_nulls == 0].reset_index(drop=True)
y = y[rows_with_nulls == 0]["mpg"].reset_index(drop=True)
  
X.head()

Unnamed: 0,displacement,cylinders,horsepower,weight,acceleration,model_year,origin
0,307.0,8,130.0,3504,12.0,70,1
1,350.0,8,165.0,3693,11.5,70,1
2,318.0,8,150.0,3436,11.0,70,1
3,304.0,8,150.0,3433,12.0,70,1
4,302.0,8,140.0,3449,10.5,70,1


In [4]:
y.head()

0    18.0
1    15.0
2    18.0
3    16.0
4    17.0
Name: mpg, dtype: float64

### Define the Cost Function

We set up a custom cost finction that the Genetic Feature Selection algorithm uses to quantify how well the subset of features predicts the target. 

In [5]:
def cost_function(X, y):
    model = LinearRegression()
    scores = cross_val_score(model, X, y, cv=3, scoring="neg_mean_absolute_error")
    return scores.mean()

### Genetic Feature Synthesis

Now we run the Featurize our data to evolve and select new features via Genetic Feature Synthesis.

In [6]:
features, feature_info = ft.featurize(
    X,
    y,
    selection_cost_func=cost_function,
    selection_bigger_is_better=True,
    n_jobs=-1,
    generate_parsimony_coefficient=0.025,
    selection_early_termination_iters=35
)

Creating new features...:  50%|██████████████████████████████████                                  | 15/30 [00:03<00:02,  5.35it/s]
Pruning feature space...: 100%|███████████████████████████████████████████████████████████████████| 10/10 [00:00<00:00, 255.58it/s][A
Creating new features...:  50%|██████████████████████████████████                                  | 15/30 [00:04<00:04,  3.62it/s]
Optimising feature selection...:  72%|███████████████████████████████████████████▏                | 72/100 [00:05<00:02, 12.74it/s]


### The New Features

Let's print out our new features to see what was generated for us. You can see that featurize has kept three of the original features ("displacement", "cylinders", "origin") and has kept four of the features created via the Genetic Feature Synthesis.

In [7]:
features.head()

Unnamed: 0,displacement,cylinders,origin,feature_2,feature_3,feature_6,feature_9
0,307.0,8,1,1526.0,-3249.249286,-3360.0,-3360.0
1,350.0,8,1,1372.0,-3497.0,-3560.75,-3560.75
2,318.0,8,1,1614.0,-3198.568728,-3315.0,-3315.0
3,304.0,8,1,1617.0,-3173.196503,-3289.0,-3289.0
4,302.0,8,1,1591.0,-3185.744002,-3338.75,-3338.75


In [8]:
feature_info

Unnamed: 0,name,prog,fitness
0,feature_0,(abs(abs(square(model_year))) - weight),-0.88089
1,feature_1,(abs(-(square(model_year))) - weight),-0.88089
2,feature_2,((abs(square(model_year)) - weight) + abs(hors...,-0.87094
3,feature_3,((square(abs(square(model_year))) / square(dis...,-0.850201
4,feature_5,(abs(abs(square(-(acceleration)))) - weight),-0.822736
5,feature_6,(abs(abs(square(abs(abs(acceleration))))) - we...,-0.822736
6,feature_7,(abs(abs(square(abs(abs(acceleration))))) - we...,-0.822736
7,feature_8,(abs(abs(abs(abs(square(abs(abs(acceleration))...,-0.822736
8,feature_9,(abs(-(abs(abs(abs(square(abs(abs(acceleration...,-0.822736
9,feature_10,(abs(abs(square(acceleration))) - weight),-0.822736


In [9]:
original = cost_function(X, y)
original

-3.344987965680754

In [10]:
new = cost_function(features, y)
new

-2.4899437940808404

In [11]:
print(f"Old: {original}, New: {new}, Improvement: {round((1 - (new / original))* 100, 1)}%")

Old: -3.344987965680754, New: -2.4899437940808404, Improvement: 25.6%
