# Classification and regression with genepro
A Scikit-learn compatible classifier and regressor is already provided in genepro. This notebook show how they can be used.


## Classification
At the moment, genepro supports binary classification. The reason why multi-class is not supported is that the output of a tree is its root, hence a multi-tree representation is required to realize multi-class classification. Here's how binary classification can be performed:

In [1]:
import sympy
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import balanced_accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from genepro.scikit import GeneProClassifier
from genepro.node_impl import *

# Let's load the Breast Cancer data set from sklearn
X, y = load_breast_cancer(return_X_y=True)

# Create a train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply feature normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Set up what nodes genepro should use
internal_nodes = [Plus(), Minus(), Times(), Div(), Log()]
# As leaf nodes, let's set up the possibility to use each feature, plus a constant
# (this is the default if leaf_nodes are not provided)
num_features = X_train.shape[1]
leaf_nodes = [Feature(i) for i in range(num_features)] + [Constant()]

# Set up classifier
gp = GeneProClassifier(score=balanced_accuracy_score, 
  evo_kwargs={'internal_nodes':internal_nodes, 'leaf_nodes':leaf_nodes, 'verbose':True, 'pop_size': 128, 'max_gens':40, 'max_tree_size':35, 'n_jobs':1, })

# Run
gp.fit(X_train, y_train)

# Get test (balanced) accuracy
test_acc = balanced_accuracy_score(y_test, gp.predict(X_test))
print("The balanced accuracy on the test set is {:.3f}".format(test_acc))
# Get the best-found tree (at the last generation) and simplify it
best_tree_repr = sympy.simplify(gp.evo.best_of_gens[-1].get_readable_repr())
print("Obtained by the (simplified) model:", best_tree_repr)

gen: 1,	best of gen fitness: 0.803,	best of gen size: 26


TypeError: Singleton array array(-1) cannot be considered a valid collection.

## Regression


In [1]:
from sklearn.datasets import load_diabetes
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from genepro.scikit import GeneProRegressor
from genepro.node_impl import *
from genepro.util import compute_linear_scaling

# Let's load the Diabetes data set from sklearn
X, y = load_diabetes(return_X_y=True)

# Create a train and test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Apply feature normalization
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Do not pass a score, internal_nodes nor leaf_nodes: defaults will be used 

# Set up regressor
use_linear_scaling = True
gp = GeneProRegressor(use_linear_scaling=use_linear_scaling, # linear scaling applies a linear layer to the prediction (intercept + slope*prediction) 
  evo_kwargs={'verbose': True, 'pop_size': 128, 'max_gens': 40, 'max_tree_size': 50, 'n_jobs': 4, })

# Run
gp.fit(X_train, y_train)

# Get test mean squared error
test_mse = mean_squared_error(y_test, gp.predict(X_test))
print("The mean squared error on the test set is {:.3f} (respective R^2 score is {:.3f})".format(
  test_mse, 1 - test_mse/np.var(y_train)))
# Get the best-found tree (at the last generation)
best_tree = gp.evo.best_of_gens[-1]
best_tree_repr = sympy.simplify(best_tree.get_readable_repr())
if use_linear_scaling:
  # Linear scaling effectively changes the model, so we should incorporate that in the expression
  # (recall that linear scaling is computed w.r.t. the training set during the evolution)
  slope, intercept = compute_linear_scaling(y_train, best_tree(X_train))
  best_tree_repr = "{:.3f} + {:.3f}*({})".format(intercept, slope, best_tree_repr)
print("Obtained by the (simplified) model:", best_tree_repr)

gen: 1,	best of gen fitness: -14339.709,	best of gen size: 3
gen: 2,	best of gen fitness: -20511.920,	best of gen size: 3
gen: 3,	best of gen fitness: -17692.709,	best of gen size: 3
gen: 4,	best of gen fitness: -6280.806,	best of gen size: 3
gen: 5,	best of gen fitness: -6291.878,	best of gen size: 3
gen: 6,	best of gen fitness: -6291.878,	best of gen size: 3
gen: 7,	best of gen fitness: -6291.878,	best of gen size: 3
gen: 8,	best of gen fitness: -6291.878,	best of gen size: 3
gen: 9,	best of gen fitness: -6202.854,	best of gen size: 5
gen: 10,	best of gen fitness: -6079.380,	best of gen size: 3
gen: 11,	best of gen fitness: -6101.827,	best of gen size: 50
gen: 12,	best of gen fitness: -6067.914,	best of gen size: 50
gen: 13,	best of gen fitness: -5924.788,	best of gen size: 20
gen: 14,	best of gen fitness: -5723.698,	best of gen size: 20
gen: 15,	best of gen fitness: -5547.235,	best of gen size: 18
gen: 16,	best of gen fitness: -5547.235,	best of gen size: 18
gen: 17,	best of gen fit

NameError: name 'sympy' is not defined