# Binary classification with Genetic Programming

Classification in GP works in exactly the same way as regression in how the evolution takes place. The only difference here is that the output of the program is transformed through a sigmoid function in order to transform the numeric output into probabilities of each class. In essence this means that a negative output of a function means that the program is predicting one class, and a positive output predicts the other.

Here we consider the [Breast Cancer Wisconsin (Diagnostic) dataset](https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic), a well-known dataset included in the `scikit-learn` library. The primary purpose of this dataset is binary classification. It is used to distinguish between benign and malignant breast tumors based on various features extracted from biopsied breast cells.

We will leverage **Genetic Programming** (GP) to address this problem. Let us import some useful modules. If you are using `conda`, you can install `graphviz` with the following commands:

```
conda install graphviz
conda install python-graphviz
conda install pydot
```

In [324]:
import gplearn.genetic as gp
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
import graphviz

Now explore the dataset and perform some preprocessing, if necessary.

In [325]:
cancer = load_breast_cancer(as_frame=True)
df = cancer.frame
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [1]:
# CODE HERE

Split your dataset into train and validation/test set

In [336]:
# CODE HERE


Define a metric for validation

In [337]:
def binary_cross_entropy(y_true, y_pred):

    # CODE HERE

    pass

Run the actual GP algorithm and play with the parameters. Which is the most accurate solution that you can find? Is it somehow interpretable?

In [338]:
mse_history = []
max_gen = 30
sc = gp.SymbolicClassifier(population_size=population_size,
                            tournament_size=tournament_size,
                            function_set=fset,
                            parsimony_coefficient=parsimony_coefficient,
                            p_crossover=p_crossover,
                            p_subtree_mutation=p_subtree_mutation, # Probability of subtree mutation
                            p_hoist_mutation=p_hoist_mutation, # Small probability of hoist mutation
                            p_point_mutation=p_point_mutation, # Small probability of point mutation
                            generations=1,
                            random_state=1,
                            feature_names=feature_names,
                            warm_start=True)
for i in range(0, max_gen+1):
    sc.set_params(generations=i+1)
    sc.fit(X_train, y_train)
    y_score = sc.predict_proba(X_val)[:,1]
    mse_history.append(binary_cross_entropy(y_val, y_score))

Plot your results.

In [None]:
plt.plot(mse_history)
plt.ylabel('Validation Loss')
plt.xlabel('Generation')
plt.show()

In [None]:
dot_data=sc._program.export_graphviz()
graphviz.Source(dot_data)