# Solving supervised machine learning problems (from the perspective of inductive programming).
Inductive program synthesis (aka inductive programming) is a subfield in the program synthesis that studies program generation from incomplete information, namely from the examples for the desired input/output behavior of the program. Genetic programming (GP) is one of the numerous approaches for the inductive synthesis characterized by performing the search in the space of syntactically correct programs of a given programming language.

In the context of supervised machine learning (SML) problem-solving, one can define the task of a GP algorithm as the program/function induction from input/output examples that identifies the mapping $f:S\mapsto R$ in the best possible way, generally measured through solution’s generalization ability on previously unseen data.

Geometric Semantic Genetic Programming (GSGP) is a variant of GP where the standard crossover and mutation operators are replaced by the so-called Geometric Semantic Operators (GSOs).

## SMLGS problem type.
Given the definitions provided above and in order to make it possible to perform automatic induction of programs from the input/output-examples, we have conceptualized a module called ``inductive_programming`` which contains different problem types, materialized as classes. One of them, called ``SMLGS``, a subclass of ``Problem``, aims at supporting the SML problem-solving, more specifically the symbolic regression and binary classification, by means of GSGP.


<img src="https://media.springernature.com/original/springer-static/image/chp%3A10.1007%2F978-3-319-44003-3_1/MediaObjects/393665_1_En_1_Fig6_HTML.gif" alt="Drawing" style="width: 300px;"/>

# 1. Create an instance of ``SMLGS``

Loads the necessary classes and functions.

In [1]:
# Imports utility libraries 
import os
import datetime
import pandas as pd
# Imports PyTorch
import torch
# Imports problems
from gpol.problems.inductive_programming import SMLGS
from gpol.utils.datasets import load_boston
from gpol.utils.utils import train_test_split, rmse
from gpol.utils.inductive_programming import function_map, prm_reconstruct_tree, _execute_tree, _get_tree_depth
# Imports metaheuristics 
from gpol.algorithms.genetic_algorithm import GSGP
# Imports operators
from gpol.operators.initializers import rhh, prm_full
from gpol.operators.selectors import prm_tournament
from gpol.operators.variators import prm_efficient_gs_xo, prm_efficient_gs_mtn

Creates an instance of ``SMLGS`` problem. The search space (*S*) of an instance of ``SMLGS`` problem consists of the following key-value pairs:
- ``"n_dims"`` is the number of input features (aka input dimensions) in the underlying ``SMLGS`` problem's instance;
- ``"function set"`` is the set of primitive functions;
- ``"constant set"`` is the set of constants to draw terminals from;
- ``"p_constants"`` is the probability of generating a constant when sampling a terminal; and
-  ``max_init_depth`` is the trees’ maximum depth during the initialization.

Besides the traditional triplet ``sspace``, ``ffunction`` and ``min_``, one has to provide the problem's instance with the input data (``X`` and ``y`` tensors), partitions' indexes (``train_indices`` and ``test_indices``), and the size of the batches (``batch_size``).

In [3]:
# Defines the processing device and random state 's seed
device, seed  = 'cpu', 0  # 'cuda' if torch.cuda.is_available() else 'cpu', 0
# Loads and allocates the data on the processing device
X, y = load_boston(X_y=True)
X = X.to(device)
y = y.to(device)
# Defines parameters for the data usage
batch_size, shuffle, p_test = 50, True, 0.3
# Performs train/test split
train_indices, test_indices = train_test_split(X=X, y=y, p_test=p_test, shuffle=shuffle, indices_only=True, seed=seed)
# Characterizes the program elements: function and constant sets
f_set = [function_map["add"], function_map["sub"], function_map["mul"], function_map["div"]]
c_set = torch.tensor([-1.0, -0.5, 0.5, 1.0], device=device)
# Creates the search space
sspace = {"n_dims": X.shape[1], "function_set": f_set, "p_constants": 0.1, "constant_set": c_set, "max_init_depth": 5}
# Creates problem's instance
pi = SMLGS(sspace=sspace, ffunction=rmse, X=X, y=y, train_indices=train_indices, test_indices=test_indices,
           batch_size=100, min_=True)

# 2. Parametrize the GSGP algorithm.

The cell in below creates a dictionary, called ``pars``, to store GSGP-specific parameters. The mutation and crossover operators, ``prm_efficient_gs_mtn`` and ``prm_efficient_gs_xo`` respectively, were specifically designed to resemble the implementation proposed in *A C++ framework for geometric semantic genetic programming* by Castelli et al.

The tensor ``ms`` represents the steps of the GSM operator; if it is a single-valued tensor, then the mutation step is always the same and equals ``ms``; if it is a vector, then, at each call of the operator, the mutation step is selected at random from ``ms``. 

In [5]:
# Defines population's size
pop_size = 100
# Creates single trees' initializer for the GSOs
sp_init = prm_full(sspace)  
# Generates GSM's steps
to, by = 5.0, 0.25  
ms = torch.arange(by, to + by, by, device=device)
# Defines selection's pressure and mutation's probability
pars = {"pop_size": pop_size, "initializer": rhh, "selector": prm_tournament(pressure=0.1),
        "mutator": prm_efficient_gs_mtn(X, sp_init, ms), "crossover": prm_efficient_gs_xo(X, sp_init),
        "p_m": 0.3, "p_c": 0.7, "elitism": True, "reproduction": False}

# 3. Prepares the connection strings.
Following the implementation proposed in *A C++ framework for geometric semantic genetic programming* by Castelli et al., the initial population of trees and the intermediary random trees, generated throughout the evolution during GSOs' application, must be stored on disk. The cell in below creates the necessary paths: 
-  ``path`` is a connection string towards GSGP's main log-folder;
-  ``path_init_pop`` is a connection string towards initial population's repository;
-  ``path_rts`` is a connection string towards random trees' repository; and
-  ``path_hist`` is a connection string towards the history's file (a file that stores solutions' genealogy).

In [6]:
# Creates the experiment's label
experiment_label = "SMLGS"  # SML approached from the perspective of Inductive Programming using GSGP
time_id = str(datetime.datetime.now().date()) + "_" + str(datetime.datetime.now().hour) + "_" + \
          str(datetime.datetime.now().minute) + "_" + str(datetime.datetime.now().second)
# Creates general path
path = os.path.join(os.getcwd(), experiment_label + "_" + time_id)
# Defines a connection string to store random trees
path_rts = os.path.join(path, "reconstruct", "rts")
if not os.path.exists(path_rts):
    os.makedirs(path_rts)
# Defines a connection string to store the initial population
path_init_pop = os.path.join(path, "reconstruct", "init_pop")
if not os.path.exists(path_init_pop):
    os.makedirs(path_init_pop)
# Creates a connection string towards the history's file
path_hist = os.path.join(path, "reconstruct", experiment_label + "_seed_" + str(seed) + "_history.csv")

# 4. Executes the experiment.

Defines the computational resources for the experiment: the number of iterations.

In [8]:
n_iter = 30

The code in below creates an instance of type ``GSGP`` with the aforementioned parameters. At the end of the search, the method ``write_history`` stores on disk the solutions' genealogy.

Note that besides algorithm-specific parameters, the constructor of an instance of ``GSGP`` algorithm also receives:
-  ``path_init_pop`` is path where the initial trees will be stored;
-  ``path_rts`` is path where the random trees generated throughout the evolution will be stored;
-  ``seed`` is used to initialize a pseudorandom number generator; and
-  ``device`` is the specification of the processing (either CPU or GPU).

The ``solve`` method has the same signature for all the search algorithms and, in this case, includes the following parameters: 
-  ``n_iter``: number of iterations to conduct the search;
-  ``tol``: minimum required (training) fitness improvement for ``n_iter_tol`` consecutive iterations to continue the search. When the fitness is not improving by at least ``tol`` for ``n_iter_tol`` consecutive iterations, the search will be automatically interrupted;
-  ``n_iter_tol``: maximum number of iterations to not meet ``tol`` improvement;
-  ``test_elite``: a flag indicating whether the best-so-far solution (𝑖) should be evaluated on test partition (regards SML-OPs);
-  ``verbose``: verbosity's detail-level;
-  ``log``: log-files' detail-level (if exists).

In [10]:
isa = GSGP(pi=pi, path_init_pop=path_init_pop, path_rts=path_rts, seed=seed, device=device, **pars)
isa.solve(n_iter=n_iter, tol=0.1, n_iter_tol=5, test_elite=True, verbose=2, log=0)
isa.write_history(path_hist)
print("Algorithm: {}".format(isa.__name__))
print("Best solution's fitness: {:.3f}".format(isa.best_sol.fit))
print("Best solution's test fitness: {:.3f}".format(isa.best_sol.test_fit))

-------------------------------------------------------------------------------------------------------
           |                    Best solution                      |            Population            |
-------------------------------------------------------------------------------------------------------
Generation | Length   Fitness          Test Fitness         Timing | AVG Fitness           STD Fitness
-------------------------------------------------------------------------------------------------------
0          | 506      11.2839          12.8798               0.055 | 5.2491e+08            5.24537e+09
1          | 506      11.2839          13.2341               0.096 | 19.0067                   2.95073
2          | 506      10.121           10.4062               0.096 | 14.2705                   1.43619
3          | 506      10.121           10.7162               0.090 | 13.3954                     1.231
4          | 506      9.17288          8.26451               0.099 | 

# 5. Reconstructs the tree.
In order to reconstruct the tree, the user needs to:
1.  read the historical records (the solutions' genealogy);
2.  create a parametrized reconstruction function using ``prm_reconstruct_tree``. The latter receives the historical records' ``pandas.DataFrame``, the paths towards the initial and random trees, and the processing devices that was used to conduct the search.
3.  choose the solution to reconstruct (by specifying its index);
4.  call the reconstruction function by passing the index of the desired solution.

In [None]:
history = pd.read_csv(os.path.join(path_hist), index_col=0)
# Creates a reconstruction function
reconstructor = prm_reconstruct_tree(history, path_init_pop, path_rts, device)
# Chooses the most fit individual to reconstruct
start_idx = history["Fitness"].idxmin()
print("Starting index (chosen individual):", start_idx)
print("Individual's info:\n", history.loc[start_idx])
# Reconstructs the individual
ind = reconstructor(start_idx)
print("Automatically reconstructed individual's representation:\n", ind[0:30])

Starting index (chosen individual): 0_29_o2_xo_4760
Individual's info:
 Iter                      29
Operator           crossover
T1           0_28_o1_xo_4540
T2          0_28_o1_mtn_4566
Tr           0_29_rt_xo_4758
ms                        -1
Fitness              4.34947
Name: 0_29_o2_xo_4760, dtype: object


Prints individuals length and depth.

In [None]:
print("Length", len(ind))
print("Depth", _get_tree_depth(ind))

Executes the reconstructed individual on the whole dataset. 

In [None]:
y_pred = _execute_tree(ind, X)
print("Individual's RMSE on X: ", rmse(y, y_pred))