Everything can be done with the TPOTEstimator class. All other classes (TPOTRegressor, TPOTClassifier, TPOTSymbolicClassifier, TPOTSymbolicRegression, TPOTGeneticFeatureSetSelector, etc.) are actually just different default settings for TPOTEstimator.


By Default, TPOT will generate pipelines with a default set of classifiers or regressors as roots (this depends on whether classification is set to true or false). All other nodes are selected from a default list of selectors and transformers. Note: This differs from the TPOT1 behavior where by default classifiers and regressors can appear in locations other than the root. You can modify the the search space for leaves, inner nodes, and roots (final classifiers) separately through built in options or custom configuration dictionaries.

In this tutorial we will walk through using the built in configurations, creating custom configurations, and using nested configurations.

# ConfigSpace

Hyperparameter search spaces are defined using the [ConfigSpace package found here](https://github.com/automl/ConfigSpace). More information on how to set up a hyperparameter space can be found in their [documentation here](https://automl.github.io/ConfigSpace/main/guide.html).

In [1]:
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier

knn_configspace = ConfigurationSpace(
    space = {

        'n_neighbors': (1, 10),
        'weights': Categorical("weights", ['uniform', 'distance']),
        'p': (1, 3),
        'metric': Categorical("metric", ['euclidean', 'minkowski']),
        'n_jobs': 1,
    }
)

hyperparameters = dict(knn_configspace.sample_configuration())
print("sampled hyperparameters")
print(hyperparameters)

knn = KNeighborsClassifier(**hyperparameters)

sampled hyperparameters
{'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 1, 'p': 1, 'weights': 'uniform'}


# TPOT Search spaces

TPOT allows you to both hyperparameter search spaces for individual methods as well as pipeline structure search spaces. For example, TPOT can create linear pipelines, trees, or graphs. 

TPOT search spaces are found in the `search_spaces` module. There are two primary kinds of search spaces, node and pipeline. Node search spaces specify the search space of a single sklearn `BaseEstimator`. Pipeline search spaces define the possible structures for a group of node search spaces. These take in node search spaces and produce a pipeline using nodes from that search space. Since sklearn Pipelines are also `BaseEstimator`, pipeline search spaces are also technically node search spaces. Meaning that pipeline search spaces can take in other pipeline search spaces in order to define more complex structures. The primary differentiating factor bewteen node and pipeline search spaces is that pipeline search spaces must take in another search space as input to feed its individual nodes. Therefore, all search spaces eventually end in a node search space at the lowest level. Note that parameters for pipeline search spaces can differ, some take in only a single search space, some take in a list, or some take in multiple defined parameters.

search spaces can be found in tpot2.search_spaces.nodes and tpot2.search_spaces.pipelines

### node search spaces
found in tpot2.search_spaces.nodes


EstimatorNode, GeneticFeatureSelector
| Name      | Info       |
| :---        |    :----:   |
| EstimatorNode | Takes in a ConfigSpace along with the class of the method. This node will optimize the hyperparameters for a single method. |
| GeneticFeatureSelectorNode | Uses evolution to optimize a set of features, exports a basic sklearn Selector that simply selects the features chosen by the node. |




### pipeline search spaces

found in tpot2.search_spaces.pipelines

WrapperPipeline -         This search space is for wrapping a sklearn estimator with a method that takes another estimator and hyperparameters as arguments.
        For example, this can be used with sklearn.ensemble.BaggingClassifier or sklearn.ensemble.AdaBoostClassifier.


| Name      | Info       |
| :---        |    :----:   |
| ChoicePipeline | Takes in a list of search spaces. Will select one node from the search space. |
| SequentialPipeline | Takes in a list of search spaces. will produce a pipeline of Sequential length. Each step in the pipeline will correspond to the the search space provided in the same index. |
| DynamicLinearPipeline | Takes in a single search space. Will produce a linear pipeline of variable length. Each step in the pipeline will be pulled from the search space provided. |
| TreePipeline |Generates a pipeline of variable length. Pipeline will have a tree structure similar to TPOT1. |
| GraphPipeline | Generates a directed acyclic graph of variable size. Search spaces for root, leaf, and inner nodes can be defined separately if desired. |
| WrapperPipeline   | This search space is for wrapping a sklearn estimator with a method that takes another estimator and hyperparameters as arguments. For example, this can be used with sklearn.ensemble.BaggingClassifier or sklearn.ensemble.AdaBoostClassifier.        |


# Estimator node example

In [2]:
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier

knn_configspace = ConfigurationSpace(
    space = {

        'n_neighbors': Integer("n_neighbors", bounds=(1, 10)),
        'weights': Categorical("weights", ['uniform', 'distance']),
        'p': Integer("p", bounds=(1, 3)),
        'metric': Categorical("metric", ['euclidean', 'minkowski']),
        'n_jobs': 1,
    }
)


knn_node = tpot2.search_spaces.nodes.EstimatorNode(
    method = KNeighborsClassifier,
    space = knn_configspace,
)

  from .autonotebook import tqdm as notebook_tqdm


You can sample generate an individual with the generate() function. This individual samples from the search space as well as provides mutation and crossover functions to modify the current sample.

Note that ConfigurationSpace does not support None as a parameter. Instead, use the special string "\<NONE\>". TPOT will automatically replace instances of this string with the Python None.

In [3]:
knn_individual = knn_node.generate()

print("sampled hyperparameters")
print(knn_individual.hyperparameters)
knn_individual.mutate() # mutate the individual
print("mutated hyperparameters")
print(knn_individual.hyperparameters)

sampled hyperparameters
{'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 9, 'p': 2, 'weights': 'distance'}
mutated hyperparameters
{'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 9, 'p': 3, 'weights': 'uniform'}


In TPOT2, crossover only modifies the individual calling the crossover function, the second individual remains the same

In [4]:
knn_individual1 = knn_node.generate()
knn_individual2 = knn_node.generate()

print("original hyperparameters for individual 1")
print(knn_individual1.hyperparameters)

print("original hyperparameters for individual 2")
print(knn_individual2.hyperparameters)

print()

knn_individual1.crossover(knn_individual2) # crossover the individuals
print("post crossover hyperparameters for individual 1")
print(knn_individual1.hyperparameters)
print("post crossover hyperparameters for individual 2")
print(knn_individual2.hyperparameters)



original hyperparameters for individual 1
{'metric': 'euclidean', 'n_jobs': 1, 'n_neighbors': 1, 'p': 1, 'weights': 'distance'}
original hyperparameters for individual 2
{'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 2, 'p': 3, 'weights': 'distance'}

post crossover hyperparameters for individual 1
{'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 2, 'p': 1, 'weights': 'distance'}
post crossover hyperparameters for individual 2
{'metric': 'minkowski', 'n_jobs': 1, 'n_neighbors': 2, 'p': 3, 'weights': 'distance'}


All search spaces have an export_pipeline function that returns an sklearn `BaseEstimator`

In [5]:
knn_individual1.export_pipeline()

If a dictionary of parameters is passed instead of of a ConfigSpace, then the hyperparameters will be fixed and not learned.

In [6]:
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier

space = {

    'n_neighbors':10,
}

knn_node = tpot2.search_spaces.nodes.EstimatorNode(
    method = KNeighborsClassifier,
    space = space,
)

knn_node.generate().export_pipeline()

# Pipeline Search Spaces

## choice search space

The simplest pipeline search space is the ChoicePipeline. This takes in a list of search spaces and simply selects and samples from one. In this example, we will construct a search space that takes in several options for a classifier.

In [7]:
import tpot2
from ConfigSpace import ConfigurationSpace
from ConfigSpace import ConfigurationSpace, Integer, Float, Categorical, Normal
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

knn_configspace = ConfigurationSpace(
    space = {

        'n_neighbors': Integer("n_neighbors", bounds=(1, 10)),
        'weights': Categorical("weights", ['uniform', 'distance']),
        'p': Integer("p", bounds=(1, 3)),
        'metric': Categorical("metric", ['euclidean', 'minkowski']),
        'n_jobs': 1,
    }
)

lr_configspace = ConfigurationSpace(
        space = {
            'solver': Categorical("solver", ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']),
            'penalty': Categorical("penalty", ['l1', 'l2']),
            'dual': Categorical("dual", [True, False]),
            'C': Float("C", bounds=(1e-4, 1e4), log=True),
            'class_weight': Categorical("class_weight", ['balanced']),
            'n_jobs': 1,
            'max_iter': 1000,
        }
    )

dt_configspace = ConfigurationSpace(
        space = {
            'criterion': Categorical("criterion", ['gini', 'entropy']),
            'max_depth': Integer("max_depth", bounds=(1, 11)),
            'min_samples_split': Integer("min_samples_split", bounds=(2, 21)),
            'min_samples_leaf': Integer("min_samples_leaf", bounds=(1, 21)),
            'max_features': Categorical("max_features", ['sqrt', 'log2']),
            'min_weight_fraction_leaf': 0.0,
        }
    )

knn_node = tpot2.search_spaces.nodes.EstimatorNode(
    method = KNeighborsClassifier,
    space = knn_configspace,
)

lr_node = tpot2.search_spaces.nodes.EstimatorNode(
    method = LogisticRegression,
    space = lr_configspace,
)

dt_node = tpot2.search_spaces.nodes.EstimatorNode(
    method = DecisionTreeClassifier,
    space = dt_configspace,
)

classifier_node = tpot2.search_spaces.pipelines.ChoicePipeline(
    search_spaces=[
        knn_node,
        lr_node,
        dt_node,
    ]
)


tpot2.search_spaces.pipelines.ChoicePipeline(
    search_spaces = [
        tpot2.search_spaces.nodes.EstimatorNode(
            method = KNeighborsClassifier,
            space = knn_configspace,
            ),
        tpot2.search_spaces.nodes.EstimatorNode(
            method = LogisticRegression,
            space = lr_configspace,
        ),
        tpot2.search_spaces.nodes.EstimatorNode(
            method = DecisionTreeClassifier,
            space = dt_configspace,
        ),
    ]
)

<tpot2.search_spaces.pipelines.choice.ChoicePipeline at 0x3159dca00>

Search space objects provided by pipeline search spaces work the same as with node search spaces. Note that crossover only works when both individuals have sampled the same method. 

In [8]:
classifier_individual = classifier_node.generate()

print("sampled pipeline")
classifier_individual.export_pipeline()

sampled pipeline


In [9]:
print("mutated pipeline")
classifier_individual.mutate()
classifier_individual.export_pipeline()

mutated pipeline


TPOT2 also comes with predefined search spaces. The current search spaces were adapted from a combination of the original TPOT package as well as the search spaces used in [AutoSklearn](https://github.com/automl/auto-sklearn/tree/development/autosklearn/pipeline/components). The helper function `tpot2.config.get_search_space` takes in a string or a list of strings, and returns either a EstimatorNode or a ChoicePipeline,respectively. 

strings can correspond to individual methods. Tehre are also special strings that return predefined lists of methods. 

Special strings are "selectors", "classifiers", "transformers"

EstimatorNode, GeneticFeatureSelector
| Special String     | Included methods      |
| :---        |    :----:   |
| "selectors" | "SelectFwe", "SelectPercentile", "VarianceThreshold", "RFE", "SelectFromModel" |
| "classifiers" | "LogisticRegression", "KNeighborsClassifier", "DecisionTreeClassifier", "SVC", "LinearSVC", "RandomForestClassifier", "GradientBoostingClassifier", "XGBClassifier", "LGBMClassifier", "ExtraTreesClassifier", "SGDClassifier", "MLPClassifier", "BernoulliNB", "MultinomialNB"  |
| "transformers" | "Binarizer", "Normalizer", "PCA", "ZeroCount", "OneHotEncoder", "FastICA", "FeatureAgglomeration", "Nystroem", "RBFSampler" |

In [10]:
#same pipeline search space as before.
classifier_choice = tpot2.config.get_search_space(["KNeighborsClassifier", "LogisticRegression", "DecisionTreeClassifier"])

print("sampled pipeline 1")
classifier_choice.generate().export_pipeline()

sampled pipeline 1


In [11]:
print("sampled pipeline 2")
classifier_choice.generate().export_pipeline()

sampled pipeline 2


In [12]:
#search space for all classifiers
classifier_choice = tpot2.config.get_search_space("classifiers")

print("sampled pipeline 1")
classifier_choice.generate().export_pipeline()

sampled pipeline 1


In [13]:
print("sampled pipeline 2")
classifier_choice.generate().export_pipeline()

sampled pipeline 2


# Sequential Example

SequentialPipelines are of fixed length and sample from a predefined distribution for each step. Here is an example of the form Selector-Transformer-Classifer

In [14]:
stc_pipeline = tpot2.search_spaces.pipelines.SequentialPipeline([
    tpot2.config.get_search_space("selectors"), 
    tpot2.config.get_search_space("transformers"),
    tpot2.config.get_search_space("classifiers"),
    
])

print("sampled pipeline")
print(dir(stc_pipeline.search_spaces[0]))


sampled pipeline
['__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'generate', 'search_spaces']


In [15]:
print("sampled pipeline")
stc_pipeline.generate().export_pipeline()

sampled pipeline


# Optimize Search Space with TPOTEstimator

Once you have constructed a search space, you can use TPOTEstimator to optimize a pipeline within that space.

In [16]:
import tpot2
import numpy as np
import sklearn
import sklearn.datasets

# create dummy dataset
X, y = sklearn.datasets.make_classification(n_samples=200, n_features=10, n_classes=2)

# train test split
X_train, X_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size=0.5)



graph_search_space = tpot2.search_spaces.pipelines.SequentialPipeline([
    tpot2.config.get_search_space("imputers"),
    tpot2.config.get_search_space("regressors"),
])


est = tpot2.TPOTEstimator(
    scorers = ['neg_root_mean_squared_error'],
    scorers_weights = [1],
    population_size = 5,
    survival_percentage=1, 
    initial_population_size=5,
    generations=5, 
    n_jobs=5,
    cv= sklearn.model_selection.StratifiedKFold(n_splits=10, shuffle=True, random_state=1),
    verbose=5, 
    max_time_seconds=360000,
    max_eval_time_seconds=60*10, 
    classification = False,
    search_space=graph_search_space,
    preprocessing=False,
)






In [17]:
from sklearn.model_selection import train_test_split
import traceback
import dill as pickle
import os
import time
import openml
import sklearn.datasets
import numpy as np
import time
import random
import sklearn.model_selection
import torch
from scipy import optimize
import pandas as pd

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer, KNNImputer


In [18]:
def add_missing(X, add_missing = 0.05, missing_type = 'MAR'):
    if isinstance(X,np.ndarray):
        X = pd.DataFrame(X)
    missing_mask = X
    missing_mask = missing_mask.mask(missing_mask.isna(), True)
    missing_mask = missing_mask.mask(missing_mask.notna(), False)
    X = X.mask(X.isna(), 0)
    T = torch.tensor(X.to_numpy())

    match missing_type:
        case 'MAR':
            out = MAR(T, [add_missing])
        case 'MCAR':
            out = MCAR(T, [add_missing])
        case 'MNAR':
            out = MNAR_mask_logistic(T, [add_missing])
    
    masked_set = pd.DataFrame(out['Mask'].numpy())
    missing_combo = (missing_mask | masked_set.isna())
    masked_set = masked_set.mask(missing_combo, True)
    masked_set.columns = X.columns.values
    #masked_set = masked_set.to_numpy()

    missing_set = pd.DataFrame(out['Missing'].numpy())
    missing_set.columns = X.columns.values
    #missing_set = missing_set.to_numpy()

    return missing_set, masked_set

"""BEYOND THIS POINT WRITTEN BY Aude Sportisse, Marine Le Morvan and Boris Muzellec - https://rmisstastic.netlify.app/how-to/python/generate_html/how%20to%20generate%20missing%20values"""

def MCAR(X, p_miss):
    out = {'X': X.double()}
    for p in p_miss: 
        mask = (torch.rand(X.shape) < p).double()
        X_nas = X.clone()
        X_nas[mask.bool()] = np.nan
        model_name = 'Missing'
        mask_name = 'Mask'
        out[model_name] = X_nas
        out[mask_name] = mask
    return out

def MAR(X,p_miss,p_obs=0.5):
    out = {'X': X.double()}
    for p in p_miss:
        n, d = X.shape
        mask = torch.zeros(n, d).bool()
        num_no_missing = max(int(p_obs * d), 1)
        num_missing = d - num_no_missing
        obs_samples = np.random.choice(d, num_no_missing, replace=False)
        copy_samples = np.array([i for i in range(d) if i not in obs_samples])
        len_obs = len(obs_samples)
        len_na = len(copy_samples)
        coeffs = torch.randn(len_obs, len_na).double()
        Wx = X[:, obs_samples].mm(coeffs)
        coeffs /= torch.std(Wx, 0, keepdim=True)
        coeffs.double()
        len_obs, len_na = coeffs.shape
        intercepts = torch.zeros(len_na)
        for j in range(len_na):
            def f(x):
                return torch.sigmoid(X[:, obs_samples].mv(coeffs[:, j]) + x).mean().item() - p
            intercepts[j] = optimize.bisect(f, -50, 50)
        ps = torch.sigmoid(X[:, obs_samples].mm(coeffs) + intercepts)
        ber = torch.rand(n, len_na)
        mask[:, copy_samples] = ber < ps
        X_nas = X.clone()
        X_nas[mask.bool()] = np.nan
        model_name = 'Missing'
        mask_name = 'Mask'
        out[model_name] = X_nas
        out[mask_name] = mask
    return out

def MNAR_mask_logistic(X, p_miss, p_params =.5, exclude_inputs=True):
    """
    Missing not at random mechanism with a logistic masking model. It implements two mechanisms:
    (i) Missing probabilities are selected with a logistic model, taking all variables as inputs. Hence, values that are
    inputs can also be missing.
    (ii) Variables are split into a set of intputs for a logistic model, and a set whose missing probabilities are
    determined by the logistic model. Then inputs are then masked MCAR (hence, missing values from the second set will
    depend on masked values.
    In either case, weights are random and the intercept is selected to attain the desired proportion of missing values.
    Parameters
    ----------
    X : torch.DoubleTensor or np.ndarray, shape (n, d)
        Data for which missing values will be simulated.
        If a numpy array is provided, it will be converted to a pytorch tensor.
    p : float
        Proportion of missing values to generate for variables which will have missing values.
    p_params : float
        Proportion of variables that will be used for the logistic masking model (only if exclude_inputs).
    exclude_inputs : boolean, default=True
        True: mechanism (ii) is used, False: (i)
    Returns
    -------
    mask : torch.BoolTensor or np.ndarray (depending on type of X)
        Mask of generated missing values (True if the value is missing).
    """
    out = {'X_init_MNAR': X.double()}
    for p in p_miss: 
        n, d = X.shape
        to_torch = torch.is_tensor(X) ## output a pytorch tensor, or a numpy array
        if not to_torch:
            X = torch.from_numpy(X)
        mask = torch.zeros(n, d).bool() if to_torch else np.zeros((n, d)).astype(bool)
        d_params = max(int(p_params * d), 1) if exclude_inputs else d ## number of variables used as inputs (at least 1)
        d_na = d - d_params if exclude_inputs else d ## number of variables masked with the logistic model
        ### Sample variables that will be parameters for the logistic regression:
        idxs_params = np.random.choice(d, d_params, replace=False) if exclude_inputs else np.arange(d)
        idxs_nas = np.array([i for i in range(d) if i not in idxs_params]) if exclude_inputs else np.arange(d)
        ### Other variables will have NA proportions selected by a logistic model
        ### The parameters of this logistic model are random.
        ### Pick coefficients so that W^Tx has unit variance (avoids shrinking)
        len_obs = len(idxs_params)
        len_na = len(idxs_nas)
        coeffs = torch.randn(len_obs, len_na).double()
        Wx = X[:, idxs_params].mm(coeffs)
        coeffs /= torch.std(Wx, 0, keepdim=True)
        coeffs.double()
        ### Pick the intercepts to have a desired amount of missing values
        len_obs, len_na = coeffs.shape
        intercepts = torch.zeros(len_na)
        for j in range(len_na):
            def f(x):
                return torch.sigmoid(X[:, idxs_params].mv(coeffs[:, j]) + x).mean().item() - p
            intercepts[j] = optimize.bisect(f, -50, 50)
        ps = torch.sigmoid(X[:, idxs_params].mm(coeffs) + intercepts)
        ber = torch.rand(n, d_na)
        mask[:, idxs_nas] = ber < ps
        ## If the inputs of the logistic model are excluded from MNAR missingness,
        ## mask some values used in the logistic model at random.
        ## This makes the missingness of other variables potentially dependent on masked values
        if exclude_inputs:
            mask[:, idxs_params] = torch.rand(n, d_params) < p
        X_nas = X.clone()
        X_nas[mask.bool()] = np.nan
        model_name = 'Missing'
        mask_name = 'Mask'
        out[model_name] = X_nas
        out[mask_name] = mask
    return out


In [19]:
def load_task(base_save_folder, task_id, r_or_c):
    
    cached_data_path = f"{base_save_folder}/{task_id}.pkl"
    print(cached_data_path)
    if os.path.exists(cached_data_path):
        d = pickle.load(open(cached_data_path, "rb"))
        X_train, y_train, X_test, y_test = d['X_train'], d['y_train'], d['X_test'], d['y_test']
    else:
        #kwargs = {'force_refresh_cache': True}
        task = openml.datasets.get_dataset(task_id)
        X, y, _, _  = task.get_data(dataset_format="dataframe")
        print(X)
        print(y)
        if y is None: 
            y = X.iloc[:, -1:]
            X = X.iloc[:, :-1]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)
        preprocessing_pipeline = sklearn.pipeline.make_pipeline(
            tpot2.builtin_modules.ColumnSimpleImputer(
                "categorical", strategy='most_frequent'), 
            tpot2.builtin_modules.ColumnSimpleImputer(
                "numeric", strategy='mean'), 
                tpot2.builtin_modules.ColumnOneHotEncoder(
                    "categorical", min_frequency=0.001, handle_unknown="ignore")
            )
        X_train = preprocessing_pipeline.fit_transform(X_train)
        X_test = preprocessing_pipeline.transform(X_test)

        X_train = sklearn.preprocessing.normalize(X_train)
        X_test = sklearn.preprocessing.normalize(X_test)

        if r_or_c =='c':
            le = sklearn.preprocessing.LabelEncoder()
            y_train = le.fit_transform(y_train)
            y_test = le.transform(y_test)

        d = {"X_train": X_train, "y_train": y_train, "X_test": X_test, "y_test": y_test}
        if not os.path.exists(f"{base_save_folder}"):
            os.makedirs(f"{base_save_folder}")
        with open(cached_data_path, "wb") as f:
            pickle.dump(d, f)

    return X_train, y_train, X_test, y_test

In [20]:
import os
import openml

X_train, y_train, X_test, y_test = load_task(base_save_folder='.ImputerExperiments/data', task_id=197, r_or_c= 'r')
for level in [0.01]:
        for type_1 in ['MAR']:
                X_train = pd.DataFrame(X_train)
                X_test = pd.DataFrame(X_test)
                X_train_M, mask_train = add_missing(X_train, add_missing=level, missing_type=type_1)
                X_test_M, mask_test = add_missing(X_test, add_missing=level, missing_type=type_1)
                X_train_n = X_train_M.to_numpy()
                X_test_n = X_test_M.to_numpy()

.ImputerExperiments/data/197.pkl


  missing_mask = missing_mask.mask(missing_mask.notna(), False)
  missing_mask = missing_mask.mask(missing_mask.notna(), False)


In [21]:
est.fit(X_train_n, y_train)

Generation:   0%|          | 0/5 [00:00<?, ?it/s]

 <tpot2.search_spaces.pipelines.sequential.SequentialPipelineIndividual object at 0x10e3a4b80> 
 local variable 'X_copy' referenced before assignment 
 Traceback (most recent call last):
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/utils/eval_utils.py", line 53, in objective_nan_wrapper
    value = func_timeout.func_timeout(timeout, objective_function, args=[individual], kwargs=objective_kwargs)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/tpot_estimator/estimator.py", line 623, in objective_function
    return objective_function_generator(
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/tpot_estimator/estimator_utils.py"

Generation:  20%|██        | 1/5 [00:20<01:22, 20.60s/it]

Generation:  1
Best root_mean_squared_error score: -2.926744809426277


Generation:  40%|████      | 2/5 [00:37<00:54, 18.18s/it]

Generation:  2
Best root_mean_squared_error score: -2.910752531356383


Generation:  60%|██████    | 3/5 [01:04<00:45, 22.55s/it]

Generation:  3
Best root_mean_squared_error score: -2.825308108329773
 <tpot2.search_spaces.pipelines.sequential.SequentialPipelineIndividual object at 0x138549de0> 
 local variable 'X_copy' referenced before assignment 
 Traceback (most recent call last):
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/utils/eval_utils.py", line 53, in objective_nan_wrapper
    value = func_timeout.func_timeout(timeout, objective_function, args=[individual], kwargs=objective_kwargs)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/tpot_estimator/estimator.py", line 623, in objective_function
    return objective_function_generator(
  File "/Users/gabrielk

Generation:  80%|████████  | 4/5 [01:12<00:16, 16.62s/it]

Generation:  4
Best root_mean_squared_error score: -2.825308108329773
 <tpot2.search_spaces.pipelines.sequential.SequentialPipelineIndividual object at 0x14aaabb80> 
 local variable 'X_copy' referenced before assignment 
 Traceback (most recent call last):
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/utils/eval_utils.py", line 53, in objective_nan_wrapper
    value = func_timeout.func_timeout(timeout, objective_function, args=[individual], kwargs=objective_kwargs)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/dafunc.py", line 108, in func_timeout
    raise_exception(exception)
  File "/Users/gabrielketron/tpot2_addimputers/env2/lib/python3.10/site-packages/func_timeout/py3_raise.py", line 7, in raise_exception
    raise exception[0] from None
  File "/Users/gabrielketron/tpot2_addimputers/tpot2/tpot2/tpot_estimator/estimator.py", line 623, in objective_function
    return objective_function_generator(
  File "/Users/gabrielk

Generation: 100%|██████████| 5/5 [01:47<00:00, 21.52s/it]

Generation:  5
Best root_mean_squared_error score: -2.825308108329773



2024-08-14 10:37:51,567 - distributed.scheduler - ERROR - Removing worker 'tcp://127.0.0.1:50009' caused the cluster to lose scattered data, which can't be recovered: {'ndarray-60bc74e8ea0275b3cc07fcd81ebcbefe', 'DataFrame-4901ef7f6301502e2364b3252b7c95e0'} (stimulus_id='handle-worker-cleanup-1723657071.56739')


In [30]:
#plot the best pipeline
print(str(est.fitted_pipeline_[1]).split("(")[0])

XGBRegressor


In [23]:
est.fitted_pipeline_[0].transform(X_test_M)

array([[5.41544294e-06, 0.00000000e+00, 3.17435213e-03, ...,
        1.80514765e-06, 6.96786991e-04, 9.17751504e-01],
       [2.07165749e-06, 1.55374312e-06, 2.19906443e-03, ...,
        5.17914373e-07, 2.18559866e-04, 5.10097492e-01],
       [1.48780084e-05, 1.01440966e-05, 3.16428187e-03, ...,
        2.56983781e-06, 5.92415242e-04, 9.08706821e-01],
       ...,
       [2.75179425e-05, 2.30555734e-05, 8.64955868e-04, ...,
        1.71057480e-06, 2.41711657e-04, 7.91458419e-01],
       [1.04867884e-06, 0.00000000e+00, 3.79097402e-04, ...,
        1.36328250e-06, 1.10478316e-03, 9.52889373e-01],
       [0.00000000e+00, 0.00000000e+00, 8.18237388e-04, ...,
        1.97165636e-06, 1.92236495e-04, 9.99219668e-01]])

In [24]:
# score the model

auroc_scorer = sklearn.metrics.get_scorer("roc_auc")
auroc_score = auroc_scorer(est, X_test, y_test)

print("auroc score", auroc_score)

AttributeError: TPOTEstimator has none of the following attributes: decision_function, predict_proba.

In [None]:
#plot the best pipeline
est.fitted_pipeline_

In [None]:
est

# Combined Search Space Example

In [None]:
from tpot2.search_spaces.pipelines import *
from tpot2.config import get_search_space

selectors = get_search_space(["selectors","selectors_classification", "Passthrough"])
estimators = get_search_space(["classifiers"])


# this allows us to wrap the classifiers in the EstimatorTransformer
# this is necessary so that classifiers can be used inside of sklearn pipelines
wrapped_estimators = WrapperPipeline(tpot2.builtin_modules.EstimatorTransformer, {}, estimators)

scalers = get_search_space(["scalers","Passthrough"])

transformers_layer =UnionPipeline([
                        ChoicePipeline([
                            DynamicUnionPipeline(get_search_space(["transformers"])),
                            get_search_space("SkipTransformer"),
                        ]),
                        get_search_space("Passthrough")
                        ]
                    )

inner_estimators_layer = UnionPipeline([
                            ChoicePipeline([
                                DynamicUnionPipeline(wrapped_estimators),
                                get_search_space("SkipTransformer"),
                            ]),
                            get_search_space("Passthrough")]
                        )


search_space = SequentialPipeline(search_spaces=[
                                        scalers,
                                        selectors, 
                                        transformers_layer,
                                        inner_estimators_layer,
                                        estimators,
                                        ])

est = tpot2.TPOTEstimator(
    scorers = ["roc_auc"],
    scorers_weights = [1],
    classification = True,
    cv = 5,
    search_space = search_space,
    population_size= 10,
    generations = 5,
    max_eval_time_seconds = 60*5,
    verbose = 2,
)

est.fit(X_train, y_train)

Generation: 100%|██████████| 5/5 [00:22<00:00,  4.60s/it]


In [None]:
est.fitted_pipeline_