In this notebook we'll attempt to use TPOT.

In [1]:
from tpot import TPOTRegressor
from sklearn.model_selection import train_test_split
import stars
from astropy.io import ascii
import numpy as np
import pandas as pd
import multiprocessing

#if __name__ == '__main__':
#    multiprocessing.set_start_method('forkserver')

sl = stars.StarLoader('data/mastarall-v3_1_1-v1_7_7.fits', 'data/mastar-combspec-v3_1_1-v1_7_7-lsfpercent99.5.fits')

## Initial experimenation

First, I copy-pasted tutorial code for TPOT. It ran for 2h and crashed after 3 generations (out of default 100). It thought ExtraTreesRegressor was best:

```python
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -137065.8257885053
exported_pipeline = ExtraTreesRegressor(bootstrap=False, max_features=0.1, min_samples_leaf=1, min_samples_split=4, n_estimators=100)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```

Then I exported the data to `data/goodt.csv` and ran from CLI:

```bash
tpot data/goodt.csv -is , -target INPUT_TEFF -mode regression -o -njobs -1 -s 1 -v 2
```

And I got the following error almost immediately:

```
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}
```

Next, I tried running TPOT by limiting config to XGBoost, just to compare what comes out. I also used just 4 cores this time and restricted algorithms, generations and population severely so that I can see a full TPOT run in action. There is no specific reason I chose XGBoost (I wanted to pick just one).

```
tpot data/goodt.csv -is , -target INPUT_TEFF -mode regression -o tpot_exported_pipeline2.py -njobs 4 -s 1 -cf tpot_checkpoints -v 3 -config tpot_xgboost.py -g 5 -p 5 -cv 5
```

Checkpoint was twice saved to tpot_checkpoints successfully and the main execution loop took 30 minutes (without optimisation step, around 44s/pipeline). The score after 3 generations was much worse to compared with the ExtraTreesRegressor, so I killed off the TPOT "optimisation" step. Overall best result was:

```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBRegressor
from tpot.export_utils import set_param_recursive
from sklearn.preprocessing import FunctionTransformer
from copy import copy

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -22709696.973815482
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy),
        FunctionTransformer(copy)
    ),
    XGBRegressor(learning_rate=0.001, max_depth=7, min_child_weight=7, n_estimators=100, n_jobs=1, objective="reg:squarederror", subsample=0.15000000000000002, verbosity=0)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```


As a next step, I decided to go back to notebook-based execution, but keep the severe restrictions of generation, population and algorithm. Also, I used the forkserver trick hoping that it will prevent crashes.

In [2]:
goodt = sl.stars[sl.stars['INPUT_TEFF']>0]
X_train, X_test, y_train, y_test = train_test_split(goodt['FLUX_CORR'], goodt['INPUT_TEFF'], train_size=0.9, test_size=0.1, random_state=1)

In [83]:
# Customising operators http://epistasislab.github.io/tpot/using/#customizing-tpots-operators-and-parameters
# Defaults from here https://github.com/EpistasisLab/tpot/tree/master/tpot/config
tpot_config = {
    'sklearn.ensemble.ExtraTreesRegressor': {
        'n_estimators': [100],
        'max_features': np.arange(0.05, 1.01, 0.05),
        'min_samples_split': range(2, 21),
        'min_samples_leaf': range(1, 21),
        'bootstrap': [True, False]
    },
}

# http://epistasislab.github.io/tpot/using/#crashfreeze-issue-with-n_jobs-1-under-osx-or-linux
if __name__ == '__main__':
    pipeline_optimizer = TPOTRegressor(random_state=1, verbosity=3, n_jobs=4, generations=5, population_size=5,
        periodic_checkpoint_folder='tpot_checkpoints', cv=5, config_dict=tpot_config, scoring='neg_mean_squared_error')
    pipeline_optimizer.fit(X_train, y_train)
    print(pipeline_optimizer.score(X_test, y_test))
    pipeline_optimizer.export('tpot_exported_pipeline.py')


1 operators have been imported by TPOT.


Optimization Progress:   0%|          | 0/30 [00:00<?, ?pipeline/s]

Skipped pipeline #1 due to time out. Continuing to the next pipeline.
Skipped pipeline #3 due to time out. Continuing to the next pipeline.
Skipped pipeline #5 due to time out. Continuing to the next pipeline.
Skipped pipeline #7 due to time out. Continuing to the next pipeline.
Skipped pipeline #10 due to time out. Continuing to the next pipeline.
Skipped pipeline #12 due to time out. Continuing to the next pipeline.
Skipped pipeline #14 due to time out. Continuing to the next pipeline.
Skipped pipeline #16 due to time out. Continuing to the next pipeline.

Generation 1 - Current Pareto front scores:

-1	-175279.31687513887	ExtraTreesRegressor(input_matrix, ExtraTreesRegressor__bootstrap=True, ExtraTreesRegressor__max_features=0.45, ExtraTreesRegressor__min_samples_leaf=10, ExtraTreesRegressor__min_samples_split=13, ExtraTreesRegressor__n_estimators=100)
Saving periodic pipeline from pareto front to tpot_checkpoints/pipeline_gen_1_idx_0_2022.05.26_10-40-59.py
Skipped pipeline #20 due 



Main execution loop here took 27 minutes (around 44s/pipeline) and the following code. The worsening of the result could be due to different test split.

```python
import numpy as np
import pandas as pd
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -155932.94160968057
exported_pipeline = ExtraTreesRegressor(bootstrap=True, max_features=0.55, min_samples_leaf=3, min_samples_split=20, n_estimators=100)
# Fix random state in exported estimator
if hasattr(exported_pipeline, 'random_state'):
    setattr(exported_pipeline, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```



In [3]:
if __name__ == '__main__':
    multiprocessing.set_start_method('forkserver')
    pipeline_optimizer = TPOTRegressor(random_state=1, verbosity=3, n_jobs=4, generations=20, population_size=20,
        periodic_checkpoint_folder='tpot_checkpoints', cv=5, scoring='neg_mean_squared_error', max_eval_time_mins=10)
    pipeline_optimizer.fit(X_train, y_train)
    print(pipeline_optimizer.score(X_test, y_test))
    pipeline_optimizer.export('tpot_exported_pipeline.py')

  from pandas import MultiIndex, Int64Index


30 operators have been imported by TPOT.


Optimization Progress:   0%|          | 0/420 [00:00<?, ?pipeline/s]

  from pandas import MultiIndex, Int64Index
  from pandas import MultiIndex, Int64Index


Skipped pipeline #8 due to time out. Continuing to the next pipeline.
Skipped pipeline #10 due to time out. Continuing to the next pipeline.
Skipped pipeline #16 due to time out. Continuing to the next pipeline.
Skipped pipeline #18 due to time out. Continuing to the next pipeline.
Skipped pipeline #20 due to time out. Continuing to the next pipeline.
Skipped pipeline #22 due to time out. Continuing to the next pipeline.


  from pandas import MultiIndex, Int64Index


Skipped pipeline #31 due to time out. Continuing to the next pipeline.
Skipped pipeline #34 due to time out. Continuing to the next pipeline.
Skipped pipeline #36 due to time out. Continuing to the next pipeline.
Skipped pipeline #38 due to time out. Continuing to the next pipeline.
Skipped pipeline #40 due to time out. Continuing to the next pipeline.
Skipped pipeline #42 due to time out. Continuing to the next pipeline.
Skipped pipeline #47 due to time out. Continuing to the next pipeline.
Skipped pipeline #51 due to time out. Continuing to the next pipeline.
Skipped pipeline #53 due to time out. Continuing to the next pipeline.
Skipped pipeline #55 due to time out. Continuing to the next pipeline.

Generation 1 - Current Pareto front scores:

-1	-138587.751580905	ExtraTreesRegressor(input_matrix, ExtraTreesRegressor__bootstrap=False, ExtraTreesRegressor__max_features=0.5, ExtraTreesRegressor__min_samples_leaf=7, ExtraTreesRegressor__min_samples_split=7, ExtraTreesRegressor__n_estima



The runs just kept crashing on the same "process was killed" error mentioned earlier. The longest run I was able to extract was 192 minutes. Maybe we need to run non-parallel.

The only special result different than ExtraTreesRegressort was LassoLarsCV:

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -62062.1953125
exported_pipeline = make_pipeline(
    Normalizer(norm="l2"),
    LassoLarsCV(normalize=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```

In [6]:
tpot_config = {
    'sklearn.linear_model.LassoLarsCV': {
        'normalize': [True, False]
    },
    # Preprocessors
    'sklearn.preprocessing.Binarizer': {
        'threshold': np.arange(0.0, 1.01, 0.05)
    },

    'sklearn.decomposition.FastICA': {
        'tol': np.arange(0.0, 1.01, 0.05)
    },

    'sklearn.cluster.FeatureAgglomeration': {
        'linkage': ['ward', 'complete', 'average'],
        'affinity': ['euclidean', 'l1', 'l2', 'manhattan', 'cosine']
    },

    'sklearn.preprocessing.MaxAbsScaler': {
    },

    'sklearn.preprocessing.MinMaxScaler': {
    },

    'sklearn.preprocessing.Normalizer': {
        'norm': ['l1', 'l2', 'max']
    },

    'sklearn.kernel_approximation.Nystroem': {
        'kernel': ['rbf', 'cosine', 'chi2', 'laplacian', 'polynomial', 'poly', 'linear', 'additive_chi2', 'sigmoid'],
        'gamma': np.arange(0.0, 1.01, 0.05),
        'n_components': range(1, 11)
    },

    'sklearn.decomposition.PCA': {
        'svd_solver': ['randomized'],
        'iterated_power': range(1, 11)
    },

    'sklearn.preprocessing.PolynomialFeatures': {
        'degree': [2],
        'include_bias': [False],
        'interaction_only': [False]
    },

    'sklearn.kernel_approximation.RBFSampler': {
        'gamma': np.arange(0.0, 1.01, 0.05)
    },

    'sklearn.preprocessing.RobustScaler': {
    },

    'sklearn.preprocessing.StandardScaler': {
    },

    'tpot.builtins.ZeroCount': {
    },

    'tpot.builtins.OneHotEncoder': {
        'minimum_fraction': [0.05, 0.1, 0.15, 0.2, 0.25],
        'sparse': [False],
        'threshold': [10]
    },


    # Selectors
    'sklearn.feature_selection.SelectFwe': {
        'alpha': np.arange(0, 0.05, 0.001),
        'score_func': {
            'sklearn.feature_selection.f_regression': None
        }
    },

    'sklearn.feature_selection.SelectPercentile': {
        'percentile': range(1, 100),
        'score_func': {
            'sklearn.feature_selection.f_regression': None
        }
    },

    'sklearn.feature_selection.VarianceThreshold': {
        'threshold': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.2]
    },

    'sklearn.feature_selection.SelectFromModel': {
        'threshold': np.arange(0, 1.01, 0.05),
        'estimator': {
            'sklearn.ensemble.ExtraTreesRegressor': {
                'n_estimators': [100],
                'max_features': np.arange(0.05, 1.01, 0.05)
            }
        }
    }
}

pipeline_optimizer = TPOTRegressor(random_state=1, verbosity=3, generations=20, population_size=20,
    periodic_checkpoint_folder='tpot_checkpoints', cv=5, scoring='neg_mean_squared_error', max_eval_time_mins=10, config_dict=tpot_config)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline.py')

19 operators have been imported by TPOT.


Optimization Progress:   0%|          | 0/420 [00:00<?, ?pipeline/s]

Skipped pipeline #10 due to time out. Continuing to the next pipeline.
Skipped pipeline #12 due to time out. Continuing to the next pipeline.
Skipped pipeline #14 due to time out. Continuing to the next pipeline.
Skipped pipeline #17 due to time out. Continuing to the next pipeline.
Skipped pipeline #19 due to time out. Continuing to the next pipeline.
Skipped pipeline #21 due to time out. Continuing to the next pipeline.
Skipped pipeline #23 due to time out. Continuing to the next pipeline.
Skipped pipeline #25 due to time out. Continuing to the next pipeline.
Skipped pipeline #27 due to time out. Continuing to the next pipeline.
Skipped pipeline #29 due to time out. Continuing to the next pipeline.
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required by LassoLarsCV..
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distan

: 

: 

Kernel crashed after ~4h hours:

```
Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.
```

At this stage I gave up - the longest execution I could get running locally was around 5 hours, and the results seemed to always get stuck on the same pipelines, not improve, and also individual models were timing out (I'm guessing this could have something to do with the amount of features being fed in - spectrum of 4k points.

The best solution found was:

```python
import numpy as np
import pandas as pd
from sklearn.linear_model import LassoLarsCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -62062.1953125
exported_pipeline = make_pipeline(
    Normalizer(norm="l2"),
    LassoLarsCV(normalize=False)
)
# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)
```

Also see the `tpot_checkpoints` directory for best solutions found in each generation.