This notebook prepares the data for export to Sagemaker Autopilot, then reports on the results of Sagemaker auto-ml run.

In [1]:
import stars
import numpy as np
import pandas as pd

sl = stars.StarLoader('data/mastarall-v3_1_1-v1_7_7.fits', 'data/mastar-combspec-v3_1_1-v1_7_7-lsfpercent99.5.fits')

In [58]:
goodt = sl.stars[sl.stars['INPUT_TEFF']>0]
teff = np.array(goodt['INPUT_TEFF']).reshape(len(goodt['INPUT_TEFF']), 1)
goodt_array = np.hstack([np.array(goodt['FLUX_CORR']), teff])

colcount = goodt_array.shape[1]
header = []
for c in np.arange(colcount-1):
    header.append('flux%d' % c)
header.append('teff')

df_goodt = pd.DataFrame(goodt_array, columns=header)
df_goodt.to_parquet('data/goodt.parquet')


I uploaded the resulting parquet file and targeted "teff" column with a Sagemaker Autopilot. I set it to generate notebooks only, so that I could step through the optimisation process.

Having limited my spend to couple of dollars, I could afford at most 10 hours of runtime. This gave me 100 training jobs at approx 10 hour(s), 35 minute(s) (I had to stop the training early).

The hyper-algorithm didn't really seem to improve over time though, as can be seen on the graph below (plots log10(mse) over time):

![](./images/sagemaker_training.png)

Best candidate Sagemaker Autopilot came up with was described such:

**[dpp0-xgboost](sagemaker_dpp0.py)**: This data transformation strategy first transforms 'numeric' features using [RobustImputer (converts missing values to nan)](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/impute/base.py). It merges all the generated features and applies [RobustStandardScaler](https://github.com/aws/sagemaker-scikit-learn-extension/blob/master/src/sagemaker_sklearn_extension/preprocessing/data.py). The
transformed data will be used to tune a *xgboost* model.

The winner was an xgboost with the following parameters:

```
alpha			1.118370287233794
colsample_bytree	0.7327357874854505
eta			0.05108342069354128
gamma			1.4447039174781267
lambda			0.014694710871502958
max_depth		7
min_child_weight	0.2518209934598997
num_round		499
subsample		0.5854222916407865
```

This is implemented by AWS in their super-secret docker images, so there is no source.

We could try replicating it though - these results are way better compared to what TPOT could come up with for XGBoost (granted, it was crashing a lot). I think perhaps scaling might be important here, so taking a look at the preprocessing steps AWS applied could be good.