Reading literature I have found some other methods that could perhaps work in this situation. It's an interesting dataset, because our input dataset has more features (flux columns) than training examples (labeled stars). I'd like to see performance of the following methods:

* try principal component regression (this is an unsupervised alternative to lasso that reduces input dimensionality)
* feature engineering: perform a polynomial regression on spectrogram to summarise flux samples to handful of values. This may work really badly, because it will wipe out all absorption lines
* feature engineering: convert spectrum to XYZ. This may work really badly because it will probably wipe out some ultraviolet and infrared information
* feature engineering: help the model focus on absorption lines. Perhaps we could calculate second derivative, some thought would be required to cater for expected redshift

Let's dig in.
  

In [11]:
import numpy as np
import pandas as pd
from sklearn.linear_model import Lasso
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
from IPython.display import HTML
import stars

mlflow.sklearn.autolog()
mlflow.set_experiment("/mastar/07_other_models")

<Experiment: artifact_location='file:///Users/x/Sites/mastar/mlruns/2', experiment_id='2', lifecycle_stage='active', name='/mastar/07_other_models', tags={}>

In [26]:
df_goodt = pd.read_parquet('data/goodt.parquet')

In [27]:
seed = 1
training_features_all, testing_features_all, training_target, testing_target = train_test_split(df_goodt, df_goodt['teff'], random_state=seed)

mangaid_training = training_features_all['mangaid']
mangaid_testing = testing_features_all['mangaid']
training_features = np.array(training_features_all.drop(['mangaid', 'teff', 'teff_ext'], axis=1))
testing_features = np.array(testing_features_all.drop(['mangaid', 'teff', 'teff_ext'], axis=1))

In [34]:
pca = PCA(n_components=50)
lin = LinearRegression()
pipeline = make_pipeline(
    Normalizer(norm="l2" ),
    pca,
    lin,
    verbose=True
)
pipeline.fit(training_features, training_target)

results = pipeline.predict(testing_features)
mse = mean_squared_error(testing_target, results)
print('MSE: %.2f' % mse)

2022/06/15 17:25:19 INFO mlflow.utils.autologging_utils: Created MLflow autologging run with ID '72d450261e8d4c6f9240c014e21811e2', which will track hyperparameters, performance metrics, model artifacts, and lineage information for the current sklearn workflow


[Pipeline] ........ (step 1 of 3) Processing normalizer, total=   0.2s
[Pipeline] ............... (step 2 of 3) Processing pca, total=   1.1s
[Pipeline] .. (step 3 of 3) Processing linearregression, total=   0.0s
MSE: 63219.59


This result is comparable to the Lasso. We can improve MSE by using more dimensions, if we wished so, but I wanted to see performance on comparable dimensionality reduction.

In [32]:
df_badt_lim = pd.read_parquet('data/badt_lim.parquet')
target_ext = df_badt_lim['teff_ext']

results_ext = pipeline.predict(np.array(df_badt_lim.drop(['mangaid', 'teff', 'teff_ext'], axis=1)))
mse = mean_squared_error(target_ext, results_ext)
print('External dataset MSE: %.2f' % mse)

External dataset MSE: 30396.76


This is also similar to Lasso. We can conclude here that Lasso gives results similar to PCA with linear regression. PCA results however are harder to reason about: it would be harder to figure out which parts of the spectrum are important to the linear model due to PCA transformation mangling the features as an intermediate step.