# Data transformations

This notebook demonstrates how to customize AGNBoost to your data needs through:

1. Custom feature creation. 
2. Applying transformations to your target variables.

Let's start by importing the necessary libraries and loading our data.

In [1]:
%load_ext autoreload
%autoreload 2

In [5]:
# Set agnboost folder as root
import os
os.chdir(os.path.expanduser("/home/kurt/Documents/agnboost/"))

# Import necessary libraries
import numpy as np
import pandas as pd
from agnboost import dataset, model
#from sklearn.metrics import mean_squared_error

# Set random seed for reproducibility
np.random.seed(123)

print("AGNBoost Basic Usage Tutorial")
print("=" * 40)

AGNBoost Basic Usage Tutorial


## Loading the Data

We'll use the Catalog class to load our astronomical dataset. The `models-block-0.fits` file contains photometric measurements and AGN fraction labels for our analysis.

In [6]:
# Load the astronomical data using the Catalog class
catalog = dataset.Catalog(path="data/cigale_mock_small.csv",summarize = False)


Current working directory: /home/kurt/Documents/agnboost
Looking for bands file at: /home/kurt/Documents/agnboost/allowed_bands.json
[INFO] Loaded bands file metadata: This file contains the allowed photometric bands for JWST
[INFO] Loaded 11 allowed bands from agnboost/allowed_bands.json
[INFO] Attempting to load file with delimiter: ','
[INFO] Successfully loaded data with 1000 rows.
[INFO] Found 11 valid band columns:
[INFO]   - jwst.nircam.F115W (F115W): 1.154 μm
[INFO]   - jwst.nircam.F150W (F150W): 1.501 μm
[INFO]   - jwst.nircam.F200W (F200W): 1.988 μm
[INFO]   - jwst.nircam.F277W (F277W): 2.776 μm
[INFO]   - jwst.nircam.F356W (F356W): 3.565 μm
[INFO]   - jwst.nircam.F410M (F410M): 4.083 μm
[INFO]   - jwst.nircam.F444W (F444W): 4.402 μm
[INFO]   - jwst.miri.F770W (F770W): 7.7 μm
[INFO]   - jwst.miri.F1000W (F1000W): 10.0 μm
[INFO]   - jwst.miri.F1500W (F1500W): 15.0 μm
[INFO]   - jwst.miri.F2100W (F2100W): 21.0 μm


There are no-nan rows to remove since the CIGALE mock data we loaded has none, but your real data might.

## Creating Features

AGNBoost automatically engineers features from photometric data, including colors and transformations. Let's create the feature dataframe that will be used for modeling.

By default, AGNBoost will create a features consisting of the photometric bands + derived colors + the squares of those derived colors

In [14]:
# Create features for modeling
catalog.create_feature_dataframe()

# Get information about created features
features = catalog.get_features()
print(f"Feature engineering complete:")
print(f"  Feature dataframe shape: {features.shape}")


[INFO] Created feature dataframe with 121 columns and 1000 rows.
Feature engineering complete:
  Feature dataframe shape: (1000, 121)


In [15]:
print(features)

     jwst.nircam.F115W  jwst.nircam.F150W  jwst.nircam.F200W  \
0            -5.023679          -4.481445          -3.904872   
1            -1.894243          -1.105450          -0.396389   
2            -0.487316          -0.497980          -0.479759   
3            -5.079519          -4.496838          -4.066625   
4            -3.625131          -3.502294          -3.353756   
..                 ...                ...                ...   
995          -1.286971          -1.147989          -0.972435   
996          -1.428387          -1.285736          -1.110556   
997          -2.655916          -2.286756          -1.873670   
998          -3.640091          -3.444642          -3.111316   
999          -0.081750           0.434442           0.832730   

     jwst.nircam.F277W  jwst.nircam.F356W  jwst.nircam.F410M  \
0            -3.300921          -2.936896          -2.782763   
1             0.157887           0.425068           0.523924   
2            -0.601100          -0.6904

In [13]:
funcs = [ (lambda x: np.log10(x), 'phots'),
           
            ('colors', 'all')
          (lambda x: x**2, 'colors'),
        ]


# add function to manually set the feature dataframe for more complex feature engineering
# You could for exmaple create default feature Dataframe. Load it, take only the columns you want, and save taht back to the self.features_df
#  
            
# jusat need a saved variable to track whether a custom feature dataframe has been saved. (for photometric error propagation)
#   this will disable the buolt-in photometric uncertaitny propagation, users will have to do it thyemselves
}

# Create features for modeling
catalog.create_feature_dataframe( feature_funcs = funcs)

# Get information about created features
features = catalog.get_features()
print(f"Feature engineering complete:")
print(f"  Feature dataframe shape: {features.shape}")


[INFO] Applied custom feature function <lambda>: added 11 features.
[INFO] Applied custom feature function <lambda>: added 11 features.
[INFO] Created feature dataframe with 22 columns and 1000 rows.
Feature engineering complete:
  Feature dataframe shape: (1000, 22)


## Loading the Pre-trained Model

AGNBoost comes with pre-trained models for common astronomical tasks. We'll load the model specifically trained for AGN fraction estimation (`agn.fracAGN`).

The `load_models()` method automatically:
- Checks for compatible pre-trained models
- Validates feature compatibility between the model and our data
- Loads model metadata including training parameters and performance metrics

In [42]:
# Initialize an AGNBoost model. The target variable is the name of the target variable column, and its value in the passed dictionary is the distribution used to model it.
agnboost_m = model.AGNBoost( feature_names = catalog.get_feature_names(),
                          target_variables = {'agn.fracAGN' : 'ZABeta'},
                         )

# Load pre-trained models. We will not pass a filename to load, and will simply the the most recent fracAGN model.
agnboost_m.load_model(model_name = 'fracAGN', overwrite = True)

if agnboost_m.models['agn.fracAGN'] is not None:
    print("✅ Pre-trained model loaded successfully!")
    
    # Display model information
    model_info = agnboost_m.model_info.get('agn.fracAGN', {})
    if model_info:
        print("\nModel information:")
        if 'training_timestamp' in model_info:
            print(f"  Trained: {model_info['training_timestamp']}")
        if 'best_score' in model_info:
            print(f"  Best validation score: {model_info['best_score']:.6f}")
        if 'features' in model_info:
            print(f"  Number of features: {len(model_info['features'])}")
else:
    print("❌ No pre-trained models found!")
    print("You may need to train a new model or check the models directory.")



✅ Pre-trained model loaded successfully!

Model information:
  Best validation score: -649218.125000
  Number of features: 121


## Making Predictions

Now we'll use our loaded model to predict AGN fractions for the test set. AGNBoost seamlessly handles the conversion of our catalog data into the format required by the underlying XGBoost model.

The prediction process uses the engineered features (colors, log magnitudes, etc.) that were automatically created from our photometric band data.

In [43]:
# Make predictions on the test set
#agnboost_m.models['agn.fracAGN'].booster.set_param( {'device': 'cpu'})
preds = agnboost_m.predict( data = catalog, split_use = 'test', model_name = 'agn.fracAGN')

print(f"  Mean: {np.mean(preds):.6f}")
print(f"  Std: {np.std(preds):.6f}")
print(f"  Min: {np.min(preds):.6f}")
print(f"  Max: {np.max(preds):.6f}")




  Mean: 0.509088
  Std: 0.318490
  Min: 0.000142
  Max: 0.989471


## Quantifying Prediction Uncertainty

One of AGNBoost's key advantages is its ability to provide robust uncertainty estimates through XGBoostLSS distributional modeling. Rather than just point estimates, we get full uncertainty quantification for each prediction.

The `prediction_uncertainty()` method returns uncertainty estimates that account for both model uncertainty and the inherent variability in the data. This is crucial for astronomical applications where understanding prediction confidence is essential for scientific interpretation.

Since the loaded data is a CIGALE mock catalog with no photometric uncertainty, we will only estimate the model (aleatoric + epistemic) uncertainty for each source.

In [None]:
model_uncertainty = agnboost_m.prediction_uncertainty( uncertainty_type = 'model', model_name = 'agn.fracAGN', catalog = catalog)

print(f"✅ Uncertainty estimates generated")
print(f"Uncertainty statistics:")
print(f"  Mean uncertainty: {np.mean(model_uncertainty):.6f}")
print(f"  Std uncertainty: {np.std(model_uncertainty):.6f}")
print(f"  Min uncertainty: {np.min(model_uncertainty):.6f}")
print(f"  Max uncertainty: {np.max(model_uncertainty):.6f}")

Processing truncated model uncertainty:  36%|▎| 362/1000 [02:56<05:26,  1.95it/s