In this notebook, we'll test out feature engineering for the photometric redshift problem, and take a look at the feature importance. 

Copyright: Viviana Acquaviva (2023); see also other data credits below.

Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

The problem is inspired by [this paper](https://arxiv.org/abs/1903.08174), for which the data are public and available [here](http://d-scholarship.pitt.edu/36064/).

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', 100)


font = {'size'   : 16}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=14) 
matplotlib.rc('ytick', labelsize=14) 
#matplotlib.rcParams.update({'figure.autolayout': True})
matplotlib.rcParams['figure.dpi'] = 300

In [None]:
!pip install xgboost # install xgboost if you don't have it

In [None]:
import xgboost as xgb

In [None]:
from sklearn.tree import DecisionTreeRegressor 
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor

We read in the selection of data as in previous notebooks.

In [None]:
sel_features = pd.read_csv('../Data/sel_features.csv', sep = '\t')

In [None]:
sel_target = pd.read_csv('../Data/sel_target.csv')

In [None]:
sel_features.shape

In [None]:
sel_target.values.ravel() #changes shape to 1d row-like array

### I'll demonstrate how to check the feature importance using Random Forests as an example.

In [None]:
model = RandomForestRegressor(max_features=4, n_estimators=200) #I need to re-seed the random state

After the model has been fit, it will have the attribute "feature\_importances\_". These are calculated based on the decrease of impurity method. We can look at the feature importance using the following code:

In [None]:
model.fit(sel_features, sel_target.values.ravel()) 

#note: this is not doing any train/test split, but fitting the entire data set 

In [None]:
model.feature_importances_

The code below plots the feature importances. You'll need to adapt it to show the results for multiple models, so you can compare them.

In [None]:
importances = model.feature_importances_

indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(sel_features.shape[1]):
    print("%d. feature: %s, %d (%f)" % (f + 1, sel_features.columns[indices[f]], indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.figure(figsize=(16,6))
plt.title("Feature importances")
plt.bar(range(sel_features.shape[1]), importances[indices],
       color="r", align="center")
plt.xticks(range(sel_features.shape[1]), sel_features.columns[indices])
plt.xlim([-1, sel_features.shape[1]])
plt.show()

### Your turn! 

### Step 1: Feature Engineering
In the paper that we used as a reference (https://arxiv.org/abs/1903.08174), the authors actually use colors, not magnitudes, as features (or to be precise: one magnitude and five colors). Find in the paper the exact list of features (hint: it's in Section 7), and generate the new features to match what is done there. Note: a color is the ratio of brightness in two bands, but because the brightness in each band is expressed in magnitudes, which is a logarithmic unit (i.e. it's proportional to log(luminosity)), you can obtain colors by subtracting two bands.

Then, compare the performance of an $\textbf{optimized}$ Random Forest model using 3 options for features:
- Version 1: 6 magnitudes, as demonstrated in Studio 8. No need to repeat the work from studio, you can just cite the result you found.
- Version 2: 1 magnitude and 5 colors, as described in the paper.
- Version 3: Include all the features from Version 1 and Version 2, yielding 6 magnitudes and 5 colors as your full set of features. 

Compare the outlier fraction and $\sigma_{NMAD}$ for the 3 versions, taking into account the uncertainties.

### Step 2: Feature Importance

Using the features from Version 3 in Step 1, calculate the feature importance for each of the 11 features. In a plot, compare the feature importances found using RandomForest, AdaBoost, and XGBoost methods.

### Acknowledgement statement: