# Create a Random Forest model from BASE-9 data


This notebook performs the following tasks:
- reads in posterior data from BASE-9
- generates features from these data 
- uses these features to train and test a random forest classifier from `scipy`
- and saves the model to a file.  

Here we use data from NGC 2682 (M67) to train the model; these data were hand labelled by Justyce.  If you need access to these data, please contact Aaron Geller.

Most of the "heavy lifting" is done by the code in the `base9_ml_utils.py` file.  See the comments and markdown in that code for more details.


___
*Authors:* Justyce Watson, Aaron Geller\
*Date:* August 2025


## Import all functions from the `base9_ml_utils.py` file

In [1]:
# import functions from .py file
from base9_ml_utils import *

# The lines below are useful if you plan to make changes to the base9_ml_utils.py file.
# They will allow the notebook to refresh when you save changes to the .py file.
#
# %load_ext autoreload
# %autoreload 2


## Read in `.res` files and creates the features 

The user should specify the data directory on their own computer.  The code assumes that this directory contains one `.res` file for each star with the filename containing the star ID.  (If there is additional text in the file name, the user can specify this in the code, using the `file_prefix` and/or `file_suffix` args so that the code can identify the star ID from the filename properly.)  

We will use the `create_features` function imported from `base9_ml_utils.py`.


In [None]:
# run this cell to see information about this function
create_features?

In [3]:
# directory on your computer where the .res data files are stored
directory = "data/NGC2682/jw_output"

# create a DataFrame with features for each star using the 'create_features'
model_cluster_statistic = create_features(directory)
 
# display the resulting DataFrame in the notebook
model_cluster_statistic

Unnamed: 0,source_id,Width,Upper_bound,Lower_bound,Stdev,SnR,Dip_p,Dip_value,KS_value,KS_p,ESS
0,608154449852709120,0.542966,1.183637,1.103694,0.237557,38.492771,0.000000,0.076963,0.448514,0.000000e+00,10197.520773
1,608303231815505920,0.755195,1.183637,1.103694,0.464201,20.268446,0.003637,0.006149,0.144816,1.462136e-112,9632.750604
2,608141294367793024,1.705887,1.183637,1.103694,0.725193,12.193566,0.000000,0.024109,0.135628,1.572096e-99,9726.076429
3,608068623521152384,1.344996,1.183637,1.103694,0.657152,14.096636,0.000085,0.007507,0.190857,7.239580e-196,9428.233718
4,608038764908563968,1.255172,1.183637,1.103694,0.639640,14.644034,0.000000,0.014813,0.207206,5.233938e-233,9639.878183
...,...,...,...,...,...,...,...,...,...,...,...
1423,604694561637360640,1.555578,1.183637,1.103694,0.709386,12.669910,0.000000,0.021891,0.152263,1.939749e-123,9644.372584
1424,604711505283863808,2.474428,1.183637,1.103694,1.159964,7.203835,0.000000,0.118215,0.216214,8.737144e-250,9341.611448
1425,604703465105196416,1.633171,1.183637,1.103694,0.750779,11.732707,0.000000,0.052372,0.168745,3.548160e-154,9777.814808
1426,604712531781276928,1.240942,1.183637,1.103694,0.596285,15.420267,0.000000,0.012573,0.165878,9.222213e-148,9802.529449


## Read in data for training and testing the model

This dataset contains hand labelled sampling quality for each star that has a `.res` file in the dataset above.  The labels were created by Justyce Watson by visually inspecting the distributions in the `.res` files.

In this dataset we will use the column `Single Sampling` as our label, and only take rows where the a label exists.

In [4]:
# Read in the data
df1 = pd.read_csv('data/NGC2682/NGC2682_Age_Stats.csv',sep=',')

# Select only the rows where Single Sampling values exist
# And keep only the relevant columns
sampling_df = df1[df1['Single Sampling'].isna() == False][['source_id','Single Sampling']]

# Display this DataFrame in the notebook
sampling_df

Unnamed: 0,source_id,Single Sampling
0,597810107020313344,Bad
1,597830722862488064,Bad
2,598464900553093504,Bad
3,598525408052424960,Bad
4,598543206396991232,Bad
...,...,...
1435,605170688827236736,Bad
1436,603848521800034176,Bad
1438,603868141210083712,Bad
1439,607987427163771520,Good


# Create the model 
Here we use the `create_model` function imported from `base9_ml_utils.py`.  In this function we split the data into training and testing subsets.  The training set is further modified so that there are equal "Good" and "Bad" labelled data.  

In [None]:
# run this cell to see information about this function
create_model?

In [6]:
# create the model (returned as a scipy pipeline object, here we call it "pipe")
pipe, X, y, X_train, y_train, X_test, y_test = create_model(model_cluster_statistic, sampling_df)

There are 161 training elements with classification = Bad
There are 161 training elements with classification = Good


## Use the model to generate labels
Here we use the `make_preds` function imported from `base9_ml_utils.py`.  In this function we send the model from `create_model` and data to be labeled.  For this step we will send the testing data.  We will also define the labels for the test data so that we can validate the quality of the model.  (Note that you can use `make_preds` without knowing the labels, as we will do in the `apply_model.ipynb` notebook.)

In [None]:
# run this cell to see information about this function
make_preds?

In [8]:
y_pred = make_preds(pipe, X_test, y_test=y_test, 
    feature_columns=[
        "Width",
        "Upper_bound",
        "Lower_bound",
        "Stdev",
        "SnR",
        "Dip_p",
        "Dip_value",
        "KS_value",
        "KS_p",
        "ESS"])


Accuracy: 0.955607476635514
              precision    recall  f1-score   support

         Bad       1.00      0.95      0.97       346
        Good       0.81      1.00      0.90        82

    accuracy                           0.96       428
   macro avg       0.91      0.97      0.93       428
weighted avg       0.96      0.96      0.96       428

Feature Importance Ranking:
Width          0.362741
SnR            0.245086
Stdev          0.221714
KS_value       0.096424
Dip_value      0.043082
ESS            0.022251
Dip_p          0.004935
KS_p           0.003767
Upper_bound    0.000000
Lower_bound    0.000000
dtype: float64


# Save the model

You can then read in your model to apply it to other datasets.  Note that in order to use a saved model, you will need to be working with the same version of scipy (and possibly other dependencies).  

In [9]:
save_model(pipe, filename="my_model.pkl")