| [**Overview**](./00_overview.ipynb) | [**EDA**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_UsingModels.ipynb) |
| -- | -- | -- |

# Loading and Using Existing `sklearn` to Make Predictions

In this notebook we'll:
* Upload some serialized models which have already been trained to Jupyter Lab
* Load these models into `sklearn` object using `joblib`
* Use these models with appropriate data to make new predictions
* Serialize these predictions for use elsewhere

Note: If you haven't already, run the download notebook [here](../data/DownloadData.ipynb).

When importing existing models, the key aspect for `sklearn` is that the versions in an ideal case would match; when you try to load models with an inconsistent version, you'll get a warning (in the best case) and may get an error. The model files for the IM4NiS project were built with `scikit-learn v1.1.3` (now a bit out of date); this was the default version installed in this environment (at least via Binder, or if you used the `environment.yml` file associated with this notebook). We can check the version of `sklearn` we're working with:

In [None]:
import sklearn
sklearn.__version__

Now we've verified we have the right version of `sklearn`, we can load up a file in `joblib`:

In [None]:
import joblib

clf = joblib.load( # this is a classifier file, hence i've named it clf here- you could call it whatever you like, as long as you're consistent
    "../data/MachineLearningModels/Spinel_LAICPMS_Binary_Mineralization_Classifier.joblib"
)

We can see that the classifier is a histogram-gradient-boosted classifier (a fancy form of random forest), and any of the parameters set on it's instantiation:

In [None]:
clf

We can also check what the features the clasisfier was trained on were, in case we didn't have the training dataset handy (in this case, we do, but it's stil good to check):

In [None]:
clf.feature_names_in_

We can also see what classes the classifier predicts, noting here we expect a binary mineralized/unmineralized class:

In [None]:
clf.classes_

One thing to note about this list of features is that they include some things which are not likely provided as standard in most datasets, so we might have to calculate them. This includes the lambdas and assocaited REE anomalies Eu/Eu* and Ce/Ce*.

This classifier exposes a `.predict()` method, which we'll use to classify our new data.

In [None]:
clf.predict?

### Training Data, Introspection and Model Performance

As the training and testing data is provided in DAP, you can independently conduct model introspection and performance evaluation in the same way we've done it during the project.

In [None]:
from pathlib import Path
import pandas as pd


def get_model_data(name):
    """
    Conveience function to do a quick lookup of data files by name,
    and load the relevant items.
    """
    return {
        p.stem.replace(name + "_", ""): (
            pd.read_csv(p) if p.suffix == ".csv" else joblib.load(p)
        )
        for p in Path("../data/MachineLearningModels").rglob("{}*".format(name))
    }


clf_data = get_model_data("Spinel_LAICPMS_Binary_Mineralization")
clf = clf_data['Classifier']

In [None]:
clf_data.keys()

In [None]:
clf_data['XX_test']

In [None]:
clf.predict(clf_data['XX_test'])

In [None]:
clf.score(clf_data['XX_test'], clf_data['yy_test'])

### Classifications on New Data

* Read the data
* Transform into consistent units
* Do any required geochemical transformation
* Add any extra features (i.e., lambdas)
* Drop any unrequired columns
* *In cases where the model can't handle missing data*: Decide how to eliminate missing data - dropping rows, columns, or both.

| [**Overview**](./00_overview.ipynb) | [**EDA**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_UsingModels.ipynb) |
| -- | -- | -- |
