| [**Overview**](./00_overview.ipynb) | [**From Data Exploration to Machine Learning**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_LoadModels.ipynb) | [**Making Predictions**](./03_Predictions.ipynb)|
| -- | -- | -- | -- |

# Loading and Using Existing `sklearn` Models

In this notebook we'll:
* Upload some serialized models which have already been trained to Jupyter Lab
* Load these models into `sklearn` object using `joblib`

Note: If you haven't already, run the download notebook [here](../data/DownloadData.ipynb).

When importing existing models, the key aspect for `sklearn` is that the versions in an ideal case would match; when you try to load models with an inconsistent version, you'll get a warning (in the best case) and may get an error. The model files for the IM4NiS project were built with `scikit-learn v1.1.3` (now a bit out of date); this was the default version installed in this environment (at least via Binder, or if you used the `environment.yml` file associated with this notebook). We can check the version of `sklearn` we're working with:

In [None]:
import sklearn

sklearn.__version__

Now we've verified we have the right version of `sklearn`, we can load up a file in `joblib`:

In [None]:
import joblib

clf = joblib.load( # this is a classifier file, hence i've named it clf here- you could call it whatever you like, as long as you're consistent
    "../data/MachineLearningModels/Spinel_LAICPMS_Binary_Mineralization_Classifier.joblib"
)

We can see that the classifier is a histogram-gradient-boosted classifier (a fancy form of random forest), and any of the parameters set on it's instantiation:

In [None]:
clf

We can also check what the features the clasisfier was trained on were, in case we didn't have the training dataset handy (in this case, we do, but it's stil good to check). One thing to note about this list of features is that they include some things which are not likely provided as standard in most datasets, so we might have to calculate them. In the case of the spinel LAICPMS-based clasisifer, this list includes `lambdas`, which parameterize REE profiles; you can [read a bit more about them in the pyrolite documentation](https://pyrolite.readthedocs.io/en/main/examples/geochem/lambdas.html), including how to calculate them and associated anomalies should you wish.

In [None]:
clf.feature_names_in_

We can also see what classes the classifier predicts, noting here we expect a binary mineralized/unmineralized class:

In [None]:
clf.classes_

This classifier exposes a `.predict()` method, which we'll use to classify any new data (next notebook), or in testing our model (below):

In [None]:
clf.predict?

### Training Data, Introspection and Model Performance

As the training and testing data is provided in DAP (which we've also downloaded here), you can independently conduct model introspection and performance evaluation in the same way we've done it during the project. Here we use a convenience function (`get_model_data()`) which fetches related data for you, and puts it in a dictionary:

In [None]:
from util import get_model_data

clf_data = get_model_data("Spinel_LAICPMS_Binary_Mineralization")

In [None]:
clf_data.keys()

From this, we can pull out the classifier:

In [None]:
clf = clf_data['Classifier']
clf

And some of the data used to train and test it:

In [None]:
clf_data['XX_test']

We can also directly use the classifier, to make predictions or otherwise, e.g. on the test dataset:

In [None]:
clf.predict(clf_data['XX_test'])

Given we know the appropriate labels for this test dataset, we can also use it to score the model (i.e., determine the performance/accuracy):

In [None]:
clf.score(clf_data['XX_test'], clf_data['yy_test'])

Another way to look at model performance in a generalised and visual way is to use confusion matricies; there's a convenience function in `pyrolite` to quickly make one:

In [None]:
from pyrolite.util.skl.vis import plot_confusion_matrix

plot_confusion_matrix(clf_data['Classifier'],
                      clf_data['XX_test'], 
                      clf_data['yy_test'], 
                      normalize=True,
                    )

We can also look at things like feature importance; here using permutation importance:

In [None]:
from util import plot_permutation_importances

plot_permutation_importances(clf_data)

----

| [**Overview**](./00_overview.ipynb) | [**From Data Exploration to Machine Learning**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_LoadModels.ipynb) | [**Making Predictions**](./03_Predictions.ipynb)|
| -- | -- | -- | -- |