| [**Overview**](./00_overview.ipynb) | [**From Data Exploration to Machine Learning**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_LoadModels.ipynb) | [**Making Predictions**](./03_Predictions.ipynb)|
| -- | -- | -- | -- |

# Making Predictions: Classifying New Data

In this notebook, we will:
* Use IM4NiS models with appropriate data to make new predictions
* Demonstrate how to use the training data to make new models
* Plot our predictions on a map
* Serialize these predictions for use elsewhere

Note: If you haven't already, run the download notebook [here](../data/DownloadData.ipynb).

In [None]:
from pathlib import Path

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import pyrolite.geochem
import pyrolite.plot

from util import get_model_data


First we'll load up the model data for the TIMA binary mineralization classifier:

In [None]:
clf_data = get_model_data("Spinel_TIMA_Binary_Mineralization")

And then, our Heavy Mineral Map of Australia Dataset, which includes geospatial information we'll use a bit later; first we'll check it exists:

In [None]:
assert Path(
    "../data/HMMA_Spinel_Locations_apfu.xlsx"
).exists(), "Missing dataset - you might need to upload it to the 'data' folder"

In [None]:
df = pd.read_excel("../data/HMMA_Spinel_Locations_apfu.xlsx").dropna(how="all")
gdf = gpd.GeoDataFrame(  # turn our dataframe into a geodataframe, which natively knows coordinates
    df, geometry=gpd.points_from_xy(df["LONG_GDA94"], df["LAT_GDA94"]), crs="GDA94"
)
# the TIMA data comes with unnecessary rows.. we can drop them!
gdf = gdf.loc[~gdf.SampleID.isin(["Average concentration", "Standard deviation"])]



In terms of what we'll need to do for a new dataset to line it up with our training dataset, it'll roughly include:
* Read the data
* Transform into consistent units
* Do any required geochemical transformation
* Add any extra features (i.e., lambdas)
* Drop any unrequired columns
* *In cases where the model can't handle missing data*: Decide how to eliminate missing data - dropping rows, columns, or both.

We do have a model already made for TIMA compositions of spinel, *but*, the features it requires differ to those from the HMMA dataset (with contrast in representation as elements/oxides, and different elements measured):

In [None]:
df.pyrochem.list_oxides

Which differs from what our classifer expects:

In [None]:
list(clf_data["Classifier"].feature_names_in_)

Because of this mismatch in data, we'll need to build a new model based on the common subset of geochemistry we have in both the training dataset and the prediction dataset. Luckily, we have everything we need already. We can figure out what overlap we have in terms of mineral chemistry; *note that `pyrolite` will spit out warnings for missing geochemical species*:

In [None]:
common_subset = (
    df.pyrochem.convert_chemistry(to=clf_data["Classifier"].feature_names_in_)
    .pyrochem.elements.dropna(how="all", axis=1)
    .columns
)
common_subset

We can use this to define a new subset of data to train a new model on:

In [None]:
X_train = clf_data["XX_train"][common_subset]
y_train = clf_data["yy_train"].iloc[:, 0] # take just the first column, as this is a 1-column dataframe
X_train.head()

And then train a simple random forest model, here with default parameterization (other than controlling the random seed, so everyone gets the same results):

In [None]:
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=17)
clf.fit(X_train, y_train)

Notably, the data comes in weight percent, which we'll need to convert to fractional compositions (summing to 1, rather than 100%) to compare to our training dataset:

In [None]:
X_predict = gdf.pyrochem.convert_chemistry(to=common_subset).pyrochem.elements / 100
X_predict.head()


We also have some missing data in out prediction dataset, and in this instance we won't be making predictions for those analyses (to do so, we'd need to use a different model, or impute the values):

In [None]:
fltr = ~pd.isna(X_predict).any(axis=1)

We can now filter our datset, and make predictions on analyses which don't have missing data:

In [None]:
gdf.loc[fltr, "Prediction"] = clf.predict(X_predict[fltr])

And have a look at the relative proportion of predictions (notably, mostly unmineralized, which is probably to be expected):

In [None]:
gdf.loc[fltr, "Prediction"].value_counts()

While looking at predictions on a grain-by-grain basis provides lots of information, we typically want to look at the data at a sample-by-sample basis (or, potentially coarser). So here what we'll do is aggregate our predictions by sample, and extract both the proportions of grains which are predicted to be from mineralized hosts and the overall number of predictions made for each sample:

In [None]:
predictions_by_sample = gdf.dissolve(by="SampleID")[["geometry"]]
predictions_by_sample["prop_min"] = (
    gdf["Prediction"]
    .map(dict(Mineralized=1, Unmineralized=0))
    .groupby(gdf["SampleID"])
    .mean()
)
# some samples might not have predictions, so we need to add nan here to avoid divide by zero/log(0) errors
predictions_by_sample["counts"] = (
    gdf["Prediction"].groupby(gdf["SampleID"]).count().replace(0, np.nan)
)


As we have the locations for these samples, we can visualize all of this on a map:

In [None]:
from util import plot_sample_predictions

plot_sample_predictions(predictions_by_sample)

We can export the predictions for each individual grain, and also for each sample (here to GeoPackage, but there are a number of potential formats):

In [None]:
gdf.to_file("../data/HMMA_spinel_with_predictions.gpkg")
predictions_by_sample.to_file("../data/HMMA_spinel_predictions_by_sample.gpkg")

We could also send this to shapefile:

In [None]:
predictions_by_sample.to_file(
    "../data/HMMA_spinel_predictions_by_sample.shp",
)

----

| [**Overview**](./00_overview.ipynb) | [**From Data Exploration to Machine Learning**](./01_EDA.ipynb) | [**Using `sklearn` Models**](./02_LoadModels.ipynb) | [**Making Predictions**](./03_Predictions.ipynb)|
| -- | -- | -- | -- |