# Classification Modeling


## ML in Chemistry

ML applications are becoming more mainstream [@joshi_2023, @baum_2021].

## Examples of ML Algorithms in Chemistry

## Decision Trees

According to @loh_2014 [pp. 330], What we would now recognize as a decision tree was first published by @morgan_1963 for regression, and classification nine years later by @messenger_1972.

An advantage of decision trees over other models is it is not necessary to scale or center input data [@géron_2019, pp. 177].

## Decision Tree Ensembles

Ensemble Methods function by aggregating the predictions of multiple individual base estimators to improve robustness by taking advantage of a concept termed *wisdom ofthe crowd* [@géron_2019, chap. 7, pp. 189], where while any one model may fail to adequately solve the problem, combining the strengths and flaws of distinct models hopefully produces a better performing model than any one alone. This was first exhibited during the 2006 Netflix Prize in which a $1 million USD prize was claimed by a super-group formed from individual competitor groups who ensembled their models together to win the competition [@kunapuli_2023, pp. 17, pt. 1]

Decision Tree ensembles consist of a specialized class of decision tree trained with the Classification And Regression Tree (CART) algorithm. CART functions try to split the training set into the two 'purest' subsets based on combinations of a feature $k$ and threshold value $t_k$, where the subsets are weighted by their size. Once split, this process is applied to each subset, and then those subsets, and so on until a depth defined by the hyperparameter `max_depth` is reached. It is a greedy algorithm in that it when splitting at the current depth, it does not include information how the current split will affect the purity of future splits, as finding the perfect tree is an *NP-Complete* problem [@géron_2019, pp. 179 - 180].

[@olson_2018] found that tree-based ensemble methods such as XGBoost significantly outperformed other models in solving bioinformatics classification problems.
 

### Boosting

Boosting (hypothesis boosting) is the act of an ensemble method combining several weak learners into a strong learner. Predictors are trained sequentially, each correcting its predecessor. Examples of boosting include AdaBoost (Adaptive Boosting) and Gradient Boosting. Gradient boosting, first proposed by [@breiman_1997] and developed by [@friedman_2001] produces an ensemble model by sequentially fitting new predictors to the residual errors made by the predecessor. When Decision Trees are the base predictor of the ensemble model, this is referred to as Gradient Tree Boosting, and the resulting model will be referred to as a Gradient Boosted Tree Model (GBTM). Methods for optimizing a GBTM include controlling the growth of the individual trees and the ensemble itself [@géron_2019].

## XGBoost

* An improvement to the Decision Tree Ensemble Model first published by Tianqi Chen and Carolos Guestrin of the University of Washington in 2016 [@chen_2016] as part of the Distributed Machine Learning Community (DMLC) [@géron_2019].
* history of development was as follows - decision tree models were found to perform well but readily over-fit training data, so researchers found that creating ensembles of decision trees with gradient boosting improved results. Tianqi Chen optimized DTEMs and dubbed it eXtreme Gradient Boosting (XGBoost) [@wade_2020].

* XGBoost advantages:
  + can handle **missing values**.
  + **sparsity-away split finding**. XGBoost uses sparse matrices to handle sparse data, i.e. the result of one-hot encoding, resulting in significant speed increases.
  + enables **parallel computing** by sorting and compressing the data into discrete blocks upon which computations can be perfomed independently then combined for the final result.
  +  **cache-aware acess**
  +  **block compression and sharding**
  +  **regularization**. XGBoost includes penalties for fitting roughness. In this way, XGBoost can be considered a regularized version of Gradient Boosting.

## XGBoost in Chemometrics

* @mustapha_2016 thoroughly reviewed the application of contemporary classification models on seven different biomolecule datasets and found that XGBoost performed best with an accuracy ranging from 94.47% - 98.49%.

### XGBoost in Chromatography

* @tian_2021 used XGBoost to classify liquor and colon cancer datasets (seperately) by observing K-means clustering extracted features, with 100% success rate.
* @guan_2023 used XGBoost with LC-MS/MS amino acid data fused with patient demographic information to predict lung cancer occurance with an accuracy of 75.29%

### XGBoost in Spectroscopy

* most recently @vanwyngaard_2023 compared the combination of Infrared Spectroscopy techniques with XGBoost to predict properties of grapevine organs.
* @yokoyama_2022 compared the use of NMR spectroscopy and a selection of ML models including XGBoost to predict chemical compound effects on aquaculture membranes, with XGBoost the most performant.
* @zou2023 profiled peanut seeds with hyperspectral imaging and constructed an XGBoost classification model with 80% accuracy.
* @ranaweera-AuthenticationGeographicalOrigin-2021 modelled Cabernet Sauvignon wine A-TEEM data to classify by geographic origin with 100% accuracy.

## XGBoost Algorithm

$$\text{obj}(\theta)=l( \theta ) + \Omega ( \theta ) $$

Where $\text{obj}(\theta)$ is the learning objective, $l( \theta )$ is the loss function, $\Omega ( \theta )$ is the regularization term. The addition of the regularization term sets XGBoost apart from other tree ensemble approaches.

[@wade_2020, @chen_2016]

*note to self: 2023-09-25 16:19:55 cant find an example of the derivation of the learning objective for classification, only regression in [@wade_2020, pp. 77 - 80]. Gna need to include it in the thesis, presumably, however it seems like a waste of time right now.*

## XGBoost in Python

XGBoost is available as a stand-alone Python package [xgboost](https://xgboost.readthedocs.io/en/stable/install.html).

It provides an interface through Scikit-Learn as described [here](https://xgboost.readthedocs.io/en/stable/python/sklearn_estimator.html) and provides regression, classificaton, and ranking. According to [this page](https://scikit-learn.org/stable/developers/develop.html#rolling-your-own-estimator) the primary motivation for this is integration with Scikit-Learn's `model_selection.GridSearchCV` and `pipeline.Pipeline` tools, as well as cross-library compatibility as other third-party libraries rely on it.


### XGBoost For Classification


### Dataset

A CUPRAC red wine dataset at 450nm has been constructed and stored to parquet file in [processing_cup_rw_dset](./processing_cup_rw_dset.ipynb). The filepath to the data file is stored in `definitons.RW_CUP_450_PROCESSED` .


### First Run Classification model

A first attempt at a classification model will be undertaken in [xgboost_modeling](./xgboost_modeling.ipynb)
