# Decision Tree Numerical Splitter Bake-off: Louppe (Sklearn) vs. Wright (Ranger)
*© Copyright James Dellinger, 2022-2025. All rights reserved.*

The two most widely-used modern Random Forests implementations are those from [Sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) and [Ranger](https://github.com/imbs-hl/ranger). [Gilles Louppe](https://glouppe.github.io/) designed and wrote the core parts of Sklearn's module, which has emerged as the go-to choice for data scientists hailing from the Python ecosystem. Ranger's creator [Marvin Wright](https://mnwright.github.io/), on the other hand, has geared his Random Forests implementation toward practitioners more comfortable living in the R language ecosystem.

The two libraries' differences run deeper than the choice of API language, and it's arguable that their respective under-the-hood approaches to implementing decision trees are far more disparate, and more interesting, than any of their surface-level dissimilarities. In particular, when it comes to implementing numerical split finding, the central step of the decision tree fitting process, these libraries' approaches could not be more different.

In this notebook I prototype in pure Python and then write Cython re-implementations of both Sklearn's and Ranger's numerical split finding algorithms, highlighting along the way the tradeoffs inherent in each library's design decisions. I also integrate each split-finder into its own full-fledged decision tree classifier and use it to fit a tree model to Kaggle's [Titanic dataset](https://www.kaggle.com/c/titanic). 

Because dataset size might influence the relative speeds of the two candidate numerical split finders, I run a further sanity check of fitting each of my Cython decision tree implementations to the Kaggle Santander Customer Satisfaction competition [dataset](https://www.kaggle.com/competitions/santander-customer-satisfaction/data), which is substantially larger (about 60K rows) than the Titanic dataset (about 800 rows). 

### Spoilers: Louppe is best for most implementations and practitioners
For those who prefer to read the ending first, here are the times that it took for each of my Cython implementations to fit a decision tree classifier model to the Santander data:

|Splitting Algorithm|Pre-sorting Required|Speed on Santander Dataset|
|---|---|---|
|Louppe|No|137 ms|
|Wright SmallQ|No|874 ms|
|Wright LargeQ|Yes|138 ms|
|Wright SmallQ/LargeQ|Yes|139 ms|
|Louppe SmallQ/Wright LargeQ|Yes|129 ms| 

The best results were attained by a hybrid I made that used Louppe's splitter to split nodes where the ratio of `num samples in node`/`num unique training set values for splitting feature` was less than 0.02, and used Wright's LargeQ splitter otherwise.

It's crucial to keep in mind that in order to use Wright's LargeQ splitter during training, the practitioner must pre-sort the unique values of each numerical feature in the training set, and then create a lookup table of shape `(# training rows, # numerical features)` where rows' raw numerical feature values are replaced with the indices of where the original values can be found in the pre-sorted unique values table.

Creating these two tables takes time and extra memory, and my judgement is that the extra effort is not worth the modest ~5% speed-up gained over using Louppe splitting alone. 

An interesting question that's beyond the scope of these experiments is whether Wright's LargeQ sorting could ever be worth the pre-sorting costs of time and memory. Perhaps in situations where tree nodes contain millions or even billions of rows, while numerical features possess a small number of unique values (say, four to ten), Louppe's strategy of relocating each node's rows' values into a contiguous memory buffer to facilitate just-in-time in-place sorting might become prohibitively expensive. 

### In a nutshell, Louppe's Numerical Split Finder: Sort at every split
[Gilles Louppe](https://glouppe.github.io/) outlines the design of Sklearn's numerical split finder on pages 107 to 110 of his doctoral [dissertation](https://arxiv.org/abs/1407.7502):
1. For each training sample (row) in the current node, copy the value held by that row for a given candidate numerical feature into a static, contiguous buffer (in other words, into a new 1-d array).
2. Use introsort to sort these feature values in ascending order (duplicates are kept) and simultaneously do an in-place argsort of the row indices of the node's samples so that each row index is in the same position (relative to the node's start index) as its corresponding feature value (relative to the beginning of the temporary 1-d array of feature values).
3. Iterate through each unique feature value, from smallest to largest, calculating the split criterion score (usually Gini) for the left and right child nodes for each split-point.

### Breiman and Cutler's Alternate Approach: Pre-sort
In his PhD thesis, Louppe acknowledges that an alternate split finding strategy is used by Breiman and Cutler in their [Fortran implementation](https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm) of Random Forests. Before training begins, they create lists of argsorted row indices for each numerical feature.

As Louppe explains on page 110, having numerical features pre-sorted reduces the complexity of best split searches from $O(KNlogN)$ to $O(KN)$ (where $N$ is the number of samples in the parent node and $K$ is the number of candidate features randomly selected for that node). 

It turns out, though, that there is no free lunch here as pre-sorting introduces new complications that can raise complexity in other areas. Specifically, since pre-sorting requires us to store a sorted list of row indices for *each* numerical feature, everytime we split a parent node we'd have to re-organize each numerical feature's argsorted row indices such that the rows belonging to the left child are contiguous and those residing in the right child are next to each other.

This is no doubt a hassle, but the beauty of this process is that as the tree is grown, each numerical feature's argsorted rows will remain properly sorted in all child nodes that have yet to be split. The prospect of not having to run a sorting function on the samples from each candidate numerical feature, each time a row is split, for each tree in a random forest has an undeniable appeal.

### Is Breiman and Cutler's numerical split finding method provably faster?
Louppe goes on to demonstrate that the complexity of ensuring that each feature's row indices remain organized in a manner that's consistent with the decision tree's structure is $O(\rho N)$ (where $\rho$ is the number of numerical features in the training set). The ratio of the complexities of pre-sorting once to doing per-split sorts is $O(\frac{\rho}{KlogN})$, with the takeaway being that depending on the number of features in the training set ($\rho$) as well as the number of features randomly selected for each node's split search ($K$), it is by no means an ironclad certainty that pre-sorting will result in faster performance. 

For this reason, Louppe didn't elect to implement pre-sorting for sklearn's random forests, and one could hardly fault him for not following in Breiman and Cutler's footsteps in this regard. Nevertheless, Louppe's complexity derivation makes it clear that neither approach has an unassailable advantage over the other.

### What about parallel training?
It seems to me, however, that even if there situations where Breiman & Cutler's method is faster than Louppe's, their approach wouldn't be compatible with a Random Forest fitting algorithm that can fit several decision trees in parallel. Each tree would have to have its own copy of the dataset's sorted raw feature values so that these values could be partitioned according to the node structure of each decision tree in the random forest. This would be profoundly expensive in terms of memory use and downright unfeasible when training on a massive dataset.

### In a nutshell, Ranger's Numerical Split Finder: A hybrid approach
One of the things that attracted me to Ranger was the fact that tries to take advantage of the benefits of both Louppe's sort-every-split strategy and Breiman and Cutler's pre-sort method. Furthermore, unlike Breiman & Cutler's presorting, Ranger's pre-sorting strategy is quite compatible with parallel training.

On page three of his [2015 paper](https://arxiv.org/abs/1508.04409) that provides a high level overview of the philosophy behind Ranger's implementation, [Marvin Wright](https://mnwright.github.io/) explains that users can choose to train a model using either "runtime-optimized" mode or "memory efficient" mode.
1. On the surface, "memory efficient" mode looks a lot like Sklearn. No presorting takes place and raw feature values are sorted for each candidate numerical feature for each node that gets split. Under the hood, Ranger has a different way of bypassing identical feature values than Sklearn. As we'll see later on below, Ranger's code to do this looks a little simpler, but Sklearn runs a bit faster.
2. "Runtime-optimized" mode is where things get really cool. Here, pre-sorting does take place before training (argsorts for each numerical feature), but the pre-argsorted row indices are only used for splitting large nodes. Wright [defines](https://github.com/imbs-hl/ranger/blob/ce497711884c783e133fb36750b60de4c140773f/src/TreeClassification.cpp#L172) a node as "large" if the ratio, `Q`, of the number of its samples to the number of unique values held by *all training samples* for the curent numerical feature is [greater than or equal to 0.02](https://github.com/imbs-hl/ranger/blob/ce497711884c783e133fb36750b60de4c140773f/src/globals.h#L106). 

   What's interesting about this definition is that it doesn't just look at the raw number of samples sitting inside a given node, but that it then adjusts this quantity by the total number of unique values in the entire training set that are held by the numerical feature that is currently being investigated (to see if it offers the best split of the node's samples). If a node had only 5 samples, but across the entire training set there were only 4 unique values for the current candidate feature, then the node would be considered to be "large Q." On the other hand, if the node had 1,000 samples, but the current given feature had two million unique training values, the node would be considered to be "small Q."

  Under this standard, if a node is deemed to be "small Q," it will get split with the same method as "memory efficient" mode, where the raw feature values are drawn and sorted real-time.
3. Finally, note that once it's arranged, the table of sorted row indices is never subsequently adjusted and could thus be used when fitting a random forest's trees in parallel.

## Implementing Louppe's Numerical Split Finder
Without futher ado, let's code up a Pythonic representation of Louppe's numerical splitting algorithm and use it to fit a decision tree to the Titanic dataset. A few implementation notes:
* The Sklearn library's decision tree architecture is *very* object-oriented. Far beyond having a simple `DecisionTree` class, Sklearn instantiates a splitter object to conduct split churches, and even uses a criterion object calculate the gini score of each split point. 
* My personal opinion: nailing down the purpose and function of each of these objects (by tabbing from `.pyx` file to `.pyx` file) made it hard for me to quickly attain high-level understanding of how Sklearn does what it does.
* Thus, I prefer a more functional approach for my own implementation. 
* What's more, I prefer to use an on-line algorithm to update my gini score calculation after each subsequent split point is examined (this is what Breiman and Cutler do in their Fortran code). Sklearn happens to eschew this practice -- it's `Criterion` class [calculates the node purity score from scratch](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_criterion.pyx#L656) at each split point.
* I also employ the following [trick](https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/tree/_criterion.pyx#L430) that I learned from Sklearn:

    * If several samples in the right child are identical, it may be faster to update left and right child stats by temporarily presuming that left child contains all the parent node's samples and then move the last few of these over to the right child. 

      I'll do this when the number of samples that'd go from left to right is less than the number of identically-valued samples that'd otherwise have to go from right to left. 

    ```python
    if pos-prev_pos > n_parent-pos-1:
        l_num, l_den = parent_num, parent_den
        r_num, r_den = 0., 0.
        l_wcc, r_wcc = parent_wcc.copy(), [0.]*n_class
        for row in reversed(rows[pos+node_start: n_parent+node_start]):
            label = labels[row]; w = c_wts[label]
            r_num += w*( 2*r_wcc[label] + w); r_den += w
            l_num += w*(-2*l_wcc[label] + w); l_den -= w 
            r_wcc[label] += w; l_wcc[label] -= w
    ```

#### Cython dual sort/argsort function
Even with my pure Python prototype of Louppe's decision tree algorithm that follows below, I'd still prefer to do my sorting as rapidly as possible.

To that end, the following functions comprise my own home-grown Cython introsort implementation that simultaneously sorts a 1-d array containing floats, and at the same time correspondingly argsorts the portion of the 1-d array containing the row indices of all training samples that contains the row indices of the samples in the current node. 

In designing this implementation, I borrowed ideas that I liked from the introsort implementations found in [Sklearn](https://github.com/scikit-learn/scikit-learn/blob/4c6fc05b2a1f11bedef5784c46b9f5d3e52489c2/sklearn/tree/_splitter.pyx#L440), [Numpy](https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211), and [libstdc++](https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78).

In [1]:
%load_ext Cython

In [2]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False

import numpy as np
cimport numpy as np
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t

from libc.math cimport log as ln
cimport cython

cdef inline void dual_swap(DTYPE_t* items, SIZE_t* rows, SIZE_t i, SIZE_t j) nogil:
    items[i], items[j] = items[j], items[i]
    rows[i], rows[j] = rows[j], rows[i]

# Quicksort helpers

cdef inline void dual_med_three(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Find the median-of-three pivot point of the second through final 
    items of a list of numbers. Once identified, the pivot is moved to 
    the front of the list. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t middle = <int>(first + (last - first)/2)
    cdef SIZE_t second = first + 1
    last -= 1
    if items[second] < items[middle]:
        if items[middle] < items[last]:
            dual_swap(items, rows, first, middle)    
        elif items[second] < items[last]:
            dual_swap(items, rows, first, last)         
        else:                        
            dual_swap(items, rows, first, second)
    elif items[second] < items[last]:
        dual_swap(items, rows, first, second)
    elif items[middle] < items[last]:
        dual_swap(items, rows, first, last)
    else:
        dual_swap(items, rows, first, middle)

cdef inline SIZE_t dual_partition(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, SIZE_t pivot) nogil:
    """Group numbers less than the pivot value together on the left and
    those that are greater on the right. Find the index that separates
    these two groups, which will belong to the first item that is greater
    than or equal to the pivot. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1885
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        pivot      : Index holding the median pivot value.
        
    Returns:
        Index of cut point used to partition the items into two smaller sequences.
    """
    while True:
        while first < last and items[first] < items[pivot]:
            first += 1                      # Get index of first item greater than or equal to median-of-three pivot. 
        last -= 1
        while items[pivot] < items[last]:
            last -= 1                       # Get index of last item less than or equal to the pivot.
        if not (first < last): 
            return first                    # After swaps are done, return index of first item in right partition.
        
        dual_swap(items, rows, first, last) # Swap the first item greater than or equal to the pivot with the
                                            # last item less than or equal to the pivot. 
        first += 1

cdef inline void dual_insertion_sort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Follows the spirit of the Numpy implementation at: 
        https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t i
    cdef SIZE_t j
    cdef SIZE_t k
    cdef DTYPE_t val
    for i in range(first+1, last):
        j = i
        k = i - 1
        val = items[i]
        row = rows[i]
        while (j > first) and val < items[k]:
            items[j] = items[k]
            rows[j] = rows[k]
            j-=1
            k-=1
        items[j] = val
        rows[j] = row

# Heapsort

cdef inline void dual_sift_down(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n, 
                                SIZE_t p, SIZE_t c, DTYPE_t val, SIZE_t row) nogil:
    """Swap a heap item with one of its children if that child's value is 
    greater than or equal to that parent's value. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L61
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
        p    : Index of the parent.
        c    : Index of the parent's first (left) child.
        val  : The parent's value.
        row  : The parent's training row index.
    """
    while c < n:    # Look at the descendents of current parent, `p`.
        if c < n-1 and items[start + c] < items[start + c + 1]: # Find larger of the first and second children.
            c += 1
        if val < items[start + c]: # If child greater than parent, swap child and parent.
            items[start + p] = items[start + c]
            rows[start + p] = rows[start + c]
            p = c   # Current greater child becomes the parent.
            c += c  # Look at this child's child, if it exists.
        else:
            break 
    items[start + p] = val
    rows[start + p] = row

cdef inline void dual_sort_heap(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Sort a binary max heap of numbers. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L77
    
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    cdef DTYPE_t val
    cdef SIZE_t row
    while n > 0:
        n -= 1
        val = items[start + n]
        row = rows[start + n]
        items[start + n] = items[start]
        rows[start + n] = rows[start]
        dual_sift_down(items, rows, start, n, 0, 1, val, row)

cdef inline void dual_heapify(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Turn a list of items into a binary max heap. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L59
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
    """
    cdef DTYPE_t val
    cdef SIZE_t p
    cdef SIZE_t last_p = (n-2)//2
    for p in range(last_p, -1, -1):
        val = items[start + p] # value of last parent
        row = rows[start + p]
        dual_sift_down(items, rows, start, n, p, 2*p + 1, val, row)

cdef inline void dual_heapsort(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Applies the heapsort algorithm to sort a list of items from least to greatest. 
    From Williams, 1964.
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    dual_heapify(items, rows, start, n)
    dual_sort_heap(items, rows, start, n)
    
# Introsort 

cdef void dual_introsort_loop(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, int depth) nogil:
    """The recursive heart of the introsort algorithm.
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        depth      : Current recursion depth.
    """
    cdef int MIN_SIZE_THRESH = 16
    cdef SIZE_t cut
    while last-first > MIN_SIZE_THRESH:
        if depth == 0:
            dual_heapsort(items, rows, first, last-first)
        depth -= 1
        dual_med_three(items, rows, first, last)
        cut = dual_partition(items, rows, first+1, last, first)
        dual_introsort_loop(items, rows, cut, last, depth)
        last = cut

# Log base-2 helper function. From Sklearn's implementation at:
#     https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_utils.pyx#L7
cdef inline DTYPE_t log2(DTYPE_t x) nogil:
    return ln(x) / ln(2.0)

cdef void dual_introsort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Implementation as described in Musser, 1997. Switches to heapsort
    when max recursion depth exceeded. Otherwise uses median-of-three 
    quicksort (Bentley & McIlroy, 1993) with all the usual optimizations:
        - Swap equal elements.
        - Only process partitions longer than the minimum size threshold.
        - When a new partition is made, recurse on the smaller half and 
          iterate over the larger half.
        - Make a final pass with insertion sort over the entire list.

    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef int max_depth = 2 * <int>log2(last-first)
    dual_introsort_loop(items, rows, first, last, max_depth)
    dual_insertion_sort(items, rows, first, last)
    
# Python wrapper
def dual_sort(np.ndarray[np.float64_t, ndim=1, mode="c"] items, np.ndarray[np.intp_t, ndim=1, mode="c"] rows,
              SIZE_t first, SIZE_t last):
    """Wrapper function for my Cython implementation of `introsort()`. 
    Followed Laura Mendoza's guide on how to have Cython access
    a Numpy array using the C pointer:
        https://members.loria.fr/LMendoza/link/Cython_speedup_notes.html#Working-with-numpy-arrays-in-I/O
    
    Arguments:
        items (Numpy array, float64): The numbers to be sorted.
        rows      (Numpy array, int): Training set row indices of all samples used to 
                                      fit the decision tree.
        first, last            (int): The range to be sorted. 
    """
    cdef DTYPE_t* items_ptr = <DTYPE_t *> items.data
    cdef SIZE_t* rows_ptr = <SIZE_t *> rows.data
    dual_introsort(items_ptr, rows_ptr, first, last)

## Louppe numerical split finder
Once the Sklearn numerical splitter has sorted the node's raw feature values and argsorted the nodes indices to match, the rest is pretty straightforward:
1. At the beginning of the split search, all of the parent node's samples sit in right child and so its weighted class counts are the same as those of the parent. The left child is empty.
2. For each child, the numerator and denominator used to calculate its gini impurity score are updated, on-line, as more samples move from the right child to the left at each unique split point.
3. We start our search by moving all samples that have the lowest raw value for the given feature from the right child over to the left child. We repeat this, value by value, as we proceed through all unique values up to and including the second-to-highest unique value. 
4. The split-point that gets recorded each time we shift a batch from the right to left child is the mid-point of: the value held by the group of samples that just got moved to the left side, and the next-lowest unique value (whose samples still sit in the right child).
5. Sometimes this mid-point will be equal to that next-lowest value. When this happens, we don't use the mid-point but instead record the lower of the two values as the split point.
6. We take advantage of situations where we can more quickly examine the rest of the split-points by reversing direction and begin moving samples from the left child back over to the right child. This can happen when the right child has many identical samples.
7. Finally, for each split point, we keep track of the index (in the `rows` list) that would mark the start of the right child if that split point eventually gets selected as the best split point. This saves us from having to work this out all over again once the best split point is discovered.

In [3]:
def find_num_split(rows, items, labels, node_start, n_parent, n_class, 
                   min_samples_leaf, min_weight_leaf, c_wts, l_wcc, r_wcc,
                   parent_wcc, best_split, current_feat, parent_num, parent_den):
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.
    
    Uses Gilles Louppe's split-finding algorithm.
    
    Saves a split's feature idx, threshold, position, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows           (ndarray of int): Indices of all rows in the training set. 
                                         Shape: (n train samples,).
        items      (ndarray of float64): The sorted feature values of the samples in the parent
                                         node (beginning at `node_start`). Shape: (n train samples,).
        labels         (ndarray of int): All training labels. Shape: (n training samples,).
        node_start                (int): Index of the beginning of the parent node in `rows`.
        n_parent                  (int): Number of samples in the parent node.
        n_class                   (int): Number of unique classes in the training set.
        min_samples_leaf          (int): Any leaf will have no fewer than this many samples.
        min_weight_leaf       (float64): Total weight of any leaf's samples will be at least this much.
        c_wts      (ndarray of float64): Class weights. Shape: (`n_class`,).
        l_wcc      (ndarray of float64): Left child's weight class counts. Shape: (`n_class`,).
        r_wcc      (ndarray of float64): Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc (ndarray of float64): Parent node's weight class counts. Shape: (`n_class`,).
        best_split              (Split): Holds the feature, threshold, position, and impurity
                                         score of the parent node's current best split.
        current_feat              (int): Column index of feature under investigation.
        parent_num            (float64): Numerator of parent node's impurity score.
        parent_den            (float64): Denominator of parent node's impurity score.
              
    Returns: 
        int: 1 if feature is constant for eligible split-points. 0, otherwise.
    """
    # So that we can iterate across all feature values in the node.
    prev_pos, pos = node_start, node_start
    node_end = node_start + n_parent
    lowest = items[pos]
    
    # Variables used to calculate proxy gini scores.
    l_num, l_den  = 0., 0.
    r_num, r_den, = parent_num, parent_den
    
    # Whether or not feat is constant within search range permitted
    # by min_samples_leaf and min_weight_leaf (0 if no, 1 if yes).
    current_feat_const = 1
    
    # Find the best split and store its score, threshold, position,
    # as well as it's children's weighted class counts.
    while pos < node_end:
        while items[pos] == lowest: # When consecutive items have the same value.
            if pos == node_end - 1: # When the final few samples all have the same value.
                return current_feat_const
            pos+=1
        next_lowest = items[pos]
        mid = lowest/2. + next_lowest/2. # Split threshold is always the mid-point between two consecutive values.
        if mid == next_lowest: mid = lowest
            
        # Move samples from the left to right child when it's quicker to do so.
        if pos-prev_pos > node_end-pos-1:
            l_num, l_den = parent_num, parent_den
            r_num, r_den = 0., 0.
            l_wcc[:] = parent_wcc
            r_wcc[:] = 0.
            for r in reversed(range(pos, node_end)):
                row = rows[r]
                label = labels[row]; w = c_wts[label]
                r_num += w*( 2*r_wcc[label] + w); r_den += w
                l_num += w*(-2*l_wcc[label] + w); l_den -= w 
                r_wcc[label] += w; l_wcc[label] -= w
        else:
            for r in range(prev_pos, pos):
                row = rows[r]
                label = labels[row]; w = c_wts[label] 
                l_num += w*( 2.*l_wcc[label] + w); l_den += w
                r_num += w*(-2.*r_wcc[label] + w); r_den -= w
                l_wcc[label] += w; r_wcc[label] -= w  
                
        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf.
        if pos - node_start < min_samples_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif node_end - pos < min_samples_leaf:
            return current_feat_const
        # l_den and r_den are left and right children's weighted sample sums.
        elif l_den < min_weight_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif r_den < min_weight_leaf:
            return current_feat_const

        current_feat_const = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Only update best split stats if current score beats all
            # other best found among all other features already explored 
            # at the current node.
            best_split.score, best_split.thresh = score, mid
            best_split.pos, best_split.feat = pos, current_feat
        lowest = next_lowest
        prev_pos = pos; pos+=1
    return current_feat_const

#### Split-making Function

In [4]:
def make_num_split(rows, X, node_info, best_split):
    """Split a decision tree node using a given ordered numerical feature and threshold. 
    
    Uses the similar logic as Gilles Louppe's Cython implementation at: 
        https://github.com/scikit-learn/scikit-learn/blob/47e3358712d483a8e8dcb84d87386eb4f3d49070/sklearn/tree/_splitter.pyx#L605
    
    Arguments: 
        rows  (ndarray of int): Indices of all rows in the training set. 
                                Shape: (n train samples,).
        X (ndarray of float64): The training data. Shape: (n train samples, n features).
        node_info (StackEntry): Stats of the node to be split.
        best_split     (Split): Stats of a node split.
    """
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[rows[p]][best_split.feat] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p]

## Louppe Decision Tree Python Version

In [5]:
def get_random_generator(seed=None):
    """Make a new Numpy random generator or use a previous one, if it exists.
    
    Inspired by sklearn's `check_random_state()` function:
        https://github.com/scikit-learn/scikit-learn/blob/62fc8bb94dcd65e72878c0599ff91391d9983424/sklearn/utils/validation.py#L852
    """
    if isinstance(seed, np.random.Generator): 
        return seed
    else:
        return np.random.default_rng(seed)

In [6]:
class StackEntry():
    """Pertinent stats needed to push a decision tree node onto a LIFO priority stack.
    
    Attributes:
        start         (int): Index of node's first sample.
        end           (int): Range of all samples in node.
        node_id       (int): Location of node in a decision tree.
        parent_id     (int): Location of node's parent.
        n_const_feats (int): Num features constant for samples in node.
    """
    def __init__(self, start, end, node_id, parent_id, n_const_feats):
        self.start, self.end = start, end
        self.node_id, self.parent_id = node_id, parent_id
        self.n_const_feats = n_const_feats
        
class Split():
    """Pertinent stats needed to compare node splits.
    
    Attributes:
        feat       (int): Column index of splitting feature.
        thresh (float64): Split threshold.
        pos        (int): Index of first sample in split's right child.
        score  (float64): Impurity score.
    """
    def __init__(self, feat, thresh, pos, score):
        self.feat, self.thresh, self.pos, self.score = feat, thresh, pos, score
        
class Node():
    """A decision tree node.
    
    Attributes:
        l_child    (int): Location of node's left child (-1 if leaf).
        r_child    (int): Location of node's right child (-1 if leaf).
        feat       (int): Column index of best split's feature (-1 if leaf).
        thresh (float64): Threshold of best split (-np.inf if leaf).
        label      (int): Class label if node is a leaf (-1, otherwise).
    """
    def __init__(self, l_child, r_child, feat, thresh, label):
        self.l_child, self.r_child = l_child, r_child
        self.feat, self.thresh = feat, thresh
        self.label = label

In [7]:
class DecisionTreeLouppe():
    """Fit a decision tree classifier using a depth-first algorithm.
    
    Based on page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    Keeps track of and avoids features that are constant for a given node's samples.

    Attributes:
            m                                    (int): Number of candidate features randomly selected to try 
                                                        to split each node.
            min_samples_leaf                     (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf         (float64): Total weight of any leaf's samples must comprise this portion 
                                                        of the sum of weights of *all* training samples used to fit 
                                                        the tree.
            class_weights         (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                                 (int): Seed of the random number generator used for tree growing.
            rows                      (ndarray of int): Row indices of all training samples. Shape: (`n_samples`,).
            features                  (ndarray of int): Column indices of all training features. Shape: (`n_features`,).
            n_class                              (int): Number of unique classes in the training set.
            n_samples                            (int): Number of samples in the training set.
            n_features                           (int): Number of features used to train.
            mem_capacity                         (int): Max number of tree nodes that can be stored in `self.nodes`
                                                        and `self.weighted_class_counts`.
            min_weight_leaf                  (float64): Total weight of any leaf's samples will be at least this much.
            n_nodes                              (int): Number of nodes in the tree.
            
            Decision Tree data structure
            ----------------------------
            nodes                    (ndarray of Node): All nodes in the decision tree. Shape: (`n_nodes`,).
            weighted_class_counts (ndarray of float64): Weighted class counts of training samples in each
                                                        node. Shape: (`n_nodes` x `n_class`,).
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights=[], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                         (int): Use when reproducibility desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C')
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = np.empty(0, dtype=Node, order='C')
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = np.empty(0, dtype=np.float64, order='C')
        
    @property
    def size(self): return self.n_nodes
    
    @property 
    def left_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].l_child
        return out

    @property 
    def right_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].r_child
        return out

    @property 
    def split_features(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].feat
        return out

    @property 
    def split_thresholds(self): 
        out = np.empty(self.n_nodes, dtype='float64')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].thresh
        return out

    @property 
    def weighted_cc(self):
        out_size = self.n_nodes*self.n_class
        out = np.empty(out_size, dtype='float64')
        for i in range(out_size):
            out[i] = self.weighted_class_counts[i]
        out.resize(self.n_nodes, self.n_class)
        return out

    @property 
    def labels(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].label
        return out
    
    def _increase_mem_capacity(self, new_capacity):
        """Resize ndarrays that hold tree's nodes and weighted class counts.
        
        Arguments:
            new_capacity (int): Amount of nodes that resized arrays will be able to hold.
        """
        self.nodes.resize(new_capacity, refcheck=False)
        self.weighted_class_counts.resize(new_capacity*self.n_class, refcheck=False)
    
    def _make_leaf(self, node_id, wcc, n_classes_node):
        """Set and store the class label of a leaf node.
        
        Break ties at random when multiple classes share the same max weight.
        Doing this avoids a bias towards lower classes that would be a possible
        consequence of using np.argmax (which is what Sklearn does).
        
        Arguments:
            node_id            (int): Location of node in `self.nodes`.
            wcc (ndarray of float64): Node's weighted class counts. Shape: (`self.n_class`,).
            n_classes_node     (int): Number of unique class labels found among
                                      node's training samples.
        """
        if n_classes_node == 1: 
            label = max(enumerate(wcc), key=lambda f: f[1])[0]
        else:              
            label = self._rng.choice(np.argwhere(wcc==np.max(wcc)).flatten())
        self.nodes[node_id] = Node(-1, -1, -1, np.nan, label)
        
    def _grow_tree(self, X, y):
        """Depth-first growth of a decision tree.
        
        Arguments:
            X (ndarray of float64): Training samples. Shape: (n samples, n features).
            y     (ndarray of int): Training labels. Shape: (n samples,).
        """
        # LIFO stack holding all nodes still to be investigated.
        node_stack = []
        
        # Stores the weighted class counts of the current node.
        node_wcc = np.empty(self.n_class, dtype=np.float64)
        
        # For finding the best split.
        l_wcc = np.empty(self.n_class, dtype=np.float64)
        r_wcc = np.empty(self.n_class, dtype=np.float64)
        items = np.empty(self.n_samples, dtype=np.float64)

        # Keeping track of nodes' constant features. Uses
        # same strategy as Sklearn: 
        #     https://github.com/scikit-learn/scikit-learn/blob/4c6fc05b2a1f11bedef5784c46b9f5d3e52489c2/sklearn/tree/_splitter.pyx#L424
        features = self.features.copy()
        constant_features = np.empty(self.n_features, dtype=np.intp)
        
        # Push root node onto the LIFO stack.
        node_stack.append(StackEntry(0, self.n_samples, 0, 0, 0))
        self.n_nodes = 1
        
        while len(node_stack) > 0:
            node_info = node_stack.pop()
            start, end = node_info.start, node_info.end
            node_id, parent_id = node_info.node_id, node_info.parent_id
            n_consts = node_info.n_const_feats
            n_samples_node = end-start
            
            # Tabulate and store the current node's weighted class counts.
            node_wcc[:] = 0.
            for i in range(n_samples_node):
                row = self.rows[start + i]
                label = y[row]
                wt = self.class_weights[label]
                node_wcc[label] += wt 
            self.weighted_class_counts[node_id*self.n_class: (node_id + 1)* self.n_class] = node_wcc
            
            # Make a leaf if required to do so.
            n_classes_node, sum_node_wcc, sum_node_wcc_sqr = 0, 0., 0.
            for c in range(self.n_class):
                wcc = node_wcc[c]
                if wcc > 0: n_classes_node += 1
                # Compute the current node's proxy gini numerator and denominator while we're at it.
                sum_node_wcc_sqr += wcc**2 
                sum_node_wcc += wcc 
            if n_classes_node == 1:                      
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif n_samples_node < 2*self.min_samples_leaf:  
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif sum_node_wcc < 2.*self.min_weight_leaf: 
                self._make_leaf(node_id, node_wcc, n_classes_node)
            
            # Or perform a split.
            else:
                # Initialize stats for best split of node.
                best_split = Split(-1, 0., -1, -np.inf)
                
                # Ensure feats drawn w/out replacement.
                n_drawn_feats = 0
                n_new_consts = 0
                n_total_consts = n_consts
                lb = 0                      # Range in `features` array from which we 
                ub = self.n_features - 1    # randomly select a feature's column index. 
               
                while n_drawn_feats < self.m:
                    n_drawn_feats += 1
                    idx = self._rng.choice(range(lb, ub-n_new_consts+1))
                    
                    # So that we don't draw a known constant feature again this split-search.
                    if idx < n_consts:
                        features[idx], features[lb] = features[lb], features[idx]
                        lb += 1 
                        continue
                        
                    # So that no new const feats get drawn more than once per split-search.
                    idx += n_new_consts

                    feat_idx = features[idx]  
                    # Prepare the rows' feature values for sorting.
                    items[start:end] = X[:,feat_idx][self.rows[start:end]]
                    
                    # Sort feature values and corresponding sample row indices
                    # to prepare for numerical split finding.
                    dual_sort(items, self.rows, start, end)
                    
                    # Make sure the feature not constant for node's samples.
                    if items[start] == items[end-1]:
                        # Move the newly-discovered constant feat to the far right-end
                        # of the left half of `features` list holding the known const
                        # feats as well as any other const feats newly discovered 
                        # during this node's split-search.
                        features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                        n_new_consts += 1
                        n_total_consts += 1
                        continue
                    else:
                        # Initialize weighted class counts of right and left children.
                        # Right child's counts are initially the same as parent node's.
                        r_wcc[:] = node_wcc
                        l_wcc[:] = 0.
                    
                        # If the feature has an impurity score that's better than the best score 
                        # found among all other features visited thus far for this node, find_num_split()
                        # updates the attributes of the struct containing the node's best split info. 
                        # 
                        # But even if a new best score isn't reached, if an impurity score can
                        # be calculated at least once during the feature's split search, the
                        # following indicator will be toggled off, to indicate that the feature
                        # is not constant (1 = is constant; 0 = not constant).
                        current_feat_const = find_num_split(self.rows, items, y, start, n_samples_node, self.n_class, 
                                                            self.min_samples_leaf, self.min_weight_leaf, self.class_weights, 
                                                            l_wcc, r_wcc, node_wcc, best_split, feat_idx, sum_node_wcc_sqr, 
                                                            sum_node_wcc)

                        if current_feat_const:
                            # The feature may be constant within the search range permitted
                            # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                            # the feature is a newly discovered constant.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # The feature is non-constant, so we ensure it's not drawn again
                            # during this split-search.
                            features[idx], features[ub] = features[ub], features[idx]
                            ub -= 1 
                            
                # To ensure that the constant features info is accurate for sibling or child nodes.
                features[0:n_consts] = constant_features[0:n_consts]
                constant_features[n_consts:n_consts+n_new_consts] = features[n_consts:n_consts+n_new_consts]
                
                # Make node a leaf if constant for all randomly drawn feats.
                # (# drawn known constant feats + # drawn new constant feats)
                if lb + n_new_consts == n_drawn_feats: 
                    self._make_leaf(node_id, node_wcc, n_classes_node)
                else: 
                    make_num_split(self.rows, X, node_info, best_split) 

                    # Update info for node that's getting split.
                    l_child_id = self.n_nodes
                    r_child_id = l_child_id + 1
                    self.nodes[node_id] = Node(l_child_id, r_child_id, best_split.feat, best_split.thresh, -1)

                    # Prepare for the left and right child nodes
                    # by increasing tree data memory capacity if
                    # necessary.
                    if self.n_nodes + 2 > self.mem_capacity:
                        # Expand memory capacity geometrically. See "geometric growth" 
                        # part of WhozCraig's SO answer at: 
                        #     https://stackoverflow.com/a/51665863/8628758.
                        # Add one after squaring so that the new capacity can
                        # contain not only a tree of greater depth, but also
                        # the maximum # nodes that that depth could have.
                        new_capacity = 2*self.mem_capacity + 1
                        self._increase_mem_capacity(new_capacity)
                        self.mem_capacity = new_capacity
                    
                    # Push right child info onto the LIFO stack.
                    node_stack.append(StackEntry(best_split.pos, end, r_child_id, node_id, n_total_consts))
                    # Push left child info onto queue.
                    node_stack.append(StackEntry(start, best_split.pos, l_child_id, node_id, n_total_consts))

                    # And update size of the tree.
                    self.n_nodes += 2
    
    def fit(self, X, y, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X (Fortran-style ndarray of float64): Pre-processed training data. 
                                                  Shape: (num train samples, num train features).
            y                   (ndarray of int): Training labels. Shape: (num train samples,).
            rows                          (list): Indices of the rows to be used for training. 
                                                  All rows used if empty.
            features                      (list): Column indices of training features that will be used.
                                                  All features used if empty.                           
        Returns:
            DecisionTreeLouppe: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        # Determine num classes found among all training samples.
        root_cc = np.unique(y, return_counts=True)[1] # Root node class counts.
        self.n_class = root_cc.size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.

        self.n_samples = len(self.rows)
        self.n_features = len(self.features)
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        init_capacity = 15
        
         # Allocate tree memory.
        self._increase_mem_capacity(init_capacity)
        self.mem_capacity = init_capacity
        
        # And sum the class weights of all the root node's samples in
        # order to know minimum total weight a leaf must have (which
        # we must know when regularizing by min_weight_fraction_leaf.)
        root_wcc = root_cc*self.class_weights
        self.min_weight_leaf = self.min_weight_fraction_leaf*root_wcc.sum()
        
        # Initialize the random number generator.
        self._rng = get_random_generator(self.seed)
        
        # Initiate tree building.
        self._grow_tree(X, y)
        return self
        
    def _next_node(self, nxt): return self.nodes[nxt]
       
    def _get_leaf_idx(self, i, X):
        root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[:,leaf.feat][i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else:
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
        
    def predict(self, X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        n_preds = X.shape[0]
        preds = np.empty(n_preds, dtype=np.intp)
        for i in range(n_preds):
            preds[i] = self.nodes[self._get_leaf_idx(i, X)].label
        return preds
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        for i in range(n_probs):
            idx = self._get_leaf_idx(i, X)
            for j in range(self.n_class):
                wcc[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

## Download the Titanic Dataset

In [8]:
import pandas as pd
from pathlib import Path
from zipfile import ZipFile

titanic_dir = Path.home()/'data'/'titanic'

def get_titanic_data():
    """Download and place the kaggle Titanic train and test csv files into Pandas dataframes."""
    titanic_dir.mkdir(parents=True, exist_ok=True)
    train_csv_path, test_csv_path = titanic_dir/'train.csv', titanic_dir/'test.csv'
    if not train_csv_path.exists():
        # Visit https://github.com/Kaggle/kaggle-api for more info on 
        # how to install the kaggle API and generate a kaggle.json key.
        !kaggle competitions download -c titanic --path "$titanic_dir"
        with ZipFile(titanic_dir/'titanic.zip', 'r') as z: z.extractall(titanic_dir) 
    titanic_train_df = pd.read_csv(train_csv_path); titanic_train_df.drop('PassengerId', axis=1, inplace=True)
    titanic_test_df = pd.read_csv(test_csv_path)
    return titanic_train_df, titanic_test_df
    
titanic_train_df, titanic_test_df = get_titanic_data()

In [9]:
titanic_train_df = titanic_train_df.iloc[:,[i for i in range(len(titanic_train_df.columns)) if i not in [2,7,9]]]

In [10]:
def train_val_split(df, label_col_idx, pct_val=20):
    """Randomly draw samples to form a validation set.
    
    Arguments:            
        df (Pandas dataframe): Dataframe containing a dataset
        label_col_idx   (int): Index of the column containing the labels
        pct_val         (int): Percent of the dataset's samples to go into the validation set.
    
    Returns: 
        the train/val inputs and labels
    """
    if pct_val < 0 or pct_val > 100: print('pct_val should be an int or float between 0 and 100'); return
    n = len(df)
    val_idx = np.random.choice(n, size=n*pct_val//100, replace=False)
    val_idx = np.sort(val_idx)
    train_idx = df.index.difference(val_idx)
    labels = df.iloc[:,label_col_idx]
    inputs = df.iloc[:,[i for i in range(len(df.columns)) if i != label_col_idx]]
    return inputs.loc[train_idx], labels.loc[train_idx], inputs.loc[val_idx], labels.loc[val_idx]

In [11]:
np.random.seed(42)
xTrain_titanic, yTrain_titanic, xVal_titanic, yVal_titanic = train_val_split(titanic_train_df, 0, 10)

In [12]:
# Categorical features to be numerically encoded using 
# PCA rank encoding.
cat_feats = [1,6]

## Preprocessing Helper Functions

In [13]:
def wt_cov_matrix(P, w, n):
    """Calculate the unbiased weighted estimated covariance of class probability 
    matrix of a categorical feature's levels.
    
    Parameters: 
        P (Numpy array - floats): class probability matrix of the levels of a given categorical variable.
        w           (int vector): sample weights
        n                  (int): number of samples in the parent node
                
    Returns: 
        the unbiased weighted estimated covariance matrix of P
    """
    p_avg = np.sum(w[:,None]*P, axis=0)/n
    X = P.copy(); X = X.T   # Can't alter the original matrix P; it will be needed later
    X -= p_avg[:,None]
    return X.dot((X*w).T)/(n - np.sum(w**2)/n)

In [14]:
def integer_encode(x):
    """Perform a naive ordinal encoding of a categorical variable's samples. Leaves any NaNs alone.
    
    Arguments: 
        x (vector): All samples in the categorical variable's column.
    
    Returns: 
        1-d array containing the variable's encoded values, 
        list of the cat variable's unique original values
    """
    levels = []; [levels.append(i) for i in x if i not in levels]
    level_map = {l:i for i, l in enumerate(levels)}  # Integer used to encode a level corresponds to the 
    x_enc = [level_map[l] for l in x]                # index where that level appears in `levels` list.
    return x_enc, levels

In [15]:
def get_prob_matrix(x, y, n, ncat, nclass):
    """Construct a class probability matrix for a categorical variable's unique levels.
    Computes contingency table of per-level class counts using fuglede's *wonderful* algorithm: 
        https://stackoverflow.com/questions/51294382/python-contingency-table/51294568#51294568
    
    Arguments: 
        x (Numpy vector of int): Naive ordinally encoded values of categorical variable.
        y (Numpy vector of int): Class labels corresponding to rows in <x>.
        n                 (int): Number of rows/samples.
        ncat              (int): Number of unique categories.
        nclass            (int): Number of unique classes.
        
    Returns: 
        2-d array containing the class probability matrix,
        list of categorical level weights (the number of samples belonging to each level)
    """
    N = np.bincount(nclass * x + y, minlength=ncat*nclass).reshape((ncat, nclass)) # Cat levels' class counts
    w = N.sum(axis=1) # Level weights are number of samples per level.
    P = np.zeros((ncat, nclass))                        # First initialize P as all zeros to prevent the output 
    np.divide(N, w[:,None], where=w[:,None]!=0, out=P)  # of np.divide from being numerically unstable.
    return P, w

In [16]:
def PCA_rank_encode(x_int, y, levels, n, nclass):
    """PCA rank encode a categorical variable using technique from Coppersmith et. al. (1999):
        https://link.springer.com/article/10.1023%2FA%3A1009869804967
        
    Arguments: 
        x (Numpy vector of int): Naive ordinally encoded values of categorical variable.
        y (Numpy vector of int): Class labels corresponding to rows in <x>.
        levels (list): All original unique categorical levels.
        ncat (int): Number of unique categories.
        nclass (int): Number of unique classes.
                
    Returns: 1-d array containing the categorical variable's PCA rank encoded values,
             dict holding the level:PCArank mapping for each unique categorical value
    """
    ncat = len(levels)                           
    P, w = get_prob_matrix(x_int, y, n, ncat, nclass)      # Construct the levels' class prob. matrix.
    Sigma = wt_cov_matrix(P, w, n)                         # Then build the weighted cov matrix.
    v = np.linalg.svd(Sigma)[2][0]                         # Next, get direction of 1st principal component.
    s = P.dot(v)                                           # And project levels' class prob coords. onto 1st PC.
    zipped = list(zip(levels, list(range(ncat))))          # Pair levels with their ordinal encodings.
    sorted = [zipped[i] for i in np.argsort(s)]            # Then sort these pairs according to levels' PC ranks.
    int_pca_map = {z[1]:r for r, z in enumerate(sorted)}   # Map each ordinal encoding to the appropriate PC rank.
    x_pca = [int_pca_map[i] for i in x_int]                # Exchange column's ordinal values with these ranks.
    level_pca_map = {z[0]:r for r, z in enumerate(sorted)} # Finally, map each orig. level to its PC rank.
    return x_pca, level_pca_map

In [17]:
from scipy import stats

def preprocess_train(x, y, cat_feats=[]):
    """Pre-process training data and labels for training via random forests.
    
    Uses these heuristics:
        1. Categorical features by default are preprocessed using PCA encoding from:
               Coppersmith et. al. (1999): https://link.springer.com/article/10.1023%2FA%3A1009869804967
           This done only once at the beginning of training (and not at each split), from:
               Wright and König (2019): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971/pdf/peerj-07-6339.pdf
           Avoids the absent levels problem described in:
               Wu (2018): https://dl.acm.org/doi/abs/10.5555/3291125.3309607
        2. Numerical feature NaNs replaced by median value.
        3. Categorical feature NaNs replaced by mode level. Both these NaN strategies
           are Breiman's "current preferred method" from:
               Breiman (2002): https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
    
    Arguments: 
        x (Pandas or Numpy array): The original un-preprocessed training data.
        y (Pandas or Numpy array): Numerically encoded training labels.
        cat_feats          (list): Categorical features' column indices.
        
    Returns: 
        The processed training data and labels, a dictionary of values used to fill
        each column's NaN values, and a dictionary containing categorical level-to-PCA maps.
        The contents of both these dicts are stored under column index numbers.
    """
    x, y = np.asfortranarray(x), np.ascontiguousarray(y)
    n, nclass, nfeat, has_cat_feats = len(x), len(np.unique(y)), x.shape[1], len(cat_feats) > 0
    num_feats = [i for i in range(nfeat) if i not in cat_feats]; has_num_feats = nfeat > len(cat_feats)
    nan_fillers, pca_maps = {}, {} # NaN fill values and cat level-to-PCA mappings stored under feat col idxs.
    if has_cat_feats:
        for i in cat_feats:
            values = x[:,i]
            nans = pd.isna(values); has_nans = nans.sum() > 0
            # Step 1: Naive ordinal encode all non-NaN categorical values.
            values[~nans], levels = integer_encode(values[~nans])
            # Step 2: Store modes of all categorical features.
            mode = stats.mode(values.astype(float), nan_policy='omit').mode[0]
            # Step 3: Replace any NaNs in cat cols with cols' modes.
            if has_nans: values[nans] = mode
            # Step 4: PCA rank-encode all categorical features.
            x[:,i], pca_maps[i] = PCA_rank_encode(values.astype(int), y, levels, n, nclass)
            nan_fillers[i] = pca_maps[i][levels[int(mode)]] # Store the PCA rank-encoded mode value for each cat feat.
    if has_num_feats:                                       # The levels list stores orig. level strings in order of 
        values = x[:,num_feats].astype(float)               # their naive ordinal encodings.
        nans = np.isnan(values); has_nans = nans.sum() > 0
        # Step 5: Store medians of all numerical features.
        medians = np.nanmedian(values, axis=0)
        for i,m in enumerate(medians): nan_fillers[num_feats[i]] = m
        # Step 6: Replace any NaNs in num cols with cols' medians.
        if has_nans: x[:,num_feats] = np.where(nans, medians, x[:,num_feats])
    return x.astype('float64'), y, nan_fillers, pca_maps 

In [18]:
def preprocess_test(x, nan_fillers, cat_feats=[], pca_maps=None):
    """Pre-process test (or any unlabelled) data for inference via random forests.
    
    Fills NaNs and encodes categorical labels using values derived from a training set.
    
    Arguments: 
        x (Pandas or Numpy array): The original un-preprocessed test data.
        nan_fillers        (dict): The values (stored under col idxs) used to fill each column's NaN rows.
        cat_feats          (list): Categorical features' column indices.
        pca_maps           (dict): Categorical level-to-PCA maps (each stored under a col's idx).
         
    Returns: 
        The processed test data.
    """
    x, nfeat, has_cat_feats = np.asfortranarray(x), x.shape[1], len(cat_feats) > 0
    if has_cat_feats:
        for i in cat_feats:
            # Step 1: Replace any new levels with NaN.
            levels = list(pca_maps[i].keys())
            new_levels = [l for l in x[:,i] if l not in levels]
            if len(new_levels) > 0: 
                new_levels_rows = np.stack([x[:,i] == l for l in new_levels]).sum(axis=0) > 0
                x[:,i] = np.where(new_levels_rows, np.nan, x[:,i])
            # Step 2: Replace levels in each cat col's non-NaN rows with proper PCA ranks.
            nans = pd.isna(x[:,i])
            pca_enc = [pca_maps[i][l] for l in x[:,i][~nans]]
            x[:,i][~nans] = pca_enc
    # Step 3: Replace all columns' NaNs   
    nan_filler_list = np.array([nan_fillers[i] for i in np.sort(list(nan_fillers.keys()))], dtype=object)
    nans = pd.isna(x); has_nans = nans.sum() > 0
    if has_nans: x = np.where(nans, nan_filler_list, x)
    x = np.asfortranarray(x)
    return x.astype('float64')

#### Preprocess the Titanic data

In [19]:
xTrain_proc, yTrain_proc, nan_fillers, pca_maps = preprocess_train(xTrain_titanic, yTrain_titanic, cat_feats)
xVal_proc = preprocess_test(xVal_titanic, nan_fillers, cat_feats, pca_maps)

## Scoring Helper Functions

In [20]:
def error(preds, labels):
    return np.sum(np.array(preds) != np.array(labels))/len(labels)

In [21]:
def accuracy(preds, labels):
    return 1 - error(preds, labels)

## Python Louppe Tree's Speed on the Titanic Data

In [22]:
m = 4 # number of features randomly selected as candidates for each split.
dt = DecisionTreeLouppe(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [23]:
dt.size # Number of nodes in the decision tree.

371

In [24]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.752808988764045

In [25]:
%timeit dt.fit(xTrain_proc, yTrain_proc);

63.1 ms ± 967 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [26]:
%timeit dt.predict(xVal_proc)

504 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Louppe Decision Tree Cython Version

In [27]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11

import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t # Signed, same as ssize_t in C. See MSeifert's SO answer: https://stackoverflow.com/a/46416257/8628758
cimport cython
from libc.math cimport log as ln
from libc.stdlib cimport realloc, free
from libc.string cimport memcpy
from libc.string cimport memset
from libcpp.stack cimport stack

# For C++ random number generation.
from libc.stdint cimport uint_fast32_t 

# Swap helper func for sorting.
cdef inline void dual_swap(DTYPE_t* items, SIZE_t* rows, SIZE_t i, SIZE_t j) nogil:
    items[i], items[j] = items[j], items[i]
    rows[i], rows[j] = rows[j], rows[i]

# Quicksort helpers

cdef inline void dual_med_three(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Find the median-of-three pivot point of the second through final 
    items of a list of numbers. Once identified, the pivot is moved to 
    the front of the list. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t middle = <int>(first + (last - first)/2)
    cdef SIZE_t second = first + 1
    last -= 1
    if items[second] < items[middle]:
        if items[middle] < items[last]:
            dual_swap(items, rows, first, middle)    
        elif items[second] < items[last]:
            dual_swap(items, rows, first, last)         
        else:                        
            dual_swap(items, rows, first, second)
    elif items[second] < items[last]:
        dual_swap(items, rows, first, second)
    elif items[middle] < items[last]:
        dual_swap(items, rows, first, last)
    else:
        dual_swap(items, rows, first, middle)

cdef inline SIZE_t dual_partition(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, SIZE_t pivot) nogil:
    """Group numbers less than the pivot value together on the left and
    those that are greater on the right. Find the index that separates
    these two groups, which will belong to the first item that is greater
    than or equal to the pivot. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1885
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        pivot      : Index holding the median pivot value.
        
    Returns:
        Index of cut point used to partition the items into two smaller sequences.
    """
    while True:
        while first < last and items[first] < items[pivot]:
            first += 1                      # Get index of first item greater than or equal to median-of-three pivot. 
        last -= 1
        while items[pivot] < items[last]:
            last -= 1                       # Get index of last item less than or equal to the pivot.
        if not (first < last): 
            return first                    # After swaps are done, return index of first item in right partition.
        
        dual_swap(items, rows, first, last) # Swap the first item greater than or equal to the pivot with the
                                            # last item less than or equal to the pivot. 
        first += 1

cdef inline void dual_insertion_sort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Follows the spirit of the Numpy implementation at: 
        https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t i
    cdef SIZE_t j
    cdef SIZE_t k
    cdef DTYPE_t val
    for i in range(first+1, last):
        j = i
        k = i - 1
        val = items[i]
        row = rows[i]
        while (j > first) and val < items[k]:
            items[j] = items[k]
            rows[j] = rows[k]
            j-=1
            k-=1
        items[j] = val
        rows[j] = row

# Heapsort

cdef inline void dual_sift_down(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n, 
                                SIZE_t p, SIZE_t c, DTYPE_t val, SIZE_t row) nogil:
    """Swap a heap item with one of its children if that child's value is 
    greater than or equal to that parent's value. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L61
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
        p    : Index of the parent.
        c    : Index of the parent's first (left) child.
        val  : The parent's value.
        row  : The parent's training row index.
    """
    while c < n:    # Look at the descendents of current parent, `p`.
        if c < n-1 and items[start + c] < items[start + c + 1]: # Find larger of the first and second children.
            c += 1
        if val < items[start + c]: # If child greater than parent, swap child and parent.
            items[start + p] = items[start + c]
            rows[start + p] = rows[start + c]
            p = c   # Current greater child becomes the parent.
            c += c  # Look at this child's child, if it exists.
        else:
            break 
    items[start + p] = val
    rows[start + p] = row

cdef inline void dual_sort_heap(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Sort a binary max heap of numbers. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L77
    
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    cdef DTYPE_t val
    cdef SIZE_t row
    while n > 0:
        n -= 1
        val = items[start + n]
        row = rows[start + n]
        items[start + n] = items[start]
        rows[start + n] = rows[start]
        dual_sift_down(items, rows, start, n, 0, 1, val, row)

cdef inline void dual_heapify(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Turn a list of items into a binary max heap. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L59
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
    """
    cdef DTYPE_t val
    cdef SIZE_t p
    cdef SIZE_t last_p = (n-2)//2
    for p in range(last_p, -1, -1):
        val = items[start + p] # value of last parent
        row = rows[start + p]
        dual_sift_down(items, rows, start, n, p, 2*p + 1, val, row)

cdef inline void dual_heapsort(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Applies the heapsort algorithm to sort a list of items from least to greatest. 
    From Williams, 1964.
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    dual_heapify(items, rows, start, n)
    dual_sort_heap(items, rows, start, n)
    
# Introsort 

cdef void dual_introsort_loop(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, int depth) nogil:
    """The recursive heart of the introsort algorithm.
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        depth      : Current recursion depth.
    """
    cdef int MIN_SIZE_THRESH = 16
    cdef SIZE_t cut
    while last-first > MIN_SIZE_THRESH:
        if depth == 0:
            dual_heapsort(items, rows, first, last-first)
        depth -= 1
        dual_med_three(items, rows, first, last)
        cut = dual_partition(items, rows, first+1, last, first)
        dual_introsort_loop(items, rows, cut, last, depth)
        last = cut

# Log base-2 helper function. From Sklearn's implementation at:
#     https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_utils.pyx#L7
cdef inline DTYPE_t log2(DTYPE_t x) nogil:
    return ln(x) / ln(2.0)

cdef void dual_introsort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Implementation as described in Musser, 1997. Switches to heapsort
    when max recursion depth exceeded. Otherwise uses median-of-three 
    quicksort (Bentley & McIlroy, 1993) with all the usual optimizations:
        - Swap equal elements.
        - Only process partitions longer than the minimum size threshold.
        - When a new partition is made, recurse on the smaller half and 
          iterate over the larger half.
        - Make a final pass with insertion sort over the entire list.

    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef int max_depth = 2 * <int>log2(last-first)
    dual_introsort_loop(items, rows, first, last, max_depth)
    dual_insertion_sort(items, rows, first, last)
    
# For convenient memory reallocation.
ctypedef fused realloc_t:
    SIZE_t
    DTYPE_t
    Node

cdef inline realloc_t* safe_realloc(realloc_t* ptr, SIZE_t n_items) nogil except *:
    # Inspired by Sklearn's safe_realloc() func. However, thankfully
    # Cython now no longer requires us to send a pointer to a pointer
    # in order to prevent crashes.
    cdef realloc_t elem = ptr[0]
    cdef SIZE_t n_bytes = n_items * sizeof(elem)
    # Make sure we're not trying to allocate too much memory.
    if n_bytes/sizeof(elem) != n_items:
        with gil:
            raise MemoryError(f"Overflow error: unable to allocate {n_bytes} bytes.")       
    cdef realloc_t* res_ptr = <realloc_t *> realloc(ptr, n_bytes)
    with gil:
        if not res_ptr: raise MemoryError()
    return res_ptr

# C++ random number generator. Not yet a part of a Cython release so
# pasted in from: 
#     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/Cython/Includes/libcpp/random.pxd#L1
cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass random_device:
        ctypedef uint_fast32_t result_type
        random_device() except +
        result_type operator()() except +

    cdef cppclass mt19937:
        ctypedef uint_fast32_t result_type
        mt19937() except +
        mt19937(result_type seed) except +
        result_type operator()() except +
        result_type min() except +
        result_type max() except +
        void discard(size_t z) except +
        void seed(result_type seed) except +

    cdef cppclass uniform_int_distribution[T]:
        ctypedef T result_type
        uniform_int_distribution() except +
        uniform_int_distribution(T, T) except +
        result_type operator()[Generator](Generator&) except +
        result_type min() except +
        result_type max() except +
        
# Info for any node that will eventually be split or made into a leaf.
# Similar to what Sklearn does at:
#     https://github.com/scikit-learn/scikit-learn/blob/a2c4d8b1f4471f52a4fcf1026f495e637a472568/sklearn/tree/_tree.pyx#L126
cdef struct StackEntry:
    SIZE_t start
    SIZE_t end
    SIZE_t node_id
    SIZE_t parent_id
    SIZE_t n_const_feats

# To compare node splits.
cdef struct Split:
    SIZE_t feat
    DTYPE_t thresh
    SIZE_t pos
    DTYPE_t score  

# Vital characteristics of a node. Set when it's added to the tree.
cdef struct Node:
    SIZE_t l_child # idx of left child, -1 if leaf
    SIZE_t r_child # idx of right child, -1 if leaf
    SIZE_t feat    # col idx of split feature, -1 if leaf
    DTYPE_t thresh # double split threshold, NAN if leaf
    SIZE_t label   # class label if leaf, -1 if non-leaf.

cdef inline void find_num_split(SIZE_t* rows, DTYPE_t* items, SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, 
                                SIZE_t n_class, SIZE_t min_samples_leaf, DTYPE_t min_weight_leaf, DTYPE_t* c_wts,
                                DTYPE_t* l_wcc, DTYPE_t* r_wcc, DTYPE_t* parent_wcc, Split* best_split, 
                                SIZE_t current_feat, DTYPE_t parent_num, DTYPE_t parent_den, 
                                bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.
    
    Uses Gilles Louppe's split-finding algorithm:
        Page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    
    Saves a split's feature idx, threshold, position, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows              : Indices of all rows in the training set. Shape: (n train samples,).
        items             : The sorted feature values of the samples in the parent
                            node (beginning at `node_start`). Shape: (n train samples,).
        labels            : All training labels. Shape: (n training samples,).
        node_start        : Index of the beginning of the parent node in `rows`.
        n_parent          : Number of samples in the parent node.
        n_class           : Number of unique classes in the training set.
        min_samples_leaf  : Any leaf will have no fewer than this many samples.
        min_weight_leaf   : Total weight of any leaf's samples will be at least this much.
        c_wts             : Class weights. Shape: (`n_class`,).
        l_wcc             : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc             : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc        : Parent node's weight class counts. Shape: (`n_class`,).
        best_split        : Holds the feature, threshold, position, and impurity
                            score of the parent node's current best split.
        current_feat      : Column index of feature under investigation.
        parent_num        : Numerator of parent node's impurity score.
        parent_den        : Denominator of parent node's impurity score.
        current_feat_const: Whether current splitting feature is constant for all eligible split 
                            thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, w, score
    cdef SIZE_t row, label, r
    
    # To iterate across all the node's samples' feature values.
    cdef SIZE_t prev_pos, pos, node_end
    cdef DTYPE_t lowest, next_lowest, mid
    
    prev_pos, pos = node_start, node_start
    node_end = node_start + n_parent
    lowest = items[pos]
    l_num, l_den = 0., 0.
    r_num, r_den = parent_num, parent_den
    
    # Find the best split and store its score, threshold, position,
    # as well as it's children's weighted class counts.
    while pos < node_end:
        while items[pos] == lowest: # When consecutive items have the same value.
            if pos == node_end - 1: # When the final few samples all have the same value.
                return
            pos+=1
        next_lowest = items[pos]
        mid = lowest/2. + next_lowest/2. # Split threshold is always the mid-point between two consecutive values.
        if mid == next_lowest: mid = lowest

        # Move samples from the left to right child when it's quicker to do so.
        if pos-prev_pos > node_end-pos-1:
            l_num, l_den = parent_num, parent_den
            r_num, r_den = 0., 0.
            memcpy(l_wcc, parent_wcc, n_class*sizeof(DTYPE_t))
            memset(r_wcc, 0, n_class*sizeof(DTYPE_t))
            for r in reversed(range(pos, node_end)):
                row = rows[r]
                label = labels[row]; w = c_wts[label]
                r_num += w*( 2*r_wcc[label] + w); r_den += w
                l_num += w*(-2*l_wcc[label] + w); l_den -= w 
                r_wcc[label] += w; l_wcc[label] -= w
        else:
            for r in range(prev_pos, pos):
                row = rows[r]
                label = labels[row]; w = c_wts[label] 
                l_num += w*( 2.*l_wcc[label] + w); l_den += w
                r_num += w*(-2.*r_wcc[label] + w); r_den -= w
                l_wcc[label] += w; r_wcc[label] -= w

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf.
        if pos - node_start < min_samples_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif node_end - pos < min_samples_leaf:
            return
        # l_den and r_den are left and right children's weighted sample sums.
        elif l_den < min_weight_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif r_den < min_weight_leaf:
            return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Only update best split stats if current score beats all
            # other best found among all other features already explored 
            # at the current node.
            best_split.score, best_split.thresh = score, mid
            best_split.pos, best_split.feat = pos, current_feat
        lowest = next_lowest
        prev_pos = pos; pos+=1

cdef inline void make_num_split(SIZE_t* rows, DTYPE_t* X, StackEntry* node_info, Split* best_split, 
                                SIZE_t n_samples) nogil:
    cdef SIZE_t p, p_end
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[best_split.feat*n_samples + rows[p]] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p] 

# Necessary constants.
cdef DTYPE_t NEG_INF = -np.inf
cdef DTYPE_t NAN = np.nan
            
cdef class _DecisionTree:
    """Fit a decision tree classifier using a depth-first algorithm.
    
    Based on page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    Keeps track of and avoids features that are constant for a given node's samples.
    
    Attributes:
        m                       : Number of candidate features randomly selected to try to split each node.
        min_samples_leaf        : Any leaf will have no fewer than this many samples.
        min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                  of the sum of weights of *all* training samples used to fit 
                                  the tree.
        class_weights           : Sample weight to be used for each class. Shape: (`n_class`,).
        seed                    : Seed of the random number generator used for tree growing.
        rng                     : C++ 19937 32bit int random number generator.
        rows                    : Row indices of all training samples. Shape: (`n_samples`,).
        features                : Column indices of all training features. Shape: (`n_features`,).
        n_class                 : Number of unique classes in the training set.
        n_samples               : Number of samples in the training set.
        n_features              : Number of features used to train.
        mem_capacity            : Max number of tree nodes that can be stored in `self.nodes`
                                  and `self.weighted_class_counts`.
        min_weight_leaf         : Total weight of any leaf's samples will be at least this much.
        n_nodes                 : Number of nodes in the tree.

        Decision Tree data structure
        ----------------------------
        nodes                   : All nodes in the decision tree. Shape: (`n_nodes`,).
        weighted_class_counts   : Weighted class counts of training samples in each node.
                                  Shape: (`n_nodes` x `n_class`,).
    """
    
    # Class attributes.
    cdef SIZE_t seed
    cdef mt19937 rng
    cdef SIZE_t mem_capacity
    cdef SIZE_t n_samples
    cdef SIZE_t n_features
    cdef SIZE_t n_class
    cdef SIZE_t m
    cdef SIZE_t min_samples_leaf, 
    cdef DTYPE_t min_weight_fraction_leaf
    cdef DTYPE_t min_weight_leaf
    cdef SIZE_t n_nodes
    cdef SIZE_t* rows
    cdef SIZE_t* features
    cdef DTYPE_t* class_weights
    cdef Node* nodes
    cdef DTYPE_t* weighted_class_counts
    def __cinit__(self, SIZE_t m, SIZE_t min_samples_leaf, DTYPE_t min_weight_fraction_leaf, SIZE_t seed): 
        """
        Arguments:
            m                       : Number of candidate features randomly selected to try to split each node.
            min_samples_leaf        : Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                      of the sum of weights of *all* training samples used to fit the tree.
            seed                    : A seed for the C++ mt19937 32bit int random generator. 
                                      Use when reproducibility is desired.
        """
        self.m, self.min_samples_leaf = m, min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = NULL
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = NULL 
        
    def __dealloc__(self):
        free(self.nodes)
        free(self.weighted_class_counts)
        
    property size:
        def __get__(self):
            return self.n_nodes
    
    property left_children:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].l_child
            return out

    property right_children:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].r_child
            return out
        
    property split_features: 
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].feat
            return out
        
    property split_thresholds:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].thresh
            return out
        
    property weighted_cc:
        def __get__(self):
            cdef SIZE_t out_size = self.n_nodes*self.n_class
            out = np.empty(out_size, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(out_size):
                    out_view[i] = self.weighted_class_counts[i]
            out.resize(self.n_nodes, self.n_class)
            return out
    
    property labels:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].label
            return out
    
    cdef void _increase_mem_capacity(self, SIZE_t new_capacity) nogil:
        self.nodes = safe_realloc(self.nodes, new_capacity)
        self.weighted_class_counts = safe_realloc(self.weighted_class_counts, self.n_class*new_capacity)
    
    cdef void _make_leaf(self, Node* leaf_node, SIZE_t* y, SIZE_t node_start, SIZE_t node_id, 
                         SIZE_t n_classes_node, SIZE_t* max_wt_classes) nogil:
        # Class with largest wcc becomes leaf node's label. Break ties with a random choice.
        cdef SIZE_t label
        cdef DTYPE_t max_wt = 0.
        cdef SIZE_t lb = 0
        cdef SIZE_t ub = -1
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t i, j
        # If all node's samples have the same class.
        if n_classes_node == 1:
            label = y[self.rows[node_start]]
        else:
            # Otherwise find label with max weighted class count for the node.
            for i in range(self.n_class):
                max_wt = max(max_wt, self.weighted_class_counts[node_id*self.n_class + i])
            # See if multiple classes share this max count.
            for i in range(self.n_class):
                if self.weighted_class_counts[node_id*self.n_class + i] == max_wt:
                    ub += 1
                    max_wt_classes[ub] = i
            # If so, randomly choose leaf's label from among those classes.
            if ub > 0:
                dist = uniform_int_distribution[SIZE_t](lb, ub) # Choose an int w/in range lb, ub, inclusive.
                j = dist(self.rng)
                label = max_wt_classes[j]
            else:
                label = max_wt_classes[lb]
        leaf_node.l_child = -1
        leaf_node.r_child = -1    
        leaf_node.feat = -1  
        leaf_node.thresh = NAN
        leaf_node.label = label 

    cdef _grow_tree(self, DTYPE_t* X, SIZE_t* y):
        # LIFO stack holding all nodes still to be investigated.
        cdef stack[StackEntry] node_stack

        #####################################################################
        # Variables containing info of the node currently being investigated.
        #####################################################################
        cdef SIZE_t start, end, node_id, parent_id, n_consts, n_samples_node
        cdef DTYPE_t* node_wcc = NULL
        cdef StackEntry node_info
        cdef Node* node = NULL
        
        # Holds child node info if the current node gets split.
        cdef SIZE_t l_child_id, r_child_id
        cdef Node* l_child_node = NULL
        cdef Node* r_child_node = NULL
        
        #####################################################################
        # For finding the best split.
        #####################################################################
        cdef Split best_split
        cdef DTYPE_t* l_wcc = NULL
        cdef DTYPE_t* r_wcc = NULL
        cdef DTYPE_t sum_node_wcc_sqr, sum_node_wcc # Parent node's proxy Gini score num and den.
        
        # Indicates a feature has been discovered to be constant during a
        # split search within the search range permitted by min_samples_leaf 
        # and min_weight_leaf.
        cdef bint current_feat_const 

        # Create a C-contiguous array of doubles to hold feature values of a 
        # given node's samples. Using Numpy to allocate memory to longer 
        # vectors is often faster than using realloc().
        cdef DTYPE_t[::1] items_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* items = &items_buffer[0]
        cdef SIZE_t r
        
        #####################################################################
        # For random feature selection (w/out replacement) and keeping track 
        # of nodes' constant features. 
        #####################################################################
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t lb, ub, idx, feat_idx, n_drawn_feats, n_new_consts, n_total_consts
        cdef SIZE_t[::1] features_buffer = np.empty(self.n_features, dtype=np.intp) 
        cdef SIZE_t* features = &features_buffer[0]
        cdef SIZE_t[::1] constant_features_buffer = np.empty(self.n_features, dtype=np.intp)
        cdef SIZE_t* constant_features = &constant_features_buffer[0]
        
        #####################################################################
        # For determining whether node should be a leaf.
        #####################################################################
        cdef SIZE_t i, c, cc, n_classes_node, row, label
        cdef DTYPE_t wcc, wt
        # Stores classes that share a leaf's max class wt. When two or more 
        # present, leaf label randomly chosen from these classes
        cdef SIZE_t* max_wt_classes = NULL
        
        with nogil:
            # Allocate memory to pointers.
            l_wcc = safe_realloc(l_wcc, self.n_class)
            r_wcc = safe_realloc(r_wcc, self.n_class)
            node_wcc = safe_realloc(node_wcc, self.n_class)
            max_wt_classes = safe_realloc(max_wt_classes, self.n_class*sizeof(SIZE_t))
            # Fill with feature column indices so we can track constant feats.
            memcpy(features, self.features, self.n_features* sizeof(SIZE_t))
            
            # Push root node onto the LIFO stack.
            node_stack.push({"start": 0, "end": self.n_samples, "node_id": 0, 
                             "parent_id": 0, "n_const_feats": 0})
            self.n_nodes = 1
            while not node_stack.empty():
                node_info = node_stack.top()
                node_stack.pop()
                start, end = node_info.start, node_info.end
                node_id, parent_id = node_info.node_id, node_info.parent_id # TODO: `parent_id` unused; is it necessary?
                n_consts = node_info.n_const_feats
                n_samples_node = end-start
                node = &self.nodes[node_id]
                
                # Tabulate the current node's weighted class counts.
                #
                # Implementation detail #1: I tried storing the l and r child wt class cts
                # of nodes' best splits so that this tabulation wouldn't need to be 
                # performed for each node. But found there was virtually no speed improvement
                # to justify the more complicated code required to store and update these 
                # values during the best split search.
                #
                # Implementation detail #2: Setting aside a block of memory to 
                # store the current node's wt class cts and passing a pointer to
                # this block to the split search function sped up training by 8%
                # compared to passing a ptr to the location of node's wt class cts 
                # in the self.weighted_class_counts array.
                memset(node_wcc, 0, self.n_class*sizeof(DTYPE_t))
                sum_node_wcc, sum_node_wcc_sqr = 0., 0.
                for i in range(n_samples_node):
                    row = self.rows[start + i]
                    label = y[row]
                    wt = self.class_weights[label]
                    # Compute the node's proxy gini numerator and denominator while we're at it.
                    sum_node_wcc_sqr += wt*(2*node_wcc[label] + wt) # numerator
                    sum_node_wcc += wt                              # denominator
                    node_wcc[label] += wt
                memcpy(&self.weighted_class_counts[node_id*self.n_class], node_wcc, self.n_class*sizeof(DTYPE_t))
                
                # Make a leaf if required to do so. 
                n_classes_node = 0
                for c in range(self.n_class):
                    wcc = node_wcc[c]
                    if wcc > 0: n_classes_node += 1
                if n_classes_node == 1:                   
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif n_samples_node < 2*self.min_samples_leaf:  
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif sum_node_wcc < 2.*self.min_weight_leaf: 
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)

                # Otherwise split the node.
                else:
                    # Initialize stats for best split of node.
                    best_split.feat = -1
                    best_split.thresh = 0.
                    best_split.pos = -1
                    best_split.score = NEG_INF

                    # Ensure feats drawn w/out replacement.
                    n_drawn_feats = 0
                    n_new_consts = 0
                    n_total_consts = n_consts
                    lb = 0                      # Range in `features` array from which we 
                    ub = self.n_features - 1    # randomly select a feature's column index. 
                        
                    while n_drawn_feats < self.m:
                        n_drawn_feats += 1

                        # Breiman & Cutler's original Fortran random forests implementation 
                        # allows for known constant features to be drawn during a split-search.
                        # I follow their example, as I believe that doing so allows individual 
                        # trees to be less correlated with each other. Since I don't pre-sort
                        # features, I would prefer not to have to sort any more features than
                        # necessary, and so I've adopted the technique Sklearn uses to track 
                        # constant features:
                        #     https://github.com/scikit-learn/scikit-learn/blob/dbe39454f766ebefc3219f2c1871ac1774316532/sklearn/tree/_splitter.pyx#L310
                        # 
                        # The idea is that feature idxs in `features` are organized into two sections:
                        #
                        #     [<indices of known constant feats>, <indices of non-constant feats>]
                        #
                        # As we begin drawing feature indices from this above list, those two sections
                        # will each be further sub-divided into two sections:
                        # 
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <undrawn non-constant feats>, <drawn non-constant feats>]
                        #
                        # When we choose a feature that happens to be a known constant, we'll re-locate
                        # its idx to the right-end of the first of those four sections. Then we 
                        # increment the lower bound threshold, `lb`, by one so that we don't re-draw 
                        # that feature again.
                        #
                        # Similarly, if we draw a non-constant feature idx, we'll move it to the 
                        # left-end of the last of the four partitions and reduce the upper bound
                        # threshold, `ub`, by one so that the feature idx can't be drawn again
                        # during this split-search. 
                        #
                        # One last important detail: sometimes we'll draw a feature that 
                        # used to be non-constant for ancestor nodes, but will be found to be 
                        # constant for the current node. When this happens, we relocate its 
                        # index so that it sits to the right of the known constant feats section.
                        # This means our `features` list could have up to five partitions:
                        #
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <newly discovered const feats>, <undrawn non-constant feats>, 
                        #      <drawn non-constant feats>]
                        #
                        # Whenever we find a new constant feature, we increment the `n_new_consts`
                        # counter by one. We also increment the `n_total_consts` counter by one. 
                        # During the split-search we have to use `n_total_consts` to keep track of
                        # the total number of constant features. n_consts` mustn't be changed
                        # because it tells us where the <newly discovered const feats> section
                        # of the `features` list begins.

                        # One last wrinkle. We subtract the # of newly discovered const feats from  
                        # the upper bound before we select an index `i` from the `features` array, 
                        # and add it back to `i` after `i` has been genereated. This prevents us from 
                        # re-drawing any of these new const feats again during this split-search.
                        dist = uniform_int_distribution[SIZE_t](lb, ub-n_new_consts)
                        idx = dist(self.rng)

                        # So that we don't draw a known constant feature again this split-search.
                        if idx < n_consts:
                            features[idx], features[lb] = features[lb], features[idx]
                            lb += 1 
                            continue

                        # So that no new const feats get drawn more than once per split-search.
                        idx += n_new_consts

                        feat_idx = features[idx]
                        # Prepare the rows' feature values for sorting.
                        for r in range(start, end):
                            # X is a pointer, so have to index into this 2d array in the C way 
                            # (also keeping in mind that the array is column-major).
                            items[r] = X[feat_idx*self.n_samples + self.rows[r]]

                        # Sort feature values and corresponding sample row indices
                        # to prepare for numerical split finding.
                        dual_introsort(items, self.rows, start, end)

                        # Make sure the feature not constant for node's samples.
                        if items[start] == items[end-1]:
                            # Move the newly-discovered constant feat to the far right-end
                            # of the left half of `features` list holding the known const
                            # feats as well as any other const feats newly discovered 
                            # during this node's split-search.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # Initialize weighted class counts of right and left children.
                            # Right child's counts are initially the same as parent node's.
                            memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                            memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))

                            # If the feature has an impurity score that's better than the best score 
                            # found among all other features visited thus far for this node, find_num_split()
                            # updates the attributes of the struct containing the node's best split info. 
                            # 
                            # But even if a new best score isn't reached, if an impurity score can
                            # be calculated at least once during the feature's split search, the
                            # following indicator will be toggled off, to indicate that the feature
                            # is not constant.
                            current_feat_const = 1 # 1 = is constant; 0 = not constant
                            find_num_split(self.rows, items, y, start, n_samples_node, 
                                           self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                           self.class_weights, l_wcc, r_wcc, node_wcc,
                                           &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc,
                                           &current_feat_const)

                            # The feature may be constant only within the search range 
                            # permitted by self.min_samples_leaf and self.min_weight_leaf. 
                            # If so, the feature is a newly discovered constant.
                            if current_feat_const:
                                features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                                n_new_consts += 1
                                n_total_consts += 1
                                continue
                            else:
                                # The feature is non-constant, so we ensure it's not drawn again
                                # during this split-search.
                                features[idx], features[ub] = features[ub], features[idx]
                                ub -= 1 

                    # To ensure that the constant features info is accurate for sibling or child nodes.
                    memcpy(&features[0], &constant_features[0], sizeof(SIZE_t)*n_consts)
                    memcpy(&constant_features[n_consts], &features[n_consts], sizeof(SIZE_t)*n_new_consts)

                    # Make node a leaf if constant for all randomly drawn feats.
                    # (# drawn known constant feats + # drawn new constant feats)
                    if lb + n_new_consts == n_drawn_feats: 
                        self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                    else: 
                        make_num_split(self.rows, X, &node_info, &best_split, self.n_samples) 

                        # Update tree info for node that's getting split.
                        l_child_id = self.n_nodes
                        r_child_id = l_child_id + 1
                        node.l_child = l_child_id
                        node.r_child = r_child_id
                        node.feat    = best_split.feat
                        node.thresh  = best_split.thresh
                        node.label   = -1

                        # Prepare for the left and right child nodes
                        # by increasing tree data memory capacity if
                        # necessary.
                        if self.n_nodes + 2 > self.mem_capacity:
                            # Expand memory capacity geometrically. See "geometric growth" 
                            # part of WhozCraig's SO answer at: 
                            #     https://stackoverflow.com/a/51665863/8628758.
                            # Add one after squaring so that the new capacity can
                            # contain not only a tree of greater depth, but also
                            # the maximum # nodes that that depth could have.
                            new_capacity = 2*self.mem_capacity + 1
                            self._increase_mem_capacity(new_capacity)
                            self.mem_capacity = new_capacity
                        
                        # Push right child info onto the LIFO stack.
                        node_stack.push({"start": best_split.pos, "end": end, "node_id": r_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})
                        # Push left child info onto queue.
                        node_stack.push({"start": start, "end": best_split.pos, "node_id": l_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})

                        # And update size of the tree.
                        self.n_nodes += 2
                        
        free(l_wcc)
        free(r_wcc)
        free(node_wcc)
        free(max_wt_classes)
    
    def fit(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X, np.ndarray[SIZE_t, ndim=1, mode="c"] y, 
            np.ndarray[SIZE_t, ndim=1, mode="c"] rows, np.ndarray[SIZE_t, ndim=1, mode="c"] features,
            np.ndarray[DTYPE_t, ndim=1, mode="c"] class_weights, SIZE_t n_class): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
            feats             (1D C-contiguous array of int): Column indices of training features.
            class_weights (1D C-contiguous array of float64): Desired weight for each class. Shape: (`n_class`,).
            n_class                                         : Number of classes in training data.  
        """
        # Casting the raw data to pointers gives a 17% speed-up compared to getting
        # pointer from the ndarray's buffer interface as recommended by DavidW in 
        # his SO answer at: https://stackoverflow.com/a/54832269/8628758. e.g.
        #     cdef DTYPE_t[::1,:] X_buffer = X
        #     cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        # Not worried about unexpected behavior as all ndarrays' contiguousness and
        # memory layout enforced prior to this point.
        cdef DTYPE_t* X_ptr = <DTYPE_t*> X.data
        cdef SIZE_t* y_ptr = <SIZE_t*> y.data
        self.rows = <SIZE_t*> rows.data
        self.features = <SIZE_t*> features.data
        self.class_weights = <DTYPE_t*> class_weights.data
        self.n_class = n_class
        self.n_samples = rows.shape[0]
        self.n_features = features.shape[0]
        cdef random_device rd # Needed when using the C++ mt19937 rng w/out a seed.
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        cdef SIZE_t init_capacity = 15
        
        cdef SIZE_t i, row, label
        cdef DTYPE_t wt
        cdef DTYPE_t sum_wts = 0
        cdef Node* root_node = NULL
        with nogil:
            # Allocate memory for the tree.
            self._increase_mem_capacity(init_capacity)
            self.mem_capacity = init_capacity
 
            # And sum the class weights of all the root node's samples in
            # order to know minimum total weight a leaf must have (which
            # we must know when regularizing by min_weight_fraction_leaf.)
            for i in range(self.n_samples):
                row = self.rows[i]
                label = y_ptr[row]
                wt = self.class_weights[label]
                sum_wts += wt
            self.min_weight_leaf = self.min_weight_fraction_leaf*sum_wts
            
            # Initialize the random number generator. Followed example from:
            #     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/tests/run/cpp_stl_random.pyx#L16
            if self.seed == -1:
                self.rng = mt19937(rd()) # If using the random device engine std::random_device.
            else:
                self.rng = mt19937(self.seed)

        # Initiate tree building.
        self._grow_tree(X_ptr, y_ptr)
    
    cdef Node* _next_node(self, SIZE_t nxt) nogil: 
        return &self.nodes[nxt]
    
    cdef SIZE_t _get_leaf_idx(self, SIZE_t i, Node* leaf, SIZE_t n, DTYPE_t* X) nogil:
        cdef SIZE_t idx
        cdef SIZE_t root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[leaf.feat*n + i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else: 
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
    
    def predict(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_preds = X.shape[0]
        cdef SIZE_t i
        preds = np.empty(n_preds, dtype=np.intp)
        cdef SIZE_t[::1] preds_view = preds
        cdef Node leaf
        with nogil:
            for i in range(n_preds): 
                preds_view[i] = self.nodes[self._get_leaf_idx(i, &leaf, n_preds, X_ptr)].label
        return preds
    
    def predict_probs(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
            """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions. Shape: (`X.size`, `self.n_class`)
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        cdef DTYPE_t[::1] wcc_view = wcc
        cdef Node leaf
        cdef SIZE_t i, j, idx
        with nogil:
            for i in range(n_probs):
                idx = self._get_leaf_idx(i, &leaf, n_probs, X_ptr)
                for j in range(self.n_class):
                    wcc_view[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

class DecisionTreeLouppeCython():
    """Fit a decision tree classifier using a depth-first algorithm.
    
    Based on page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    Keeps track of and avoids features that are constant for a given node's samples.

    Attributes:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            seed                         (int): Seed of the random number generator used for tree growing.
            rows              (ndarray of int): Row indices of all training samples. Shape: (n training samples,).
            features          (ndarray of int): Column indices of all training features. Shape: (n features,).
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            n_class                      (int): Number of unique classes in the training set.
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights = [], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit the tree.
            seed                         (int): Use when reproducibility is desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C') 
        if seed is None:
            self.seed = -1
        else:
            self.seed = seed
        self._tree = _DecisionTree(self.m, self.min_samples_leaf, self.min_weight_fraction_leaf, self.seed)
        
    @property
    def size(self): return self._tree.size
    
    @property
    def left_children(self): return self._tree.left_children
    
    @property
    def right_children(self): return self._tree.right_children
            
    @property 
    def split_features(self): return self._tree.split_features

    @property 
    def split_thresholds(self): return self._tree.split_thresholds
    
    @property
    def weighted_class_counts(self): return self._tree.weighted_cc
    
    @property
    def labels(self): return self._tree.labels
    
    def fit(self, X, y, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X (Fortran-style ndarray of float64): Pre-processed training data. 
                                                  Shape: (num train samples, num train features).
            y                   (ndarray of int): Training labels. Shape: (num train samples,).
            rows                          (list): Indices of the rows to be used for training. 
                                                  All rows used if empty.
            features                      (list): Column indices of training features that will be used.
                                                  All features used if empty.                          
        Returns:
            DecisionTreeLouppeCython: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        self.n_class = np.unique(y).size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.
            
        self._tree.fit(X, y, self.rows, self.features, self.class_weights, self.n_class)
        return self

    def predict(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict(X)
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict_probs(X)

## Cython Louppe Tree's Speed on the Titanic Data

In [28]:
m = 4
dt = DecisionTreeLouppeCython(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [29]:
dt.size # Tree is different cause Cython uses C++ rng, not Numpy rng as Python version does.

393

In [30]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.8202247191011236

In [31]:
%timeit dt.fit(xTrain_proc, yTrain_proc)

538 µs ± 22.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [32]:
%timeit dt.predict(xVal_proc)

8.54 µs ± 216 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Implementing Wright's Numerical Split Finders
To mimic what Ranger does, I'll need to write two split finder functions -- one to for when we split "small Q" nodes, and the other for splitting "large Q" nodes (where the ratio of the number of samples in the node to the number of unique values for a given feature in the training set is larger 0.02). I'll start by writing the "small Q" node split finder. 

Now I could just use the Sklearn numerical split finder to do this, but Marvin Wright designed Ranger to sort and explore split points using a different technique than Sklearn, and I think it'll be interesting to summarize how he approached the problem of cataloging and then iterating through numerical split points. Here's the process he used:
1. Get the sorted unique raw feature values held by the node's samples. 
2. The number of possible split points will ultimately be the length of the above result, minus one.
3. If the above result is one or less, the samples in the node are constant for the given feature, and the next feature can be investigated.
4. Create a 1-d array storing weighted sample class counts per split-point (for each unique value of the feature), and use a for-loop to iterate over all samples in the node to tabulate these counts.
5. At each split-point, we use one more for-loop to update the weighted class counts of the left and right children. A side-effect of shifting multiple samples (having the same raw feature value and class label) from the right to left child at the same time is that we won't be able to use an on-line algorithm to update the numerators used to calculate the proxy gini score. Instead, at each split point we'll have to use the updated left and right weighted class counts to calculate the right and left children's proxy gini numerators from scratch.

Conceptually-speaking, I find Wright's approach to investigating numerical split-points to be more elegant and intuitive than what Louppe does with Sklearn:
* No need to dual sort raw feature values and row indices.
* By only iterating over unique split points, we don't have to keep track of two values ("recent" and "next-most-recent") to cover situations where we iterate over a batch of samples that all have the same feature value.
* Easy to determine whether node is constant for the given feature early-on before iterating over any samples.
* No having to specify what to do in the corner case where the last few samples all have the same value.

Unfortunately, I expect that this simplicity comes at the cost of some speed:
* We have to iterate over all rows to compile the per-split class counts, and then we have to iterate over all unique split-points. It's conceivable that oftentimes this will amount to roughly twice as many iterations as Louppe's method requires (in Sklearn, you just iterate over each sample, once.)
* Can't on-line update numerators of proxy gini scores.

#### Aside: Small Q splits in `memory_saving_splitting` mode
If you were to comb through Ranger's source code, you'd notice that its small Q splitter [has an option](https://github.com/imbs-hl/ranger/blob/ce497711884c783e133fb36750b60de4c140773f/src/TreeClassification.cpp#L221) for what Wright calls `memory_saving_splitting` mode. This setting controls the length of the `split_class_counts` list that's used whenever a split search happens. When the setting is active, the small Q splitter creates the list of split-point class counts from scratch for each new candidate splitting feature. The list's length is no longer than the <# unique feature values of the node's samples> x <# of classes>.

If `memory_saving_splitting` mode is not engaged, before training begins Ranger [will allocate memory](https://github.com/imbs-hl/ranger/blob/ce497711884c783e133fb36750b60de4c140773f/src/TreeClassification.cpp#L37) for the creation of a single 1-d array that's of length <largest # unique values of any numerical feature in the training set> x <# classes>. This same list will be used to compile the split-point class counts for each and every split search, for all of a decision tree's nodes. Whenever a new candidate feature is drawn, the splitter will zero out the first several positions of this list such that the class counts for each unique raw feature value found among a given node's samples can be tabulated from scratch.

I surmise that the advantage of using the same set of memory addresses for each split search is that it avoids the overhead that's necessary to allocate memory for brand new class count lists over and over, for each candidate feature at each node that's to be split. Since my goal in this notebook is to find the fastest numerical splitting algorithm, my smallQ splitter implemention below *will not* use memory-saving mode by default, and will allocate memory for arrays used by the splitter only once at the beginning of training.

## Wright Small Node "Small Q" numerical split finder

In [33]:
def find_num_split_smallQ(X, rows, node_unique_vals_feat, labels, node_start, n_parent, n_class, 
                          min_samples_leaf, min_weight_leaf, c_wts, l_wcc, r_wcc,
                          parent_wcc, best_split, current_feat, parent_num, parent_den,
                          node_n_unique_vals_feat, split_counts_raw, split_class_counts_wt):
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's SmallQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L233
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        X                     (ndarray of float64): Training data. Shape: (n train samples, n features).
        rows                      (ndarray of int): Indices of all rows in the training set. 
                                                    Shape: (n train samples,).
        node_unique_vals_feat (ndarray of float64): The sorted unique feature values of the samples in the parent
                                                    node (beginning index 0). Shape: (n train samples,).
        labels                    (ndarray of int): All training labels. Shape: (n training samples,).
        node_start                           (int): Index of the beginning of the parent node in `rows`.
        n_parent                             (int): Number of samples in the parent node.
        n_class                              (int): Number of unique classes in the training set.
        min_samples_leaf                     (int): Any leaf will have no fewer than this many samples.
        min_weight_leaf                  (float64): Total weight of any leaf's samples will be at least this much.
        c_wts                 (ndarray of float64): Class weights. Shape: (`n_class`,).
        l_wcc                 (ndarray of float64): Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                 (ndarray of float64): Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc            (ndarray of float64): Parent node's weight class counts. Shape: (`n_class`,).
        best_split                         (Split): Holds the feature, threshold, and impurity
                                                    score of the parent node's current best split.
        current_feat                         (int): Column index of feature under investigation.
        parent_num                       (float64): Numerator of parent node's impurity score.
        parent_den                       (float64): Denominator of parent node's impurity score.
        node_n_unique_vals_feat              (int): Number of unique values for one feature found among 
                                                    the node's samples.
        split_counts_raw          (ndarray of int): Stores sample counts found at each unique split point of
                                                    a given feature in a given node. 
                                                    Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt (ndarray of float64): Stores weighted class counts of each class at each unique
                                                    split point of a given feature in a given node. Shape:
                                                    (<max cardinality of all numerical feats in dataset> x `n_class`,)
              
    Returns: 
        int: 1 if feature is constant for eligible split-points. 0, otherwise.
    """
    # Whether or not feat is constant within search range permitted
    # by min_samples_leaf and min_weight_leaf (0 if no, 1 if yes).
    current_feat_const = 1
    
    # Tabulate both the sample counts at all possible split points
    # as well as weighted class counts at each split point.
    split_counts_raw[:node_n_unique_vals_feat] = 0
    split_class_counts_wt[:node_n_unique_vals_feat*n_class] = 0.
    for i in range(n_parent):
        row = rows[node_start + i]
        value = X[row][current_feat]
        label = labels[row]
        split_point_idx = np.searchsorted(node_unique_vals_feat, value, side='left', sorter=None)
        split_counts_raw[split_point_idx] += 1
        split_class_counts_wt[split_point_idx*n_class + label] += c_wts[label] 
        
    # To keep track of num samples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(node_n_unique_vals_feat - 1):
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left
        
        l_num, r_num = 0., 0. # To calculate numerators of proxy gini scores.
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            l_wcc[j] += split_class_counts_wt[i*n_class + j]
            r_wcc[j] -= split_class_counts_wt[i*n_class + j]
            l_num += l_wcc[j]*l_wcc[j]
            l_den += split_class_counts_wt[i*n_class + j]
            r_num += r_wcc[j]*r_wcc[j]
        r_den = parent_den - l_den

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return current_feat_const
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return current_feat_const
          
        current_feat_const = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Split threshold is always the mid-point between two consecutive values.
            mid = node_unique_vals_feat[i]/2. + node_unique_vals_feat[i+1]/2. 
            if mid == node_unique_vals_feat[i+1]: mid = node_unique_vals_feat[i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat
    return current_feat_const

In [34]:
def make_num_split(rows, X, node_info, best_split):
    """Split a decision tree node using a given ordered numerical feature and threshold. 
    
    Uses the similar logic as Gilles Louppe's Cython implementation at: 
        https://github.com/scikit-learn/scikit-learn/blob/47e3358712d483a8e8dcb84d87386eb4f3d49070/sklearn/tree/_splitter.pyx#L605
    
    Arguments: 
        rows  (ndarray of int): Indices of all rows in the training set. 
                                Shape: (n train samples,).
        X (ndarray of float64): The training data. Shape: (n train samples, n features).
        node_info (StackEntry): Stats of the node to be split.
        best_split     (Split): Stats of a node split.
    
    Returns: 
        Position of first item in split's right child node.
    """
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[rows[p]][best_split.feat] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p]
    return p

## Wright SmallQ Decision Tree Python Version

In [35]:
# For all Wright splitting algorithms (SmallQ, LargeQ, hybrids), 
# no need to store best split position in Split() struct.
# Will be returned by make_num_split() instead.
class Split():
    """Pertinent stats needed to compare node splits.
    
    Attributes:
        feat       (int): Column index of splitting feature.
        thresh (float64): Split threshold.
        score  (float64): Impurity score.
    """
    def __init__(self, feat, thresh, score):
        self.feat, self.thresh, self.score = feat, thresh, score

class DecisionTreeSmallQ():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Marvin Wright's SmallQ numerical splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L233
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights=[], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                         (int): Use when reproducibility desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C')
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = np.empty(0, dtype=Node, order='C')
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = np.empty(0, dtype=np.float64, order='C')
        
    @property
    def size(self): return self.n_nodes
    
    @property 
    def left_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].l_child
        return out

    @property 
    def right_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].r_child
        return out

    @property 
    def split_features(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].feat
        return out

    @property 
    def split_thresholds(self): 
        out = np.empty(self.n_nodes, dtype='float64')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].thresh
        return out

    @property 
    def weighted_cc(self):
        out_size = self.n_nodes*self.n_class
        out = np.empty(out_size, dtype='float64')
        for i in range(out_size):
            out[i] = self.weighted_class_counts[i]
        out.resize(self.n_nodes, self.n_class)
        return out

    @property 
    def labels(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].label
        return out
    
    def _increase_mem_capacity(self, new_capacity):
        """Resize ndarrays that hold tree's nodes and weighted class counts.
        
        Arguments:
            new_capacity (int): Amount of nodes that resized arrays will be able to hold.
        """
        self.nodes.resize(new_capacity, refcheck=False)
        self.weighted_class_counts.resize(new_capacity*self.n_class, refcheck=False)
    
    def _make_leaf(self, node_id, wcc, n_classes_node):
        """Set and store the class label of a leaf node.
        
        Break ties at random when multiple classes share the same max weight.
        Doing this avoids a bias towards lower classes that would be a possible
        consequence of using np.argmax (which is what Sklearn does).
        
        Arguments:
            node_id            (int): Location of node in `self.nodes`.
            wcc (ndarray of float64): Node's weighted class counts. Shape: (`self.n_class`,).
            n_classes_node     (int): Number of unique class labels found among
                                      node's training samples.
        """
        if n_classes_node == 1: 
            label = max(enumerate(wcc), key=lambda f: f[1])[0]
        else:              
            label = self._rng.choice(np.argwhere(wcc==np.max(wcc)).flatten())
        self.nodes[node_id] = Node(-1, -1, -1, np.nan, label) 
        
    def _grow_tree(self, X, y):
        """Depth-first growth of a decision tree.
        
        Arguments:
            X (ndarray of float64): Training samples. Shape: (n samples, n features).
            y     (ndarray of int): Training labels. Shape: (n samples,).
        """
        # LIFO stack holding all nodes still to be investigated.
        node_stack = []
        
        # Stores the weighted class counts of the current node.
        node_wcc = np.empty(self.n_class, dtype=np.float64)
        
        ##############################################################
        # For finding the best split.
        ##############################################################
        l_wcc = np.empty(self.n_class, dtype=np.float64)
        r_wcc = np.empty(self.n_class, dtype=np.float64)
        items = np.empty(self.n_samples, dtype=np.float64)
        
        # Make 1-d arrays containing sample counts and weighted class 
        # counts for each unique raw feature value.
        # Raw, non-weighted, sample counts at each split-point.
        split_counts_raw = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        # Weighted class counts for each split-point.
        split_class_counts_wt = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64)  

        # Keeping track of nodes' constant features. 
        features = self.features.copy()
        constant_features = np.empty(self.n_features, dtype=np.intp)
        
        # Push root node onto the LIFO stack.
        node_stack.append(StackEntry(0, self.n_samples, 0, 0, 0))
        self.n_nodes = 1
        
        while len(node_stack) > 0:
            node_info = node_stack.pop()
            start, end = node_info.start, node_info.end
            node_id, parent_id = node_info.node_id, node_info.parent_id
            n_consts = node_info.n_const_feats
            n_samples_node = end-start
            
            # Tabulate and store the current node's weighted class counts.
            node_wcc[:] = 0.
            for i in range(n_samples_node):
                row = self.rows[start + i]
                label = y[row]
                wt = self.class_weights[label]
                node_wcc[label] += wt 
            self.weighted_class_counts[node_id*self.n_class: (node_id + 1)* self.n_class] = node_wcc
            
            # Make a leaf if required to do so.
            n_classes_node, sum_node_wcc, sum_node_wcc_sqr = 0, 0., 0.
            for c in range(self.n_class):
                wcc = node_wcc[c]
                if wcc > 0: n_classes_node += 1
                # Compute the current node's proxy gini numerator and denominator while we're at it.
                sum_node_wcc_sqr += wcc**2 
                sum_node_wcc += wcc 
            if n_classes_node == 1:                      
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif n_samples_node < 2*self.min_samples_leaf:  
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif sum_node_wcc < 2.*self.min_weight_leaf: 
                self._make_leaf(node_id, node_wcc, n_classes_node)
            
            # Or perform a split.
            else:
                # Initialize stats for best split of node.
                best_split = Split(-1, 0., -np.inf)
                
                # Ensure feats drawn w/out replacement.
                n_drawn_feats = 0
                n_new_consts = 0
                n_total_consts = n_consts
                lb = 0                      # Range in `features` array from which we 
                ub = self.n_features - 1    # randomly select a feature's column index. 
               
                while n_drawn_feats < self.m:
                    n_drawn_feats += 1
                    idx = self._rng.choice(range(lb, ub-n_new_consts+1))
                    
                    # So that we don't draw a known constant feature again this split-search.
                    if idx < n_consts:
                        features[idx], features[lb] = features[lb], features[idx]
                        lb += 1 
                        continue
                        
                    # So that no new const feats get drawn more than once per split-search.
                    idx += n_new_consts

                    feat_idx = features[idx]  
                    
                    # Prepare the rows' feature values for sorting.
                    items[:n_samples_node] = X[:,feat_idx][self.rows[start:end]]
                    
                    # Make sure the feature not constant for node's samples.
                    node_unique_vals_feat = np.unique(items[:n_samples_node])
                    node_n_unique_vals_feat = len(node_unique_vals_feat)
                    if node_n_unique_vals_feat < 2:
                        # Move the newly-discovered constant feat to the far right-end
                        # of the left half of `features` list holding the known const
                        # feats as well as any other const feats newly discovered 
                        # during this node's split-search.
                        features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                        n_new_consts += 1
                        n_total_consts += 1
                        continue
                    else:
                        # Initialize weighted class counts of right and left children.
                        # Right child's counts are initially the same as parent node's.
                        r_wcc[:] = node_wcc
                        l_wcc[:] = 0.
                    
                        # If the feature has an impurity score that's better than the best score 
                        # found among all other features visited thus far for this node, find_num_split()
                        # updates the attributes of the struct containing the node's best split info. 
                        # 
                        # But even if a new best score isn't reached, if an impurity score can
                        # be calculated at least once during the feature's split search, the
                        # following indicator will be toggled off, to indicate that the feature
                        # is not constant (1 = is constant; 0 = not constant).
                        current_feat_const = find_num_split_smallQ(X, self.rows, node_unique_vals_feat, y, start, n_samples_node, 
                                                                   self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                                   self.class_weights, l_wcc, r_wcc, node_wcc, best_split, feat_idx, 
                                                                   sum_node_wcc_sqr, sum_node_wcc, node_n_unique_vals_feat, split_counts_raw, 
                                                                   split_class_counts_wt)

                        if current_feat_const:
                            # The feature may be constant within the search range permitted
                            # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                            # the feature is a newly discovered constant.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # The feature is non-constant, so we ensure it's not drawn again
                            # during this split-search.
                            features[idx], features[ub] = features[ub], features[idx]
                            ub -= 1 
                            
                # To ensure that the constant features info is accurate for sibling or child nodes.
                features[0:n_consts] = constant_features[0:n_consts]
                constant_features[n_consts:n_consts+n_new_consts] = features[n_consts:n_consts+n_new_consts]
                
                # Make node a leaf if constant for all randomly drawn feats.
                # (# drawn known constant feats + # drawn new constant feats)
                if lb + n_new_consts == n_drawn_feats: 
                    self._make_leaf(node_id, node_wcc, n_classes_node)
                else: 
                    split_pos = make_num_split(self.rows, X, node_info, best_split) 

                    # Update info for node that's getting split.
                    l_child_id = self.n_nodes
                    r_child_id = l_child_id + 1
                    self.nodes[node_id] = Node(l_child_id, r_child_id, best_split.feat, best_split.thresh, -1)

                    # Prepare for the left and right child nodes
                    # by increasing tree data memory capacity if
                    # necessary.
                    if self.n_nodes + 2 > self.mem_capacity:
                        # Expand memory capacity geometrically. See "geometric growth" 
                        # part of WhozCraig's SO answer at: 
                        #     https://stackoverflow.com/a/51665863/8628758.
                        # Add one after squaring so that the new capacity can
                        # contain not only a tree of greater depth, but also
                        # the maximum # nodes that that depth could have.
                        new_capacity = 2*self.mem_capacity + 1
                        self._increase_mem_capacity(new_capacity)
                        self.mem_capacity = new_capacity
                    
                    # Push right child info onto the LIFO stack.
                    node_stack.append(StackEntry(split_pos, end, r_child_id, node_id, n_total_consts))
                    # Push left child info onto queue.
                    node_stack.append(StackEntry(start, split_pos, l_child_id, node_id, n_total_consts))

                    # And update size of the tree.
                    self.n_nodes += 2
    
    def fit(self, X, y, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X (Fortran-style ndarray of float64): Pre-processed training data. 
                                                  Shape: (num train samples, num train features).
            y                   (ndarray of int): Training labels. Shape: (num train samples,).
            rows                          (list): Indices of the rows to be used for training. 
                                                  All rows used if empty.
            features                      (list): Column indices of training features that will be used.
                                                  All features used if empty.                           
        Returns:
            DecisionTreeSmallQ: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        # Determine # classes found among all training samples.
        root_cc = np.unique(y, return_counts=True)[1] 
        self.n_class = root_cc.size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.

        self.n_samples = len(self.rows)
        self.n_features = len(self.features)
        
        # Get the max cardinality of all numerical feats.
        self.max_n_unique_feat_vals = max([np.unique(X[:,i]).size for i in range(X.shape[1])])
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        init_capacity = 15
        
         # Allocate tree memory.
        self._increase_mem_capacity(init_capacity)
        self.mem_capacity = init_capacity
        
        # And sum the class weights of all the root node's samples in
        # order to know minimum total weight a leaf must have (which
        # we must know when regularizing by min_weight_fraction_leaf.)
        root_wcc = root_cc*self.class_weights
        self.min_weight_leaf = self.min_weight_fraction_leaf*root_wcc.sum()
        
        # Initialize the random number generator.
        self._rng = get_random_generator(self.seed)
        
        # Initiate tree building.
        self._grow_tree(X, y)
        return self
        
    def _next_node(self, nxt): return self.nodes[nxt]
       
    def _get_leaf_idx(self, i, X):
        root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[:,leaf.feat][i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else:
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
        
    def predict(self, X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        n_preds = X.shape[0]
        preds = np.empty(n_preds, dtype=np.intp)
        for i in range(n_preds):
            preds[i] = self.nodes[self._get_leaf_idx(i, X)].label
        return preds
    
    def predict_probs(self, X):
       """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        for i in range(n_probs):
            idx = self._get_leaf_idx(i, X)
            for j in range(self.n_class):
                wcc[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

## Python Wright SmallQ Tree's Speed on the Titanic Data

In [36]:
m = 4
dt = DecisionTreeSmallQ(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [37]:
dt.size # Number of nodes in the decision tree.

371

In [38]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.752808988764045

In [39]:
%timeit dt.fit(xTrain_proc, yTrain_proc)

151 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [40]:
%timeit dt.predict(xVal_proc)

517 µs ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Wright SmallQ Decision Tree Cython Version

In [41]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11

import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t # Signed, same as ssize_t in C. See MSeifert's SO answer: https://stackoverflow.com/a/46416257/8628758
cimport cython
from libc.math cimport log as ln
from libc.stdlib cimport realloc, free
from libc.string cimport memcpy
from libc.string cimport memset
from libcpp.stack cimport stack

# For C++ random number generation.
from libc.stdint cimport uint_fast32_t 

# Swap helper func for sorting.
cdef inline void swap(DTYPE_t* items, SIZE_t i, SIZE_t j) nogil:
    items[i], items[j] = items[j], items[i]

# Quicksort helpers

cdef inline void med_three(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Find the median-of-three pivot point of the second through final 
    items of a list of numbers. Once identified, the pivot is moved to 
    the front of the list. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t middle = <int>(first + (last - first)/2)
    cdef SIZE_t second = first + 1
    last -= 1
    if items[second] < items[middle]:
        if items[middle] < items[last]:
            swap(items, first, middle)    
        elif items[second] < items[last]:
            swap(items, first, last)         
        else:                        
            swap(items, first, second)
    elif items[second] < items[last]:
        swap(items, first, second)
    elif items[middle] < items[last]:
        swap(items, first, last)
    else:
        swap(items, first, middle)

cdef inline SIZE_t partition(DTYPE_t* items, SIZE_t first, SIZE_t last, SIZE_t pivot) nogil:
    """Group numbers less than the pivot value together on the left and
    those that are greater on the right. Find the index that separates
    these two groups, which will belong to the first item that is greater
    than or equal to the pivot. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1885
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        pivot      : Index holding the median pivot value.
        
    Returns:
        Index of cut point used to partition the items into two smaller sequences.
    """
    while True:
        while first < last and items[first] < items[pivot]:
            first += 1                      # Get index of first item greater than or equal to median-of-three pivot. 
        last -= 1
        while items[pivot] < items[last]:
            last -= 1                       # Get index of last item less than or equal to the pivot.
        if not (first < last): 
            return first                    # After swaps are done, return index of first item in right partition.
        
        swap(items, first, last)            # Swap the first item greater than or equal to the pivot with the
                                            # last item less than or equal to the pivot. 
        first += 1

cdef inline void insertion_sort(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Follows the spirit of the Numpy implementation at: 
        https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t i
    cdef SIZE_t j
    cdef SIZE_t k
    cdef DTYPE_t val
    for i in range(first+1, last):
        j = i
        k = i - 1
        val = items[i]
        while (j > first) and val < items[k]:
            items[j] = items[k]
            j-=1
            k-=1
        items[j] = val

# Heapsort

cdef inline void sift_down(DTYPE_t* items, SIZE_t start, int n, SIZE_t p, 
                           SIZE_t c, DTYPE_t val) nogil:
    """Swap a heap item with one of its children if that child's value is 
    greater than or equal to that parent's value. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L61
    
    Arguments:
        items: 1-d array containing numbers.
        start: Index of the first number.
        n    : Quantity of numbers.
        p    : Index of the parent.
        c    : Index of the parent's first (left) child.
        val  : The parent's value.
    """
    while c < n:    # Look at the descendents of current parent, `p`.
        if c < n-1 and items[start + c] < items[start + c + 1]: # Find larger of the first and second children.
            c += 1
        if val < items[start + c]: # If child greater than parent, swap child and parent.
            items[start + p] = items[start + c]
            p = c   # Current greater child becomes the parent.
            c += c  # Look at this child's child, if it exists.
        else:
            break 
    items[start + p] = val

cdef inline void sort_heap(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Sort a binary max heap of numbers. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L77
    
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    cdef DTYPE_t val
    while n > 0:
        n -= 1
        val = items[start + n]
        items[start + n] = items[start]
        sift_down(items, start, n, 0, 1, val)

cdef inline void heapify(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Turn a list of items into a binary max heap. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L59
    
    Arguments:
        items: 1-d array containing numbers.
        start: Index of the first number.
        n    : Quantity of numbers.
    """
    cdef DTYPE_t val
    cdef SIZE_t p
    cdef SIZE_t last_p = (n-2)//2
    for p in range(last_p, -1, -1):
        val = items[start + p] # value of last parent
        sift_down(items, start, n, p, 2*p + 1, val)

cdef inline void heapsort(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Applies the heapsort algorithm to sort a list of items from least to greatest. 
    From Williams, 1964.
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    heapify(items, start, n)
    sort_heap(items, start, n)
    
# Introsort 

cdef void introsort_loop(DTYPE_t* items, SIZE_t first, SIZE_t last, int depth) nogil:
    """The recursive heart of the introsort algorithm.
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        depth      : Current recursion depth.
    """
    cdef int MIN_SIZE_THRESH = 16
    cdef SIZE_t cut
    while last-first > MIN_SIZE_THRESH:
        if depth == 0:
            heapsort(items, first, last-first)
        depth -= 1
        med_three(items, first, last)
        cut = partition(items, first+1, last, first)
        introsort_loop(items, cut, last, depth)
        last = cut

# Log base-2 helper function. From Sklearn's implementation at:
#     https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_utils.pyx#L7
cdef inline DTYPE_t log2(DTYPE_t x) nogil:
    return ln(x) / ln(2.0)

cdef void introsort(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Implementation as described in Musser, 1997. Switches to heapsort
    when max recursion depth exceeded. Otherwise uses median-of-three 
    quicksort (Bentley & McIlroy, 1993) with all the usual optimizations:
        - Swap equal elements.
        - Only process partitions longer than the minimum size threshold.
        - When a new partition is made, recurse on the smaller half and 
          iterate over the larger half.
        - Make a final pass with insertion sort over the entire list.

    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef int max_depth = 2 * <int>log2(last-first)
    introsort_loop(items, first, last, max_depth)
    insertion_sort(items, first, last)
    
cdef SIZE_t sort_unique(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Sort a 1-d array of numbers in-place using introsort and 
    place the unique values in consecutive ascending order at 
    the beginning of the array.
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        
    Returns: 
        Number of unique items.
    """
    cdef SIZE_t i = 1
    cdef SIZE_t j = 1
    introsort(items, first, last)
    while i < last-first:
        if items[i] == items[i-1]:
            i += 1
        else:
            if i - j < 1:
                j += 1
                i += 1
            else:
                items[j] = items[i]
                j += 1
                i += 1
    return j
    
# For convenient memory reallocation.
ctypedef fused realloc_t:
    SIZE_t
    DTYPE_t
    Node

cdef inline realloc_t* safe_realloc(realloc_t* ptr, SIZE_t n_items) nogil except *:
    # Inspired by Sklearn's safe_realloc() func. However, thankfully
    # Cython now no longer requires us to send a pointer to a pointer
    # in order to prevent crashes.
    cdef realloc_t elem = ptr[0]
    cdef SIZE_t n_bytes = n_items * sizeof(elem)
    # Make sure we're not trying to allocate too much memory.
    if n_bytes/sizeof(elem) != n_items:
        with gil:
            raise MemoryError(f"Overflow error: unable to allocate {n_bytes} bytes.")       
    cdef realloc_t* res_ptr = <realloc_t *> realloc(ptr, n_bytes)
    with gil:
        if not res_ptr: raise MemoryError()
    return res_ptr

# C++ random number generator. Not yet a part of a Cython release so
# pasted in from: 
#     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/Cython/Includes/libcpp/random.pxd#L1
cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass random_device:
        ctypedef uint_fast32_t result_type
        random_device() except +
        result_type operator()() except +

    cdef cppclass mt19937:
        ctypedef uint_fast32_t result_type
        mt19937() except +
        mt19937(result_type seed) except +
        result_type operator()() except +
        result_type min() except +
        result_type max() except +
        void discard(size_t z) except +
        void seed(result_type seed) except +

    cdef cppclass uniform_int_distribution[T]:
        ctypedef T result_type
        uniform_int_distribution() except +
        uniform_int_distribution(T, T) except +
        result_type operator()[Generator](Generator&) except +
        result_type min() except +
        result_type max() except +
        
# Info for any node that will eventually be split or made into a leaf.
# Similar to what Sklearn does at:
#     https://github.com/scikit-learn/scikit-learn/blob/a2c4d8b1f4471f52a4fcf1026f495e637a472568/sklearn/tree/_tree.pyx#L126
cdef struct StackEntry:
    SIZE_t start
    SIZE_t end
    SIZE_t node_id
    SIZE_t parent_id
    SIZE_t n_const_feats

# To compare node splits.
cdef struct Split:
    SIZE_t feat
    DTYPE_t thresh
    DTYPE_t score  

# Vital characteristics of a node. Set when it's added to the tree.
cdef struct Node:
    SIZE_t l_child # idx of left child, -1 if leaf
    SIZE_t r_child # idx of right child, -1 if leaf
    SIZE_t feat    # col idx of split feature, -1 if leaf
    DTYPE_t thresh # double split threshold, NAN if leaf
    SIZE_t label   # class label if leaf, -1 if non-leaf.
    
cdef inline SIZE_t find_first(DTYPE_t* items, DTYPE_t value, SIZE_t first, SIZE_t last) nogil:
    """Find first occurrence of an element in a vector of sorted 
       (ascending order) elements.
       
    Uses same algorithm as Python's bisect_left() function:
        https://github.com/python/cpython/blob/8fd2d36c1c6da78b2402fcb8bcefdad8428c8bc3/Lib/bisect.py#L68
        
    Arguments:
        items      : The pre-sorted elements to be searched over.
        value      : The value to search for.
        first, last: The range of items to be searched.
        
    Returns:
        Index of the first element in `items` that equals `value`.
        
        If no such element exists in `items` the returned index
        merely indicates where the element would reside where it
        present in the sorted vector.
    """
    cdef SIZE_t mid
    while first < last:
        mid = (first + last) // 2
        if items[mid] < value:
            first = mid + 1
        else:
            last = mid
    return first

cdef inline void find_num_split_smallQ(DTYPE_t* X, SIZE_t* rows, DTYPE_t* node_unique_vals_feat, 
                                       SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, 
                                       SIZE_t n_samples, SIZE_t n_class, SIZE_t min_samples_leaf, 
                                       DTYPE_t min_weight_leaf, DTYPE_t* c_wts, DTYPE_t* l_wcc, 
                                       DTYPE_t* r_wcc, DTYPE_t* parent_wcc, Split* best_split, 
                                       SIZE_t current_feat, DTYPE_t parent_num, DTYPE_t parent_den, 
                                       SIZE_t node_n_unique_vals_feat, SIZE_t* split_counts_raw, 
                                       DTYPE_t* split_class_counts_wt, bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's SmallQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L233
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        X                      : Training data. Shape: (n train samples, n features).
        rows                   : Indices of all rows in the training set. Shape: (n train samples,).
        node_unique_vals_feat  : The sorted unique feature values of the samples in the parent
                                 node (beginning index 0). Shape: (n train samples,).
        labels                 : All training labels. Shape: (n training samples,).
        node_start             : Index of the beginning of the parent node in `rows`.
        n_parent               : Number of samples in the parent node.
        n_samples              : Number of samples in the training data.
        n_class                : Number of unique classes in the training set.
        min_samples_leaf       : Any leaf will have no fewer than this many samples.
        min_weight_leaf        : Total weight of any leaf's samples will be at least this much.
        c_wts                  : Class weights. Shape: (`n_class`,).
        l_wcc                  : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                  : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc             : Parent node's weight class counts. Shape: (`n_class`,).
        best_split             : Holds the feature, threshold and impurity
                                 score of the parent node's current best split.
        current_feat           : Column index of feature under investigation.
        parent_num             : Numerator of parent node's impurity score.
        parent_den             : Denominator of parent node's impurity score.
        node_n_unique_vals_feat: Number of unique values for one feature found among 
                                 the node's samples.
        split_counts_raw       : Stores sample counts found at each unique split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt  : Stores weighted class counts of each class at each unique
                                 split point of a given feature in a given node. Shape:
                                 (<max cardinality of all numerical feats in dataset> x `n_class`,)
        current_feat_const     : Whether current splitting feature is constant for all eligible split 
                                 thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used while tabulating sample and weighted
    # class counts at all unique split points.
    cdef SIZE_t row, label, split_point_idx
    cdef DTYPE_t value
    
    # Variables to track progress during the split search.
    cdef SIZE_t n_left, n_right, i, j
    
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, wt, score, mid
    
    # Tabulate both the sample counts at all possible split points
    # as well as weighted class counts at each split point.
    memset(&split_counts_raw[0], 0, sizeof(SIZE_t)*node_n_unique_vals_feat)
    memset(&split_class_counts_wt[0], 0, sizeof(DTYPE_t)*node_n_unique_vals_feat*n_class)
    for i in range(n_parent):
        row = rows[node_start + i]
        value = X[current_feat*n_samples + row]
        label = labels[row]
        split_point_idx = find_first(node_unique_vals_feat, value, 0, node_n_unique_vals_feat)
        split_counts_raw[split_point_idx] += 1
        split_class_counts_wt[split_point_idx*n_class + label] += c_wts[label] 
    
    # To keep track of num amples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(node_n_unique_vals_feat - 1):
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left

        l_num, r_num = 0., 0. # To calculate numerators of proxy gini scores.
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            l_den += wt
            r_num += r_wcc[j]*r_wcc[j]
        r_den = parent_den - l_den

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Split threshold is always the mid-point between two consecutive values.
            mid = node_unique_vals_feat[i]/2. + node_unique_vals_feat[i+1]/2. 
            if mid == node_unique_vals_feat[i+1]: mid = node_unique_vals_feat[i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat

cdef inline SIZE_t make_num_split(SIZE_t* rows, DTYPE_t* X, StackEntry* node_info, Split* best_split, 
                                SIZE_t n_samples) nogil:
    cdef SIZE_t p, p_end
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[best_split.feat*n_samples + rows[p]] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p] 
    return p

# Necessary constants.
cdef DTYPE_t NEG_INF = -np.inf
cdef DTYPE_t NAN = np.nan
            
cdef class _DecisionTree:
    # Class attributes.
    cdef SIZE_t seed
    cdef mt19937 rng
    cdef SIZE_t mem_capacity
    cdef SIZE_t n_samples
    cdef SIZE_t n_features
    cdef SIZE_t n_class
    cdef SIZE_t max_n_unique_feat_vals
    cdef SIZE_t m
    cdef SIZE_t min_samples_leaf, 
    cdef DTYPE_t min_weight_fraction_leaf
    cdef DTYPE_t min_weight_leaf
    cdef SIZE_t n_nodes
    cdef SIZE_t* rows
    cdef SIZE_t* features
    cdef DTYPE_t* class_weights
    cdef Node* nodes
    cdef DTYPE_t* weighted_class_counts
    def __cinit__(self, SIZE_t m, SIZE_t min_samples_leaf, DTYPE_t min_weight_fraction_leaf, SIZE_t seed): 
        """
        Arguments:
            m                       : Number of candidate features randomly selected to try to split each node.
            min_samples_leaf        : Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                      of the sum of weights of *all* training samples used to fit the tree.
            seed                    : A seed for the C++ mt19937 32bit int random generator. 
                                      Use when reproducibility is desired.
        """
        self.m, self.min_samples_leaf = m, min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = NULL
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = NULL 
        
    def __dealloc__(self):
        free(self.nodes)
        free(self.weighted_class_counts)
        
    property size:
        def __get__(self):
            return self.n_nodes
    
    property left_children:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].l_child
            return out

    property right_children:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].r_child
            return out
        
    property split_features: 
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].feat
            return out
        
    property split_thresholds:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].thresh
            return out
        
    property weighted_cc:
        def __get__(self):
            cdef SIZE_t out_size = self.n_nodes*self.n_class
            out = np.empty(out_size, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(out_size):
                    out_view[i] = self.weighted_class_counts[i]
            out.resize(self.n_nodes, self.n_class)
            return out
    
    property labels:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].label
            return out
    
    cdef void _increase_mem_capacity(self, SIZE_t new_capacity) nogil:
        self.nodes = safe_realloc(self.nodes, new_capacity)
        self.weighted_class_counts = safe_realloc(self.weighted_class_counts, self.n_class*new_capacity)
    
    cdef void _make_leaf(self, Node* leaf_node, SIZE_t* y, SIZE_t node_start, SIZE_t node_id, 
                         SIZE_t n_classes_node, SIZE_t* max_wt_classes) nogil:
        # Class with largest wcc becomes leaf node's label. Break ties with a random choice.
        cdef SIZE_t label
        cdef DTYPE_t max_wt = 0.
        cdef SIZE_t lb = 0
        cdef SIZE_t ub = -1
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t i, j
        # If all node's samples have the same class.
        if n_classes_node == 1:
            label = y[self.rows[node_start]]
        else:
            # Otherwise find label with max weighted class count for the node.
            for i in range(self.n_class):
                max_wt = max(max_wt, self.weighted_class_counts[node_id*self.n_class + i])
            # See if multiple classes share this max count.
            for i in range(self.n_class):
                if self.weighted_class_counts[node_id*self.n_class + i] == max_wt:
                    ub += 1
                    max_wt_classes[ub] = i
            # If so, randomly choose leaf's label from among those classes.
            if ub > 0:
                dist = uniform_int_distribution[SIZE_t](lb, ub) # Choose an int w/in range lb, ub, inclusive.
                j = dist(self.rng)
                label = max_wt_classes[j]
            else:
                label = max_wt_classes[lb]
        leaf_node.l_child = -1
        leaf_node.r_child = -1    
        leaf_node.feat = -1  
        leaf_node.thresh = NAN
        leaf_node.label = label 

    cdef _grow_tree(self, DTYPE_t* X, SIZE_t* y):
        # LIFO stack holding all nodes still to be investigated.
        cdef stack[StackEntry] node_stack

        #####################################################################
        # Variables containing info of the node currently being investigated.
        #####################################################################
        cdef SIZE_t start, end, node_id, parent_id, n_consts, n_samples_node
        cdef DTYPE_t* node_wcc = NULL
        cdef StackEntry node_info
        cdef Node* node = NULL
        
        # Holds child node info if the current node gets split.
        cdef SIZE_t l_child_id, r_child_id
        cdef Node* l_child_node = NULL
        cdef Node* r_child_node = NULL
        
        #####################################################################
        # For finding the best split.
        #####################################################################
        cdef Split best_split
        cdef DTYPE_t* l_wcc = NULL
        cdef DTYPE_t* r_wcc = NULL
        cdef DTYPE_t sum_node_wcc_sqr, sum_node_wcc # Parent node's proxy Gini score num and den.
        cdef SIZE_t split_pos
        
        # Indicates a feature has been discovered to be constant during a
        # split search within the search range permitted by min_samples_leaf 
        # and min_weight_leaf.
        cdef bint current_feat_const 

        # Create a C-contiguous array of doubles to hold feature values of a 
        # given node's samples. Using Numpy to allocate memory to longer 
        # vectors is often faster than using realloc().
        cdef DTYPE_t[::1] items_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* items = &items_buffer[0]
        cdef SIZE_t r
        
        # An array to contain unique split points for a feature at a given node.
        cdef DTYPE_t[::1] node_unique_vals_feat_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* node_unique_vals_feat = &node_unique_vals_feat_buffer[0]
        cdef SIZE_t node_n_unique_vals_feat
        
        # Make 1-d arrays containing sample counts and weighted class 
        # counts for each unique raw feature value.
        # Raw, non-weighted, sample counts at each split-point.
        cdef SIZE_t[::1] split_counts_raw_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        cdef SIZE_t* split_counts_raw = &split_counts_raw_buffer[0]
        # Weighted class counts for each split-point.
        cdef DTYPE_t[::1] split_class_counts_wt_buffer = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64) 
        cdef DTYPE_t* split_class_counts_wt = &split_class_counts_wt_buffer[0]
        
        #####################################################################
        # For random feature selection (w/out replacement) and keeping track 
        # of nodes' constant features. 
        #####################################################################
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t lb, ub, idx, feat_idx, n_drawn_feats, n_new_consts, n_total_consts
        cdef SIZE_t[::1] features_buffer = np.empty(self.n_features, dtype=np.intp) 
        cdef SIZE_t* features = &features_buffer[0]
        cdef SIZE_t[::1] constant_features_buffer = np.empty(self.n_features, dtype=np.intp)
        cdef SIZE_t* constant_features = &constant_features_buffer[0]
        
        #####################################################################
        # For determining whether node should be a leaf.
        #####################################################################
        cdef SIZE_t i, c, cc, n_classes_node, row, label
        cdef DTYPE_t wcc, wt
        # Stores classes that share a leaf's max class wt. When two or more 
        # present, leaf label randomly chosen from these classes
        cdef SIZE_t* max_wt_classes = NULL
        
        with nogil:
            # Allocate memory to pointers.
            l_wcc = safe_realloc(l_wcc, self.n_class)
            r_wcc = safe_realloc(r_wcc, self.n_class)
            node_wcc = safe_realloc(node_wcc, self.n_class)
            max_wt_classes = safe_realloc(max_wt_classes, self.n_class*sizeof(SIZE_t))
            # Fill with feature column indices so we can track constant feats.
            memcpy(features, self.features, self.n_features* sizeof(SIZE_t))
            
            # Push root node onto the LIFO stack.
            node_stack.push({"start": 0, "end": self.n_samples, "node_id": 0, 
                             "parent_id": 0, "n_const_feats": 0})
            self.n_nodes = 1
            while not node_stack.empty():
                node_info = node_stack.top()
                node_stack.pop()
                start, end = node_info.start, node_info.end
                node_id, parent_id = node_info.node_id, node_info.parent_id # TODO: `parent_id` unused; is it necessary?
                n_consts = node_info.n_const_feats
                n_samples_node = end-start
                node = &self.nodes[node_id]
                
                # Tabulate the current node's weighted class counts.
                #
                # Implementation detail #1: I tried storing the l and r child wt class cts
                # of nodes' best splits so that this tabulation wouldn't need to be 
                # performed for each node. But found there was virtually no speed improvement
                # to justify the more complicated code required to store and update these 
                # values during the best split search.
                #
                # Implementation detail #2: Setting aside a block of memory to 
                # store the current node's wt class cts and passing a pointer to
                # this block to the split search function sped up training by 8%
                # compared to passing a ptr to the location of node's wt class cts 
                # in the self.weighted_class_counts array.
                memset(node_wcc, 0, self.n_class*sizeof(DTYPE_t))
                sum_node_wcc, sum_node_wcc_sqr = 0., 0.
                for i in range(n_samples_node):
                    row = self.rows[start + i]
                    label = y[row]
                    wt = self.class_weights[label]
                    # Compute the node's proxy gini numerator and denominator while we're at it.
                    sum_node_wcc_sqr += wt*(2*node_wcc[label] + wt) # numerator
                    sum_node_wcc += wt                              # denominator
                    node_wcc[label] += wt
                memcpy(&self.weighted_class_counts[node_id*self.n_class], node_wcc, self.n_class*sizeof(DTYPE_t))
                
                # Make a leaf if required to do so. 
                n_classes_node = 0
                for c in range(self.n_class):
                    wcc = node_wcc[c]
                    if wcc > 0: n_classes_node += 1
                if n_classes_node == 1:                   
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif n_samples_node < 2*self.min_samples_leaf:  
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif sum_node_wcc < 2.*self.min_weight_leaf: 
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)

                # Otherwise split the node.
                else:
                    # Initialize stats for best split of node.
                    best_split.feat = -1
                    best_split.thresh = 0.
                    best_split.score = NEG_INF

                    # Ensure feats drawn w/out replacement.
                    n_drawn_feats = 0
                    n_new_consts = 0
                    n_total_consts = n_consts
                    lb = 0                      # Range in `features` array from which we 
                    ub = self.n_features - 1    # randomly select a feature's column index. 
                        
                    while n_drawn_feats < self.m:
                        n_drawn_feats += 1

                        # Breiman & Cutler's original Fortran random forests implementation 
                        # allows for known constant features to be drawn during a split-search.
                        # I follow their example, as I believe that doing so allows individual 
                        # trees to be less correlated with each other. Since I don't pre-sort
                        # features, I would prefer not to have to sort any more features than
                        # necessary, and so I've adopted the technique Sklearn uses to track 
                        # constant features:
                        #     https://github.com/scikit-learn/scikit-learn/blob/dbe39454f766ebefc3219f2c1871ac1774316532/sklearn/tree/_splitter.pyx#L310
                        # 
                        # The idea is that feature idxs in `features` are organized into two sections:
                        #
                        #     [<indices of known constant feats>, <indices of non-constant feats>]
                        #
                        # As we begin drawing feature indices from this above list, those two sections
                        # will each be further sub-divided into two sections:
                        # 
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <undrawn non-constant feats>, <drawn non-constant feats>]
                        #
                        # When we choose a feature that happens to be a known constant, we'll re-locate
                        # its idx to the right-end of the first of those four sections. Then we 
                        # increment the lower bound threshold, `lb`, by one so that we don't re-draw 
                        # that feature again.
                        #
                        # Similarly, if we draw a non-constant feature idx, we'll move it to the 
                        # left-end of the last of the four partitions and reduce the upper bound
                        # threshold, `ub`, by one so that the feature idx can't be drawn again
                        # during this split-search. 
                        #
                        # One last important detail: sometimes we'll draw a feature that 
                        # used to be non-constant for ancestor nodes, but will be found to be 
                        # constant for the current node. When this happens, we relocate its 
                        # index so that it sits to the right of the known constant feats section.
                        # This means our `features` list could have up to five partitions:
                        #
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <newly discovered const feats>, <undrawn non-constant feats>, 
                        #      <drawn non-constant feats>]
                        #
                        # Whenever we find a new constant feature, we increment the `n_new_consts`
                        # counter by one. We also increment the `n_total_consts` counter by one. 
                        # During the split-search we have to use `n_total_consts` to keep track of
                        # the total number of constant features. n_consts` mustn't be changed
                        # because it tells us where the <newly discovered const feats> section
                        # of the `features` list begins.

                        # One last wrinkle. We subtract the # of newly discovered const feats from  
                        # the upper bound before we select an index `i` from the `features` array, 
                        # and add it back to `i` after `i` has been genereated. This prevents us from 
                        # re-drawing any of these new const feats again during this split-search.
                        dist = uniform_int_distribution[SIZE_t](lb, ub-n_new_consts)
                        idx = dist(self.rng)

                        # So that we don't draw a known constant feature again this split-search.
                        if idx < n_consts:
                            features[idx], features[lb] = features[lb], features[idx]
                            lb += 1 
                            continue

                        # So that no new const feats get drawn more than once per split-search.
                        idx += n_new_consts

                        feat_idx = features[idx]
                        # Place all samples' feat values into contiguous storage.
                        for r in range(n_samples_node):
                            # X is a pointer, so have to index into this 2d array in the C way 
                            # (also keeping in mind that the array is column-major).
                            items[r] = X[feat_idx*self.n_samples + self.rows[start + r]]

                        # Place all unique split points, in ascending order, inside the 
                        # first <node_n_unique_vals_feat> indices of the items array.
                        node_n_unique_vals_feat = sort_unique(items, 0, n_samples_node)

                        # Make sure the feature not constant for node's samples.
                        if node_n_unique_vals_feat < 2:
                            # Move the newly-discovered constant feat to the far right-end
                            # of the left half of `features` list holding the known const
                            # feats as well as any other const feats newly discovered 
                            # during this node's split-search.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # Initialize weighted class counts of right and left children.
                            # Right child's counts are initially the same as parent node's.
                            memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                            memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))

                            # If the feature has an impurity score that's better than the best score 
                            # found among all other features visited thus far for this node, find_num_split()
                            # updates the attributes of the struct containing the node's best split info. 
                            # 
                            # But even if a new best score isn't reached, if an impurity score can
                            # be calculated at least once during the feature's split search, the
                            # following indicator will be toggled off, to indicate that the feature
                            # is not constant.
                            current_feat_const = 1 # 1 = is constant; 0 = not constant
                            find_num_split_smallQ(X, self.rows, items, y, start, n_samples_node, self.n_samples, 
                                                  self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                  self.class_weights, l_wcc, r_wcc, node_wcc,
                                                  &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc,
                                                  node_n_unique_vals_feat, split_counts_raw, split_class_counts_wt,
                                                  &current_feat_const)

                            if current_feat_const:
                                # The feature may be constant within the search range permitted
                                # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                                # the feature is a newly discovered constant.
                                features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                                n_new_consts += 1
                                n_total_consts += 1
                                continue
                            else:
                                # The feature is non-constant, so we ensure it's not drawn again
                                # during this split-search.
                                features[idx], features[ub] = features[ub], features[idx]
                                ub -= 1 

                    # To ensure that the constant features info is accurate for sibling or child nodes.
                    memcpy(&features[0], &constant_features[0], sizeof(SIZE_t)*n_consts)
                    memcpy(&constant_features[n_consts], &features[n_consts], sizeof(SIZE_t)*n_new_consts)

                    # Make node a leaf if constant for all randomly drawn feats.
                    # (# drawn known constant feats + # drawn new constant feats)
                    if lb + n_new_consts == n_drawn_feats: 
                        self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                    else: 
                        split_pos = make_num_split(self.rows, X, &node_info, &best_split, self.n_samples) 

                        # Update tree info for node that's getting split.
                        l_child_id = self.n_nodes
                        r_child_id = l_child_id + 1
                        node.l_child = l_child_id
                        node.r_child = r_child_id
                        node.feat    = best_split.feat
                        node.thresh  = best_split.thresh
                        node.label   = -1

                        # Prepare for the left and right child nodes
                        # by increasing tree data memory capacity if
                        # necessary.
                        if self.n_nodes + 2 > self.mem_capacity:
                            # Expand memory capacity geometrically. See "geometric growth" 
                            # part of WhozCraig's SO answer at: 
                            #     https://stackoverflow.com/a/51665863/8628758.
                            # Add one after squaring so that the new capacity can
                            # contain not only a tree of greater depth, but also
                            # the maximum # nodes that that depth could have.
                            new_capacity = 2*self.mem_capacity + 1
                            self._increase_mem_capacity(new_capacity)
                            self.mem_capacity = new_capacity
                        
                        # Push right child info onto the LIFO stack.
                        node_stack.push({"start": split_pos, "end": end, "node_id": r_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})
                        # Push left child info onto queue.
                        node_stack.push({"start": start, "end": split_pos, "node_id": l_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})

                        # And update size of the tree.
                        self.n_nodes += 2
                        
        free(l_wcc)
        free(r_wcc)
        free(node_wcc)
        free(max_wt_classes)
    
    def fit(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X, np.ndarray[SIZE_t, ndim=1, mode="c"] y, 
            np.ndarray[SIZE_t, ndim=1, mode="c"] rows, np.ndarray[SIZE_t, ndim=1, mode="c"] features,
            np.ndarray[DTYPE_t, ndim=1, mode="c"] class_weights, SIZE_t n_class): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
            feats             (1D C-contiguous array of int): Column indices of training features.
            class_weights (1D C-contiguous array of float64): Desired weight for each class. Shape: (`n_class`,).
            n_class                                         : Number of classes in training data.   
        """
        # Casting the raw data to pointers gives a 17% speed-up compared to getting
        # pointer from the ndarray's buffer interface, as recommended by DavidW in 
        # his SO answer at: https://stackoverflow.com/a/54832269/8628758. e.g.
        #     cdef DTYPE_t[::1,:] X_buffer = X
        #     cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        # Not worried about unexpected behavior as all ndarrays' contiguousness and
        # memory layout enforced prior to this point.
        cdef DTYPE_t* X_ptr = <DTYPE_t*> X.data
        cdef SIZE_t* y_ptr = <SIZE_t*> y.data
        self.rows = <SIZE_t*> rows.data
        self.features = <SIZE_t*> features.data
        self.class_weights = <DTYPE_t*> class_weights.data
        self.n_class = n_class
        self.n_samples = rows.shape[0]
        self.n_features = features.shape[0]
        cdef random_device rd # Needed when using the C++ mt19937 rng w/out a seed.
        
        # Get the max cardinality of all numerical feats.
        self.max_n_unique_feat_vals = max([np.unique(X[:,i]).size for i in range(X.shape[1])])
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        cdef SIZE_t init_capacity = 15
        
        cdef SIZE_t i, row, label
        cdef DTYPE_t wt
        cdef DTYPE_t sum_wts = 0
        cdef Node* root_node = NULL
        with nogil:
            # Allocate memory for the tree.
            self._increase_mem_capacity(init_capacity)
            self.mem_capacity = init_capacity
 
            # And sum the class weights of all the root node's samples in
            # order to know minimum total weight a leaf must have (which
            # we must know when regularizing by min_weight_fraction_leaf.)
            for i in range(self.n_samples):
                row = self.rows[i]
                label = y_ptr[row]
                wt = self.class_weights[label]
                sum_wts += wt
            self.min_weight_leaf = self.min_weight_fraction_leaf*sum_wts
            
            # Initialize the random number generator. Followed example from:
            #     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/tests/run/cpp_stl_random.pyx#L16
            if self.seed == -1:
                self.rng = mt19937(rd()) # If using the random device engine std::random_device.
            else:
                self.rng = mt19937(self.seed)

        # Initiate tree building.
        self._grow_tree(X_ptr, y_ptr)
    
    cdef Node* _next_node(self, SIZE_t nxt) nogil: 
        return &self.nodes[nxt]
    
    cdef SIZE_t _get_leaf_idx(self, SIZE_t i, Node* leaf, SIZE_t n, DTYPE_t* X) nogil:
        cdef SIZE_t idx
        cdef SIZE_t root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[leaf.feat*n + i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else: 
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
    
    def predict(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_preds = X.shape[0]
        cdef SIZE_t i
        preds = np.empty(n_preds, dtype=np.intp)
        cdef SIZE_t[::1] preds_view = preds
        cdef Node leaf
        with nogil:
            for i in range(n_preds): 
                preds_view[i] = self.nodes[self._get_leaf_idx(i, &leaf, n_preds, X_ptr)].label
        return preds
    
    def predict_probs(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions. Shape: (`X.size`, `self.n_class`)
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        cdef DTYPE_t[::1] wcc_view = wcc
        cdef Node leaf
        cdef SIZE_t i, j, idx
        with nogil:
            for i in range(n_probs):
                idx = self._get_leaf_idx(i, &leaf, n_probs, X_ptr)
                for j in range(self.n_class):
                    wcc_view[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

class DecisionTreeSmallQCython():
    """Fit a decision tree using a depth-first algorithm.
    
    Uses Marvin Wright's SmallQ numerical splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L233
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights = [], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit the tree.
            seed                         (int): Use when reproducibility is desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C') 
        if seed is None:
            self.seed = -1
        else:
            self.seed = seed
        self._tree = _DecisionTree(self.m, self.min_samples_leaf, self.min_weight_fraction_leaf, self.seed)
        
    @property
    def size(self): return self._tree.size
    
    @property
    def left_children(self): return self._tree.left_children
    
    @property
    def right_children(self): return self._tree.right_children
            
    @property 
    def split_features(self): return self._tree.split_features

    @property 
    def split_thresholds(self): return self._tree.split_thresholds
    
    @property
    def weighted_class_counts(self): return self._tree.weighted_cc
    
    @property
    def labels(self): return self._tree.labels
    
    def fit(self, X, y, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X (Fortran-style ndarray of float64): Pre-processed training data. 
                                                  Shape: (num train samples, num train features).
            y                   (ndarray of int): Training labels. Shape: (num train samples,).
            rows                          (list): Indices of the rows to be used for training. 
                                                  All rows used if empty.
            features                      (list): Column indices of training features that will be used.
                                                  All features used if empty.                          
        Returns:
            DecisionTreeSmallQCython: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        self.n_class = np.unique(y).size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.
            
        self._tree.fit(X, y, self.rows, self.features, self.class_weights, self.n_class)
        return self

    def predict(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict(X)
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict_probs(X)

## Cython Wright SmallQ Tree's Speed on the Titanic Data

In [42]:
m = 4
dt = DecisionTreeSmallQCython(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [43]:
dt.size

393

In [44]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.8202247191011236

In [45]:
%timeit dt.fit(xTrain_proc, yTrain_proc) #beat 989us

1.03 ms ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [46]:
%timeit dt.predict(xVal_proc)

7.27 µs ± 82.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Unsurprisingly, and likely for the reasons described above (possibly twice as much iteration required), the Wright's Small Q splitter runs nearly 100% slower than Sklearn's numerical splitter. 

## Wright Large Node "Large Q" numerical split finder
The "large Q" split finder will look quite similar to the "small Q" split finder. The only difference is that when searching for the best split, instead of iterating through the sorted, unique raw feature values of just *the node's samples*, we will use the pre-sorted data that holds the unique values of *all training samples*. And so, most of our work here will be to write some pre-processing helper functions that sort the training data's numerical features before training begins.

Here's how Wright designed Ranger's pre-sorting to work:
1. Create new arrays that contain all the unique values for each numerical feature in the training data, and then sort these arrays in ascending order. Once sorted, these unique values lists will represent all possible split points whenever a particular feature is evaluated to see whether or not it should be used to split a node.
2. Create a look-up table of size <# training rows> x <# numerical features>. For each numerical feature, each row will contain the index at which that row's raw feature value can be found in the appropriate unique values list.

In order to iterate through the split-points of a particular numerical feature, the "large Q" numerical splitter will use the look-up table to build a somewhat longer version of `split_class_counts_wt` list that was constructed by the "small Q" splitter. This time around, the per-split class counts list will be long enough to contain all of a feature's split-points (the unique values) that are found inside the entire training set. 

To fill in this list's class counts, the splitter will iterate through each row in the node, grab its raw feature value, and then go to the look-up table to find the position of that value in the appropriate feature's unique values list. This result tells us the location of the row's split point. 

In other words, once we know where the row's raw feature value ranks amongst all values of that feature observed in the training set, we'll be able to figure out the position in the `split_class_counts_wt` list at which we should increment the counter that corresponds to the row's class label.

Once this list is tabulated, the splitter can iterate through each split point, from left to right, adjusting the left and right child weight class counts and calculating the proxy gini score as it goes. Many times the class counts for a given split-point will all be zero because the node has no samples that have that split-point's raw feature value. To easily skip these split points, we'll also compile a list of `split_weighted_sample_counts` inside the smae for-loop we use to tabulate the `split_class_counts_wt` list. As the splitter proceeds through split-points, it will skip past all splits where the weighted sample count is zero.

#### Ranger's difference to Breiman & Cutler's pre-sorting
What stood out to me most about Ranger's pre-sorting is that it gets the a lot of the benefit that Breiman & Cutler's pre-sorting strategy provides, but without constant upkeep that their approach requires. To be clear, however, when it comes to the specific act of iterating through the pre-sorted values, Breiman's method will be faster because each node's values for every numerical feature will already have been pre-sorted and stored contiguously. The downside of this, of course, is that everytime a node is split its samples' pre-sorted values for all other numerical feature's *that weren't* used to split the node must also then be partitioned into left and right children.

With Ranger, split-finding Wright's pre-sorted values will be a bit slower because the values of the node's samples aren't stored contiguously. Instead, the index of each row's split point must be looked up, one-by-one, inside a for-loop. On the other hand, the benefit of using such a for-loop is that unlike with Breiman and Cutler, no other extra upkeep need be done to update the ordering of all other numerical features' pre-sorted values.

Indeed, in datasets with thousands or tens of thousands of features, I would imagine that Breiman and Cutler's strategy simply wouldn't be feasible.

In his paper that introduced Ranger, Wright was clear that it was only after extensive runtime profiling that he landed on the strategy of alternating between "small Q" and "large Q" numerical splitting algorithms depending on the number of samples in the node as well as the number of unique values found in the training set under the candidate feature. I'm still hopeful that this strategy will ultimately prove competitive with Louppe's numerical splitter.

#### Re-implementing `preprocess_train` to perform "large Q" preprocesssing
Without further ado, let's write up logic that will generate the numerical feature unique values lists, and the index table of numerical split points for each training row. I'll add this code to revised version of my training data pre-processing function.

In [47]:
def preprocess_train(x, y, cat_feats=[], largeQ=False):
    """Pre-process training data and labels for training via random forests.
    
    Uses these heuristics:
        1. Categorical features by default are preprocessed using PCA encoding from:
               Coppersmith et. al. (1999): https://link.springer.com/article/10.1023%2FA%3A1009869804967
           This done only once at the beginning of training (and not at each split), from:
               Wright and König (2019): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6368971/pdf/peerj-07-6339.pdf
           Avoids the absent levels problem described in:
               Wu (2018): https://dl.acm.org/doi/abs/10.5555/3291125.3309607
        2. Numerical feature NaNs replaced by median value.
        3. Categorical feature NaNs replaced by mode level. Both these NaN strategies
           are Breiman's "current preferred method" from:
               Breiman (2002): https://www.stat.berkeley.edu/~breiman/Using_random_forests_V3.1.pdf
    
    Arguments: 
        x (Pandas or Numpy array): The original un-preprocessed training data.
        y (Pandas or Numpy array): Numerically encoded training labels.
        cat_feats          (list): Categorical features' column indices.
        largeQ             (bool): Whether tree fitting will use "large Q" 
                                   numerical splitting.
        
    Returns: 
        The processed training data and labels, a dictionary of values used to fill
        each column's NaN values, and a dictionary containing categorical level-to-PCA maps.
        The contents of both these dicts are stored under column index numbers.
        
        Also returns if "large Q" numerical splitting is enabled:
            - Column-major Numpy array where each row holds the index (amongst a 
              feature's ordered, ascending unique values) at which that row's raw value 
              for that feature would be located. These indices hold split-point locations 
              used for LargeQ numerical splitting.
              
            - Column-major Numpy array where first n rows of each column contains the
              feature's n unique values in ascending order. Shape of array is
              <max cardinality (of all feats)> x <nfeats>.
              
            - 1-d array holding cardinality (num unique vals) of each feature.
    """
    x, y = np.asfortranarray(x), np.ascontiguousarray(y)
    n, nclass, nfeat, has_cat_feats = len(x), len(np.unique(y)), x.shape[1], len(cat_feats) > 0
    num_feats = [i for i in range(nfeat) if i not in cat_feats]; has_num_feats = nfeat > len(cat_feats)
    nan_fillers, pca_maps = {}, {} # NaN fill values and cat level-to-PCA mappings stored under feat col idxs.
    if has_cat_feats:
        for i in cat_feats:
            values = x[:,i]
            nans = pd.isna(values); has_nans = nans.sum() > 0
            # Step 1: Naive ordinal encode all non-NaN categorical values.
            values[~nans], levels = integer_encode(values[~nans])
            # Step 2: Store modes of all categorical features.
            mode = stats.mode(values.astype(float), nan_policy='omit').mode[0]
            # Step 3: Replace any NaNs in cat cols with cols' modes.
            if has_nans: values[nans] = mode
            # Step 4: PCA rank-encode all categorical features.
            x[:,i], pca_maps[i] = PCA_rank_encode(values.astype(int), y, levels, n, nclass)
            nan_fillers[i] = pca_maps[i][levels[int(mode)]] # Store the PCA rank-encoded mode value for each cat feat.
    if has_num_feats:                                       # The levels list stores orig. level strings in order of 
        values = x[:,num_feats].astype(float)               # their naive ordinal encodings.
        nans = np.isnan(values); has_nans = nans.sum() > 0
        # Step 5: Store medians of all numerical features.
        medians = np.nanmedian(values, axis=0)
        for i,m in enumerate(medians): nan_fillers[num_feats[i]] = m
        # Step 6: Replace any NaNs in num cols with cols' medians.
        if has_nans: x[:,num_feats] = np.where(nans, medians, x[:,num_feats])
    if largeQ:
        unique_vals_feats = [np.unique(x[:,i]) for i in range(nfeat)]
        split_point_idxs = np.asfortranarray([[np.searchsorted(unique_vals_feats[i], x[j][i], 
                                                      side='left', sorter=None) 
                                      for j in range(x.shape[0])]
                                      for i in range(x.shape[1])], dtype='int').T
        n_unique_vals_feats = np.array([len(arr) for arr in unique_vals_feats], dtype=np.intp, order='C')
        unique_vals_feats_out = np.empty((n_unique_vals_feats.max(), nfeat), dtype=np.float64, order='F')
        for i in range(nfeat): unique_vals_feats_out[:n_unique_vals_feats[i],i] = unique_vals_feats[i]
        return (x.astype('float64'), y, nan_fillers, pca_maps, np.asfortranarray(split_point_idxs), 
                np.asfortranarray(unique_vals_feats_out), n_unique_vals_feats)
    else:
        return x.astype('float64'), y, nan_fillers, pca_maps 

In [48]:
def find_num_split_largeQ(rows, split_point_idxs, unique_vals_feats, labels, node_start, n_parent, 
                          n_class, min_samples_leaf, min_weight_leaf, c_wts, l_wcc, r_wcc,
                          parent_wcc, best_split, current_feat, parent_num, parent_den, 
                          n_unique_vals_feat, split_counts_raw, split_counts_wt, split_class_counts_wt):
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's LargeQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows                      (ndarray of int): Indices of all rows in the training set. 
                                                    Shape: (n train samples,).
        split_point_idxs          (ndarray of int): All numerical feature split-point locations for all rows.
                                                    Shape: (n training samples, n features).
        unique_vals_feats     (ndarray of float64): Columns contain sorted unique values for all features. 
                                                    Shape: (max cardinality of all feats, n features).
        labels                    (ndarray of int): All training labels. Shape: (n training samples,).
        node_start                           (int): Index of the beginning of the parent node in `rows`.
        n_parent                             (int): Number of samples in the parent node.
        n_class                              (int): Number of unique classes in the training set.
        min_samples_leaf                     (int): Any leaf will have no fewer than this many samples.
        min_weight_leaf                  (float64): Total weight of any leaf's samples will be at least this much.
        c_wts                 (ndarray of float64): Class weights. Shape: (`n_class`,).
        l_wcc                 (ndarray of float64): Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                 (ndarray of float64): Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc            (ndarray of float64): Parent node's weight class counts. Shape: (`n_class`,).
        best_split                         (Split): Holds the feature, threshold, and impurity
                                                    score of the parent node's current best split.
        current_feat                         (int): Column index of feature under investigation.
        parent_num                       (float64): Numerator of parent node's impurity score.
        parent_den                       (float64): Denominator of parent node's impurity score.
        n_unique_vals_feat                   (int): Number of unique values for one feature found among 
                                                    all training samples.
        split_counts_raw          (ndarray of int): Stores sample counts found at each unique split point of
                                                    a given feature for a given node. 
                                                    Shape: (max cardinality of all features,).
        split_counts_wt       (ndarary of float64): Stores weighted sample counts found at each unique split point
                                                    of a given feature for a given node. 
                                                    Shape: (max cardinality of all features,).
        split_class_counts_wt (ndarray of float64): Stores weighted class counts of each class at each unique
                                                    split point of a given feature for a given node. Shape:
                                                    (<max cardinalty of all features> x `n_class`,)
              
    Returns: 
        int: 1 if feature is constant for eligible split-points. 0, otherwise.
    """
    # Whether or not feat is constant within search range permitted
    # by min_samples_leaf and min_weight_leaf (0 if no, 1 if yes).
    current_feat_const = 1
    
    # Tabulate sample counts, weighted counts, and weighted class counts at 
    # each split point. Values at split points not belonging to node's rows
    # will remain zero.
    split_counts_raw[:n_unique_vals_feat] = 0
    split_counts_wt[:n_unique_vals_feat] = 0.
    split_class_counts_wt[:n_unique_vals_feat*n_class] = 0.
    for i in range(n_parent):
        row = rows[node_start + i]
        label = labels[row]
        wt = c_wts[label]
        split_point_idx = split_point_idxs[row][current_feat]
        split_counts_raw[split_point_idx] += 1
        split_counts_wt[split_point_idx] += wt
        split_class_counts_wt[split_point_idx*n_class + label] += wt
        
    # If feat is constant for the node.
    if (split_counts_raw[:n_unique_vals_feat] > 0).sum() < 2: 
        return current_feat_const
      
    # To keep track of num samples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(n_unique_vals_feat - 1):
        if split_counts_raw[i] == 0: continue # Move to next split-point if no samples at this one.
            
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left
        if n_right == 0: return current_feat_const # Stop search when right child empty.
        
        # Calculate denominators of proxy gini scores.
        l_den += split_counts_wt[i]
        r_den = parent_den - l_den
        
        # Calculate numerators of proxy gini scores.
        l_num, r_num = 0., 0. 
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            r_num += r_wcc[j]*r_wcc[j]
        
        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return current_feat_const
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return current_feat_const
        
        current_feat_const = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Find raw feature value of sample(s) at next-closest split-point.
            k = i+1
            while split_counts_raw[k] == 0: k+=1
            # Split threshold is always the mid-point between two consecutive values.
            mid = unique_vals_feats[i][current_feat]/2. + unique_vals_feats[k][current_feat]/2. 
            if mid == unique_vals_feats[k][current_feat]: mid = unique_vals_feats[i][current_feat]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat
    return current_feat_const

## Wright LargeQ Decision Tree Python Version
To attain the fastest speed, Wright designed Ranger to have the tree-growing algorithm alternate between "small Q" and "large Q" node splitting, depending on the size of the node and number of unique values held by the candidate splitting feature.

I plan to implement this as well, but first I'd like to see how the exclusive use of Ranger's "large Q" style splitting compares to the "small Q" method, as well as to Louppe/Sklearn-style splitting.

In [49]:
class DecisionTreeLargeQ():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Marvin Wright's LargeQ numerical splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    """

    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights=[], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                         (int): Use when reproducibility desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C')
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = np.empty(0, dtype=Node, order='C')
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = np.empty(0, dtype=np.float64, order='C')
        
    @property
    def size(self): return self.n_nodes
    
    @property 
    def left_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].l_child
        return out

    @property 
    def right_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].r_child
        return out

    @property 
    def split_features(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].feat
        return out

    @property 
    def split_thresholds(self): 
        out = np.empty(self.n_nodes, dtype='float64')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].thresh
        return out

    @property 
    def weighted_cc(self):
        out_size = self.n_nodes*self.n_class
        out = np.empty(out_size, dtype='float64')
        for i in range(out_size):
            out[i] = self.weighted_class_counts[i]
        out.resize(self.n_nodes, self.n_class)
        return out

    @property 
    def labels(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].label
        return out
    
    def _increase_mem_capacity(self, new_capacity):
        """Resize ndarrays that hold tree's nodes and weighted class counts.
        
        Arguments:
            new_capacity (int): Amount of nodes that resized arrays will be able to hold.
        """
        self.nodes.resize(new_capacity, refcheck=False)
        self.weighted_class_counts.resize(new_capacity*self.n_class, refcheck=False)
    
    def _make_leaf(self, node_id, wcc, n_classes_node):
        """Set and store the class label of a leaf node.
        
        Break ties at random when multiple classes share the same max weight.
        Doing this avoids a bias towards lower classes that would be a possible
        consequence of using np.argmax (which is what Sklearn does).
        
        Arguments:
            node_id            (int): Location of node in `self.nodes`.
            wcc (ndarray of float64): Node's weighted class counts. Shape: (`self.n_class`,).
            n_classes_node     (int): Number of unique class labels found among
                                      node's training samples.
        """
        if n_classes_node == 1: 
            label = max(enumerate(wcc), key=lambda f: f[1])[0]
        else:              
            label = self._rng.choice(np.argwhere(wcc==np.max(wcc)).flatten())
        self.nodes[node_id] = Node(-1, -1, -1, np.nan, label) 
        
    def _grow_tree(self, X, y, split_point_idxs, unique_feat_vals):
        """Depth-first growth of a decision tree.
        
        Arguments:
            X                     (ndarray of float64): Training samples. Shape: (n samples, n features).
            y                         (ndarray of int): Training labels. Shape: (n samples,).
            split_point_idxs          (ndarray of int): All numerical feature split-point locations for all rows.
                                                        Shape: (n training samples, n features).
            unique_vals_feats     (ndarray of float64): Columns contain sorted unique values for all features.
                                                        Shape: (max cardinality of all feats, n features).
        """
        # LIFO stack holding all nodes still to be investigated.
        node_stack = []
        
        # Stores the weighted class counts of the current node.
        node_wcc = np.empty(self.n_class, dtype=np.float64)
        
        ##############################################################
        # For finding the best split.
        ##############################################################
        l_wcc = np.empty(self.n_class, dtype=np.float64)
        r_wcc = np.empty(self.n_class, dtype=np.float64)
        
        # Make 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value.
        #
        # Raw, non-weighted, sample counts at each split-point.
        split_counts_raw = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        # Weighted sample counts at each split-point.
        split_counts_wt = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        # Weighted class counts for each split-point.
        split_class_counts_wt = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64)  

        # Keeping track of nodes' constant features. 
        features = self.features.copy()
        constant_features = np.empty(self.n_features, dtype=np.intp)
        
        # Push root node onto the LIFO stack.
        node_stack.append(StackEntry(0, self.n_samples, 0, 0, 0))
        self.n_nodes = 1
        
        while len(node_stack) > 0:
            node_info = node_stack.pop()
            start, end = node_info.start, node_info.end
            node_id, parent_id = node_info.node_id, node_info.parent_id
            n_consts = node_info.n_const_feats
            n_samples_node = end-start
            
            # Tabulate and store the current node's weighted class counts.
            node_wcc[:] = 0.
            for i in range(n_samples_node):
                row = self.rows[start + i]
                label = y[row]
                wt = self.class_weights[label]
                node_wcc[label] += wt 
            self.weighted_class_counts[node_id*self.n_class: (node_id + 1)* self.n_class] = node_wcc
            
            # Make a leaf if required to do so.
            n_classes_node, sum_node_wcc, sum_node_wcc_sqr = 0, 0., 0.
            for c in range(self.n_class):
                wcc = node_wcc[c]
                if wcc > 0: n_classes_node += 1
                # Compute the current node's proxy gini numerator and denominator while we're at it.
                sum_node_wcc_sqr += wcc**2 
                sum_node_wcc += wcc 
            if n_classes_node == 1:                      
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif n_samples_node < 2*self.min_samples_leaf:  
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif sum_node_wcc < 2.*self.min_weight_leaf: 
                self._make_leaf(node_id, node_wcc, n_classes_node)
            
            # Or perform a split.
            else:
                # Initialize stats for best split of node.
                best_split = Split(-1, 0., -np.inf)
                
                # Ensure feats drawn w/out replacement.
                n_drawn_feats = 0
                n_new_consts = 0
                n_total_consts = n_consts
                lb = 0                      # Range in `features` array from which we 
                ub = self.n_features - 1    # randomly select a feature's column index. 
               
                while n_drawn_feats < self.m:
                    n_drawn_feats += 1
                    idx = self._rng.choice(range(lb, ub-n_new_consts+1))
                    
                    # So that we don't draw a known constant feature again this split-search.
                    if idx < n_consts:
                        features[idx], features[lb] = features[lb], features[idx]
                        lb += 1 
                        continue
                        
                    # So that no new const feats get drawn more than once per split-search.
                    idx += n_new_consts
                    
                    feat_idx = features[idx]
                  
                    # Num split points found among training samples for given feat.
                    n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                    
                    # Initialize weighted class counts of right and left children.
                    # Right child's counts are initially the same as parent node's.
                    r_wcc[:] = node_wcc
                    l_wcc[:] = 0.
                    
                    # If the feature has an impurity score that's better than the best score 
                    # found among all other features visited thus far for this node, find_num_split()
                    # updates the attributes of the struct containing the node's best split info. 
                    # 
                    # But even if a new best score isn't reached, if an impurity score can
                    # be calculated at least once during the feature's split search, the
                    # following indicator will be toggled off, to indicate that the feature
                    # is not constant (1 = is constant; 0 = not constant).
                    current_feat_const = find_num_split_largeQ(self.rows, split_point_idxs, unique_feat_vals, y, start, n_samples_node, 
                                                       self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                       self.class_weights, l_wcc, r_wcc, node_wcc, best_split, feat_idx, 
                                                       sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, split_counts_raw, 
                                                       split_counts_wt, split_class_counts_wt)

                    if current_feat_const:
                        # The feature may be constant within the search range permitted
                        # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                        # the feature is a newly discovered constant.
                        features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                        n_new_consts += 1
                        n_total_consts += 1
                        continue
                    else:
                        # The feature is non-constant, so we ensure it's not drawn again
                        # during this split-search.
                        features[idx], features[ub] = features[ub], features[idx]
                        ub -= 1 
                            
                # To ensure that the constant features info is accurate for sibling or child nodes.
                features[0:n_consts] = constant_features[0:n_consts]
                constant_features[n_consts:n_consts+n_new_consts] = features[n_consts:n_consts+n_new_consts]
                
                # Make node a leaf if constant for all randomly drawn feats.
                # (# drawn known constant feats + # drawn new constant feats)
                if lb + n_new_consts == n_drawn_feats: 
                    self._make_leaf(node_id, node_wcc, n_classes_node)
                else: 
                    split_pos = make_num_split(self.rows, X, node_info, best_split) 

                    # Update info for node that's getting split.
                    l_child_id = self.n_nodes
                    r_child_id = l_child_id + 1
                    self.nodes[node_id] = Node(l_child_id, r_child_id, best_split.feat, best_split.thresh, -1)

                    # Prepare for the left and right child nodes
                    # by increasing tree data memory capacity if
                    # necessary.
                    if self.n_nodes + 2 > self.mem_capacity:
                        # Expand memory capacity geometrically. See "geometric growth" 
                        # part of WhozCraig's SO answer at: 
                        #     https://stackoverflow.com/a/51665863/8628758.
                        # Add one after squaring so that the new capacity can
                        # contain not only a tree of greater depth, but also
                        # the maximum # nodes that that depth could have.
                        new_capacity = 2*self.mem_capacity + 1
                        self._increase_mem_capacity(new_capacity)
                        self.mem_capacity = new_capacity
                    
                    # Push right child info onto the LIFO stack.
                    node_stack.append(StackEntry(split_pos, end, r_child_id, node_id, n_total_consts))
                    # Push left child info onto queue.
                    node_stack.append(StackEntry(start, split_pos, l_child_id, node_id, n_total_consts))

                    # And update size of the tree.
                    self.n_nodes += 2
    
    def fit(self, X, y, split_point_idxs, unique_vals_feats, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X    (Fortan-style ndarray of float64): Pre-processed training data.
            y                     (ndarray of int): Training labels.
            split_point_idxs      (ndarray of int): All numerical feature split-point locations for all rows.
                                                    Shape: (n training samples, n features).
            unique_vals_feats (ndarray of float64): Columns contain sorted unique values for all features.
                                                    Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats      (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows                            (list): Indices of the rows to be used for training. 
                                                    All rows used if empty.
            features                        (list): Column indices of training features that will be used.
                                                    All features used if empty.    
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        # Determine # classes found among all training samples.
        root_cc = np.unique(y, return_counts=True)[1] 
        self.n_class = root_cc.size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.

        self.n_samples = len(self.rows)
        self.n_features = len(self.features)
        
        # Store the num unique vals for each numerical feat and
        # find the maximum cardinality.
        self.n_unique_vals_feats = n_unique_vals_feats
        self.max_n_unique_feat_vals = self.n_unique_vals_feats.max()
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        init_capacity = 15
        
         # Allocate tree memory.
        self._increase_mem_capacity(init_capacity)
        self.mem_capacity = init_capacity
        
        # And sum the class weights of all the root node's samples in
        # order to know minimum total weight a leaf must have (which
        # we must know when regularizing by min_weight_fraction_leaf.)
        root_wcc = root_cc*self.class_weights
        self.min_weight_leaf = self.min_weight_fraction_leaf*root_wcc.sum()
        
        # Initialize the random number generator.
        self._rng = get_random_generator(self.seed)
        
        # Initiate tree building.
        self._grow_tree(X, y, split_point_idxs, unique_vals_feats)
        return self
        
    def _next_node(self, nxt): return self.nodes[nxt]
       
    def _get_leaf_idx(self, i, X):
        root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[:,leaf.feat][i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else:
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
        
    def predict(self, X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        n_preds = X.shape[0]
        preds = np.empty(n_preds, dtype=np.intp)
        for i in range(n_preds):
            preds[i] = self.nodes[self._get_leaf_idx(i, X)].label
        return preds
    
    def predict_probs(self, X):
       """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        for i in range(n_probs):
            idx = self._get_leaf_idx(i, X)
            for j in range(self.n_class):
                wcc[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

## Python Wright LargeQ Tree's Speed on the Titanic Data

In [50]:
(xTrain_proc, yTrain_proc, nan_fillers, 
 pca_maps, split_point_idxs, unique_vals_feats, 
 n_unique_vals_feats) = preprocess_train(xTrain_titanic, yTrain_titanic, cat_feats, largeQ=True)
xVal_proc = preprocess_test(xVal_titanic, nan_fillers, cat_feats, pca_maps)

In [51]:
m = 4
dt = DecisionTreeLargeQ(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [52]:
dt.size # Number of nodes in the decision tree.

371

In [53]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.752808988764045

In [54]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

103 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [55]:
%timeit dt.predict(xVal_proc)

503 µs ± 9.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Wright LargeQ Decision Tree Cython Version

In [56]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11

import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t # Signed, same as ssize_t in C. See MSeifert's SO answer: https://stackoverflow.com/a/46416257/8628758
cimport cython
from libc.math cimport log as ln
from libc.stdlib cimport realloc, free
from libc.string cimport memcpy
from libc.string cimport memset
from libcpp.stack cimport stack

# For C++ random number generation.
from libc.stdint cimport uint_fast32_t 
    
# For convenient memory reallocation.
ctypedef fused realloc_t:
    SIZE_t
    DTYPE_t
    Node

cdef inline realloc_t* safe_realloc(realloc_t* ptr, SIZE_t n_items) nogil except *:
    # Inspired by Sklearn's safe_realloc() func. However, thankfully
    # Cython now no longer requires us to send a pointer to a pointer
    # in order to prevent crashes.
    cdef realloc_t elem = ptr[0]
    cdef SIZE_t n_bytes = n_items * sizeof(elem)
    # Make sure we're not trying to allocate too much memory.
    if n_bytes/sizeof(elem) != n_items:
        with gil:
            raise MemoryError(f"Overflow error: unable to allocate {n_bytes} bytes.")       
    cdef realloc_t* res_ptr = <realloc_t *> realloc(ptr, n_bytes)
    with gil:
        if not res_ptr: raise MemoryError()
    return res_ptr

# C++ random number generator. Not yet a part of a Cython release so
# pasted in from: 
#     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/Cython/Includes/libcpp/random.pxd#L1
cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass random_device:
        ctypedef uint_fast32_t result_type
        random_device() except +
        result_type operator()() except +

    cdef cppclass mt19937:
        ctypedef uint_fast32_t result_type
        mt19937() except +
        mt19937(result_type seed) except +
        result_type operator()() except +
        result_type min() except +
        result_type max() except +
        void discard(size_t z) except +
        void seed(result_type seed) except +

    cdef cppclass uniform_int_distribution[T]:
        ctypedef T result_type
        uniform_int_distribution() except +
        uniform_int_distribution(T, T) except +
        result_type operator()[Generator](Generator&) except +
        result_type min() except +
        result_type max() except +
        
# Info for any node that will eventually be split or made into a leaf.
# Similar to what Sklearn does at:
#     https://github.com/scikit-learn/scikit-learn/blob/a2c4d8b1f4471f52a4fcf1026f495e637a472568/sklearn/tree/_tree.pyx#L126
cdef struct StackEntry:
    SIZE_t start
    SIZE_t end
    SIZE_t node_id
    SIZE_t parent_id
    SIZE_t n_const_feats

# To compare node splits.
cdef struct Split:
    SIZE_t feat
    DTYPE_t thresh
    DTYPE_t score  

# Vital characteristics of a node. Set when it's added to the tree.
cdef struct Node:
    SIZE_t l_child # idx of left child, -1 if leaf
    SIZE_t r_child # idx of right child, -1 if leaf
    SIZE_t feat    # col idx of split feature, -1 if leaf
    DTYPE_t thresh # double split threshold, NAN if leaf
    SIZE_t label   # class label if leaf, -1 if non-leaf.

cdef inline void find_num_split_largeQ(SIZE_t* rows, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats, 
                                       SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, SIZE_t n_samples,
                                       SIZE_t n_class, SIZE_t min_samples_leaf, DTYPE_t min_weight_leaf, 
                                       DTYPE_t* c_wts, DTYPE_t* l_wcc, DTYPE_t* r_wcc, DTYPE_t* parent_wcc, 
                                       Split* best_split, SIZE_t current_feat, DTYPE_t parent_num, 
                                       DTYPE_t parent_den, SIZE_t n_unique_vals_feat, SIZE_t max_n_unique_vals, 
                                       SIZE_t* split_counts_raw, DTYPE_t* split_counts_wt, 
                                       DTYPE_t* split_class_counts_wt, bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's SmallQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows                   : Indices of all rows in the training set. Shape: (n train samples,).
        split_point_idxs       : All numerical feature split-point locations for all rows.
                                 Shape: (n training samples, n features).
        unique_vals_feats      : Columns contain sorted unique values for all features. 
                                 Shape: (max cardinality of all feats, n features).
                                 node (beginning index 0). Shape: (n train samples,).
        labels                 : All training labels. Shape: (n training samples,).
        node_start             : Index of the beginning of the parent node in `rows`.
        n_parent               : Number of samples in the parent node.
        n_samples              : Number of samples in the training data.
        n_class                : Number of unique classes in the training set.
        min_samples_leaf       : Any leaf will have no fewer than this many samples.
        min_weight_leaf        : Total weight of any leaf's samples will be at least this much.
        c_wts                  : Class weights. Shape: (`n_class`,).
        l_wcc                  : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                  : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc             : Parent node's weight class counts. Shape: (`n_class`,).
        best_split             : Holds the feature, threshold and impurity
                                 score of the parent node's current best split.
        current_feat           : Column index of feature under investigation.
        parent_num             : Numerator of parent node's impurity score.
        parent_den             : Denominator of parent node's impurity score.
        n_unique_vals_feat     : Number of unique values for one feature found among 
                                 all training samples.
        max_n_unique_vals      : Maximum cardinality of all features in dataset.
        split_counts_raw       : Stores sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_counts_wt        : Stores weighted sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt  : Stores weighted class counts of each class at each
                                 split point of a given feature in a given node. Shape:
                                 (<max cardinality of all numerical feats in dataset> x `n_class`,)
        current_feat_const     : Whether current splitting feature is constant for all eligible split 
                                 thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used while tabulating raw and weighted sample counts, 
    # as well as weighted class counts at all unique split points.
    cdef SIZE_t row, label, split_point_idx
    
    # Make sure node's samples aren't all constant for feature.
    cdef SIZE_t n_splits_node = 0
    
    # Variables to track progress during the split search.
    cdef SIZE_t n_left, n_right, i, j, k
    
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, wt, score, mid
    
    # Tabulate sample counts, weighted counts, and weighted class counts at 
    # each split point. Values at split points not belonging to node's rows
    # will remain zero.
    memset(split_counts_raw, 0, sizeof(SIZE_t)*n_unique_vals_feat)
    memset(split_counts_wt, 0, sizeof(DTYPE_t)*n_unique_vals_feat)
    memset(split_class_counts_wt, 0, sizeof(SIZE_t)*n_unique_vals_feat*n_class)
    for i in range(n_parent):
        row = rows[node_start + i]
        label = labels[row]
        wt = c_wts[label]
        split_point_idx = split_point_idxs[n_samples*current_feat + row]
        split_counts_raw[split_point_idx] += 1
        split_counts_wt[split_point_idx] += wt
        split_class_counts_wt[split_point_idx*n_class + label] += wt
        
    # If feat is constant for the node.
    for i in range(n_unique_vals_feat):
        if split_counts_raw[i] > 0: n_splits_node += 1
    if n_splits_node < 2: return
    
    # To keep track of num samples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(n_unique_vals_feat - 1):
        if split_counts_raw[i] == 0: continue # Move to next split-point if no samples at this one.
        
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left
        if n_right == 0: return # Make sure to stop search when right child empty.
        
        # Calculate denominators of proxy gini scores.
        l_den += split_counts_wt[i]
        r_den = parent_den - l_den

        # Calculate numerators of proxy gini scores.
        l_num, r_num = 0., 0. 
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            r_num += r_wcc[j]*r_wcc[j]

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Find raw feature value of sample(s) at next-closest split-point.
            k = i+1
            while split_counts_raw[k] == 0: k+=1
            # Split threshold is always the mid-point between two consecutive values.
            mid = (unique_vals_feats[current_feat*max_n_unique_vals + i]/2. + 
                   unique_vals_feats[current_feat*max_n_unique_vals + k]/2.) 
            if mid == unique_vals_feats[current_feat*max_n_unique_vals + k]: 
                mid = unique_vals_feats[current_feat*max_n_unique_vals + i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat

cdef inline SIZE_t make_num_split(SIZE_t* rows, DTYPE_t* X, StackEntry* node_info, Split* best_split, 
                                SIZE_t n_samples) nogil:
    cdef SIZE_t p, p_end
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[best_split.feat*n_samples + rows[p]] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p] 
    return p

# Necessary constants.
cdef DTYPE_t NEG_INF = -np.inf
cdef DTYPE_t NAN = np.nan
            
cdef class _DecisionTree:
    # Class attributes.
    cdef SIZE_t seed
    cdef mt19937 rng
    cdef SIZE_t mem_capacity
    cdef SIZE_t n_samples
    cdef SIZE_t n_features
    cdef SIZE_t n_class
    cdef SIZE_t m
    cdef SIZE_t min_samples_leaf, 
    cdef DTYPE_t min_weight_fraction_leaf
    cdef DTYPE_t min_weight_leaf
    cdef SIZE_t n_nodes
    cdef SIZE_t max_n_unique_feat_vals
    cdef SIZE_t* n_unique_vals_feats
    cdef SIZE_t* rows
    cdef SIZE_t* features
    cdef DTYPE_t* class_weights
    cdef Node* nodes
    cdef DTYPE_t* weighted_class_counts
    def __cinit__(self, SIZE_t m, SIZE_t min_samples_leaf, DTYPE_t min_weight_fraction_leaf, SIZE_t seed): 
        """
        Arguments:
            m                       : Number of candidate features randomly selected to try to split each node.
            min_samples_leaf        : Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                      of the sum of weights of *all* training samples used to fit the tree.
            seed                    : A seed for the C++ mt19937 32bit int random generator. 
                                      Use when reproducibility is desired.
        """
        self.m, self.min_samples_leaf = m, min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = NULL
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = NULL 
        
    def __dealloc__(self):
        free(self.nodes)
        free(self.weighted_class_counts)
        
    property size:
        def __get__(self):
            return self.n_nodes
    
    property left_children:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].l_child
            return out

    property right_children:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].r_child
            return out
        
    property split_features: 
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].feat
            return out
        
    property split_thresholds:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].thresh
            return out
        
    property weighted_cc:
        def __get__(self):
            cdef SIZE_t out_size = self.n_nodes*self.n_class
            out = np.empty(out_size, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(out_size):
                    out_view[i] = self.weighted_class_counts[i]
            out.resize(self.n_nodes, self.n_class)
            return out
    
    property labels:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].label
            return out
    
    cdef void _increase_mem_capacity(self, SIZE_t new_capacity) nogil:
        self.nodes = safe_realloc(self.nodes, new_capacity)
        self.weighted_class_counts = safe_realloc(self.weighted_class_counts, self.n_class*new_capacity)
    
    cdef void _make_leaf(self, Node* leaf_node, SIZE_t* y, SIZE_t node_start, SIZE_t node_id, 
                         SIZE_t n_classes_node, SIZE_t* max_wt_classes) nogil:
        # Class with largest wcc becomes leaf node's label. Break ties with a random choice.
        cdef SIZE_t label
        cdef DTYPE_t max_wt = 0.
        cdef SIZE_t lb = 0
        cdef SIZE_t ub = -1
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t i, j
        # If all node's samples have the same class.
        if n_classes_node == 1:
            label = y[self.rows[node_start]]
        else:
            # Otherwise find label with max weighted class count for the node.
            for i in range(self.n_class):
                max_wt = max(max_wt, self.weighted_class_counts[node_id*self.n_class + i])
            # See if multiple classes share this max count.
            for i in range(self.n_class):
                if self.weighted_class_counts[node_id*self.n_class + i] == max_wt:
                    ub += 1
                    max_wt_classes[ub] = i
            # If so, randomly choose leaf's label from among those classes.
            if ub > 0:
                dist = uniform_int_distribution[SIZE_t](lb, ub) # Choose an int w/in range lb, ub, inclusive.
                j = dist(self.rng)
                label = max_wt_classes[j]
            else:
                label = max_wt_classes[lb]
        leaf_node.l_child = -1
        leaf_node.r_child = -1    
        leaf_node.feat = -1  
        leaf_node.thresh = NAN
        leaf_node.label = label 

    cdef _grow_tree(self, DTYPE_t* X, SIZE_t* y, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats):
        # LIFO stack holding all nodes still to be investigated.
        cdef stack[StackEntry] node_stack

        #####################################################################
        # Variables containing info of the node currently being investigated.
        #####################################################################
        cdef SIZE_t start, end, node_id, parent_id, n_consts, n_samples_node
        cdef DTYPE_t* node_wcc = NULL
        cdef StackEntry node_info
        cdef Node* node = NULL
        
        # Holds child node info if the current node gets split.
        cdef SIZE_t l_child_id, r_child_id
        cdef Node* l_child_node = NULL
        cdef Node* r_child_node = NULL
        
        #####################################################################
        # For finding the best split.
        #####################################################################
        cdef Split best_split
        cdef DTYPE_t* l_wcc = NULL
        cdef DTYPE_t* r_wcc = NULL
        cdef DTYPE_t sum_node_wcc_sqr, sum_node_wcc # Parent node's proxy Gini score num and den.
        cdef SIZE_t split_pos
        
        # Indicates a feature has been discovered to be constant during a
        # split search within the search range permitted by min_samples_leaf 
        # and min_weight_leaf.
        cdef bint current_feat_const 
        
        # Num unique vals for given feat found among all training samples.
        cdef SIZE_t n_unique_vals_feat
        
        # Make 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value.
        #
        # Raw, non-weighted, sample counts at each split-point.
        cdef SIZE_t[::1] split_counts_raw_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        cdef SIZE_t* split_counts_raw = &split_counts_raw_buffer[0]
        # Weighted sample counts at each split-point.
        cdef DTYPE_t[::1] split_counts_wt_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        cdef DTYPE_t* split_counts_wt = &split_counts_wt_buffer[0]
        # Weighted class counts for each split-point.
        cdef DTYPE_t[::1] split_class_counts_wt_buffer = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64) 
        cdef DTYPE_t* split_class_counts_wt = &split_class_counts_wt_buffer[0]
        
        #####################################################################
        # For random feature selection (w/out replacement) and keeping track 
        # of nodes' constant features. 
        #####################################################################
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t lb, ub, idx, feat_idx, n_drawn_feats, n_new_consts, n_total_consts
        cdef SIZE_t[::1] features_buffer = np.empty(self.n_features, dtype=np.intp) 
        cdef SIZE_t* features = &features_buffer[0]
        cdef SIZE_t[::1] constant_features_buffer = np.empty(self.n_features, dtype=np.intp)
        cdef SIZE_t* constant_features = &constant_features_buffer[0]
        
        #####################################################################
        # For determining whether node should be a leaf.
        #####################################################################
        cdef SIZE_t i, c, cc, n_classes_node, row, label
        cdef DTYPE_t wcc, wt
        # Stores classes that share a leaf's max class wt. When two or more 
        # present, leaf label randomly chosen from these classes
        cdef SIZE_t* max_wt_classes = NULL
        
        with nogil:
            # Allocate memory to pointers.
            l_wcc = safe_realloc(l_wcc, self.n_class)
            r_wcc = safe_realloc(r_wcc, self.n_class)
            node_wcc = safe_realloc(node_wcc, self.n_class)
            max_wt_classes = safe_realloc(max_wt_classes, self.n_class*sizeof(SIZE_t))
            # Fill with feature column indices so we can track constant feats.
            memcpy(features, self.features, self.n_features* sizeof(SIZE_t))
            
            # Push root node onto the LIFO stack.
            node_stack.push({"start": 0, "end": self.n_samples, "node_id": 0, 
                             "parent_id": 0, "n_const_feats": 0})
            self.n_nodes = 1
            while not node_stack.empty():
                node_info = node_stack.top()
                node_stack.pop()
                start, end = node_info.start, node_info.end
                node_id, parent_id = node_info.node_id, node_info.parent_id # TODO: `parent_id` unused; is it necessary?
                n_consts = node_info.n_const_feats
                n_samples_node = end-start
                node = &self.nodes[node_id]
                
                # Tabulate the current node's weighted class counts.
                #
                # Implementation detail #1: I tried storing the l and r child wt class cts
                # of nodes' best splits so that this tabulation wouldn't need to be 
                # performed for each node. But found there was virtually no speed improvement
                # to justify the more complicated code required to store and update these 
                # values during the best split search.
                #
                # Implementation detail #2: Setting aside a block of memory to 
                # store the current node's wt class cts and passing a pointer to
                # this block to the split search function sped up training by 8%
                # compared to passing a ptr to the location of node's wt class cts 
                # in the self.weighted_class_counts array.
                memset(node_wcc, 0, self.n_class*sizeof(DTYPE_t))
                sum_node_wcc, sum_node_wcc_sqr = 0., 0.
                for i in range(n_samples_node):
                    row = self.rows[start + i]
                    label = y[row]
                    wt = self.class_weights[label]
                    # Compute the node's proxy gini numerator and denominator while we're at it.
                    sum_node_wcc_sqr += wt*(2*node_wcc[label] + wt) # numerator
                    sum_node_wcc += wt                              # denominator
                    node_wcc[label] += wt
                memcpy(&self.weighted_class_counts[node_id*self.n_class], node_wcc, self.n_class*sizeof(DTYPE_t))
                
                # Make a leaf if required to do so. 
                n_classes_node = 0
                for c in range(self.n_class):
                    wcc = node_wcc[c]
                    if wcc > 0: n_classes_node += 1
                if n_classes_node == 1:                   
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif n_samples_node < 2*self.min_samples_leaf:  
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif sum_node_wcc < 2.*self.min_weight_leaf: 
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)

                # Otherwise split the node.
                else:
                    # Initialize stats for best split of node.
                    best_split.feat = -1
                    best_split.thresh = 0.
                    best_split.score = NEG_INF

                    # Ensure feats drawn w/out replacement.
                    n_drawn_feats = 0
                    n_new_consts = 0
                    n_total_consts = n_consts
                    lb = 0                      # Range in `features` array from which we 
                    ub = self.n_features - 1    # randomly select a feature's column index. 
                        
                    while n_drawn_feats < self.m:
                        n_drawn_feats += 1

                        # Breiman & Cutler's original Fortran random forests implementation 
                        # allows for known constant features to be drawn during a split-search.
                        # I follow their example, as I believe that doing so allows individual 
                        # trees to be less correlated with each other. Since I don't pre-sort
                        # features, I would prefer not to have to sort any more features than
                        # necessary, and so I've adopted the technique Sklearn uses to track 
                        # constant features:
                        #     https://github.com/scikit-learn/scikit-learn/blob/dbe39454f766ebefc3219f2c1871ac1774316532/sklearn/tree/_splitter.pyx#L310
                        # 
                        # The idea is that feature idxs in `features` are organized into two sections:
                        #
                        #     [<indices of known constant feats>, <indices of non-constant feats>]
                        #
                        # As we begin drawing feature indices from this above list, those two sections
                        # will each be further sub-divided into two sections:
                        # 
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <undrawn non-constant feats>, <drawn non-constant feats>]
                        #
                        # When we choose a feature that happens to be a known constant, we'll re-locate
                        # its idx to the right-end of the first of those four sections. Then we 
                        # increment the lower bound threshold, `lb`, by one so that we don't re-draw 
                        # that feature again.
                        #
                        # Similarly, if we draw a non-constant feature idx, we'll move it to the 
                        # left-end of the last of the four partitions and reduce the upper bound
                        # threshold, `ub`, by one so that the feature idx can't be drawn again
                        # during this split-search. 
                        #
                        # One last important detail: sometimes we'll draw a feature that 
                        # used to be non-constant for ancestor nodes, but will be found to be 
                        # constant for the current node. When this happens, we relocate its 
                        # index so that it sits to the right of the known constant feats section.
                        # This means our `features` list could have up to five partitions:
                        #
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <newly discovered const feats>, <undrawn non-constant feats>, 
                        #      <drawn non-constant feats>]
                        #
                        # Whenever we find a new constant feature, we increment the `n_new_consts`
                        # counter by one. We also increment the `n_total_consts` counter by one. 
                        # During the split-search we have to use `n_total_consts` to keep track of
                        # the total number of constant features. n_consts` mustn't be changed
                        # because it tells us where the <newly discovered const feats> section
                        # of the `features` list begins.

                        # One last wrinkle. We subtract the # of newly discovered const feats from  
                        # the upper bound before we select an index `i` from the `features` array, 
                        # and add it back to `i` after `i` has been genereated. This prevents us from 
                        # re-drawing any of these new const feats again during this split-search.
                        dist = uniform_int_distribution[SIZE_t](lb, ub-n_new_consts)
                        idx = dist(self.rng)

                        # So that we don't draw a known constant feature again this split-search.
                        if idx < n_consts:
                            features[idx], features[lb] = features[lb], features[idx]
                            lb += 1 
                            continue

                        # So that no new const feats get drawn more than once per split-search.
                        idx += n_new_consts

                        feat_idx = features[idx]
                        
                        # Num split points found among training samples for given feat.
                        n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                        
                        # Initialize weighted class counts of right and left children.
                        # Right child's counts are initially the same as parent node's.
                        memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                        memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))

                        # If the feature has an impurity score that's better than the best score 
                        # found among all other features visited thus far for this node, find_num_split()
                        # updates the attributes of the struct containing the node's best split info. 
                        # 
                        # But even if a new best score isn't reached, if an impurity score can
                        # be calculated at least once during the feature's split search, the
                        # following indicator will be toggled off, to indicate that the feature
                        # is not constant.
                        current_feat_const = 1 # 1 = is constant; 0 = not constant
                        find_num_split_largeQ(self.rows, split_point_idxs, unique_vals_feats, y, start, 
                                              n_samples_node, self.n_samples, self.n_class, self.min_samples_leaf, 
                                              self.min_weight_leaf, self.class_weights, l_wcc, r_wcc, node_wcc, 
                                              &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, 
                                              self.max_n_unique_feat_vals, split_counts_raw, split_counts_wt, 
                                              split_class_counts_wt, &current_feat_const)

                        if current_feat_const:
                            # The feature may be constant within the search range permitted
                            # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                            # the feature is a newly discovered constant.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # The feature is non-constant, so we ensure it's not drawn again
                            # during this split-search.
                            features[idx], features[ub] = features[ub], features[idx]
                            ub -= 1 

                    # To ensure that the constant features info is accurate for sibling or child nodes.
                    memcpy(&features[0], &constant_features[0], sizeof(SIZE_t)*n_consts)
                    memcpy(&constant_features[n_consts], &features[n_consts], sizeof(SIZE_t)*n_new_consts)

                    # Make node a leaf if constant for all randomly drawn feats.
                    # (# drawn known constant feats + # drawn new constant feats)
                    if lb + n_new_consts == n_drawn_feats: 
                        self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                    else: 
                        split_pos = make_num_split(self.rows, X, &node_info, &best_split, self.n_samples) 

                        # Update tree info for node that's getting split.
                        l_child_id = self.n_nodes
                        r_child_id = l_child_id + 1
                        node.l_child = l_child_id
                        node.r_child = r_child_id
                        node.feat    = best_split.feat
                        node.thresh  = best_split.thresh
                        node.label   = -1

                        # Prepare for the left and right child nodes
                        # by increasing tree data memory capacity if
                        # necessary.
                        if self.n_nodes + 2 > self.mem_capacity:
                            # Expand memory capacity geometrically. See "geometric growth" 
                            # part of WhozCraig's SO answer at: 
                            #     https://stackoverflow.com/a/51665863/8628758.
                            # Add one after squaring so that the new capacity can
                            # contain not only a tree of greater depth, but also
                            # the maximum # nodes that that depth could have.
                            new_capacity = 2*self.mem_capacity + 1
                            self._increase_mem_capacity(new_capacity)
                            self.mem_capacity = new_capacity
                        
                        # Push right child info onto the LIFO stack.
                        node_stack.push({"start": split_pos, "end": end, "node_id": r_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})
                        # Push left child info onto queue.
                        node_stack.push({"start": start, "end": split_pos, "node_id": l_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})

                        # And update size of the tree.
                        self.n_nodes += 2
                        
        free(l_wcc)
        free(r_wcc)
        free(node_wcc)
        free(max_wt_classes)
    
    def fit(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X, np.ndarray[SIZE_t, ndim=1, mode="c"] y,
            np.ndarray[SIZE_t, ndim=2, mode="fortran"] split_point_idxs, 
            np.ndarray[DTYPE_t, ndim=2, mode="fortran"] unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] n_unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] rows, np.ndarray[SIZE_t, ndim=1, mode="c"] features,
            np.ndarray[DTYPE_t, ndim=1, mode="c"] class_weights, SIZE_t n_class): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
            feats             (1D C-contiguous array of int): Column indices of training features.
            class_weights (1D C-contiguous array of float64): Desired weight for each class. Shape: (`n_class`,).
            n_class                                         : Number of classes in training data.   
        """
        # Casting the raw data to pointers gives a 17% speed-up compared to getting
        # pointer from the ndarray's buffer interface, as recommended by DavidW in 
        # his SO answer at: https://stackoverflow.com/a/54832269/8628758. e.g.
        #     cdef DTYPE_t[::1,:] X_buffer = X
        #     cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        # Not worried about unexpected behavior as all ndarrays' contiguousness and
        # memory layout enforced prior to this point.
        cdef DTYPE_t* X_ptr = <DTYPE_t*> X.data
        cdef SIZE_t* y_ptr = <SIZE_t*> y.data
        cdef SIZE_t* split_point_idxs_ptr = <SIZE_t*> split_point_idxs.data
        cdef DTYPE_t* unique_vals_feats_ptr = <DTYPE_t*> unique_vals_feats.data
        self.n_unique_vals_feats = <SIZE_t*> n_unique_vals_feats.data
        self.rows = <SIZE_t*> rows.data
        self.features = <SIZE_t*> features.data
        self.class_weights = <DTYPE_t*> class_weights.data
        self.n_class = n_class
        self.n_samples = rows.shape[0]
        self.n_features = features.shape[0]
        cdef random_device rd # Needed when using the C++ mt19937 rng w/out a seed.
        
        # Get the max cardinality of all numerical feats.
        self.max_n_unique_feat_vals = n_unique_vals_feats.max()
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        cdef SIZE_t init_capacity = 15
        
        cdef SIZE_t i, row, label
        cdef DTYPE_t wt
        cdef DTYPE_t sum_wts = 0
        cdef Node* root_node = NULL
        with nogil:
            # Allocate memory for the tree.
            self._increase_mem_capacity(init_capacity)
            self.mem_capacity = init_capacity
 
            # And sum the class weights of all the root node's samples in
            # order to know minimum total weight a leaf must have (which
            # we must know when regularizing by min_weight_fraction_leaf.)
            for i in range(self.n_samples):
                row = self.rows[i]
                label = y_ptr[row]
                wt = self.class_weights[label]
                sum_wts += wt
            self.min_weight_leaf = self.min_weight_fraction_leaf*sum_wts
            
            # Initialize the random number generator. Followed example from:
            #     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/tests/run/cpp_stl_random.pyx#L16
            if self.seed == -1:
                self.rng = mt19937(rd()) # If using the random device engine std::random_device.
            else:
                self.rng = mt19937(self.seed)

        # Initiate tree building.
        self._grow_tree(X_ptr, y_ptr, split_point_idxs_ptr, unique_vals_feats_ptr)
    
    cdef Node* _next_node(self, SIZE_t nxt) nogil: 
        return &self.nodes[nxt]
    
    cdef SIZE_t _get_leaf_idx(self, SIZE_t i, Node* leaf, SIZE_t n, DTYPE_t* X) nogil:
        cdef SIZE_t idx
        cdef SIZE_t root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[leaf.feat*n + i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else: 
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
    
    def predict(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_preds = X.shape[0]
        cdef SIZE_t i
        preds = np.empty(n_preds, dtype=np.intp)
        cdef SIZE_t[::1] preds_view = preds
        cdef Node leaf
        with nogil:
            for i in range(n_preds): 
                preds_view[i] = self.nodes[self._get_leaf_idx(i, &leaf, n_preds, X_ptr)].label
        return preds
    
    def predict_probs(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions. Shape: (`X.size`, `self.n_class`)
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        cdef DTYPE_t[::1] wcc_view = wcc
        cdef Node leaf
        cdef SIZE_t i, j, idx
        with nogil:
            for i in range(n_probs):
                idx = self._get_leaf_idx(i, &leaf, n_probs, X_ptr)
                for j in range(self.n_class):
                    wcc_view[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

class DecisionTreeLargeQCython():
    """Fit a decision tree using a depth-first algorithm.
    
    Uses Marvin Wright's LargeQ numerical splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights = [], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit the tree.
            seed                         (int): Use when reproducibility is desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C') 
        if seed is None:
            self.seed = -1
        else:
            self.seed = seed
        self._tree = _DecisionTree(self.m, self.min_samples_leaf, self.min_weight_fraction_leaf, self.seed)
        
    @property
    def size(self): return self._tree.size
    
    @property
    def left_children(self): return self._tree.left_children
    
    @property
    def right_children(self): return self._tree.right_children
            
    @property 
    def split_features(self): return self._tree.split_features

    @property 
    def split_thresholds(self): return self._tree.split_thresholds
    
    @property
    def weighted_class_counts(self): return self._tree.weighted_cc
    
    @property
    def labels(self): return self._tree.labels
    
    def fit(self, X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
                                                              All rows used if empty.
            feats             (1D C-contiguous array of int): Column indices of training features.
                                                              All rows used if empty.
                                                              
        Returns:
            DecisionTreeLargeQCython: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        self.n_class = np.unique(y).size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.
            
        self._tree.fit(X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, 
                       self.rows, self.features, self.class_weights, self.n_class)
        return self

    def predict(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict(X)
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict_probs(X)

## Cython Wright LargeQ Tree's Speed on the Titanic Data

In [57]:
m = 4
dt = DecisionTreeLargeQCython(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [58]:
dt.size # Number of nodes in the decision tree.

393

In [59]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.8202247191011236

In [60]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

315 µs ± 9.87 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [61]:
%timeit dt.predict(xVal_proc)

8.11 µs ± 715 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Although my Cython version of Wright's SmallQ splitting was well over twice as slow as my Cython version of Louppe's splitting technique, Wright's LargeQ Cython implementation is dangerousy close to being twice as fast as Louppe's!

## Wright SmallQ/LargeQ Decision Tree Python Version
Now, let's do things just as Wright intended for the Ranger library and see what happens when we fit a decision tree using the "small Q" splitter on nodes whose ratio, `q`, of <# samples in the node> to <current feature's # unique values in the dataset> is [less than 0.02](https://github.com/imbs-hl/ranger/blob/ce497711884c783e133fb36750b60de4c140773f/src/globals.h#L106), and using the "large Q" splitter for all other nodes.

In [62]:
Q_THRESHOLD = 0.02

class DecisionTreeSmallLargeQ():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Marvin Wright's SmallQ and LargeQ numerical splitting algorithms:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L173
    """
        
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights=[], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                         (int): Use when reproducibility desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C')
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = np.empty(0, dtype=Node, order='C')
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = np.empty(0, dtype=np.float64, order='C')
        
    @property
    def size(self): return self.n_nodes
    
    @property 
    def left_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].l_child
        return out

    @property 
    def right_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].r_child
        return out

    @property 
    def split_features(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].feat
        return out

    @property 
    def split_thresholds(self): 
        out = np.empty(self.n_nodes, dtype='float64')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].thresh
        return out

    @property 
    def weighted_cc(self):
        out_size = self.n_nodes*self.n_class
        out = np.empty(out_size, dtype='float64')
        for i in range(out_size):
            out[i] = self.weighted_class_counts[i]
        out.resize(self.n_nodes, self.n_class)
        return out

    @property 
    def labels(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].label
        return out
    
    def _increase_mem_capacity(self, new_capacity):
        """Resize ndarrays that hold tree's nodes and weighted class counts.
        
        Arguments:
            new_capacity (int): Amount of nodes that resized arrays will be able to hold.
        """
        self.nodes.resize(new_capacity, refcheck=False)
        self.weighted_class_counts.resize(new_capacity*self.n_class, refcheck=False)
    
    def _make_leaf(self, node_id, wcc, n_classes_node):
        """Set and store the class label of a leaf node.
        
        Break ties at random when multiple classes share the same max weight.
        Doing this avoids a bias towards lower classes that would be a possible
        consequence of using np.argmax (which is what Sklearn does).
        
        Arguments:
            node_id            (int): Location of node in `self.nodes`.
            wcc (ndarray of float64): Node's weighted class counts. Shape: (`self.n_class`,).
            n_classes_node     (int): Number of unique class labels found among
                                      node's training samples.
        """
        if n_classes_node == 1: 
            label = max(enumerate(wcc), key=lambda f: f[1])[0]
        else:              
            label = self._rng.choice(np.argwhere(wcc==np.max(wcc)).flatten())
        self.nodes[node_id] = Node(-1, -1, -1, np.nan, label) 
        
    def _grow_tree(self, X, y, split_point_idxs, unique_feat_vals):
        """Depth-first growth of a decision tree.
        
        Arguments:
            X                     (ndarray of float64): Training samples. Shape: (n samples, n features).
            y                         (ndarray of int): Training labels. Shape: (n samples,).
            split_point_idxs          (ndarray of int): All numerical feature split-point locations for all rows.
                                                        Shape: (n training samples, n features).
            unique_vals_feats     (ndarray of float64): Columns contain sorted unique values for all features.
                                                        Shape: (max cardinality of all feats, n features).
        """
        # LIFO stack holding all nodes still to be investigated.
        node_stack = []
        
        # Stores the weighted class counts of the current node.
        node_wcc = np.empty(self.n_class, dtype=np.float64)
        
        ##############################################################
        # For finding the best split.
        ##############################################################
        l_wcc = np.empty(self.n_class, dtype=np.float64)
        r_wcc = np.empty(self.n_class, dtype=np.float64)
        
        # For SmallQ Splitting, when we sort just the unique values for a feature
        # that are found inside a single node.
        items = np.empty(self.n_samples, dtype=np.float64)
        
        # 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value:
        #
        # Raw, non-weighted, sample counts at each split-point (used by SmallQ and LargeQ).
        split_counts_raw = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        # Weighted sample counts at each split-point (just used by LargeQ).
        split_counts_wt = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        # Weighted class counts for each split-point (used by SmallQ and LargeQ).
        split_class_counts_wt = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64)  

        # Keeping track of nodes' constant features. 
        features = self.features.copy()
        constant_features = np.empty(self.n_features, dtype=np.intp)
        
        # Push root node onto the LIFO stack.
        node_stack.append(StackEntry(0, self.n_samples, 0, 0, 0))
        self.n_nodes = 1
        
        while len(node_stack) > 0:
            node_info = node_stack.pop()
            start, end = node_info.start, node_info.end
            node_id, parent_id = node_info.node_id, node_info.parent_id
            n_consts = node_info.n_const_feats
            n_samples_node = end-start
            
            # Tabulate and store the current node's weighted class counts.
            node_wcc[:] = 0.
            for i in range(n_samples_node):
                row = self.rows[start + i]
                label = y[row]
                wt = self.class_weights[label]
                node_wcc[label] += wt 
            self.weighted_class_counts[node_id*self.n_class: (node_id + 1)* self.n_class] = node_wcc
            
            # Make a leaf if required to do so.
            n_classes_node, sum_node_wcc, sum_node_wcc_sqr = 0, 0., 0.
            for c in range(self.n_class):
                wcc = node_wcc[c]
                if wcc > 0: n_classes_node += 1
                # Compute the current node's proxy gini numerator and denominator while we're at it.
                sum_node_wcc_sqr += wcc**2 
                sum_node_wcc += wcc 
            if n_classes_node == 1:                      
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif n_samples_node < 2*self.min_samples_leaf:  
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif sum_node_wcc < 2.*self.min_weight_leaf: 
                self._make_leaf(node_id, node_wcc, n_classes_node)
            
            # Or perform a split.
            else:
                # Initialize stats for best split of node.
                best_split = Split(-1, 0., -np.inf)
                
                # Ensure feats drawn w/out replacement.
                n_drawn_feats = 0
                n_new_consts = 0
                n_total_consts = n_consts
                lb = 0                      # Range in `features` array from which we 
                ub = self.n_features - 1    # randomly select a feature's column index. 
               
                while n_drawn_feats < self.m:
                    n_drawn_feats += 1
                    idx = self._rng.choice(range(lb, ub-n_new_consts+1))
                    
                    # So that we don't draw a known constant feature again this split-search.
                    if idx < n_consts:
                        features[idx], features[lb] = features[lb], features[idx]
                        lb += 1 
                        continue
                        
                    # So that no new const feats get drawn more than once per split-search.
                    idx += n_new_consts
                    
                    feat_idx = features[idx]
                  
                    # Num split points found among training samples for given feat.
                    n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                    
                    q = n_samples_node/n_unique_vals_feat
                    
                    if q < Q_THRESHOLD:
                        self.num_small_Q += 1
                        items[:n_samples_node] = X[:,feat_idx][self.rows[start:end]]
                        
                        # Make sure the feature not constant for node's samples.
                        node_unique_vals_feat = np.unique(items[:n_samples_node])
                        node_n_unique_vals_feat = len(node_unique_vals_feat)
                        if node_n_unique_vals_feat < 2:
                            # Move the newly-discovered constant feat to the far right-end
                            # of the left half of `features` list holding the known const
                            # feats as well as any other const feats newly discovered 
                            # during this node's split-search.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # Initialize weighted class counts of right and left children.
                            # Right child's counts are initially the same as parent node's.
                            r_wcc[:] = node_wcc
                            l_wcc[:] = 0.

                            # If the feature has an impurity score that's better than the best score 
                            # found among all other features visited thus far for this node, find_num_split()
                            # updates the attributes of the struct containing the node's best split info. 
                            # 
                            # But even if a new best score isn't reached, if an impurity score can
                            # be calculated at least once during the feature's split search, the
                            # following indicator will be toggled off, to indicate that the feature
                            # is not constant (1 = is constant; 0 = not constant).
                            current_feat_const = find_num_split_smallQ(X, self.rows, node_unique_vals_feat, y, start, n_samples_node, 
                                                                       self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                                       self.class_weights, l_wcc, r_wcc, node_wcc, best_split, feat_idx, 
                                                                       sum_node_wcc_sqr, sum_node_wcc, node_n_unique_vals_feat, split_counts_raw, 
                                                                       split_class_counts_wt)
                    else:
                        self.num_large_Q += 1
                        r_wcc[:] = node_wcc
                        l_wcc[:] = 0.
                        current_feat_const = find_num_split_largeQ(self.rows, split_point_idxs, unique_vals_feats, y, start, n_samples_node, 
                                                           self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                           self.class_weights, l_wcc, r_wcc, node_wcc, best_split, feat_idx, 
                                                           sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, split_counts_raw, 
                                                           split_counts_wt, split_class_counts_wt)

                    if current_feat_const:
                        # The feature may be constant within the search range permitted
                        # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                        # the feature is a newly discovered constant.
                        features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                        n_new_consts += 1
                        n_total_consts += 1
                        continue
                    else:
                        # The feature is non-constant, so we ensure it's not drawn again
                        # during this split-search.
                        features[idx], features[ub] = features[ub], features[idx]
                        ub -= 1 
                            
                # To ensure that the constant features info is accurate for sibling or child nodes.
                features[0:n_consts] = constant_features[0:n_consts]
                constant_features[n_consts:n_consts+n_new_consts] = features[n_consts:n_consts+n_new_consts]
                
                # Make node a leaf if constant for all randomly drawn feats.
                # (# drawn known constant feats + # drawn new constant feats)
                if lb + n_new_consts == n_drawn_feats: 
                    self._make_leaf(node_id, node_wcc, n_classes_node)
                else: 
                    split_pos = make_num_split(self.rows, X, node_info, best_split) 

                    # Update info for node that's getting split.
                    l_child_id = self.n_nodes
                    r_child_id = l_child_id + 1
                    self.nodes[node_id] = Node(l_child_id, r_child_id, best_split.feat, best_split.thresh, -1)

                    # Prepare for the left and right child nodes
                    # by increasing tree data memory capacity if
                    # necessary.
                    if self.n_nodes + 2 > self.mem_capacity:
                        # Expand memory capacity geometrically. See "geometric growth" 
                        # part of WhozCraig's SO answer at: 
                        #     https://stackoverflow.com/a/51665863/8628758.
                        # Add one after squaring so that the new capacity can
                        # contain not only a tree of greater depth, but also
                        # the maximum # nodes that that depth could have.
                        new_capacity = 2*self.mem_capacity + 1
                        self._increase_mem_capacity(new_capacity)
                        self.mem_capacity = new_capacity
                    
                    # Push right child info onto the LIFO stack.
                    node_stack.append(StackEntry(split_pos, end, r_child_id, node_id, n_total_consts))
                    # Push left child info onto queue.
                    node_stack.append(StackEntry(start, split_pos, l_child_id, node_id, n_total_consts))

                    # And update size of the tree.
                    self.n_nodes += 2
    
    def fit(self, X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X    (Fortan-style ndarray of float64): Pre-processed training data.
            y                     (ndarray of int): Training labels.
            split_point_idxs      (ndarray of int): All numerical feature split-point locations for all rows.
                                                    Shape: (n training samples, n features).
            unique_vals_feats (ndarray of float64): Columns contain sorted unique values for all features.
                                                    Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats      (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows                            (list): Indices of the rows to be used for training. 
                                                    All rows used if empty.
            features                        (list): Column indices of training features that will be used.
                                                    All features used if empty.    
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        # Determine # classes found among all training samples.
        root_cc = np.unique(y, return_counts=True)[1] 
        self.n_class = root_cc.size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.

        self.n_samples = len(self.rows)
        self.n_features = len(self.features)
        
        # Store the num unique vals for each numerical feat and
        # find the maximum cardinality of all features.
        self.n_unique_vals_feats = n_unique_vals_feats
        self.max_n_unique_feat_vals = self.n_unique_vals_feats.max()
        
        # To track how often the "small Q" and "large Q" splitters are used.
        self.num_small_Q = 0
        self.num_large_Q = 0
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        init_capacity = 15
        
         # Allocate tree memory.
        self._increase_mem_capacity(init_capacity)
        self.mem_capacity = init_capacity
        
        # And sum the class weights of all the root node's samples in
        # order to know minimum total weight a leaf must have (which
        # we must know when regularizing by min_weight_fraction_leaf.)
        root_wcc = root_cc*self.class_weights
        self.min_weight_leaf = self.min_weight_fraction_leaf*root_wcc.sum()
        
        # Initialize the random number generator.
        self._rng = get_random_generator(self.seed)
        
        # Initiate tree building.
        self._grow_tree(X, y, split_point_idxs, unique_feat_vals)
        return self
        
    def _next_node(self, nxt): return self.nodes[nxt]
       
    def _get_leaf_idx(self, i, X):
        root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[:,leaf.feat][i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else:
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
        
    def predict(self, X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        n_preds = X.shape[0]
        preds = np.empty(n_preds, dtype=np.intp)
        for i in range(n_preds):
            preds[i] = self.nodes[self._get_leaf_idx(i, X)].label
        return preds
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        for i in range(n_probs):
            idx = self._get_leaf_idx(i, X)
            for j in range(self.n_class):
                wcc[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

## Python Wright SmallQ/Large Q Tree's Speed on the Titanic Data

In [63]:
m = 4
dt = DecisionTreeSmallLargeQ(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [64]:
dt.size # Number of nodes in the decision tree.

371

In [65]:
dt.num_small_Q # Number of times the "small Q" splitter called during tree fitting.

24

In [66]:
dt.num_large_Q # Number of times the "large Q" splitter called during tree fitting.

527

In [67]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.752808988764045

In [68]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

109 ms ± 6.77 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [69]:
%timeit dt.predict(xVal_proc)

503 µs ± 6.99 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Wright SmallQ/LargeQ Decision Tree Cython Version

In [70]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11

import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t # Signed, same as ssize_t in C. See MSeifert's SO answer: https://stackoverflow.com/a/46416257/8628758
cimport cython
from libc.math cimport log as ln
from libc.stdlib cimport realloc, free
from libc.string cimport memcpy
from libc.string cimport memset
from libcpp.stack cimport stack

# For C++ random number generation.
from libc.stdint cimport uint_fast32_t 

# Swap helper func for sorting.
cdef inline void swap(DTYPE_t* items, SIZE_t i, SIZE_t j) nogil:
    items[i], items[j] = items[j], items[i]

# Quicksort helpers

cdef inline void med_three(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Find the median-of-three pivot point of the second through final 
    items of a list of numbers. Once identified, the pivot is moved to 
    the front of the list. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t middle = <int>(first + (last - first)/2)
    cdef SIZE_t second = first + 1
    last -= 1
    if items[second] < items[middle]:
        if items[middle] < items[last]:
            swap(items, first, middle)    
        elif items[second] < items[last]:
            swap(items, first, last)         
        else:                        
            swap(items, first, second)
    elif items[second] < items[last]:
        swap(items, first, second)
    elif items[middle] < items[last]:
        swap(items, first, last)
    else:
        swap(items, first, middle)

cdef inline SIZE_t partition(DTYPE_t* items, SIZE_t first, SIZE_t last, SIZE_t pivot) nogil:
    """Group numbers less than the pivot value together on the left and
    those that are greater on the right. Find the index that separates
    these two groups, which will belong to the first item that is greater
    than or equal to the pivot. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1885
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        pivot      : Index holding the median pivot value.
        
    Returns:
        Index of cut point used to partition the items into two smaller sequences.
    """
    while True:
        while first < last and items[first] < items[pivot]:
            first += 1                      # Get index of first item greater than or equal to median-of-three pivot. 
        last -= 1
        while items[pivot] < items[last]:
            last -= 1                       # Get index of last item less than or equal to the pivot.
        if not (first < last): 
            return first                    # After swaps are done, return index of first item in right partition.
        
        swap(items, first, last)            # Swap the first item greater than or equal to the pivot with the
                                            # last item less than or equal to the pivot. 
        first += 1

cdef inline void insertion_sort(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Follows the spirit of the Numpy implementation at: 
        https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t i
    cdef SIZE_t j
    cdef SIZE_t k
    cdef DTYPE_t val
    for i in range(first+1, last):
        j = i
        k = i - 1
        val = items[i]
        while (j > first) and val < items[k]:
            items[j] = items[k]
            j-=1
            k-=1
        items[j] = val

# Heapsort

cdef inline void sift_down(DTYPE_t* items, SIZE_t start, int n, SIZE_t p, 
                           SIZE_t c, DTYPE_t val) nogil:
    """Swap a heap item with one of its children if that child's value is 
    greater than or equal to that parent's value. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L61
    
    Arguments:
        items: 1-d array containing numbers.
        start: Index of the first number.
        n    : Quantity of numbers.
        p    : Index of the parent.
        c    : Index of the parent's first (left) child.
        val  : The parent's value.
    """
    while c < n:    # Look at the descendents of current parent, `p`.
        if c < n-1 and items[start + c] < items[start + c + 1]: # Find larger of the first and second children.
            c += 1
        if val < items[start + c]: # If child greater than parent, swap child and parent.
            items[start + p] = items[start + c]
            p = c   # Current greater child becomes the parent.
            c += c  # Look at this child's child, if it exists.
        else:
            break 
    items[start + p] = val

cdef inline void sort_heap(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Sort a binary max heap of numbers. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L77
    
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    cdef DTYPE_t val
    while n > 0:
        n -= 1
        val = items[start + n]
        items[start + n] = items[start]
        sift_down(items, start, n, 0, 1, val)

cdef inline void heapify(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Turn a list of items into a binary max heap. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L59
    
    Arguments:
        items: 1-d array containing numbers.
        start: Index of the first number.
        n    : Quantity of numbers.
    """
    cdef DTYPE_t val
    cdef SIZE_t p
    cdef SIZE_t last_p = (n-2)//2
    for p in range(last_p, -1, -1):
        val = items[start + p] # value of last parent
        sift_down(items, start, n, p, 2*p + 1, val)

cdef inline void heapsort(DTYPE_t* items, SIZE_t start, int n) nogil:
    """Applies the heapsort algorithm to sort a list of items from least to greatest. 
    From Williams, 1964.
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    heapify(items, start, n)
    sort_heap(items, start, n)
    
# Introsort 

cdef void introsort_loop(DTYPE_t* items, SIZE_t first, SIZE_t last, int depth) nogil:
    """The recursive heart of the introsort algorithm.
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        depth      : Current recursion depth.
    """
    cdef int MIN_SIZE_THRESH = 16
    cdef SIZE_t cut
    while last-first > MIN_SIZE_THRESH:
        if depth == 0:
            heapsort(items, first, last-first)
        depth -= 1
        med_three(items, first, last)
        cut = partition(items, first+1, last, first)
        introsort_loop(items, cut, last, depth)
        last = cut

# Log base-2 helper function. From Sklearn's implementation at:
#     https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_utils.pyx#L7
cdef inline DTYPE_t log2(DTYPE_t x) nogil:
    return ln(x) / ln(2.0)

cdef void introsort(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Implementation as described in Musser, 1997. Switches to heapsort
    when max recursion depth exceeded. Otherwise uses median-of-three 
    quicksort (Bentley & McIlroy, 1993) with all the usual optimizations:
        - Swap equal elements.
        - Only process partitions longer than the minimum size threshold.
        - When a new partition is made, recurse on the smaller half and 
          iterate over the larger half.
        - Make a final pass with insertion sort over the entire list.

    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef int max_depth = 2 * <int>log2(last-first)
    introsort_loop(items, first, last, max_depth)
    insertion_sort(items, first, last)
    
cdef SIZE_t sort_unique(DTYPE_t* items, SIZE_t first, SIZE_t last) nogil:
    """Sort a 1-d array of numbers in-place using introsort and 
    place the unique values in consecutive ascending order at 
    the beginning of the array.
    
    Arguments:
        items      : The numbers to be sorted.
        first, last: The range of items to be sorted. 
        
    Returns: 
        Number of unique items.
    """
    cdef SIZE_t i = 1
    cdef SIZE_t j = 1
    introsort(items, first, last)
    while i < last-first:
        if items[i] == items[i-1]:
            i += 1
        else:
            if i - j < 1:
                j += 1
                i += 1
            else:
                items[j] = items[i]
                j += 1
                i += 1
    return j
    
# For convenient memory reallocation.
ctypedef fused realloc_t:
    SIZE_t
    DTYPE_t
    Node

cdef inline realloc_t* safe_realloc(realloc_t* ptr, SIZE_t n_items) nogil except *:
    # Inspired by Sklearn's safe_realloc() func. However, thankfully
    # Cython now no longer requires us to send a pointer to a pointer
    # in order to prevent crashes.
    cdef realloc_t elem = ptr[0]
    cdef SIZE_t n_bytes = n_items * sizeof(elem)
    # Make sure we're not trying to allocate too much memory.
    if n_bytes/sizeof(elem) != n_items:
        with gil:
            raise MemoryError(f"Overflow error: unable to allocate {n_bytes} bytes.")       
    cdef realloc_t* res_ptr = <realloc_t *> realloc(ptr, n_bytes)
    with gil:
        if not res_ptr: raise MemoryError()
    return res_ptr

# C++ random number generator. Not yet a part of a Cython release so
# pasted in from: 
#     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/Cython/Includes/libcpp/random.pxd#L1
cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass random_device:
        ctypedef uint_fast32_t result_type
        random_device() except +
        result_type operator()() except +

    cdef cppclass mt19937:
        ctypedef uint_fast32_t result_type
        mt19937() except +
        mt19937(result_type seed) except +
        result_type operator()() except +
        result_type min() except +
        result_type max() except +
        void discard(size_t z) except +
        void seed(result_type seed) except +

    cdef cppclass uniform_int_distribution[T]:
        ctypedef T result_type
        uniform_int_distribution() except +
        uniform_int_distribution(T, T) except +
        result_type operator()[Generator](Generator&) except +
        result_type min() except +
        result_type max() except +
        
# Info for any node that will eventually be split or made into a leaf.
# Similar to what Sklearn does at:
#     https://github.com/scikit-learn/scikit-learn/blob/a2c4d8b1f4471f52a4fcf1026f495e637a472568/sklearn/tree/_tree.pyx#L126
cdef struct StackEntry:
    SIZE_t start
    SIZE_t end
    SIZE_t node_id
    SIZE_t parent_id
    SIZE_t n_const_feats

# To compare node splits.
cdef struct Split:
    SIZE_t feat
    DTYPE_t thresh
    DTYPE_t score  

# Vital characteristics of a node. Set when it's added to the tree.
cdef struct Node:
    SIZE_t l_child # idx of left child, -1 if leaf
    SIZE_t r_child # idx of right child, -1 if leaf
    SIZE_t feat    # col idx of split feature, -1 if leaf
    DTYPE_t thresh # double split threshold, NAN if leaf
    SIZE_t label   # class label if leaf, -1 if non-leaf.
    
cdef inline SIZE_t find_first(DTYPE_t* items, DTYPE_t value, SIZE_t first, SIZE_t last) nogil:
    """Find first occurrence of an element in a vector of sorted 
       (ascending order) elements.
       
    Uses same algorithm as Python's bisect_left() function:
        https://github.com/python/cpython/blob/8fd2d36c1c6da78b2402fcb8bcefdad8428c8bc3/Lib/bisect.py#L68
        
    Arguments:
        items      : The pre-sorted elements to be searched over.
        value      : The value to search for.
        first, last: The range of items to be searched.
        
    Returns:
        Index of the first element in `items` that equals `value`.
        
        If no such element exists in `items` the returned index
        merely indicates where the element would reside where it
        present in the sorted vector.
    """
    cdef SIZE_t mid
    while first < last:
        mid = (first + last) // 2
        if items[mid] < value:
            first = mid + 1
        else:
            last = mid
    return first

cdef inline void find_num_split_smallQ(DTYPE_t* X, SIZE_t* rows, DTYPE_t* node_unique_vals_feat, 
                                       SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, 
                                       SIZE_t n_samples, SIZE_t n_class, SIZE_t min_samples_leaf, 
                                       DTYPE_t min_weight_leaf, DTYPE_t* c_wts, DTYPE_t* l_wcc, 
                                       DTYPE_t* r_wcc, DTYPE_t* parent_wcc, Split* best_split, 
                                       SIZE_t current_feat, DTYPE_t parent_num, DTYPE_t parent_den, 
                                       SIZE_t node_n_unique_vals_feat, SIZE_t* split_counts_raw, 
                                       DTYPE_t* split_class_counts_wt, bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's SmallQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L233
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        X                      : Training data. Shape: (n train samples, n features).
        rows                   : Indices of all rows in the training set. Shape: (n train samples,).
        node_unique_vals_feat  : The sorted unique feature values of the samples in the parent
                                 node (beginning index 0). Shape: (n train samples,).
        labels                 : All training labels. Shape: (n training samples,).
        node_start             : Index of the beginning of the parent node in `rows`.
        n_parent               : Number of samples in the parent node.
        n_samples              : Number of samples in the training data.
        n_class                : Number of unique classes in the training set.
        min_samples_leaf       : Any leaf will have no fewer than this many samples.
        min_weight_leaf        : Total weight of any leaf's samples will be at least this much.
        c_wts                  : Class weights. Shape: (`n_class`,).
        l_wcc                  : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                  : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc             : Parent node's weight class counts. Shape: (`n_class`,).
        best_split             : Holds the feature, threshold and impurity
                                 score of the parent node's current best split.
        current_feat           : Column index of feature under investigation.
        parent_num             : Numerator of parent node's impurity score.
        parent_den             : Denominator of parent node's impurity score.
        node_n_unique_vals_feat: Number of unique values for one feature found among 
                                 the node's samples.
        split_counts_raw       : Stores sample counts found at each unique split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt  : Stores weighted class counts of each class at each unique
                                 split point of a given feature in a given node. Shape:
                                 (<max cardinality of all numerical feats in dataset> x `n_class`,)
        current_feat_const     : Whether current splitting feature is constant for all eligible split 
                                 thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used while tabulating sample and weighted
    # class counts at all unique split points.
    cdef SIZE_t row, label, split_point_idx
    cdef DTYPE_t value
    
    # Variables to track progress during the split search.
    cdef SIZE_t n_left, n_right, i, j
    
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, wt, score, mid
    
    # Tabulate both the sample counts at all possible split points
    # as well as weighted class counts at each split point.
    memset(&split_counts_raw[0], 0, sizeof(SIZE_t)*node_n_unique_vals_feat)
    memset(&split_class_counts_wt[0], 0, sizeof(DTYPE_t)*node_n_unique_vals_feat*n_class)
    for i in range(n_parent):
        row = rows[node_start + i]
        value = X[current_feat*n_samples + row]
        label = labels[row]
        split_point_idx = find_first(node_unique_vals_feat, value, 0, node_n_unique_vals_feat)
        split_counts_raw[split_point_idx] += 1
        split_class_counts_wt[split_point_idx*n_class + label] += c_wts[label] 
    
    # To keep track of num amples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(node_n_unique_vals_feat - 1):
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left

        l_num, r_num = 0., 0. # To calculate numerators of proxy gini scores.
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            l_den += wt
            r_num += r_wcc[j]*r_wcc[j]
        r_den = parent_den - l_den

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Split threshold is always the mid-point between two consecutive values.
            mid = node_unique_vals_feat[i]/2. + node_unique_vals_feat[i+1]/2. 
            if mid == node_unique_vals_feat[i+1]: mid = node_unique_vals_feat[i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat

cdef inline void find_num_split_largeQ(SIZE_t* rows, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats, 
                                       SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, SIZE_t n_samples,
                                       SIZE_t n_class, SIZE_t min_samples_leaf, DTYPE_t min_weight_leaf, 
                                       DTYPE_t* c_wts, DTYPE_t* l_wcc, DTYPE_t* r_wcc, DTYPE_t* parent_wcc, 
                                       Split* best_split, SIZE_t current_feat, DTYPE_t parent_num, 
                                       DTYPE_t parent_den, SIZE_t n_unique_vals_feat, SIZE_t max_n_unique_vals, 
                                       SIZE_t* split_counts_raw, DTYPE_t* split_counts_wt, 
                                       DTYPE_t* split_class_counts_wt, bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's LargeQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows                   : Indices of all rows in the training set. Shape: (n train samples,).
        split_point_idxs       : All numerical feature split-point locations for all rows.
                                 Shape: (n training samples, n features).
        unique_vals_feats      : Columns contain sorted unique values for all features. 
                                 Shape: (max cardinality of all feats, n features).
                                 node (beginning index 0). Shape: (n train samples,).
        labels                 : All training labels. Shape: (n training samples,).
        node_start             : Index of the beginning of the parent node in `rows`.
        n_parent               : Number of samples in the parent node.
        n_samples              : Number of samples in the training data.
        n_class                : Number of unique classes in the training set.
        min_samples_leaf       : Any leaf will have no fewer than this many samples.
        min_weight_leaf        : Total weight of any leaf's samples will be at least this much.
        c_wts                  : Class weights. Shape: (`n_class`,).
        l_wcc                  : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                  : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc             : Parent node's weight class counts. Shape: (`n_class`,).
        best_split             : Holds the feature, threshold and impurity
                                 score of the parent node's current best split.
        current_feat           : Column index of feature under investigation.
        parent_num             : Numerator of parent node's impurity score.
        parent_den             : Denominator of parent node's impurity score.
        n_unique_vals_feat     : Number of unique values for one feature found among 
                                 all training samples.
        max_n_unique_vals      : Maximum cardinality of all features in dataset.
        split_counts_raw       : Stores sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_counts_wt        : Stores weighted sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt  : Stores weighted class counts of each class at each
                                 split point of a given feature in a given node. Shape:
                                 (<max cardinality of all numerical feats in dataset> x `n_class`,)
        current_feat_const     : Whether current splitting feature is constant for all eligible split 
                                 thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used while tabulating raw and weighted sample counts, 
    # as well as weighted class counts at all unique split points.
    cdef SIZE_t row, label, split_point_idx
    
    # Make sure node's samples aren't all constant for feature.
    cdef SIZE_t n_splits_node = 0
    
    # Variables to track progress during the split search.
    cdef SIZE_t n_left, n_right, i, j, k
    
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, wt, score, mid
    
    # Tabulate sample counts, weighted counts, and weighted class counts at 
    # each split point. Values at split points not belonging to node's rows
    # will remain zero.
    memset(split_counts_raw, 0, sizeof(SIZE_t)*n_unique_vals_feat)
    memset(split_counts_wt, 0, sizeof(DTYPE_t)*n_unique_vals_feat)
    memset(split_class_counts_wt, 0, sizeof(SIZE_t)*n_unique_vals_feat*n_class)
    for i in range(n_parent):
        row = rows[node_start + i]
        label = labels[row]
        wt = c_wts[label]
        split_point_idx = split_point_idxs[n_samples*current_feat + row]
        split_counts_raw[split_point_idx] += 1
        split_counts_wt[split_point_idx] += wt
        split_class_counts_wt[split_point_idx*n_class + label] += wt
        
    # If feat is constant for the node.
    for i in range(n_unique_vals_feat):
        if split_counts_raw[i] > 0: n_splits_node += 1
    if n_splits_node < 2: return
    
    # To keep track of num samples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(n_unique_vals_feat - 1):
        if split_counts_raw[i] == 0: continue # Move to next split-point if no samples at this one.
        
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left
        if n_right == 0: return # Make sure to stop search when right child empty.
        
        # Calculate denominators of proxy gini scores.
        l_den += split_counts_wt[i]
        r_den = parent_den - l_den

        # Calculate numerators of proxy gini scores.
        l_num, r_num = 0., 0. 
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            r_num += r_wcc[j]*r_wcc[j]

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Find raw feature value of sample(s) at next-closest split-point.
            k = i+1
            while split_counts_raw[k] == 0: k+=1
            # Split threshold is always the mid-point between two consecutive values.
            mid = (unique_vals_feats[current_feat*max_n_unique_vals + i]/2. + 
                   unique_vals_feats[current_feat*max_n_unique_vals + k]/2.) 
            if mid == unique_vals_feats[current_feat*max_n_unique_vals + k]: 
                mid = unique_vals_feats[current_feat*max_n_unique_vals + i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat

cdef inline SIZE_t make_num_split(SIZE_t* rows, DTYPE_t* X, StackEntry* node_info, Split* best_split, 
                                SIZE_t n_samples) nogil:
    cdef SIZE_t p, p_end
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[best_split.feat*n_samples + rows[p]] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p] 
    return p

# Necessary constants.
cdef DTYPE_t NEG_INF = -np.inf
cdef DTYPE_t NAN = np.nan
cdef DTYPE_t Q_THRESHOLD = 0.02
            
cdef class _DecisionTree:
    # Class attributes.
    cdef SIZE_t seed
    cdef mt19937 rng
    cdef SIZE_t mem_capacity
    cdef SIZE_t n_samples
    cdef SIZE_t n_features
    cdef SIZE_t n_class
    cdef SIZE_t m
    cdef SIZE_t min_samples_leaf, 
    cdef DTYPE_t min_weight_fraction_leaf
    cdef DTYPE_t min_weight_leaf
    cdef SIZE_t n_nodes
    cdef SIZE_t max_n_unique_feat_vals
    cdef SIZE_t* n_unique_vals_feats
    cdef SIZE_t* rows
    cdef SIZE_t* features
    cdef DTYPE_t* class_weights
    cdef Node* nodes
    cdef DTYPE_t* weighted_class_counts
    def __cinit__(self, SIZE_t m, SIZE_t min_samples_leaf, DTYPE_t min_weight_fraction_leaf, SIZE_t seed): 
        """
        Arguments:
            m                       : Number of candidate features randomly selected to try to split each node.
            min_samples_leaf        : Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                      of the sum of weights of *all* training samples used to fit the tree.
            seed                    : A seed for the C++ mt19937 32bit int random generator. 
                                      Use when reproducibility is desired.
        """
        self.m, self.min_samples_leaf = m, min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = NULL
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = NULL 
        
    def __dealloc__(self):
        free(self.nodes)
        free(self.weighted_class_counts)
        
    property size:
        def __get__(self):
            return self.n_nodes
    
    property left_children:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].l_child
            return out

    property right_children:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].r_child
            return out
        
    property split_features: 
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].feat
            return out
        
    property split_thresholds:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].thresh
            return out
        
    property weighted_cc:
        def __get__(self):
            cdef SIZE_t out_size = self.n_nodes*self.n_class
            out = np.empty(out_size, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(out_size):
                    out_view[i] = self.weighted_class_counts[i]
            out.resize(self.n_nodes, self.n_class)
            return out
    
    property labels:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].label
            return out
    
    cdef void _increase_mem_capacity(self, SIZE_t new_capacity) nogil:
        self.nodes = safe_realloc(self.nodes, new_capacity)
        self.weighted_class_counts = safe_realloc(self.weighted_class_counts, self.n_class*new_capacity)
    
    cdef void _make_leaf(self, Node* leaf_node, SIZE_t* y, SIZE_t node_start, SIZE_t node_id, 
                         SIZE_t n_classes_node, SIZE_t* max_wt_classes) nogil:
        # Class with largest wcc becomes leaf node's label. Break ties with a random choice.
        cdef SIZE_t label
        cdef DTYPE_t max_wt = 0.
        cdef SIZE_t lb = 0
        cdef SIZE_t ub = -1
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t i, j
        # If all node's samples have the same class.
        if n_classes_node == 1:
            label = y[self.rows[node_start]]
        else:
            # Otherwise find label with max weighted class count for the node.
            for i in range(self.n_class):
                max_wt = max(max_wt, self.weighted_class_counts[node_id*self.n_class + i])
            # See if multiple classes share this max count.
            for i in range(self.n_class):
                if self.weighted_class_counts[node_id*self.n_class + i] == max_wt:
                    ub += 1
                    max_wt_classes[ub] = i
            # If so, randomly choose leaf's label from among those classes.
            if ub > 0:
                dist = uniform_int_distribution[SIZE_t](lb, ub) # Choose an int w/in range lb, ub, inclusive.
                j = dist(self.rng)
                label = max_wt_classes[j]
            else:
                label = max_wt_classes[lb]
        leaf_node.l_child = -1
        leaf_node.r_child = -1    
        leaf_node.feat = -1  
        leaf_node.thresh = NAN
        leaf_node.label = label 

    cdef _grow_tree(self, DTYPE_t* X, SIZE_t* y, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats):
        # LIFO stack holding all nodes still to be investigated.
        cdef stack[StackEntry] node_stack

        #####################################################################
        # Variables containing info of the node currently being investigated.
        #####################################################################
        cdef SIZE_t start, end, node_id, parent_id, n_consts, n_samples_node
        cdef DTYPE_t* node_wcc = NULL
        cdef StackEntry node_info
        cdef Node* node = NULL
        
        # Holds child node info if the current node gets split.
        cdef SIZE_t l_child_id, r_child_id
        cdef Node* l_child_node = NULL
        cdef Node* r_child_node = NULL
        
        #####################################################################
        # For finding the best split.
        #####################################################################
        cdef Split best_split
        cdef DTYPE_t* l_wcc = NULL
        cdef DTYPE_t* r_wcc = NULL
        cdef DTYPE_t sum_node_wcc_sqr, sum_node_wcc # Parent node's proxy Gini score num and den.
        cdef SIZE_t split_pos
        
        # Indicates a feature has been discovered to be constant during a
        # split search within the search range permitted by min_samples_leaf 
        # and min_weight_leaf.
        cdef bint current_feat_const 
        
        # Determines whether to use SmallQ or LargeQ splitting. 
        cdef DTYPE_t q
        cdef SIZE_t n_unique_vals_feat

        # Following two buffers used for SmallQ splitting.
        
        # Create a C-contiguous array of doubles to hold feature values of a 
        # given node's samples. Using Numpy to allocate memory to longer 
        # vectors is often faster than using realloc().
        cdef DTYPE_t[::1] items_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* items = &items_buffer[0]
        cdef SIZE_t r
        
        # An array to contain unique split points for a feature at a given node.
        cdef DTYPE_t[::1] node_unique_vals_feat_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* node_unique_vals_feat = &node_unique_vals_feat_buffer[0]
        cdef SIZE_t node_n_unique_vals_feat
        
        # Three 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value.
        # Used for both SmallQ and LargeQ splitting, except for split_counts_wt,
        # which is just used for LargeQ.
        
        # Raw, non-weighted, sample counts at each split-point.
        cdef SIZE_t[::1] split_counts_raw_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        cdef SIZE_t* split_counts_raw = &split_counts_raw_buffer[0]
        # Weighted sample counts at each split-point.
        cdef DTYPE_t[::1] split_counts_wt_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        cdef DTYPE_t* split_counts_wt = &split_counts_wt_buffer[0]
        # Weighted class counts for each split-point.
        cdef DTYPE_t[::1] split_class_counts_wt_buffer = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64) 
        cdef DTYPE_t* split_class_counts_wt = &split_class_counts_wt_buffer[0]
        
        #####################################################################
        # For random feature selection (w/out replacement) and keeping track 
        # of nodes' constant features. 
        #####################################################################
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t lb, ub, idx, feat_idx, n_drawn_feats, n_new_consts, n_total_consts
        cdef SIZE_t[::1] features_buffer = np.empty(self.n_features, dtype=np.intp) 
        cdef SIZE_t* features = &features_buffer[0]
        cdef SIZE_t[::1] constant_features_buffer = np.empty(self.n_features, dtype=np.intp)
        cdef SIZE_t* constant_features = &constant_features_buffer[0]
        
        #####################################################################
        # For determining whether node should be a leaf.
        #####################################################################
        cdef SIZE_t i, c, cc, n_classes_node, row, label
        cdef DTYPE_t wcc, wt
        # Stores classes that share a leaf's max class wt. When two or more 
        # present, leaf label randomly chosen from these classes
        cdef SIZE_t* max_wt_classes = NULL
        
        with nogil:
            # Allocate memory to pointers.
            l_wcc = safe_realloc(l_wcc, self.n_class)
            r_wcc = safe_realloc(r_wcc, self.n_class)
            node_wcc = safe_realloc(node_wcc, self.n_class)
            max_wt_classes = safe_realloc(max_wt_classes, self.n_class*sizeof(SIZE_t))
            # Fill with feature column indices so we can track constant feats.
            memcpy(features, self.features, self.n_features* sizeof(SIZE_t))
            
            # Push root node onto the LIFO stack.
            node_stack.push({"start": 0, "end": self.n_samples, "node_id": 0, 
                             "parent_id": 0, "n_const_feats": 0})
            self.n_nodes = 1
            while not node_stack.empty():
                node_info = node_stack.top()
                node_stack.pop()
                start, end = node_info.start, node_info.end
                node_id, parent_id = node_info.node_id, node_info.parent_id # TODO: `parent_id` unused; is it necessary?
                n_consts = node_info.n_const_feats
                n_samples_node = end-start
                node = &self.nodes[node_id]
                
                # Tabulate the current node's weighted class counts.
                #
                # Implementation detail #1: I tried storing the l and r child wt class cts
                # of nodes' best splits so that this tabulation wouldn't need to be 
                # performed for each node. But found there was virtually no speed improvement
                # to justify the more complicated code required to store and update these 
                # values during the best split search.
                #
                # Implementation detail #2: Setting aside a block of memory to 
                # store the current node's wt class cts and passing a pointer to
                # this block to the split search function sped up training by 8%
                # compared to passing a ptr to the location of node's wt class cts 
                # in the self.weighted_class_counts array.
                memset(node_wcc, 0, self.n_class*sizeof(DTYPE_t))
                sum_node_wcc, sum_node_wcc_sqr = 0., 0.
                for i in range(n_samples_node):
                    row = self.rows[start + i]
                    label = y[row]
                    wt = self.class_weights[label]
                    # Compute the node's proxy gini numerator and denominator while we're at it.
                    sum_node_wcc_sqr += wt*(2*node_wcc[label] + wt) # numerator
                    sum_node_wcc += wt                              # denominator
                    node_wcc[label] += wt
                memcpy(&self.weighted_class_counts[node_id*self.n_class], node_wcc, self.n_class*sizeof(DTYPE_t))
                
                # Make a leaf if required to do so. 
                n_classes_node = 0
                for c in range(self.n_class):
                    wcc = node_wcc[c]
                    if wcc > 0: n_classes_node += 1
                if n_classes_node == 1:                   
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif n_samples_node < 2*self.min_samples_leaf:  
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif sum_node_wcc < 2.*self.min_weight_leaf: 
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)

                # Otherwise split the node.
                else:
                    # Initialize stats for best split of node.
                    best_split.feat = -1
                    best_split.thresh = 0.
                    best_split.score = NEG_INF

                    # Ensure feats drawn w/out replacement.
                    n_drawn_feats = 0
                    n_new_consts = 0
                    n_total_consts = n_consts
                    lb = 0                      # Range in `features` array from which we 
                    ub = self.n_features - 1    # randomly select a feature's column index. 
                        
                    while n_drawn_feats < self.m:
                        n_drawn_feats += 1

                        # Breiman & Cutler's original Fortran random forests implementation 
                        # allows for known constant features to be drawn during a split-search.
                        # I follow their example, as I believe that doing so allows individual 
                        # trees to be less correlated with each other. Since I don't pre-sort
                        # features, I would prefer not to have to sort any more features than
                        # necessary, and so I've adopted the technique Sklearn uses to track 
                        # constant features:
                        #     https://github.com/scikit-learn/scikit-learn/blob/dbe39454f766ebefc3219f2c1871ac1774316532/sklearn/tree/_splitter.pyx#L310
                        # 
                        # The idea is that feature idxs in `features` are organized into two sections:
                        #
                        #     [<indices of known constant feats>, <indices of non-constant feats>]
                        #
                        # As we begin drawing feature indices from this above list, those two sections
                        # will each be further sub-divided into two sections:
                        # 
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <undrawn non-constant feats>, <drawn non-constant feats>]
                        #
                        # When we choose a feature that happens to be a known constant, we'll re-locate
                        # its idx to the right-end of the first of those four sections. Then we 
                        # increment the lower bound threshold, `lb`, by one so that we don't re-draw 
                        # that feature again.
                        #
                        # Similarly, if we draw a non-constant feature idx, we'll move it to the 
                        # left-end of the last of the four partitions and reduce the upper bound
                        # threshold, `ub`, by one so that the feature idx can't be drawn again
                        # during this split-search. 
                        #
                        # One last important detail: sometimes we'll draw a feature that 
                        # used to be non-constant for ancestor nodes, but will be found to be 
                        # constant for the current node. When this happens, we relocate its 
                        # index so that it sits to the right of the known constant feats section.
                        # This means our `features` list could have up to five partitions:
                        #
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <newly discovered const feats>, <undrawn non-constant feats>, 
                        #      <drawn non-constant feats>]
                        #
                        # Whenever we find a new constant feature, we increment the `n_new_consts`
                        # counter by one. We also increment the `n_total_consts` counter by one. 
                        # During the split-search we have to use `n_total_consts` to keep track of
                        # the total number of constant features. n_consts` mustn't be changed
                        # because it tells us where the <newly discovered const feats> section
                        # of the `features` list begins.

                        # One last wrinkle. We subtract the # of newly discovered const feats from  
                        # the upper bound before we select an index `i` from the `features` array, 
                        # and add it back to `i` after `i` has been genereated. This prevents us from 
                        # re-drawing any of these new const feats again during this split-search.
                        dist = uniform_int_distribution[SIZE_t](lb, ub-n_new_consts)
                        idx = dist(self.rng)

                        # So that we don't draw a known constant feature again this split-search.
                        if idx < n_consts:
                            features[idx], features[lb] = features[lb], features[idx]
                            lb += 1 
                            continue

                        # So that no new const feats get drawn more than once per split-search.
                        idx += n_new_consts

                        feat_idx = features[idx]
                        
                        # Num split points found among training samples for given feat.
                        n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                        
                        q = n_samples_node/n_unique_vals_feat
                        
                        # SmallQ Splitting.
                        if q < Q_THRESHOLD:
                            # Place all samples' feat values into contiguous storage.
                            for r in range(n_samples_node):
                                # X is a pointer, so have to index into this 2d array in the C way 
                                # (also keeping in mind that the array is column-major).
                                items[r] = X[feat_idx*self.n_samples + self.rows[start + r]]
                                
                            # Place all unique split points, in ascending order, inside the 
                            # first <node_n_unique_vals_feat> indices of the items array.
                            node_n_unique_vals_feat = sort_unique(items, 0, n_samples_node)

                            # Make sure the feature not constant for node's samples.
                            if node_n_unique_vals_feat < 2:
                                # Move the newly-discovered constant feat to the far right-end
                                # of the left half of `features` list holding the known const
                                # feats as well as any other const feats newly discovered 
                                # during this node's split-search.
                                features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                                n_new_consts += 1
                                n_total_consts += 1
                                continue
                            else:
                                # Initialize weighted class counts of right and left children.
                                # Right child's counts are initially the same as parent node's.
                                memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                                memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))

                                # If the feature has an impurity score that's better than the best score 
                                # found among all other features visited thus far for this node, find_num_split()
                                # updates the attributes of the struct containing the node's best split info. 
                                # 
                                # But even if a new best score isn't reached, if an impurity score can
                                # be calculated at least once during the feature's split search, the
                                # following indicator will be toggled off, to indicate that the feature
                                # is not constant.
                                current_feat_const = 1 # 1 = is constant; 0 = not constant
                                find_num_split_smallQ(X, self.rows, items, y, start, n_samples_node, self.n_samples, 
                                                      self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                      self.class_weights, l_wcc, r_wcc, node_wcc,
                                                      &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc,
                                                      node_n_unique_vals_feat, split_counts_raw, split_class_counts_wt,
                                                      &current_feat_const)
                                
                        # LargeQ Splitting.
                        else:
                            memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                            memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))
                            current_feat_const = 1 # 1 = is constant; 0 = not constant
                            find_num_split_largeQ(self.rows, split_point_idxs, unique_vals_feats, y, start, 
                                                  n_samples_node, self.n_samples, self.n_class, self.min_samples_leaf, 
                                                  self.min_weight_leaf, self.class_weights, l_wcc, r_wcc, node_wcc, 
                                                  &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, 
                                                  self.max_n_unique_feat_vals, split_counts_raw, split_counts_wt, 
                                                  split_class_counts_wt, &current_feat_const)

                        if current_feat_const:
                            # The feature may be constant within the search range permitted
                            # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                            # the feature is a newly discovered constant.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # The feature is non-constant, so we ensure it's not drawn again
                            # during this split-search.
                            features[idx], features[ub] = features[ub], features[idx]
                            ub -= 1 

                    # To ensure that the constant features info is accurate for sibling or child nodes.
                    memcpy(&features[0], &constant_features[0], sizeof(SIZE_t)*n_consts)
                    memcpy(&constant_features[n_consts], &features[n_consts], sizeof(SIZE_t)*n_new_consts)

                    # Make node a leaf if constant for all randomly drawn feats.
                    # (# drawn known constant feats + # drawn new constant feats)
                    if lb + n_new_consts == n_drawn_feats: 
                        self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                    else: 
                        split_pos = make_num_split(self.rows, X, &node_info, &best_split, self.n_samples) 

                        # Update tree info for node that's getting split.
                        l_child_id = self.n_nodes
                        r_child_id = l_child_id + 1
                        node.l_child = l_child_id
                        node.r_child = r_child_id
                        node.feat    = best_split.feat
                        node.thresh  = best_split.thresh
                        node.label   = -1

                        # Prepare for the left and right child nodes
                        # by increasing tree data memory capacity if
                        # necessary.
                        if self.n_nodes + 2 > self.mem_capacity:
                            # Expand memory capacity geometrically. See "geometric growth" 
                            # part of WhozCraig's SO answer at: 
                            #     https://stackoverflow.com/a/51665863/8628758.
                            # Add one after squaring so that the new capacity can
                            # contain not only a tree of greater depth, but also
                            # the maximum # nodes that that depth could have.
                            new_capacity = 2*self.mem_capacity + 1
                            self._increase_mem_capacity(new_capacity)
                            self.mem_capacity = new_capacity
                        
                        # Push right child info onto the LIFO stack.
                        node_stack.push({"start": split_pos, "end": end, "node_id": r_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})
                        # Push left child info onto queue.
                        node_stack.push({"start": start, "end": split_pos, "node_id": l_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})

                        # And update size of the tree.
                        self.n_nodes += 2
                        
        free(l_wcc)
        free(r_wcc)
        free(node_wcc)
        free(max_wt_classes)
    
    def fit(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X, np.ndarray[SIZE_t, ndim=1, mode="c"] y,
            np.ndarray[SIZE_t, ndim=2, mode="fortran"] split_point_idxs, 
            np.ndarray[DTYPE_t, ndim=2, mode="fortran"] unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] n_unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] rows, np.ndarray[SIZE_t, ndim=1, mode="c"] features,
            np.ndarray[DTYPE_t, ndim=1, mode="c"] class_weights, SIZE_t n_class): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
            feats             (1D C-contiguous array of int): Column indices of training features.
            class_weights (1D C-contiguous array of float64): Desired weight for each class. Shape: (`n_class`,).
            n_class                                         : Number of classes in training data. 
        """
        # Casting the raw data to pointers gives a 17% speed-up compared to getting
        # pointer from the ndarray's buffer interface, as recommended by DavidW in 
        # his SO answer at: https://stackoverflow.com/a/54832269/8628758. e.g.
        #     cdef DTYPE_t[::1,:] X_buffer = X
        #     cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        # Not worried about unexpected behavior as all ndarrays' contiguousness and
        # memory layout enforced prior to this point.
        cdef DTYPE_t* X_ptr = <DTYPE_t*> X.data
        cdef SIZE_t* y_ptr = <SIZE_t*> y.data
        cdef SIZE_t* split_point_idxs_ptr = <SIZE_t*> split_point_idxs.data
        cdef DTYPE_t* unique_vals_feats_ptr = <DTYPE_t*> unique_vals_feats.data
        self.n_unique_vals_feats = <SIZE_t*> n_unique_vals_feats.data
        self.rows = <SIZE_t*> rows.data
        self.features = <SIZE_t*> features.data
        self.class_weights = <DTYPE_t*> class_weights.data
        self.n_class = n_class
        self.n_samples = rows.shape[0]
        self.n_features = features.shape[0]
        cdef random_device rd # Needed when using the C++ mt19937 rng w/out a seed.
        
        # Get the max cardinality of all numerical feats.
        self.max_n_unique_feat_vals = n_unique_vals_feats.max()
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        cdef SIZE_t init_capacity = 15
        
        cdef SIZE_t i, row, label
        cdef DTYPE_t wt
        cdef DTYPE_t sum_wts = 0
        cdef Node* root_node = NULL
        with nogil:
            # Allocate memory for the tree.
            self._increase_mem_capacity(init_capacity)
            self.mem_capacity = init_capacity
 
            # And sum the class weights of all the root node's samples in
            # order to know minimum total weight a leaf must have (which
            # we must know when regularizing by min_weight_fraction_leaf.)
            for i in range(self.n_samples):
                row = self.rows[i]
                label = y_ptr[row]
                wt = self.class_weights[label]
                sum_wts += wt
            self.min_weight_leaf = self.min_weight_fraction_leaf*sum_wts
            
            # Initialize the random number generator. Followed example from:
            #     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/tests/run/cpp_stl_random.pyx#L16
            if self.seed == -1:
                self.rng = mt19937(rd()) # If using the random device engine std::random_device.
            else:
                self.rng = mt19937(self.seed)

        # Initiate tree building.
        self._grow_tree(X_ptr, y_ptr, split_point_idxs_ptr, unique_vals_feats_ptr)
    
    cdef Node* _next_node(self, SIZE_t nxt) nogil: 
        return &self.nodes[nxt]
    
    cdef SIZE_t _get_leaf_idx(self, SIZE_t i, Node* leaf, SIZE_t n, DTYPE_t* X) nogil:
        cdef SIZE_t idx
        cdef SIZE_t root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[leaf.feat*n + i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else: 
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
    
    def predict(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_preds = X.shape[0]
        cdef SIZE_t i
        preds = np.empty(n_preds, dtype=np.intp)
        cdef SIZE_t[::1] preds_view = preds
        cdef Node leaf
        with nogil:
            for i in range(n_preds): 
                preds_view[i] = self.nodes[self._get_leaf_idx(i, &leaf, n_preds, X_ptr)].label
        return preds
    
    def predict_probs(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions. Shape: (`X.size`, `self.n_class`)
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        cdef DTYPE_t[::1] wcc_view = wcc
        cdef Node leaf
        cdef SIZE_t i, j, idx
        with nogil:
            for i in range(n_probs):
                idx = self._get_leaf_idx(i, &leaf, n_probs, X_ptr)
                for j in range(self.n_class):
                    wcc_view[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

class DecisionTreeSmallQLargeQCython():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Marvin Wright's SmallQ and LargeQ numerical splitting algorithms:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L173
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights = [], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit the tree.
            seed                         (int): Use when reproducibility is desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C') 
        if seed is None:
            self.seed = -1
        else:
            self.seed = seed
        self._tree = _DecisionTree(self.m, self.min_samples_leaf, self.min_weight_fraction_leaf, self.seed)
        
    @property
    def size(self): return self._tree.size
    
    @property
    def left_children(self): return self._tree.left_children
    
    @property
    def right_children(self): return self._tree.right_children
            
    @property 
    def split_features(self): return self._tree.split_features

    @property 
    def split_thresholds(self): return self._tree.split_thresholds
    
    @property
    def weighted_class_counts(self): return self._tree.weighted_cc
    
    @property
    def labels(self): return self._tree.labels
    
    def fit(self, X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
                                                              All rows used if empty.
            feats             (1D C-contiguous array of int): Column indices of training features.
                                                              All rows used if empty.
                                                              
        Returns:
            DecisionTreeSmallQLargeQCython: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        self.n_class = np.unique(y).size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.
            
        self._tree.fit(X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, 
                       self.rows, self.features, self.class_weights, self.n_class)
        return self

    def predict(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict(X)
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict_probs(X)

## Cython Wright SmallQ/LargeQ Decision Tree's Speed on the Titanic Data

In [71]:
m = 4
dt = DecisionTreeSmallQLargeQCython(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [72]:
dt.size

393

In [73]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.8202247191011236

In [74]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

500 µs ± 16.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [75]:
%timeit dt.predict(xVal_proc)

7.26 µs ± 52.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Wright's SmallQ/LargeQ splitting runs about 50% slower than LargeQ alone, but is still around 7% faster than Louppe's splitting function. 

It appears that, at least for the Titanic dataset, using just LargeQ splitting will provide superior speed. However, before I move on to observing whether LargeQ also enjoys a performance advantage on the larger Santander dataset, on a final lark I'd like to see if there's any benefit to combining both Louppe's and Wright's approaches.

Let's see what happens when we use Louppe's splitter for SmallQ splits and Wright's LargeQ splitter for all other splits.

## Louppe SmallQ/Wright LargeQ Decision Tree Python Version

In [76]:
Q_THRESHOLD = 0.02

def find_num_split_Louppe(rows, items, labels, node_start, n_parent, n_class, 
                          min_samples_leaf, min_weight_leaf, c_wts, l_wcc, r_wcc,
                          parent_wcc, best_split, current_feat, parent_num, parent_den):
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Gilles Louppe's split-finding algorithm:
        Page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    
    Saves a split's feature idx, threshold, position, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows           (ndarray of int): Indices of all rows in the training set. 
                                         Shape: (n train samples,).
        items      (ndarray of float64): The sorted feature values of the samples in the parent
                                         node (beginning at `node_start`). Shape: (n train samples,).
        labels         (ndarray of int): All training labels. Shape: (n training samples,).
        node_start                (int): Index of the beginning of the parent node in `rows`.
        n_parent                  (int): Number of samples in the parent node.
        n_class                   (int): Number of unique classes in the training set.
        min_samples_leaf          (int): Any leaf will have no fewer than this many samples.
        min_weight_leaf       (float64): Total weight of any leaf's samples will be at least this much.
        c_wts      (ndarray of float64): Class weights. Shape: (`n_class`,).
        l_wcc      (ndarray of float64): Left child's weight class counts. Shape: (`n_class`,).
        r_wcc      (ndarray of float64): Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc (ndarray of float64): Parent node's weight class counts. Shape: (`n_class`,).
        best_split              (Split): Holds the feature, threshold, position, and impurity
                                         score of the parent node's current best split.
        current_feat              (int): Column index of feature under investigation.
        parent_num            (float64): Numerator of parent node's impurity score.
        parent_den            (float64): Denominator of parent node's impurity score.
              
    Returns: 
        int: 1 if feature is constant for eligible split-points. 0, otherwise.
    """
    # So that we can iterate across all feature values in the node.
    prev_pos, pos = node_start, node_start
    node_end = node_start + n_parent
    lowest = items[pos]
    
    # Variables used to calculate proxy gini scores.
    l_num, l_den  = 0., 0.
    r_num, r_den, = parent_num, parent_den
    
    # Whether or not feat is constant within search range permitted
    # by min_samples_leaf and min_weight_leaf (0 if no, 1 if yes).
    current_feat_const = 1
    
    # Find the best split and store its score, threshold, position,
    # as well as it's children's weighted class counts.
    while pos < node_end:
        while items[pos] == lowest: # When consecutive items have the same value.
            if pos == node_end - 1: # When the final few samples all have the same value.
                return current_feat_const
            pos+=1
        next_lowest = items[pos]
        mid = lowest/2. + next_lowest/2. # Split threshold is always the mid-point between two consecutive values.
        if mid == next_lowest: mid = lowest
            
        # Move samples from the left to right child when it's quicker to do so.
        if pos-prev_pos > node_end-pos-1:
            l_num, l_den = parent_num, parent_den
            r_num, r_den = 0., 0.
            l_wcc[:] = parent_wcc
            r_wcc[:] = 0.
            for r in reversed(range(pos, node_end)):
                row = rows[r]
                label = labels[row]; w = c_wts[label]
                r_num += w*( 2*r_wcc[label] + w); r_den += w
                l_num += w*(-2*l_wcc[label] + w); l_den -= w 
                r_wcc[label] += w; l_wcc[label] -= w
        else:
            for r in range(prev_pos, pos):
                row = rows[r]
                label = labels[row]; w = c_wts[label] 
                l_num += w*( 2.*l_wcc[label] + w); l_den += w
                r_num += w*(-2.*r_wcc[label] + w); r_den -= w
                l_wcc[label] += w; r_wcc[label] -= w  
                
        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf.
        if pos - node_start < min_samples_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif node_end - pos < min_samples_leaf:
            return current_feat_const
        # l_den and r_den are left and right children's weighted sample sums.
        elif l_den < min_weight_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif r_den < min_weight_leaf:
            return current_feat_const

        current_feat_const = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Only update best split stats if current score beats all
            # other best found among all other features already explored 
            # at the current node.
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat
        lowest = next_lowest
        prev_pos = pos; pos+=1
    return current_feat_const

class DecisionTreeLouppeWright():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Louppe's splitter for SmallQ splits and Wright's splitter for LargeQ splitting.
    """
        
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights=[], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try 
                                                to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit 
                                                the tree.
            class_weights (ndarray of float64): Sample weight to be used for each class. Shape: (`n_class`,).
            seed                         (int): Use when reproducibility desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C')
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = np.empty(0, dtype=Node, order='C')
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = np.empty(0, dtype=np.float64, order='C')
        
    @property
    def size(self): return self.n_nodes
    
    @property 
    def left_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].l_child
        return out

    @property 
    def right_children(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].r_child
        return out

    @property 
    def split_features(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].feat
        return out

    @property 
    def split_thresholds(self): 
        out = np.empty(self.n_nodes, dtype='float64')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].thresh
        return out

    @property 
    def weighted_cc(self):
        out_size = self.n_nodes*self.n_class
        out = np.empty(out_size, dtype='float64')
        for i in range(out_size):
            out[i] = self.weighted_class_counts[i]
        out.resize(self.n_nodes, self.n_class)
        return out

    @property 
    def labels(self): 
        out = np.empty(self.n_nodes, dtype='int')
        for i in range(self.n_nodes):
            out[i] = self.nodes[i].label
        return out
    
    def _increase_mem_capacity(self, new_capacity):
        """Resize ndarrays that hold tree's nodes and weighted class counts.
        
        Arguments:
            new_capacity (int): Amount of nodes that resized arrays will be able to hold.
        """
        self.nodes.resize(new_capacity, refcheck=False)
        self.weighted_class_counts.resize(new_capacity*self.n_class, refcheck=False)
    
    def _make_leaf(self, node_id, wcc, n_classes_node):
        """Set and store the class label of a leaf node.
        
        Break ties at random when multiple classes share the same max weight.
        Doing this avoids a bias towards lower classes that would be a possible
        consequence of using np.argmax (which is what Sklearn does).
        
        Arguments:
            node_id            (int): Location of node in `self.nodes`.
            wcc (ndarray of float64): Node's weighted class counts. Shape: (`self.n_class`,).
            n_classes_node     (int): Number of unique class labels found among
                                      node's training samples.
        """
        if n_classes_node == 1: 
            label = max(enumerate(wcc), key=lambda f: f[1])[0]
        else:              
            label = self._rng.choice(np.argwhere(wcc==np.max(wcc)).flatten())
        self.nodes[node_id] = Node(-1, -1, -1, np.nan, label) 
        
    def _grow_tree(self, X, y, split_point_idxs, unique_feat_vals):
        """Depth-first growth of a decision tree.
        
        Arguments:
            X                     (ndarray of float64): Training samples. Shape: (n samples, n features).
            y                         (ndarray of int): Training labels. Shape: (n samples,).
            split_point_idxs          (ndarray of int): All numerical feature split-point locations for all rows.
                                                        Shape: (n training samples, n features).
            unique_vals_feats     (ndarray of float64): Columns contain sorted unique values for all features.
                                                        Shape: (max cardinality of all feats, n features).
        """
        # LIFO stack holding all nodes still to be investigated.
        node_stack = []
        
        # Stores the weighted class counts of the current node.
        node_wcc = np.empty(self.n_class, dtype=np.float64)
        
        ##############################################################
        # For finding the best split.
        ##############################################################
        l_wcc = np.empty(self.n_class, dtype=np.float64)
        r_wcc = np.empty(self.n_class, dtype=np.float64)
        
        # For SmallQ splitting using Louppe.
        items = np.empty(self.n_samples, dtype=np.float64)
        
        # 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value.
        # For Wright LargeQ splitting.
        
        # Raw, non-weighted, sample counts at each split-point.
        split_counts_raw = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        # Weighted sample counts at each split-point.
        split_counts_wt = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        # Weighted class counts for each split-point.
        split_class_counts_wt = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64)  

        # Keeping track of nodes' constant features. 
        features = self.features.copy()
        constant_features = np.empty(self.n_features, dtype=np.intp)
        
        # Push root node onto the LIFO stack.
        node_stack.append(StackEntry(0, self.n_samples, 0, 0, 0))
        self.n_nodes = 1
        
        while len(node_stack) > 0:
            node_info = node_stack.pop()
            start, end = node_info.start, node_info.end
            node_id, parent_id = node_info.node_id, node_info.parent_id
            n_consts = node_info.n_const_feats
            n_samples_node = end-start
            
            # Tabulate and store the current node's weighted class counts.
            node_wcc[:] = 0.
            for i in range(n_samples_node):
                row = self.rows[start + i]
                label = y[row]
                wt = self.class_weights[label]
                node_wcc[label] += wt 
            self.weighted_class_counts[node_id*self.n_class: (node_id + 1)* self.n_class] = node_wcc
            
            # Make a leaf if required to do so.
            n_classes_node, sum_node_wcc, sum_node_wcc_sqr = 0, 0., 0.
            for c in range(self.n_class):
                wcc = node_wcc[c]
                if wcc > 0: n_classes_node += 1
                # Compute the current node's proxy gini numerator and denominator while we're at it.
                sum_node_wcc_sqr += wcc**2 
                sum_node_wcc += wcc 
            if n_classes_node == 1:                      
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif n_samples_node < 2*self.min_samples_leaf:  
                self._make_leaf(node_id, node_wcc, n_classes_node)
            elif sum_node_wcc < 2.*self.min_weight_leaf: 
                self._make_leaf(node_id, node_wcc, n_classes_node)
            
            # Or perform a split.
            else:
                # Initialize stats for best split of node.
                best_split = Split(-1, 0., -np.inf)
                
                # Ensure feats drawn w/out replacement.
                n_drawn_feats = 0
                n_new_consts = 0
                n_total_consts = n_consts
                lb = 0                      # Range in `features` array from which we 
                ub = self.n_features - 1    # randomly select a feature's column index. 
               
                while n_drawn_feats < self.m:
                    n_drawn_feats += 1
                    idx = self._rng.choice(range(lb, ub-n_new_consts+1))
                    
                    # So that we don't draw a known constant feature again this split-search.
                    if idx < n_consts:
                        features[idx], features[lb] = features[lb], features[idx]
                        lb += 1 
                        continue
                        
                    # So that no new const feats get drawn more than once per split-search.
                    idx += n_new_consts
                    
                    feat_idx = features[idx]
                  
                    # Num split points found among training samples for given feat.
                    n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                    
                    q = n_samples_node/n_unique_vals_feat
                    
                    # Louppe numerical splitting.
                    if q < Q_THRESHOLD:
                        # Prepare the rows' feature values for sorting.
                        items[start:end] = X[:,feat_idx][self.rows[start:end]]

                        # Sort feature values and corresponding sample row indices
                        # to prepare for numerical split finding.
                        dual_sort(items, self.rows, start, end)

                        # Make sure the feature not constant for node's samples.
                        if items[start] == items[end-1]:
                            # Move the newly-discovered constant feat to the far right-end
                            # of the left half of `features` list holding the known const
                            # feats as well as any other const feats newly discovered 
                            # during this node's split-search.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # Initialize weighted class counts of right and left children.
                            # Right child's counts are initially the same as parent node's.
                            r_wcc[:] = node_wcc
                            l_wcc[:] = 0.

                            # If the feature has an impurity score that's better than the best score 
                            # found among all other features visited thus far for this node, find_num_split()
                            # updates the attributes of the struct containing the node's best split info. 
                            # 
                            # But even if a new best score isn't reached, if an impurity score can
                            # be calculated at least once during the feature's split search, the
                            # following indicator will be toggled off, to indicate that the feature
                            # is not constant (1 = is constant; 0 = not constant).
                            current_feat_const = find_num_split_Louppe(self.rows, items, y, start, n_samples_node, self.n_class, 
                                                                       self.min_samples_leaf, self.min_weight_leaf, self.class_weights, 
                                                                       l_wcc, r_wcc, node_wcc, best_split, feat_idx, sum_node_wcc_sqr, 
                                                                       sum_node_wcc)
                    # Wright LargeQ numerical splitting.
                    else:
                        r_wcc[:] = node_wcc
                        l_wcc[:] = 0.
                        current_feat_const = find_num_split_largeQ(self.rows, split_point_idxs, unique_vals_feats, y, start, n_samples_node, 
                                                           self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                                           self.class_weights, l_wcc, r_wcc, node_wcc, best_split, feat_idx, 
                                                           sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, split_counts_raw, 
                                                           split_counts_wt, split_class_counts_wt)

                    if current_feat_const:
                        # The feature may be constant within the search range permitted
                        # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                        # the feature is a newly discovered constant.
                        features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                        n_new_consts += 1
                        n_total_consts += 1
                        continue
                    else:
                        # The feature is non-constant, so we ensure it's not drawn again
                        # during this split-search.
                        features[idx], features[ub] = features[ub], features[idx]
                        ub -= 1 
                            
                # To ensure that the constant features info is accurate for sibling or child nodes.
                features[0:n_consts] = constant_features[0:n_consts]
                constant_features[n_consts:n_consts+n_new_consts] = features[n_consts:n_consts+n_new_consts]
                
                # Make node a leaf if constant for all randomly drawn feats.
                # (# drawn known constant feats + # drawn new constant feats)
                if lb + n_new_consts == n_drawn_feats: 
                    self._make_leaf(node_id, node_wcc, n_classes_node)
                else: 
                    split_pos = make_num_split(self.rows, X, node_info, best_split) 

                    # Update info for node that's getting split.
                    l_child_id = self.n_nodes
                    r_child_id = l_child_id + 1
                    self.nodes[node_id] = Node(l_child_id, r_child_id, best_split.feat, best_split.thresh, -1)

                    # Prepare for the left and right child nodes
                    # by increasing tree data memory capacity if
                    # necessary.
                    if self.n_nodes + 2 > self.mem_capacity:
                        # Expand memory capacity geometrically. See "geometric growth" 
                        # part of WhozCraig's SO answer at: 
                        #     https://stackoverflow.com/a/51665863/8628758.
                        # Add one after squaring so that the new capacity can
                        # contain not only a tree of greater depth, but also
                        # the maximum # nodes that that depth could have.
                        new_capacity = 2*self.mem_capacity + 1
                        self._increase_mem_capacity(new_capacity)
                        self.mem_capacity = new_capacity
                    
                    # Push right child info onto the LIFO stack.
                    node_stack.append(StackEntry(split_pos, end, r_child_id, node_id, n_total_consts))
                    # Push left child info onto queue.
                    node_stack.append(StackEntry(start, split_pos, l_child_id, node_id, n_total_consts))

                    # And update size of the tree.
                    self.n_nodes += 2
    
    def fit(self, X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X    (Fortan-style ndarray of float64): Pre-processed training data.
            y                     (ndarray of int): Training labels.
            split_point_idxs      (ndarray of int): All numerical feature split-point locations for all rows.
                                                    Shape: (n training samples, n features).
            unique_vals_feats (ndarray of float64): Columns contain sorted unique values for all features.
                                                    Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats      (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows                            (list): Indices of the rows to be used for training. 
                                                    All rows used if empty.
            features                        (list): Column indices of training features that will be used.
                                                    All features used if empty.    
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        # Determine # classes found among all training samples.
        root_cc = np.unique(y, return_counts=True)[1] 
        self.n_class = root_cc.size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.

        self.n_samples = len(self.rows)
        self.n_features = len(self.features)
        
        # Store the num unique vals for each numerical feat and
        # find the maximum cardinality of all features.
        self.n_unique_vals_feats = n_unique_vals_feats
        self.max_n_unique_feat_vals = self.n_unique_vals_feats.max()
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        init_capacity = 15
        
         # Allocate tree memory.
        self._increase_mem_capacity(init_capacity)
        self.mem_capacity = init_capacity
        
        # And sum the class weights of all the root node's samples in
        # order to know minimum total weight a leaf must have (which
        # we must know when regularizing by min_weight_fraction_leaf.)
        root_wcc = root_cc*self.class_weights
        self.min_weight_leaf = self.min_weight_fraction_leaf*root_wcc.sum()
        
        # Initialize the random number generator.
        self._rng = get_random_generator(self.seed)
        
        # Initiate tree building.
        self._grow_tree(X, y, split_point_idxs, unique_feat_vals)
        return self
        
    def _next_node(self, nxt): return self.nodes[nxt]
       
    def _get_leaf_idx(self, i, X):
        root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[:,leaf.feat][i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else:
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
        
    def predict(self, X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        n_preds = X.shape[0]
        preds = np.empty(n_preds, dtype=np.intp)
        for i in range(n_preds):
            preds[i] = self.nodes[self._get_leaf_idx(i, X)].label
        return preds
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        for i in range(n_probs):
            idx = self._get_leaf_idx(i, X)
            for j in range(self.n_class):
                wcc[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

## Python Louppe SmallQ/Wright LargeQ Tree's Speed on the Titanic Data

In [77]:
m = 4
dt = DecisionTreeLouppeWright(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [78]:
dt.size

371

In [79]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.752808988764045

In [80]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

106 ms ± 2.87 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [81]:
%timeit dt.predict(xVal_proc)

527 µs ± 10.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Cython Louppe SmallQ/Wright LargeQ

In [82]:
%%cython
# cython: wraparound=False, boundscheck=False, cdivision=True, initializedcheck=False
# distutils: language = c++
# distutils: extra_compile_args = -std=c++11

import numpy as np
cimport numpy as np
np.import_array()
ctypedef np.float64_t DTYPE_t
ctypedef np.intp_t SIZE_t # Signed, same as ssize_t in C. See MSeifert's SO answer: https://stackoverflow.com/a/46416257/8628758
cimport cython
from libc.math cimport log as ln
from libc.stdlib cimport realloc, free
from libc.string cimport memcpy
from libc.string cimport memset
from libcpp.stack cimport stack

# For C++ random number generation.
from libc.stdint cimport uint_fast32_t 

# Swap helper func for sorting.
cdef inline void dual_swap(DTYPE_t* items, SIZE_t* rows, SIZE_t i, SIZE_t j) nogil:
    items[i], items[j] = items[j], items[i]
    rows[i], rows[j] = rows[j], rows[i]

# Quicksort helpers

cdef inline void dual_med_three(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Find the median-of-three pivot point of the second through final 
    items of a list of numbers. Once identified, the pivot is moved to 
    the front of the list. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L78
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t middle = <int>(first + (last - first)/2)
    cdef SIZE_t second = first + 1
    last -= 1
    if items[second] < items[middle]:
        if items[middle] < items[last]:
            dual_swap(items, rows, first, middle)    
        elif items[second] < items[last]:
            dual_swap(items, rows, first, last)         
        else:                        
            dual_swap(items, rows, first, second)
    elif items[second] < items[last]:
        dual_swap(items, rows, first, second)
    elif items[middle] < items[last]:
        dual_swap(items, rows, first, last)
    else:
        dual_swap(items, rows, first, middle)

cdef inline SIZE_t dual_partition(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, SIZE_t pivot) nogil:
    """Group numbers less than the pivot value together on the left and
    those that are greater on the right. Find the index that separates
    these two groups, which will belong to the first item that is greater
    than or equal to the pivot. Borrows from libstdc++ implementation at: 
        https://github.com/gcc-mirror/gcc/blob/d9375e490072d1aae73a93949aa158fcd2a27018/libstdc%2B%2B-v3/include/bits/stl_algo.h#L1885
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        pivot      : Index holding the median pivot value.
        
    Returns:
        Index of cut point used to partition the items into two smaller sequences.
    """
    while True:
        while first < last and items[first] < items[pivot]:
            first += 1                      # Get index of first item greater than or equal to median-of-three pivot. 
        last -= 1
        while items[pivot] < items[last]:
            last -= 1                       # Get index of last item less than or equal to the pivot.
        if not (first < last): 
            return first                    # After swaps are done, return index of first item in right partition.
        
        dual_swap(items, rows, first, last) # Swap the first item greater than or equal to the pivot with the
                                            # last item less than or equal to the pivot. 
        first += 1

cdef inline void dual_insertion_sort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Follows the spirit of the Numpy implementation at: 
        https://github.com/numpy/numpy/blob/5ffb84c3057a187b01acdeaa628137193df12098/numpy/core/src/npysort/quicksort.cpp#L211
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef SIZE_t i
    cdef SIZE_t j
    cdef SIZE_t k
    cdef DTYPE_t val
    for i in range(first+1, last):
        j = i
        k = i - 1
        val = items[i]
        row = rows[i]
        while (j > first) and val < items[k]:
            items[j] = items[k]
            rows[j] = rows[k]
            j-=1
            k-=1
        items[j] = val
        rows[j] = row

# Heapsort

cdef inline void dual_sift_down(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n, 
                                SIZE_t p, SIZE_t c, DTYPE_t val, SIZE_t row) nogil:
    """Swap a heap item with one of its children if that child's value is 
    greater than or equal to that parent's value. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L61
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
        p    : Index of the parent.
        c    : Index of the parent's first (left) child.
        val  : The parent's value.
        row  : The parent's training row index.
    """
    while c < n:    # Look at the descendents of current parent, `p`.
        if c < n-1 and items[start + c] < items[start + c + 1]: # Find larger of the first and second children.
            c += 1
        if val < items[start + c]: # If child greater than parent, swap child and parent.
            items[start + p] = items[start + c]
            rows[start + p] = rows[start + c]
            p = c   # Current greater child becomes the parent.
            c += c  # Look at this child's child, if it exists.
        else:
            break 
    items[start + p] = val
    rows[start + p] = row

cdef inline void dual_sort_heap(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Sort a binary max heap of numbers. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L77
    
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    cdef DTYPE_t val
    cdef SIZE_t row
    while n > 0:
        n -= 1
        val = items[start + n]
        row = rows[start + n]
        items[start + n] = items[start]
        rows[start + n] = rows[start]
        dual_sift_down(items, rows, start, n, 0, 1, val, row)

cdef inline void dual_heapify(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Turn a list of items into a binary max heap. From Williams, 1964.
    Modeled after Numpy's implementation at:
        https://github.com/numpy/numpy/blob/084d05a5d1ef3efe79474b09b42594ee9ef086cb/numpy/core/src/npysort/heapsort.cpp#L59
    
    Arguments:
        items: 1-d array containing numbers.
        rows : Row indices of all training samples.
        start: Index of the first number.
        n    : Quantity of numbers.
    """
    cdef DTYPE_t val
    cdef SIZE_t p
    cdef SIZE_t last_p = (n-2)//2
    for p in range(last_p, -1, -1):
        val = items[start + p] # value of last parent
        row = rows[start + p]
        dual_sift_down(items, rows, start, n, p, 2*p + 1, val, row)

cdef inline void dual_heapsort(DTYPE_t* items, SIZE_t* rows, SIZE_t start, int n) nogil:
    """Applies the heapsort algorithm to sort a list of items from least to greatest. 
    From Williams, 1964.
    Arguments:
        items: 1-d array containing the numbers to be sorted.
        rows : Row indices of all training samples.
        start: Index of the first number to be sorted.
        n    : Quantity of numbers to be sorted
    """
    dual_heapify(items, rows, start, n)
    dual_sort_heap(items, rows, start, n)
    
# Introsort 

cdef void dual_introsort_loop(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last, int depth) nogil:
    """The recursive heart of the introsort algorithm.
    
    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
        depth      : Current recursion depth.
    """
    cdef int MIN_SIZE_THRESH = 16
    cdef SIZE_t cut
    while last-first > MIN_SIZE_THRESH:
        if depth == 0:
            dual_heapsort(items, rows, first, last-first)
        depth -= 1
        dual_med_three(items, rows, first, last)
        cut = dual_partition(items, rows, first+1, last, first)
        dual_introsort_loop(items, rows, cut, last, depth)
        last = cut

# Log base-2 helper function. From Sklearn's implementation at:
#     https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_utils.pyx#L7
cdef inline DTYPE_t log2(DTYPE_t x) nogil:
    return ln(x) / ln(2.0)

cdef void dual_introsort(DTYPE_t* items, SIZE_t* rows, SIZE_t first, SIZE_t last) nogil:
    """Implementation as described in Musser, 1997. Switches to heapsort
    when max recursion depth exceeded. Otherwise uses median-of-three 
    quicksort (Bentley & McIlroy, 1993) with all the usual optimizations:
        - Swap equal elements.
        - Only process partitions longer than the minimum size threshold.
        - When a new partition is made, recurse on the smaller half and 
          iterate over the larger half.
        - Make a final pass with insertion sort over the entire list.

    Arguments:
        items      : The numbers to be sorted.
        rows       : Row indices of all training samples.
        first, last: The range of items to be sorted. 
    """
    cdef int max_depth = 2 * <int>log2(last-first)
    dual_introsort_loop(items, rows, first, last, max_depth)
    dual_insertion_sort(items, rows, first, last)
    
# For convenient memory reallocation.
ctypedef fused realloc_t:
    SIZE_t
    DTYPE_t
    Node

cdef inline realloc_t* safe_realloc(realloc_t* ptr, SIZE_t n_items) nogil except *:
    # Inspired by Sklearn's safe_realloc() func. However, thankfully
    # Cython now no longer requires us to send a pointer to a pointer
    # in order to prevent crashes.
    cdef realloc_t elem = ptr[0]
    cdef SIZE_t n_bytes = n_items * sizeof(elem)
    # Make sure we're not trying to allocate too much memory.
    if n_bytes/sizeof(elem) != n_items:
        with gil:
            raise MemoryError(f"Overflow error: unable to allocate {n_bytes} bytes.")       
    cdef realloc_t* res_ptr = <realloc_t *> realloc(ptr, n_bytes)
    with gil:
        if not res_ptr: raise MemoryError()
    return res_ptr

# C++ random number generator. Not yet a part of a Cython release so
# pasted in from: 
#     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/Cython/Includes/libcpp/random.pxd#L1
cdef extern from "<random>" namespace "std" nogil:
    cdef cppclass random_device:
        ctypedef uint_fast32_t result_type
        random_device() except +
        result_type operator()() except +

    cdef cppclass mt19937:
        ctypedef uint_fast32_t result_type
        mt19937() except +
        mt19937(result_type seed) except +
        result_type operator()() except +
        result_type min() except +
        result_type max() except +
        void discard(size_t z) except +
        void seed(result_type seed) except +

    cdef cppclass uniform_int_distribution[T]:
        ctypedef T result_type
        uniform_int_distribution() except +
        uniform_int_distribution(T, T) except +
        result_type operator()[Generator](Generator&) except +
        result_type min() except +
        result_type max() except +
        
# Info for any node that will eventually be split or made into a leaf.
# Similar to what Sklearn does at:
#     https://github.com/scikit-learn/scikit-learn/blob/a2c4d8b1f4471f52a4fcf1026f495e637a472568/sklearn/tree/_tree.pyx#L126
cdef struct StackEntry:
    SIZE_t start
    SIZE_t end
    SIZE_t node_id
    SIZE_t parent_id
    SIZE_t n_const_feats

# To compare node splits.
cdef struct Split:
    SIZE_t feat
    DTYPE_t thresh
    DTYPE_t score  

# Vital characteristics of a node. Set when it's added to the tree.
cdef struct Node:
    SIZE_t l_child # idx of left child, -1 if leaf
    SIZE_t r_child # idx of right child, -1 if leaf
    SIZE_t feat    # col idx of split feature, -1 if leaf
    DTYPE_t thresh # double split threshold, NAN if leaf
    SIZE_t label   # class label if leaf, -1 if non-leaf.

# Louppe numerical splitting.
cdef inline void find_num_split(SIZE_t* rows, DTYPE_t* items, SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, 
                                SIZE_t n_class, SIZE_t min_samples_leaf, DTYPE_t min_weight_leaf, DTYPE_t* c_wts,
                                DTYPE_t* l_wcc, DTYPE_t* r_wcc, DTYPE_t* parent_wcc, Split* best_split, 
                                SIZE_t current_feat, DTYPE_t parent_num, DTYPE_t parent_den, 
                                bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Gilles Louppe's split-finding algorithm:
        Page 31 in Louppe, 2015: https://arxiv.org/pdf/1407.7502.pdf
    
    Saves a split's feature idx, threshold, position, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows              : Indices of all rows in the training set. Shape: (n train samples,).
        items             : The sorted feature values of the samples in the parent
                            node (beginning at `node_start`). Shape: (n train samples,).
        labels            : All training labels. Shape: (n training samples,).
        node_start        : Index of the beginning of the parent node in `rows`.
        n_parent          : Number of samples in the parent node.
        n_class           : Number of unique classes in the training set.
        min_samples_leaf  : Any leaf will have no fewer than this many samples.
        min_weight_leaf   : Total weight of any leaf's samples will be at least this much.
        c_wts             : Class weights. Shape: (`n_class`,).
        l_wcc             : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc             : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc        : Parent node's weight class counts. Shape: (`n_class`,).
        best_split        : Holds the feature, threshold, position, and impurity
                            score of the parent node's current best split.
        current_feat      : Column index of feature under investigation.
        parent_num        : Numerator of parent node's impurity score.
        parent_den        : Denominator of parent node's impurity score.
        current_feat_const: Whether current splitting feature is constant for all eligible split 
                            thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, w, score
    cdef SIZE_t row, label, r
    
    # To iterate across all the node's samples' feature values.
    cdef SIZE_t prev_pos, pos, node_end
    cdef DTYPE_t lowest, next_lowest, mid
    
    prev_pos, pos = node_start, node_start
    node_end = node_start + n_parent
    lowest = items[pos]
    l_num, l_den = 0., 0.
    r_num, r_den = parent_num, parent_den
    
    # Find the best split and store its score, threshold, position,
    # as well as it's children's weighted class counts.
    while pos < node_end:
        while items[pos] == lowest: # When consecutive items have the same value.
            if pos == node_end - 1: # When the final few samples all have the same value.
                return
            pos+=1
        next_lowest = items[pos]
        mid = lowest/2. + next_lowest/2. # Split threshold is always the mid-point between two consecutive values.
        if mid == next_lowest: mid = lowest

        # Move samples from the left to right child when it's quicker to do so.
        if pos-prev_pos > node_end-pos-1:
            l_num, l_den = parent_num, parent_den
            r_num, r_den = 0., 0.
            memcpy(l_wcc, parent_wcc, n_class*sizeof(DTYPE_t))
            memset(r_wcc, 0, n_class*sizeof(DTYPE_t))
            for r in reversed(range(pos, node_end)):
                row = rows[r]
                label = labels[row]; w = c_wts[label]
                r_num += w*( 2*r_wcc[label] + w); r_den += w
                l_num += w*(-2*l_wcc[label] + w); l_den -= w 
                r_wcc[label] += w; l_wcc[label] -= w
        else:
            for r in range(prev_pos, pos):
                row = rows[r]
                label = labels[row]; w = c_wts[label] 
                l_num += w*( 2.*l_wcc[label] + w); l_den += w
                r_num += w*(-2.*r_wcc[label] + w); r_den -= w
                l_wcc[label] += w; r_wcc[label] -= w

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf.
        if pos - node_start < min_samples_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif node_end - pos < min_samples_leaf:
            return
        # l_den and r_den are left and right children's weighted sample sums.
        elif l_den < min_weight_leaf: 
            lowest = next_lowest
            prev_pos = pos; pos+=1 
            continue
        elif r_den < min_weight_leaf:
            return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Only update best split stats if current score beats all
            # other best found among all other features already explored 
            # at the current node.
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat
        lowest = next_lowest
        prev_pos = pos; pos+=1

cdef inline void find_num_split_largeQ(SIZE_t* rows, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats, 
                                       SIZE_t* labels, SIZE_t node_start, SIZE_t n_parent, SIZE_t n_samples,
                                       SIZE_t n_class, SIZE_t min_samples_leaf, DTYPE_t min_weight_leaf, 
                                       DTYPE_t* c_wts, DTYPE_t* l_wcc, DTYPE_t* r_wcc, DTYPE_t* parent_wcc, 
                                       Split* best_split, SIZE_t current_feat, DTYPE_t parent_num, 
                                       DTYPE_t parent_den, SIZE_t n_unique_vals_feat, SIZE_t max_n_unique_vals, 
                                       SIZE_t* split_counts_raw, DTYPE_t* split_counts_wt, 
                                       DTYPE_t* split_class_counts_wt, bint* current_feat_const) nogil:
    """Calculates the impurity score of each eligible split threshold in a 
    decision tree node that belongs to a single numerical feature.

    Uses Marvin Wright's LargeQ splitting algorithm:
        https://github.com/imbs-hl/ranger/blob/5f71872d7b552fd2cf652daab92416f52976df86/src/TreeClassification.cpp#L316
    
    Saves a split's feature idx, threshold, and impurity score if the
    score is a new best for the node.
    
    Arguments:
        rows                   : Indices of all rows in the training set. Shape: (n train samples,).
        split_point_idxs       : All numerical feature split-point locations for all rows.
                                 Shape: (n training samples, n features).
        unique_vals_feats      : Columns contain sorted unique values for all features. 
                                 Shape: (max cardinality of all feats, n features).
                                 node (beginning index 0). Shape: (n train samples,).
        labels                 : All training labels. Shape: (n training samples,).
        node_start             : Index of the beginning of the parent node in `rows`.
        n_parent               : Number of samples in the parent node.
        n_samples              : Number of samples in the training data.
        n_class                : Number of unique classes in the training set.
        min_samples_leaf       : Any leaf will have no fewer than this many samples.
        min_weight_leaf        : Total weight of any leaf's samples will be at least this much.
        c_wts                  : Class weights. Shape: (`n_class`,).
        l_wcc                  : Left child's weight class counts. Shape: (`n_class`,).
        r_wcc                  : Right child's weight class counts. Shape: (`n_class`,).
        parent_wcc             : Parent node's weight class counts. Shape: (`n_class`,).
        best_split             : Holds the feature, threshold and impurity
                                 score of the parent node's current best split.
        current_feat           : Column index of feature under investigation.
        parent_num             : Numerator of parent node's impurity score.
        parent_den             : Denominator of parent node's impurity score.
        n_unique_vals_feat     : Number of unique values for one feature found among 
                                 all training samples.
        max_n_unique_vals      : Maximum cardinality of all features in dataset.
        split_counts_raw       : Stores sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_counts_wt        : Stores weighted sample counts found at each split point of
                                 a given feature in a given node. 
                                 Shape: (<max cardinality of all numerical feats in dataset>,).
        split_class_counts_wt  : Stores weighted class counts of each class at each
                                 split point of a given feature in a given node. Shape:
                                 (<max cardinality of all numerical feats in dataset> x `n_class`,)
        current_feat_const     : Whether current splitting feature is constant for all eligible split 
                                 thresholds in the current node. 1 if yes, 0 otherwise.
    """
    # Variables used while tabulating raw and weighted sample counts, 
    # as well as weighted class counts at all unique split points.
    cdef SIZE_t row, label, split_point_idx
    
    # Make sure node's samples aren't all constant for feature.
    cdef SIZE_t n_splits_node = 0
    
    # Variables to track progress during the split search.
    cdef SIZE_t n_left, n_right, i, j, k
    
    # Variables used to calculate proxy gini scores.
    cdef DTYPE_t l_num, l_den, r_num, r_den, wt, score, mid
    
    # Tabulate sample counts, weighted counts, and weighted class counts at 
    # each split point. Values at split points not belonging to node's rows
    # will remain zero.
    memset(split_counts_raw, 0, sizeof(SIZE_t)*n_unique_vals_feat)
    memset(split_counts_wt, 0, sizeof(DTYPE_t)*n_unique_vals_feat)
    memset(split_class_counts_wt, 0, sizeof(SIZE_t)*n_unique_vals_feat*n_class)
    for i in range(n_parent):
        row = rows[node_start + i]
        label = labels[row]
        wt = c_wts[label]
        split_point_idx = split_point_idxs[n_samples*current_feat + row]
        split_counts_raw[split_point_idx] += 1
        split_counts_wt[split_point_idx] += wt
        split_class_counts_wt[split_point_idx*n_class + label] += wt
        
    # If feat is constant for the node.
    for i in range(n_unique_vals_feat):
        if split_counts_raw[i] > 0: n_splits_node += 1
    if n_splits_node < 2: return
    
    # To keep track of num samples in left child.
    n_left = 0    
    # Left child's proxy gini score denominator.
    l_den = 0.
    
    # Search for the threshold of the best split.
    for i in range(n_unique_vals_feat - 1):
        if split_counts_raw[i] == 0: continue # Move to next split-point if no samples at this one.
        
        n_left += split_counts_raw[i]
        n_right = n_parent - n_left
        if n_right == 0: return # Make sure to stop search when right child empty.
        
        # Calculate denominators of proxy gini scores.
        l_den += split_counts_wt[i]
        r_den = parent_den - l_den

        # Calculate numerators of proxy gini scores.
        l_num, r_num = 0., 0. 
        for j in range(n_class):
            # Can't do the on-line proxy gini update algorithm cause we
            # move all samples from a given class over to the left side 
            # before updating the calculation.
            wt = split_class_counts_wt[i*n_class + j]
            l_wcc[j] += wt
            r_wcc[j] -= wt
            l_num += l_wcc[j]*l_wcc[j]
            r_num += r_wcc[j]*r_wcc[j]

        # Only investigate split-points that satisfy min_samples_leaf and min_weight_leaf
        if n_left < min_samples_leaf: continue
        elif n_right < min_samples_leaf: return
        elif l_den < min_weight_leaf: continue
        elif r_den < min_weight_leaf: return

        current_feat_const[0] = 0 # If we can compute a score, current feat not constant.
        score = (l_num/l_den) + (r_num/r_den) # Proxy gini score.
        if score > best_split.score: 
            # Find raw feature value of sample(s) at next-closest split-point.
            k = i+1
            while split_counts_raw[k] == 0: k+=1
            # Split threshold is always the mid-point between two consecutive values.
            mid = (unique_vals_feats[current_feat*max_n_unique_vals + i]/2. + 
                   unique_vals_feats[current_feat*max_n_unique_vals + k]/2.) 
            if mid == unique_vals_feats[current_feat*max_n_unique_vals + k]: 
                mid = unique_vals_feats[current_feat*max_n_unique_vals + i]
            best_split.score, best_split.thresh, best_split.feat = score, mid, current_feat

cdef inline SIZE_t make_num_split(SIZE_t* rows, DTYPE_t* X, StackEntry* node_info, Split* best_split, 
                                SIZE_t n_samples) nogil:
    cdef SIZE_t p, p_end
    p, p_end = node_info.start, node_info.end
    while p < p_end:
        if X[best_split.feat*n_samples + rows[p]] <= best_split.thresh: p+=1
        else: p_end-=1; rows[p], rows[p_end] = rows[p_end], rows[p] 
    return p

# Necessary constants.
cdef DTYPE_t NEG_INF = -np.inf
cdef DTYPE_t NAN = np.nan
cdef DTYPE_t Q_THRESHOLD = 0.02
            
cdef class _DecisionTree:
    # Class attributes.
    cdef SIZE_t seed
    cdef mt19937 rng
    cdef SIZE_t mem_capacity
    cdef SIZE_t n_samples
    cdef SIZE_t n_features
    cdef SIZE_t n_class
    cdef SIZE_t m
    cdef SIZE_t min_samples_leaf, 
    cdef DTYPE_t min_weight_fraction_leaf
    cdef DTYPE_t min_weight_leaf
    cdef SIZE_t n_nodes
    cdef SIZE_t max_n_unique_feat_vals
    cdef SIZE_t* n_unique_vals_feats
    cdef SIZE_t* rows
    cdef SIZE_t* features
    cdef DTYPE_t* class_weights
    cdef Node* nodes
    cdef DTYPE_t* weighted_class_counts
    def __cinit__(self, SIZE_t m, SIZE_t min_samples_leaf, DTYPE_t min_weight_fraction_leaf, SIZE_t seed): 
        """
        Arguments:
            m                       : Number of candidate features randomly selected to try to split each node.
            min_samples_leaf        : Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf: Total weight of any leaf's samples must comprise this portion 
                                      of the sum of weights of *all* training samples used to fit the tree.
            seed                    : A seed for the C++ mt19937 32bit int random generator. 
                                      Use when reproducibility is desired.
        """
        self.m, self.min_samples_leaf = m, min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.seed = seed
        
        # The Decision Tree data structure: a 1-d array of nodes. Index of 
        # each node in this array is its "node id." Root node's id is 0.
        # Each `Node` object in the array contains that node's:
        #     - left child node id
        #     - right child node id
        #     - split feature column index
        #     - numerical split threshold
        #     - class label
        self.nodes = NULL
        
        # Tree nodes' weighted class counts. Will ultimately be a 
        # 1-d array of length: n_nodes * n_class.
        self.weighted_class_counts = NULL 
        
    def __dealloc__(self):
        free(self.nodes)
        free(self.weighted_class_counts)
        
    property size:
        def __get__(self):
            return self.n_nodes
    
    property left_children:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].l_child
            return out

    property right_children:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].r_child
            return out
        
    property split_features: 
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].feat
            return out
        
    property split_thresholds:
        def __get__(self):
            out = np.empty(self.n_nodes, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].thresh
            return out
        
    property weighted_cc:
        def __get__(self):
            cdef SIZE_t out_size = self.n_nodes*self.n_class
            out = np.empty(out_size, dtype='float64')
            cdef DTYPE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(out_size):
                    out_view[i] = self.weighted_class_counts[i]
            out.resize(self.n_nodes, self.n_class)
            return out
    
    property labels:
        def __get__(self): 
            out = np.empty(self.n_nodes, dtype='int')
            cdef SIZE_t[::1] out_view = out
            cdef SIZE_t i
            with nogil:
                for i in range(self.n_nodes):
                    out_view[i] = self.nodes[i].label
            return out
    
    cdef void _increase_mem_capacity(self, SIZE_t new_capacity) nogil:
        self.nodes = safe_realloc(self.nodes, new_capacity)
        self.weighted_class_counts = safe_realloc(self.weighted_class_counts, self.n_class*new_capacity)
    
    cdef void _make_leaf(self, Node* leaf_node, SIZE_t* y, SIZE_t node_start, SIZE_t node_id, 
                         SIZE_t n_classes_node, SIZE_t* max_wt_classes) nogil:
        # Class with largest wcc becomes leaf node's label. Break ties with a random choice.
        cdef SIZE_t label
        cdef DTYPE_t max_wt = 0.
        cdef SIZE_t lb = 0
        cdef SIZE_t ub = -1
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t i, j
        # If all node's samples have the same class.
        if n_classes_node == 1:
            label = y[self.rows[node_start]]
        else:
            # Otherwise find label with max weighted class count for the node.
            for i in range(self.n_class):
                max_wt = max(max_wt, self.weighted_class_counts[node_id*self.n_class + i])
            # See if multiple classes share this max count.
            for i in range(self.n_class):
                if self.weighted_class_counts[node_id*self.n_class + i] == max_wt:
                    ub += 1
                    max_wt_classes[ub] = i
            # If so, randomly choose leaf's label from among those classes.
            if ub > 0:
                dist = uniform_int_distribution[SIZE_t](lb, ub) # Choose an int w/in range lb, ub, inclusive.
                j = dist(self.rng)
                label = max_wt_classes[j]
            else:
                label = max_wt_classes[lb]
        leaf_node.l_child = -1
        leaf_node.r_child = -1    
        leaf_node.feat = -1  
        leaf_node.thresh = NAN
        leaf_node.label = label 

    cdef _grow_tree(self, DTYPE_t* X, SIZE_t* y, SIZE_t* split_point_idxs, DTYPE_t* unique_vals_feats):
        # LIFO stack holding all nodes still to be investigated.
        cdef stack[StackEntry] node_stack

        #####################################################################
        # Variables containing info of the node currently being investigated.
        #####################################################################
        cdef SIZE_t start, end, node_id, parent_id, n_consts, n_samples_node
        cdef DTYPE_t* node_wcc = NULL
        cdef StackEntry node_info
        cdef Node* node = NULL
        
        # Holds child node info if the current node gets split.
        cdef SIZE_t l_child_id, r_child_id
        cdef Node* l_child_node = NULL
        cdef Node* r_child_node = NULL
        
        #####################################################################
        # For finding the best split.
        #####################################################################
        cdef Split best_split
        cdef DTYPE_t* l_wcc = NULL
        cdef DTYPE_t* r_wcc = NULL
        cdef DTYPE_t sum_node_wcc_sqr, sum_node_wcc # Parent node's proxy Gini score num and den.
        cdef SIZE_t split_pos
        
        # Indicates a feature has been discovered to be constant during a
        # split search within the search range permitted by min_samples_leaf 
        # and min_weight_leaf.
        cdef bint current_feat_const 
        
        # Determines whether to use SmallQ or LargeQ splitting. 
        cdef DTYPE_t q
        cdef SIZE_t n_unique_vals_feat

        # Using Louppe splitting for SmallQ. A C-contiguous array of doubles to 
        # hold feature values of a given node's samples. Using Numpy to allocate 
        # memory to longer vectors is often faster than using realloc().
        cdef DTYPE_t[::1] items_buffer = np.empty(self.n_samples, dtype=np.float64)
        cdef DTYPE_t* items = &items_buffer[0]
        cdef SIZE_t r

        # Three 1-d arrays containing raw and weighted sample counts, as
        # well as weighted class counts for each unique raw feature value.
        # Used for both SmallQ and LargeQ splitting, except for split_counts_wt,
        # which is just used for LargeQ.
        
        # Raw, non-weighted, sample counts at each split-point.
        cdef SIZE_t[::1] split_counts_raw_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.intp) 
        cdef SIZE_t* split_counts_raw = &split_counts_raw_buffer[0]
        # Weighted sample counts at each split-point.
        cdef DTYPE_t[::1] split_counts_wt_buffer = np.empty(self.max_n_unique_feat_vals, dtype=np.float64)
        cdef DTYPE_t* split_counts_wt = &split_counts_wt_buffer[0]
        # Weighted class counts for each split-point.
        cdef DTYPE_t[::1] split_class_counts_wt_buffer = np.empty(self.n_class*self.max_n_unique_feat_vals, dtype=np.float64) 
        cdef DTYPE_t* split_class_counts_wt = &split_class_counts_wt_buffer[0]
        
        #####################################################################
        # For random feature selection (w/out replacement) and keeping track 
        # of nodes' constant features. 
        #####################################################################
        cdef uniform_int_distribution[SIZE_t] dist
        cdef SIZE_t lb, ub, idx, feat_idx, n_drawn_feats, n_new_consts, n_total_consts
        cdef SIZE_t[::1] features_buffer = np.empty(self.n_features, dtype=np.intp) 
        cdef SIZE_t* features = &features_buffer[0]
        cdef SIZE_t[::1] constant_features_buffer = np.empty(self.n_features, dtype=np.intp)
        cdef SIZE_t* constant_features = &constant_features_buffer[0]
        
        #####################################################################
        # For determining whether node should be a leaf.
        #####################################################################
        cdef SIZE_t i, c, cc, n_classes_node, row, label
        cdef DTYPE_t wcc, wt
        # Stores classes that share a leaf's max class wt. When two or more 
        # present, leaf label randomly chosen from these classes
        cdef SIZE_t* max_wt_classes = NULL
        
        with nogil:
            # Allocate memory to pointers.
            l_wcc = safe_realloc(l_wcc, self.n_class)
            r_wcc = safe_realloc(r_wcc, self.n_class)
            node_wcc = safe_realloc(node_wcc, self.n_class)
            max_wt_classes = safe_realloc(max_wt_classes, self.n_class*sizeof(SIZE_t))
            # Fill with feature column indices so we can track constant feats.
            memcpy(features, self.features, self.n_features* sizeof(SIZE_t))
            
            # Push root node onto the LIFO stack.
            node_stack.push({"start": 0, "end": self.n_samples, "node_id": 0, 
                             "parent_id": 0, "n_const_feats": 0})
            self.n_nodes = 1
            while not node_stack.empty():
                node_info = node_stack.top()
                node_stack.pop()
                start, end = node_info.start, node_info.end
                node_id, parent_id = node_info.node_id, node_info.parent_id # TODO: `parent_id` unused; is it necessary?
                n_consts = node_info.n_const_feats
                n_samples_node = end-start
                node = &self.nodes[node_id]
                
                # Tabulate the current node's weighted class counts.
                #
                # Implementation detail #1: I tried storing the l and r child wt class cts
                # of nodes' best splits so that this tabulation wouldn't need to be 
                # performed for each node. But found there was virtually no speed improvement
                # to justify the more complicated code required to store and update these 
                # values during the best split search.
                #
                # Implementation detail #2: Setting aside a block of memory to 
                # store the current node's wt class cts and passing a pointer to
                # this block to the split search function sped up training by 8%
                # compared to passing a ptr to the location of node's wt class cts 
                # in the self.weighted_class_counts array.
                memset(node_wcc, 0, self.n_class*sizeof(DTYPE_t))
                sum_node_wcc, sum_node_wcc_sqr = 0., 0.
                for i in range(n_samples_node):
                    row = self.rows[start + i]
                    label = y[row]
                    wt = self.class_weights[label]
                    # Compute the node's proxy gini numerator and denominator while we're at it.
                    sum_node_wcc_sqr += wt*(2*node_wcc[label] + wt) # numerator
                    sum_node_wcc += wt                              # denominator
                    node_wcc[label] += wt
                memcpy(&self.weighted_class_counts[node_id*self.n_class], node_wcc, self.n_class*sizeof(DTYPE_t))
                
                # Make a leaf if required to do so. 
                n_classes_node = 0
                for c in range(self.n_class):
                    wcc = node_wcc[c]
                    if wcc > 0: n_classes_node += 1
                if n_classes_node == 1:                   
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif n_samples_node < 2*self.min_samples_leaf:  
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                elif sum_node_wcc < 2.*self.min_weight_leaf: 
                    self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)

                # Otherwise split the node.
                else:
                    # Initialize stats for best split of node.
                    best_split.feat = -1
                    best_split.thresh = 0.
                    best_split.score = NEG_INF

                    # Ensure feats drawn w/out replacement.
                    n_drawn_feats = 0
                    n_new_consts = 0
                    n_total_consts = n_consts
                    lb = 0                      # Range in `features` array from which we 
                    ub = self.n_features - 1    # randomly select a feature's column index. 
                        
                    while n_drawn_feats < self.m:
                        n_drawn_feats += 1

                        # Breiman & Cutler's original Fortran random forests implementation 
                        # allows for known constant features to be drawn during a split-search.
                        # I follow their example, as I believe that doing so allows individual 
                        # trees to be less correlated with each other. Since I don't pre-sort
                        # features, I would prefer not to have to sort any more features than
                        # necessary, and so I've adopted the technique Sklearn uses to track 
                        # constant features:
                        #     https://github.com/scikit-learn/scikit-learn/blob/dbe39454f766ebefc3219f2c1871ac1774316532/sklearn/tree/_splitter.pyx#L310
                        # 
                        # The idea is that feature idxs in `features` are organized into two sections:
                        #
                        #     [<indices of known constant feats>, <indices of non-constant feats>]
                        #
                        # As we begin drawing feature indices from this above list, those two sections
                        # will each be further sub-divided into two sections:
                        # 
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <undrawn non-constant feats>, <drawn non-constant feats>]
                        #
                        # When we choose a feature that happens to be a known constant, we'll re-locate
                        # its idx to the right-end of the first of those four sections. Then we 
                        # increment the lower bound threshold, `lb`, by one so that we don't re-draw 
                        # that feature again.
                        #
                        # Similarly, if we draw a non-constant feature idx, we'll move it to the 
                        # left-end of the last of the four partitions and reduce the upper bound
                        # threshold, `ub`, by one so that the feature idx can't be drawn again
                        # during this split-search. 
                        #
                        # One last important detail: sometimes we'll draw a feature that 
                        # used to be non-constant for ancestor nodes, but will be found to be 
                        # constant for the current node. When this happens, we relocate its 
                        # index so that it sits to the right of the known constant feats section.
                        # This means our `features` list could have up to five partitions:
                        #
                        #     [<drawn known constant feats>, <undrawn known constant feats>, 
                        #      <newly discovered const feats>, <undrawn non-constant feats>, 
                        #      <drawn non-constant feats>]
                        #
                        # Whenever we find a new constant feature, we increment the `n_new_consts`
                        # counter by one. We also increment the `n_total_consts` counter by one. 
                        # During the split-search we have to use `n_total_consts` to keep track of
                        # the total number of constant features. n_consts` mustn't be changed
                        # because it tells us where the <newly discovered const feats> section
                        # of the `features` list begins.

                        # One last wrinkle. We subtract the # of newly discovered const feats from  
                        # the upper bound before we select an index `i` from the `features` array, 
                        # and add it back to `i` after `i` has been genereated. This prevents us from 
                        # re-drawing any of these new const feats again during this split-search.
                        dist = uniform_int_distribution[SIZE_t](lb, ub-n_new_consts)
                        idx = dist(self.rng)

                        # So that we don't draw a known constant feature again this split-search.
                        if idx < n_consts:
                            features[idx], features[lb] = features[lb], features[idx]
                            lb += 1 
                            continue

                        # So that no new const feats get drawn more than once per split-search.
                        idx += n_new_consts

                        feat_idx = features[idx]
                        
                        # Num split points found among training samples for given feat.
                        n_unique_vals_feat = self.n_unique_vals_feats[feat_idx]
                        
                        q = n_samples_node/n_unique_vals_feat
                        
                        # SmallQ Splitting.
                        if q < Q_THRESHOLD:
                            # Prepare the rows' feature values for sorting.
                            for r in range(start, end):
                                # X is a pointer, so have to index into this 2d array in the C way 
                                # (also keeping in mind that the array is column-major).
                                items[r] = X[feat_idx*self.n_samples + self.rows[r]]

                            # Sort feature values and corresponding sample row indices
                            # to prepare for numerical split finding.
                            dual_introsort(items, self.rows, start, end)

                            # Make sure the feature not constant for node's samples.
                            if items[start] == items[end-1]:
                                # Move the newly-discovered constant feat to the far right-end
                                # of the left half of `features` list holding the known const
                                # feats as well as any other const feats newly discovered 
                                # during this node's split-search.
                                features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                                n_new_consts += 1
                                n_total_consts += 1
                                continue
                            else:
                                # Initialize weighted class counts of right and left children.
                                # Right child's counts are initially the same as parent node's.
                                memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                                memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))

                                # If the feature has an impurity score that's better than the best score 
                                # found among all other features visited thus far for this node, find_num_split()
                                # updates the attributes of the struct containing the node's best split info. 
                                # 
                                # But even if a new best score isn't reached, if an impurity score can
                                # be calculated at least once during the feature's split search, the
                                # following indicator will be toggled off, to indicate that the feature
                                # is not constant.
                                current_feat_const = 1 # 1 = is constant; 0 = not constant
                                find_num_split(self.rows, items, y, start, n_samples_node, 
                                               self.n_class, self.min_samples_leaf, self.min_weight_leaf, 
                                               self.class_weights, l_wcc, r_wcc, node_wcc,
                                               &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc,
                                               &current_feat_const)
                                
                        # LargeQ Splitting.
                        else:
                            memcpy(r_wcc, node_wcc, self.n_class*sizeof(DTYPE_t))
                            memset(l_wcc, 0, self.n_class*sizeof(DTYPE_t))
                            current_feat_const = 1 # 1 = is constant; 0 = not constant
                            find_num_split_largeQ(self.rows, split_point_idxs, unique_vals_feats, y, start, 
                                                  n_samples_node, self.n_samples, self.n_class, self.min_samples_leaf, 
                                                  self.min_weight_leaf, self.class_weights, l_wcc, r_wcc, node_wcc, 
                                                  &best_split, feat_idx, sum_node_wcc_sqr, sum_node_wcc, n_unique_vals_feat, 
                                                  self.max_n_unique_feat_vals, split_counts_raw, split_counts_wt, 
                                                  split_class_counts_wt, &current_feat_const)

                        if current_feat_const:
                            # The feature may be constant within the search range permitted
                            # by self.min_samples_leaf and self.min_weight_leaf. If so, 
                            # the feature is a newly discovered constant.
                            features[idx], features[n_total_consts] = features[n_total_consts], features[idx]
                            n_new_consts += 1
                            n_total_consts += 1
                            continue
                        else:
                            # The feature is non-constant, so we ensure it's not drawn again
                            # during this split-search.
                            features[idx], features[ub] = features[ub], features[idx]
                            ub -= 1 

                    # To ensure that the constant features info is accurate for sibling or child nodes.
                    memcpy(&features[0], &constant_features[0], sizeof(SIZE_t)*n_consts)
                    memcpy(&constant_features[n_consts], &features[n_consts], sizeof(SIZE_t)*n_new_consts)

                    # Make node a leaf if constant for all randomly drawn feats.
                    # (# drawn known constant feats + # drawn new constant feats)
                    if lb + n_new_consts == n_drawn_feats: 
                        self._make_leaf(node, y, start, node_id, n_classes_node, max_wt_classes)
                    else: 
                        split_pos = make_num_split(self.rows, X, &node_info, &best_split, self.n_samples) 

                        # Update tree info for node that's getting split.
                        l_child_id = self.n_nodes
                        r_child_id = l_child_id + 1
                        node.l_child = l_child_id
                        node.r_child = r_child_id
                        node.feat    = best_split.feat
                        node.thresh  = best_split.thresh
                        node.label   = -1

                        # Prepare for the left and right child nodes
                        # by increasing tree data memory capacity if
                        # necessary.
                        if self.n_nodes + 2 > self.mem_capacity:
                            # Expand memory capacity geometrically. See "geometric growth" 
                            # part of WhozCraig's SO answer at: 
                            #     https://stackoverflow.com/a/51665863/8628758.
                            # Add one after squaring so that the new capacity can
                            # contain not only a tree of greater depth, but also
                            # the maximum # nodes that that depth could have.
                            new_capacity = 2*self.mem_capacity + 1
                            self._increase_mem_capacity(new_capacity)
                            self.mem_capacity = new_capacity
                        
                        # Push right child info onto the LIFO stack.
                        node_stack.push({"start": split_pos, "end": end, "node_id": r_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})
                        # Push left child info onto queue.
                        node_stack.push({"start": start, "end": split_pos, "node_id": l_child_id, 
                                         "parent_id": node_id, "n_const_feats": n_total_consts})

                        # And update size of the tree.
                        self.n_nodes += 2
                        
        free(l_wcc)
        free(r_wcc)
        free(node_wcc)
        free(max_wt_classes)
    
    def fit(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X, np.ndarray[SIZE_t, ndim=1, mode="c"] y,
            np.ndarray[SIZE_t, ndim=2, mode="fortran"] split_point_idxs, 
            np.ndarray[DTYPE_t, ndim=2, mode="fortran"] unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] n_unique_vals_feats,
            np.ndarray[SIZE_t, ndim=1, mode="c"] rows, np.ndarray[SIZE_t, ndim=1, mode="c"] features,
            np.ndarray[DTYPE_t, ndim=1, mode="c"] class_weights, SIZE_t n_class): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
            feats             (1D C-contiguous array of int): Column indices of training features.
            class_weights (1D C-contiguous array of float64): Desired weight for each class. Shape: (`n_class`,).
            n_class                                         : Number of classes in training data. 
        """
        # Casting the raw data to pointers gives a 17% speed-up compared to getting
        # pointer from the ndarray's buffer interface, as recommended by DavidW in 
        # his SO answer at: https://stackoverflow.com/a/54832269/8628758. e.g.
        #     cdef DTYPE_t[::1,:] X_buffer = X
        #     cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        # Not worried about unexpected behavior as all ndarrays' contiguousness and
        # memory layout enforced prior to this point.
        cdef DTYPE_t* X_ptr = <DTYPE_t*> X.data
        cdef SIZE_t* y_ptr = <SIZE_t*> y.data
        cdef SIZE_t* split_point_idxs_ptr = <SIZE_t*> split_point_idxs.data
        cdef DTYPE_t* unique_vals_feats_ptr = <DTYPE_t*> unique_vals_feats.data
        self.n_unique_vals_feats = <SIZE_t*> n_unique_vals_feats.data
        self.rows = <SIZE_t*> rows.data
        self.features = <SIZE_t*> features.data
        self.class_weights = <DTYPE_t*> class_weights.data
        self.n_class = n_class
        self.n_samples = rows.shape[0]
        self.n_features = features.shape[0]
        cdef random_device rd # Needed when using the C++ mt19937 rng w/out a seed.
        
        # Get the max cardinality of all numerical feats.
        self.max_n_unique_feat_vals = n_unique_vals_feats.max()
        
        # Why initialize tree memory to hold 15 nodes? For a given 
        # depth, d >= 1, a tree will have a maximum of d^2 - 1 nodes. 
        # i.e. at d=1 a tree only has its root node. When d = 2, the 
        # tree has 3 nodes. If d=3, a tree will have 2^3 - 1 = 7 nodes, 
        # etc. 15 is the max # of nodes a tree of depth=4 could have. 
        cdef SIZE_t init_capacity = 15
        
        cdef SIZE_t i, row, label
        cdef DTYPE_t wt
        cdef DTYPE_t sum_wts = 0
        cdef Node* root_node = NULL
        with nogil:
            # Allocate memory for the tree.
            self._increase_mem_capacity(init_capacity)
            self.mem_capacity = init_capacity
 
            # And sum the class weights of all the root node's samples in
            # order to know minimum total weight a leaf must have (which
            # we must know when regularizing by min_weight_fraction_leaf.)
            for i in range(self.n_samples):
                row = self.rows[i]
                label = y_ptr[row]
                wt = self.class_weights[label]
                sum_wts += wt
            self.min_weight_leaf = self.min_weight_fraction_leaf*sum_wts
            
            # Initialize the random number generator. Followed example from:
            #     https://github.com/cython/cython/blob/9341e73aceface39dd7b48bf46b3f376cde33296/tests/run/cpp_stl_random.pyx#L16
            if self.seed == -1:
                self.rng = mt19937(rd()) # If using the random device engine std::random_device.
            else:
                self.rng = mt19937(self.seed)

        # Initiate tree building.
        self._grow_tree(X_ptr, y_ptr, split_point_idxs_ptr, unique_vals_feats_ptr)
    
    cdef Node* _next_node(self, SIZE_t nxt) nogil: 
        return &self.nodes[nxt]
    
    cdef SIZE_t _get_leaf_idx(self, SIZE_t i, Node* leaf, SIZE_t n, DTYPE_t* X) nogil:
        cdef SIZE_t idx
        cdef SIZE_t root_idx = 0
        leaf = self._next_node(root_idx)
        while leaf.label == -1:
            if X[leaf.feat*n + i] <= leaf.thresh:
                idx = leaf.l_child
                leaf = self._next_node(idx)
            else: 
                idx = leaf.r_child
                leaf = self._next_node(idx)
        return idx
    
    def predict(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate class predictions for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of int: Class predictions. Shape: (`X.size`,).
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_preds = X.shape[0]
        cdef SIZE_t i
        preds = np.empty(n_preds, dtype=np.intp)
        cdef SIZE_t[::1] preds_view = preds
        cdef Node leaf
        with nogil:
            for i in range(n_preds): 
                preds_view[i] = self.nodes[self._get_leaf_idx(i, &leaf, n_preds, X_ptr)].label
        return preds
    
    def predict_probs(self, np.ndarray[DTYPE_t, ndim=2, mode="fortran"] X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D Fortran-contiguous ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions. Shape: (`X.size`, `self.n_class`)
        """
        cdef DTYPE_t[::1,:] X_buffer = X
        cdef DTYPE_t* X_ptr = &X_buffer[0,0]
        cdef SIZE_t n_probs = X.shape[0]
        wcc = np.empty(n_probs*self.n_class, dtype=np.float64)
        cdef DTYPE_t[::1] wcc_view = wcc
        cdef Node leaf
        cdef SIZE_t i, j, idx
        with nogil:
            for i in range(n_probs):
                idx = self._get_leaf_idx(i, &leaf, n_probs, X_ptr)
                for j in range(self.n_class):
                    wcc_view[i*self.n_class + j] = self.weighted_class_counts[idx*self.n_class + j]
        wcc.resize(n_probs, self.n_class)
        sums = np.sum(wcc, axis=1)[:,None]
        return np.divide(wcc, sums)

class DecisionTreeLouppeWrightCython():
    """Fit a decision tree classifier using a depth-first tree 
    growth algorithm. 
    
    Uses Louppe's splitter for SmallQ splits and Wright's splitter for LargeQ splitting.
    """
    
    def __init__(self, m, min_samples_leaf=1, min_weight_fraction_leaf=0., class_weights = [], seed=None): 
        """
        Arguments:
            m                            (int): Number of candidate features randomly selected to try to split each node.
            min_samples_leaf             (int): Any leaf will have no fewer than this many samples.
            min_weight_fraction_leaf (float64): Total weight of any leaf's samples must comprise this portion 
                                                of the sum of weights of *all* training samples used to fit the tree.
            seed                         (int): Use when reproducibility is desired.
        """
        self.m = m
        self.min_samples_leaf, self.min_weight_fraction_leaf = min_samples_leaf, min_weight_fraction_leaf
        self.class_weights = np.array(class_weights, dtype=np.float64, order='C') 
        if seed is None:
            self.seed = -1
        else:
            self.seed = seed
        self._tree = _DecisionTree(self.m, self.min_samples_leaf, self.min_weight_fraction_leaf, self.seed)
        
    @property
    def size(self): return self._tree.size
    
    @property
    def left_children(self): return self._tree.left_children
    
    @property
    def right_children(self): return self._tree.right_children
            
    @property 
    def split_features(self): return self._tree.split_features

    @property 
    def split_thresholds(self): return self._tree.split_thresholds
    
    @property
    def weighted_class_counts(self): return self._tree.weighted_cc
    
    @property
    def labels(self): return self._tree.labels
    
    def fit(self, X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, rows=[], features=[]): 
        """Fit a decision tree classifier model.
        
        Arguments:
            X       (2D Fortran-contiguous array of float64): Pre-processed training data.
            y                 (1D C-contiguous array of int): Training labels.
            split_point_idxs                (ndarray of int): All numerical feature split-point locations for 
                                                              all rows. Shape: (n training samples, n features).
            unique_vals_feats           (ndarray of float64): Columns contain sorted unique values for all features.
                                                              Shape: (max cardinality of all feats, n features).
            n_unique_vals_feats                (ndarray int): Cardinality of each feature. Shape: (n features,).
            rows              (1D C-contiguous array of int): Indices of the rows to be used for training. 
                                                              All rows used if empty.
            feats             (1D C-contiguous array of int): Column indices of training features.
                                                              All rows used if empty.
                                                              
        Returns:
            DecisionTreeLouppeWrightCython: A decision tree object.
        """
        if len(rows) > 0:
            self.rows = np.array(rows, dtype='int', order='C')
        else:
            self.rows = np.arange(0, X.shape[0], 1)
            
        if len(features) > 0:
            self.features = np.array(features, dtype='int', order='C')
        else:
            self.features = np.arange(0, X.shape[1], 1)
        
        self.n_class = np.unique(y).size
        if len(self.class_weights) == 0: 
            self.class_weights.resize(self.n_class, refcheck=False)
            self.class_weights[:] = 1.
            
        self._tree.fit(X, y, split_point_idxs, unique_feat_vals, n_unique_vals_feats, 
                       self.rows, self.features, self.class_weights, self.n_class)
        return self

    def predict(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict(X)
    
    def predict_probs(self, X):
        """Generate prediction probabilities for one or more test inputs.
        
        Arguments:
            X (2D ndarray of float64): Pre-processed test samples.
            
        Returns:
            ndarray of float64: Class probability predictions.
                                Shape: (`X.size`, `self.n_class`)
        """
        return self._tree.predict_probs(X)

## Cython Louppe SmallQ/Wright LargeQ Tree's Speed on the Titanic Data

In [83]:
m = 4
dt = DecisionTreeLouppeWrightCython(m, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [84]:
dt.size

393

In [85]:
preds = dt.predict(xVal_proc)
accuracy(preds, yVal_titanic)

0.8202247191011236

In [86]:
%timeit dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

346 µs ± 5.15 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [87]:
%timeit dt.predict(xVal_proc)

7.4 µs ± 103 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


There you have it, using Louppe for SmallQ splitting speeds things up compared to a SmallQ/LargeQ hybrid that uses Wright's SmallQ algorithm. However, the winner for the Titanic dataset is still a tree that only makes splits using Wright's LargeQ algorithm:

|Splitting Algorithm|Speed on Titanic Dataset|
|---|---|
|Louppe|538 µs|
|Wright SmallQ|1.03 ms|
|Wright LargeQ|315 µs|
|Wright SmallQ/LargeQ|500 µs|
|Louppe SmallQ/Wright LargeQ|346 µs| 

Before I call it a day, I'd like to see if the above gaps in relative performance hold true when these splitting algorithms are faced with a much larger dataset. Kaggle's Santander Customer Satisfaction competition [dataset](https://www.kaggle.com/competitions/santander-customer-satisfaction/data) fits the bill nicely, given that it clocks in at 369 numerical features and over 60K rows.

## Downloading the Santander Competition's Dataset

In [88]:
santander_dir = Path.home()/'data'/'santander_cust_sat'

def get_santander_data():
    """Download and place the kaggle Santander train and test csv files into Pandas dataframes."""
    santander_dir.mkdir(parents=True, exist_ok=True)
    train_csv_path, test_csv_path = santander_dir/'train.csv', santander_dir/'test.csv'
    if not train_csv_path.exists():
        # Visit https://github.com/Kaggle/kaggle-api for more info on 
        # how to install the kaggle API and generate a kaggle.json key.
        !kaggle competitions download -c santander-customer-satisfaction --path "$santander_dir"
        with ZipFile(santander_dir/'santander-customer-satisfaction.zip', 'r') as z: z.extractall(santander_dir) 
    santander_train_df = pd.read_csv(train_csv_path); santander_train_df.drop('ID', axis=1, inplace=True)
    santander_test_df  = pd.read_csv(test_csv_path)
    return santander_train_df, santander_test_df
    
santander_train_df, _ = get_santander_data()

In [89]:
np.random.seed(42)
xTrain_santander, yTrain_santander, xVal_santander, yVal_santander = train_val_split(santander_train_df, 369)

In [90]:
xTrain_santander.shape

(60816, 369)

#### Helper function to calculate an ROC AUC score
This competition's evaluation metric was the area under the ROC curve. This is because a good model needs to be able to find all the unsatisfied customers without incorrectly labelling too many satisfied customers as "unsatisfied." ROC will do a better job of punishing a model for committing false-positive errors than a metric like accuracy would.

In [91]:
def binary_roc_auc(probs, labels, pos_label=1):
    """Calculate area under the receiver operating characteristic curve for a 2-class prediction task.
    
    To write this function, I first studied the logic spread across four sklearn files:
        https://github.com/scikit-learn/scikit-learn/blob/8479a74af207d857da4188b75375ce9d24c7ef90/sklearn/metrics/_ranking.py#L324
        https://github.com/scikit-learn/scikit-learn/blob/8479a74af207d857da4188b75375ce9d24c7ef90/sklearn/metrics/_ranking.py#L827
        https://github.com/scikit-learn/scikit-learn/blob/8479a74af207d857da4188b75375ce9d24c7ef90/sklearn/metrics/_ranking.py#L653
        https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/metrics/_ranking.py#L41
    
   And then I picked out the necessary steps and condensed into a concise function that 
   generates an ROC curve and calculates its area.
    
    Arguments:
        probs (Two-column Numpy array): Prediction probabilities of both classes for n samples. 
        labels          (list of ints): Ground-truth labels.
        pos_label                (int): Class label (either 0 or 1) of the positive class.
        
    Returns:
        The area under the ROC curve.
    
    """
    n = len(labels); row_idxs = np.array(list(range(n)))
    pos_probs = np.array(probs[:,pos_label])                    # Use pred probs from positive class.
    dual_sort(pos_probs, row_idxs, 0, n)                        # Sort pred probs and corresponding  
    pos_probs, row_idxs = np.flip(pos_probs), np.flip(row_idxs) # row indices in decreasing order.
    pos_labels = (np.array(labels)==pos_label)
    pos_labels = pos_labels[row_idxs]                           # Keep order consistent with the now-sorted pred probs.
    threshold_idxs = np.where(np.diff(pos_probs))[0]            # We only care about unique pred probs.
    threshold_idxs = np.append(threshold_idxs, n-1)             # Include idx of final sample.
    TP = np.cumsum(pos_labels)[threshold_idxs]                  # Count all true positives under each unique pred prob.
    FP = 1 + threshold_idxs - TP
    TP, FP = np.append(0, TP), np.append(0, FP)                 # Make sure ROC curve will contain origin coord.
    TPR, FPR = TP/TP[-1], FP/FP[-1]
    AUC = np.trapz(TPR, FPR)                                    # TPRs are on y-axis, FPRs on x-axis.
    return AUC

In [92]:
(xTrain_proc, yTrain_proc, nan_fillers, 
 _, split_point_idxs, unique_vals_feats, 
 n_unique_vals_feats) = preprocess_train(xTrain_santander, yTrain_santander, largeQ=True)
xVal_proc = preprocess_test(xVal_santander, nan_fillers)

## Cython Louppe Tree's Speed on the Santander Data

#### Aside: Regularizing with `min_samples_leaf`
I've found that when using a single decision tree, I get beter results on the Santander satisfaction data by regularizing my model using the `min_samples_leaf` parameter. This forces the growth of shallower tree models that generalize a bit better to unseen samples.

In [93]:
m = 10
dt = DecisionTreeLouppeCython(m, min_samples_leaf=5, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [94]:
dt.size

1049

In [95]:
probs = dt.predict_probs(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6986164474454197

In [96]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc)

137 ms ± 1.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [97]:
%timeit dt.predict_probs(xVal_proc)

5.7 ms ± 811 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Cython Wright SmallQ Tree's Speed on the Santander Data

In [98]:
m = 10
dt = DecisionTreeSmallQCython(m, min_samples_leaf=5, seed=42)
dt.fit(xTrain_proc, yTrain_proc);

In [99]:
dt.size

1049

In [100]:
probs = dt.predict_probs(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6986164474454197

In [101]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc)

874 ms ± 40.5 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [102]:
%timeit dt.predict_probs(xVal_proc)

4.35 ms ± 904 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Cython Wright LargeQ Tree's Speed on the Santander Data

In [103]:
m = 10
dt = DecisionTreeLargeQCython(m, min_samples_leaf=5, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [104]:
dt.size

1049

In [105]:
probs = dt.predict_probs(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6986164474454197

In [106]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

138 ms ± 2.35 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [107]:
%timeit dt.predict_probs(xVal_proc)

4.61 ms ± 639 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Cython Wright SmallQ/LargeQ Tree's Speed on the Santander Data

In [108]:
m = 10
dt = DecisionTreeSmallQLargeQCython(m, min_samples_leaf=5, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [109]:
dt.size

1049

In [110]:
probs = dt.predict_probs(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6986164474454197

In [111]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

139 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [112]:
%timeit dt.predict_probs(xVal_proc)

7.25 ms ± 290 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Cython Louppe SmallQ/Wright LargeQ Tree's Speed on the Santander Data

In [113]:
m = 10
dt = DecisionTreeLouppeWrightCython(m, min_samples_leaf=5, seed=42)
dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats);

In [114]:
dt.size

1049

In [115]:
probs = dt.predict_probs(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6986164474454197

In [116]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc, split_point_idxs, unique_vals_feats, n_unique_vals_feats)

129 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [117]:
%timeit dt.predict_probs(xVal_proc)

4.09 ms ± 581 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Discussion: The Louppe/Wright hybrid is fastest, but usually Louppe alone will be best
Recall that on the Titanic dataset, Wright LargeQ splitting was the fastest, closely followed by the Louppe SmallQ/Wright LargeQ hybrid. Wright's SmallQ/LargeQ hybrid and then Louppe's splitter took third and fourth place, respectively, with fairly similar speeds. Finally, the Wright SmallQ splitting brought up the rear clocking in nearly twice as slow as Louppe's splitter, and three times as slow as the best-forming Wright LargeQ splitter.

|Splitting Algorithm|Speed on Santander Dataset|
|---|---|
|Louppe|137 ms|
|Wright SmallQ|874 ms|
|Wright LargeQ|138 ms|
|Wright SmallQ/LargeQ|139 ms|
|Louppe SmallQ/Wright LargeQ|129 ms| 

However, on the much larger Santander dataset, with the exception of Wright's SmallQ splitter, the splitting algorithms' relative speeds are virtually identical. The Louppe/Wright hybrid is now best, but I don't believe that its marginal 5% speed improvement over it's closest rival (Louppe) is a large enough carrot to compensate for the cost, both in time and memory, necessary to pre-sort the data for LargeQ splitting.

Louppe splitting doesn't require any pre-sorting, and its implementation is far more straightforward than either of the two hybrid SmallQ/LargeQ approaches I experimented with. Therefore, it's my belief that the Louppe splitting algorithm is the best for most random forest implementations and the practitioners who will use those them.

#### More research necessary to determine whether Louppe will *always* be sufficient
There could conceivably be scenarios where pre-sorting/LargeQ splitting is preferable to Louppe's just-in-time sorting method. As I hypothesized at this notebook's outset, I could imagine this being the case if one were trying to fit a decision tree to billions of rows of training data that consists of low-cardinality numerical features. The costs of relocating potentially billions of floats to a contiguous memory buffer for each candidate feature, for each node, may place sole use Louppe's method at a significant disadvantage to LargeQ or a LargeQ hybrid strategy.

## Sklearn Library Tree's Speed on the Santander Data
To verify that the Cython code I've written is well-tuned and performant, let's compare the speed of the Sklearn library's [`DecisionTreeClassifier`](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html), which also uses a Cython implementation of Louppe's splitter, with the performance of my own Cython implementation of Louppe.

In [118]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion='gini', splitter='best', min_samples_leaf=5, 
                            max_features=10, random_state=79)
dt.fit(xTrain_proc, yTrain_proc);

In [119]:
dt.tree_.node_count # Number of nodes in the sklearn decision tree.

1049

In [120]:
probs = dt.predict_proba(xVal_proc)
binary_roc_auc(probs, yVal_santander)

0.6786673008005639

In [121]:
%timeit -n 100 dt.fit(xTrain_proc, yTrain_proc)

128 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [122]:
%timeit dt.predict_proba(xVal_proc)

15.3 ms ± 652 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


#### My Cython Louppe implementation is about as fast as Sklearn during training
Gratifyingly, my Cython implementation runs roughly as fast as Sklearn's. I should emphasize that despite choosing random seeds such that both Sklearn's and my own trees grew to the same size (1,049 nodes), these trees are not identical. The crucial implication of this is that any comparison of my and Sklearn's relative speeds will never be perfectly apples-to-apples.

The first reason for this is that our two tree-growing implementations use different random number generator implementations. Sklearn employs a [home-grown](https://github.com/scikit-learn/scikit-learn/blob/f9fd3b535af47986d20ad1ad06be35de10221cc2/sklearn/utils/_random.pxd#L24) version of George Marsaglia's [Xorshift RNG algorithm](http://www.jstatsoft.org/v08/i14/paper), while I rely on the C++ standard library's [version](https://cplusplus.com/reference/random/mersenne_twister_engine/) of Makoto Matsumoto's and Takuji Nishimura's [Mersenne Twister algorithm](https://en.wikipedia.org/wiki/Mersenne_Twister). 

But even if my and Sklearn's decision tree implementations used identical RNG implementations, it'd still be impossible to compare their respective speeds under identical conditions. In other words, it wouldn't be possible to force each algorithm to grow the same decision tree in the same way (i.e. drawing and sorting the same features for each node's best split search). This is because it is highly likely that for any given node split, my and Sklearn's RNGs will *not* be called the same number of times. There are two reasons for this:
1. Sklearn [will draw beyond](https://github.com/scikit-learn/scikit-learn/blob/213d21fe719ce5778726203893c78251b8af34fa/sklearn/tree/_splitter.pyx#L305) `m` number of features if the first `m` features that were drawn are constant. Conversely, my implementation would turn such a node into a leaf upon drawing `m` constant features.
2. When setting a node as a leaf, my implementation will randomly choose the node's label on occasions where more than one class label have identical weighted class counts among the leaf node's samples, and this random choice happens during model training. Sklearn, on the other hand, does not set leaf labels during training, and instead defers this task to inference time. 

In other words, at a given tree node, my and Sklearn's implementations will almost certainly choose different batches of candidate splitting features. If one implementation got unlucky and happened to more frequently pick features that took longer to sort, that implementation would take longer to train, even if it were coded to be as performant as the other implementation.

#### My Cython is way faster during inference
At test-time, however, it is possible to make an apples-apples comparison of the inference speeds of my and Sklearn's implementations. My Cython code takes all of about 4ms to run inference on all samples in the validation set while Sklearn is over 200% slower, taking nearly 15ms to do the same task. 

I believe this discrepancy in performance is explained by the fact that for each test input, my code only has to travel through the tree's node structure to the appropriate leaf and then grab that leaf's label, which had already been determined at training. Sklearn, on the other hand, has to [execute a call](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_classes.py#L432) to `np.argmax` inside a pure Python for-loop that iterates over each test sample. To be fair, I believe the for-loop is necesssary because Sklearn [supports multi-label](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html) classification problems, which my implementation does not.

#### Aside: A quibble regarding bias that Sklearn may introduce at test-time
When its tree's `predict()` function is called and Sklearn does finally determine leaf labels, [it uses](https://github.com/scikit-learn/scikit-learn/blob/95d4f0841d57e8b5f6b2a570312e9d832e69debc/sklearn/tree/_classes.py#L434) `np.argmax` to select the class label with the maximum weighted class count. Using `np.argmax` in this way introduces a possible bias toward leaves being given lower class labels, as Numpy's argmax function [only returns](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) the index *of the first* element that contains the max value of an array's elements. If ever a leaf's samples's class labels are equally weighted, Sklearn will always set the leaf's class label to the lower class. e.g. class `0` instead of class `1`.

It would be a pity to train a decision tree model that tends to predict lower classes solely due to a quirk of `np.argmax`. For this reason, my implementation breaks ties by randomly choosing the label class from all classes that share the maximum weighted sample sum of a given leaf node.

## Testing My Cython Louppe Implementation for Memory Leaks
This final sanity check ensures that my implementation properly frees all allocated memory. I used the code from [this gist](https://gist.github.com/raghavrv/c5a147220509d872e3627830967dff1b) created by Sklearn contributor [Venkat Rajagopalan](https://gist.github.com/raghavrv).

In [123]:
# Adapted from raghavrv's wonderful gist at: https://gist.github.com/raghavrv/c5a147220509d872e3627830967dff1b

import os, time, gc, psutil
from pympler import tracker
import numpy as np

tracker.memory_tracker = tracker.SummaryTracker()
def get_mem():
    return "{:.0f}MB".format(p.memory_info().rss / 1e6)

p = psutil.Process()

def sleep_1s_and_print_mem(title):
    time.sleep(1)
    tracker.memory_tracker.print_diff()
    print("\n\n" + title + " : " + get_mem() + "   " + "=" * 50 + "\n\n")

sleep_1s_and_print_mem("Initial memory")

for i in range(7):
    dt = DecisionTreeLouppeCython(m, min_samples_leaf=5, seed=47).fit(xTrain_proc, yTrain_proc)
    del dt
    gc.collect()

    sleep_1s_and_print_mem("After iteration %d" % i)

                                              types |   # objects |   total size
                          pandas.core.series.Series |        1141 |    643.16 MB
                                               list |       19822 |      1.64 MB
                                                str |       19064 |      1.36 MB
                                               dict |        3473 |    368.22 KB
                                                int |        5431 |    148.59 KB
                                      numpy.ndarray |        1151 |    119.83 KB
  pandas.core.internals.managers.SingleBlockManager |        1141 |    106.97 KB
                                              tuple |        2148 |    106.15 KB
              pandas._libs.internals.BlockPlacement |        1141 |     80.23 KB
                                              slice |        1141 |     62.40 KB
              pandas.core.internals.blocks.IntBlock |         789 |     55.48 KB
            pandas.core.inte