# CSE Selection

In this notebook we will be exploring CSE selection.  We will be comparing a few different mechanisms for selecting CSEs and how they do relative to each other.  We will start by comparing the current JIT heuristic for CSE selection to both random CSE selection and the perfscore we get from choosing no CSEs at all.  We will then revisit how our initial, simple reinforcement learning model did.  Finally, we will build a classification model using neural networks.

For our neural network approach, we will start with a basic model:  Just picking *individual* CSEs and trying to predict which CSEs will result in a positive or negative perfscore.  This isn't quite the general case of choosing a sequence of CSEs to JIT (in fact we found this simple single-cse model does not do that well for this task).  We will explore trying to generalize to choosing the full CSE sequence in future work.

Before getting started, we need to set up some parameters and define a function to calculate `geomean` which we'll use to compare results:

In [1]:
# --------------------------------------------------------------------------------------------
# Constants and parameters

# Update these values to point at your local build
MCH_FILE = "~/git/runtime/artifacts/spmi/mch/43854594-cd60-45df-a89f-5b7697586f46.linux.x64/libraries_tests_no_tiered_compilation.run.linux.x64.Release.mch"
CORE_ROOT = "~/git/runtime/artifacts/bin/coreclr/linux.x64.Checked/"

# At what perf_score would we want to select a CSE?  We don't want to select any CSE which
# is >0.0, as we are creating new temporaries for no value.
# I've arbitrarily selected this minimum perf_score improvement for a "successful" CSE.
CSE_SUCCESS_THRESHOLD = -5.0

# We will only retain features which have a correlation of at least 1% with the change in
# perf_score.  This is to reduce the number of features we have to consider.  Selecting a
# value as high as 15% would still be reasonable, but since there are so few features we
# can afford to allow a lower threshold.
CORRELATION_THRESHOLD = 0.01

# The reinforcement learning trained model to compare against.
RL_MODEL = "../models/rl/ppo.zip"
MODEL_ALGORITHM = 'PPO'

# Single CSE Classification Model parameters
OPTIMIZER = 'adam'
LOSS = 'binary_crossentropy'
METRICS = ['accuracy']
EPOCHS = 100
BATCH_SIZE = 256

# --------------------------------------------------------------------------------------------
# Setup and imports

# resolve '~' to home directory
import os
MCH_FILE = os.path.expanduser(MCH_FILE)
CORE_ROOT = os.path.expanduser(CORE_ROOT)

import os
import sys
from tqdm import tqdm
from pandas import DataFrame
import numpy as np
import tensorflow as tf
from tensorflow.keras.layers import Dense # type: ignore
from sklearn.preprocessing import StandardScaler

# Add the parent directory to the path so that we can import the jitml module
sys.path.append(os.path.dirname(os.getcwd()))

from jitml import SuperPmiContext, MethodKind, SuperPmiCache, SuperPmi, get_individual_cse_perf, JitType

ctx = SuperPmiContext(mch=MCH_FILE, core_root=CORE_ROOT)
cache : SuperPmiCache = ctx.create_cache()
spmi : SuperPmi = ctx.create_superpmi()
spmi.start()

def calculate_geomean(scores, baselines):
    ratios = [score / baseline for score, baseline in zip(scores, baselines)]
    log_ratios = np.log(ratios)
    mean_log_ratios = np.mean(log_ratios)
    geomean = np.exp(mean_log_ratios)
    return geomean

def print_difference(scores, baseline, name : str, baseline_name : str):
    if not scores or not baseline:
        print(f"No data for {name} or {baseline_name}")
        return

    same_as_no_cse = 0
    bad_change = []
    good_change = []

    for i in range(len(scores)):
        if scores[i] is None:
            continue

        diff = scores[i] - baseline[i]
        if np.isclose(diff, 0):
            same_as_no_cse += 1
        elif diff < 0:
            good_change.append(diff / scores[i] * 100.0)
        else:
            bad_change.append(diff / scores[i] * 100.0)

    print(f"Geomean of {name} vs {baseline_name}: {calculate_geomean(scores, baseline):.2f}")
    print()
    print(f"% of time same score:          {100 * same_as_no_cse / len(scores):.2f}%")
    print(f"% of time {baseline_name} is better:    {100 * len(bad_change) / len(scores):.2f}%")
    print(f"% of time {name} is better: {100 * len(good_change) / len(scores):.2f}%")
    print()
    print(f"Average improvement when {name} is better: {np.mean(good_change):.2f}%")
    print(f"Average degradation when {baseline_name} is better:     {np.mean(bad_change):.2f}%")
    print()

class TqdmCallback(tf.keras.callbacks.Callback):
    def __init__(self, epochs):
        self.epochs = epochs
        self._progress = None

    def on_train_begin(self, logs=None):
        self._progress = tqdm(total=self.epochs, desc='Epochs', unit='epoch', ncols=120)

    def on_epoch_end(self, epoch, logs=None):
        self._progress.update(1)
        self._progress.set_postfix(acc=logs.get('accuracy'), val_acc=logs.get('val_accuracy'))

    def on_train_end(self, logs=None):
        self._progress.close()

2024-05-19 13:45:52.425166: I tensorflow/core/util/port.cc:113] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-05-19 13:45:52.457500: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-19 13:45:52.457534: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-19 13:45:52.458348: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-05-19 13:45:52.463531: I tensorflow/core/platform/cpu_feature_guar

## How often are CSEs a net benefit to enable?

Let's first take a quick look at how often enabling an individual CSE is a good decision or not.  To do this, we've generated the `individual_cse_perf` dataset.  This dataset was generated by first JIT'ing every method with no CSEs enabled as a baseline, then taking every viable CSE and JIT'ing it individually (no other CSEs enabled) then comparing the resulting perfscore to the baseline.

This tells us a simplified story of what we are looking for:  Whether or not each individual CSE is a net improvement or detriment to overall perfscore when we enable it (and by what magnitude).  This is not the complete picture, since each CSE decision will affect every other CSE decision, but this is a good point for understanding.

We will query this dataset for all viable CSEs.  We will count the number of times each individual CSE was a net benefit to perfscore as both a raw number and as a percentage.  We will then take a look at how often the JIT heuristic selected individual CSEs.

The result seems to be that each CSE decision is a coinflip:  Around 46% of the time, enabling a CSE is a good decision.  The heuristic seems to be slightly more conservative than that, only selecting only 32% of viable CSE candidates.

In [2]:
individual_cse = get_individual_cse_perf(MCH_FILE, CORE_ROOT)

def print_cse_info():
    grouped_by_method = individual_cse.loc[individual_cse['viable']].groupby('method')

    good_cses = []
    good_cses_pct = []

    heuristic_cses = []
    heuristic_cses_pct = []
    for _, method_info in grouped_by_method:
        num_cses = len(method_info)
        num_good = (method_info['diff'] < 0).sum()
        good_cses.append(num_good)
        good_cses_pct.append(num_good / num_cses)

        # count method_infos with 'heuristic_selected' == True
        num_heuristic = method_info['heuristic_selected'].sum()
        heuristic_cses.append(num_heuristic)
        heuristic_cses_pct.append(num_heuristic / num_cses)

    print(f"Avg good CSEs:   {np.mean(good_cses)}")
    print(f"Avg good CSEs %: {np.mean(good_cses_pct) * 100:.2f}%")
    print()
    print(f"Avg heuristic CSEs:   {np.mean(heuristic_cses)}")
    print(f"Avg heuristic CSEs %: {np.mean(heuristic_cses_pct) * 100:.2f}%")

print_cse_info()

Avg good CSEs:   1.9211512105984467
Avg good CSEs %: 46.08%

Avg heuristic CSEs:   1.5503121668950814
Avg heuristic CSEs %: 31.88%


## The Current CSE Heuristic (Linux x64)

Let's next take a look at the quality of the current hand-written heuristic for selecting CSEs by comparing it to doing nothing at all (selecting no CSEs).  Our methodology here is simple:  JIT the method once with no CSEs enabled, then JIT the method with the JIT's default heuristic and see if the result is positive or negative.

We run this experiment below.  As you can see, the Geomean of using no CSEs versus using the default heuristic on the 'libraries_tests_no_tiered_compilation' mch file is 0.98.  That is, using the heuristic is slightly better than using no CSEs at all, but not by a very large factor.  Digging in a little further, we find that the perfscore is the same 26% of the time, the heuristic is *worse* than using no CSEs 22% of the time, and the heuristic is better than choosing no CSEs 52% of the time.  The heuristic degrades performance by 5% when it chooses the wrong CSEs, and improves performance by 8% when it does so correctly.

Essentially, when the current heuristic degrades performance slightly less than it improves it when it guesses correctly or incorrectly (+5% vs -8%), but it improves performance twice as often as it hurts it.  **[IMPORTANT]** This is a key finding, because much of my focus in improving the current JIT heursitic has been on finding CSEs to enable that will improve performance.  This research actually says that we can improve the JIT's current heuristic by finding places it enables a CSE when it shouldn't and preventing that.

### Is it better than random chance?

Another way to get a baseline of comparison is to generate random selections of CSE choices and see how well this does against our two methods (no-CSEs or the JIT heuristic).  We (of course) expect the JIT heuristic to be better than random choice, and we will certainly want any model we generate to also be better than random choice.  Let's validate that here.

To do this, we take the average amount of CSEs (as a percentage of total CSEs).  We will then randomly select that many CSEs out of all viable CSEs for each method and compare it to the JIT's heuristic

Knowing what we now know about CSEs (46% of them ar good, 54% of them are bad), the results versus enabling no CSEs is not surprising at all.  As expected, random selection of CSEs is roughly equal to choosing no CSEs at all.  The geomean between random selection and no CSEs at all is **1.00**.

The comparison vs the JIT heuristic tells the same story with different numbers.  Random chance is worse than the current heuristic with a geomean of **1.01** compared to it.  The heuristic chooses better 62% of the time, and random chance chooses better CSEs than the heuristic 31% of the time.

It's a bit surprsing that choosing random CSEs to enable outperforms the standard heuristic 31% of the time, but this turns out to be a common theme.  We will see a similar number below when we try to use a neural network classifier.

The only notable thing about random chance vs the current JIT heuristic is the average improvement/degredation of performance.  On average, random chance improves performance by 4.4% when choosing correctly and degrades performance by 3.9% when it selects poorly.  This is notable because the default JIT heuristic degrades performance by 5.1% when it guesses wrong.  This again suggests that we may be able to find places where the JIT incorrectly enables bad CSEs for a big benefit.

In [3]:
# Wrapped as a method so we don't carry around these variables
def print_cse_vs_heuristic_vs_random():
    methods = []
    heuristic_scores = []
    no_cse_scores = []
    pct_cses_used = []

    # Loop through all methods in the .mch.  We will JIT each method with no CSEs enabled,
    # discarding any methods which have no viable CSEs.  We will then JIT the method with
    # the heuristic enabled and compare the perf_scores.
    for method_id in cache.all_methods:
        no_cse = cache.jit_method(spmi, method_id, MethodKind.NO_CSE)
        if not any(True for candidate in no_cse.cse_candidates if candidate.viable):
            continue

        heuristic = cache.jit_method(spmi, method_id, MethodKind.HEURISTIC)

        # If we got a perfscore of 0, ignore this method
        if np.isclose(heuristic.perf_score, 0.0) or np.isclose(no_cse.perf_score, 0.0):
            continue

        heuristic_scores.append(heuristic.perf_score)
        no_cse_scores.append(no_cse.perf_score)
        pct_cses_used.append(heuristic.num_cse/ sum(1 for x in no_cse.cse_candidates if x.viable))
        methods.append((no_cse, heuristic))

    avg_cses_used = np.mean(pct_cses_used)
    print(f"Average cse usage: {avg_cses_used* 100:.2f}%")
    print()

    print_difference(heuristic_scores, no_cse_scores, "heuristic", "no-cse")

    # Now we will compare the heuristic to a random selection of CSEs
    no_cse_baseline = []
    heuristic_baseline = []
    random_scores = []
    for no_cse, heuristic in tqdm(methods, ncols=120, desc="Random choices"):
        viable = [x.index for x in no_cse.cse_candidates if x.viable]
        count = int(len(viable) * avg_cses_used)
        if count > 1:
            choice = list(np.random.choice(viable, count, replace=False))
            random = cache.jit_method(spmi, no_cse.index, choice)

            # Unfortunately, we sometimes fail to JIT methods.  I haven't debugged this yet
            if random is not None and not np.isclose(random.perf_score, 0.0):
                random_scores.append(random.perf_score)
                no_cse_baseline.append(no_cse.perf_score)
                heuristic_baseline.append(heuristic.perf_score)


    print_difference(random_scores, no_cse_baseline, "random", "no-cse")
    print_difference(random_scores, heuristic_baseline, "random", "heuristic")

print_cse_vs_heuristic_vs_random()


Average cse usage: 58.56%

Geomean of heuristic vs no-cse: 0.98

% of time same score:          25.73%
% of time no-cse is better:    22.44%
% of time heuristic is better: 51.83%

Average improvement when heuristic is better: -7.75%
Average degradation when no-cse is better:     5.12%



Random choices: 100%|██████████████████████████████████████████████████████████| 131369/131369 [15:43<00:00, 139.16it/s]


Geomean of random vs no-cse: 1.00

% of time same score:          3.10%
% of time no-cse is better:    44.77%
% of time random is better: 52.13%

Average improvement when random is better: -4.39%
Average degradation when no-cse is better:     3.85%

Geomean of random vs heuristic: 1.01

% of time same score:          7.59%
% of time heuristic is better:    61.65%
% of time random is better: 30.76%

Average improvement when random is better: -3.31%
Average degradation when heuristic is better:     3.78%



## Comparison with the Simplified Reinforcement Learning Model

We previously built a reinforcement learning based model to attempt to tackle a simplified version of this problem.  This model has *several* limitations and is not meant to be the real solution that we can drop into the JIT.  Rather it was meant as a starting point to prove that the approach can work.  Here are the major limitations of the (intentionally simple) RL model we built:

* It only works on methods with 3-16 CSE candidates (it cannot handle anything with more than 16 CSE candidates).
* It cannot choose to enable no CSEs, even if all CSEs candidates are bad ones.
* It JITs the method at every iteration to get the most recent info about CSE candidates.
* The model can choose to apply CSEs which are already applied or are non-viable.
* Using the model is somewhat complicated, because it requires us to handle the case where the model wants to apply a CSE which is already applied or non-viable.  That is, we have to make some choices in how this is handled and that decision making is outside of the model itself.

Note that all of these limitations can be fixed and improved.  We simply built the most expedient model possible to prove that the approach works.

With all of that said, it's still useful to compare this simplified model to the underlying JIT heuristic/no cse/random chance that we have before.

### Setting up the Comparison

The code that we have below requires that the reinforcement learning model we built be trained.  We do not check these models into the repo.  If you want to run this script below, you will need to train the model with:

```bash
./train.py models/rl/ ~/path/to/tests.mch --core_root ~/path/to/core_root --test-percent 0.2 --iterations N --parallel M
```

Where iterations is followed by the number of iterations to train for and the parallel flag is how many parallel processes to use (default is just 1).

The model used below was trained for 10,000,000  iterations.

### Results

The output below shows that our reinforcement learning trained model performs very similarly on both the training data and test data it's never seen before.  For the text data, we have a `geomean` of 0.97 compared to methods with no CSEs selected.  We have a `geomean` of 0.99 when compared to the JIT's RL heuristic on these same methods.  So our reinforcement learning trained model performs slightly better than the current JIT heuristic on methods with up to 16 CSEs, despite this model's limitations.

The RL trained model is better than random chance as well (1.02 geomean comparing random to the RL model).  The surprising finding here is that randomly selecting CSEs to enable is better than the RL model 31% of the time (same as the heuristic based approach).

## Conclusions about an RL Approach

While the particular model architecture we used in our first attempt at building a module using reinforcement learning isn't suitable for use in the JIT, it does show that this approach was successful.  For the limited scope of methods that this model handles, it performs better than the current JIT CSE heuristic.

It's still worrisome that a third of the time, random chance selects better CSEs than this model, but this is no worse than the current CSE heuristic in the JIT.

Before continuing down the path of using reinforcement learning to build a model for JIT CSE heuristics, we will first try to use some more standard and straightforward ML techniques, such as using neural networks to try to classify whether CSEs should be enabled or not.  If we find this simpler approach does not work, we can go back to investigating reinforcement learning based models instead.

In [4]:
from jitml import JitCseModel, MethodContext
from evaluate import get_most_likley_allowed_action

def predict_cses(model : JitCseModel, method : MethodContext) -> MethodContext:
    # We can't be done on the first iteration, this is a limitation of the current RL model.
    # It also may look odd that we iterate over the method and jit it multiple times.  This
    # also is how the simple RL model works, which is not how the next version of this model
    # would be built if we continue down this path.
    can_terminate = False
    cses = []
    while (action := get_most_likley_allowed_action(model, method, can_terminate)) is not None:
        cses.append(action)
        method = cache.jit_method(spmi, method.index, cses)
        if method is None:
            break

        can_terminate = True
        if not any(True for candidate in method.cse_candidates if candidate.viable):
            break

    return method

# Implemented as a function so we don't carry these variables forward
def evaluate_rl_model(algorithm, model_path):
    model = JitCseModel(algorithm)
    model.load(model_path)

    failures = 0
    rl_scores = {}
    no_cse_scores = {}
    heuristic_scores = {}
    pct_cses = []
    no_cse_methods = []

    training_methods = set(cache.train_methods)
    print(f"Training methods: {len(training_methods)}, Test methods: {len(cache.test_methods)}")

    # This test/train split is for the RL model, which is different from the
    # previous set of methods.  This is because the RL model can only handle
    # methods with 3-16 CSEs.
    all_methods = cache.test_methods + cache.train_methods
    for method_index in tqdm(all_methods, ncols=120, desc="Evaluating RL model"):
        # JIT the method with no CSEs, and then JIT the method with the heuristic
        no_cse = cache.jit_method(spmi, method_index, MethodKind.NO_CSE)
        heuristic = cache.jit_method(spmi, method_index, MethodKind.HEURISTIC)

        # JIT the method with the RL model
        rl_result = predict_cses(model, no_cse)
        if rl_result is None:
            failures += 1
            continue

        if np.isclose(rl_result.perf_score, 0.0) or \
           np.isclose(no_cse.perf_score, 0.0) or \
           np.isclose(heuristic.perf_score, 0.0):
            continue

        rl_scores[method_index] = rl_result.perf_score
        no_cse_scores[method_index] = no_cse.perf_score
        heuristic_scores[method_index] = heuristic.perf_score
        pct_cses.append(len(rl_result.cses_chosen) / sum(1 for x in no_cse.cse_candidates if x.viable))
        no_cse_methods.append(no_cse)

    avg_cses_used = np.mean(pct_cses)

    print(f"Failures: {failures}")
    print(f"Average cse usage: {avg_cses_used:.2f}")
    print()

    # Print out the results of the RL model on test/training data vs baselines
    print("Training data results:")
    rl = [value for key, value in rl_scores.items() if key in training_methods]
    no_cse = [value for key, value in no_cse_scores.items() if key in training_methods]
    heuristic = [value for key, value in heuristic_scores.items() if key in training_methods]
    print_difference(rl, no_cse, "rl", "no-cse")
    print_difference(rl, heuristic, "rl", "heuristic")

    print("Test data results:")
    rl = [value for key, value in rl_scores.items() if key not in training_methods]
    no_cse = [value for key, value in no_cse_scores.items() if key not in training_methods]
    heuristic = [value for key, value in heuristic_scores.items() if key not in training_methods]
    print_difference(rl, no_cse, "rl", "no-cse")
    print_difference(rl, heuristic, "rl", "heuristic")

    # calculate random baseline
    rl_for_random = []
    random_scores = []
    for no_cse in tqdm(no_cse_methods, ncols=100, desc="Random choices"):
        viable = [x.index for x in no_cse.cse_candidates if x.viable]
        count = int(len(viable) * avg_cses_used)
        if count > 1 and count <= len(viable):
            choice = list(np.random.choice(viable, count, replace=False))
            random = cache.jit_method(spmi, no_cse.index, choice)

            # Sometimes JIT'ing fails, so we only consider successful JITs
            if random is not None and not np.isclose(random.perf_score, 0.0):
                random_scores.append(random.perf_score)
                rl_for_random.append(rl_scores[no_cse.index])

    print_difference(rl_for_random, random_scores, "rl", "random")

evaluate_rl_model(MODEL_ALGORITHM, RL_MODEL)

  from .autonotebook import tqdm as notebook_tqdm


Training methods: 55261, Test methods: 6140


Evaluating RL model: 100%|████████████████████████████████████████████████████████| 61401/61401 [29:10<00:00, 35.07it/s]


Failures: 104
Average cse usage: 0.36

Training data results:
Geomean of rl vs no-cse: 0.97

% of time same score:          9.78%
% of time no-cse is better:    30.64%
% of time rl is better: 59.59%

Average improvement when rl is better: -5.90%
Average degradation when no-cse is better:     1.77%

Geomean of rl vs heuristic: 0.99

% of time same score:          20.33%
% of time heuristic is better:    36.88%
% of time rl is better: 42.79%

Average improvement when rl is better: -4.10%
Average degradation when heuristic is better:     2.98%

Test data results:
Geomean of rl vs no-cse: 0.97

% of time same score:          9.50%
% of time no-cse is better:    31.01%
% of time rl is better: 59.49%

Average improvement when rl is better: -5.80%
Average degradation when no-cse is better:     1.81%

Geomean of rl vs heuristic: 0.99

% of time same score:          19.55%
% of time heuristic is better:    37.00%
% of time rl is better: 43.45%

Average improvement when rl is better: -4.08%
Aver

Random choices: 100%|████████████████████████████████████████| 61285/61285 [06:30<00:00, 156.90it/s]


Geomean of rl vs random: 0.98

% of time same score:          7.24%
% of time random is better:    31.25%
% of time rl is better: 61.51%

Average improvement when rl is better: -3.77%
Average degradation when random is better:     1.86%



## Neural Networks for a single CSE Decision

So reinforcement learning can work for this problem, but it's very difficult and complicated to train an RL model, and the one we have has a lot of limitations.  Let's take a look at some simpler approaches in deep learning.  We'll look at whether a neural network can predict whether enabling a CSE is a good or bad decision.  If we can do that, we may just be able to use it to tell us what CSEs to enable.

We will use the `individual_cse_perf` dataset again.  This tells us a simplified story of what we are looking for:  Whether or not each individual CSE is a net improvement or detriment to overall perfscore when we enable it (and by what magnitude).  This is not the complete picture, since each CSE decision will affect every other CSE decision.  However, this is a good starting dataset for our model to see if it will even train, and if it trains then whether it makes for good CSE decision making.

### Neural Network Architecture

Our input to the network will be a list of features that the JIT provides in `CSE_HeuristicRLHook::GetFeatures` (optcse.cpp).  We use scikit-learn's `StandardScaler` to scale the inputs effectively.  Some of the data (like the weights) might benefit from further thought into feature normalization.  Our output will be true/false of whether an individual CSE improves the perfscore or not (actually it's whether it improves the perfscore by at least `CSE_SUCCESS_THRESHOLD` as enabling a CSE that does nothing isn't helpful).  That means our output node will be a single neuron with `sigmoid` activation to get a probability.

Since we are doing binary classification, our loss function will be `binary_crossentropy`, which is the standard choice for binary classification.  Similarly we will use the `adam` optimizer since it does a good job with this type of task, though this was chosen semi-arbitrarily.  Others like `RMSprop` would work well too.

For our hidden layers, we've chosen 3 dense layers with 64 neurons each.  This choice of layers and neurons was chosen by experimenting with different architectures with [classification.py](../classification.py).  This program lets you specify the neural network architecture, optimizer, loss, etc and it will calculate the loss and accuracy of that network for this problem.

Through experimenting we found that smaller networks of 8 neurons underfit the data, and 2-3 layers of 16, 32, or 64 neurons worked nicely.  `[64, 64, 64, 1]` produced the best result and didn't overfit, so we use it here.  Something like `[16, 16, 1]` is likely a more sensible choice with less parameters to train and had similar (but slightly worse) results.  We also experimented with mixed densities, like `[64, 32, 16, 1]` but they did not seem to provide any advantages.

Below we train with 100 epochs since that seemed to give good results without seeing evidence of overfitting.  This number is also arbitrary, and can be increased to see how far we can push it, or decreased for a small penalty in accuracy.

### A Quick Note about Test/Train split

With this dataset we have to be careful about how we build our train/test split.  We cannot simply use sklearn's `train_test_split` to do this for us because multiple rows of data can come from the same method (each method can have multiple viable CSEs, after all).  We don't want to pollute the validation data with CSE decisions from methods that are in the training data.  This may not even be sufficient since some method bodies might be identical even if they came from different places.  We will have to fix that in a future version.

### Single CSE Results

As you can see from the output below, we hit ~98% accuracy on the training data and 97% accuracy on test data that the model has never seen before.  Meaning that for any individual CSE, this model can tell you with very high accuracy that just enabling that *one* cse and no others will have a positive or negative impact on perfscore.  (Actual numbers may vary from run to run, depending on the last time this script was run.)

So, this model does successfully train and seems to have good results on test data it has never seen before.

In [5]:
def create_model(input_len : int):
    model = tf.keras.models.Sequential([
        Dense(64, activation='relu', input_shape=(input_len,)),
        Dense(64, activation='relu'),
        Dense(64, activation='relu'),
        Dense(1, activation='sigmoid')
    ])

    model.compile(optimizer=OPTIMIZER, loss=LOSS, metrics=METRICS)
    return model

def split_and_scale(df : DataFrame):
    train_mask = df['method'].isin(cache.train_methods)
    test_mask = df['method'].isin(cache.test_methods)
    x_train, y_train = df[train_mask].drop(columns=['target', 'method']), df[train_mask]['target']
    x_test, y_test = df[test_mask].drop(columns=['target', 'method']), df[test_mask]['target']

    scaler = StandardScaler()
    x_train = scaler.fit_transform(x_train)
    x_test = scaler.transform(x_test)
    return scaler, x_train, x_test, y_train, y_test, x_train.shape[1]

# see notebooks/00_random_forest.ipynb for more information on the approach here
def sanitize_data(df : DataFrame, threshold : float) -> DataFrame:
    # Don't modify the original dataframe
    result = df.copy()

    # One-hot encode the type column
    for member in JitType:
        result[f"type_{member.name}"] = result['type'] == member

    result.drop(columns=['type'], inplace=True)
    result['selected'] = result['selected'].apply(len)
    result['target'] = result['diff'] < threshold

    # where result is viable
    result = result[result['viable']]

    # Drop columns we don't want to use as features
    to_drop = ['cse_index', 'cse_score', 'no_cse_score', 'heuristic_score', 'heuristic_selected',
               'index', 'applied', 'viable', 'diff']
    to_drop = [x for x in to_drop if x in result.columns]

    result.drop(columns=to_drop, inplace=True)
    return result

def train_single_cse_model(data : DataFrame, threshold : float):
    # Loads a dataset that is all the CSEs in the .mch file individually JIT'ed.
    normalized = sanitize_data(data, threshold)
    scalar, x_train, x_test, y_train, y_test, feature_len = split_and_scale(normalized)

    print(f"Training on {len(x_train)} CSE decisions.")
    print(f"Validating on {len(x_test)} CSE decisions.")

    # Create and train the model
    model = create_model(feature_len)
    model.fit(x_train, y_train, epochs=EPOCHS, batch_size=BATCH_SIZE, validation_data=(x_test, y_test), verbose=0,
              callbacks=[TqdmCallback(EPOCHS)])

    # Evaluate the model
    train_loss, train_acc = model.evaluate(x_train, y_train, verbose=0)
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=0)

    print()
    print(f"Train Accuracy: loss:{train_loss:.4f} accuracy:{train_acc:.4f}")
    print(f"Test Accuracy:  loss:{test_loss:.4f} accuracy:{test_acc:.4f}")

    return scalar, model

single_scalar, single_model = train_single_cse_model(individual_cse, CSE_SUCCESS_THRESHOLD)


Training on 291304 CSE decisions.
Validating on 32401 CSE decisions.


2024-05-19 14:38:16.398945: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1929] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19444 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:65:00.0, compute capability: 8.9
Epochs:   0%|                                                                                | 0/100 [00:00<?, ?epoch/s]2024-05-19 14:38:17.263968: W external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:225] Falling back to the CUDA driver for PTX compilation; ptxas does not support CC 8.9
2024-05-19 14:38:17.264006: W external/local_xla/xla/stream_executor/gpu/asm_compiler.cc:228] Used ptxas at ptxas
2024-05-19 14:38:17.264125: W tensorflow/compiler/mlir/tools/kernel_gen/transforms/gpu_kernel_to_blob_pass.cc:191] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
2024-05-19 14:38:17.304016: W tensorflow/compiler/mlir/tools/kernel_gen/transforms/gpu_kernel_to_blob_pass.cc:191] Failed to com


Train Accuracy: loss:0.0473 accuracy:0.9808
Test Accuracy:  loss:0.0751 accuracy:0.9740


## Does our Neural Network make for a better CSE heuristic?

Just because the model trains and generalizes to test data doesn't mean that it will do better than the current heuristic.  If the output of the model were to be believed, we should be correctly predicting that CSE coinflip more than 95% of the time.  That's too good to be true.  We also trained from a dataset of individual CSE decisions, but CSEs work in concert with each other.  So let's test and see what the outcome actually is.

Now that we have a trained model, we can take the CSEs for a method and use `model.predict` to tell us whether each CSE for the method is predicted to be below our success threshold.  We will use the same `SELECTION_PROBABILITY` as the default tensorflow criteria of 0.5.  That is, if the output of our neural network is above 0.5 we will select the CSE.

We will then compare our neural network to JIT'ing the method with no CSEs enabled (which is very similar to random chance) and also compare it to the current JIT heuristic, hoping for a better outcome.

### Results

As you can see from the output below, the neural network we trained did not live up to our expectations.  Despite predicting the results of single CSE decisions with a high degree of accuracy, we unfortunately see that using this to make full CSE decisions for a method doesn't quite model reality.

On the positive side, this model matched the current JIT CSE Heuristic (geomean of 1.00), though it does so in a very different way.  Because we set our selection threshold to be -5.0 perfscore or better, the model is very conservative.  We rarely make a "wrong" choice, but we choose very few CSEs overall.  (As an aside, this also gives more evidence that we may be able to improve the current heuristic by choosing less "bad" CSEs.)

With this in mind, let's explore what happens when we be less strict with our threshold for a successful CSE.

In [7]:
SELECTION_PROBABILITY = 0.5

def predict_cse_with_model(scalar, model, threshold):
    model_scores = []
    no_cse_scores = []
    heuristic_scores = []
    chosen = []
    chosen_pct = []

    grouped = individual_cse.groupby('method')
    for method_id in tqdm(cache.test_methods, ncols=120, desc="Predicting CSEs"):
        # Sanitize data
        if method_id not in grouped.groups:
            continue

        row = grouped.get_group(method_id)
        row_viable = row[row['viable']]
        features = sanitize_data(row_viable, threshold)
        x = scalar.transform(features.drop(columns=['target', 'method']))

        # Predict what CSEs to use
        y_pred = model.predict(x, verbose=0).ravel()
        above_threshold = np.where(y_pred > SELECTION_PROBABILITY)[0]
        cses_chosen = above_threshold[np.argsort(-y_pred[above_threshold])].tolist()

        # pull index out of row_viable
        cses_chosen = [row_viable.iloc[x].cse_index for x in cses_chosen]

        # JIT the method with the chosen CSEs
        no_cse = cache.jit_method(spmi, method_id, MethodKind.NO_CSE)
        heuristic = cache.jit_method(spmi, method_id, MethodKind.HEURISTIC)

        if cses_chosen:
            method = cache.jit_method(spmi, method_id, cses_chosen)
            if method is None or np.isclose(method.perf_score, 0.0):
                continue
        else:
            method = no_cse

        model_scores.append(method.perf_score)
        no_cse_scores.append(no_cse.perf_score)
        heuristic_scores.append(heuristic.perf_score)
        chosen.append(len(cses_chosen))
        chosen_pct.append(len(cses_chosen) / sum(1 for x in no_cse.cse_candidates if x.viable))

    print(f"Average CSEs chosen:   {np.mean(chosen):.2f}")
    print(f"Average CSEs chosen %: {np.mean(chosen_pct) * 100:.2f}%")
    print()
    print("VS No CSE")
    print_difference(model_scores, no_cse_scores, "model", "no-cse")
    print()
    print("VS Heuristic")
    print_difference(model_scores, heuristic_scores, "model", "heuristic")

print(f"Results from selecting CSEs with predicted perfscore < {CSE_SUCCESS_THRESHOLD}:")
predict_cse_with_model(single_scalar, single_model, CSE_SUCCESS_THRESHOLD)

Results from selecting CSEs with predicted perfscore < -5.0:


Predicting CSEs: 100%|██████████████████████████████████████████████████████████████| 6140/6140 [05:41<00:00, 17.98it/s]


Average CSEs chosen:   0.49
Average CSEs chosen %: 8.82%

VS No CSE
Geomean of model vs no-cse: 0.98

% of time same score:          77.00%
% of time no-cse is better:    0.93%
% of time model is better: 22.07%

Average improvement when model is better: -11.10%
Average degradation when no-cse is better:     1.42%


VS Heuristic
Geomean of model vs heuristic: 1.00

% of time same score:          17.46%
% of time heuristic is better:    48.85%
% of time model is better: 33.69%

Average improvement when model is better: -5.78%
Average degradation when heuristic is better:     3.30%



## A Second Attempt at an Individual CSE Based Neural Network

In this attempt, we'll take the exact same neural network architecture and hyperparameters but instead try to train it to detect when there is any improvement in perfscore at all.  This uses 0 for our success threshold instead of -5.0.  We will expect this new model to select more CSEs, but also likely make more mistakes in choosing CSEs which regress perfscore.  We hope that, on balance, this will be a net improvement over the heuristic.

One interesting side effect of setting the threshold to 0 is that the model is not as accurate on this data.  We hit 92% accuracy with this model (compared to 96% before).  This is most likely because many CSEs are clustered around the 0.0 perfscore change mark.  We might be able to increase the accuracy by using something somewhere between `[-5, 0]`, but at the end of the day the important thing is how well the model does on the more general scenario of picking multiple CSEs, not in training.

The final outcome here is a better one than the previous model, despite the lower accuracy on the training data.  We managed to beat the current JIT heuristic by a non-trivial margin, with a geomean of 0.98.  This still isn't a huge win though, just incremental progress.

One interesting finding here is that the CSEs chosen has risen in this model up to 46%.  This is the exact percentage of time that we found individual CSEs to be beneficial.  This is a good sign that our model is well calibrated to the training data we gave it.  That gives us good hope that if we can feed a neural network better data and more relevant features, we should see even better results.  Hopefully.

In [8]:
NEW_THRESHOLD = 0.0
scalar_zero, model_zero = train_single_cse_model(individual_cse, NEW_THRESHOLD)

print()
print(f"Results from selecting CSEs with predicted perfscore < {NEW_THRESHOLD}:")
predict_cse_with_model(scalar_zero, model_zero, NEW_THRESHOLD)

Training on 291304 CSE decisions.
Validating on 32401 CSE decisions.


Epochs: 100%|████████████████████████████████████████████| 100/100 [03:34<00:00,  2.14s/epoch, acc=0.931, val_acc=0.921]



Train Accuracy: loss:0.1769 accuracy:0.9315
Test Accuracy:  loss:0.2143 accuracy:0.9213

Results from selecting CSEs with predicted perfscore < 0.0:


Predicting CSEs: 100%|██████████████████████████████████████████████████████████████| 6140/6140 [06:42<00:00, 15.25it/s]


Average CSEs chosen:   2.31
Average CSEs chosen %: 45.67%

VS No CSE
Geomean of model vs no-cse: 0.96

% of time same score:          25.59%
% of time no-cse is better:    3.00%
% of time model is better: 71.41%

Average improvement when model is better: -6.66%
Average degradation when no-cse is better:     1.41%


VS Heuristic
Geomean of model vs heuristic: 0.98

% of time same score:          28.64%
% of time heuristic is better:    5.42%
% of time model is better: 65.94%

Average improvement when model is better: -3.86%
Average degradation when heuristic is better:     1.92%



## Thoughts and Conclusions

We took a wide walk through CSE decision making in this notebook.  We were able to build three ML models which are as good or better than the current JIT heuristic:  The reinforcement learning trained model we built a few weeks past, and two classification models based on the `individual_cse` dataset.

Our first attempt at building a classic neural network (without RL) to match the JIT heuristic.  However, it does not generalize as well as we expected it to based on its classification success.  Our second attempt was not as accurate on the test or training data, but resulted in a better overall model which better fit the real-world scenario...though still nowhere near as good as we might hope.

We may be able to train a better network in the future by giving it data about multiple CSE selections instead of "individual ones".  One of our next steps will be trying to generate better training data from multiple CSE decisions.

Here are some other quick takeaways:

* Individual CSE decisions are essentially a coinflip, 46% of the time an individual CSE will be a positive change in perfscore (if no others are enabled).
* This "individual" CSE decision data doesn't seem to completely generalize to selecting multiple CSEs.
* The current hand-written JIT heuristic does quite well, but is *worse* than choosing no CSEs (or random chance) 26% of the time.
* We may be able to improve the hand-written JIT by eliminating places where it guesses wrong instead of trying to find new places to say yes.
* The initial simplified reinforcement learning trained model we built does do better than the current JIT heuristic for functions with 3-16 CSEs.
* The `individual_cse` dataset can be learned and generalized by a neural network.
* This resulting neural network trained with `CSE_SUCCESS_THRESHOLD = -5.0` does better than the current JIT heuristic and our simplified RL model.  However, it does not live up to its "97% success" rate when it goes from picking individual CSEs to a series of them.
* We can change our `CSE_SUCCESS_THRESHOLD` to 0 for better performance.  The neural network guesses individual CSE perfscore less accurately, but the resulting network generalizes better to selecting CSEs for a full method.

Lastly, note that this is another step in a series of explorations.  This is not the final outcome of this work.

### Next Steps

The next project here is to build a better dataset of CSE decisions and results.  We need to be careful to not double-count methods with the same method body, such that that data leaks between the training and test set.  I think this can be accomplished with the method hash that comes from the JIT.  Once that's complete we can try to build a better, more real model using that data.

Here are a series of other follow ups from this work:

* Build a better dataset that incorporates multiple CSE decisions.
* Attempt to train a new neural network on a multi-cse dataset and see if it's better or worse.
* Attempt to identify bad CSE decisions by the current JIT heuristic and see if we can update the heuristic to work better.
* Take a closer look at the features the JIT provides for CSE decisions.  What changes after we enable a CSE?
* Consider adding features for basic block data (like variables live across the block).
* Attempt to build a regression model to predict high value CSEs, instead of just the true/false classification model.