## Situation and Task
- **Reason**: to provide realtime pricing of bonds with embedded options, prices need to be converted to OAS spread quickly, but the traditional OAS calculator is too slow.
- **Difficulty**: OAS calculator is a high-dimensional input, complex function
- **Existing solution**: pre-compute and cache OAS calc results, but those are expensive.
- **Task**: build an OAS approximator with guaranteed precision

## Action
- **Initial solution**: Build a machine learning model (such as neural net), trained on the inputs $X$ and true output $y$ from the OAS calculator. The trained model can then be the approximator.
- **Further difficulty**: expensive to generate training samples, especially when we need to guarantee precision. 
- **Refinement**: Made the connection in hyperparameter searching in neural network and solve this above problem adapting from Bayesian optimization/Gaussian process.
	- The problem is similar to **crude oil extraction**: we do not know where the oil is, and it is expensive to dig holes
	- So the solution is that we trial a few places, analyze the soil sample and determine which regions to dig next, and iterate. At the end, if you look at the actual holes we dig in the area, you will find that there are more holes in areas where there is high likelihood of oil.
	- The adapted Bayesian optimization for our problem is similarly an automatic way to search. 
        - As an **iterative process**, given the existing model errors on the current samples, we analyze and determine which region that model is more likely to do poorly (i.e. large model errors so far). Then we sample more in those regions and reevaluate.
        - In this way, we **get away from having to sample too many unnecessary points** for training samples: since for areas where the OAS approximator is already doing a good job, we can afford to sample less, just as we can forget about areas in ground when we are highly certain there is no oil underneath.

## Resolution
- **Achievement**: 
    - Machine learning solution + Bayesian optimization is able to **save 100x compute time and 1000x storage**
    - Make viable a business proposal: it is a **0 to 1 breakthrough**
- **Take-away**. It is also remarkable intellectually, as it showcase the **power of making conceptual connection and adapting solutions**
	- Machine learning algo were used to e.g. identify cats in the picture, but it is just approximating complex logics and functions, which makes it suitable for OAS approximation.
	- Bayesian optimization was traditionally used to find the maximum value of high-dimensional functions, but it can also be used as a systematic way of search, which is suitable to be adapted to sample training data.

## Technical Details

### Input space - high dimensional

- price, volatility, coupon, non-call in years, maturity, tenor values of the base curve

- The goal is to approximate the OAS calculator with **guarantee of precision** in a range of the above inputs, 
    - where the range of price, volatility and curve indicate what **market environment** fits the calibration, 
    - while the range of coupon, non-call year and maturity indicate **which bonds can use the approximator**.

- It is clear that a simple gridding and caching approach will quickly run afoul of the **curse of dimensionality**.

- We also tried reducing the dimensionality of **base curve by using Nielson-Siegal components**, i.e. PCA

### The machine learning model - MLP

- We tried single and double-layered MLP, with a `tanh` layer in between.

- **Number of neurons** is found to be sufficient to be 30-50. 
    - The small and simple network **does not require us to spend much time in hyperparameter turning**: empirically, **larger networks have bigger chance of having saddle points**

- We find that `MLPRegressor` in `sklearn` with `lbfgs` as optimizer works reasonably well.

- We specify **No regularization**, as the implicit assumption is the true OAS calculator is not noisy (when its precision is sufficiently high see below).

- Loss function is defined to be **p mean**, where the **larger the p, the more we avert large error**. This is a way to guarantee precision, at least empirically.

- The main headache is the **precision of the true OAS calculator** and to **sample smartly** (see the above point for Gaussian process).
    - The two issues compound each other: to have better precision, it is more expensive to run the true OAS calculator, while it is all the more important to sub-sample that counts.


### The Gaussian process optimization

- In each iteration, train the MLP on the given set of data, obtain the errors
- Train a Gaussian process on the data points of errors.
- Further generate extra points in the input space, especially around points which are significantly non-zero. See which newly sample points are predicted to be significantly non-zero by the Gaussian process. Add those to the set of data to be re-trained by MLP next.
    - We specify significance of zero as just absolute value of posterior mean over posterior standard deviation: not clear whether this will perform better or worse than the expected improvement, but it is there for simplicity.
- Repeat, until there is no added data, or the number of data points reach a threshold

#### Specifying the Gaussian process

- The kernel is **isotropic**, i.e. only depends on the distance between the two input points. 
    - The **distance of input points are weighted Euclidean**, reflecting the belief which dimension the error is likely to be more sensitive on.
    - To customize the above kernel, needed to **override the call method in kernel in `sklearn`**.

- We **do not specify extra noise** (i.e. `alpha` is the defaulted very small value in the API). The implicit assumption is we trust the precision of the true OAS calculator we are approximating.

#### Next steps of refining the Gaussian process

- GP is probably reaching the **limit of number of data** (a few thousand) since we need to invert the covariance matrix in training

- GP is probably reaching the **limit of the number of features** as well: should not be more than a dozen.

- Maybe a better logic to judge the significance of errors on the next points are needed; see ‘expected improvement’ in Bayesian optimization: https://towardsdatascience.com/bayesian-optimization-a-step-by-step-approach-a1cb678dd2ec