# Accelerate Inference Throughput via Cascaded Multi-layer Ensemble
:label:`sec_cascade_infer`

Below is a preliminary API design.

`def predictor.fit_cascade()`
-----------------
Build a cascade of previously-trained models to achieve different trade-off of eval_metric and inference throughput, per user speicfy. This is invoked post `fit()` stage.  
The cascade consist of two key components, the sequence of trained models and the short-circuit(early-exit) thresholds attached to each model member. Both components are found via Hyperparameter Optimization(HPO) process on validation dataset.

```Python
Parameters:
    infer_limit: float, default = None
        The inference time limit in seconds per row to achieve. This is not gurantee because.
    infer_limit_batch_size: int, default = None
        The batch size to use when predicting in bulk to estimate per-row inference time. Must be an integer greater than 0. If None and infer_limit is specified, will default to 10000.
    build_cascade_algo: str, default = 'F2S+'
        Valida values are: ['F2S', 'F2S+', 'Greedy', 'Greedy+']. F2S+ and Greedy+ are recommended.
        'F2S' denotes Fast-to-Slow algorithm, where model in cascade is ordered by *pred_time_val_marginal*. 'Greedy' denotes Greedy Search algorithm, where the cascade model sequence is determined by a greedy search process. 'F2S+' and 'Greedy+' denotes the Plus version algorithm. In non-Plus cascade, each model member is a previously-trained model, e.g. ['KNN', 'Cat', 'RF', 'NN', 'WE_L2']. In the regarding Plus version, each model member is replaced by a newly fitted weighted_ensemble model of the trained model and its precedents, e.g. ['KNN', 'WE_L2-KNN_Cat', 'WE_L2-KNN_Cat_RF', 'WE_L2-KNN_Cat_RF_NN', 'WE_L2']. 
    hyperparameter_tune_kwargs: str or dict, default = None
        HPO strategy and kwargs (i.e. how many HPO trials to run). 
        num_trials: int, default=1000
        searcher: str, default='TPE'
            Valid values are: ['TPE', 'Random'].
        hpo_score_func: str, default='ag_goodness'
            Valid values are: ['ag_goodness', 'eval_metric'].
            'ag_goodness' is a AutoGluon pre-defined weighted sum score of eval_metric and inference throughput, with practise-based penalty that pushes the built cascade model to achieve strong (penalize severaly when eval_metric is close to random guess) and fast (penalize if cascade is too fast to focus more on eval_metric) performance for real-life application. 'eval_metric' is a score function that maxmize eval_metric within specified `infer_limit`.
        Greedy_algo_config_num_trials: int, default=50
            Greedy search cascade build algorithm involves a series of hpo trials for each configure during search process. This argument determine the number of trails to run for each configure. Recommed set a number less than 80 for reasonble duration.

Returns: CascadeConfig
```

A `CascadeConfig` class is defined as a class that carries all necessary information of a built cascade.
```Python
@dataclass(frozen=True)
class CascadeConfig:
    model: Tuple[str]         # cascade model sequence. Length=N.  TODO: stores name, but WE_L2_cascadeBuilt2 is noe self-explanatory.
    thresholds: Tuple[float]  # regarding short-circuit/early-exit thresholds. Length=N-1
    pred_time_val: float
    score_val: float
    hpo_score: float
    hpo_func_name: Optional[str] = None
```
