In [None]:
from mymodels.data_engineer import data_engineer
from mymodels.pipeline import MyPipeline


"""
# Global settings for font
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'Times New Roman'
"""

"""
# For debugging
import logging
logging.basicConfig(
    level = logging.DEBUG,
    format = "%(asctime)s - %(levelname)s - %(message)s"
)
"""


## Construct an object for workflow

### Parameters

- **results_dir**: Directory path where your results will be stored. Accepts either a string or pathlib.Path object. The directory will be created if it doesn't exist.

- **random_state**: Random seed for the entire pipeline (data splitting, model tuning, etc.). (Default is 0)

- **show**: Whether to display the figure on the screen. (Default is `False`)

- **plot_format**: Output format for figures. (Default is jpg)

- **plot_dpi**: Controlling the resolution of output figures. (Default is 500)

In [2]:
mymodel = MyPipeline(
    results_dir = "results/titanic",
    random_state = 0,
    show = False,
    plot_format = "jpg",
    plot_dpi = 500
)

## Load data

### Parameters

- **file_path**: In which the data you want to input. **.csv format is mandatory**. 

- **y**: The target you want to predict. A `str` object represented column name or a `int` object represented the column index are both acceptable.

- **x_list**: A `list` object (or a `tuple` object) of the independent variables. Each element in `list` (or `tuple`) must be a `str` object represented column name or a `int` object represented the column index.

- **index_col**: An `int` object or `str` object representing the index column. (Default is `None`)

    > It's STRONGLY RECOMMENDED to set the index column if you want to output the raw data and the shap values. Also, it's acceptable to provide a `list` object (or a `tuple` object) for representing multiple index columns. 

- **test_ratio**: The proportion of test data. (Default is 0.3)

- **inspect**: Whether to display the y column or the independent variables you chose in the terminal. (Default is `True`)

In [3]:
mymodel.load(
    file_path = "data/titanic.csv",
    y = "Survived",
    x_list = ["Pclass", "Sex", "Embarked", "Age", "SibSp", "Parch", "Fare"],
    index_col = ["PassengerId"],
    test_ratio = 0.3,
    inspect = False
)

## Diagnose the data

### Parameters

- `sample_k`: set the sampling ratio for diagnostic data.

The `mymodels.diagnose()` method provides visual data diagnostics, including:

- Data types

- Missing data counts

- Distribution, count, and proportion of categorical variables (displayed as bar charts)

- Distribution of continuous variables (displayed as violin plots and box plots)

- Correlation of continuous variables, **strongly recommended to review before SHAP analysis** (displayed as heatmaps using Spearman and Pearson correlation tests)

It's strongly recommanded to run step-by-step. Run the `mymodels.diagnose()` for data diagnosis firstly, then setting feature engineering parameters or customizing feature engineering based on the results.

**Note: This method only diagnoses the training set after data splitting.**


In [None]:
mymodel.diagnose(sample_k=None)

## Data Engineering

### Parameters

- **outlier_col**: Specify the columns with outliers. (Default is `None`)

  > It's not supported currently, but will be added in the future.

- **missing_values_cols**: A `list` (or a `tuple`) object for representing the columns have missing values. (Default is `None`)

- **impute_method**: A `str`, `list` (or a `tuple`) object for representing the impute methods. (Default is `None`)

  All impute methods in `sklearn.preprocessing.SimpleImputer` are supported. 

  If a `str` is presented, all columns in `missing_values_cols` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `missing_values_cols`, and they must either both be provided or both be set to `None`.

- **cat_features**: A `list` (or a `tuple`) object for representing the categorical columns. (Default is `None`)

- **encode_method**: A `str` object representing the encode method, or a `list` (or a `tuple`) of encode methods are both acceptable. (Default is `None`)

  If a `str` is presented, all columns in `cat_features` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `cat_features`, and they must either both be provided or both be set to `None`.

  > A full list of supported encode methods can be found at the end.

- **scale_cols**: A `list` (or a `tuple`) object for representing the columns need scaling. (Default is `None`)

- **scale_method**: A `str` object representing the scale method, or a `list` (or a `tuple`) of scale methods are both acceptable. (Default is `None`)

  Currently the `sklearn.preprocessing.StandardScaler` and `sklearn.preprocessing.MinMaxScaler` are supported.

  If a `str` is presented, all columns in `scale_cols` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `scale_cols`, and they must either both be provided or both be set to `None`.

- **n_jobs**: Parallel execution in data engineering. Speed up in excuting large dataset. (Default is `1`)

- **verbose**: Whether to print the infomation in transformation. (Default is `False`)


### Note

The `mymodels.data_engineer()` method return a `sklearn.pipeline.Pipeline` object, which will be passed into the `mymodels.optimize()` method below. This `Pipeline` will preprocess the data before model training, including:

- Being called in each fold of cross-validation to fit and transform the training set, then transform the validation set. Each fold will create an independent pipeline object that doesn't affect others.

- After hyperparameter optimization, a new pipeline object will be created to fit and transform all training data, then transform the test set.

Users can define their own pipeline objects for feature engineering (e.g., adding feature selection steps), as long as they conform to the `sklearn.pipeline.Pipeline` class. However, users must test these themselves, and this project takes no responsibility for any issues that arise.

The pipeline will be exported to a `data_engineer_pipeline.joblib` file in the specified result path, which users can manually load and reuse.


In [5]:
"""
# User-defined pipeline
from sklearn.pipeline import Pipeline
self_defined_data_engineer_pipeline = Pipeline()
"""


# Return an instance of `sklearn.pipeline.Pipeline` object
# User can define their own pipeline
data_engineer_pipeline = data_engineer(
    outlier_cols = None,
    missing_values_cols = ["Age", "Embarked"],
    impute_method = ["mean", "most_frequent"],
    cat_features = ["Sex", "Embarked"],
    encode_method = ["onehot", "onehot"],
    # scale_cols = ["Fare"],
    # scale_method = ["standard"],
    n_jobs = 5,
    verbose = False
)

## Execute the optimization

### Parameters

- **model_name**: the model you want to use. In this example, `xgbc` represented XGBoost classifier, other model name like `catr` means CatBoost regressor. A full list of model names representing different models and tasks can be found at the end.

- **data_engineer_pipeline**: A `sklearn.pipeline.Pipeline` object for data engineering.

- **cv**: Cross-validation in the tuning process. (Default is 5)

- **trials**: How many trials in the Bayesian tuning process (Based on [Optuna](https://optuna.org/)). 10 trials is just for demonstration, users should set it to a larger value for better hyperparameter optimization. (Default is 50)

- **n_jobs**: How many cores will be used in the cross-validation process. It's recommended to use the same value as `cv`. (Default is 5)

- **cat_features**: A `list` (or a `tuple`) of categorical features to specify for **CatBoost model ONLY**. A `list` (or a `tuple`) of `str` representing the column names. (Default is `None`)

  > If the model_name is neither `catc` nor `catr` (which represent CatBoost models), this parameter must be set to `None`; otherwise, an assertion error will occur.

- **optimize_history**: Whether to save the optimization history. (Default is `True`)

- **save_optimal_params**: Whether to save the best parameters. (Default is `True`)

- **save_optimal_model**: Whether to save the optimal model. (Default is `True`)

### Output

Several files will be output in the results directory:

- `params.yml` will document the best parameters.

- `mapping.json` will document the mapping relationship between the categorical features and the encoded features.

- `optimal-model.joblib` will save the optimal model from sklearn.

- `optimal-model.cbm` will save the optimal model from CatBoost.

- `optimal-model.txt` will save the optimal model from LightGBM.

- `optimal-model.json` will save the optimal model from XGBoost.

- `optimal-model.pkl` will save all types of optimal model for compatibility.

In [None]:
mymodel.optimize(
    model_name = "xgbc",
    data_engineer_pipeline = data_engineer_pipeline,
    cv = 5,
    trials = 10,
    n_jobs = 5,
    cat_features = None,  # For CatBoost ONLY
    optimize_history = True,
    save_optimal_params = True,
    save_optimal_model = True
)

## Evaluate the model's accuracy

### Parameters

- **save_raw_data**: Whether to save the raw prediction data. Default is `True`.

### Output

The accuracy results will be output to the directory you defined above:

- A `.yml` file named `accuracy` will document the results of model's accuracy.

- A figure named `roc_curve_plot` document the classification accuracy.

- Or a figure named `accuracy_plot` (it is a scatter plot) for regression task.

In [None]:
mymodel.evaluate(save_raw_data = True)

## Explain the model using SHAP (SHapley Additive exPlanations)

### Parameters

- **select_background_data**: The data used for **background value calculation**. (Default is `"train"`)

    Default is `"train"`, meaning that all data in the training set will be used. `"test"` means that all data in the test set will be used. `"all"` means that all data in the training and test set will be used. 

- **select_shap_data**: The data used for **calculating SHAP values**. Default is `"test"`, meaning that all data in the test set will be used. `"all"` means that all data in the training and test set will be used. (Default is `"test"`)

- **sample_background_data_k**: Sampling the samples in the training set for **background value calculation**. (Default is `None`)
    
    Default `None`, meaning that all data in the training set will be used. An integer value means an actual number of data, while a float (i.e., 0.5) means the proportion in the training set for it. 

- **sample_shap_data_k**: Similar meaning to the `sample_background_data_k`. The test set will be implemented for **SHAP value calculation**. (Default is `None`)

- **output_raw_data**: Whether to save the raw data. Default is `False`.

> SHAP currently doesn't support multi-class classification tasks when using **GBDT** models. This limitation may affect the interpretability results and users should verify compatibility with their use case.

### Output

The figures (Summary plot, Dependence plots) will be output to the directory you defined above.


In [None]:
mymodel.explain(
    select_background_data = "train",
    select_shap_data = "test",
    sample_background_data_k = 50,
    sample_shap_data_k = 50,
    output_raw_data = True
)