In [1]:
import numpy as np
import pandas as pd


from mymodels import data_engineer
from mymodels import MyModel

  from .autonotebook import tqdm as notebook_tqdm



## 1. Construct an object for workflow

- **random_state**: Random seed for the entire pipeline (data splitting, model tuning, etc.). (Default is 0)

In [2]:
mymodel = MyModel(random_state = 0)

## 2. Data Engineering

### Note

The `mymodels.data_engineer()` method return a `sklearn.pipeline.Pipeline` object, which will be passed into the `mymodels.optimize()` method below. This `Pipeline` will preprocess the data before model training, including:

- Being called in each fold of cross-validation to fit and transform the training set, then transform the validation set. Each fold will create an independent pipeline object that doesn't affect others.

- After hyperparameter optimization, a new pipeline object will be created to fit and transform all training data, then transform the test set.

Users can define their own pipeline objects for feature engineering (e.g., adding feature selection steps), as long as they conform to the `sklearn.pipeline.Pipeline` class. However, users must test these themselves, and this project takes no responsibility for any issues that arise.

The pipeline will be exported to a `data_engineer_pipeline.joblib` file in the specified result path, which users can manually load and reuse.


### Parameters

- **outlier_cols**: Specify the columns with outliers. (Default is `None`)

  > It's not supported currently, but will be added in the future.

- **missing_values_cols**: A `list` (or a `tuple`) object for representing the columns have missing values. (Default is `None`)

- **impute_method**: A `str`, `list` (or a `tuple`) object for representing the impute methods. (Default is `None`)

  All impute methods in `sklearn.preprocessing.SimpleImputer` are supported. 

  If a `str` is presented, all columns in `missing_values_cols` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `missing_values_cols`, and they must either both be provided or both be set to `None`.

- **cat_features**: A `list` (or a `tuple`) object for representing the categorical columns. (Default is `None`)

- **encode_method**: A `str` object representing the encode method, or a `list` (or a `tuple`) of encode methods are both acceptable. (Default is `None`)

  If a `str` is presented, all columns in `cat_features` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `cat_features`, and they must either both be provided or both be set to `None`.

  > A full list of supported encode methods can be found at the end.

- **scale_cols**: A `list` (or a `tuple`) object for representing the columns need scaling. (Default is `None`)

- **scale_method**: A `str` object representing the scale method, or a `list` (or a `tuple`) of scale methods are both acceptable. (Default is `None`)

  Currently the `sklearn.preprocessing.StandardScaler` and `sklearn.preprocessing.MinMaxScaler` are supported.

  If a `str` is presented, all columns in `scale_cols` will be implemented by the given method. If a `list` (or a `tuple`) is presented ,the length of this parameter must match `scale_cols`, and they must either both be provided or both be set to `None`.

- **n_jobs**: Parallel execution in data engineering. Speed up in excuting large dataset. (Default is `1`)

- **verbose**: Whether to print the infomation in transformation. (Default is `False`)


In [3]:
"""
# User-defined pipeline
from sklearn.pipeline import Pipeline
self_defined_data_engineer_pipeline = Pipeline()
"""


data_engineer_pipeline = data_engineer(
    outlier_cols = None,
    missing_values_cols = ["Age", "Embarked"],
    impute_method = ["mean", "most_frequent"],
    cat_features = ["Sex", "Embarked"],
    encode_method = ["onehot", "onehot"],
    # scale_cols = ["Fare"],
    # scale_method = ["standard"],
    n_jobs = 5,
    verbose = False
)

## 3. Load data and configurations

- **data**: Use pandas for data input. 

    > It's STRONGLY RECOMMENDED to set the index column if you want to output the raw data and the shap values. Also, it's acceptable to provide a `list` object (or a `tuple` object) for representing multiple index columns. 

### Parameters

- **model_name**: the model you want to use. In this example, `xgbc` represented XGBoost classifier, other model name like `catr` means CatBoost regressor. A full list of model names representing different models and tasks can be found at the end.

- **input_data**: A `pd.Dataframe` of pandas for input. 

- **y**: The target you want to predict. A `str` object represented column name or a `int` object represented the column index are both acceptable.

- **x_list**: A `list` object (or a `tuple` object) of the independent variables. Each element in `list` (or `tuple`) must be a `str` object represented column name or a `int` object represented the column index.

- **test_ratio**: The proportion of test data. (Default is 0.3)

- **stratify**: Whether or not to split the data in a stratified fashion. (Default is False)


- **data_engineer_pipeline**: A `sklearn.pipeline.Pipeline` object for data engineering.

- **cat_features**: A `list` (or a `tuple`) of categorical features to specify for **CatBoost model ONLY**. A `list` (or a `tuple`) of `str` representing the column names. (Default is `None`)

  > If the model_name is neither `catc` nor `catr` (which represent CatBoost models), this parameter must be set to `None`; otherwise, an assertion error will occur.


- **model_configs_path**: The hyperparameters tuning space can be found in `model_configs.yml` file, user can change the hyperparameters to fit their needs.


In [4]:
data = pd.read_csv("data/titanic.zip", encoding="utf-8",
                   na_values=np.nan, index_col=["PassengerId"])

mymodel.load(
    model_name = "rfc",
    input_data = data,
    y = "Survived",
    x_list = ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"],
    test_ratio = 0.3,
    stratify = False,
    data_engineer_pipeline = data_engineer_pipeline,
    cat_features = ["Sex", "Embarked"],
    model_configs_path = "model_configs.yml"
)



## 4. Format the visualization and output

### Parameters

- **results_dir**: Directory path where your results will be stored. Accepts either a string or pathlib.Path object. The directory will be created if it doesn't exist.

- **show**: Whether to display the figure on the screen. (Default is `False`)

- **plot_format**: Output format for figures. (Default is jpg)

- **plot_dpi**: Controlling the resolution of output figures. (Default is 500)

- **save_optimal_model**: Whether to save the optimal model. (Default is `False`)

    If `save_optimal_model` is `True`:

    - `optimal-model.joblib` will save the optimal model from sklearn.

    - `optimal-model.cbm` will save the optimal model from CatBoost.

    - `optimal-model.txt` will save the optimal model from LightGBM.

    - `optimal-model.json` will save the optimal model from XGBoost.

    - `optimal-model.pkl` will save all types of optimal model for compatibility.

- **save_raw_data**: Whether to save the raw data. (Default is `False`)

- **save_shap_values**: Whether to export the shap values. (Default is `False`)

In [5]:
mymodel.format(
    results_dir = "results/titanic",
    show = False,
    plot_format = "jpg",
    plot_dpi = 500,
    save_optimal_model = True,
    save_raw_data = True,
    save_shap_values = True
)

## 5. Diagnose the data

The `mymodels.diagnose()` method provides visual data diagnostics, including:

- Data types

- Missing data counts

- Distribution, count, and proportion of categorical variables (displayed as bar charts)

- Distribution of continuous variables (displayed as violin plots and box plots)

- Correlation of continuous variables, **strongly recommended to review before SHAP analysis** (displayed as heatmaps using Spearman and Pearson correlation tests)

**Note: This method only diagnoses the training set after data splitting.**

### Parameters

- `sample_k`: set the sampling ratio for diagnostic data.


In [6]:
mymodel.diagnose(sample_k=None)


Data diagnosis is performed on TRAINING DATASET ONLY.


Categorical Features Statistics:
Feature Name  Count  Null Count Null Ratio  Unique Count Unique Ratio
         Sex    623           0      0.00%             2        0.32%
    Embarked    623           2      0.32%             3        0.48%

Numerical Features Statistics:
Feature Name  Count  Null Count Null Ratio  Min   25% Median   75%    Max  Mean   Std Kurtosis Skewness
      Pclass    623           0      0.00% 1.00  1.50   3.00  3.00   3.00  2.29  0.84    -1.34    -0.58
         Age    623         121     19.42% 0.67 21.00  29.00 38.00  80.00 29.92 14.51     0.25     0.34
       SibSp    623           0      0.00% 0.00  0.00   0.00  1.00   8.00  0.53  1.16    19.25     3.91
       Parch    623           0      0.00% 0.00  0.00   0.00  0.00   6.00  0.39  0.83     9.77     2.75
        Fare    623           0      0.00% 0.00  7.92  15.00 31.39 512.33 32.46 48.26    35.27     4.84


## 6. Optimizing

### Parameters

- **strategy**: The strategy for hyperparameters searching.
  - `tpe`: `TPESampler` of Optuna.
  - `random`: `RandomSampler` of Optuna.

- **cv**: Cross-validation in the tuning process. (Default is 5)

- **trials**: How many trials in the Bayesian tuning process (Based on [Optuna](https://optuna.org/)). 10 trials is just for demonstration, users should set it to a larger value for better hyperparameter optimization. (Default is 100)

- **n_jobs**: How many cores will be used in the cross-validation process. It's recommended to use the same value as `cv`. (Default is 5)

- **direction**: the optimization direction: "minimize" or "maximize". (Default: "maximize")

- **eval_function**: user-defined evaluation function. **It must be a callable object, and be compatable with the `direction`**. (Default: None)

    ```python
    >>> from sklearn.metrics import cohen_kappa_score
    >>> ……
    >>> mymodel.optimize(
    >>>     ……,
    >>>     direction = "maximize",
    >>>     eval_function = cohen_kappa_score,
    >>>     ……
    >>> )
    ```

In [7]:
mymodel.optimize(
    strategy = "tpe",
    cv = 5,
    trials = 10,
    n_jobs = 5,
    direction = "maximize",
    eval_function = None
)

Best trial: 4. Best value: 0.826658: 100%|██████████| 10/10 [00:28<00:00,  2.84s/it]


## 7. Evaluate the model's accuracy

### Parameters

- **show_train**: Whether to show the accuracy on training set. (Default is `False`)

- **dummy**: Whether to use a dummy estimator for comparison. (Default is `False`)

- **eval_metric**: An user-defined evaluate metric. (Default is `None`)

In [8]:
mymodel.evaluate(
    show_train = True,
    dummy = True,
    eval_metric = None
)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


{
    "model": {
        "test": {
            "Overall Accuracy": 0.8208955223880597,
            "Precision": 0.8202415294475999,
            "Recall": 0.8208955223880597,
            "F1": 0.8205208492234899,
            "Kappa": 0.6155868993545301,
            "Matthews Correlation Coefficient": 0.6156658792988534,
            "Specificity": 0.8630952380952381
        },
        "train": {
            "Overall Accuracy": 0.898876404494382,
            "Precision": 0.8985556501894498,
            "Recall": 0.898876404494382,
            "F1": 0.8984200568507754,
            "Kappa": 0.7853779904306221,
            "Matthews Correlation Coefficient": 0.7859300160237404,
            "Specificity": 0.931758530183727
        }
    },
    "dummy": {
        "test": {
            "Overall Accuracy": 0.6268656716417911,
            "Precision": 0.3929605702829138,
            "Recall": 0.6268656716417911,
            "F1": 0.4830891414487197,
            "Kappa": 0.0,
            "Matthews

## 8. Explaining

### Parameters

- **select_background_data**: The data used for **background value calculation**. (Default is `"train"`)

    Default is `"train"`, meaning that all data in the training set will be used. `"test"` means that all data in the test set will be used. `"all"` means that all data in the training and test set will be used. 

- **select_shap_data**: The data used for **calculating SHAP values**. Default is `"test"`, meaning that all data in the test set will be used. `"all"` means that all data in the training and test set will be used. (Default is `"test"`)

- **sample_background_data_k**: Sampling the samples in the training set for **background value calculation**. (Default is `None`)
    
    Default `None`, meaning that all data in the training set will be used. An integer value means an actual number of data, while a float (i.e., 0.5) means the proportion in the training set for it. 

- **sample_shap_data_k**: Similar meaning to the `sample_background_data_k`. The test set will be implemented for **SHAP value calculation**. (Default is `None`)

> SHAP currently doesn't support multi-class classification tasks when using **GBDT** models. This limitation may affect the interpretability results and users should verify compatibility with their use case.

In [9]:
mymodel.explain(
    select_background_data = "train",
    select_shap_data = "test",
    sample_background_data_k = 50,
    sample_shap_data_k = 50
)

## 9. Prediction

Finally, the optimized model and data engineering pipeline can be utilized to generate predictions on the test dataset.  

In this step, the final predictions were saved in a file named `prediction.csv`, which includes two columns: one for the index and another for the predicted values.  

The `prediction.csv` file can then be uploaded to the Kaggle platform to obtain a score for the predictions.


In [None]:
data_pred = pd.read_csv("data/titanic_test.csv", encoding = "utf-8",
                        na_values = np.nan, index_col = ["PassengerId"])

data_pred = data_pred.loc[:, ["Pclass", "Sex", "Age", "SibSp", "Parch", "Fare", "Embarked"]]

y_pred = mymodel.predict(data = data_pred)
y_pred.name = "Survived"
y_pred.to_csv("results/titanic/prediction.csv", encoding = "utf-8", index = True)