# Introduction to AutoXGB
![cover](/Deepnote_AutoXGB.png)

[<img alt="Github" src="https://img.shields.io/badge/AutoXGB-0.2.2-1e90ff?logo=github&logoColor=white&style=for-the-badge" />](https://github.com/abhishekkrthakur/autoxgb)

**XGBoost + Optuna: no brainer**


AutoXGB is simple but effective AutoML tool to train model tabular dataset directly. The AutoXGB use [XGBoost](https://xgboost.readthedocs.io/en/stable/) for training the model, [Optuna](https://optuna.org/) for hyperparameters optimization and [FastAPI](https://fastapi.tiangolo.com/) to run web app. 

* auto train xgboost directly from CSV files
* auto tune xgboost using Optuna
* auto serve best xgboot model using FastAPI

```python
pip install autoxgb
```

# Dataset
The dataset is available at Kaggle: [Adult Census Income](https://www.kaggle.com/uciml/adult-census-income) under [CC0: Public Domain](https://creativecommons.org/publicdomain/zero/1.0/). It was extracted from the [1994 Census bureau database](http://www.census.gov/en.html) by Ronny Kohavi and Barry Becker (Data Mining and Visualization, Silicon Graphics).The prediction task is to determine whether a person makes over $50K a year.

# Initializing
- **train_filename** -> path to training data
- **output** -> path to output folder to store artifacts
- **test_filename** -> path to test data. If not specified, only OOF predictions will be saved
- **task** = None -> if not specified, the task will be inferred automatically
    - task = "classification"
    - task = "regression"
- **idx** -> if not specified, the id column will be generated automatically with the name `id`
- **targets** -> if not specified, the target column be assumed to be named `target` and the problem will be treated as one of: binary classification, multiclass classification, or single column regression
    * targets = ["target"]
    * targets = ["target1", "target2"]
- **features** -> if not specified, all columns except `id`, `targets` & `kfold` columns will be used
    - features = ["col1", "col2"]
- **categorical_features** -> if not specified, categorical columns will be inferred automatically
    - categorical_features = ["col1", "col2"]
- **use_gpu** -> if not specified, GPU is not used
    - use_gpu = True
    - use_gpu = False
- **num_folds** -> number of folds to use for cross-validation
- **seed** -> random seed for reproducibility


- **num_trials** -> number of optuna trials to run
    - default is 1000
    - num_trials = 1000
- **time_limit** -> time_limit for optuna trials in seconds
    - if not specified, timeout is not set and all trials are run
    - time_limit = None

- **fast** -> if fast is set to True, the hyperparameter tuning will use only one fold. however, the model will be trained on all folds in the end to generate OOF predictions and test predictions

In [None]:
from autoxgb import AutoXGB


# Input tabular data and output artifacts
train_filename = "binary_classification.csv"
output = "output"

# optional parameters
test_filename = None
task = None
idx = None
targets = ["income"]
features = None
categorical_features = None
use_gpu = False
num_folds = 5
seed = 42
num_trials = 100
time_limit = 360
fast = False

# Training & Optimization
It's time to defined model using `AutoXGB()`and add previously defined parameters. Finally, we will use `axgb.train()` to start model training. The model will run XGBboost model, Optuna, and save artifacts (**model, predication, results, config, params, encoders**) in output folder.

In [None]:
axgb = AutoXGB(
    train_filename=train_filename,
    output=output,
    test_filename=test_filename,
    task=task,
    idx=idx,
    targets=targets,
    features=features,
    categorical_features=categorical_features,
    use_gpu=use_gpu,
    num_folds=num_folds,
    seed=seed,
    num_trials=num_trials,
    time_limit=time_limit,
    fast=fast,
)
axgb.train()

2022-02-09 17:36:52.310 | INFO     | autoxgb.autoxgb:__post_init__:42 - Output directory: output
2022-02-09 17:36:52.314 | INFO     | autoxgb.autoxgb:_process_data:149 - Reading training data
2022-02-09 17:36:52.403 | INFO     | autoxgb.utils:reduce_memory_usage:48 - Mem. usage decreased to 2.64 Mb (29.2% reduction)
2022-02-09 17:36:52.427 | INFO     | autoxgb.autoxgb:_determine_problem_type:140 - Problem type: binary_classification
2022-02-09 17:36:52.428 | INFO     | autoxgb.autoxgb:_create_folds:58 - Creating folds
2022-02-09 17:36:52.472 | INFO     | autoxgb.autoxgb:_process_data:170 - Encoding target(s)
2022-02-09 17:36:52.486 | INFO     | autoxgb.autoxgb:_process_data:195 - Found 8 categorical features.
2022-02-09 17:36:52.490 | INFO     | autoxgb.autoxgb:_process_data:198 - Encoding categorical features
2022-02-09 17:36:53.034 | INFO     | autoxgb.autoxgb:_process_data:236 - Model config: train_filename='binary_classification.csv' test_filename=None idx='id' targets=['income'] p

2022-02-09 18:04:11.459 | INFO     | autoxgb.utils:predict_model:333 - Fold 2 done!
2022-02-09 18:04:11.460 | INFO     | autoxgb.utils:predict_model:238 - Training and predicting for fold 3
Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter actually being used
  but getting flagged wrongly here. Please open an issue if you find any such cases.


2022-02-09 18:07:48.436 | INFO     | autoxgb.utils:predict_model:333 - Fold 3 done!
2022-02-09 18:07:48.442 | INFO     | autoxgb.utils:predict_model:238 - Training and predicting for fold 4
Parameters: { "colsample_bytree", "max_depth", "subsample", "tree_method" } might not be used.

  This could be a false alarm, with some parameters getting used by language bindings but
  then being mistakenly passed down to XGBoost core, or some parameter

## Training with CLI

We can also train the model in terminal using the `autoxgb train` command. The parameters are same as above.

```
autoxgb train \
 --train_filename binary_classification.csv \
 --output output \
```

# Web API
By using `autoxgb serve` on CLI you can run localy FastAPI server. 

![Picture title](/image-20220210-172802.png)


## AutoXGB Serve Parameters
- model_path -> Path to model
- port -> Port to serve on
- host -> Host to serve on
- workers -> Number of workers
- debug -> Display logs of error and success

## Deepnote Public Server
In order to run local server on cloud, Deepnote use `ngrok`. We just need to turn on the option and use port as 8080.

![Picture title](/image-20220210-171603.png)

Our API is running smooth and you can access it using `https://8d3ae411-c6bc-4cad-8a14-732f8e3f13b7.deepnoteproject.com/docs`. We have provided just model path, host ip, and port number to run the server.

In [None]:
!autoxgb serve --model_path /work/output --host 0.0.0.0 --port 8080 --debug

[32mINFO[0m:     Will watch for changes in these directories: ['/work']
[32mINFO[0m:     Uvicorn running on [1mhttp://0.0.0.0:8080[0m (Press CTRL+C to quit)
[32mINFO[0m:     Started reloader process [[36m[1m153[0m] using [36m[1mwatchgod[0m
[32mINFO[0m:     Started server process [[36m163[0m]
[32mINFO[0m:     Waiting for application startup.
[32mINFO[0m:     Application startup complete.
[32mINFO[0m:     172.3.167.43:41136 - "[1mGET /favicon.ico HTTP/1.1[0m" [31m404 Not Found[0m
[32mINFO[0m:     172.3.167.43:41278 - "[1mGET / HTTP/1.1[0m" [31m404 Not Found[0m
[32mINFO[0m:     172.3.188.123:38366 - "[1mGET /doc HTTP/1.1[0m" [31m404 Not Found[0m
[32mINFO[0m:     172.3.161.55:40498 - "[1mGET /doc HTTP/1.1[0m" [31m404 Not Found[0m
[32mINFO[0m:     172.3.161.55:40628 - "[1mGET /docs HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     172.3.188.123:38788 - "[1mGET /openapi.json HTTP/1.1[0m" [32m200 OK[0m
[32mINFO[0m:     172.3.167.43:48326 -

# Prediction

We can add random inputs to predict the income . In this example we are using FastAPI `/docs` option to run GUI.

## Input
We are going to use FastAPI GUI to run predictions on model by adding `/docs` at the end of the link. For example `172.3.167.43:39118/docs`
- workclass: "Private"
- education: "HS-grad"
- marital.status: "Widowed"
- occupation: "Transport-moving"
- relationship: "Unmarried"
- race: "White"
- sex: "Male"
- native.country: "United-States"
- age: 20
- fnlwgt: 313986
- education.num: 9
- capital.gain: 0
- capital.loss: 0
- hours.per.week: 40

![Picture title](/image-20220210-173410.png)

## Outcome
The result is `<50k` with confidence of 97.6% and `>50k` with confidence of 2.3%.

![Picture title](/image-20220210-173502.png)

## Test with Request
You can try the API using `requests` in Python. Just push params in the form of a dictionary and get output in the form of json.

In [None]:
import requests

params = {
    "workclass": "Private",
    "education": "HS-grad",
    "marital.status": "Widowed",
    "occupation": "Transport-moving",
    "relationship": "Unmarried",
    "race": "White",
    "sex": "Male",
    "native.country": "United-States",
    "age": 20,
    "fnlwgt": 313986,
    "education.num": 9,
    "capital.gain": 0,
    "capital.loss": 0,
    "hours.per.week": 40,
}


article = requests.post(
    f"https://8d3ae411-c6bc-4cad-8a14-732f8e3f13b7.deepnoteproject.com/predict",
    json=params,
)

data_dict = article.json()
print(data_dict)
## {'id': 0, '<=50K': 0.9762147068977356, '>50K': 0.023785298690199852}

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=8d3ae411-c6bc-4cad-8a14-732f8e3f13b7' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>