### Regression on California Housing Datase

In this Notebook I walk through all the steps needed to develop a **regression model** and save it as JSON file.

After the model have been saved, I will deploy the model as a **REST** service.

The Notebook is based on the **California Housing Dataset**, downloaded from SKlearn.

I have **split** the initial demo in two Notebooks:
* this one, where I train the model (using XGBoost) and save the model in JSON format
* a second NB where the trained model, serialized as JSON, is saved to the Model Catalog and then deployed as REST service

In [1]:
import pandas as pd
import numpy as np
from pandas.api.types import is_numeric_dtype

# the dataset used for the example
from sklearn.datasets import fetch_california_housing

from sklearn.model_selection import train_test_split

# the GBM used
import xgboost as xgb

### some utility functions

In [2]:
# functions
def get_general_info(data_df):
    print(f"There are: {len(data_df.columns)} columns in the dataset")
    print()
    print(
        "The list of column names, in alphabetical order:",
        sorted(list(data_df.columns)),
    )
    print()
    print(f"There are {data_df.shape[0]} records in the dataset")
    print()
    
    return

# well you have to decide a threshold in term of a fraction
# to decide if the col is categorical
FRAC = 0.1

def analyze_df(data_df):
    # it is ok to use isna, isnull is an alias of isna
    missing_val = data_df.isna().sum()

    # cardinality

    THR = data_df.shape[0] * FRAC

    list_card = []
    list_cat = []
    list_dtypes = []
    list_num_zeros = []

    for col in data_df.columns:
        # count the # of distinct values
        n_distinct = data_df[col].nunique()
        list_card.append(n_distinct)
        
        # is categorical is decide on this rule
        if n_distinct < THR:
            # categorical
            list_cat.append("Yes")
        else:
            list_cat.append("No")

        list_dtypes.append(data_df[col].dtype)

    # build the results DF
    result_df = pd.DataFrame(
        {
            "col_name": list(data_df.columns),
            "missing_vals": missing_val,
            "cardinality": list_card,
            "is_categorical": list_cat,
            "data_type": list_dtypes,
        },
        index=None,
    )

    # if you don't want cols as index
    result_df.reset_index(drop=True, inplace=True)

    return result_df

### Load the dataset

In [3]:
# load the dataset
housing = fetch_california_housing(as_frame=True)

orig_df = housing.frame

In [4]:
orig_df.head()

Unnamed: 0,MedInc,HouseAge,AveRooms,AveBedrms,Population,AveOccup,Latitude,Longitude,MedHouseVal
0,8.3252,41.0,6.984127,1.02381,322.0,2.555556,37.88,-122.23,4.526
1,8.3014,21.0,6.238137,0.97188,2401.0,2.109842,37.86,-122.22,3.585
2,7.2574,52.0,8.288136,1.073446,496.0,2.80226,37.85,-122.24,3.521
3,5.6431,52.0,5.817352,1.073059,558.0,2.547945,37.85,-122.25,3.413
4,3.8462,52.0,6.281853,1.081081,565.0,2.181467,37.85,-122.25,3.422


### some EDA

In [5]:
get_general_info(orig_df)

analyze_df(orig_df)

There are: 9 columns in the dataset

The list of column names, in alphabetical order: ['AveBedrms', 'AveOccup', 'AveRooms', 'HouseAge', 'Latitude', 'Longitude', 'MedHouseVal', 'MedInc', 'Population']

There are 20640 records in the dataset



Unnamed: 0,col_name,missing_vals,cardinality,is_categorical,data_type
0,MedInc,0,12928,No,float64
1,HouseAge,0,52,Yes,float64
2,AveRooms,0,19392,No,float64
3,AveBedrms,0,14233,No,float64
4,Population,0,3888,No,float64
5,AveOccup,0,18841,No,float64
6,Latitude,0,862,Yes,float64
7,Longitude,0,844,Yes,float64
8,MedHouseVal,0,3842,No,float64


In [6]:
# In this example I'll use all the columns (ex MedHouseVal) as features, except Lat, Long, to simplify

TARGET = "MedHouseVal"
all_cols = list(orig_df.columns)
cols_to_drop = ['Latitude', 'Longitude']

cat_cols = ['HouseAge']

# take care, I have sorted
FEATURES = sorted(list(set(all_cols) - set([TARGET])- set(cols_to_drop)))

# for LightGBM
cat_columns_idxs = [i for i, col in enumerate(FEATURES) if col in cat_cols]

FEATURES

['AveBedrms', 'AveOccup', 'AveRooms', 'HouseAge', 'MedInc', 'Population']

In [7]:
# the only important thing is that we have 1 categorical column: HouseAge

# we will code categorical as integer starting from zero
# in this case it is easy, since the minimum is 1... so we need only to subtract 1

In [8]:
# make a copy before any changes
used_df = orig_df.copy()

used_df['HouseAge'] = used_df['HouseAge'] - 1.

used_df['HouseAge'] = used_df['HouseAge'].astype(int)
used_df['HouseAge'] = used_df['HouseAge'].astype("category")

In [9]:
# let's make a simple train/test split
X = used_df[FEATURES].values
y = used_df[TARGET].values

TEST_SIZE = 0.2

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=TEST_SIZE, random_state=1)

### Train with best params

In [10]:
%%time
### train with best params

params = {
    "n_estimators": 600,
    "learning_rate": 0.008,
    "max_depth": 7
}

model = xgb.XGBRegressor(**params)

model.fit(X_train, y_train, eval_set=[(X_test, y_test)], verbose=100)

[0]	validation_0-rmse:1.90045
[100]	validation_0-rmse:1.03403
[200]	validation_0-rmse:0.73735
[300]	validation_0-rmse:0.65741
[400]	validation_0-rmse:0.63848
[500]	validation_0-rmse:0.63392
[599]	validation_0-rmse:0.63315
CPU times: user 1h 25min 27s, sys: 12.5 s, total: 1h 25min 39s
Wall time: 3min 1s


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, enable_categorical=False,
             gamma=0, gpu_id=-1, importance_type=None,
             interaction_constraints='', learning_rate=0.008, max_delta_step=0,
             max_depth=7, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=600, n_jobs=32,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)

### Save the model in JSON format

In [11]:
MODEL_FILE_NAME = 'housing.json'

model.save_model(MODEL_FILE_NAME)

### Test reloading it

In [13]:
model_loaded = xgb.Booster()

model_loaded.load_model(MODEL_FILE_NAME)

In [20]:
model_loaded.attributes()

{'best_iteration': '599',
 'best_ntree_limit': '600',
 'scikit_learn': '{"n_estimators": 600, "objective": "reg:squarederror", "max_depth": 7, "learning_rate": 0.008, "verbosity": null, "booster": null, "tree_method": null, "gamma": null, "min_child_weight": null, "max_delta_step": null, "subsample": null, "colsample_bytree": null, "colsample_bylevel": null, "colsample_bynode": null, "reg_alpha": null, "reg_lambda": null, "scale_pos_weight": null, "base_score": null, "missing": NaN, "num_parallel_tree": null, "random_state": null, "n_jobs": null, "monotone_constraints": null, "interaction_constraints": null, "importance_type": null, "gpu_id": null, "validate_parameters": null, "predictor": null, "enable_categorical": false, "evals_result_": {"validation_0": {"rmse": [1.900449, 1.887128, 1.873942, 1.860858, 1.84788, 1.835065, 1.822358, 1.809725, 1.797241, 1.78485, 1.772612, 1.760471, 1.748413, 1.73648, 1.724705, 1.713, 1.701428, 1.689916, 1.678595, 1.667275, 1.656116, 1.645022, 1.63406,