# Prediction model training

With analysis of the data and relationships between the data, the dataset is defined. The next step is the creation of a job that can create and regularly update datasets in the gold layer of the medallion architecture. This will keep the dataset up to date and ready for the model-retraining process.

Creating a model by applying algorithms to data is done through the following steps: 
1) Load the dataset and select relevant features/columns
2) If features aren't ready, preprocess and apply techniques to prepare them for the training process
3) Split the dataset into train/test or train/val/test groups
4) Train model
5) Evaluate the model on a relevant set of metrics
6) Save model

In [160]:
import pickle
from pathlib import Path

import numpy as np
import polars as pl
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

## Paths and dataset

Defines paths and loads dataset from the gold layer.

In [161]:
ROOT_PATH = Path(__name__).absolute().parent.parent
DATA_PATH = ROOT_PATH / "data"
MODEL_PATH = ROOT_PATH / "model"

GOLD_PATH = DATA_PATH / "storage" / "gold"

In [5]:
dataset_path = GOLD_PATH / "dataset.parquet"
dataset_df = pl.read_parquet(source=dataset_path)

dataset_df.head(1)

customer_id,book_id,name,gender,education,occupation,age_category,price,pages,label,price_standardized,pages_standardized
str,str,str,u32,u32,u32,u32,f32,i32,u32,f32,f32
"""04814766f679bd3db04cb360a40eea…","""mR9eAgAAQBAJ""","""Willard Plant""",0,0,0,0,312.0,640,1,0.032734,0.292535


Select only relevant column that will later be part of the training process. Because the data is already prepared in front, the focus of the notebook is only on the model training.

In [6]:
dataset_for_train = dataset_df.select(
    [
        "gender",
        "education",
        "occupation",
        "age_category",
        "price_standardized",
        "pages_standardized",
        "label",
    ]
)

In [7]:
print(f"Dataset length: {len(dataset_for_train)}")

Dataset length: 1413


There are 1413 data points available. This is a very small sample and this training is the showcase. For any real-world use cases we need as much **quality** data as we can.

In [None]:
# lock seed for reproducability

In [8]:
# shuffle dataset
dataset_for_train = dataset_for_train.sample(fraction=1, shuffle=True)

In [10]:
dataset = dataset_for_train.to_pandas()

We could continue with the Polars Dataframes for training but Pandas Dataframes are going to be used for simplicity and "better" compliance with sklearn library. 
Approach with Polars would be different, train/test split would be done in advance and if volume of data is bigger. In that case LazyFrame with Tensorflow/PyTorch would be better option.

In [11]:
train_df, test_df = train_test_split(dataset, test_size=0.3)

In [19]:
# separate features and labels

train_features, train_labels = train_df.iloc[:, :-1], train_df.iloc[:, [-1]]
test_features, test_labels = test_df.iloc[:, :-1], test_df.iloc[:, [-1]]

In [28]:
# transform labels from 1d dataframe into 1d array

train_labels = train_labels.values.ravel()
test_labels = test_labels.values.ravel()

In [120]:
def evaluation(predictions, true_labels):
    accuracy = round(accuracy_score(true_labels, predictions) * 100, 2)
    precision, recall, fbeta, _ = precision_recall_fscore_support(
        true_labels, predictions
    )

    print(f"Accuracy: {accuracy}")
    print(f"Precision score: {precision}")
    print(f"Recall score: {recall}")
    print(f"FBeta score: {fbeta}")

## Training different models 

Library problem is presented as binary classification problem. We can compare a couple of "traditional" machine learning algorithm which should be sufficient for this type of problem and data sample without need to go with something heavy such as neural networks.

In [143]:
logistic_regression = LogisticRegression()
random_forest = RandomForestClassifier(random_state=123)

### Fitting logistic regression model

In [144]:
logistic_regression.fit(train_features, train_labels)

In [145]:
preds = logistic_regression.predict(test_features)

print("Evaluating logistic regression predictor performance: \n")
evaluation(preds, test_labels)

Evaluating logistic regression predictor performance: 

Accuracy: 71.93
Precision score: [0.         0.71933962]
Recall score: [0. 1.]
FBeta score: [0.         0.83676269]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Fitting Random Forest model

In [137]:
random_forest.fit(train_features, train_labels)

In [138]:
preds = random_forest.predict(test_features)

print("Evaluating Random Forest predictor performance: \n")
evaluation(preds, test_labels)

Evaluating Random Forest predictor performance: 

Accuracy: 71.46
Precision score: [0.45454545 0.72885572]
Recall score: [0.08403361 0.96065574]
FBeta score: [0.14184397 0.82885431]


Results are bad. Predictor only predicts positive class and it seems that categorical features are playing small role in determining label. This was somehow conclusion after EDA.
Accuracy is a falls metrics here as it predicts mostly dominant class. Models are failing a lot with the negative class. Data imbalance is definitely a problem and the features are not separating space clearly.  

## Training with class weights

In [146]:
weights = compute_class_weight(
    class_weight="balanced", classes=np.unique(train_labels), y=train_labels
)

In [147]:
weights

array([2.19777778, 0.64725131])

In [154]:
logistic_regression_with_weights = LogisticRegression(
    class_weight={0: 2.1977, 1: 0.647}
)
random_forest_with_weights = RandomForestClassifier(
    random_state=123, class_weight={0: 2.1977, 1: 0.647}
)

In [155]:
logistic_regression_with_weights.fit(train_features, train_labels)

In [156]:
preds = logistic_regression.predict(test_features)

print("Evaluating logistic regression predictor performance with weights: \n")
evaluation(preds, test_labels)

Evaluating logistic regression predictor performance with weights: 

Accuracy: 71.93
Precision score: [0.         0.71933962]
Recall score: [0. 1.]
FBeta score: [0.         0.83676269]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [157]:
random_forest_with_weights.fit(train_features, train_labels)

In [159]:
preds = random_forest_with_weights.predict(test_features)

print("Evaluating Random Forest predictor performance with weights: \n")
evaluation(preds, test_labels)

Evaluating Random Forest predictor performance with weights: 

Accuracy: 70.28
Precision score: [0.31578947 0.72098765]
Recall score: [0.05042017 0.95737705]
FBeta score: [0.08695652 0.82253521]


Having weights makes performance more worse.

After these results we should go back to EDA, explore more, find out more data and try to improve performance.

## Saving the model

In [166]:
print("Saving Random Forest model")
with open(f"{MODEL_PATH}/model.pkl", "wb") as model:
    pickle.dump(random_forest, model)

Saving Random Forest model
