Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge code from branch refactor/xg-boost-model to refactor/major_change_with_k1st #186

Merged
merged 59 commits into from
Feb 23, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
11ac728
feat: add user code to user Oracle
phamhoangtuan Oct 5, 2022
515dbca
refactor: create OracleModel class, split students and ensemble to mo…
phamhoangtuan Oct 5, 2022
632adf5
fix: add condition to build_model func in modeler
phamhoangtuan Oct 6, 2022
f8ee2c0
fix: update modeler build_model function and model predict function
phamhoangtuan Oct 6, 2022
cd5be61
fix: update persist and load function of OracleModel
phamhoangtuan Oct 6, 2022
95fe02a
fix: fix bug version is overlapped, remove unused Oracle class
phamhoangtuan Oct 6, 2022
33b7966
refactor: change param name to be clearer, split persist and load fun…
phamhoangtuan Oct 7, 2022
28f05c4
fix: rename ensemble to ensembler, add Oracle as backward compability…
phamhoangtuan Oct 7, 2022
299aea1
fix: fix FuzzyModel process rules function
phamhoangtuan Oct 7, 2022
ff540fb
fix: update modeler build_model func comment, update test_oracle UT file
phamhoangtuan Oct 7, 2022
a823b36
feat: add evaluation func to modeler
phamhoangtuan Oct 7, 2022
2a4f5d8
fix: use logger from loguru, not logging; comment out UT for ts oracl…
phamhoangtuan Oct 7, 2022
337cccf
feat: add ipython notebook to run Oracle with iris data
phamhoangtuan Oct 8, 2022
f4501b8
feat: add ipython notebook to run oracle with azure iot data
phamhoangtuan Oct 10, 2022
91c600d
fix: update tutorial to use new Oracle code
phamhoangtuan Oct 10, 2022
e1880fc
fix: remove unused lib in TimeSeriesOracle
phamhoangtuan Oct 10, 2022
e1650c7
Merge branch 'main' into feat/oracle_with_fuzzy_refactor
phamhoangtuan Oct 18, 2022
a0fc3ec
feat: add loguru to pyproject
phamhoangtuan Oct 19, 2022
4c6b508
feat: create multi_model wrapper and kCollaborator (#173)
nubs01 Jan 4, 2023
070ecb4
fix: change OracleModel to Oracle
phamhoangtuan Jan 4, 2023
4b59262
feat: add xgboost model wrapper
phamhoangtuan Jan 4, 2023
b6b8e11
fix: remove tensorflow lib
phamhoangtuan Jan 5, 2023
dbb16bf
fix: change TimeSeriesOracleModel to TimeSeriesOracle
phamhoangtuan Jan 5, 2023
54fa9eb
fix: update Model and Modeler code for XGBoost based on Roshan code
phamhoangtuan Jan 6, 2023
ae8533c
fix: change x to X in k-collaborator, reformat python code
phamhoangtuan Jan 6, 2023
54d851f
feat: add Logical OR Ensemble model and modeler
phamhoangtuan Jan 9, 2023
948597d
feat: create oracle similar to collaborator, add hyperparam ingestion…
nubs01 Jan 11, 2023
77012cf
fix: update var type in xgboost modeler
phamhoangtuan Jan 11, 2023
f39cb45
fix: add name attribute to XGBoost model
phamhoangtuan Jan 11, 2023
dfca63a
fix: add name attribute to FuzzyModel, fix approach to create datafra…
phamhoangtuan Jan 11, 2023
26825ff
fix: fix input data in k-collaborator, fix build_model of ensemble
phamhoangtuan Jan 11, 2023
9bd5ca9
fix: init ensemble class
phamhoangtuan Jan 11, 2023
b92edeb
Feat/improved models for build pipeline (#179)
nubs01 Jan 12, 2023
f327074
fix: fix typo, add name to Model classes, change xgboost xgrid search…
phamhoangtuan Jan 13, 2023
f81d869
fix: add name property to Model, not Modeler
phamhoangtuan Jan 13, 2023
7d43a6f
fix: get student train data
phamhoangtuan Jan 13, 2023
4c1e712
fix: k-colab can deal with prepared_data is None then only produce fu…
phamhoangtuan Jan 14, 2023
e116bdc
fix: deal with K-Oracle without prepared data then return only teache…
phamhoangtuan Jan 14, 2023
e24d466
fix: return empty df if no df to concat
phamhoangtuan Jan 16, 2023
75d7be5
fix: add teacher model to Oracle
phamhoangtuan Jan 16, 2023
99d4339
fix: XGBoost if no result key then get first column of y_train data a…
phamhoangtuan Jan 16, 2023
c859704
fix: remove xgb grid search for hyperparam
phamhoangtuan Jan 17, 2023
768fd6e
fix: add input_features to stats when building model
phamhoangtuan Jan 17, 2023
2948493
fix: add input_features as columns of prepared_data X_train
phamhoangtuan Jan 17, 2023
4483a84
fix: deal with None prepared_data in k-colab when saving input_featur…
phamhoangtuan Jan 17, 2023
18b29af
fix: order input_features when persist multimodel
phamhoangtuan Jan 18, 2023
77f1703
fix: revert student_models.py
phamhoangtuan Jan 18, 2023
6324f6b
fix: fix NA value when concat list of Series
phamhoangtuan Jan 18, 2023
37a5e9b
fix: convert numpy.ndarray to pandas.DataFrame when running LogisticR…
phamhoangtuan Jan 18, 2023
755391b
fix: set default value for xgb
phamhoangtuan Jan 19, 2023
0cc1557
fix: add support parallel running when building k-colab and k-oracle
phamhoangtuan Jan 19, 2023
047a4f8
feat: Refactor XGB Model
vophihungvn Feb 2, 2023
8453e91
fix: log
vophihungvn Feb 2, 2023
185dae0
fix: evaluate func
vophihungvn Feb 2, 2023
1311283
refactor: Random forest model
vophihungvn Feb 2, 2023
047c22a
feat: update random forest student model
vophihungvn Feb 2, 2023
916b939
refactor: Logistic regression model
vophihungvn Feb 2, 2023
6ad1d13
fix(multi-model): fix logic to append pred result
phamhoangtuan Feb 10, 2023
33e8a35
merge: fix conflict when merging code
phamhoangtuan Feb 22, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 0 additions & 1 deletion MIGRATION_NOTES.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,3 @@ Due to changes to the package structure, please update your H1st code to:
- The `Modeler` class is introduced with `build_model` method which is used to build the corresponding `Model` instance. The `Model` class' `load_data`, `explore`, `evaluate` now belongs to the `Modeler` class with `explore` being renamed to `explore_data` and `evaluate` being renamed to `evaluate_model`.
- The `Model` class' `predict` method is renamed to `process`. Its `PredictiveModel` subclass which is then inherited by `RuleBasedModel`, `MLModel` still possess the `predict` method which basically calls the `process` one.
- The `Model` class' `load` is renamed to `load_params`.

73 changes: 73 additions & 0 deletions h1st/model/k1st/collaborator.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
import logging

import pandas as pd

from h1st.model.wrapper.multi_model import MultiModel
from h1st.model.ml_model import MLModel


class kCollaboratorModel(MultiModel):

name: str = 'k-Collaborator'
data_key = 'X'
output_key = 'predictions'

def __init__(self):
'''
Model for running predictions with multiple models and combining the outputs.
Also handles persistance and loading of submodels

knowledge and ml models must take that same input and output a dict
with the "predictions" key and value as list of dict or dict (matching
the input format with key "X")
i.e. model.predict({"X": {'k1': 12, 'k2': 22}}) ->
{"predictions": {'output': 10}}
'''
super().__init__()
self.ensemble = None

def predict(self, input_data: dict) -> dict:
submodel_out = super().predict(input_data)[self.output_key]

# Inject original x value into input feature of Ensembler
if (
isinstance(self.ensemble, MLModel)
and (
isinstance(input_data["X"], pd.DataFrame)
or isinstance(input_data["X"], pd.Series)
)
and self.stats["inject_x_in_ensembler"]
):
submodel_out = pd.concat([submodel_out, input_data['X']], axis=1)

if self.ensemble is None:
return {self.output_key: submodel_out}

ensemble_input = {'X': submodel_out}
ensemble_key = getattr(self.ensemble, 'output_key', 'predictions')
ensemble_pred = self.ensemble.predict(ensemble_input)[ensemble_key]
return {self.output_key: ensemble_pred}

def persist(self, version=None):
if self.ensemble is not None:
ensemble_version = self.ensemble.persist(version)
self.stats['ensemble'] = {
'version': ensemble_version,
'model_class': self.ensemble.__class__,
}
else:
self.stats['ensemble'] = None

version = super().persist(version)
return version

def load(self, version=None):
super().load(version)

if self.stats['ensemble'] is not None:
ensemble_version = self.stats['ensemble']['version']
ensemble_class = self.stats['ensemble']['model_class']
self.ensemble = ensemble_class().load(ensemble_version)
else:
self.ensemble = None
return self
101 changes: 101 additions & 0 deletions h1st/model/k1st/collaborator_modeler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
import logging
from typing import List

import pandas as pd

from h1st.model.wrapper.multi_modeler import MultiModeler
from h1st.model.oracle.ensembler_models import MajorityVotingEnsembleModel
from h1st.model.rule_based_modeler import RuleBasedModeler
from h1st.model.k1st.collaborator import kCollaboratorModel
from h1st.model.predictive_model import PredictiveModel
from h1st.model.modeler import Modeler
from h1st.model.model import Model
from h1st.model.ml_modeler import MLModeler


class kCollaboratorModeler(MultiModeler):

model_class = kCollaboratorModel

def __init__(self):
super().__init__()

def build_model(
self,
prepared_data: dict,
modelers: List[MLModeler] = [],
ensemble_modeler: Modeler = RuleBasedModeler(MajorityVotingEnsembleModel),
models: List[PredictiveModel] = None,
inject_x_in_ensembler: bool = False,
parallel: bool = False,
) -> kCollaboratorModel:
'''
prepared_data must be in the format necessary for modelers
'''
self.stats['inject_x_in_ensembler'] = inject_x_in_ensembler
model = super().build_model(prepared_data, modelers, parallel)
for i, m in enumerate(models):
model.add_model(m, name=f'prebuilt-{model.__class__.__name__}-{i}')

if prepared_data is None:
model.stats['input_features'] = models[0].stats['input_features']
return model
else:
model.stats['input_features'] = list(prepared_data['X_train'].columns)

# train ensemble
raw_pred = model.predict({model.data_key: prepared_data['X_train']})[
model.output_key
]

# If there is labeled_data and ensembler_modeler is MLModeler,
# then prepare the training data of ensembler.
labeled_data = prepared_data.get("labeled_data", prepared_data)
if isinstance(ensemble_modeler, MLModeler) and labeled_data is None:
raise ValueError("No data to train the machine-learning-based ensembler")

ensembler_data = {}
if labeled_data:
ensembler_train_input = model.predict(
{model.data_key: labeled_data["X_train"]}
)[model.output_key]

ensembler_test_input = model.predict(
{model.data_key: labeled_data["X_test"]}
)[model.output_key]

ensembler_data = {
'X_train': ensembler_train_input,
'y_train': labeled_data['y_train'],
'X_test': ensembler_test_input,
'y_test': labeled_data['y_test'],
}
else:
ensembler_data = None

ensemble = ensemble_modeler.build_model(ensembler_data)
model.ensemble = ensemble

# Generate metrics of all sub models (teacher, student, ensembler).
if labeled_data:
test_data = {"X": labeled_data["X_test"], "y": labeled_data["y_test"]}
try:
model.metrics = self.evaluate_model(test_data, model)
except Exception as e:
logging.error(
(
"Couldn't complete the submodel evaluation. "
"Got the following error."
)
)
logging.error(e)
else:
logging.info("Evaluated all sub models successfully.")

return model

def evaluate_model(self, test_data: dict, model: Model):
submodel_metrics = super().evaluate_model(test_data, model)
metrics = {'submodel_metrics': submodel_metrics}
# TODO: Compute overall model metrics
return metrics
74 changes: 74 additions & 0 deletions h1st/model/k1st/oracle.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
import logging

import pandas as pd

from h1st.model.wrapper.multi_model import MultiModel
from h1st.model.ml_model import MLModel


class kOracleModel(MultiModel):

name: str = 'k-Oracle'
data_key = 'X'
output_key = 'predictions'

def __init__(self):
'''
Model for running predictions with multiple models and combining the outputs.
Also handles persistance and loading of submodels

knowledge and ml models must take that same input and output a dict
with the "predictions" key and value as list of dict or dict (matching
the input format with key "X")
i.e. model.predict({"X": {'k1': 12, 'k2': 22}}) ->
{"predictions": {'output': 10}}
'''
super().__init__()
self.ensemble = None

def predict(self, input_data: dict) -> dict:
submodel_out = super().predict(input_data)[self.output_key]

# Inject original x value into input feature of Ensembler
if (
isinstance(self.ensemble, MLModel)
and (
isinstance(input_data["X"], pd.DataFrame)
or isinstance(input_data["X"], pd.Series)
)
and self.stats["inject_x_in_ensembler"]
):
submodel_out = pd.concat([submodel_out, input_data['X']], axis=1)

if self.ensemble is None:
return {self.output_key: submodel_out}

ensemble_input = {'X': submodel_out}
ensemble_key = getattr(self.ensemble, 'output_key', 'predictions')
ensemble_pred = self.ensemble.predict(ensemble_input)[ensemble_key]
return {self.output_key: ensemble_pred}

def persist(self, version=None):
if self.ensemble:
ensemble_version = self.ensemble.persist()
self.stats['ensemble'] = {
'version': ensemble_version,
'model_class': self.ensemble.__class__,
}
else:
self.stats['ensemble'] = None

version = super().persist(version)
return version

def load(self, version=None):
super().load(version)

if self.stats['ensemble']:
ensemble_version = self.stats['ensemble']['version']
ensemble_class = self.stats['ensemble']['model_class']
self.ensemble = ensemble_class().load(ensemble_version)
else:
self.ensemble = None

return self
115 changes: 115 additions & 0 deletions h1st/model/k1st/oracle_modeler.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,115 @@
import logging
from typing import List

from h1st.model.wrapper.multi_modeler import MultiModeler
from h1st.model.oracle.ensembler_models import MajorityVotingEnsembleModel
from h1st.model.rule_based_modeler import RuleBasedModeler
from h1st.model.k1st.oracle import kOracleModel
from h1st.model.predictive_model import PredictiveModel
from h1st.model.modeler import Modeler
from h1st.model.model import Model
from h1st.model.ml_modeler import MLModeler


class kOracleModeler(MultiModeler):

model_class = kOracleModel

def __init__(self):
super().__init__()

def build_model(
self,
prepared_data: dict,
modelers: List[MLModeler] = [],
ensemble_modeler: Modeler = RuleBasedModeler(MajorityVotingEnsembleModel),
teacher: PredictiveModel = None,
inject_x_in_ensembler: bool = False,
parallel: bool = False,
) -> kOracleModel:
'''
prepared_data must be in the format necessary for modelers
'''
if prepared_data is None:
model = kOracleModel()
model.add_model(teacher, f'prebuilt-{teacher.__class__.__name__}')
model.stats['input_features'] = teacher.stats['input_features']
return model

teacher_data_key = getattr(teacher, 'data_key', 'X')
teacher_output_key = getattr(teacher, 'output_key', 'predictions')
teacher_pred = teacher.predict({teacher_data_key: prepared_data['X_teacher_train']})[
teacher_output_key
]
student_training_data = prepared_data.copy()
student_training_data['y_train'] = teacher_pred
if 'X_test' in prepared_data.keys():
student_training_data['y_test'] = teacher.predict(
{teacher_data_key: prepared_data['X_test']}
)[teacher_output_key]

self.stats['inject_x_in_ensembler'] = inject_x_in_ensembler
model = super().build_model(student_training_data, modelers, parallel)

# Add teacher to MultiModel
model.add_model(teacher, f'prebuilt-{teacher.__class__.__name__}')

# train ensemble
raw_pred = model.predict({model.data_key: prepared_data['X_train']})[
model.output_key
]

# If there is labeled_data and ensembler_modeler is MLModeler,
# then prepare the training data of ensembler.
labeled_data = prepared_data.get("labeled_data", None)
if isinstance(ensemble_modeler, MLModeler) and labeled_data is None:
raise ValueError("No data to train the machine-learning-based ensembler")

ensembler_data = {}
if labeled_data:
x_train_input = {"X": labeled_data["X_train"]}
x_test_input = {"X": labeled_data["X_test"]}

ensembler_train_input = model.predict({model.data_key: x_train_input})[
model.output_key
]

ensembler_test_input = model.predict({model.data_key: x_test_input})

ensembler_data = {
'X_train': ensembler_train_input,
'y_train': labeled_data['y_train'],
'X_test': ensembler_test_input,
'y_test': labeled_data['y_test'],
}
else:
ensembler_data = None

ensemble = ensemble_modeler.build_model(ensembler_data)
model.ensemble = ensemble

# Generate metrics of all sub models (teacher, student, ensembler).
if labeled_data:
test_data = {"X": labeled_data["X_test"], "y": labeled_data["y_test"]}
try:
model.metrics = self.evaluate_model(test_data, model)
except Exception as e:
logging.error(
(
"Couldn't complete the submodel evaluation. "
"Got the following error."
)
)
logging.error(e)
else:
logging.info("Evaluated all sub models successfully.")

model.stats['input_features'] = list(prepared_data['X_train'].columns)

return model

def evaluate_model(self, test_data: dict, model: Model):
submodel_metrics = super().evaluate_model(test_data, model)
metrics = {'submodel_metrics': submodel_metrics}
# TODO: Compute overall model metrics
return metrics