Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit 7785cc2
Showing
11 changed files
with
2,713 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2017 Marios Michailidis | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
## About | ||
|
||
`pystacknet` is a light python version of [StackNet](https://github.com/kaz-Anova/StackNet) which was originally made in Java. | ||
|
||
It supports many of the original features, with some new elements. | ||
|
||
|
||
## Installation | ||
|
||
``` | ||
git clone https://github.com/h2oai/pystacknet | ||
cd pystacknet | ||
python setup.py install | ||
``` | ||
|
||
## New features | ||
|
||
`pystacknet`'s main object is a 2-dimensional list of sklearn type of models. This list defines the StackNet structure. This is the equivalent of [parameters](https://github.com/kaz-Anova/StackNet#parameters-file) in the Java version. A representative example could be: | ||
|
||
``` | ||
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier | ||
from sklearn.linear_model import LogisticRegression | ||
models=[ | ||
######## First level ######## | ||
[RandomForestClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1), | ||
ExtraTreesClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1), | ||
GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, max_features=0.5, random_state=1), | ||
LogisticRegression(random_state=1) | ||
], | ||
######## Second level ######## | ||
[RandomForestClassifier (n_estimators=200, criterion="entropy", max_depth=5, max_features=0.5, random_state=1)] | ||
] | ||
``` | ||
|
||
`pystacknet` is not as strict as in the `Java` version and can allow `Regressors`, `Classifiers` or even `Transformers` at any level of StackNet. In other words the following could work just fine: | ||
|
||
``` | ||
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier,GradientBoostingRegressor | ||
from sklearn.linear_model import LogisticRegression, Ridge | ||
from sklearn.decomposition import PCA | ||
models=[ | ||
[RandomForestClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1), | ||
ExtraTreesRegressor (n_estimators=100, max_depth=5, max_features=0.5, random_state=1), | ||
GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, max_features=0.5, random_state=1), | ||
LogisticRegression(random_state=1), | ||
PCA(n_components=4,random_state=1) | ||
], | ||
[RandomForestClassifier (n_estimators=200, criterion="entropy", max_depth=5, max_features=0.5, random_state=1)] | ||
] | ||
``` | ||
|
||
**Note** that not all transformers are meaningful in this context and you should use it at your own risk. | ||
|
||
|
||
## Parameters | ||
|
||
A typical usage for classification could be : | ||
|
||
``` | ||
from pystacknet.pystacknet import StackNetClassifier | ||
model=StackNetClassifier(models, metric="auc", folds=4, | ||
restacking=False,use_retraining=True, use_proba=True, | ||
random_state=12345,n_jobs=1, verbose=1) | ||
model.fit(x,y) | ||
preds=model.predict_proba(x_test) | ||
``` | ||
Where : | ||
|
||
|
||
Command | Explanation | ||
--- | --- | ||
models | List of models. This should be a 2-dimensional list . The first level hould defice the stacking level and each entry is the model. | ||
metric | Can be "auc","logloss","accuracy","f1","matthews" or your own custom metric as long as it implements (ytrue,ypred,sample_weight=) | ||
folds | This can be either integer to define the number of folds used in `StackNet` or an iterable yielding train/test splits. | ||
restacking | True for [restacking](https://github.com/kaz-Anova/StackNet#restacking-mode) else False | ||
use_proba | When evaluating the metric, it will use probabilities instead of class predictions if `use_proba==True` | ||
use_retraining | If `True` it does one model based on the whole training data in order to score the test data. Otherwise it takes the average of all models used in the folds ( however this takes more memory and there is no guarantee that it will work better.) | ||
random_state | WInteger for randomised procedures | ||
n_jobs | Number of models to run in parallel. This is independent of any extra threads allocated | ||
n_jobs | Number of models to run in parallel. This is independent of any extra threads allocated from the selected algorithms. e.g. it is possible to run 4 models in parallel where one is a randomforest that runs on 10 threads (it selected). | ||
verbose | Integer value higher than zero to allow printing at the console. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
__version__ = '0.0.1' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
# -*- coding: utf-8 -*- | ||
""" | ||
Created on Fri Aug 31 18:33:58 2018 | ||
@author: Marios Michailidis | ||
metrics and method to check metrics used within StackNet | ||
""" | ||
|
||
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score , mean_squared_log_error #regression metrics | ||
from sklearn.metrics import roc_auc_score, log_loss ,accuracy_score, f1_score ,matthews_corrcoef | ||
import numpy as np | ||
|
||
valid_regression_metrics=["rmse","mae","rmsle","r2","mape","smape"] | ||
valid_classification_metrics=["auc","logloss","accuracy","f1","matthews"] | ||
|
||
############ classification metrics ############ | ||
|
||
def auc(y_true, y_pred, sample_weight=None): | ||
return roc_auc_score(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
def logloss(y_true, y_pred, sample_weight=None): | ||
return log_loss(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
def accuracy(y_true, y_pred, sample_weight=None): | ||
return accuracy_score(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
def f1(y_true, y_pred, sample_weight=None): | ||
return f1_score(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
def matthews(y_true, y_pred, sample_weight=None): | ||
return matthews_corrcoef(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
############ regression metrics ############ | ||
|
||
def rmse(y_true, y_pred, sample_weight=None): | ||
return np.sqrt(mean_squared_error(y_true, y_pred, sample_weight=sample_weight)) | ||
|
||
def mae(y_true, y_pred, sample_weight=None): | ||
return mean_absolute_error(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
def rmsle (y_true, y_pred, sample_weight=None): | ||
return np.sqrt(mean_squared_log_error(y_true, y_pred, sample_weight=sample_weight)) | ||
|
||
def r2(y_true, y_pred, sample_weight=None): | ||
return r2_score(y_true, y_pred, sample_weight=sample_weight) | ||
|
||
|
||
def mape(y_true, y_pred, sample_weight=None): | ||
y_true = y_true.ravel() | ||
y_pred = y_pred.ravel() | ||
if sample_weight is not None: | ||
sample_weight = sample_weight.ravel() | ||
eps = 1E-15 | ||
ape = np.abs((y_true - y_pred) / (y_true + eps)) * 100 | ||
ape[y_true == 0] = 0 | ||
return np.average(ape, weights=sample_weight) | ||
|
||
|
||
def smape(y_true, y_pred, sample_weight=None): | ||
|
||
y_true = y_true.ravel() | ||
y_pred = y_pred.ravel() | ||
if sample_weight is not None: | ||
sample_weight = sample_weight.ravel() | ||
eps = 1E-15 | ||
sape = (np.abs(y_true - y_pred) / (0.5 * (np.abs(y_true) + np.abs(y_pred)) + eps)) * 100 | ||
sape[(y_true == 0) & (y_pred == 0)] = 0 | ||
return np.average(sape, weights=sample_weight) | ||
|
||
|
||
""" | ||
metric: string or class that returns a metric given (y_true, y_pred, sample_weight=None) | ||
Curently supported metrics are "rmse","mae","rmsle","r2","mape","smape" | ||
""" | ||
|
||
|
||
def check_regression_metric(metric): | ||
|
||
if type(metric) is type(None): | ||
raise Exception ("metric cannot be None") | ||
if isinstance(metric, str) : | ||
if metric not in valid_regression_metrics: | ||
raise Exception ("The regression metric has to be one of %s " % (", ".join([str(k) for k in valid_regression_metrics]))) | ||
if metric=="rmse": | ||
return rmse,metric | ||
elif metric=="mae": | ||
return mae,metric | ||
elif metric=="rmsle": | ||
return rmsle,metric | ||
elif metric=="r2": | ||
return r2,metric | ||
elif metric=="mape": | ||
return mape,metric | ||
elif metric=="smape": | ||
return smape,metric | ||
else : | ||
raise Exception ("The metric %s is not recognised " % (metric) ) | ||
else : #customer metrics is given | ||
try: | ||
y_true_temp=[[1],[2],[3]] | ||
y_pred_temp=[[2],[1],[3]] | ||
y_true_temp=np.array(y_true_temp) | ||
y_pred_temp=np.array(y_pred_temp) | ||
sample_weight_temp=[1,0.5,1] | ||
metric(y_true_temp,y_pred_temp, sample_weight=sample_weight_temp ) | ||
return metric,"custom" | ||
|
||
except: | ||
raise Exception ("The custom metric has to implement metric(y_true, y_pred, sample_weight=None)" ) | ||
|
||
|
||
""" | ||
metric: string or class that returns a metric given (y_true, y_pred, sample_weight=None) | ||
Curently supported metrics are "rmse","mae","rmsle","r2","mape","smape" | ||
""" | ||
|
||
|
||
def check_classification_metric(metric): | ||
|
||
if type(metric) is type(None): | ||
raise Exception ("metric cannot be None") | ||
if isinstance(metric, str) : | ||
if metric not in valid_classification_metrics: | ||
raise Exception ("The classification metric has to be one of %s " % (", ".join([str(k) for k in valid_classification_metrics]))) | ||
if metric=="auc": | ||
return auc,metric | ||
elif metric=="logloss": | ||
return logloss,metric | ||
elif metric=="accuracy": | ||
return accuracy,metric | ||
elif metric=="r2": | ||
return r2,metric | ||
elif metric=="f1": | ||
return f1,metric | ||
elif metric=="matthews": | ||
return matthews,metric | ||
else : | ||
raise Exception ("The metric %s is not recognised " % (metric) ) | ||
else : #customer metrics is given | ||
try: | ||
y_true_temp=[[1],[0],[1]] | ||
y_pred_temp=[[0.4],[1],[0.2]] | ||
y_true_temp=np.array(y_true_temp) | ||
y_pred_temp=np.array(y_pred_temp) | ||
sample_weight_temp=[1,0.5,1] | ||
metric(y_true_temp,y_pred_temp, sample_weight=sample_weight_temp ) | ||
return metric,"custom" | ||
|
||
except: | ||
raise Exception ("The custom metric has to implement metric(y_true, y_pred, sample_weight=None)" ) | ||
|
||
|
||
|
||
|
||
|
Oops, something went wrong.