pystacknet

h2oai · Sep 4, 2018 · 7785cc2 · 7785cc2
commit 7785cc2
Show file tree

Hide file tree

Showing 11 changed files with 2,713 additions and 0 deletions.
diff --git a/LICENSE.txt b/LICENSE.txt
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2017 Marios Michailidis
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -0,0 +1,90 @@
+## About
+
+`pystacknet` is a light python version of [StackNet](https://github.com/kaz-Anova/StackNet) which was originally made in Java.
+
+It supports many of the original features, with some new elements. 
+
+
+## Installation
+
+```
+git clone https://github.com/h2oai/pystacknet
+cd pystacknet
+python setup.py install
+```
+
+## New features
+
+`pystacknet`'s main object is a 2-dimensional list of sklearn type of models. This list defines the StackNet structure. This is the equivalent of [parameters](https://github.com/kaz-Anova/StackNet#parameters-file) in the Java version. A representative example could be:
+
+```
+from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
+from sklearn.linear_model import LogisticRegression
+
+    models=[ 
+            ######## First level ########
+            [RandomForestClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1),
+             ExtraTreesClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1),
+             GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, max_features=0.5, random_state=1),
+             LogisticRegression(random_state=1)
+             ],
+            ######## Second level ########
+            [RandomForestClassifier (n_estimators=200, criterion="entropy", max_depth=5, max_features=0.5, random_state=1)]
+            ]
+```
+
+`pystacknet` is not as strict as in the `Java` version and can allow `Regressors`, `Classifiers` or even `Transformers` at any level of StackNet. In other words the following could work just fine:
+
+```
+from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, ExtraTreesClassifier, ExtraTreesRegressor, GradientBoostingClassifier,GradientBoostingRegressor
+from sklearn.linear_model import LogisticRegression, Ridge
+from sklearn.decomposition import PCA
+    models=[ 
+            
+            [RandomForestClassifier (n_estimators=100, criterion="entropy", max_depth=5, max_features=0.5, random_state=1),
+             ExtraTreesRegressor (n_estimators=100, max_depth=5, max_features=0.5, random_state=1),
+             GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5, max_features=0.5, random_state=1),
+             LogisticRegression(random_state=1),
+             PCA(n_components=4,random_state=1)
+             ],
+            
+            [RandomForestClassifier (n_estimators=200, criterion="entropy", max_depth=5, max_features=0.5, random_state=1)]
+            
+            
+            ]
+```
+
+**Note** that not all transformers are meaningful in this context and you should use it at your own risk. 
+
+
+## Parameters
+
+A typical usage for classification could be : 
+
+```
+from pystacknet.pystacknet import StackNetClassifier
+
+model=StackNetClassifier(models, metric="auc", folds=4,
+	restacking=False,use_retraining=True, use_proba=True, 
+	random_state=12345,n_jobs=1, verbose=1)
+
+model.fit(x,y)
+preds=model.predict_proba(x_test)
+
+
+```
+Where :
+
+
+Command | Explanation
+--- | ---
+models  |  List of models. This should be a 2-dimensional list . The first level hould defice the stacking level and each entry is the model. 
+metric  | Can be "auc","logloss","accuracy","f1","matthews" or your own custom metric as long as it implements (ytrue,ypred,sample_weight=)
+folds   |  This can be either integer to define the number of folds used in `StackNet` or an iterable yielding train/test splits.
+restacking   |  True for [restacking](https://github.com/kaz-Anova/StackNet#restacking-mode) else False
+use_proba   |  When evaluating the metric, it will use probabilities instead of class predictions if `use_proba==True`
+use_retraining   |  If `True` it does one model based on the whole training data in order to score the test data. Otherwise it takes the average of all models used in the folds ( however this takes more memory and there is no guarantee that it will work better.) 
+random_state   |  WInteger for randomised procedures
+n_jobs   |   Number of models to run in parallel. This is independent of any extra threads allocated
+ n_jobs   |   Number of models to run in parallel. This is independent of any extra threads allocated from the selected algorithms. e.g. it is possible to run 4 models in parallel where one is a randomforest that runs on 10 threads (it selected).
+ verbose   |   Integer value higher than zero to allow printing at the console. 
diff --git a/pystacknet/__init__.py b/pystacknet/__init__.py
@@ -0,0 +1 @@
+__version__ = '0.0.1'
diff --git a/pystacknet/metrics.py b/pystacknet/metrics.py
@@ -0,0 +1,157 @@
+# -*- coding: utf-8 -*-
+"""
+Created on Fri Aug 31 18:33:58 2018
+
+@author: Marios Michailidis
+
+metrics and method to check metrics used within StackNet
+
+"""
+
+from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score , mean_squared_log_error  #regression metrics
+from sklearn.metrics import roc_auc_score, log_loss ,accuracy_score, f1_score ,matthews_corrcoef
+import numpy as np
+
+valid_regression_metrics=["rmse","mae","rmsle","r2","mape","smape"]
+valid_classification_metrics=["auc","logloss","accuracy","f1","matthews"]
+
+############ classification metrics ############
+
+def auc(y_true, y_pred, sample_weight=None):    
+    return roc_auc_score(y_true, y_pred, sample_weight=sample_weight)
+
+def logloss(y_true, y_pred, sample_weight=None):    
+    return log_loss(y_true, y_pred, sample_weight=sample_weight)
+
+def accuracy(y_true, y_pred, sample_weight=None):    
+    return accuracy_score(y_true, y_pred, sample_weight=sample_weight)
+
+def f1(y_true, y_pred, sample_weight=None):    
+    return f1_score(y_true, y_pred, sample_weight=sample_weight)
+
+def matthews(y_true, y_pred, sample_weight=None):    
+    return matthews_corrcoef(y_true, y_pred, sample_weight=sample_weight)
+
+############ regression metrics ############
+
+def rmse(y_true, y_pred, sample_weight=None):    
+    return np.sqrt(mean_squared_error(y_true, y_pred, sample_weight=sample_weight))
+
+def mae(y_true, y_pred, sample_weight=None):    
+    return mean_absolute_error(y_true, y_pred, sample_weight=sample_weight)
+
+def rmsle (y_true, y_pred, sample_weight=None):    
+    return np.sqrt(mean_squared_log_error(y_true, y_pred, sample_weight=sample_weight))
+
+def r2(y_true, y_pred, sample_weight=None):    
+    return r2_score(y_true, y_pred, sample_weight=sample_weight)
+
+
+def mape(y_true, y_pred, sample_weight=None):
+    y_true = y_true.ravel()
+    y_pred = y_pred.ravel()
+    if sample_weight is not None:
+        sample_weight = sample_weight.ravel()
+    eps = 1E-15
+    ape = np.abs((y_true - y_pred) / (y_true + eps)) * 100
+    ape[y_true == 0] = 0
+    return np.average(ape, weights=sample_weight)
+
+
+def smape(y_true, y_pred, sample_weight=None):
+
+    y_true = y_true.ravel()
+    y_pred = y_pred.ravel()
+    if sample_weight is not None:
+        sample_weight = sample_weight.ravel()
+    eps = 1E-15
+    sape = (np.abs(y_true - y_pred) / (0.5 * (np.abs(y_true) + np.abs(y_pred)) + eps)) * 100
+    sape[(y_true == 0) & (y_pred == 0)] = 0
+    return np.average(sape, weights=sample_weight)         
+
+
+"""
+metric: string or class that returns a metric given (y_true, y_pred, sample_weight=None)
+Curently supported metrics are "rmse","mae","rmsle","r2","mape","smape"
+"""
+
+
+def check_regression_metric(metric):
+
+    if type(metric) is type(None):
+        raise Exception ("metric cannot be None")
+    if isinstance(metric, str)  :
+        if metric not in valid_regression_metrics:
+            raise Exception ("The regression metric has to be one of %s " % (", ".join([str(k) for k in valid_regression_metrics])))
+        if metric=="rmse":
+            return rmse,metric
+        elif metric=="mae":
+            return mae,metric
+        elif metric=="rmsle":
+            return rmsle,metric       
+        elif metric=="r2":
+            return r2,metric      
+        elif metric=="mape":
+            return mape,metric      
+        elif metric=="smape":
+            return smape,metric    
+        else :
+            raise Exception ("The metric %s is not recognised " % (metric) ) 
+    else : #customer metrics is given
+        try:
+            y_true_temp=[[1],[2],[3]]
+            y_pred_temp=[[2],[1],[3]]
+            y_true_temp=np.array(y_true_temp)
+            y_pred_temp=np.array(y_pred_temp)            
+            sample_weight_temp=[1,0.5,1]
+            metric(y_true_temp,y_pred_temp,  sample_weight=sample_weight_temp )
+            return metric,"custom"
+
+        except:
+            raise Exception ("The custom metric has to implement metric(y_true, y_pred, sample_weight=None)" ) 
+
+
+"""
+metric: string or class that returns a metric given (y_true, y_pred, sample_weight=None)
+Curently supported metrics are "rmse","mae","rmsle","r2","mape","smape"
+"""
+
+
+def check_classification_metric(metric):
+
+    if type(metric) is type(None):
+        raise Exception ("metric cannot be None")
+    if isinstance(metric, str)  :
+        if metric not in valid_classification_metrics:
+            raise Exception ("The classification metric has to be one of %s " % (", ".join([str(k) for k in valid_classification_metrics])))
+        if metric=="auc":
+            return auc,metric
+        elif metric=="logloss":
+            return logloss,metric
+        elif metric=="accuracy":
+            return accuracy,metric       
+        elif metric=="r2":
+            return r2,metric      
+        elif metric=="f1":
+            return f1,metric      
+        elif metric=="matthews":
+            return matthews,metric    
+        else :
+            raise Exception ("The metric %s is not recognised " % (metric) ) 
+    else : #customer metrics is given
+        try:
+            y_true_temp=[[1],[0],[1]]
+            y_pred_temp=[[0.4],[1],[0.2]]
+            y_true_temp=np.array(y_true_temp)
+            y_pred_temp=np.array(y_pred_temp)
+            sample_weight_temp=[1,0.5,1]
+            metric(y_true_temp,y_pred_temp,  sample_weight=sample_weight_temp )
+            return metric,"custom"
+
+        except:
+            raise Exception ("The custom metric has to implement metric(y_true, y_pred, sample_weight=None)" ) 
+
+
+
+
+