Issue with custom stacking pipeline #62

gsvijayraajaa · 2017-04-18T13:15:05Z

HI,

I have created a pipeline by stacking bunch of models together. The pipeline looks like;

pipe_stacking = make_pipeline(min_max_scaler,pca,EnsembleClassifier(classifiers=[modelLogit,modelRF,modelXGB, linear_classifier,dnnClassifier], meta_classifier=gridGB_high))

The idea is to build a meta classifier on top of the probability scores of one of the class label from the base classifiers.

The ensemble classifier looks like;


from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
class EnsembleClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, classifiers=None,meta_classifier=None):
        self.classifiers=classifiers
        self.meta_classifier=meta_classifier

    def fit(self, X, y):
        return self
        
    def predict_proba(self, X):
        """
        Create a vector of probability score from 5 base classifiers .
        [clf1_proba, clf2_proba, clf3_proba, clf4_proba, clf5_proba]
        """
        self.prob_result = [self.classifiers[0].predict_proba(X)[0][0],
                            self.classifiers[1].predict_proba(X)[0][0],
                            self.classifiers[2].predict_proba(X)[0][0],
                            np.asarray(list( self.classifiers[3].predict_proba(X)))[0][0],
                            np.asarray(list( self.classifiers[4].predict_proba(X)))[0][0]]
        
        self.cols_df  = ['Logit_df','RF_df','XGB_df','TLinear_df','TDNN_df']
        self.vector = pd.DataFrame(data=[self.prob_result],columns=self.cols_df)

        # Retrieve the probability score from the meta classifier which is trained already
        prob = self.meta_classifier.predict_proba(self.vector)
        return prob
    
    def transform(self, X, **transform_params):
        return pd.DataFrame(self.meta_classifier.predict(X))
    
    def predict(self, x):
        return self.meta_classifier.fit_predict(x)

The pipeline call : pipe_stacking.predict_proba(test_data) works perfectly fine.

I am trying to use the LimeTabularExplainer on this stacking model;

explainer = `LimeTabularExplainer(data.as_matrix(),feature_names=features,class_names=class_names) 

exp = explainer.explain_instance(test_data, pipe_stacking.predict_proba, num_features=153,num_samples=44)

I get this error log;

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1644-ff3b6d617477> in <module>()
----> 1 exp = explainer.explain_instance(test_pipe.iloc[32], pipe_stacking.predict_proba, num_features=153,num_samples=44)

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_tabular.pyc in explain_instance(self, data_row, classifier_fn, labels, top_labels, num_features, num_samples, distance_metric, model_regressor)
    276                 scaled_data, yss, distances, label, num_features,
    277                 model_regressor=model_regressor,
--> 278                 feature_selection=self.feature_selection)
    279         return ret_exp
    280 

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in explain_instance_with_data(self, neighborhood_data, neighborhood_labels, distances, label, num_features, feature_selection, model_regressor)
    151                                                weights,
    152                                                num_features,
--> 153                                                feature_selection)
    154 
    155         if model_regressor is None:

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in feature_selection(self, data, labels, weights, num_features, method)
    100                 n_method = 'highest_weights'
    101             return self.feature_selection(data, labels, weights,
--> 102                                           num_features, n_method)
    103 
    104     def explain_instance_with_data(self,

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in feature_selection(self, data, labels, weights, num_features, method)
     73         elif method == 'highest_weights':
     74             clf = sklearn.linear_model.Ridge(alpha=0, fit_intercept=True)
---> 75             clf.fit(data, labels, sample_weight=weights)
     76             feature_weights = sorted(zip(range(data.shape[0]),
     77                                          clf.coef_ * data[0]),

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
    640         self : returns an instance of self.
    641         """
--> 642         return super(Ridge, self).fit(X, y, sample_weight=sample_weight)
    643 
    644 

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
    463     def fit(self, X, y, sample_weight=None):
    464         X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
--> 465                          multi_output=True, y_numeric=True)
    466 
    467         if ((sample_weight is not None) and

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    529         y = y.astype(np.float64)
    530 
--> 531     check_consistent_length(X, y)
    532 
    533     return X, y

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_consistent_length(*arrays)
    179     if len(uniques) > 1:
    180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
    182 
    183 

ValueError: Found input variables with inconsistent numbers of samples: [44, 1]

It works only if num_samples = 1.

I am not sure what the issue is. Any direction will be greatly appreciated.

Regards,
Vijay Raajaa GS

The text was updated successfully, but these errors were encountered:

marcotcr · 2017-04-18T15:51:08Z

Are you sure pipe_stacking.predict_proba(test_data) works perfectly fine? In this line:
self.prob_result = [self.classifiers[0].predict_proba(X)[0][0], self.classifiers[1].predict_proba(X)[0][0], self.classifiers[2].predict_proba(X)[0][0], np.asarray(list( self.classifiers[3].predict_proba(X)))[0][0], np.asarray(list( self.classifiers[4].predict_proba(X)))[0][0]]
you seem to be taking the first row and first column ([0][0]) of the prediction probability for each classifier, regardless of the size of the input. That is, I think you'll always output one row in predict_proba, even if the input is 10 rows.
LimeTabular assumes the X in predict_proba can be a 2d array, not only a 1d array.

gsvijayraajaa · 2017-04-18T16:47:34Z

This is a use case specific implementation, wherein the pipeline is used only for prediction against a single input vector. Henceforth the size of the input is one here. X in predict_proba comes after PCA, which is a 2d array. I have used the same training input matrix for initialization & used the same input data point against lime explainer built on Random forest earlier, it did work fine. It doesnt seem to work for the custom stacking implementation.

The error log has marked this an error;

--> 531     check_consistent_length(X, y)
ValueError: Found input variables with inconsistent numbers of samples: [44, 1]

marcotcr · 2017-04-18T16:51:52Z

can you share a notebook with the error?

gsvijayraajaa · 2017-04-18T17:27:29Z

Sure. I have shared the same to your inbox.

gsvijayraajaa · 2017-04-19T05:37:10Z

Thanks for the right direction. Your previous correction was perfectly right, the stacking predict_proba function was returning only the first output. I had changed it to work on an array of any given size. Since the sample in my implementation was initialized with a size 44, the ideal output from the predict_proba from my stacking implementation should be (44x2).

gsvijayraajaa closed this as completed Apr 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with custom stacking pipeline #62

Issue with custom stacking pipeline #62

gsvijayraajaa commented Apr 18, 2017 •

edited

Loading

marcotcr commented Apr 18, 2017

gsvijayraajaa commented Apr 18, 2017 •

edited

Loading

marcotcr commented Apr 18, 2017

gsvijayraajaa commented Apr 18, 2017

gsvijayraajaa commented Apr 19, 2017

Issue with custom stacking pipeline #62

Issue with custom stacking pipeline #62

Comments

gsvijayraajaa commented Apr 18, 2017 • edited Loading

marcotcr commented Apr 18, 2017

gsvijayraajaa commented Apr 18, 2017 • edited Loading

marcotcr commented Apr 18, 2017

gsvijayraajaa commented Apr 18, 2017

gsvijayraajaa commented Apr 19, 2017

gsvijayraajaa commented Apr 18, 2017 •

edited

Loading

gsvijayraajaa commented Apr 18, 2017 •

edited

Loading