Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with custom stacking pipeline #62

Closed
gsvijayraajaa opened this issue Apr 18, 2017 · 5 comments
Closed

Issue with custom stacking pipeline #62

gsvijayraajaa opened this issue Apr 18, 2017 · 5 comments

Comments

@gsvijayraajaa
Copy link

gsvijayraajaa commented Apr 18, 2017

HI,

I have created a pipeline by stacking bunch of models together. The pipeline looks like;

pipe_stacking = make_pipeline(min_max_scaler,pca,EnsembleClassifier(classifiers=[modelLogit,modelRF,modelXGB, linear_classifier,dnnClassifier], meta_classifier=gridGB_high))

The idea is to build a meta classifier on top of the probability scores of one of the class label from the base classifiers.

The ensemble classifier looks like;


from sklearn.base import BaseEstimator, TransformerMixin, ClassifierMixin
class EnsembleClassifier(BaseEstimator, ClassifierMixin):
    def __init__(self, classifiers=None,meta_classifier=None):
        self.classifiers=classifiers
        self.meta_classifier=meta_classifier

    def fit(self, X, y):
        return self
        
    def predict_proba(self, X):
        """
        Create a vector of probability score from 5 base classifiers .
        [clf1_proba, clf2_proba, clf3_proba, clf4_proba, clf5_proba]
        """
        self.prob_result = [self.classifiers[0].predict_proba(X)[0][0],
                            self.classifiers[1].predict_proba(X)[0][0],
                            self.classifiers[2].predict_proba(X)[0][0],
                            np.asarray(list( self.classifiers[3].predict_proba(X)))[0][0],
                            np.asarray(list( self.classifiers[4].predict_proba(X)))[0][0]]
        
        self.cols_df  = ['Logit_df','RF_df','XGB_df','TLinear_df','TDNN_df']
        self.vector = pd.DataFrame(data=[self.prob_result],columns=self.cols_df)

        # Retrieve the probability score from the meta classifier which is trained already
        prob = self.meta_classifier.predict_proba(self.vector)
        return prob
    
    def transform(self, X, **transform_params):
        return pd.DataFrame(self.meta_classifier.predict(X))
    
    def predict(self, x):
        return self.meta_classifier.fit_predict(x)

The pipeline call : pipe_stacking.predict_proba(test_data) works perfectly fine.

I am trying to use the LimeTabularExplainer on this stacking model;

explainer = `LimeTabularExplainer(data.as_matrix(),feature_names=features,class_names=class_names) 

exp = explainer.explain_instance(test_data, pipe_stacking.predict_proba, num_features=153,num_samples=44) 

I get this error log;

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-1644-ff3b6d617477> in <module>()
----> 1 exp = explainer.explain_instance(test_pipe.iloc[32], pipe_stacking.predict_proba, num_features=153,num_samples=44)

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_tabular.pyc in explain_instance(self, data_row, classifier_fn, labels, top_labels, num_features, num_samples, distance_metric, model_regressor)
    276                 scaled_data, yss, distances, label, num_features,
    277                 model_regressor=model_regressor,
--> 278                 feature_selection=self.feature_selection)
    279         return ret_exp
    280 

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in explain_instance_with_data(self, neighborhood_data, neighborhood_labels, distances, label, num_features, feature_selection, model_regressor)
    151                                                weights,
    152                                                num_features,
--> 153                                                feature_selection)
    154 
    155         if model_regressor is None:

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in feature_selection(self, data, labels, weights, num_features, method)
    100                 n_method = 'highest_weights'
    101             return self.feature_selection(data, labels, weights,
--> 102                                           num_features, n_method)
    103 
    104     def explain_instance_with_data(self,

/Users/raajaa/anaconda/lib/python2.7/site-packages/lime/lime_base.pyc in feature_selection(self, data, labels, weights, num_features, method)
     73         elif method == 'highest_weights':
     74             clf = sklearn.linear_model.Ridge(alpha=0, fit_intercept=True)
---> 75             clf.fit(data, labels, sample_weight=weights)
     76             feature_weights = sorted(zip(range(data.shape[0]),
     77                                          clf.coef_ * data[0]),

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
    640         self : returns an instance of self.
    641         """
--> 642         return super(Ridge, self).fit(X, y, sample_weight=sample_weight)
    643 
    644 

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/linear_model/ridge.pyc in fit(self, X, y, sample_weight)
    463     def fit(self, X, y, sample_weight=None):
    464         X, y = check_X_y(X, y, ['csr', 'csc', 'coo'], dtype=np.float64,
--> 465                          multi_output=True, y_numeric=True)
    466 
    467         if ((sample_weight is not None) and

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    529         y = y.astype(np.float64)
    530 
--> 531     check_consistent_length(X, y)
    532 
    533     return X, y

/Users/raajaa/anaconda/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_consistent_length(*arrays)
    179     if len(uniques) > 1:
    180         raise ValueError("Found input variables with inconsistent numbers of"
--> 181                          " samples: %r" % [int(l) for l in lengths])
    182 
    183 

ValueError: Found input variables with inconsistent numbers of samples: [44, 1] 

It works only if num_samples = 1.

I am not sure what the issue is. Any direction will be greatly appreciated.

Regards,
Vijay Raajaa GS

@marcotcr
Copy link
Owner

Are you sure pipe_stacking.predict_proba(test_data) works perfectly fine? In this line:
self.prob_result = [self.classifiers[0].predict_proba(X)[0][0], self.classifiers[1].predict_proba(X)[0][0], self.classifiers[2].predict_proba(X)[0][0], np.asarray(list( self.classifiers[3].predict_proba(X)))[0][0], np.asarray(list( self.classifiers[4].predict_proba(X)))[0][0]]
you seem to be taking the first row and first column ([0][0]) of the prediction probability for each classifier, regardless of the size of the input. That is, I think you'll always output one row in predict_proba, even if the input is 10 rows.
LimeTabular assumes the X in predict_proba can be a 2d array, not only a 1d array.

@gsvijayraajaa
Copy link
Author

gsvijayraajaa commented Apr 18, 2017

This is a use case specific implementation, wherein the pipeline is used only for prediction against a single input vector. Henceforth the size of the input is one here. X in predict_proba comes after PCA, which is a 2d array. I have used the same training input matrix for initialization & used the same input data point against lime explainer built on Random forest earlier, it did work fine. It doesnt seem to work for the custom stacking implementation.

The error log has marked this an error;

--> 531     check_consistent_length(X, y)
ValueError: Found input variables with inconsistent numbers of samples: [44, 1] 

@marcotcr
Copy link
Owner

can you share a notebook with the error?

@gsvijayraajaa
Copy link
Author

Sure. I have shared the same to your inbox.

@gsvijayraajaa
Copy link
Author

Thanks for the right direction. Your previous correction was perfectly right, the stacking predict_proba function was returning only the first output. I had changed it to work on an array of any given size. Since the sample in my implementation was initialized with a size 44, the ideal output from the predict_proba from my stacking implementation should be (44x2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants