Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Found input variables with inconsistent numbers of samples: [5000, 1] #35

Closed
courageon opened this issue Nov 2, 2016 · 16 comments
Closed

Comments

@courageon
Copy link

Not sure if a new version of scikit-learn is messing this up or not but I get this error when trying to run an explanation:

Found input variables with inconsistent numbers of samples: [5000, 1]

The outer-error occurs in lime_base.py here:
https://github.com/marcotcr/lime/blob/master/lime/lime_base.py#L75

The inner error is thrown in scikit-learn here:
https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L180

I have tried to follow the multi-class notebook example as closely as I could but I do not see anything I could change to make my data look any more like the one in the example. That is, all of my classifier outputs look exactly like what's given in the example.

Any suggestions?

Thanks!

@marcotcr
Copy link
Owner

marcotcr commented Nov 3, 2016

Hello,
You're getting this error running the multi-class notebook? If so, a few questions:

  • what line in the notebook?
  • what's your version of python?
  • what's your version of sklearn?
  • what's your operating system?
    Thanks!

@marcotcr
Copy link
Owner

I just tried all of the notebooks with the newest version of sklearn, and they work, so it's probably not related to sklearn.

@nikodrum
Copy link

Same problem. @courageon how did you solve it?

@courageon
Copy link
Author

Sorry for such a long delay in response. I got pulled off the task to look at something else for a while. Thank you for looking into it @marcotcr.

LIME works fine, I made some wrong assumptions in how the predict callback was supposed to work. @nikodrum take a look at your predict callback in your explain_instance call. explain_instance sends a list of items to be predicted by the predict callback. I was only expecting a single item at a time and was therefore only returning a single item at a time. That error appears because I was only returning one prediction, instead of the requested 5000.

@marcotcr Would it be possible to update the comment to lime_text's explain_instance function to point this out? Not being very familiar with scikit's predict_proba function, it wasn't very clear (to me at least) that this was the expected case.

Once that was fixed though, everything else fell into place and I started getting some interesting results from my model. Very cool!

@marcotcr
Copy link
Owner

Yeah, the comments were definitely wrong, thanks for pointing it out.

@DianeBouchacourt
Copy link

@courageon did you feed multiple samples in the end ?

I am having the same error in my own code, and I don't have it neither when I use num_samples = 1 in the call of explain_instance. But I have the warning from python3.6/site-packages/sklearn/linear_model/ridge.py

Singular matrix in solving dual problem. Using least-squares solution instead.

which comes from the fact that I have 1 sample I guess ? Any idea why in comparison the notebook run fine whereas it uses 1 sample too it seems ?

Thanks !

@DianeBouchacourt
Copy link

OK I think I understand now, the 5000 samples are the data created in explain_instance as perturbed input.

@whyisyoung
Copy link

@DianeBouchacourt Hi, how did you solve the following warning?

Singular matrix in solving dual problem. Using least-squares solution instead.

I changed the num_samples in the explain_instance() function to a smaller number (e.g., 32 or 500). But num_samples = 1000 didn't produce such warnings. I'm using ~2,500 samples for training, ~1,300 samples for testing. So I guess the default num_samples = 5000 is too big for my case.

@craigmassie
Copy link

Similar issue here. If I set num_samples=100 I receive inconsistency with: [100, 10]. num_sample=1000 and inconsistency of [1000,100]. Any ideas? Code is here: https://github.com/craigmassie/MachineLearningParadigm/blob/master/VisAndExplain.ipynb

@sheisjw
Copy link

sheisjw commented Jan 14, 2020

@courageon I have exactly the same issue. I defined a prediction function because model.predict only returns one value per data point. I encountered Found input variables with inconsistent numbers of samples: [5000, 1] when using explain_instance. Can you help here?

    def func2(text_sample):
        proba_yes = pipeline.predict(text_sample)[0][0] #probability of yes
        proba_no = 1 - proba_yes
        print(proba_no, proba_yes)
        return np.array([[proba_no, proba_yes]])
    return func2

F = pred_f()
cnn_features = explainer.explain_instance(preprocess_text_minimal(text_sample), F, num_features=2)```

@elliottash
Copy link

elliottash commented Feb 3, 2020

Hi all,
I am also having this issue, using a binary classification. My pipeline is a text encoder followed by sklearn logit. And I get the listed error.
Any help would be appreciated and thanks!


EDIT: I fixed it. the problem was that my text encoder wouldn't work on lists of inputs.

@tharix
Copy link

tharix commented Sep 22, 2020

Found input variables with inconsistent number of samples: [123, 491]

I need help with this please

@13Ashu
Copy link

13Ashu commented Dec 3, 2020

I faced the same issue.
The issue arises because most of the times our custom predict function either takes in input as a str or as a list.
To solve it, you need to create a predict function that handles both the types of inputs and returns the probabilities in the same way as predict_proba does.

My model always needs the input as a list. I have designed it in such a way that passing "predict_proba" hyperparameter gives us the same model.predict_proba output as sklearn models.

def predict_prob(sent):
    if isinstance(sent,list):
        out = model.predict(sent,predict_proba=True)
        return(out)
    elif isinstance(sent,str):
        out = model.predict([sent],predict_proba=True)
        return(out)
    else:
        return("Some ERRORRRR")

@GladiatorX
Copy link

Hello,
I am too facing similar issue.
So I am working on text classification setup where my function predictSentiment outputs prob. distribution for 3 classes.

Here a sample output value of predictSentiment is [[7.97884561e-01 1.07981933e-01 2.67660452e-04]] & is of array datatype; similar to scikit learns predict_prob() function.

def predictSentiment(sentence):
	# code to predict sentiment score
	return output
explainer = lime.lime_text.LimeTextExplainer(
    split_expression=lambda s: re.split(r'\W+', s),
    class_names=["NEGATIVE", "NEUTRAL","POSITIVE"]
)

exp = explainer.explain_instance(
    sentence,# The review to explain
    classifier_fn= predictSentiment,
    top_labels=1,
    num_features=4, 
    num_samples=1000    
)

My Error: Found input variables with inconsistent numbers of samples: [1000, 1]
kindly help. Thanks !

@marcotcr
Copy link
Owner

predictSentiment should take as input a list of sentences rather than a single sentence.

@deweihu96
Copy link

Got a similar issue. I have a binary text classification model with PyTorch. When I set num_samples as 1, everything is ok. Except the model could learn nothing, and the weights of features are all zero. When I set num_samples as default, I got this error: "ValueError: Found input variables with inconsistent numbers of samples", no matter how I tune it (e.g. set to 2, 32 or 500, 1000....).

My code is here, could anyone help me? Thanks in advance: )
https://github.com/DeweiHu66666/Share/blob/main/caml_lime.ipynb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests