Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing fit feature #155

Closed
jbesomi opened this issue Aug 20, 2020 · 5 comments
Closed

Missing fit feature #155

jbesomi opened this issue Aug 20, 2020 · 5 comments
Labels
bug Something isn't working discussion To discuss new improvements

Comments

@jbesomi
Copy link
Owner

jbesomi commented Aug 20, 2020

In any ML task, the assumption is that the test data are not available during training and just available in the prediction phase.

Assume someone wants to categorize reviews using tfidf + Naive Bayes. The required step would be the following ones:

  1. Split train and test
  2. Fit tfidf on the train part and generate (the transform part in scikit-learn) the tfidf values on the train part
  3. Train the model
  4. Generate the tf-idf values on the test part, this time using the already fitted model

The problem is that with the current implementation we don't have any state (and that brings also many advantages such as simplicity). The tfidf functions do not return any already fitted model, rather the already transformed values.

We need to take a clear position wrt to this point. Having the exact same approach as scikit-learn would not probably make sense, still, we need to consider this fact. Opinions?

@jbesomi jbesomi added discussion To discuss new improvements bug Something isn't working labels Aug 20, 2020
@mk2510
Copy link
Collaborator

mk2510 commented Aug 22, 2020

could you explain what you want you mean? 🐙 🍺

@jbesomi
Copy link
Owner Author

jbesomi commented Sep 8, 2020

In sklearn and similar toolkit we generally model.fit_transform on train data and only model.transform on text data. In Texthero we cannot do that as we don't have any model object (but that's fine somehow, we just need to have a clear position on that)

@mk2510
Copy link
Collaborator

mk2510 commented Sep 13, 2020

so far I didn't start thinking about the missing fit feature, as fit is in general called on models, which can be fitted to your dataset (at least from my experience so far).
In those cases, where we work with models - which can be fitted, like in 'pca' or 'kmeans' - I think, we can mention it on the getting started page, that we call everytime fit_transform and don't provide the option to store the fitted pca model. But the way the API is designed, the user probably wouldn't assume it anyways, I guess 😬 🙈 :octocat:

@jbesomi
Copy link
Owner Author

jbesomi commented Sep 14, 2020

Agree. Yes, we can mention that in the (future) FAQ page.

@harshraj-wadhwani
Copy link

@mk2510 Any plans to add fit method ?
Or is there any way by which I can fit on train data, save to pickle file, and transform on test data ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working discussion To discuss new improvements
Projects
None yet
Development

No branches or pull requests

3 participants