Does not seem to work for NLP pipeline from sklearn #164

dcshapiro · 2020-06-23T15:31:17Z

Hi,

I tried to convert a model pipeline containing 3 steps:

[('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('clf-svm', SGDClassifier())]

I get the following error:

MissingConverter: Unable to find converter for model type <class 'sklearn.feature_extraction.text.CountVectorizer'>.
It usually means the pipeline being converted contains a
transformer or a predictor with no corresponding converter implemented.
Please fill an issue at https://github.com/microsoft/hummingbird.

I'm guessing the system does not yet handle the first 2 parts...

ksaur · 2020-06-23T17:53:58Z

Hi @dcshapiro, thanks for bringing this up! Yes, the first two parts are not yet handled.

CountVectorizer is on our roadmap; we have it implemented but need to clean it up and release it.
TfidfTransformer is a bit more difficult to implement.
The problem with text featurizers is tokenization. Since tensors are statically sized, we need to know the number of words in a document. This means we must basically tokenize the document upfront.

tfidf is not implemented yet but should be possible once the text is tokenized! We'll keep that in mind for future features!

Hemantr05 · 2020-09-10T22:12:05Z

@ksaur I would like to contribute by writing the code for tf-idf, if possible.

interesaaat · 2020-09-10T22:21:53Z

Please! Having tf-idf will be fantastic. Can you please open as issue specific for tf-idf so that we can discuss on the implementation? I don't think it will be super trivial because, for example, PyTorch does not support string data types, so we will have to transform the input data into some numeric form (for one-hot encoding we had to do the same).

This was referenced Sep 11, 2020

tf-idf implementation #293

Open

CountVectorizer implementation #203

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not seem to work for NLP pipeline from sklearn #164

Does not seem to work for NLP pipeline from sklearn #164

dcshapiro commented Jun 23, 2020

ksaur commented Jun 23, 2020 •

edited by interesaaat

Hemantr05 commented Sep 10, 2020

interesaaat commented Sep 10, 2020

Does not seem to work for NLP pipeline from sklearn #164

Does not seem to work for NLP pipeline from sklearn #164

Comments

dcshapiro commented Jun 23, 2020

ksaur commented Jun 23, 2020 • edited by interesaaat

Hemantr05 commented Sep 10, 2020

interesaaat commented Sep 10, 2020

ksaur commented Jun 23, 2020 •

edited by interesaaat