Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sklearn column transformer with Tfidfvectorizer requires column to be defined with its positional reference as integer #995

Open
sharathts14 opened this issue May 26, 2023 · 4 comments

Comments

@sharathts14
Copy link

If the column is referenced with its column name as string, facing a RunTimeError as below

RuntimeError: Unable to find column name 'subject' among names ['input']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

with the same example as in https://onnx.ai/sklearn-onnx/auto_examples/plot_tfidfvectorizer.html when the training dataset is converted to a Pandas dataframe and the column transformer is referenced with column name, the above error can be reproduced.

below is the code to reproduce:

import ssl
ssl._create_default_https_context = ssl._create_unverified_context
import pandas as pd
import matplotlib.pyplot as plt
import os
#from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducer
import numpy
import onnxruntime as rt
from skl2onnx.common.data_types import StringTensorType
from skl2onnx import convert_sklearn
import numpy as np

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.datasets import fetch_20newsgroups
try:
    from sklearn.datasets._twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
except ImportError:
    # scikit-learn < 0.24
    from sklearn.datasets.twenty_newsgroups import (
        strip_newsgroup_footer, strip_newsgroup_quoting)
from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.metrics import classification_report
from sklearn.linear_model import LogisticRegression


# limit the list of categories to make running this example faster.
categories = ['alt.atheism', 'talk.religion.misc']
train = fetch_20newsgroups(random_state=1,
                           subset='train',
                           categories=categories,
                           )
test = fetch_20newsgroups(random_state=1,
                          subset='test',
                          categories=categories,
                          )


class SubjectBodyExtractor(BaseEstimator, TransformerMixin):
  """Extract the subject & body from a usenet post in a single pass.
  Takes a sequence of strings and produces a dict of sequences. Keys are
  `subject` and `body`.
  """

  def fit(self, x, y=None):
    return self

  def transform(self, posts):
    # construct object dtype array with two columns
    # first column = 'subject' and second column = 'body'
    features = np.empty(shape=(len(posts), 2), dtype=object)
    for i, text in enumerate(posts):
      headers, _, bod = text.partition('\n\n')
      bod = strip_newsgroup_footer(bod)
      bod = strip_newsgroup_quoting(bod)
      features[i, 1] = bod

      prefix = 'Subject:'
      sub = ''
      for line in headers.split('\n'):
        if line.startswith(prefix):
          sub = line[len(prefix):]
          break
      features[i, 0] = sub

    return features


train_data = SubjectBodyExtractor().fit_transform(train.data)
test_data = SubjectBodyExtractor().fit_transform(test.data)

# convert training data to dataframe so that column name can be used instead of column index
train_df = pd.DataFrame(train_data, columns = ['subject', 'body'])
print(train_df.head(1))

pipeline = Pipeline([
    ('union', ColumnTransformer(
        [
            ('subject', TfidfVectorizer(min_df=50, max_features=500), 'subject'),  # 0, is replaced with column name 'subject'

            ('body_bow', Pipeline([
                ('tfidf', TfidfVectorizer()),
                ('best', TruncatedSVD(n_components=50)),
            ]), 'body'),  # 1, is replaced with column name 'body'

            # Removed from the original example as
            # it requires a custom converter.
            # ('body_stats', Pipeline([
            #   ('stats', TextStats()),  # returns a list of dicts
            #   ('vect', DictVectorizer()),  # list of dicts -> feature matrix
            # ]), 1),
        ],

        transformer_weights={
            'subject': 0.8,
            'body_bow': 0.5,
            # 'body_stats': 1.0,
        }
    )),

    # Use a LogisticRegression classifier on the combined features.
    # Instead of LinearSVC (not fully ready in onnxruntime).
    ('logreg', LogisticRegression()),
])

pipeline.fit(train_df, train.target)
print(pipeline.steps)
#print(classification_report(pipeline.predict(test_data), test.target))

seps = {
    TfidfVectorizer: {
        "separators": [
            ' ', '.', '\\?', ',', ';', ':', '!',
            '\\(', '\\)', '\n', '"', "'",
            "-", "\\[", "\\]", "@"
        ]
    }
}
model_onnx = convert_sklearn(
    pipeline, "tfidf",
    initial_types=[("input", StringTensorType([None, 2]))],
    # options=seps,
    target_opset=12)
@sharathts14
Copy link
Author

I am not sure if this is bug or a currently requires us to specify column positional integer as column name (string) currently not supported?

@sharathts14
Copy link
Author

I realize the issue in the above code in defining the "initial_types" which is obviously changed after converting to dataframe.

with
initial_types=[("subject", StringTensorType([None, 1])), ("body", StringTensorType([None, 1]))],

the code works fine and the issue filed here stands invalid.

But with my internal example which i cannot share here shows error as below:

RuntimeError: Unable to find column name 'command_normalized' among names ['variable']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.

and somehow my input variables defined are getting converted to [Variable('variable', 'variable6', type=FloatTensorType(shape=[]))] in _parse_sklearn_column_transformer of _parse.py

further debug in process

@sharathts14
Copy link
Author

Ah, I see the 'note' section in https://onnx.ai/sklearn-onnx/api_summary.html which exactly mentions the same

@jjasont
Copy link

jjasont commented Dec 12, 2023

@sharathts14 hi there, recently stumbled upon this limitation as well. How do you resolve this and what's your workaround for this limitation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants