You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If the column is referenced with its column name as string, facing a RunTimeError as below
RuntimeError: Unable to find column name 'subject' among names ['input']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.
importsslssl._create_default_https_context=ssl._create_unverified_contextimportpandasaspdimportmatplotlib.pyplotaspltimportos#from onnx.tools.net_drawer import GetPydotGraph, GetOpNodeProducerimportnumpyimportonnxruntimeasrtfromskl2onnx.common.data_typesimportStringTensorTypefromskl2onnximportconvert_sklearnimportnumpyasnpfromsklearn.baseimportBaseEstimator, TransformerMixinfromsklearn.datasetsimportfetch_20newsgroupstry:
fromsklearn.datasets._twenty_newsgroupsimport (
strip_newsgroup_footer, strip_newsgroup_quoting)
exceptImportError:
# scikit-learn < 0.24fromsklearn.datasets.twenty_newsgroupsimport (
strip_newsgroup_footer, strip_newsgroup_quoting)
fromsklearn.decompositionimportTruncatedSVDfromsklearn.feature_extraction.textimportTfidfVectorizerfromsklearn.pipelineimportPipelinefromsklearn.composeimportColumnTransformerfromsklearn.metricsimportclassification_reportfromsklearn.linear_modelimportLogisticRegression# limit the list of categories to make running this example faster.categories= ['alt.atheism', 'talk.religion.misc']
train=fetch_20newsgroups(random_state=1,
subset='train',
categories=categories,
)
test=fetch_20newsgroups(random_state=1,
subset='test',
categories=categories,
)
classSubjectBodyExtractor(BaseEstimator, TransformerMixin):
"""Extract the subject & body from a usenet post in a single pass. Takes a sequence of strings and produces a dict of sequences. Keys are `subject` and `body`. """deffit(self, x, y=None):
returnselfdeftransform(self, posts):
# construct object dtype array with two columns# first column = 'subject' and second column = 'body'features=np.empty(shape=(len(posts), 2), dtype=object)
fori, textinenumerate(posts):
headers, _, bod=text.partition('\n\n')
bod=strip_newsgroup_footer(bod)
bod=strip_newsgroup_quoting(bod)
features[i, 1] =bodprefix='Subject:'sub=''forlineinheaders.split('\n'):
ifline.startswith(prefix):
sub=line[len(prefix):]
breakfeatures[i, 0] =subreturnfeaturestrain_data=SubjectBodyExtractor().fit_transform(train.data)
test_data=SubjectBodyExtractor().fit_transform(test.data)
# convert training data to dataframe so that column name can be used instead of column indextrain_df=pd.DataFrame(train_data, columns= ['subject', 'body'])
print(train_df.head(1))
pipeline=Pipeline([
('union', ColumnTransformer(
[
('subject', TfidfVectorizer(min_df=50, max_features=500), 'subject'), # 0, is replaced with column name 'subject'
('body_bow', Pipeline([
('tfidf', TfidfVectorizer()),
('best', TruncatedSVD(n_components=50)),
]), 'body'), # 1, is replaced with column name 'body'# Removed from the original example as# it requires a custom converter.# ('body_stats', Pipeline([# ('stats', TextStats()), # returns a list of dicts# ('vect', DictVectorizer()), # list of dicts -> feature matrix# ]), 1),
],
transformer_weights={
'subject': 0.8,
'body_bow': 0.5,
# 'body_stats': 1.0,
}
)),
# Use a LogisticRegression classifier on the combined features.# Instead of LinearSVC (not fully ready in onnxruntime).
('logreg', LogisticRegression()),
])
pipeline.fit(train_df, train.target)
print(pipeline.steps)
#print(classification_report(pipeline.predict(test_data), test.target))seps= {
TfidfVectorizer: {
"separators": [
' ', '.', '\\?', ',', ';', ':', '!',
'\\(', '\\)', '\n', '"', "'",
"-", "\\[", "\\]", "@"
]
}
}
model_onnx=convert_sklearn(
pipeline, "tfidf",
initial_types=[("input", StringTensorType([None, 2]))],
# options=seps,target_opset=12)
The text was updated successfully, but these errors were encountered:
I realize the issue in the above code in defining the "initial_types" which is obviously changed after converting to dataframe.
with initial_types=[("subject", StringTensorType([None, 1])), ("body", StringTensorType([None, 1]))],
the code works fine and the issue filed here stands invalid.
But with my internal example which i cannot share here shows error as below:
RuntimeError: Unable to find column name 'command_normalized' among names ['variable']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.
and somehow my input variables defined are getting converted to [Variable('variable', 'variable6', type=FloatTensorType(shape=[]))] in _parse_sklearn_column_transformer of _parse.py
If the column is referenced with its column name as string, facing a RunTimeError as below
RuntimeError: Unable to find column name 'subject' among names ['input']. Make sure the input names specified with parameter initial_types fits the column names specified in the pipeline to convert. This may happen because a ColumnTransformer follows a transformer without any mapped converter in a pipeline.
with the same example as in https://onnx.ai/sklearn-onnx/auto_examples/plot_tfidfvectorizer.html when the training dataset is converted to a Pandas dataframe and the column transformer is referenced with column name, the above error can be reproduced.
below is the code to reproduce:
The text was updated successfully, but these errors were encountered: