As I was working to keep my promise from a
<a href="{{ site.baseurl }}{% link _posts/2024-03-09-stack-overflow-api.html %}">previous entry</a>,
I came across a scenario that I thought was worth a blog post.
I was using the
[`mlxtend`](https://rasbt.github.io/mlxtend/)
package to show how one might perform a basket analysis on question tags when I discovered a feature that I expected to exist, didn't.
I'll elaborate.

# The Missing Feature

I connected to the API as I had previously written about and pulled questions.

In [1]:
from os import getenv

from stackapi import StackAPI


key = getenv("STACK_API_KEY")
SITE = StackAPI("stackoverflow", key=key)
questions = SITE.fetch("questions")

questions["items"][0]

{'tags': ['html', 'css', 'flexbox', 'responsive-design', 'centering'],
 'owner': {'account_id': 26330658,
  'reputation': 73,
  'user_id': 19991177,
  'user_type': 'registered',
  'profile_image': 'https://www.gravatar.com/avatar/1379e1c185626a10b0ddac93c5326254?s=256&d=identicon&r=PG',
  'display_name': 'TheNickster',
  'link': 'https://stackoverflow.com/users/19991177/thenickster'},
 'is_answered': True,
 'view_count': 18,
 'answer_count': 2,
 'score': 0,
 'last_activity_date': 1711251161,
 'creation_date': 1711235354,
 'question_id': 78212821,
 'content_license': 'CC BY-SA 4.0',
 'link': 'https://stackoverflow.com/questions/78212821/how-do-i-center-score-text-for-a-basketball-scoreboard',
 'title': 'How do I Center Score Text for a Basketball Scoreboard?'}

In the question items there's a field called "tags", which I want to use for the analysis.
The tags are presented as a list of strings.
To keep them tied to their questions and make analysis a bit easier,
I decided to convert the list of question items to a `pandas.DataFrame`.

In [2]:
import pandas as pd

# Configuration settings
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 6


df = pd.DataFrame(questions["items"])
# Question Ids are unique to the row.
df = df.set_index("question_id")
# Results may vary as the most recent questions are returned each call.
print(df.tags.head())

question_id
78212821    [html, css, flexbox, responsive-design, center...
76143172                                 [php, symfony, twig]
35707320            [ruby-on-rails, mongodb, ruby-on-rails-4]
48057197               [php, apache, xampp, php-7.1, php-7.2]
49476559    [java, compiler-errors, java-9, java-module, m...
Name: tags, dtype: object


Preprocessing of the tags would be handled by the `mlxtend` library.
I chose to use the
[`TransactionEncoder`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.preprocessing/#transactionencoder), which is similar to a [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html),
but for converting item lists (think lists of lists; nested lists) into transaction data rather than an array (one value per cell) into columns.

In [3]:
from mlxtend.preprocessing.transactionencoder import TransactionEncoder


encoder = TransactionEncoder()
tag_encodings = encoder.fit_transform(df.tags)
tag_encodings

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

The returned results are an array.
No problem with that.
But while browsing
[the example in the User Guide](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/),
I noticed how they converted the array into a
[`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [4]:
tag_df = pd.DataFrame(
    tag_encodings,
    index=df.index,  # I added the index to align with the input data.
    columns=encoder.columns_,
)
print(tag_df.head())

              .net  .net-6.0  .net-attributes  ...  zooming    zsh  zustand
question_id                                    ...                         
78212821     False     False            False  ...    False  False    False
76143172     False     False            False  ...    False  False    False
35707320     False     False            False  ...    False  False    False
48057197     False     False            False  ...    False  False    False
49476559     False     False            False  ...    False  False    False

[5 rows x 900 columns]


There's nothing wrong with how this was done, but I wondered why the
[`set_output`](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html)
method wasn't taken advantage of.
That's when I realized it's not exposed in `mlxtend`.

In [5]:
try:
    encoder = TransactionEncoder().set_output(transform="pandas")

except Exception as e:
    print(repr(e))

AttributeError("This 'TransactionEncoder' has no attribute 'set_output'")


"That's odd," I thought.
I'm pretty sure [`scikit-learn`](https://scikit-learn.org/stable/index.html) is a requirement for `mlxtend`.
Surely the supported version is greater than 1.2?

After looking at the [requirements.txt](https://github.com/rasbt/mlxtend/blob/master/requirements.txt) file,
I was relieved to see that the package did in fact use the newest version of `scikit-learn`.
But why didn't `set_output` work?

The reason wasn't obvious after digging through the `TransactionEncoder`'s source code.
Switching to how `set_output` works in `scikit-learn`, I found what I was looking for in the documentation for the
[`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin) class:

> Mixin class for all transformers in scikit-learn.
> 
> This mixin defines the following functionality:
> 
> - a `fit_transform` method that delegates to `fit` and `transform`;
> - a `set_output` method to output `X` as a specific container type.
> 
> If [`get_feature_names_out`](https://scikit-learn.org/stable/glossary.html#term-get_feature_names_out) is defined,
> then [`BaseEstimator`](https://scikit-learn.org/stable/modules/generated/sklearn.base.BaseEstimator.html#sklearn.base.BaseEstimator)
> will automatically wrap `transform` and `fit_transform` to follow the `set_output` API.
> See the [Developer API for `set_output`](https://scikit-learn.org/stable/developers/develop.html#developer-api-set-output)
> for details.
> 
> [`OneToOneFeatureMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.OneToOneFeatureMixin.html#sklearn.base.OneToOneFeatureMixin) and [`ClassNamePrefixFeaturesOutMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.ClassNamePrefixFeaturesOutMixin.html#sklearn.base.ClassNamePrefixFeaturesOutMixin) are helpful mixins for defining [`get_feature_names_out`](https://scikit-learn.org/stable/glossary.html#term-get_feature_names_out).

The current version of `TransactionEncoder` *does* inherit from `scikit-learn`'s `TransformerMixin`,
but *does not* define the `get_feature_names_out` method.
Implementing the method would allow the `TransactionEncoder` to output a `pandas.DataFrame` by default.
I'm up for the challenge 😎.

# New Issue (Feature)

If you haven't contributed to an open source project before, here are some general guidelines I like to follow:
1. Check if a related issue has already been logged.
Nobody wants to deal with closing duplicate tickets.
Or worse, not closing them and having to deal with duplicate work that's already been completed.
2. Read the package's contribution guidelines and code of conduct.
If there's an existing process in place, follow it.

I usually perform a few searches over the open issues with various keywords to see if anything comes up.
For this particular issue I tried
["set_output"](https://github.com/rasbt/mlxtend/issues?q=is%3Aissue+is%3Aopen+set_output),
["TransactionEncoder"](https://github.com/rasbt/mlxtend/issues?q=is%3Aissue+is%3Aopen+TransactionEncoder), and
["get_feature_names_out"](https://github.com/rasbt/mlxtend/issues?q=is%3Aissue+is%3Aopen+get_feature_names_out).
The first and third yielded no results, and the second had some unrelated to the format of the output.
I'm good to proceed.

`mlxtend`'s [issue template](https://github.com/rasbt/mlxtend/issues/new/choose) has four major categories:
- Bug report
- Documentation improvement
- Feature request
- Other
- Usage question

Since the `get_feature_names_out` method doesn't exist in the `TransactionEncoder`,
I think this should be a feature request.

I started off with a title: ***"Integrate scikit-learn's `set_output` method into `TransactionEncoder`."***
I want my feature request to be specific and small enough that it can be easily merged,
as well as not break any preexisting code (though I do forsee a `scikit-learn` version bump).

Next, I need to fill out the following four sections:
- Describe the workflow you want to enable
- Describe your proposed solution
- Describe alternatives you've considered, if relevant
- Additional context

Here's what I put for each:

### Describe the workflow you want to enable

In `scikit-learn` [version 1.2](https://scikit-learn.org/1.2/whats_new/v1.2.html#id8),
[the `set_output` API was introduced](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html).
I would like to expose the API inside of the
[`mlxtend.preprocessing.transactionencoder.TransactionEncoder`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.preprocessing/#transactionencoder) class.
This would allow the user to set the output of :method:`TransactionEncoder.fit_transform` and :method:`TransactionEncoder.transform` to a
[`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) by default,
rather than having to manually create the object after transformation.

### Describe your proposed solution

My proposed solution is to define the
[:method:`get_feature_names_out`](https://scikit-learn.org/stable/glossary.html#term-get_feature_names_out)
in :class:`TransactionEncoder` as this is required to expose the :method:`set_output`.
See [:class:`TransformerMixin`](https://scikit-learn.org/stable/modules/generated/sklearn.base.TransformerMixin.html#sklearn.base.TransformerMixin)
and [Developer API for `set_output`](https://scikit-learn.org/stable/developers/develop.html#developer-api-set-output)
for more details.

### Describe alternatives you've considered, if relevant

Continue using the method described in the
[User Guide](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/)
—convert the output of the transformer to a `pandas.DataFrame` manually.

### Additional context

- This would require the minimum version of `scikit-learn` to increase from 1.0.2 to 1.2.2.
- I'm willing to take on the PR for this work.

# Submit

After doing my due diligence, I submitted the feature request/issue.
You can keep tabs on it here 👉 
[Integrate scikit-learn's `set_output` method into `TransactionEncoder`](https://github.com/rasbt/mlxtend/issues/1085).
While I wait for one of the package maintainers to green-light my request,
I'll scope out how difficult it will be to implement the `get_feature_names_out` method.
I should also see if I need to write or update any [unit tests](https://en.wikipedia.org/wiki/Unit_testing).
Catch you in part deuce ✌️.