As I was working to keep my promise from a
<a href="{{ site.baseurl }}{% link _posts/2024-03-09-stack-overflow-api.html %}">previous post</a>,
I came across a scenario that I thought was worth a blog post.
I was using the
[`mlxtend`](https://rasbt.github.io/mlxtend/)
package to show how one might perform a basket analysis on question tags when I discovered a feature that I expected to exist, didn't.
I'll elaborate.

# The Missing Feature

I connected to the API as I had previously written about and pulled questions.

In [1]:
from os import getenv

from stackapi import StackAPI


key = getenv("STACK_API_KEY")
SITE = StackAPI("stackoverflow", key=key)
questions = SITE.fetch("questions")

questions["items"][0]

{'tags': ['google-apps-script', 'google-drive-api'],
 'owner': {'account_id': 19491597,
  'reputation': 3,
  'user_id': 14259706,
  'user_type': 'registered',
  'profile_image': 'https://lh3.googleusercontent.com/a-/AOh14Gj-MrYaFm4cOQkGDPB9QR-s_AqAWO_11zTcsKpR4g=k-s256',
  'display_name': 'Lorenlis Hernandez',
  'link': 'https://stackoverflow.com/users/14259706/lorenlis-hernandez'},
 'is_answered': True,
 'view_count': 1019,
 'accepted_answer_id': 63845724,
 'answer_count': 2,
 'score': 0,
 'last_activity_date': 1710209200,
 'creation_date': 1599819696,
 'last_edit_date': 1599837995,
 'question_id': 63845187,
 'content_license': 'CC BY-SA 4.0',
 'link': 'https://stackoverflow.com/questions/63845187/do-not-send-notifications-when-share-a-file',
 'title': 'DO NOT send notifications when share a file'}

In the question items there's a field called "tags", which I want to use for the analysis.
The tags are presented as a list of strings.
To keep them tied to their questions and make analysis a bit easier,
I decided to convert the list of question items to a `pandas.DataFrame`.

In [2]:
import pandas as pd

# Configuration settings
pd.options.display.expand_frame_repr = False
pd.options.display.max_columns = 8


df = pd.DataFrame(questions["items"])
# Question Ids are unique to the row.
df = df.set_index("question_id")
# Results may vary as the most recent questions are returned each call.
print(df.tags.head())

question_id
78135146          [tensorflow, deep-learning, neural-network]
78132659    [java, android, firebase, firebase-realtime-da...
78137900                 [ios, xcode, google-cloud-firestore]
78133446                                                [sql]
78137890                                       [excel-online]
Name: tags, dtype: object


Preprocessing of the tags would be handled by the `mlxtend` library.
I chose to use the
[`TransactionEncoder`](https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.preprocessing/#transactionencoder), which is similar to a [`OneHotEncoder`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html),
but for converting item lists into transaction data rather than an array into columns.

In [3]:
from mlxtend.preprocessing.transactionencoder import TransactionEncoder


encoder = TransactionEncoder()
tag_encodings = encoder.fit_transform(df.tags)
tag_encodings

array([[False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       ...,
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False],
       [False, False, False, ..., False, False, False]])

The returned results are an array.
No problem with that.
But while browsing
[the example in the User Guide](https://rasbt.github.io/mlxtend/user_guide/preprocessing/TransactionEncoder/),
I noticed how they converted the array into a
[`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html).

In [4]:
tag_df = pd.DataFrame(
    tag_encodings,
    index=df.index,  # I added the index to align with the input data.
    columns=encoder.columns_,
)
print(tag_df.head())

              .net  .net-6.0  .net-8.0     3d  ...  zephyr-rtos  zeroconf    zip    zod
question_id                                    ...                                     
78135146     False     False     False  False  ...        False     False  False  False
78132659     False     False     False  False  ...        False     False  False  False
78137900     False     False     False  False  ...        False     False  False  False
78133446     False     False     False  False  ...        False     False  False  False
78137890     False     False     False  False  ...        False     False  False  False

[5 rows x 830 columns]


There's nothing wrong with how this was done, but I wondered why the
[`set_output`](https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_set_output.html)
method wasn't taken advantage of.
That's when I realized it's not exposed in `mlxtend`.

In [5]:
try:
    encoder = TransactionEncoder().set_output(transform="pandas")

except Exception as e:
    print(repr(e))

AttributeError("This 'TransactionEncoder' has no attribute 'set_output'")


"That's odd," I thought.
I'm pretty sure [`scikit-learn`](https://scikit-learn.org/stable/index.html) is a requirement for `mlxtend`.
Surely the supported version is greater than 1.2?

After looking at the [requirements.txt](https://github.com/rasbt/mlxtend/blob/master/requirements.txt) file,
I was relieved to see that the package did in fact use the newest version of `scikit-learn`.
But why didn't `set_output` work?

Digging through the `TransactionEncoder`'s source code, the reason wasn't obvious at first.
Then I saw it—the
[`columns_` attribute](https://github.com/rasbt/mlxtend/blob/b3b81f4dd603e0ad9c8f3133f1b2bf2f5177cc9d/mlxtend/preprocessing/transactionencoder.py#L21).
As of `scikit-learn` version 1.0:
> [All estimators store `feature_names_in_` when fitted on pandas DataFrames.](https://scikit-learn.org/1.0/whats_new/v1.0.html#:~:text=All%20estimators%20store%20feature_names_in_%20when%20fitted%20on%20pandas%20Dataframes.)

Seeing as the `TransactionEncoder` was developed prior to this API change in `scikit-learn`,
it was probably missed and therefore not integrated.
I believe it *should* be integrated, so I'm logging an issue for a feature enhancement and covering the process in this post.

# New Issue

If you haven't logged an issue to an open source project before, here are some general guidelines I like to follow:
1. Check if a related issue has already been logged. Nobody wants to deal with closing duplicate tickets. Or worse, not closing them and having to deal with duplicate work that's already been completed.
2. Read the package's contribution guidelines and code of conduct. If there's an existing process in place, follow it.

I usually perform a few searches over the open issues with various keywords to see if anything comes up.
For this particular issue I tried
["set_output"](https://github.com/rasbt/mlxtend/issues?q=is%3Aissue+is%3Aopen+set_output) and
["TransactionEncoder"](https://github.com/rasbt/mlxtend/issues?q=is%3Aissue+is%3Aopen+TransactionEncoder).
The first yielded no results, and the second had results unrelated to the format of the output.
I figured I'm good to proceed.

`mlxtend`'s [issue template](https://github.com/rasbt/mlxtend/issues/new/choose) has four major categories:
- Bug report
- Documentation improvement
- Feature request
- Other
- Usage question

Since the `set_output` attribute doesn't exist in `mlxtend` I think this issue should be a feature request.

I start off with a title: ***"Integrate scikit-learn's `set_output` method into `TransactionEncoder`."***
I want my feature request to be specific and small enough that it can be easily merged,
as well as not break any preexisting code (though I do forsee some
[`DeprecationWarning`](https://docs.python.org/3/library/exceptions.html#DeprecationWarning)s).

Next, I need to fill out the following four sections:
- Describe the workflow you want to enable
- Describe your proposed solution
- Describe alternatives you've considered, if relevant
- Additional context