Skip to content

Conversation

@mbaak
Copy link
Contributor

@mbaak mbaak commented Sep 1, 2024

Note: we are using sklearn v1.4.2 of TfidfTransformer in order to ensure compatibility between the
pandas and spark version of emm. In sklearn v1.5+ TfidfTransformer no longer has the _idf_diag attribute,
needed for the compatibility.

@mbaak mbaak force-pushed the copy_skl14_tfidf branch 2 times, most recently from 3f891b8 to 102cc91 Compare September 1, 2024 13:07
… skl>=1.5

Change for TfidfTransformer of sklearn v1.5 in order to ensure compatibility between the
pandas and spark version of emm. In sklearn v1.5+ TfidfTransformer no longer has the _idf_diag attribute,
needed for setting the compatibility.
@mbaak mbaak changed the title WIP: Copy skl14 tfidf code into repo to ensure pandas spark compatibility FIX: ensure sklearn and spark tfidf compatibility for sklearn >= 1.5 Sep 3, 2024
@mbaak mbaak requested a review from twalen September 3, 2024 07:50
Copy link
Collaborator

@twalen twalen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, in further versions the idf_diag can be calculated only for sklearn < 1.5 (max_idf_square can be calculated directly from idf_)

@mbaak mbaak merged commit 80a5683 into main Sep 5, 2024
@mbaak mbaak deleted the copy_skl14_tfidf branch September 5, 2024 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants