Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change representation_series to DataFrame #156

Merged

Conversation

mk2510
Copy link
Collaborator

@mk2510 mk2510 commented Aug 21, 2020

  • all functions, which previously dealt with representation series now handle only the dataframe instead. 🚀

  • rm all functions like flatten, as they are not needed anymore

  • adopted docstrings and tests

-> further stuff to do:

  • add those examples into the tutorials, readme, getting started

mk2510 and others added 8 commits August 18, 2020 22:06
suport MultiIndex as function parameter

returns MultiIndex, where Representation was returned

* missing: correct test


Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types


Co-authored-by: Henri Froese <hf2000510@gmail.com>
@henrifroese
Copy link
Collaborator

Will review soon

Copy link
Collaborator

@henrifroese henrifroese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall: looks great; nice that we're close to getting this done 🚀 ! General comments that should be addressed:

  • macOS build is failing in Travis. From the log we can see that this is due to the DocumentTermDF not being printed the same in macOS consoles. We probably do not want to not test this at all, so either (a) look at somehow passing a #doctest: +some_command_to_solve_this or (b) skip it with #doctest: +SKIP and add a unittest instead where we manually compare the series (probably much easier)

  • in general: for dimensionality reduction and clustering, as far as I can see we are not testing this at all. Of course we haven't tested it before, but this is probably the best time to add at least one unittest for all the functions (we're skipping all the doctests at the moment). Should be relatively quick to implement this in test_representation.py

tests/test_representation.py Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
tests/test_representation.py Outdated Show resolved Hide resolved
@henrifroese
Copy link
Collaborator

I'll address everything I can myself right now (doctests, unittests, ...)

@henrifroese
Copy link
Collaborator

Was able to address most small comments myself, will do the rest later with @mk2510

@henrifroese
Copy link
Collaborator

henrifroese commented Aug 21, 2020

We have now addressed the remaining issues. We're skipping doctests in representation.py and implemented doctests for every representation function in test_representation.py instead.

From our side, this can be merged now @jbesomi 🙏 🚀 🐳 🤞

@henrifroese henrifroese marked this pull request as ready for review August 21, 2020 17:50
@jbesomi
Copy link
Owner

jbesomi commented Sep 3, 2020

Looks great (even too hard to understand, at least quickly).

The main question (and sorry for the late review; will catch up faster now): why do we return a MultiIndex sparse DataFrame? Why not simply a (sparse) DataFrame? This should simplify things a bit and probably is the most natural type users expect (i.e we wrap the scipy sparse matrix on a DF)

@jbesomi jbesomi marked this pull request as draft September 8, 2020 09:04
@jbesomi
Copy link
Owner

jbesomi commented Sep 8, 2020

As far as I know, we will not have to deal anymore with "RepresentationSeries" as there are no functions that return such object. tfidf and company returns a DataFrame that we can convert into a VectorSeries. Is that right?

Then, for instance, normalize does not need to handle the RepresentationSeries case, rather VectorSeries and DataFrame. It seems to me that _check_is_valid_DataFrame is not useful as-it-is, rather, we can move the code inside the normalize function to detect if input_matrix is a VectorSeries (need to check is valid) or simply a DataFrame.

There are still many part of the code that mention "DocumentTermDF", this might not be necessary, right? i.e pca can be applied to any DataFrame, not only document-term ...

  • plus you need to fix minor issues, such as docstring longer than 75 characters ... (i.e line 917), pca and nmf not having the same summary sentence (one is missing "on the given input.") [I haven't checked the others]

☝️ @mk2510 for the next time, make sure to set "ready for review" when you double-checked the whole PR and you are sure on all changes. It makes spare time to both you and me :) 👍

@vercel vercel bot temporarily deployed to Preview September 9, 2020 20:46 Inactive
@mk2510
Copy link
Collaborator Author

mk2510 commented Sep 9, 2020

However we use _check_is_valid_DataFrame in multiple different functions, like pca, mnf, etc. It is necessary to convert the given data into the right function input format. This works differently for VectorSeries and for DataFrames. Hence we use this function quite often. If only used in normalize, I totally agree with you, that it would be better to move it inside the function.

There are still many part of the code that mention "DocumentTermDF", this might not be necessary, right? i.e pca can be applied to any DataFrame, not only document-term ...

I totally miss those 🤦 but now all unnecessary Document Term mentions should be gone.

pca and nmf not having the same summary sentence (one is missing "on the given input.") [I haven't checked the others]

Those summary sentences should now be the same and the lengths also under 76 everywhere 🤞

As far as I know, we will not have to deal anymore with "RepresentationSeries" as there are no functions that return such object. tfidf and company returns a DataFrame that we can convert into a VectorSeries. Is that right?

That is absolutly right. The representation Series will be removed in the next PR, where we worked on the hero types. #157 🏎️

@mk2510 mk2510 marked this pull request as ready for review September 9, 2020 20:55
Copy link
Collaborator

@henrifroese henrifroese left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went through everything once more. Will fix Fixed the very small stuff I found. It's now ready to merge in my opinion

texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
texthero/representation.py Outdated Show resolved Hide resolved
@jbesomi
Copy link
Owner

jbesomi commented Sep 12, 2020

Looks almost perfect 😍

I just noticed how we don't have a strict rule for how we define the default value in the docstring. I believe we can stick to :

max_features : int, optional, default=None

Can you please make sure in all functions we write it the same way?

@henrifroese
Copy link
Collaborator

henrifroese commented Sep 12, 2020

Can you please make sure in all functions we write it the same way?

Yes, I'll do that and add the "British/American English" and "number of default components" to CONTRIBUTING.md and change it in all files later.

EDIT: now decided to already do this in representation as this representation version is so different from the master.

@henrifroese
Copy link
Collaborator

henrifroese commented Sep 12, 2020

Just incorporated the suggested changes from the review 🌩️

@henrifroese
Copy link
Collaborator

As we can see, the DF doctest fails in macOS, so I'll skip it again

@jbesomi
Copy link
Owner

jbesomi commented Sep 12, 2020

Ok, see here

@jbesomi
Copy link
Owner

jbesomi commented Sep 12, 2020

Let's go! 🎉 🎉 Good job.

This was referenced Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support "Pandas Series Representation"
3 participants