-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change representation_series to DataFrame #156
Change representation_series to DataFrame #156
Conversation
suport MultiIndex as function parameter returns MultiIndex, where Representation was returned * missing: correct test Co-authored-by: Henri Froese <hf2000510@gmail.com>
*missing: test adopting for new types Co-authored-by: Henri Froese <hf2000510@gmail.com>
Will review soon |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall: looks great; nice that we're close to getting this done 🚀 ! General comments that should be addressed:
-
macOS build is failing in Travis. From the log we can see that this is due to the
DocumentTermDF
not being printed the same in macOS consoles. We probably do not want to not test this at all, so either (a) look at somehow passing a#doctest: +some_command_to_solve_this
or (b) skip it with#doctest: +SKIP
and add a unittest instead where we manually compare the series (probably much easier) -
in general: for dimensionality reduction and clustering, as far as I can see we are not testing this at all. Of course we haven't tested it before, but this is probably the best time to add at least one unittest for all the functions (we're skipping all the doctests at the moment). Should be relatively quick to implement this in
test_representation.py
I'll address everything I can myself right now (doctests, unittests, ...) |
Was able to address most small comments myself, will do the rest later with @mk2510 |
We have now addressed the remaining issues. We're skipping doctests in representation.py and implemented doctests for every representation function in test_representation.py instead. From our side, this can be merged now @jbesomi 🙏 🚀 🐳 🤞 |
Looks great (even too hard to understand, at least quickly). The main question (and sorry for the late review; will catch up faster now): why do we return a MultiIndex sparse DataFrame? Why not simply a (sparse) DataFrame? This should simplify things a bit and probably is the most natural type users expect (i.e we wrap the scipy sparse matrix on a DF) |
As far as I know, we will not have to deal anymore with "RepresentationSeries" as there are no functions that return such object. Then, for instance, There are still many part of the code that mention "DocumentTermDF", this might not be necessary, right? i.e
|
However we use
I totally miss those 🤦 but now all unnecessary Document Term mentions should be gone.
Those summary sentences should now be the same and the lengths also under 76 everywhere 🤞
That is absolutly right. The representation Series will be removed in the next PR, where we worked on the hero types. #157 🏎️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just went through everything once more. Will fix Fixed the very small stuff I found. It's now ready to merge in my opinion
Looks almost perfect 😍 I just noticed how we don't have a strict rule for how we define the default value in the docstring. I believe we can stick to :
Can you please make sure in all functions we write it the same way? |
Yes, I'll do that and add the "British/American English" and "number of default components" to EDIT: now decided to already do this in representation as this representation version is so different from the master. |
Just incorporated the suggested changes from the review 🌩️ |
As we can see, the DF doctest fails in macOS, so I'll skip it again |
Ok, see here |
Let's go! 🎉 🎉 Good job. |
all functions, which previously dealt with representation series now handle only the dataframe instead. 🚀
rm all functions like flatten, as they are not needed anymore
adopted docstrings and tests
-> further stuff to do: