Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Data class for data pre-processing #2727

Closed
wants to merge 9 commits into from

Conversation

heisenbuug
Copy link
Contributor

As discussed in #2722, I implemented the data class. Describe method works only on numeric attribute types.

.info() will give output:
info

.describe() will give output:
describe

@heisenbuug heisenbuug changed the title Data class Adding Data class for data pre-processing Nov 22, 2020
Copy link
Contributor

@rxng8 rxng8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good. The only thing I notice is that should we have one file for header and one file for header implementation?

@rxng8
Copy link
Contributor

rxng8 commented Nov 22, 2020

Referring to #2715, I think one should also notice the user if they cannot load the data. I.e :

if (!Data::Load()):
     FAIL(<message>)

@heisenbuug
Copy link
Contributor Author

I think this looks good. The only thing I notice is that should we have one file for header and one file for header implementation?

Yea, I am planning to convert it into two files, this way we can maintain consistency over styling.

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @heisenbuug, thanks for taking the initiative and implementing this! It actually looks like what you are implementing is a generic dataframe-like class... which is great! Do you think that we could use some of the existing functionality we have with data::DatasetInfo? For instance, consider adding a constructor that can take an already-constructed matrix and DatasetInfo.

Also, do you think you could add some tests and fix the style issues? We don't need to check the exact format of the output (since we may change how things are printed later), but we should at least check that the various operations work and compile correctly.

I know I am writing a lot; sorry about that. This is the beginning of some pretty important support for arbitrary datatypes, so I want to make sure we get it started while thinking about all the eventual uses we might have. :) I could see this being the precursor of a separate C++ dataframe library, for instance. Do you think we could use support in this class to replace the implenentations inside the bindings in methods/preprocess/?

src/mlpack/core/data/pre_processing.hpp Show resolved Hide resolved

#include <mlpack/core.hpp>
#include <fstream>
#include <vector>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two are already included by core.hpp. Also, if you can, try to include only mlpack/prereqs.hpp since it will include fewer other files and help keep compilation fast. 👍

src/mlpack/core/data/pre_processing.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/pre_processing.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/pre_processing_impl.hpp Outdated Show resolved Hide resolved
src/mlpack/core/data/pre_processing_impl.hpp Outdated Show resolved Hide resolved
void Data:: info()
{
// Returns Index range, list of headers
std::cout << typeid(data).name() << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't guaranteed to be a readable description of the name---it might be tedious, but we should probably find another solution here.

src/mlpack/core/data/pre_processing_impl.hpp Outdated Show resolved Hide resolved

std::cout.setf(std::ios::fixed);
std::cout.width(5);
std::cout << "Count: " << std::endl << stats.count() << std::endl;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should prefix the values with the name of the dimension?

@heisenbuug
Copy link
Contributor Author

Hello @rcurtin, thank you for the detailed look. I noticed many of the mistakes myself and I am working on them.

I know I am writing a lot; sorry about that. This is the beginning of some pretty important support for arbitrary datatypes, so I want to make sure we get it started while thinking about all the eventual uses we might have. :) I could see this being the precursor of a separate C++ dataframe library, for instance. Do you think we could use support in this class to replace the implenentations inside the bindings in methods/preprocess/?

That's what my plan is, to keep adding the functionalities once we get the basic implementation ready.
I will soon make a PR for this which will include all the changes and some other functionalities that I have added.

Also, I have some concerns regarding the implementation of certain features, if you can look at them, shall I add it to the #2722
or maybe we can use IRC for that?

@rcurtin
Copy link
Member

rcurtin commented Dec 1, 2020

Up to you, maybe easier to add to #2722 so that it's all in one place 👍

Copy link
Member

@rcurtin rcurtin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @heisenbuug, I took a quick look. Hopefully my comments are helpful. I didn't look too deeply (yet) because I still think there are some high level design questions to be addressed. 👍 More details in my comments. Thanks!

* http://www.opensource.org/licenses/BSD-3-Clause for more information.
*/
#ifndef MLPACK_CORE_DATA_DATA_PRE_PROCESSING_HPP
#define MLPACK_CORE_DATA_DATA_PRE_PROCESSING_HPP
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you've doubled DATA in the macro name here. :)

* @param numericAttributes Bool array to indicate numeric attributes
* @param numericData Arma mat of numeric attributes only
*/
class Data {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So my primary suggestion here is that this class is basically storing data and the set of dimensions that are categorical. But we already have basically this in the data::DatasetInfo class. How can we redo this code so that it fits (or adapts) that existing support that we already have? I don't want to end up with two very different ways of achieving the same support---that makes the library feel schizophrenic and not unified. 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heyy, first of all, sorry for the late reply and not following up, okay, I will try and redo the code so that it can fit here...

@mlpack-bot
Copy link

mlpack-bot bot commented Feb 4, 2021

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added s: stale and removed s: stale labels Feb 4, 2021
@heisenbuug
Copy link
Contributor Author

@rcurtin, we have two ways to load data...
One that is implemented in mlpack::data::Load and other one is arma::mat.load()
We already have two implementations to do same thing, how should I implement new dataframe-like class then? Should it have it's own load method or something else, any suggestions?

@rcurtin
Copy link
Member

rcurtin commented Mar 6, 2021

We generally don't use arma::mat.load() inside of mlpack. All data loading is done via data::Load(), and when we are loading categorical data, we use the overload that also takes a data::DatasetInfo (which encodes whether a dimension is categorical or numeric, and any mappings that are used to represent the categories as values inside an Armadillo matrix). My suggestion would be that if we are going to add a dataframe-type class, it should take over all the functionality that's currently being provided by passing around both an arma::mat and a data::DatasetInfo.

@mlpack-bot
Copy link

mlpack-bot bot commented Apr 5, 2021

This issue has been automatically marked as stale because it has not had any recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions! 👍

@mlpack-bot mlpack-bot bot added the s: stale label Apr 5, 2021
@mlpack-bot mlpack-bot bot closed this Apr 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants