Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it easier to work with missing data #564

Closed
ryancerf opened this issue Aug 4, 2019 · 6 comments
Closed

Make it easier to work with missing data #564

ryancerf opened this issue Aug 4, 2019 · 6 comments
Labels
core in core sub-project

Comments

@ryancerf
Copy link
Collaborator

ryancerf commented Aug 4, 2019

Useful for working with missing data. Similar to pandas fillna, but for a single column at a time.

Usage:

myDoubleColumn.fillMissingWith(1.0);

I am not sure what the easiest way to do this is right now.

Will send a PR if we think this is worthwhile.

@benmccann
Copy link
Collaborator

You should be able to do that today with col.set(NumberPredicates.isMissing, 1.0) . This is just off the top of my head, so the syntax might be slightly off

@ryancerf ryancerf changed the title add fillMissingWith method to Column Make it easier to work with missing data Aug 4, 2019
@ryancerf
Copy link
Collaborator Author

ryancerf commented Aug 4, 2019

Another method that might be useful for working with missing data is a getOrDefault method on Row.

// Proposed
row.getLongOrDefault("col1", 1L);

It could simply iterating over a table with missing data.

    // Current
    for (Row row : table) {
        long value1 = row.getLong("col1");
        int value2 = row.getInt("col2");
        value1 = LongColumn.valueIsMissing(value1) ? 1L : value1;
        value2 = IntColumn.valueIsMissing(value2) ? 1 : value2;

        // do something with value1 and value 2;
    }

    // Proposed
    for (Row row : table) {
        long value1 = row.getLongOrDefault("col1", 1L);
        int value2 = row.getIntOrDefault("col2", 1);
        // do something with value1 and value2
    }

@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 4, 2019

I have mixed feelings on whether it's worth having a new method.
As @benmccann points out, it's not too hard.

col.set(col.isMissing(), 1.0);
will do it.

On the other hand, there's a reason why they made it easier in pandas. I feel like it's customary in data analysis tools and libraries to provide very good support for missing values. I would not be averse to:

Column<?>::setMissing(? value)...
I wouldn't call it "fill" though, I think fill in Tablesaw is more likely to refer to a series of appends than a series of updates. Update methods generally start with set.

@ryancerf
Copy link
Collaborator Author

ryancerf commented Aug 4, 2019

Using set is fine with me, but I would like to point out we do have a fillWith method on DoubleColumn (I am not sure I like this method. map is a good alternative).

// This method was added recently.
 @Override
    public DoubleColumn fillWith(double d) {
        for (int r = 0; r < size(); r++) {
            set(r, d);
        }
        return this;
    }

@lwhite1 lwhite1 added the core in core sub-project label Aug 4, 2019
@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 6, 2019

@ryancerf You're right about fillWith. I forgot how it was implemented.

I think the intent there was to create columns where every value is set in this method, and that it is intended for populating new columns. Although it uses set() instead of append, I think the intent is the same.

I do think set() is more appropriate for a method that is being selective about what it's updating, where fill suggests more of a bulk/batch process.

lwhite1 added a commit that referenced this issue Aug 6, 2019
plus test, plus fix for a bug in a DoubleColumn.create()
lwhite1 added a commit that referenced this issue Aug 6, 2019
* Fix for Make it easier to work with missing data #564

plus test, plus fix for a bug in a DoubleColumn.create()

* Made create method safe for null values in input data

* made append methods that take Objects handle null by adding the missing value indicator

* Updated DoubleColumn and IntColumn to take advantage of improved append method
@lwhite1
Copy link
Collaborator

lwhite1 commented Aug 6, 2019

closing. @ryancerf please lmk if you have issues with the resolution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core in core sub-project
Projects
None yet
Development

No branches or pull requests

3 participants