Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: support FeatureTransforms.jl #44

Closed
wants to merge 13 commits into from
Closed

Conversation

bencottier
Copy link

@bencottier bencottier commented Apr 7, 2021

Part of #12

Implements is_transformable, apply, apply!, apply_append methods for KeyedDataset.

The scope of my thinking has been mostly limited to one-to-one Transforms. LinearCombination won't work yet. Also, AbstractScaling is inconvenient and could have its own special method (see further below).

Some further comments to discuss or note are in the diff.


Scaling is currently complicated (I might be missing an easier way to access the data, but nonetheless):

julia> only(ds(:train, :price).data)
(:train, :price) => [-2.0 4.0; 3.0 2.0; -1.0 -1.0]

julia> scaling = MeanStdScaling(only(ds(:train, :price).data)[2]; dims=:id, inds=[2]);

julia> r = scaling(ds; dims=:id, inds=[2])
KeyedDataset with:
  2 components
    (:train, :price) => 3x1 KeyedArray{Float64} with dimension time[1], id[2]
    (:predict, :price) => 3x1 KeyedArray{Float64} with dimension time[1], id[2]
  2 constraints
    [1] (:__, :time)  3-element UnitRange{Int64}
    [2] (:__, :id)  1-element Vector{Symbol}

julia> only(r(:train, :price).data)[2]
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
   time  3-element UnitRange{Int64}
   id  1-element Vector{Symbol}And data, 3×1 Matrix{Float64}:
      (:b) (1)   0.9271726499455306
 (2)   0.13245323570650436
 (3)  -1.0596258856520353

In addition (or alternative) to implementing MeanStdScaling(::KeyedDataset; dims), addressing the below issues would help improve the above:

invenia/FeatureTransforms.jl#59
invenia/FeatureTransforms.jl#56

@codecov
Copy link

codecov bot commented Apr 7, 2021

Codecov Report

Merging #44 (6a26438) into main (d39b647) will decrease coverage by 1.98%.
The diff coverage is 92.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
- Coverage   98.88%   96.90%   -1.99%     
==========================================
  Files           7        8       +1     
  Lines         270      291      +21     
==========================================
+ Hits          267      282      +15     
- Misses          3        9       +6     
Impacted Files Coverage Δ
src/AxisSets.jl 25.00% <ø> (-50.00%) ⬇️
src/featuretransforms.jl 90.47% <90.47%> (ø)
src/impute.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d39b647...6a26438. Read the comment docs.

@glennmoy
Copy link
Member

glennmoy commented Apr 7, 2021

Is this roughly what we want apply(::KeyedDataset) to do?

Pretty much. Setting the awkwardness of MeanStdScaling aside I think this is a pretty good start.

In this context, does it make sense to allow :_ at the end of a Pattern, translating to dims=: in FeatureTransforms.apply Currently, allowing Colon() at the end of a Pattern wouldn't work this way; it wouldn't match any dimspaths.

I think this makes sense, given that Patterns are the expected interface with KeyedDatasets. dims=: only "make sense" in the context of arrays.

How should LinearCombination work? Suppose we wanted to append the result to the dataset - will there be problems with constraints?

Intuitively, I would imagine LinearCombination as acting over a collection of component arrays. Although this is very unlikely to be invoked, operations like this should follow the same principles we've set up in FeatureTransforms. In this instance you'd have to write a specific apply method for it 👎

However, I'll note that the apply methods in FeatureTransforms could be made more generic if we used traits invenia/FeatureTransforms.jl#75 so maybe we can withhold supporting LinearCombination directly until we get those in place? I have an idea of what it involves I just need to write it up.

Impute = "0.6"
NamedDims = "0.2"
OrderedCollections = "1"
ReadOnlyArrays = "0.1"
julia = "1.3"
julia = "1.5"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comply with FeatureTransforms compat

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?

Comment on lines +136 to +139
if inner # batched apply_append on each component
return map(ds, patterns...) do a
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
end
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thought this could be handy and worth including.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false .

given the implementation is rather easy (just a map over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.

selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By using a single component name this assumes, but does not enforce, that there is only one kind of component being transformed e.g. :price. It could still be multiple components e.g. (:train, :price) and (:predict, :price).

But we also discussed the idea of passing in a full dimspath, which I'm open to.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the condition would be that the last part of each dimpath is the same? Off the top of my head:

length(unique([dpath[end] for dpath in selected])) == 1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, though that could probably be simplified to only(last.(selected))?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumption is fine IMO

_pattern(dims::Pattern) = dims
_pattern(dims::Tuple) = Pattern(dims)
_pattern(dims) = Pattern(:__, dims)
_impute_pattern(dims::Pattern) = dims
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to distinguish _impute_pattern and _transform_pattern could be a code smell? But there is a difference in what dims means for Impute vs. FT.

_transform_pattern is closer to what mapslices does:

function Base.mapslices(f::Function, ds::KeyedDataset, keys...; dims)
patterns = if isempty(keys)
dims isa Symbol ? Pattern[(:__, dims)] : Pattern[(:__, d) for d in dims]
else
Pattern[keys...]
end

Ideally we'd standardise how to handle patterns/dims everywhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,154 @@
FeatureTransforms.is_transformable(::KeyedDataset) = true

_transform_pattern(keys, dims) = isempty(keys) ? _transform_pattern(dims) : Pattern[keys...]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment in impute.jl

selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe shouldn't call it keys to avoid confusion with KeyedArray keys. keys means dimspaths here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd probably use a variable like _dimpaths or dpaths.

Copy link
Member

@rofinn rofinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a reasonable start. I think even if we can't support the full FeatureTransforms.jl API this might be enough to gather more use-cases.

Impute = "0.6"
NamedDims = "0.2"
OrderedCollections = "1"
ReadOnlyArrays = "0.1"
julia = "1.3"
julia = "1.5"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?

@@ -88,5 +89,6 @@ include("dataset.jl")
include("indexing.jl")
include("functions.jl")
include("impute.jl")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#46

(:train, :price2) => [4.0 16.0; 9.0 4.0; 1.0 1.0]
(:predict, :price2) => [0.25 1.0; 25.0 4.0; 0.0 1.0]
```
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These docstrings are a bit verbose (ie: several duplicate sentences between them). Could we simplify the apply_append and append docstrings to reference the apply! method?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's a good idea

The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`.
Otherwise, components are selected by the desired `dims`.

If `inner=true`, perform `FeatureTransforms.apply_append` on each component,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure inner is the right term for this as it feels like it overlaps with things like inner joins. Maybe batch would be more appropriate?

Comment on lines +136 to +139
if inner # batched apply_append on each component
return map(ds, patterns...) do a
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.

selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.

selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd probably use a variable like _dimpaths or dpaths.

_pattern(dims::Pattern) = dims
_pattern(dims::Tuple) = Pattern(dims)
_pattern(dims) = Pattern(:__, dims)
_impute_pattern(dims::Pattern) = dims
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?

@test is_transformable(ds)
end

# TODO: use fake Transforms
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are fake transforms?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forthcoming as part of test utils in in FeatureTransforms. See this POC PR https://github.com/invenia/FeatureTransforms.jl/pull/77/files#diff-4c5e126be8af5fe14f9784e4cedac0f729e29553a4fc76c5ca47fbd1c7e0a4d8R1
(Glenn is breaking up into multiple PRs at the moment)

ds::KeyedDataset, t::Transform, keys...;
dims=:, kwargs...
)
return map(ds, _transform_pattern(keys, dims)...) do a
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a for-loop and the output be just return ds? otherwise doesn't it just return the components that were affected?

selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))

# construct keys of new transformed components
new_keys = [(k[1:end-1]..., component_name) for k in selected]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this assumption is fine IMO

Comment on lines +136 to +139
if inner # batched apply_append on each component
return map(ds, patterns...) do a
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false .

given the implementation is rather easy (just a map over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.

@test !isequal(ds, expected)
end

@testset "inds" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we have to test inds here as that's part of the FeatureTransforms implementation. It's not affected directly in this package so we should just be able to trust that it works downstream.

@test !isequal(ds, expected)
end

@testset "outer" begin
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case in point on inner vs outer, a whole other testset to keep on top of

@@ -0,0 +1,154 @@
FeatureTransforms.is_transformable(::KeyedDataset) = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now automatic when an apply method is defined for a type

https://github.com/invenia/FeatureTransforms.jl/releases/tag/v0.3.4

@glennmoy
Copy link
Member

@rofinn
Copy link
Member

rofinn commented May 12, 2021

Closing as I believe #50 covers this now? Feel free to open if there's still functionality here we want to be included.

@rofinn rofinn closed this May 12, 2021
@bencottier
Copy link
Author

Closing as I believe #50 covers this now? Feel free to open if there's still functionality here we want to be included.

Yep this is fine, for #53 we should make a new PR rather than reopen, but some code here may be useful to copy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants