-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: support FeatureTransforms.jl #44
Conversation
Codecov Report
@@ Coverage Diff @@
## main #44 +/- ##
==========================================
- Coverage 98.88% 96.90% -1.99%
==========================================
Files 7 8 +1
Lines 270 291 +21
==========================================
+ Hits 267 282 +15
- Misses 3 9 +6
Continue to review full report at Codecov.
|
Pretty much. Setting the awkwardness of
I think this makes sense, given that
Intuitively, I would imagine However, I'll note that the |
Also update docstrings of apply and _apply_paths
- Map simplifies things - don't need the custom _apply_paths() anymore - Returning full dataset seems more appropriate after talking to Rory
Impute = "0.6" | ||
NamedDims = "0.2" | ||
OrderedCollections = "1" | ||
ReadOnlyArrays = "0.1" | ||
julia = "1.3" | ||
julia = "1.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comply with FeatureTransforms compat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?
if inner # batched apply_append on each component | ||
return map(ds, patterns...) do a | ||
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thought this could be handy and worth including.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false
.
given the implementation is rather easy (just a map
over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns)) | ||
|
||
# construct keys of new transformed components | ||
new_keys = [(k[1:end-1]..., component_name) for k in selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using a single component name this assumes, but does not enforce, that there is only one kind of component being transformed e.g. :price
. It could still be multiple components e.g. (:train, :price)
and (:predict, :price)
.
But we also discussed the idea of passing in a full dimspath, which I'm open to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the condition would be that the last part of each dimpath is the same? Off the top of my head:
length(unique([dpath[end] for dpath in selected])) == 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, though that could probably be simplified to only(last.(selected))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumption is fine IMO
_pattern(dims::Pattern) = dims | ||
_pattern(dims::Tuple) = Pattern(dims) | ||
_pattern(dims) = Pattern(:__, dims) | ||
_impute_pattern(dims::Pattern) = dims |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having to distinguish _impute_pattern
and _transform_pattern
could be a code smell? But there is a difference in what dims
means for Impute vs. FT.
_transform_pattern
is closer to what mapslices
does:
Lines 79 to 84 in d39b647
function Base.mapslices(f::Function, ds::KeyedDataset, keys...; dims) | |
patterns = if isempty(keys) | |
dims isa Symbol ? Pattern[(:__, dims)] : Pattern[(:__, d) for d in dims] | |
else | |
Pattern[keys...] | |
end |
Ideally we'd standardise how to handle patterns/dims everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,154 @@ | |||
FeatureTransforms.is_transformable(::KeyedDataset) = true | |||
|
|||
_transform_pattern(keys, dims) = isempty(keys) ? _transform_pattern(dims) : Pattern[keys...] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment in impute.jl
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns)) | ||
|
||
# construct keys of new transformed components | ||
new_keys = [(k[1:end-1]..., component_name) for k in selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe shouldn't call it keys
to avoid confusion with KeyedArray keys. keys means dimspaths here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd probably use a variable like _dimpaths
or dpaths
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a reasonable start. I think even if we can't support the full FeatureTransforms.jl API this might be enough to gather more use-cases.
Impute = "0.6" | ||
NamedDims = "0.2" | ||
OrderedCollections = "1" | ||
ReadOnlyArrays = "0.1" | ||
julia = "1.3" | ||
julia = "1.5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?
@@ -88,5 +89,6 @@ include("dataset.jl") | |||
include("indexing.jl") | |||
include("functions.jl") | |||
include("impute.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(:train, :price2) => [4.0 16.0; 9.0 4.0; 1.0 1.0] | ||
(:predict, :price2) => [0.25 1.0; 25.0 4.0; 0.0 1.0] | ||
``` | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docstrings are a bit verbose (ie: several duplicate sentences between them). Could we simplify the apply_append
and append
docstrings to reference the apply!
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's a good idea
The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`. | ||
Otherwise, components are selected by the desired `dims`. | ||
|
||
If `inner=true`, perform `FeatureTransforms.apply_append` on each component, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure inner
is the right term for this as it feels like it overlaps with things like inner joins. Maybe batch
would be more appropriate?
if inner # batched apply_append on each component | ||
return map(ds, patterns...) do a | ||
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns)) | ||
|
||
# construct keys of new transformed components | ||
new_keys = [(k[1:end-1]..., component_name) for k in selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns)) | ||
|
||
# construct keys of new transformed components | ||
new_keys = [(k[1:end-1]..., component_name) for k in selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'd probably use a variable like _dimpaths
or dpaths
.
_pattern(dims::Pattern) = dims | ||
_pattern(dims::Tuple) = Pattern(dims) | ||
_pattern(dims) = Pattern(:__, dims) | ||
_impute_pattern(dims::Pattern) = dims |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?
@test is_transformable(ds) | ||
end | ||
|
||
# TODO: use fake Transforms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are fake transforms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forthcoming as part of test utils in in FeatureTransforms. See this POC PR https://github.com/invenia/FeatureTransforms.jl/pull/77/files#diff-4c5e126be8af5fe14f9784e4cedac0f729e29553a4fc76c5ca47fbd1c7e0a4d8R1
(Glenn is breaking up into multiple PRs at the moment)
ds::KeyedDataset, t::Transform, keys...; | ||
dims=:, kwargs... | ||
) | ||
return map(ds, _transform_pattern(keys, dims)...) do a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be a for-loop and the output be just return ds
? otherwise doesn't it just return the components that were affected?
selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns)) | ||
|
||
# construct keys of new transformed components | ||
new_keys = [(k[1:end-1]..., component_name) for k in selected] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this assumption is fine IMO
if inner # batched apply_append on each component | ||
return map(ds, patterns...) do a | ||
FeatureTransforms.apply_append(a, t; dims=dims, kwargs...) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false
.
given the implementation is rather easy (just a map
over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.
@test !isequal(ds, expected) | ||
end | ||
|
||
@testset "inds" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if we have to test inds here as that's part of the FeatureTransforms implementation. It's not affected directly in this package so we should just be able to trust that it works downstream.
@test !isequal(ds, expected) | ||
end | ||
|
||
@testset "outer" begin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case in point on inner
vs outer
, a whole other testset to keep on top of
@@ -0,0 +1,154 @@ | |||
FeatureTransforms.is_transformable(::KeyedDataset) = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now automatic when an apply
method is defined for a type
https://github.com/invenia/FeatureTransforms.jl/releases/tag/v0.3.4
Closing as I believe #50 covers this now? Feel free to open if there's still functionality here we want to be included. |
Part of #12
Implements
is_transformable
,apply
,apply!
,apply_append
methods forKeyedDataset
.The scope of my thinking has been mostly limited to one-to-one
Transform
s.LinearCombination
won't work yet. Also,AbstractScaling
is inconvenient and could have its own special method (see further below).Some further comments to discuss or note are in the diff.
Scaling is currently complicated (I might be missing an easier way to access the data, but nonetheless):
In addition (or alternative) to implementing
MeanStdScaling(::KeyedDataset; dims)
, addressing the below issues would help improve the above:invenia/FeatureTransforms.jl#59
invenia/FeatureTransforms.jl#56