WIP: support FeatureTransforms.jl #44

bencottier · 2021-04-07T13:51:17Z

Part of #12

Implements is_transformable, apply, apply!, apply_append methods for KeyedDataset.

The scope of my thinking has been mostly limited to one-to-one Transforms. LinearCombination won't work yet. Also, AbstractScaling is inconvenient and could have its own special method (see further below).

Some further comments to discuss or note are in the diff.

Scaling is currently complicated (I might be missing an easier way to access the data, but nonetheless):

julia> only(ds(:train, :price).data)
(:train, :price) => [-2.0 4.0; 3.0 2.0; -1.0 -1.0]

julia> scaling = MeanStdScaling(only(ds(:train, :price).data)[2]; dims=:id, inds=[2]);

julia> r = scaling(ds; dims=:id, inds=[2])
KeyedDataset with:
  2 components
    (:train, :price) => 3x1 KeyedArray{Float64} with dimension time[1], id[2]
    (:predict, :price) => 3x1 KeyedArray{Float64} with dimension time[1], id[2]
  2 constraints
    [1] (:__, :time) ∈ 3-element UnitRange{Int64}
    [2] (:__, :id) ∈ 1-element Vector{Symbol}

julia> only(r(:train, :price).data)[2]
2-dimensional KeyedArray(NamedDimsArray(...)) with keys:
↓   time ∈ 3-element UnitRange{Int64}
→   id ∈ 1-element Vector{Symbol}And data, 3×1 Matrix{Float64}:
      (:b) (1)   0.9271726499455306
 (2)   0.13245323570650436
 (3)  -1.0596258856520353

In addition (or alternative) to implementing MeanStdScaling(::KeyedDataset; dims), addressing the below issues would help improve the above:

invenia/FeatureTransforms.jl#59
invenia/FeatureTransforms.jl#56

For FeatureTransforms compat

codecov · 2021-04-07T13:55:28Z

Codecov Report

Merging #44 (6a26438) into main (d39b647) will decrease coverage by 1.98%.
The diff coverage is 92.00%.

@@            Coverage Diff             @@
##             main      #44      +/-   ##
==========================================
- Coverage   98.88%   96.90%   -1.99%     
==========================================
  Files           7        8       +1     
  Lines         270      291      +21     
==========================================
+ Hits          267      282      +15     
- Misses          3        9       +6

Impacted Files	Coverage Δ
src/AxisSets.jl	`25.00% <ø> (-50.00%)`	⬇️
src/featuretransforms.jl	`90.47% <90.47%> (ø)`
src/impute.jl	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d39b647...6a26438. Read the comment docs.

glennmoy · 2021-04-07T14:25:42Z

Is this roughly what we want apply(::KeyedDataset) to do?

Pretty much. Setting the awkwardness of MeanStdScaling aside I think this is a pretty good start.

In this context, does it make sense to allow :_ at the end of a Pattern, translating to dims=: in FeatureTransforms.apply Currently, allowing Colon() at the end of a Pattern wouldn't work this way; it wouldn't match any dimspaths.

I think this makes sense, given that Patterns are the expected interface with KeyedDatasets. dims=: only "make sense" in the context of arrays.

How should LinearCombination work? Suppose we wanted to append the result to the dataset - will there be problems with constraints?

Intuitively, I would imagine LinearCombination as acting over a collection of component arrays. Although this is very unlikely to be invoked, operations like this should follow the same principles we've set up in FeatureTransforms. In this instance you'd have to write a specific apply method for it 👎

However, I'll note that the apply methods in FeatureTransforms could be made more generic if we used traits invenia/FeatureTransforms.jl#75 so maybe we can withhold supporting LinearCombination directly until we get those in place? I have an idea of what it involves I just need to write it up.

Also update docstrings of apply and _apply_paths

- Map simplifies things - don't need the custom _apply_paths() anymore - Returning full dataset seems more appropriate after talking to Rory

bencottier · 2021-04-15T18:06:11Z

Project.toml

 Impute = "0.6"
 NamedDims = "0.2"
 OrderedCollections = "1"
 ReadOnlyArrays = "0.1"
-julia = "1.3"
+julia = "1.5"


Comply with FeatureTransforms compat

Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?

bencottier · 2021-04-15T18:08:01Z

src/featuretransforms.jl

+    if inner  # batched apply_append on each component
+        return map(ds, patterns...) do a
+            FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
+        end


Just thought this could be handy and worth including.

I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.

I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false .

given the implementation is rather easy (just a map over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.

bencottier · 2021-04-15T18:13:26Z

src/featuretransforms.jl

+        selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))
+
+        # construct keys of new transformed components
+        new_keys = [(k[1:end-1]..., component_name) for k in selected]


By using a single component name this assumes, but does not enforce, that there is only one kind of component being transformed e.g. :price. It could still be multiple components e.g. (:train, :price) and (:predict, :price).

But we also discussed the idea of passing in a full dimspath, which I'm open to.

Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.

So the condition would be that the last part of each dimpath is the same? Off the top of my head:

length(unique([dpath[end] for dpath in selected])) == 1

Yeah, though that could probably be simplified to only(last.(selected))?

this assumption is fine IMO

bencottier · 2021-04-15T18:16:19Z

src/impute.jl

-_pattern(dims::Pattern) = dims
-_pattern(dims::Tuple) = Pattern(dims)
-_pattern(dims) = Pattern(:__, dims)
+_impute_pattern(dims::Pattern) = dims


Having to distinguish _impute_pattern and _transform_pattern could be a code smell? But there is a difference in what dims means for Impute vs. FT.

_transform_pattern is closer to what mapslices does:

AxisSets.jl/src/functions.jl

Lines 79 to 84 in d39b647

function Base.mapslices(f::Function, ds::KeyedDataset, keys...; dims)

patterns = if isempty(keys)

dims isa Symbol ? Pattern[(:__, dims)] : Pattern[(:__, d) for d in dims]

else

Pattern[keys...]

end

Ideally we'd standardise how to handle patterns/dims everywhere.

Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?

invenia/Impute.jl#66 (comment)

bencottier · 2021-04-15T18:16:33Z

src/featuretransforms.jl

@@ -0,0 +1,154 @@
+FeatureTransforms.is_transformable(::KeyedDataset) = true
+
+_transform_pattern(keys, dims) = isempty(keys) ? _transform_pattern(dims) : Pattern[keys...]


See comment in impute.jl

bencottier · 2021-04-15T18:31:10Z

src/featuretransforms.jl

+        selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))
+
+        # construct keys of new transformed components
+        new_keys = [(k[1:end-1]..., component_name) for k in selected]


Maybe shouldn't call it keys to avoid confusion with KeyedArray keys. keys means dimspaths here.

Yeah, I'd probably use a variable like _dimpaths or dpaths.

rofinn

Seems like a reasonable start. I think even if we can't support the full FeatureTransforms.jl API this might be enough to gather more use-cases.

rofinn · 2021-04-15T19:56:16Z

Project.toml

 Impute = "0.6"
 NamedDims = "0.2"
 OrderedCollections = "1"
 ReadOnlyArrays = "0.1"
-julia = "1.3"
+julia = "1.5"


Kind of annoying that FeatureTransforms only supports 1.5, but I guess all our packages should support it anyways?

rofinn · 2021-04-15T20:01:10Z

src/AxisSets.jl

@@ -88,5 +89,6 @@ include("dataset.jl")
 include("indexing.jl")
 include("functions.jl")
 include("impute.jl")


rofinn · 2021-04-15T20:05:28Z

src/featuretransforms.jl

+   (:train, :price2) => [4.0 16.0; 9.0 4.0; 1.0 1.0]
+ (:predict, :price2) => [0.25 1.0; 25.0 4.0; 0.0 1.0]
+```
+"""


These docstrings are a bit verbose (ie: several duplicate sentences between them). Could we simplify the apply_append and append docstrings to reference the apply! method?

Yeah that's a good idea

rofinn · 2021-04-15T20:08:54Z

src/featuretransforms.jl

+The transform can be applied to a subselection of components via a [`Pattern`](@ref) `key`.
+Otherwise, components are selected by the desired `dims`.
+
+If `inner=true`, perform `FeatureTransforms.apply_append` on each component,


I'm not sure inner is the right term for this as it feels like it overlaps with things like inner joins. Maybe batch would be more appropriate?

rofinn · 2021-04-15T20:09:51Z

src/featuretransforms.jl

+    if inner  # batched apply_append on each component
+        return map(ds, patterns...) do a
+            FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
+        end


I could see this being useful, but I'm not sure we have enough use-cases yet. That being said, it wouldn't be too hard to deprecated if it isn't useful.

rofinn · 2021-04-15T20:11:51Z

src/featuretransforms.jl

+        selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))
+
+        # construct keys of new transformed components
+        new_keys = [(k[1:end-1]..., component_name) for k in selected]


Might be good to throw an argument error if that assumption doesn't hold for now. Passing multiple dimpaths does seem noisy, and hard to justify without use-cases.

rofinn · 2021-04-15T20:12:56Z

src/featuretransforms.jl

+        selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))
+
+        # construct keys of new transformed components
+        new_keys = [(k[1:end-1]..., component_name) for k in selected]


Yeah, I'd probably use a variable like _dimpaths or dpaths.

rofinn · 2021-04-15T20:14:38Z

src/impute.jl

-_pattern(dims::Pattern) = dims
-_pattern(dims::Tuple) = Pattern(dims)
-_pattern(dims) = Pattern(:__, dims)
+_impute_pattern(dims::Pattern) = dims


Yeah, maybe add a comment to invenia/Impute.jl#66? If we're gonna change that in Impute.jl it'd be good to do that before a 1.0 release?

rofinn · 2021-04-15T20:15:05Z

test/featuretransforms.jl

+        @test is_transformable(ds)
+    end
+
+    # TODO: use fake Transforms


What are fake transforms?

Forthcoming as part of test utils in in FeatureTransforms. See this POC PR https://github.com/invenia/FeatureTransforms.jl/pull/77/files#diff-4c5e126be8af5fe14f9784e4cedac0f729e29553a4fc76c5ca47fbd1c7e0a4d8R1
(Glenn is breaking up into multiple PRs at the moment)

glennmoy · 2021-04-16T18:10:26Z

src/featuretransforms.jl

+    ds::KeyedDataset, t::Transform, keys...;
+    dims=:, kwargs...
+)
+    return map(ds, _transform_pattern(keys, dims)...) do a


should this be a for-loop and the output be just return ds? otherwise doesn't it just return the components that were affected?

glennmoy · 2021-04-16T18:26:07Z

src/featuretransforms.jl

+        selected = unique(x[1:end-1] for x in dimpaths(ds) if any(p -> x in p, patterns))
+
+        # construct keys of new transformed components
+        new_keys = [(k[1:end-1]..., component_name) for k in selected]


this assumption is fine IMO

glennmoy · 2021-04-16T18:28:03Z

src/featuretransforms.jl

+    if inner  # batched apply_append on each component
+        return map(ds, patterns...) do a
+            FeatureTransforms.apply_append(a, t; dims=dims, kwargs...)
+        end


I'm not sure how useful this is right away either, and it might complicate the code and tests if we have to support both batch=true/false .

given the implementation is rather easy (just a map over the components) it should be straightforward for users to do it themselves and hold off doing it here until we know it's worth doing.

glennmoy · 2021-04-16T18:36:19Z

test/featuretransforms.jl

+                @test !isequal(ds, expected)
+            end
+
+            @testset "inds" begin


I'm not sure if we have to test inds here as that's part of the FeatureTransforms implementation. It's not affected directly in this package so we should just be able to trust that it works downstream.

glennmoy · 2021-04-16T18:36:56Z

test/featuretransforms.jl

+                @test !isequal(ds, expected)
+            end
+
+            @testset "outer" begin


case in point on inner vs outer, a whole other testset to keep on top of

glennmoy · 2021-04-20T15:03:43Z

src/featuretransforms.jl

@@ -0,0 +1,154 @@
+FeatureTransforms.is_transformable(::KeyedDataset) = true


This is now automatic when an apply method is defined for a type

https://github.com/invenia/FeatureTransforms.jl/releases/tag/v0.3.4

glennmoy · 2021-04-22T22:35:20Z

https://github.com/invenia/AxisSets.jl/releases/tag/v0.1.7

rofinn · 2021-05-12T14:57:21Z

Closing as I believe #50 covers this now? Feel free to open if there's still functionality here we want to be included.

bencottier · 2021-05-13T11:11:52Z

Closing as I believe #50 covers this now? Feel free to open if there's still functionality here we want to be included.

Yep this is fine, for #53 we should make a new PR rather than reopen, but some code here may be useful to copy.

bencottier added 3 commits April 7, 2021 14:15

Implement initial FeatureTransforms support

b71c197

Only return transformed components in new dataset

f3bd036

Bump julia compat to 1.5

80bd4b3

For FeatureTransforms compat

glennmoy mentioned this pull request Apr 7, 2021

Use traits to generalise apply methods invenia/FeatureTransforms.jl#75

Closed

bencottier added 10 commits April 9, 2021 17:59

Move constructing paths to separate function

de3304f

Implement apply! and apply_append

525ccc1

Also update docstrings of apply and _apply_paths

Allow colon for single dims

c840873

Update tests for all apply methods

b73a80d

Put _pattern back into impute.jl

f85a135

Make apply methods return full dataset and use map

8af6a70

- Map simplifies things - don't need the custom _apply_paths() anymore - Returning full dataset seems more appropriate after talking to Rory

Use common function for pattern in apply methods

5c71193

Implement outer apply_append method

230ea65

Simplify component_name

c7423b4

Update docstrings and doctests

6a26438

bencottier commented Apr 15, 2021

View reviewed changes

bencottier requested review from rofinn and glennmoy April 15, 2021 18:32

rofinn approved these changes Apr 15, 2021

View reviewed changes

bencottier mentioned this pull request Apr 16, 2021

API simplification invenia/Impute.jl#66

Open

glennmoy reviewed Apr 16, 2021

View reviewed changes

glennmoy reviewed Apr 20, 2021

View reviewed changes

glennmoy mentioned this pull request Apr 21, 2021

Support FeatureTransforms #50

Merged

rofinn closed this May 12, 2021

	function Base.mapslices(f::Function, ds::KeyedDataset, keys...; dims)
	patterns = if isempty(keys)
	dims isa Symbol ? Pattern[(:__, dims)] : Pattern[(:__, d) for d in dims]
	else
	Pattern[keys...]
	end

		@@ -0,0 +1,154 @@
		FeatureTransforms.is_transformable(::KeyedDataset) = true

		_transform_pattern(keys, dims) = isempty(keys) ? _transform_pattern(dims) : Pattern[keys...]

WIP: support FeatureTransforms.jl #44

WIP: support FeatureTransforms.jl #44

Conversation

bencottier commented Apr 7, 2021 • edited Loading

codecov bot commented Apr 7, 2021 • edited Loading

Codecov Report

glennmoy commented Apr 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rofinn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

glennmoy commented Apr 22, 2021

rofinn commented May 12, 2021

bencottier commented May 13, 2021

bencottier commented Apr 7, 2021 •

edited

Loading

codecov bot commented Apr 7, 2021 •

edited

Loading

glennmoy commented Apr 7, 2021 •

edited

Loading