Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GroupBy with multiple variables #145

Closed
evveric opened this issue Apr 9, 2019 · 4 comments
Closed

GroupBy with multiple variables #145

evveric opened this issue Apr 9, 2019 · 4 comments

Comments

@evveric
Copy link

evveric commented Apr 9, 2019

This is Not an issue but just a question on how to do things (Sorry I donno where else I should post these questions... Pls Let me know where is better to ask questions)

I saw "GroupBy" can group by only 1 stat
"Group" is a collection, that holds multiple operations applied on a vector.

My question, how can I combine the two?
Simple example: I want to group by X, at the same time, Y is a matrix, has 5 columns
I want to group by X, and show the Average ( first column of Y), Variance (2nd column of Y), Extrema(3rd column), etc....

x = rand(1:10, 100); y = x .+ randn(100, 5)

How to do that? If not possible now, I have to write new code?
thank you

@joshday
Copy link
Owner

joshday commented Apr 9, 2019

This repo doesn't get too many issues, so it's not a problem to post it here for now. You may get quicker responses (more eyes on the question) if you post on Julia's slack.

Using your example data, I believe this does what you're trying to do:

julia> stat = Group(Mean(), Variance(), Extrema(), Extrema(), Extrema());

julia> o = GroupBy(Int, stat);

julia> fit!(o, zip(x, OnlineStats.eachrow(y)))

@joshday joshday closed this as completed Apr 9, 2019
@evveric
Copy link
Author

evveric commented Apr 9, 2019

Thank you. In the end i might need to write a customised groupby function myself. Since my y contains invalid or missing value and i need to run fit! per each y[x] and if y[x] is not valid I skip it but continue on fit! Y[x+1].

Also here it runs zip function as well as a eachrow function. Will this create unnecessary memory allocation? I am doing this online algo with data of rows = 10million so i need to make sure each loop my memory usage is O(1)

Thanks

@evveric
Copy link
Author

evveric commented Apr 9, 2019

Also i am not able to use julia slack. My company network forbidden this website. But github my company firewall allows.

@joshday
Copy link
Owner

joshday commented Apr 9, 2019

zip and eachrow are lazy, but they need to create some allocations:

julia> x = rand(1:10, 10^7); y = x .+ randn(10^7, 5);

julia> stat = Group(Mean(), Variance(), Extrema(), Extrema(), Extrema());

julia> o = GroupBy(Int, stat);

julia> @time fit!(o, zip(x, OnlineStats.eachrow(y)));
  0.624278 seconds (10.00 M allocations: 305.176 MiB, 3.40% gc time)

I don't completely follow what you're doing, but you can take a look at FTSeries to filter observations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants