Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge StreamStats, add to JuliaStats #19

Closed
6 tasks done
joshday opened this issue Jun 29, 2015 · 29 comments
Closed
6 tasks done

Merge StreamStats, add to JuliaStats #19

joshday opened this issue Jun 29, 2015 · 29 comments

Comments

@joshday
Copy link
Owner

joshday commented Jun 29, 2015

See StreamStats.jl issue #24.

To prepare for adding OnlineStats into JuliaStats, we should add the following functionality from StreamStats:

  • HyperLogLog
  • online bootstrap
  • Adagrad OLS
  • Adagrad Logistic regression
  • Adagrad L2 Logistic Reg
  • Adagrad Ridge Reg
@tbreloff
Copy link
Collaborator

I'm going to try to move one or two of these over... I'll start with HyperLogLog and let you know how it goes.

@tbreloff
Copy link
Collaborator

Just pushed HyperLogLog on the tom branch... I think it's working, but I could use some additional documentation/tests from @johnmyleswhite (as there wasn't any documentation to begin with, I had to start out by learning what the hell HyperLogLog was :)

@tbreloff
Copy link
Collaborator

@johnmyleswhite: do you have any resources you can recommend to read up more on the bootstrapping and adagrad methods you implemented? I want to understand a little more thoroughly before moving them over

@johnmyleswhite
Copy link

@tbreloff
Copy link
Collaborator

Thanks John. I'm trying to think through whether it's feasible to use AdaGrad as a Weighting, instead of reimplementing each individual algorithm. Certainly it would be a nice bonus if this could be applied to other stats, but I'm not sure how easy it is to generalize.

On Jun 30, 2015, at 12:51 AM, John Myles White notifications@github.com wrote:

Online bootstrapping: http://statweb.stanford.edu/~owen/reports/multi.pdf

AdaGrad: http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf


Reply to this email directly or view it on GitHub.

@johnmyleswhite
Copy link

I'd be cautious using AdaGrad as a weighting since it could only be generalized if you made the idea of a gradient part of the OnlineStats type system.

@tbreloff
Copy link
Collaborator

To quote Dumb and Dumber: "...more like 1 in a million."... "So you're
saying there's a chance!" (yes, in this case I'm "dumber")

Anyways... I don't doubt it would require some careful thought to get
right. If it can be generalizable (and we decide it's worthwhile to adjust
the type system), then do you see value in attempting it? If you have 15
minutes sometime to discuss a little further with skype or hangouts, let me
know. @joshday should join as well.

On Tue, Jun 30, 2015 at 12:06 PM, John Myles White <notifications@github.com

wrote:

I'd be cautious using AdaGrad as a weighting since it could only be
generalized if you made the idea of a gradient part of the OnlineStats type
system.


Reply to this email directly or view it on GitHub
#19 (comment)
.

@johnmyleswhite
Copy link

Yeah, I could find some time tomorrow afternoon to chat.

If we could make it general it would be useful. We'd want to think about how a general solution relates to other SGD tools that people have tried building.

@tbreloff
Copy link
Collaborator

Perfect... tomorrow afternoon works. In the meantime, do you know offhand
the other SGD tools/packages out there I should familiarize myself with?

On Tue, Jun 30, 2015 at 12:20 PM, John Myles White <notifications@github.com

wrote:

Yeah, I could find some time tomorrow afternoon to chat.

If we could make it general it would be useful. We'd want to think about
how a general solution relates to other SGD tools that people have tried
building.


Reply to this email directly or view it on GitHub
#19 (comment)
.

@johnmyleswhite
Copy link

How about chatting at 11 AM Pacific time?

@johnmyleswhite
Copy link

I'd look at https://github.com/lindahua/SGDOptim.jl for some ideas.

@joshday
Copy link
Owner Author

joshday commented Jul 1, 2015

I'd love to be a part of that conversation. I'll be available at 11am Pacific.

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 1, 2015

Sounds good. I'm in.

On Jun 30, 2015, at 10:32 PM, Josh Day notifications@github.com wrote:

I'd love to be a part of that conversation. I'll be available at 11am Pacific.


Reply to this email directly or view it on GitHub.

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 2, 2015

@johnmyleswhite: I'm confusing myself regarding the Logistic Loss function you have in StreamStats, and hoping you can help. I expect the standard loss functions look like this:

  • Least-squares: f(w,(x, y)) = (y−w'x)^2
  • Hinge loss: f(w,(x, y)) = max{0,1−y(w'x)}
  • Logistic regression: f(w,(x, y)) = log(1+exp(−yw'x))

And the gradient would be df/dwi (i.e. the derivative of the loss function with respect to the ith weight). For the least squares, you have the derivative as (-2 * xi * err) which makes sense.

In your ApproxLogit update function, you have essentially the same derivative (but you dropped the 2 and swapped signs with your update to beta... effectively making it the same as the least squares case). Is that correct? Is there some simple math that I'm being dense about?

I think the derivative is something more complicated than just (-xi * err) for the logistic case, involving exp at a minimum. I didn't work out the full equation, so I guess it's possible something cancels out.

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 2, 2015

Also I pushed up a first draft for Adagrad to the tom branch... comments welcome.

@johnmyleswhite
Copy link

The code I wrote maximizes logistic log likelihood rather than minimize a specific cost function. It should be closely related to minimizing log loss.

@joshday
Copy link
Owner Author

joshday commented Jul 2, 2015

One thing to beware is how the response is coded. Tom, I believe the loss function you wrote above has y coded as (-1, 1), whereas StreamStats has it as (0, 1). Here's the log-likelihood version John is talking about, apart from a typo (I think - missing minus sign in gradient):

screen shot 2015-07-02 at 7 14 13 pm

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 3, 2015

Thanks for these comments. In your experience(s), is it better/easier to deal with maximizing likelihood vs minimizing cost? The formulas I referenced came from section 1.1 of http://www.jmlr.org/papers/volume11/xiao10a/xiao10a.pdf, which is focused on minimizing, and I'd like to attempt implementing the Adagrad/RDA hybrid that was referenced in section 3 of http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf (I believe also minimizing).

Also John: now I understand why you have "beta -= ..." in the OLS version and "beta += ..." in the logistic... I was a little confused there. Thanks!

@joshday
Copy link
Owner Author

joshday commented Jul 3, 2015

I just went through the math. They're equivalent. The gradient ends up being the same in either case. It just depends what we want to enforce on users: {-1, 1} or {0, 1}.

  • {0, 1} will be more familiar to statisticians
  • {-1, 1} will be more familiar to machine learning folks

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 3, 2015

Sweet... you're one step ahead of me. Thanks.

@joshday
Copy link
Owner Author

joshday commented Jul 3, 2015

Adagrad stuff looks cool 👍. It works!

I think you switched link and invlink. The link function relates the expected value to the linear predictor:

  • link function: g(E(y)) = Xb
  • inverse link: h(Xb) = E(y)

I'm wondering if some penalties will need their own update! method. Updating with the L1 penalty should have that extra soft-thresholding step (instead of setting coefficients to 0, right now it pushes them negative).

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 3, 2015

Agreed on the link/invlink... I fixed it.

Regarding the thresholding... is that specific to L1 or is it common for other regularizers? When does the thresholding happen? In the update to beta, or the update to an intermediate calc?

@joshday
Copy link
Owner Author

joshday commented Jul 3, 2015

Soft thresholding shows up in a few different implementations of the LASSO that I've seen. I'm not sure where else it pops up (probably also elastic net).

It would be in the update to beta, but I think I spoke too soon. After looking at the papers a bit more, it doesn't look like it fits into the general Adagrad scheme, and that's why the second paper you referenced does the Adagrad/RDA hybrid. I need to think about it more.

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 3, 2015

Ok that lines up with what I'm thinking too. I'll probably reorg a little when I build out RDA.

With regards to Adagrad... the 2 tests I have for OLS with and without L2 reg are both passing, but the logistic isn't. The logistic with NoReg is close to correct... I'm not sure how close I should expect it to be. The logistic with L2 reg is just wrong. Another set of eyes would be awesome.

@joshday
Copy link
Owner Author

joshday commented Jul 3, 2015

Logistic with L2 penalty may be correct, just very sensitive to choice of λ. I think it's because of rather large coefficients used to generate data. A beta of 10 would be huge in logistic regression with standardized data. A single standard deviation change in the associated variable would swing the probability essentially from 0 to 1.

julia> 1 / (1 + exp(-5))
0.9933071490757153

julia> 1 / (1 + exp(5))
0.0066928509242848554
julia>  o = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss())

julia>  o2 = Adagrad(x, y; link=LogisticLink(), loss=LogisticLoss(), reg = L2Reg(.0001))

julia> coef(o) - coef(o2)
10-element Array{Float64,1}:
 0.0509798
 0.103378
 0.153715
 0.205846
 0.257928
 0.310924
 0.362245
 0.416003
 0.466223
 0.517923

@tbreloff
Copy link
Collaborator

tbreloff commented Jul 3, 2015

Thanks that was helpful... I also realized after re-running the tests that I didn't have my y mapped to {0,1}. Now both logistic tests are similar, but the whole beta vector is scaled down:

OnlineStats.Adagrad{β=[0.7217354581720843,1.4348008374414074,2.146884897634276,2.8766786770306774,3.5859435927149845,4.3038226266476745,5.059988427489082,5.757408227532339,6.491600366175806,7.219898540409098] nobs=1000000}: β=[1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]
OnlineStats.Adagrad{β=[1.0584635582815747,1.0584635582815747,2.14490825297699,2.8505958359494468,3.5604315976752616,4.262512514113438,4.987637119365928,5.708458455240219,6.420254076403213,7.142296659800167] nobs=1000000}: β=[1.5,1.5,3.0,4.0,5.0,6.0,7.0,8.0,9.0,10.0]

Note that the first beta should be [1,2,3,...,10] and the second beta should be [1.5,1.5,3,...,10]

@tbreloff
Copy link
Collaborator

@johnmyleswhite: Adagrad should be fully ported now. I have tests for all 4 versions implemented by StreamStats, with tests that compare the results and code speed. Example usage:

β = coef(Adagrad(x,y; link=LogisticLink(), loss=LogisticLoss(), reg=L2Reg(λ))

@joshday: are you going to take a look at the Bootstrapping code?

@joshday
Copy link
Owner Author

joshday commented Jul 14, 2015

👍 . Yes, I should get a chance to look at Bootstrapping later this week.

@joshday
Copy link
Owner Author

joshday commented Jul 17, 2015

@johnmyleswhite Online bootstrap stuff is ported over. Since OnlineStats' state function operates differently than in StreamStats, I added a field for a function to specify how to generate the cached_state

type BernoulliBootstrap{S <: OnlineStat} <: Bootstrap
    replicates::Vector{S}            # replicates of base stat
    cached_state::Vector{Float64}    # cache of replicate states
    f::Function                      # function to generate state. Ex: mean, var, std
    n::Int                           # number of observations
    cache_is_dirty::Bool
end

With this setup, there's a little more flexibility in that you can use the same object with different functions:

v = Variance()
o_mean = BernoulliBootstrap(v, mean, 1000)
o_std = BernoulliBootstrap(v, std, 1000)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants