Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganization/rewrite #48

Closed
joshday opened this issue Dec 9, 2015 · 3 comments
Closed

Reorganization/rewrite #48

joshday opened this issue Dec 9, 2015 · 3 comments

Comments

@joshday
Copy link
Owner

joshday commented Dec 9, 2015

I started toying around with changes to OnlineStats in a separate package, https://github.com/joshday/OnlineStatistics.jl (It only exists as a separate package to run comparisons between the old and the new. It is not, nor will it be, a replacement). What was meant to be just a few performance tests turned into a major rewrite (details on changes below). I've managed some performance improvements (some marginal, some orders of magnitude):

julia> include("test/performance.jl")
WARNING: replacing module Performance

  =======================================
  Performance on 10 million observations
  =======================================

                Mean new :  0.040254 seconds (5 allocations: 192 bytes)
                Mean old :  0.042746 seconds (5 allocations: 192 bytes)

        Mean (batch) new :  0.005795 seconds (6 allocations: 224 bytes)
        Mean (batch) old :  0.005977 seconds (6 allocations: 224 bytes)

            Variance new :  0.042452 seconds (5 allocations: 208 bytes)
            Variance old :  0.060829 seconds (5 allocations: 192 bytes)

             Extrema new :  0.049575 seconds (4 allocations: 160 bytes)
             Extrema old :  0.051199 seconds (4 allocations: 160 bytes)

         QuantileSGD new :  0.618786 seconds (8 allocations: 448 bytes)
         QuantileSGD old :  1.510063 seconds (94 allocations: 76.310 MB, 0.25% gc time)

          QuantileMM new :  0.772390 seconds (10 allocations: 656 bytes)
          QuantileMM old :  1.720187 seconds (96 allocations: 76.310 MB, 1.33% gc time)

             Moments new :  0.099977 seconds (6 allocations: 288 bytes)
             Moments old :  0.071233 seconds (5 allocations: 208 bytes)


  ============================================
  Performance on .2 million × 500 observations
  ============================================

               Means new :  0.010688 seconds (17 allocations: 2.469 KB)
        Means old (VERY SLOW) :

       Means (batch) new :  0.009797 seconds (16 allocations: 1.578 KB)
       Means (batch) old :  0.009701 seconds (17 allocations: 1.609 KB)

           Variances new :  0.041235 seconds (40 allocations: 7.766 KB)
    Variances old (VERY SLOW) :

   Variances (batch) new :  0.042492 seconds (37 allocations: 5.109 KB)
   Variances (batch) old :  0.037979 seconds (38 allocations: 5.156 KB)

           CovMatrix new :  2.819448 seconds (200.00 k allocations: 9.155 MB)
    CovMatrix old (VERY SLOW) :

   CovMatrix (batch) new :  0.072263 seconds (23 allocations: 237.063 KB)
   CovMatrix (batch) old :  0.065741 seconds (19 allocations: 80.625 KB)



  ===========================================
  Performance on 1 million × 5 design matrix
  ===========================================

              LinReg new :  0.033935 seconds (35 allocations: 45.779 MB, 9.72% gc time)
              LinReg old :  0.053115 seconds (37 allocations: 45.779 MB, 43.14% gc time)
           SparseReg old :  0.092843 seconds (33 allocations: 45.779 MB, 71.34% gc time)

Changes:

  • remove state(o) and statenames(o), replace with value(o)
    • value(o) returns only the statistic, nonessential information (ex: Vector of quantiles) should show up in Base.show methods
  • Change Weighting to Weight, EqualWeighting to EqualWeight, etc.
  • change update! to StatsBase.fit!
  • faster and easier to understand sweep! operator with additional method with a placeholder vector to avoid gc. sweep! has also been changed to store the upper triangular matrix, rather than lower.
  • Rather than let each Distribution have its own type, fitting distributions can all be done through FitDistribution and FitMvDistribution types
  • Remove SparseReg, it's functionality (coefficients from penalized likelihood) is now handled by LinReg
  • cleanup
    • I had too many files floating around. For example, everything in summary/ is now in one file, summary.jl.
  • rename StochasticModel to StatLearn. I think hinting at statistical learning is better than "algorithms based on a stochastic subgradient". If anyone has a better name for something that incorporates SVMs and linear, logistic, poisson, huber, l1 loss, and quantile regression, I'm all ears.
  • probably other things I forgot

I wanted to get this out in the open before I start moving things over to OnlineStats. The biggest impact change is how weightings are handled.

@tbreloff
Copy link
Collaborator

tbreloff commented Dec 9, 2015

It'll take me a little while to go through this in detail, but just based on your summary: 👍

@joshday
Copy link
Owner Author

joshday commented Jan 4, 2016

These changes are now in master. New docs coming soon.

@joshday
Copy link
Owner Author

joshday commented Jan 9, 2016

Changes are now in METADATA. Docs mostly moved to README to make them easier to maintain.

@joshday joshday closed this as completed Jan 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants