Skip to content

joshday/Telperion.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Telperion: Parse Statistical Formulas into Feature Columns.

using DataFrames, StatsBase, Telperion

df = DataFrame(y=rand(100), a=1:100, b=randn(100), c=randn(100), d=rand(1:5, 100))

x, y = @xy df log.(y) ~ 1 + a + zscore(b) + abs.(sin.(c)) + dummy(d)

x
julia> x
OrderedDict{String,Any} with 8 entries:
  "1"             => [1, 1, 1, 1, 1, 1, 1, 1, 1, 1  …  1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  "a"             => [1, 2, 3, 4, 5, 6, 7, 8, 9, 10  …  91, 92, 93, 94, 95, 96, 97, 98, 99, 100]
  "zscore(b)"     => [1.13036, -0.280105, 2.29973, -0.267989, -0.240071, -0.797709, -0.315514, -0.322103, 0.0217353, -1.67589  …  1.45323, -0.363556, -0.650576, -1.543…
  "abs.(sin.(c))" => [0.753822, 0.992965, 0.41306, 0.733578, 0.21487, 0.958583, 0.163681, 0.238074, 0.166078, 0.920199  …  0.407876, 0.277916, 0.0207317, 0.572013, 0.2…
  "dummy(d) [2]"  => Bool[0, 0, 0, 0, 0, 0, 1, 0, 0, 0  …  0, 1, 0, 0, 0, 0, 1, 0, 0, 0]
  "dummy(d) [3]"  => Bool[0, 0, 0, 0, 0, 1, 0, 1, 0, 1  …  0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
  "dummy(d) [4]"  => Bool[0, 1, 0, 1, 0, 0, 0, 0, 0, 0  …  0, 0, 1, 0, 1, 0, 0, 0, 0, 0]
  "dummy(d) [5]"  => Bool[1, 0, 1, 0, 1, 0, 0, 0, 0, 0  …  1, 0, 0, 1, 0, 0, 0, 0, 0, 1]

(then create the data matrix with reduce(hcat, values(x)))

Why does this exist?

I wanted to try my own take on StatsModels where each term is generated by valid Julia code rather than a DSL. The formula syntax is the same (e.g. y ~ 1 + term2 + term3) .

  • Numbers are the only thing given special treatment: They are turned into vectors e.g. 1 --> fill(1, size(df, 1))
Benefits
  • Simplicity (this README has roughly the same number of lines of code).
  • Terms can be any Julia code that creates:
    • An AbstractVector or iterable of the correct length.
    • An OrderedDict of AbstractVector/iterables (for terms that create multiple columns)
  • Works out of the box with many data structures.
using IndexedTables 

t = table((x=rand(10), y=rand(10)))

x, y = @xy rows(t) y ~ 1 + x

Special Thanks

I would not have been able to write this package without the existence of StatsModels.jl, DataFramesMeta.jl, or StatsPlots.jl, which are all fantastic.

About

Simple Statistical Formulas

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages