Better Integration with StatsBase.RegressionModel #19

Nosferican · 2017-07-17T19:32:52Z

Hi @gragusa. I am finishing some of the decisions for my longitudinal analysis package. The structs will all be children of StatsBase.RegressionModel and inherit from StatsBase.StatisticalModel. All methods for both abstract types will be implemented such that StatsBase.residuals(model) will return the residuals and so on with coef, dof, dof_residual, etc. In addition, all other fields which do not have a StatsBase.RegressionModel method to access are fully accessible through Base.get(model, value::Symbol). For example, if one wants to get the MRSS one just needs to call get(model, :MRSS), etc. Let me know if this scheme will work for you.

The text was updated successfully, but these errors were encountered:

nalimilan · 2017-07-19T10:42:13Z

I don't think the get approach is very Julian. Why not just define functions for these? Depending on how general they are, we could imagine defining some of them in StatsBase.

Nosferican · 2017-07-19T16:08:28Z

I have the get since seemed like a more flexible approach than getfield when the structure or names of the fields is undetermined. It works similar to the get for nullable which returns the value. It expands Base.get on a struct it is not defined for so it respects Base. Some of the fields I used get for where:

X: for the design matrix (in 2SLS models it refers to the exogenous variables and the predicted values of the endogenous variables)
Bread: inv(X' * X) used in many calculation and especially for the variance-covariance estimations. It is cached as to eliminate the need to inverse the matrix repeatedly.
MRSS: RSS / rdf (I believe StatsBase could have some for unbiased deviance measure or so.
N: Number of observations used in the estimation. It differs from nobs for certain models such as the between estimator or first-difference.
n: Number of panels (could be adapted to number of Primary Survey Units as well)
Estimator: The estimator used.
PID: Indicators of which observations belong to each panel
TID: Indicators of which observations belong to each time period
T: Length of panels
Idiosyncratic: Idiosyncratic error component (only random effects use this one)
Individual: Individual error component (only random effects use this one)
θ: Parameter for quasi-demeaning transformation (only random effects use this one)
z: Matrix of column wise endogenous variables
Z: Matrix of instruments
Some refer to the DataFrames framework such as:
Formula: The DataFrames.Formula
IV: DataFrames.Formula (endogenous ~ instruments)
Varlist: Number of variables estimated (excludes those that were dropped due to multicolinearity)
Intercept: If the model has an intercept. Some estimators force the intercept even if it is requested in the formula to not to be used. It is not used in the intermediate steps, but added in the final model.

Some of these could potentially be added to StatsBase and I would delighted to have those. However, some are not shared for all StatsBase.RegressionModel and would not want to impose values on further development if they are more harmful than beneficial. Technically every regression model has a PID and TID which in unique measurements models is just the number of the observation for PID and ones for TID. In that framework T would just be ones for each panel/PSU.

Maybe some that could be added to StatsBase are: MRSS (unbiased deviance?), N (in order to have nobs better defined*), X (design matrix?)... Bread could be something to consider since many implementation do cache this value and its quite useful for many applications. Estimator can be generalized. For example, GLM (defines and estimator as a Family and Link).

Formula and Varlists are the elements kept from DataFrames. If using the DataFrames.DataFramesRegressionModel then those could potentially be functions of the DataFrames.ModelFrame and design matrix from DataFrames.ModelMatrix.

nalimilan · 2017-07-19T16:58:53Z

Thanks for the list. get is not a good solution since it will be type-unstable.

Some of the cases you listed could be variants of existing functions, e.g. n and N could be obtained via nobs with a symbol argument. intercept could be obtained using coef or params. Others would deserve new functions in the API, like model matrix, formula... Not sure about others.

nalimilan · 2017-07-19T17:04:33Z

Also, it looks like lots of these concepts are common with MixedModels.jl. Have you looked ta that package? Cc: @dmbates

gragusa · 2017-07-19T17:54:33Z

I agree with @nalimilan that get is not the best solution here. @Nosferican take a look at https://github.com/gragusa/InstrumentalVariables.jl. There I have a small IV model where only three new functions have to be defined to use CovarianceMatrices. I understand that more complex models have more complex requirements. The one that I find pressing is nobs that means different things in different context (weighted/unweighted, data size, estimation sample size, etc.).

Wouldn't be enough to have instead of nobs something wrknobs or effnobs which would return "Number of observations used in the estimation"?

More generally, I think I am not getting part of the discussion (totally my fault). Suppose you have a linear model type. I would define the type as follow

struct MyLinPredModel <: LinPredModel
end

struct MyModel 
 model::MyLinPredModel
 ...
end

You need then to define the following methods for MyLinPredModel

function ModelMatrix(::MyLinPredModel)
end

function wrkresid(::MyLinPredModel)
end

function wrkwts(::MyLinPredModel)
end

function nobs(::MyLinPredModel)
end

vcov(x::MyModel, k::RobustVariance) = vcov(x.model, k)

Then everything should work. Could this potentially work?

Nosferican · 2017-07-19T17:55:05Z

My implementation has a very naive way of making get type-stable, but I am sure there are better ways. My get implementation is:

abstract type ModelStatistic end
struct ModelStatistic_RSS
    value::Float64
end
struct Model <: StatsBase.RegressionModel
    model_statistic::Dict{Symbol,ModelStatistic}
end
function Base.get(obj::ModelStatistic)
    getfield(obj, :value)
end
function Base.get(obj::Model, value::Symbol)
    model_stats = getfield(obj, :model_stats)
    get(model_stats, value)
end

I came up with this solution which has its pros and drawbacks. For example, it is strict in the sense no longer I can make calculations between RSS and rdf (dof_residuals) since these are ModelStatistics. Ipso facto,

struct ModelStatistic_MRSS <: ModelStatistic
    value::Float64
    function ModelStatistic_MRSS(RSS::ModelStatistic_RSS, rdf::ModelStatistic_rdf)
        RSS = get(RSS)
        rdf = get(rdf)
        RSS / rdf
    end
end

The pros is that I am less likely to make an error, and can use the inner constructors to verify properties of the statistics. It also allows the model_stats dictionary to be more efficient than using {Symbol, Any}. Any thoughts on this approach are super welcomed since I think I have a very basic understanding of Julia in general (basically what I read from the manual).

Aye. MixedModels is very similar to panel econometrics. Some of the estimators are especial cases. The difference for those estimators is on how to estimate the error components. The statistics way use MLE such as in MixedModels while in econometrics we tend to use GLS framework since it requires less strict assumptions. the two approaches also differ quite a bit in unbalanced panels which are very common in economics.
The MixedModels uses a modified Formula for including endogenous variables. I decided to keep them as two different DataFrames.Formula, but if a decision has been reached on expanding DataFrames.Formula or StatsModels.Formula I could adapt to follow it.

The individual, idiosyncratic, and thetas are bunched in a dictionary as ErrorComponent in R plm package. I could follow that strategy and implement the functions to access them only for the structures that have it.

Nosferican · 2017-07-19T18:30:14Z

@gragusa I think that's a good overall scheme. The only issues I can think of at least for HC estimators is for example with HC1 or HC4 which use uses the dof value and that one differs from the one implied by size(ModelMatrix, 2). However, for fixed effects, the HC estimators are not consistent so I only allow users to specify :OLS or :PID (clustered at panels); this decision is the one implemented in Stata since Stata 12 or 13 if not mistaken. For CRVE (HC0/HC1), the code could adjust the dof by adding the ncluster - 1 to it (this would change if I implement two-way errors or allow clustering at a higher dimension in the future so is not super robust). I am not familiar with HAC for panels since people usually care more about the independence assumption and clustering is heteroscedastic, autocorrelation, and cluster robust already. Still in the extreme case that independence is valid, but heteroscedasticity and autocorrelation are present why not offer a solution for that case.

How about adding named arguments for dof and cluster ids? Then it could work something like:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64};
    dof::Int64 = size(x,2),
    cl::Vector{UnitRange{Int64}} = map(elem -> elem:elem, 1:size(x, 1)))

Then I could make StatsBase.vcov(model::MyModel, variant = VCE) a wrapper around that? It would also work quite nicely with GLM by taking the values and then passing those to the same function.

gragusa · 2017-07-19T18:46:52Z

@Nosferican In HC4 I use _nobs(l::LinPredModel) = length(l.rr.y) you could define this for your model.

You do not need arguments if you can get dof and cl from types. As @nalimilan, this would be extremely un-julian thing to do.....

I can modify CRVE to use dof. Would this be enough?

For panel with fixed effect, HC is inconsistent, but could be corrected (and this can be done inside the API (obtain the HC and the apply the bias adjustment).

Nosferican · 2017-07-19T19:01:44Z

@gragusa dof cannot be implied from the model matrix for within estimators. The size of the model matrix will give you the incorrect model degree of freedom. The adjustment factor is the number of panels - 1 (to account for the degrees of freedom used for the group means). The clusters are also not implied from the model matrix. The alternative would be to dispatch different methods:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64)
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    cl::Vector{UnitRange{Int64}})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64,
    cl::Vector{UnitRange{Int64}})

would this be the Julian-way? If such, I can adapt my code to eschew named arguments with default implied values and define different methods.

I think addressing the dof and cl should be sufficient to port the VCE estimates.

Nosferican · 2017-07-19T19:09:47Z

@nalimilan @gragusa...
Considering other packages besides mine and maximum code re-use... what about

vcov(model::StatsBase.RegressionModel, variant::Symbol)
vcov(model::StatsBase.RegressionModel, variant::Symbol, cl::Vector{UnitRange{Int64}})

where

ModelMatrix = StatsBase.designmatrix(model) # or design_matrix, etc.
wrkresid = StatsBase.residuals(model)
wrkwts = StatsBase.weights(model) # or the weights naming convention used by StatsBase
_nobs = StatsBase.nobs(model)
dof = StatsBase.dof(model)

Nosferican · 2017-11-10T16:16:17Z

For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301). I would be happy to help in the transition so more people can use the package.

@lbittarello if Microeconometrics inheritances from StatsBase.RegressionModel it should play well likewise.

gragusa · 2017-11-10T17:17:13Z

It would be great to have some help. I have not touch CovarianceMatrices in a long time and I have not followed StatsModels.jl evolution. I will try to make a overhaul of the types dependency and also improve the API. But feel free to chime in ....

On Fri, Nov 10, 2017 at 5:16 PM José Bayoán Santiago Calderón (史志鼎) < ***@***.***> wrote: For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32 <JuliaStats/StatsModels.jl#32>. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301 <JuliaStats/StatsBase.jl#301>). I would be happy to help in the transition so more people can use the package. @lbittarello <https://github.com/lbittarello> if Microeconometrics <http://Microeconometrics.jl> inheritances from StatsBase.RegressionModel it should play well likewise. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAfRgsKB5l4vdCHWWnWx93uKiPTyId1sks5s1HbRgaJpZM4OabhV> .

-- Giuseppe Ragusa

Nosferican · 2017-11-10T19:05:06Z

I can help with the API and HC / CRVE... HAC is outside my area of expertise so I rather let you handle everything there for now.
I am envisioning something in the likes of:

mutable struct MyPkgStruct<: StatsBase.RegressionModel
    this has the various fields for each package, but has the following methods...
end

X = StatsBase.model_matrix(obj::StatsBase.RegressionModel)
û = StatsBase.residuals(obj::StatsBase.RegressionModel)
rdf = StatsBase.dof_residual(obj::StatsBase.RegressionModel)

vce = get_vce(obj::StatsBase.RegressionModel)
clusters = clusters(obj::StatsBase.RegressionModel)
weights = get_weights(obj::StatsBase.RegressionModel)
# Should the matrix already be weighted?
# These could be requested through the new StatsBase.params(obj::StatsBase.RegressionModel)

where

vce::CovarianceMatrices.RobustVariance # Maybe include OLS
X::AbstractMatrix
û::AbstractVector
rdf::Real
clusters::AbstractMatrix
weights::AbstractVector

Then we can provide a function for packages to use:

function estimate_vcov(obj::StatsBase.RegressionModel)
    X = StatsBase.model_matrix(obj)
    û = StatsBase.residuals(obj)
    rdf = StatsBase.dof_residual(obj)
    vce = get_vce(obj)
    clusters = clusters(obj)
    weights = get_weights(obj)
    return vcov(vce, X, û, weights, clusters, rdf)
end
function estimate_vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
    setfield!(obj, fieldname, estimate_vcov(obj))
    return
end

which updates their struct with the vcov::AbstractMatrix.

What are your thoughts?

* add params * add params\! * update docstrings

Nosferican · 2019-03-12T21:10:34Z

Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.

Can you list the requirements you might need?

weights
informationmatrix
residuals
leverage
dof_residual
deviance
Do the HAC estimators require a time indicator?

gragusa · 2019-06-03T14:56:29Z

Late reply —- It depends on how we want to implement this. The minimal interface would be for statsmodel to provide - pseudohessian - pseudoscore Or something like this (this are the bread and meat of robust variances). Then packages should take care of implementing those. Alternatively, the methods you propose are going to be useful if CovarianceMatrices has to implement variances for methods.

On Tue, 12 Mar 2019 at 22:10, José Bayoán Santiago Calderón (史志鼎) < ***@***.***> wrote: Would be nice to have before the transition in StatsModels.jl Terms 2.0 era. Can you list the requirements you might need? - weights - informationmatrix - residuals - leverage - dof_residual - deviance Do the HAC estimators require a time indicator? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAfRgvWO4_739VUv0jfvysU4CXy-KtLqks5vWBfLgaJpZM4OabhV> .

-- Giuseppe Ragusa

Nosferican · 2019-06-03T15:03:23Z

The API has the methods,
score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, and deviance should be more than enough for most cases (i.e., not including time or clusters).

gragusa · 2019-06-03T15:07:26Z

Does GLM implement those? If not we should write them and make a PR.

On Mon, 3 Jun 2019 at 17:03, José Bayoán Santiago Calderón < ***@***.***> wrote: The API has the methods, score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, deviance should be more than enough for most cases (e.g., not including time or clusters). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#19?email_source=notifications&email_token=AAD5DASXRZ46FTAGEGO7MYTPYUXDXA5CNFSM4DTJXBK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODWZWCMQ#issuecomment-498295090>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAD5DAURXDYR7OTT6W5BHW3PYUXDXANCNFSM4DTJXBKQ> .

-- Giuseppe Ragusa

Nosferican mentioned this issue Aug 30, 2017

Possible Playmate lbittarello/Microeconometrics.jl#1

Closed

Nosferican referenced this issue in JuliaStats/StatsBase.jl Nov 10, 2017

add params (#274)

6df21f0

* add params * add params\! * update docstrings

Nosferican mentioned this issue Nov 10, 2017

add params JuliaStats/StatsBase.jl#274

Merged

Nosferican mentioned this issue Mar 16, 2019

Implementing missing RegressionModel API JuliaStats/GLM.jl#292

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Integration with StatsBase.RegressionModel #19

Better Integration with StatsBase.RegressionModel #19

Nosferican commented Jul 17, 2017

nalimilan commented Jul 19, 2017

Nosferican commented Jul 19, 2017 •

edited

Loading

nalimilan commented Jul 19, 2017 •

edited

Loading

nalimilan commented Jul 19, 2017

gragusa commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Jul 19, 2017

gragusa commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Nov 10, 2017

gragusa commented Nov 10, 2017 via email

Nosferican commented Nov 10, 2017 •

edited

Loading

Nosferican commented Mar 12, 2019

gragusa commented Jun 3, 2019 via email

Nosferican commented Jun 3, 2019 •

edited

Loading

gragusa commented Jun 3, 2019 via email

Better Integration with StatsBase.RegressionModel #19

Better Integration with StatsBase.RegressionModel #19

Comments

Nosferican commented Jul 17, 2017

nalimilan commented Jul 19, 2017

Nosferican commented Jul 19, 2017 • edited Loading

nalimilan commented Jul 19, 2017 • edited Loading

nalimilan commented Jul 19, 2017

gragusa commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Jul 19, 2017

gragusa commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Jul 19, 2017

Nosferican commented Nov 10, 2017

gragusa commented Nov 10, 2017 via email

Nosferican commented Nov 10, 2017 • edited Loading

Nosferican commented Mar 12, 2019

gragusa commented Jun 3, 2019 via email

Nosferican commented Jun 3, 2019 • edited Loading

gragusa commented Jun 3, 2019 via email

Nosferican commented Jul 19, 2017 •

edited

Loading

nalimilan commented Jul 19, 2017 •

edited

Loading

Nosferican commented Nov 10, 2017 •

edited

Loading

Nosferican commented Jun 3, 2019 •

edited

Loading