Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Integration with StatsBase.RegressionModel #19

Open
Nosferican opened this issue Jul 17, 2017 · 17 comments
Open

Better Integration with StatsBase.RegressionModel #19

Nosferican opened this issue Jul 17, 2017 · 17 comments

Comments

@Nosferican
Copy link
Contributor

Hi @gragusa. I am finishing some of the decisions for my longitudinal analysis package. The structs will all be children of StatsBase.RegressionModel and inherit from StatsBase.StatisticalModel. All methods for both abstract types will be implemented such that StatsBase.residuals(model) will return the residuals and so on with coef, dof, dof_residual, etc. In addition, all other fields which do not have a StatsBase.RegressionModel method to access are fully accessible through Base.get(model, value::Symbol). For example, if one wants to get the MRSS one just needs to call get(model, :MRSS), etc. Let me know if this scheme will work for you.

@nalimilan
Copy link

I don't think the get approach is very Julian. Why not just define functions for these? Depending on how general they are, we could imagine defining some of them in StatsBase.

@Nosferican
Copy link
Contributor Author

Nosferican commented Jul 19, 2017

I have the get since seemed like a more flexible approach than getfield when the structure or names of the fields is undetermined. It works similar to the get for nullable which returns the value. It expands Base.get on a struct it is not defined for so it respects Base. Some of the fields I used get for where:

  • X: for the design matrix (in 2SLS models it refers to the exogenous variables and the predicted values of the endogenous variables)
  • Bread: inv(X' * X) used in many calculation and especially for the variance-covariance estimations. It is cached as to eliminate the need to inverse the matrix repeatedly.
  • MRSS: RSS / rdf (I believe StatsBase could have some for unbiased deviance measure or so.
  • N: Number of observations used in the estimation. It differs from nobs for certain models such as the between estimator or first-difference.
  • n: Number of panels (could be adapted to number of Primary Survey Units as well)
  • Estimator: The estimator used.
  • PID: Indicators of which observations belong to each panel
  • TID: Indicators of which observations belong to each time period
  • T: Length of panels
  • Idiosyncratic: Idiosyncratic error component (only random effects use this one)
  • Individual: Individual error component (only random effects use this one)
  • θ: Parameter for quasi-demeaning transformation (only random effects use this one)
  • z: Matrix of column wise endogenous variables
  • Z: Matrix of instruments
    Some refer to the DataFrames framework such as:
  • Formula: The DataFrames.Formula
  • IV: DataFrames.Formula (endogenous ~ instruments)
  • Varlist: Number of variables estimated (excludes those that were dropped due to multicolinearity)
  • Intercept: If the model has an intercept. Some estimators force the intercept even if it is requested in the formula to not to be used. It is not used in the intermediate steps, but added in the final model.

Some of these could potentially be added to StatsBase and I would delighted to have those. However, some are not shared for all StatsBase.RegressionModel and would not want to impose values on further development if they are more harmful than beneficial. Technically every regression model has a PID and TID which in unique measurements models is just the number of the observation for PID and ones for TID. In that framework T would just be ones for each panel/PSU.

Maybe some that could be added to StatsBase are: MRSS (unbiased deviance?), N (in order to have nobs better defined*), X (design matrix?)... Bread could be something to consider since many implementation do cache this value and its quite useful for many applications. Estimator can be generalized. For example, GLM (defines and estimator as a Family and Link).

Formula and Varlists are the elements kept from DataFrames. If using the DataFrames.DataFramesRegressionModel then those could potentially be functions of the DataFrames.ModelFrame and design matrix from DataFrames.ModelMatrix.

@nalimilan
Copy link

nalimilan commented Jul 19, 2017

Thanks for the list. get is not a good solution since it will be type-unstable.

Some of the cases you listed could be variants of existing functions, e.g. n and N could be obtained via nobs with a symbol argument. intercept could be obtained using coef or params. Others would deserve new functions in the API, like model matrix, formula... Not sure about others.

@nalimilan
Copy link

Also, it looks like lots of these concepts are common with MixedModels.jl. Have you looked ta that package? Cc: @dmbates

@gragusa
Copy link
Owner

gragusa commented Jul 19, 2017

I agree with @nalimilan that get is not the best solution here. @Nosferican take a look at https://github.com/gragusa/InstrumentalVariables.jl. There I have a small IV model where only three new functions have to be defined to use CovarianceMatrices. I understand that more complex models have more complex requirements. The one that I find pressing is nobs that means different things in different context (weighted/unweighted, data size, estimation sample size, etc.).

Wouldn't be enough to have instead of nobs something wrknobs or effnobs which would return "Number of observations used in the estimation"?

More generally, I think I am not getting part of the discussion (totally my fault). Suppose you have a linear model type. I would define the type as follow

struct MyLinPredModel <: LinPredModel
end

struct MyModel 
 model::MyLinPredModel
 ...
end

You need then to define the following methods for MyLinPredModel

function ModelMatrix(::MyLinPredModel)
end

function wrkresid(::MyLinPredModel)
end

function wrkwts(::MyLinPredModel)
end

function nobs(::MyLinPredModel)
end

vcov(x::MyModel, k::RobustVariance) = vcov(x.model, k)

Then everything should work. Could this potentially work?

@Nosferican
Copy link
Contributor Author

My implementation has a very naive way of making get type-stable, but I am sure there are better ways. My get implementation is:

abstract type ModelStatistic end
struct ModelStatistic_RSS
    value::Float64
end
struct Model <: StatsBase.RegressionModel
    model_statistic::Dict{Symbol,ModelStatistic}
end
function Base.get(obj::ModelStatistic)
    getfield(obj, :value)
end
function Base.get(obj::Model, value::Symbol)
    model_stats = getfield(obj, :model_stats)
    get(model_stats, value)
end

I came up with this solution which has its pros and drawbacks. For example, it is strict in the sense no longer I can make calculations between RSS and rdf (dof_residuals) since these are ModelStatistics. Ipso facto,

struct ModelStatistic_MRSS <: ModelStatistic
    value::Float64
    function ModelStatistic_MRSS(RSS::ModelStatistic_RSS, rdf::ModelStatistic_rdf)
        RSS = get(RSS)
        rdf = get(rdf)
        RSS / rdf
    end
end

The pros is that I am less likely to make an error, and can use the inner constructors to verify properties of the statistics. It also allows the model_stats dictionary to be more efficient than using {Symbol, Any}. Any thoughts on this approach are super welcomed since I think I have a very basic understanding of Julia in general (basically what I read from the manual).

Aye. MixedModels is very similar to panel econometrics. Some of the estimators are especial cases. The difference for those estimators is on how to estimate the error components. The statistics way use MLE such as in MixedModels while in econometrics we tend to use GLS framework since it requires less strict assumptions. the two approaches also differ quite a bit in unbalanced panels which are very common in economics.
The MixedModels uses a modified Formula for including endogenous variables. I decided to keep them as two different DataFrames.Formula, but if a decision has been reached on expanding DataFrames.Formula or StatsModels.Formula I could adapt to follow it.

The individual, idiosyncratic, and thetas are bunched in a dictionary as ErrorComponent in R plm package. I could follow that strategy and implement the functions to access them only for the structures that have it.

@Nosferican
Copy link
Contributor Author

@gragusa I think that's a good overall scheme. The only issues I can think of at least for HC estimators is for example with HC1 or HC4 which use uses the dof value and that one differs from the one implied by size(ModelMatrix, 2). However, for fixed effects, the HC estimators are not consistent so I only allow users to specify :OLS or :PID (clustered at panels); this decision is the one implemented in Stata since Stata 12 or 13 if not mistaken. For CRVE (HC0/HC1), the code could adjust the dof by adding the ncluster - 1 to it (this would change if I implement two-way errors or allow clustering at a higher dimension in the future so is not super robust). I am not familiar with HAC for panels since people usually care more about the independence assumption and clustering is heteroscedastic, autocorrelation, and cluster robust already. Still in the extreme case that independence is valid, but heteroscedasticity and autocorrelation are present why not offer a solution for that case.

How about adding named arguments for dof and cluster ids? Then it could work something like:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64};
    dof::Int64 = size(x,2),
    cl::Vector{UnitRange{Int64}} = map(elem -> elem:elem, 1:size(x, 1)))

Then I could make StatsBase.vcov(model::MyModel, variant = VCE) a wrapper around that? It would also work quite nicely with GLM by taking the values and then passing those to the same function.

@gragusa
Copy link
Owner

gragusa commented Jul 19, 2017

@Nosferican In HC4 I use _nobs(l::LinPredModel) = length(l.rr.y) you could define this for your model.

You do not need arguments if you can get dof and cl from types. As @nalimilan, this would be extremely un-julian thing to do.....

I can modify CRVE to use dof. Would this be enough?

For panel with fixed effect, HC is inconsistent, but could be corrected (and this can be done inside the API (obtain the HC and the apply the bias adjustment).

@Nosferican
Copy link
Contributor Author

@gragusa dof cannot be implied from the model matrix for within estimators. The size of the model matrix will give you the incorrect model degree of freedom. The adjustment factor is the number of panels - 1 (to account for the degrees of freedom used for the group means). The clusters are also not implied from the model matrix. The alternative would be to dispatch different methods:

vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64)
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    cl::Vector{UnitRange{Int64}})
vcov(x::Matrix{Float64},
    k::RobustVariance,
    û::Vector{Float64},
    wrkwts::Vector{Float64},
    dof::Int64,
    cl::Vector{UnitRange{Int64}})

would this be the Julian-way? If such, I can adapt my code to eschew named arguments with default implied values and define different methods.

I think addressing the dof and cl should be sufficient to port the VCE estimates.

@Nosferican
Copy link
Contributor Author

@nalimilan @gragusa...
Considering other packages besides mine and maximum code re-use... what about

vcov(model::StatsBase.RegressionModel, variant::Symbol)
vcov(model::StatsBase.RegressionModel, variant::Symbol, cl::Vector{UnitRange{Int64}})

where

ModelMatrix = StatsBase.designmatrix(model) # or design_matrix, etc.
wrkresid = StatsBase.residuals(model)
wrkwts = StatsBase.weights(model) # or the weights naming convention used by StatsBase
_nobs = StatsBase.nobs(model)
dof = StatsBase.dof(model)

@Nosferican
Copy link
Contributor Author

For Julia 0.7 the migration to a DataFrames and StatsModels ecosystem will be completed. As part of the changes coming is the shift from the DataFrames.DataFrameRegressionModel which GLM and this package rely to StatsBase.RegressionModel inheritance StatsModels#32. I have added a few methods to StatsBase which will help the transition like coefnames and model_matrix (Commit #301). I would be happy to help in the transition so more people can use the package.

@lbittarello if Microeconometrics inheritances from StatsBase.RegressionModel it should play well likewise.

@gragusa
Copy link
Owner

gragusa commented Nov 10, 2017 via email

@Nosferican
Copy link
Contributor Author

Nosferican commented Nov 10, 2017

I can help with the API and HC / CRVE... HAC is outside my area of expertise so I rather let you handle everything there for now.
I am envisioning something in the likes of:

mutable struct MyPkgStruct<: StatsBase.RegressionModel
    this has the various fields for each package, but has the following methods...
end

X = StatsBase.model_matrix(obj::StatsBase.RegressionModel)
û = StatsBase.residuals(obj::StatsBase.RegressionModel)
rdf = StatsBase.dof_residual(obj::StatsBase.RegressionModel)

vce = get_vce(obj::StatsBase.RegressionModel)
clusters = clusters(obj::StatsBase.RegressionModel)
weights = get_weights(obj::StatsBase.RegressionModel)
# Should the matrix already be weighted?
# These could be requested through the new StatsBase.params(obj::StatsBase.RegressionModel)

where

vce::CovarianceMatrices.RobustVariance # Maybe include OLS
X::AbstractMatrix
û::AbstractVector
rdf::Real
clusters::AbstractMatrix
weights::AbstractVector

Then we can provide a function for packages to use:

function estimate_vcov(obj::StatsBase.RegressionModel)
    X = StatsBase.model_matrix(obj)
    û = StatsBase.residuals(obj)
    rdf = StatsBase.dof_residual(obj)
    vce = get_vce(obj)
    clusters = clusters(obj)
    weights = get_weights(obj)
    return vcov(vce, X, û, weights, clusters, rdf)
end
function estimate_vcov!(obj::StatsBase.RegressionModel, fieldname::Symbol)
    setfield!(obj, fieldname, estimate_vcov(obj))
    return
end

which updates their struct with the vcov::AbstractMatrix.

What are your thoughts?

Nosferican referenced this issue in JuliaStats/StatsBase.jl Nov 10, 2017
* add params

* add params\!

* update docstrings
@Nosferican
Copy link
Contributor Author

Would be nice to have before the transition in StatsModels.jl Terms 2.0 era.

Can you list the requirements you might need?

  • weights
  • informationmatrix
  • residuals
  • leverage
  • dof_residual
  • deviance
    Do the HAC estimators require a time indicator?

@gragusa
Copy link
Owner

gragusa commented Jun 3, 2019 via email

@Nosferican
Copy link
Contributor Author

Nosferican commented Jun 3, 2019

The API has the methods,
score(obj::StatisticalModel) and informationmatrix(model::StatisticalModel; expected::Bool = true). That should be the minimum interface. The additional accessors dof_residual, weights, and deviance should be more than enough for most cases (i.e., not including time or clusters).

@gragusa
Copy link
Owner

gragusa commented Jun 3, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants