-
Notifications
You must be signed in to change notification settings - Fork 1
Variance Inflation Factor #11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
cbaf1f1
eafe844
5222e12
7b14307
b32c4f8
338fc32
7a5857a
9269209
c28e4a7
a0f6a88
d16e11a
a7c33d8
1c83cba
7a6bc45
9f87b29
2befe83
cf6fc55
8c72643
daa7ca0
72df44b
1994ca2
a013290
f3a21a6
4016d62
cf8f720
b3231fa
83f1f3b
c171289
5e1976d
47f7bee
fcf1b8b
170265a
f29e8d5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -24,3 +24,17 @@ adjr2 | |
| ```@docs | ||
| icc | ||
| ``` | ||
|
|
||
| ## Variance Inflation Factor | ||
|
|
||
| ```@docs | ||
| vif | ||
| ``` | ||
|
|
||
| ```@docs | ||
| termnames | ||
| ``` | ||
|
|
||
| ```@docs | ||
| gvif | ||
| ``` | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,124 @@ | ||
| _rename_intercept(s) = strip(s) == "1" ? "(Intercept)" : s | ||
|
|
||
| """ | ||
| termnames(model) | ||
|
|
||
| Return the names associated with (fixed effects) terms in a model. | ||
|
|
||
| For models with only continuous predictors, this is the same as | ||
| `coefnames(model)`. | ||
| For models with categorical predictors, the returned names reflect | ||
| the categorical predictor and not the coefficients resulting from | ||
| the choice of contrast coding. | ||
| """ | ||
| termnames(model) = _rename_intercept.(string.(_terms(model))) | ||
|
|
||
| _terms(model) = collect(formula(model).rhs.terms) | ||
| _terms(model::MixedModel) = collect(formula(model).rhs[1].terms) | ||
|
|
||
| """ | ||
| vif(m::RegressionModel) | ||
|
|
||
| Compute the variance inflation factor (VIF). | ||
|
|
||
| Returns a vector of inflation factors computed for each coefficient except | ||
| for the intercept. | ||
| In other words, the corresponding coefficients are `coefnames(m)[2:end]`. | ||
|
|
||
| The [variance inflation factor (VIF)](https://en.wikipedia.org/wiki/Variance_inflation_factor) measures | ||
| the increase in the variance of a parameter's estimate in a model with multiple parameters relative to | ||
| the variance of a parameter's estimate in a model containing only that parameter. | ||
|
|
||
| See also [`coefnames`](@ref), [`gvif`](@ref). | ||
|
|
||
| !!! warning | ||
| This method will fail if there is (numerically) perfect multicollinearity, | ||
| i.e. rank deficiency (in the fixed effects). In that case though, the VIF | ||
| isn't particularly informative anyway. | ||
| """ | ||
| function vif(m::RegressionModel) | ||
| mm = Statistics.cov2cor!(vcov(m), stderror(m)) | ||
| all(==(1), view(modelmatrix(m), :, 1)) || | ||
| throw(ArgumentError("VIF only defined for models with an intercept term")) | ||
| mm = @view mm[2:end, 2:end] | ||
| size(mm, 2) > 1 || | ||
| throw(ArgumentError("VIF not meaningful for models with only one non-intercept term")) | ||
| # NB: The correlation matrix is positive definite and hence invertible | ||
| # unless there is perfect rank deficiency, hence the warning. | ||
| # NB: The determinate technique for GVIF could also be applied | ||
| # columnwise (instead of Term-wise) here, but it wouldn't really | ||
| # be any more efficient because this is a Symmetric matrix and computing | ||
| # all those determinants has its cost. The determinants also hint at | ||
| # how you could show equivalency, if you remember that inversion is solving | ||
| # a linear system and that Cramer's rule -- which uses determinants -- | ||
| # can also a linear system | ||
| # so we want diag(inv(mm)) but directly computing inverses is bad. | ||
| # that said, these are typically small-ish matrices and this is Simple. | ||
| return diag(inv(mm)) | ||
| end | ||
|
|
||
| """ | ||
| gvif(m::RegressionModel; scale=false) | ||
|
|
||
| Compute the generalized variance inflation factor (GVIF). | ||
|
|
||
| If `scale=true`, then each GVIF is scaled by the degrees of freedom | ||
| for (number of coefficients associated with) the predictor: ``GVIF^(1 / (2*df))`` | ||
|
|
||
| Returns a vector of inflation factors computed for each term except | ||
| for the intercept. | ||
| In other words, the corresponding coefficients are `termnames(m)[2:end]`. | ||
|
|
||
| The [generalized variance inflation factor (VIF)](https://doi.org/10.2307/2290467) | ||
| measures the increase in the variance of a (group of) parameter's estimate in a model | ||
| with multiple parameters relative to the variance of a parameter's estimate in a | ||
| model containing only that parameter. For continuous, numerical predictors, the GVIF | ||
| is the same as the VIF, but for categorical predictors, the GVIF provides a single | ||
| number for the entire group of contrast-coded coefficients associated with a categorical | ||
| predictor. | ||
|
|
||
| See also [`termnames`](@ref), [`vif`](@ref). | ||
|
|
||
| !!! warning | ||
| This method will fail if there is (numerically) perfect multicollinearity, | ||
| i.e. rank deficiency (in the fixed effects). In that case though, the VIF | ||
| isn't particularly informative anyway. | ||
|
|
||
| ## References | ||
|
|
||
| Fox, J., & Monette, G. (1992). Generalized Collinearity Diagnostics. | ||
| Journal of the American Statistical Association, 87(417), 178. doi:10.2307/2290467 | ||
| """ | ||
| function gvif(m::RegressionModel; scale=false) | ||
| mm = StatsBase.cov2cor!(vcov(m), stderror(m)) | ||
|
|
||
| StatsModels.hasintercept(formula(m)) || | ||
| throw(ArgumentError("GVIF only defined for models with an intercept term")) | ||
| mm = @view mm[2:end, 2:end] | ||
| size(mm, 2) > 1 || | ||
| throw(ArgumentError("GVIF not meaningful for models with only one non-intercept term")) | ||
|
|
||
| tn = @view termnames(m)[2:end] | ||
| trms = @view _terms(m)[2:end] | ||
|
|
||
| df = width.(trms) | ||
| vals = zeros(eltype(mm), length(tn)) | ||
| # benchmarking shows thad adding in Symmetric() here before det() | ||
| # actually slows things down a bit | ||
| detmm = det(mm) | ||
| acc = 0 | ||
| for idx in axes(vals, 1) | ||
| wt = df[idx] | ||
| trm_only = [acc < i <= (acc + wt) for i in axes(mm, 2)] | ||
| trm_excl = .!trm_only | ||
| vals[idx] = det(view(mm, trm_only, trm_only)) * | ||
| det(view(mm, trm_excl, trm_excl)) / | ||
| detmm | ||
|
Comment on lines
+114
to
+116
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd think this is prone to over- and underflow. Probably safer to compute it as exp(
logdet(view(mm, trm_only, trm_only)) +
logdet(view(mm, trm_excl, trm_excl)) -
logdetmm
)
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point, changed in #16 |
||
| acc += wt | ||
| end | ||
|
|
||
| if scale | ||
| vals .= vals .^ (1 ./ (2 .* df)) | ||
| end | ||
| return vals | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,5 +1,9 @@ | ||
| [deps] | ||
| DataFrames = "a93c6f00-e57d-5684-b7b6-d8193f3e46c0" | ||
| GLM = "38e38edf-8417-5370-95a0-9cbb8c7f171a" | ||
| LinearAlgebra = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" | ||
| MixedModels = "ff71e718-51f3-5ec2-a782-8ffcbfa3c316" | ||
| RDatasets = "ce6b1742-4840-55fa-b093-852dadbb1d8b" | ||
| Statistics = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" | ||
| StatsBase = "2913bbd2-ae8a-5f71-8c99-4fb6c76f3a91" | ||
| Test = "8dfed614-e22c-5e08-85e1-65c5234f0b40" |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,3 +8,7 @@ end | |
| @testset "r2" begin | ||
| include("r2.jl") | ||
| end | ||
|
|
||
| @testset "VIF" begin | ||
| include("vif.jl") | ||
| end | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,61 @@ | ||
| using GLM | ||
| using LinearAlgebra | ||
| using MixedModels | ||
| using MixedModelsExtras | ||
| using StatsBase | ||
| using Test | ||
|
|
||
| using MixedModels: dataset | ||
| using RDatasets: dataset as rdataset | ||
|
|
||
| progress = false | ||
|
|
||
| @testset "LMM" begin | ||
| fm0 = fit(MixedModel, @formula(reaction ~ 0 + days + (1 | subj)), dataset(:sleepstudy); | ||
| progress) | ||
| fm1 = fit(MixedModel, @formula(reaction ~ 1 + days + (1 | subj)), dataset(:sleepstudy); | ||
| progress) | ||
| fm2 = fit(MixedModel, @formula(reaction ~ 1 + days * days^2 + (1 | subj)), | ||
| dataset(:sleepstudy); progress) | ||
|
|
||
| ae_int = ArgumentError("VIF only defined for models with an intercept term") | ||
| ae_nterm = ArgumentError("VIF not meaningful for models with only one non-intercept term") | ||
| @test_throws ae_int vif(fm0) | ||
| @test_throws ae_nterm vif(fm1) | ||
|
|
||
| mm = @view cor(modelmatrix(fm2))[2:end, 2:end] | ||
| # this is a slightly different way to do the same | ||
| # computation. I've verified that doing it this way | ||
| # in R gives the same answers as car::vif | ||
| # note that computing the same model from scratch in R | ||
| # gives different VIFs ([70.41934, 439.10096, 184.00107]), | ||
| # but that model is a slightly different fit than the Julia | ||
| # one and that has knock-on effects | ||
| @test isapprox(vif(fm2), diag(inv(mm))) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW all of your
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know, but JuliaFormatter struggles with the kwarg in that case. 😄 |
||
|
|
||
| # since these are all continuous preds, should be the same | ||
| # but uses a very different computational method! | ||
| @test isapprox(vif(fm2), gvif(fm2)) | ||
| # the scale factor is gvif^1/(2df) | ||
| # so just sqrt.(vif) when everything is continuous | ||
| @test isapprox(gvif(fm2; scale=true), | ||
| sqrt.(gvif(fm2))) | ||
| end | ||
|
|
||
| @testset "GVIF and RegrssionModel" begin | ||
| duncan = rdataset("car", "Duncan") | ||
|
|
||
| lm1 = lm(@formula(Prestige ~ 1 + Income + Education), duncan) | ||
| @test termnames(lm1) == coefnames(lm1) | ||
| vif_lm1 = vif(lm1) | ||
|
|
||
| # values here taken from car | ||
| @test isapprox(vif_lm1, [2.1049, 2.1049]; atol=1e-5) | ||
| @test isapprox(vif_lm1, gvif(lm1)) | ||
|
|
||
| lm2 = lm(@formula(Prestige ~ 1 + Income + Education + Type), duncan) | ||
| @test termnames(lm2) == ["(Intercept)", "Income", "Education", "Type"] | ||
| @test isapprox(gvif(lm2), [2.209178, 5.297584, 5.098592]; atol=1e-5) | ||
| @test isapprox(gvif(lm2; scale=true), | ||
| [1.486330, 2.301648, 1.502666]; atol=1e-5) | ||
| end | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just using
Symmetricwill makedetuse the Bunch-Kaufman factorization for indefinite matrices instead of the LU and typically the LU will be more optimized. What should be faster isdet(cholesky(Symmetric(mm))).