# Application: Heterogeneous Effect of Gender on Wage Using Double Lasso

 We use US census data from the year 2012 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience.  For a detailed description of the variables we refer to the help page.



This analysis allows a closer look how discrimination according to gender is related to other socio-economic variables.



In [44]:
using RData, LinearAlgebra, GLM, DataFrames, Statistics, Random, Distributions, 
DataStructures, NamedArrays, PrettyTables, StatsModels, Combinatorics

import CodecBzip2

In [56]:
# Importing .Rdata file

cps2012 = load("../../../data/cps2012.RData")

Dict{String, Any} with 1 entry:
  "data" => [1m29217×23 DataFrame[0m…

In [57]:
keys(cps2012)   # get information from key

KeySet for a Dict{String, Any} with 1 entry. Keys:
  "data"

In [58]:
cps2012

Dict{String, Any} with 1 entry:
  "data" => [1m29217×23 DataFrame[0m…

In [59]:
cps2012 = cps2012["data"]

names(cps2012)

23-element Vector{String}:
 "year"
 "lnw"
 "female"
 "widowed"
 "divorced"
 "separated"
 "nevermarried"
 "hsd08"
 "hsd911"
 "hsg"
 "cg"
 "ad"
 "mw"
 "so"
 "we"
 "exp1"
 "exp2"
 "exp3"
 "exp4"
 "weight"
 "married"
 "ne"
 "sc"

In [60]:
    # couples variables combinations 
    combinations_upto(x, n) = Iterators.flatten(combinations(x, i) for i in 1:n)

    # combinations without same couple
    expand_exp(args, deg::ConstantTerm) =
        tuple(((&)(terms...) for terms in combinations_upto(args, deg.n))...)

    StatsModels.apply_schema(t::FunctionTerm{typeof(^)}, sch::StatsModels.Schema, ctx::Type) =
        apply_schema.(expand_exp(t.args_parsed...), Ref(sch), ctx)

In [61]:
# Basic model 


reg = @formula(lnw ~ -1 + female + female&(widowed + divorced + separated + nevermarried +
hsd08 + hsd911 + hsg + cg + ad + mw + so + we + exp1 + exp2 + exp3) + (widowed +
divorced + separated + nevermarried + hsd08 + hsd911 + hsg + cg + ad + mw + so +
we + exp1 + exp2 + exp3)^2 )


formula_basic = apply_schema(reg, schema(reg, cps2012))


FormulaTerm
Response:
  lnw(continuous)
Predictors:
  0
  female(continuous)
  widowed(continuous)
  divorced(continuous)
  separated(continuous)
  nevermarried(continuous)
  hsd08(continuous)
  hsd911(continuous)
  hsg(continuous)
  cg(continuous)
  ad(continuous)
  mw(continuous)
  so(continuous)
  we(continuous)
  exp1(continuous)
  exp2(continuous)
  exp3(continuous)
  widowed(continuous) & divorced(continuous)
  widowed(continuous) & separated(continuous)
  widowed(continuous) & nevermarried(continuous)
  widowed(continuous) & hsd08(continuous)
  widowed(continuous) & hsd911(continuous)
  widowed(continuous) & hsg(continuous)
  widowed(continuous) & cg(continuous)
  widowed(continuous) & ad(continuous)
  widowed(continuous) & mw(continuous)
  widowed(continuous) & so(continuous)
  widowed(continuous) & we(continuous)
  widowed(continuous) & exp1(continuous)
  widowed(continuous) & exp2(continuous)
  widowed(continuous) & exp3(continuous)
  divorced(continuous) & separated(continuo

In [62]:
formula_basic

FormulaTerm
Response:
  lnw(continuous)
Predictors:
  0
  female(continuous)
  widowed(continuous)
  divorced(continuous)
  separated(continuous)
  nevermarried(continuous)
  hsd08(continuous)
  hsd911(continuous)
  hsg(continuous)
  cg(continuous)
  ad(continuous)
  mw(continuous)
  so(continuous)
  we(continuous)
  exp1(continuous)
  exp2(continuous)
  exp3(continuous)
  widowed(continuous) & divorced(continuous)
  widowed(continuous) & separated(continuous)
  widowed(continuous) & nevermarried(continuous)
  widowed(continuous) & hsd08(continuous)
  widowed(continuous) & hsd911(continuous)
  widowed(continuous) & hsg(continuous)
  widowed(continuous) & cg(continuous)
  widowed(continuous) & ad(continuous)
  widowed(continuous) & mw(continuous)
  widowed(continuous) & so(continuous)
  widowed(continuous) & we(continuous)
  widowed(continuous) & exp1(continuous)
  widowed(continuous) & exp2(continuous)
  widowed(continuous) & exp3(continuous)
  divorced(continuous) & separated(continuo

In [63]:
coefnames(formula_basic)

("lnw", Any["female", "widowed", "divorced", "separated", "nevermarried", "hsd08", "hsd911", "hsg", "cg", "ad"  …  "female & hsd911", "female & hsg", "female & cg", "female & ad", "female & mw", "female & so", "female & we", "female & exp1", "female & exp2", "female & exp3"])

In [64]:
Y = select(cps2012,:lnw)  # uptcome variable
control = coefnames(formula_basic)[2]  # regresors 
names_col = Symbol.(control)  # string to Symbol to create varaible's name 

136-element Vector{Symbol}:
 :female
 :widowed
 :divorced
 :separated
 :nevermarried
 :hsd08
 :hsd911
 :hsg
 :cg
 :ad
 :mw
 :so
 :we
 ⋮
 Symbol("female & nevermarried")
 Symbol("female & hsd08")
 Symbol("female & hsd911")
 Symbol("female & hsg")
 Symbol("female & cg")
 Symbol("female & ad")
 Symbol("female & mw")
 Symbol("female & so")
 Symbol("female & we")
 Symbol("female & exp1")
 Symbol("female & exp2")
 Symbol("female & exp3")

In [65]:
X = StatsModels.modelmatrix(formula_basic,cps2012)

29217×136 Matrix{Float64}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  22.0  4.84    10.648
 1.0  0.0  0.0  0.0  0.0  0.0  1.0  0.0     0.0  0.0  30.0  9.0     27.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0   0.0  0.0      0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0     0.0  0.0  14.0  1.96     2.744
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0   0.0  0.0      0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  …  0.0  0.0   0.0  0.0      0.0
 0.0  0.0  0.0  0.0  1.0  0.0  0.0  1.0     0.0  0.0   0.0  0.0      0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0   0.0  0.0      0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0   0.0  0.0      0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  15.5  2.4025   3.72388
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  1.0  …  0.0  0.0   0.0  0.0      0.0
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0   7.0  0.49     0.343
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0   0.0  0.0      0.0
 ⋮            

In [66]:
X = DataFrame(X, names_col)

Unnamed: 0_level_0,female,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [67]:
# Function to get index of constant columns   

cons_column = []

#recoge todos los valores que tengan varianza 0
for i in 1:size(X,2)
    if var(X[!,i]) == 0
        append!(cons_column  , i)      
    end       
end


In [68]:
# Drop constant columns 

names(X)[cons_column]
select!(X, Not(names(X)[cons_column]))

Unnamed: 0_level_0,female,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [69]:
X

Unnamed: 0_level_0,female,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
6,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
7,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
8,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
10,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [70]:
# demean function
function desv_mean(a)
    a = Matrix(a)   # dataframe to matrix 
    A = mean(a, dims = 1)
    M = zeros(Float64, size(X,1), size(X,2))
    
    for i in 1:size(a,2)
          M[:,i] = a[:,i] .- A[i]
    end
    
    return M
end    


# Matrix Model & demean

X = DataFrame(desv_mean(X), names(X)) # Dataframe and names 

Unnamed: 0_level_0,female,widowed,divorced,separated,nevermarried,hsd08,hsd911
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,0.571243,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
2,0.571243,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,0.977821
3,-0.428757,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
4,0.571243,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
5,-0.428757,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
6,-0.428757,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
7,-0.428757,-0.00797481,-0.113393,-0.0165999,0.843653,-0.0041072,-0.0221789
8,-0.428757,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
9,-0.428757,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789
10,0.571243,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789


In [71]:
# index to get columns that contains female

index = []

for i in 1:size(X,2)  
        if contains( names(X)[i] , "female")
            append!(index, i)
        end  
end

In [72]:
index

16-element Vector{Any}:
   1
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116

In [73]:
# Control variables 

W = select(X, Not(names(X)[index]))

Unnamed: 0_level_0,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
2,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,0.977821,-0.247288
3,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
4,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
5,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
6,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
7,-0.00797481,-0.113393,-0.0165999,0.843653,-0.0041072,-0.0221789,0.752712
8,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
9,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
10,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288


In [74]:
include("../hdmjl/hdmjl.jl")

In [106]:
index

16-element Vector{Any}:
   1
 102
 103
 104
 105
 106
 107
 108
 109
 110
 111
 112
 113
 114
 115
 116

# HDM package

In [136]:
table = NamedArray(zeros(16, 2))

j = 0

for i in 1:length(index)

j = j + 1
    
#first step
D = select(X, names(X)[index[i]])
    
D_reg_0  = rlasso_arg( W, D, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )


D_resid = rlasso(D_reg_0)["residuals"]

#second step
    
Y_reg_0  = rlasso_arg( W, Y, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

Y_resid = rlasso(Y_reg_0)["residuals"]

D_resid = reshape(D_resid, length(D_resid), 1)

# third step
    
Lasso_HDM = lm(D_resid, Y_resid)

table[j,1] = GLM.coeftable(Lasso_HDM).cols[5][1]
table[j,2] = GLM.coeftable(Lasso_HDM).cols[6][1]

    
end 


In [137]:
D = select(X, names(X)[1])
    
D_reg_0  = rlasso_arg( W, D, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )


D_resid = rlasso(D_reg_0)["residuals"]

#second step
    
Y_reg_0  = rlasso_arg( W, Y, nothing, true, true, true, false, false, 
                    nothing, 1.1, nothing, 5000, 15, 10^(-5), -Inf, true, Inf, true )

Y_resid = rlasso(Y_reg_0)["residuals"]

D_resid = reshape(D_resid, length(D_resid), 1)

# third step
    
Lasso_HDM = lm(D_resid, Y_resid)

table[1,1] = GLM.coeftable(Lasso_HDM).cols[5][1]
table[1,2] = GLM.coeftable(Lasso_HDM).cols[6][1]

-0.2671060318070428

In [140]:
W

Unnamed: 0_level_0,widowed,divorced,separated,nevermarried,hsd08,hsd911,hsg
Unnamed: 0_level_1,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
2,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,0.977821,-0.247288
3,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
4,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
5,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
6,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,0.752712
7,-0.00797481,-0.113393,-0.0165999,0.843653,-0.0041072,-0.0221789,0.752712
8,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
9,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288
10,-0.00797481,-0.113393,-0.0165999,-0.156347,-0.0041072,-0.0221789,-0.247288


In [135]:
GLM.coeftable(Lasso_HDM).cols

6-element Vector{Any}:
 [-0.28067177854159236]
 [0.0069211397955338035]
 [-40.55282609993071]
 [0.0]
 [-0.29423752527614194]
 [-0.2671060318070428]

In [122]:
table

16×2 Named Matrix{Float64}
A ╲ B │          1           2
──────┼───────────────────────
1     │  -0.294238   -0.267106
2     │  -0.392813  -0.0289782
3     │  -0.240157   -0.157788
4     │  -0.385625   -0.173658
5     │  -0.159834  -0.0895803
6     │  -0.534498  -0.0514616
7     │  -0.501689   -0.299386
8     │  -0.328185   -0.271684
9     │  -0.281631    -0.23003
10    │  -0.325477   -0.255887
11    │  -0.303997   -0.253261
12    │  -0.310251   -0.258417
13    │  -0.309586   -0.247186
14    │ -0.0139084  -0.0125946
15    │ -0.0478188  -0.0427854
16    │ -0.0150576   -0.013297