<a href="https://colab.research.google.com/github/niklasdonth/niklasdonth/blob/main/Kopie_von_ols_for_gender_wage_gap_inference_gesis_solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

First, load the packages:

In [None]:
install.packages(c("xtable","sandwich"))

Installing packages into ‘/usr/local/lib/R/site-library’
(as ‘lib’ is unspecified)

also installing the dependency ‘zoo’




# An inferential problem: The Gender Wage Gap

In this lab, we analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and focus on the following inference question:

What is the difference in predicted wages between men and women with the same job-relevant characteristics?

Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching).

To investigate the gender wage gap, we consider the following log-linear regression model

\begin{align}
\log(Y) &= \beta'X + \epsilon\\
&= \beta_1 D  + \beta_2' W + \epsilon,
\end{align}

where $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the
$W$'s are controls explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women.

## Data analysis

We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Load the data set:

In [None]:
load("/content/wage2015_subsample_inference.Rdata")
attach(data)
dim(data)

**Exercise 1:** To start our (causal) analysis, compare the sample means given gender. To do this, calculate the mean of (log) wage for men and women separately and take the difference. What is the unconditional gender wage gap?

In [None]:
library(xtable)

Z <- data[which(colnames(data) %in% c("lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"))]

data_female <- data[data$sex==1,]
Z_female <- data_female[which(colnames(data) %in% c("lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"))]


data_male <- data[data$sex==0,]
Z_male <- data_male[which(colnames(data) %in% c("lwage","sex","shs","hsg","scl","clg","ad","ne","mw","so","we","exp1"))]

table <- matrix(0, 12, 3)
table[1:12,1]   <- as.numeric(lapply(Z,mean))
table[1:12,2]   <- as.numeric(lapply(Z_male,mean))
table[1:12,3]   <- as.numeric(lapply(Z_female,mean))
rownames(table) <- c("Log Wage","Sex","Less then High School","High School Graduate","Some College","Collage Graduate","Advanced Degree", "Northeast","Midwest","South","West","Experience")
colnames(table) <- c("All","Men","Women")
tab<- xtable(table, digits = 4)
tab

Unnamed: 0_level_0,All,Men,Women
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>
Log Wage,2.9707867,2.98782963,2.9494849
Sex,0.44446602,0.0,1.0
Less then High School,0.02330097,0.03180706,0.01266929
High School Graduate,0.2438835,0.29430269,0.18086501
Some College,0.27805825,0.273331,0.2839668
Collage Graduate,0.3176699,0.29395316,0.34731324
Advanced Degree,0.13708738,0.10660608,0.17518567
Northeast,0.25961165,0.25900035,0.26037571
Midwest,0.29650485,0.2981475,0.29445173
South,0.2161165,0.22090178,0.21013543


In particular, the table above shows that the difference in average *logwage* between men and women is equal to $0,038$

In [None]:
mean(data_female$lwage)-mean(data_male$lwage)

Thus, the unconditional gender wage gap is about $3,8$\% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience.

**Exercise 2:** Verify by running an ols regression that the calculated unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$:

\begin{align}
\log(Y) &=\beta D + \epsilon.
\end{align}

We verify this by running an ols regression in R.

In [None]:
library(sandwich)
nocontrol.fit <- lm(lwage ~ sex)
nocontrol.est <- summary(nocontrol.fit)$coef["sex",1]
HCV.coefs <- vcovHC(nocontrol.fit, type = 'HC3'); # Jackknife estimate
nocontrol.se <- sqrt(diag(HCV.coefs))[2] # Estimated std errors


# print unconditional effect of gender and the corresponding standard error
cat ("The estimated gender coefficient is",nocontrol.est," and the corresponding robust standard error is",nocontrol.se)


The estimated gender coefficient is -0.03834473  and the corresponding robust standard error is 0.01590824

Note that the standard error is computed with the *R* package *sandwich* to be robust to heteroskedasticity.


**Exercise 3:** Next, run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$:

\begin{align}
\log(Y) &=\beta_1 D  + \beta_2' W + \epsilon.
\end{align}

Here, consider the flexible model to account for non-linear relationsships:

In [None]:
flex <- lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)

Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions.

What is the predictive effect (PE)?

Let us run the ols regression with controls.

In [None]:
# Ols regression with controls

flex <- lwage ~ sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we)
control.fit <- lm(flex, data=data)
control.est <- summary(control.fit)$coef[2,1]

cat("Coefficient for OLS with controls", control.est)

HCV.coefs <- vcovHC(control.fit, type = 'HC3');
control.se <- sqrt(diag(HCV.coefs))[2] # Estimated std errors

Coefficient for OLS with controls -0.0695532

The estimated regression coefficient $\beta_1\approx-0.0696$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed.
We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $4$\% for women increases to about $7$\% after controlling for worker characteristics.


**Exercise 4:** Next, we use the Frisch-Waugh-Lovell theorem from the lecture partialling-out the linear effect of the controls via ols. Compare your estimated effect with the coefficient from the regression above.

In [None]:
# Partialling-Out using ols

# models
flex.y <- lwage ~  (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for Y
flex.d <- sex ~ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+occ2+ind2+mw+so+we) # model for D

# partialling-out the linear effect of W from Y
t.Y <- lm(flex.y, data=data)$res
# partialling-out the linear effect of W from D
t.D <- lm(flex.d, data=data)$res

# regression of Y on D after partialling-out the effect of W
partial.fit <- lm(t.Y~t.D)
partial.est <- summary(partial.fit)$coef[2,1]

cat("Coefficient for D via partialling-out", partial.est)

# standard error
HCV.coefs <- vcovHC(partial.fit, type = 'HC3')
partial.se <- sqrt(diag(HCV.coefs))[2]
# Note that jackknife standard errors depend on all the variables in the model and so are not appropriate for the partialed out regression (without adjustment)

# confidence interval
confint(partial.fit)[2,]

Coefficient for D via partialling-out -0.0695532

ERROR: Error in vcovHC(partial.fit, type = "HC3"): could not find function "vcovHC"


Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient equals the estimated coefficient from the ols regression with controls.

Next, we summarize the results.

In [None]:
table<- matrix(0, 3, 2)
table[1,1]<- nocontrol.est
table[1,2]<- nocontrol.se
table[2,1]<- control.est
table[2,2]<- control.se
table[3,1]<- partial.est
table[3,2]<- partial.se
colnames(table)<- c("Estimate","Std. Error")
rownames(table)<- c("Without controls", "full reg", "partial reg")
tab<- xtable(table, digits=c(3, 3, 4))
tab

Unnamed: 0_level_0,Estimate,Std. Error
Unnamed: 0_level_1,<dbl>,<dbl>
Without controls,-0.03834473,0.01590824
full reg,-0.0695532,0.0156992
partial reg,-0.0695532,0.01500873


It it worth to notice that controlling for worker characteristics increases the gender wage gap from less that 4\% to 7\%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast);  a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators.

Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect
selection effects of unobserved differences in covariates between men and women in our sample.
