
## Introduction

Q4.
a) Compare pooled OLS and fixed effects estimates
b) Examine gender differences in the effect of children
c) Test the validity of random effects assumptions
d) Conduct a formal Hausman test


## 1. Data Loading and Preparation



In [1]:
# Load required packages
library(haven)
library(tidyverse)
library(plm)
library(lmtest)
library(sandwich)

# Load the SOEP dataset
soep <- read_dta("soep_lebensz_en.dta")

# Inspect the data structure
glimpse(soep)

── [1mAttaching core tidyverse packages[22m ────────────────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.4     [32m✔[39m [34mreadr    [39m 2.1.5
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.1
[32m✔[39m [34mggplot2  [39m 4.0.0     [32m✔[39m [34mtibble   [39m 3.3.0
[32m✔[39m [34mlubridate[39m 1.9.4     [32m✔[39m [34mtidyr    [39m 1.3.1
[32m✔[39m [34mpurrr    [39m 1.1.0     
── [1mConflicts[22m ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors

Attaching package: ‘plm’

The following objects are masked from 

In [2]:
# Create has_kids binary variable (1 if person has any children, 0 otherwise)
soep <- soep |>
  mutate(has_kids = ifelse(no_kids > 0, 1, 0))

# Prepare categorical variables for regression
soep <- soep |>
  mutate(
    health_cat = as.factor(health_org),
    year_fac = as.factor(year),
    sex_fac = as.factor(sex)
  )

# Verify the variable creation
table(soep$has_kids, soep$no_kids, useNA = "always")

      
          0    1    2    3 <NA>
  0    7428    0    0    0    0
  1       0 2058 1878  652    0
  <NA>    0    0    0    0  906


## 2. Part A: Pooled OLS vs Fixed Effects

### Pooled OLS Regression

We first estimate a pooled OLS regression that ignores the panel structure, treating all observations as independent. We include gender, education, categorical health indicators, and year fixed effects, with standard errors clustered at the individual level.



In [3]:
# Run pooled OLS model
pooled_ols <- lm(satisf_std ~ has_kids + sex_fac + education + health_cat + year_fac, 
                 data = soep)

# Get clustered standard errors at individual level
pooled_ols_clustered <- coeftest(pooled_ols, 
                                  vcov = vcovCL, 
                                  cluster = ~id, 
                                  data = soep)

pooled_ols_clustered


t test of coefficients:

               Estimate Std. Error  t value  Pr(>|t|)    
(Intercept)  -1.4696226  0.1005546 -14.6152 < 2.2e-16 ***
has_kids     -0.1468156  0.0273686  -5.3644 8.293e-08 ***
sex_fac1      0.0614782  0.0274062   2.2432 0.0249029 *  
education     0.0238177  0.0052391   4.5461 5.524e-06 ***
health_cat2   0.7241917  0.0838475   8.6370 < 2.2e-16 ***
health_cat3   1.1849977  0.0831723  14.2475 < 2.2e-16 ***
health_cat4   1.5169296  0.0834787  18.1715 < 2.2e-16 ***
health_cat5   1.8118611  0.0885232  20.4676 < 2.2e-16 ***
year_fac2001 -0.0271567  0.0207891  -1.3063 0.1914810    
year_fac2002 -0.0827616  0.0221974  -3.7284 0.0001937 ***
year_fac2003 -0.1387162  0.0238563  -5.8147 6.251e-09 ***
year_fac2004 -0.2424405  0.0249236  -9.7273 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



### Fixed Effects Regression

Next, we estimate a fixed effects model that controls for all time-invariant individual characteristics by including individual-specific intercepts.



In [4]:
# Create panel data object
soep_panel <- pdata.frame(soep, index = c("id", "year"))

# Run fixed effects model
fe_model <- plm(satisf_std ~ has_kids + education + health_cat + year_fac,
                data = soep_panel,
                model = "within",
                effect = "individual")

# Get clustered standard errors
fe_clustered <- coeftest(fe_model, vcov = vcovHC(fe_model, cluster = "group"))

fe_clustered



t test of coefficients:

              Estimate Std. Error  t value  Pr(>|t|)    
has_kids      0.042085   0.046825   0.8988    0.3688    
education    -0.017705   0.036558  -0.4843    0.6282    
health_cat2   0.388783   0.086123   4.5143 6.453e-06 ***
health_cat3   0.720040   0.089403   8.0539 9.285e-16 ***
health_cat4   0.909834   0.092046   9.8845 < 2.2e-16 ***
health_cat5   1.058753   0.097284  10.8832 < 2.2e-16 ***
year_fac2001 -0.051938   0.020291  -2.5596    0.0105 *  
year_fac2002 -0.126262   0.021982  -5.7439 9.620e-09 ***
year_fac2003 -0.184022   0.023736  -7.7528 1.019e-14 ***
year_fac2004 -0.295234   0.025187 -11.7217 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



### Comparison and Interpretation



In [5]:
# Compare coefficients
data.frame(
  Model = c("Pooled OLS", "Fixed Effects", "Difference"),
  Coefficient = c(
    coef(pooled_ols)["has_kids"],
    coef(fe_model)["has_kids"],
    coef(pooled_ols)["has_kids"] - coef(fe_model)["has_kids"]
  )
)

          Model Coefficient
1    Pooled OLS -0.14681556
2 Fixed Effects  0.04208499
3    Difference -0.18890054


**Interpretation:**

The pooled OLS estimate shows that having children is associated with a **-0.147 standard deviation decrease** in life satisfaction (p < 0.001). However, the fixed effects estimate is **+0.042** and not statistically significant (p = 0.369).

The large difference of **-0.189** between these estimates reveals important **selection bias**. The negative correlation in pooled OLS conflates two distinct effects:

1. **Selection effect**: People who have children differ systematically in their baseline life satisfaction
2. **Causal effect**: The actual within-person change in satisfaction from having children

The difference tells us that the unobserved individual effect $f_i$ is **negatively correlated** with `has_kids`. This means people who are predisposed to lower life satisfaction are more likely to have children (or people who have children have other unobserved characteristics associated with lower satisfaction).

The fixed effects estimator removes this selection bias by comparing each person to themselves over time, identifying the effect from within-person variation only.

## 3. Part B: Gender Differences in Fixed Effects

### Why Does Gender Disappear in Fixed Effects?

Gender is a **time-invariant** characteristic—it does not change over time for an individual. In the fixed effects model, all time-invariant variables are absorbed into the individual fixed effects $\alpha_i$ and cannot be separately identified.

The within transformation effectively differences out all between-individual variation, leaving only within-individual variation over time. Since gender doesn't vary within individuals, it gets completely removed from the estimation.

### Fixed Effects with Gender × Has_Kids Interaction

However, we can examine whether the **effect** of having children differs by gender using an interaction term:



In [6]:
# Fixed effects with gender × has_kids interaction
fe_interaction <- plm(satisf_std ~ has_kids * sex_fac + education + health_cat + year_fac,
                      data = soep_panel,
                      model = "within",
                      effect = "individual")

# Get clustered standard errors
fe_interaction_clustered <- coeftest(fe_interaction, 
                                     vcov = vcovHC(fe_interaction, cluster = "group"))

fe_interaction_clustered


t test of coefficients:

                   Estimate Std. Error  t value  Pr(>|t|)    
has_kids           0.133919   0.061482   2.1782   0.02942 *  
education         -0.016807   0.036718  -0.4577   0.64715    
health_cat2        0.387347   0.086237   4.4916 7.176e-06 ***
health_cat3        0.719195   0.089512   8.0347 1.085e-15 ***
health_cat4        0.909248   0.092144   9.8677 < 2.2e-16 ***
health_cat5        1.059830   0.097340  10.8879 < 2.2e-16 ***
year_fac2001      -0.051767   0.020282  -2.5524   0.01072 *  
year_fac2002      -0.125778   0.021969  -5.7252 1.074e-08 ***
year_fac2003      -0.182948   0.023744  -7.7050 1.480e-14 ***
year_fac2004      -0.294420   0.025163 -11.7004 < 2.2e-16 ***
has_kids:sex_fac1 -0.171947   0.090623  -1.8974   0.05782 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



### Interpretation of Gender Differences



In [7]:
# Calculate effects by gender
data.frame(
  Gender = c("Women (sex=0)", "Men (sex=1)"),
  Effect = c(
    coef(fe_interaction)["has_kids"],
    coef(fe_interaction)["has_kids"] + coef(fe_interaction)["has_kids:sex_fac1"]
  )
)

         Gender     Effect
1 Women (sex=0)  0.1339186
2   Men (sex=1) -0.0380287


**Findings:**

- **Women** experience a **+0.134 standard deviation increase** in life satisfaction when having children (p = 0.029, statistically significant)
- **Men** experience a **-0.038 standard deviation decrease** in life satisfaction (not statistically significant)
- The **interaction coefficient is -0.172** (p = 0.058, marginally significant at the 10% level)

**Interpretation:**

There are meaningful gender differences in how having children affects life satisfaction. Women benefit significantly from having children, while men experience a small (non-significant) negative effect. The difference between genders is about 0.17 standard deviations, which is both statistically meaningful (marginally) and practically significant.

This could reflect differences in time allocation, parenting roles, social expectations, or the psychological rewards from parenthood across genders.

## 4. Part C: Random Effects Model

### Estimate Random Effects Model

The random effects model assumes that individual-specific effects are uncorrelated with the regressors: $E(f_i | X) = 0$.



In [8]:
# Run random effects model
re_model <- plm(satisf_std ~ has_kids + sex_fac + education + health_cat + year_fac,
                data = soep_panel,
                model = "random",
                effect = "individual")

summary(re_model)

Oneway (individual) effect Random Effect Model 
   (Swamy-Arora's transformation)

Call:
plm(formula = satisf_std ~ has_kids + sex_fac + education + health_cat + 
    year_fac, data = soep_panel, effect = "individual", model = "random")

Unbalanced Panel: n = 3289, T = 1-5, N = 10659

Effects:
                 var std.dev share
idiosyncratic 0.4539  0.6737 0.547
individual    0.3756  0.6129 0.453
theta:
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.2603  0.5183  0.5588  0.5071  0.5588  0.5588 

Residuals:
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-4.23270 -0.34209  0.08048  0.00197  0.42633  2.82500 

Coefficients:
               Estimate Std. Error  z-value  Pr(>|z|)    
(Intercept)  -1.2422464  0.0799120 -15.5452 < 2.2e-16 ***
has_kids     -0.1000018  0.0243745  -4.1027 4.083e-05 ***
sex_fac1      0.0410318  0.0264104   1.5536   0.12027    
education     0.0237643  0.0053403   4.4500 8.588e-06 ***
health_cat2   0.5783963  0.0500716  11.5514 < 2.2e-16 ***
health_cat


### Compare Fixed Effects vs Random Effects



In [9]:
# Compare key coefficients
data.frame(
  Model = c("Fixed Effects", "Random Effects", "Difference"),
  has_kids = c(
    coef(fe_model)["has_kids"],
    coef(re_model)["has_kids"],
    coef(fe_model)["has_kids"] - coef(re_model)["has_kids"]
  )
)

           Model    has_kids
1  Fixed Effects  0.04208499
2 Random Effects -0.10000179
3     Difference  0.14208678


**Findings:**

- **Fixed Effects**: coefficient = 0.042 (p = 0.369, not significant)
- **Random Effects**: coefficient = -0.100 (p < 0.001, highly significant)
- **Difference**: 0.142 (substantial)

### Can We Trust the Random Effects Assumptions?

**No, we cannot trust the RE model assumptions in this context.**

The substantial difference between the FE and RE coefficients indicates that $E(f_i | X) \neq 0$—the individual-specific effects are **correlated** with the regressors. This violates the fundamental assumption of the random effects model.

The evidence shows:
1. The coefficients differ not just in magnitude but also in sign (FE positive, RE negative)
2. They differ in statistical significance (FE insignificant, RE highly significant)
3. The difference of 0.142 is economically meaningful

This pattern suggests that individuals who have children differ systematically in their unobserved characteristics from those who don't. The RE model incorrectly treats these differences as random, leading to biased estimates that conflate selection effects with causal effects—similar to the pooled OLS problem.

## 5. Part D: Hausman Test

The Hausman test provides a formal statistical test of whether the Random Effects assumptions are violated.

- **Null Hypothesis (H₀):** The individual effects are uncorrelated with the regressors (RE model is appropriate)
- **Alternative Hypothesis (H₁):** The individual effects are correlated with the regressors (FE model should be used)



In [10]:
# Perform Hausman test
hausman_test <- phtest(fe_model, re_model)
hausman_test


	Hausman Test

data:  satisf_std ~ has_kids + education + health_cat + year_fac
chisq = 152.17, df = 10, p-value < 2.2e-16
alternative hypothesis: one model is inconsistent



**Results:**

- **Test Statistic**: χ² = 152.17 (df = 10)
- **P-value**: < 2.2e-16
- **Decision**: **Strongly reject** the null hypothesis

**Interpretation:**

We strongly reject the hypothesis that individual effects are uncorrelated with the regressors. This means:

1. The individual effects **are** significantly correlated with the regressors
2. The Random Effects model is **not appropriate** for this data
3. We should use the **Fixed Effects model** for inference
4. This formally confirms our suspicion from Part C that RE assumptions are violated

**Practical Implication:**

The Fixed Effects estimate (0.042, not significant) is the more reliable estimate of the causal effect of having children on life satisfaction. The pooled OLS and RE estimates are biased due to selection on unobservables—individuals who choose to have children differ systematically from those who don't in ways that affect their baseline life satisfaction.

## 6. Summary of Findings

### Part A: Pooled OLS vs Fixed Effects

| Model | Coefficient | SE | P-value | Interpretation |
|-------|-------------|-----|---------|----------------|
| Pooled OLS | -0.147 | 0.027 | < 0.001 | Biased by selection |
| Fixed Effects | 0.042 | 0.047 | 0.369 | Unbiased causal effect |
| Difference | -0.189 | — | — | Reveals negative selection |

**Key Insight**: The unobserved individual effect $f_i$ is negatively correlated with `has_kids`, revealing substantial selection bias. People with lower baseline life satisfaction are more likely to have children.

### Part B: Gender Differences

| Gender | Effect | P-value |
|--------|--------|---------|
| Women | +0.134 | 0.029 |
| Men | -0.038 | > 0.10 |
| Interaction | -0.172 | 0.058 |

**Key Insight**: Women experience a significant positive effect from having children, while men experience a small negative (non-significant) effect. The difference is marginally significant and practically meaningful.

### Part C: Random Effects Validity

| Model | Coefficient | P-value |
|-------|-------------|---------|
| Random Effects | -0.100 | < 0.001 |
| Fixed Effects | 0.042 | 0.369 |
| Difference | 0.142 | — |

**Conclusion**: We **cannot trust** the RE assumptions. The substantial difference indicates $E(f_i | X) \neq 0$.

### Part D: Hausman Test

- **χ² = 152.17** (df = 10), **p < 2.2e-16**
- **Strongly reject** the null hypothesis
- **FE model is appropriate**

## Final Conclusion

This analysis demonstrates the critical importance of accounting for unobserved heterogeneity in panel data:

1. **Selection bias is substantial**: Naive comparisons (pooled OLS) show children decrease life satisfaction by 0.15 standard deviations, but this entirely reflects selection, not causation

2. **Fixed effects reveal the causal story**: Within individuals over time, having children has no significant overall effect on life satisfaction (coefficient = 0.042, p = 0.37)

3. **Gender heterogeneity matters**: Women experience a positive effect (+0.13 std), while men do not (-0.04 std). This suggests important differences in how parenthood affects life satisfaction across genders

4. **Random effects are inappropriate**: The Hausman test strongly rejects RE assumptions (χ² = 152, p < 0.001), confirming that individual effects are correlated with the decision to have children

**Methodological Takeaway**: The Fixed Effects model is the appropriate choice for causal inference in this context. Comparing the biased (OLS, RE) to unbiased (FE) estimates reveals the substantial role of selection on unobservables in determining who becomes a parent.

