## Final Exam

Michael Huang

EN.625.603.84

Description: The datafile contains data for 2015 for full-time workers with a high school diploma
or B.A./B.S. as their highest degree. See the pdf attachment for an overview of the data and
variable descriptions. In this exercise, you will investigate the relationship between a worker’s
age and earnings. (Generally, older workers have more job experience, leading to higher
productivity and higher earnings.)


In [None]:
from statsmodels.formula.api import ols

import numpy as np
import pandas as pd

# As excel files do not play nice with Python, converted the data to csv

cps_df = pd.read_csv("CPS2015-1.csv")
cps_df


Unnamed: 0,year,ahe,bachelor,female,age
0,2015,11.778846,0,0,26
1,2015,9.615385,0,1,33
2,2015,12.019231,0,0,31
3,2015,18.376068,0,0,32
4,2015,41.836735,0,0,28
...,...,...,...,...,...
7093,2015,96.153847,1,0,25
7094,2015,30.769230,1,0,34
7095,2015,9.230769,0,0,27
7096,2015,13.653846,1,1,27


##### a. Run a regression of average hourly earnings (AHE) on age (Age), gender (Female), and education (Bachelor). If age increases from 25 to 26, how are earnings expected to change? If age increases from 33 to 34, how are earnings expected to change?

In [32]:

model_a = ols("ahe ~ age + female + bachelor", data=cps_df).fit()
print(f"{model_a.summary()=}")

age_coeff_a = model_a.params["age"]
print(f"{age_coeff_a=}")

model_a.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                    ahe   R-squared:                       0.190
Model:                            OLS   Adj. R-squared:                  0.189
Method:                 Least Squares   F-statistic:                     553.4
Date:                Sat, 16 Aug 2025   Prob (F-statistic):          3.46e-323
Time:                        06:05:18   Log-Likelihood:                -27036.
No. Observations:                7098   AIC:                         5.408e+04
Df Residuals:                    7094   BIC:                         5.411e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------

In this model, we simply look at the coefficient for age, since this simply represents the marginal change for if age increases by 1, so this is the same for both if the age increases from 25 to 26 and if the age increases from 33 to 34. The corresponding coefficient as displayed above is 0.531275239654071, so there would be a **0.531275239654071** increase in average hourly earnings.

##### b. Run a regression of the logarithm of average hourly earnings, ln(AHE), on Age, Female, and Bachelor. If age increases from 25 to 26, how are earnings expected to change? If age increases from 33 to 34, how are earnings expected to change?

In [26]:
cps_df["ln_ahe"] = np.log(cps_df["ahe"])

model_b = ols("ln_ahe ~ age + female + bachelor", data=cps_df).fit()
print(f"{model_b.summary()=}")

age_coeff_b = model_b.params["age"]
print(f"{age_coeff_b=}")

model_b.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.208
Method:                 Least Squares   F-statistic:                     622.4
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        05:55:27   Log-Likelihood:                -4821.9
No. Observations:                7098   AIC:                             9652.
Df Residuals:                    7094   BIC:                             9679.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------

In this model, we simply look at the coefficient for age, and in this log-linear model it represents the percentage change for if age increases by 1, so this percentage change would be the same for both if the age increases from 25 to 26 and if the age increases from 33 to 34. The corresponding coefficient as displayed above is 0.02419115791114511, so there would be a **2.419115791114511%** increase in average hourly earnings.

##### c. Run a regression of the logarithm of average hourly earnings, ln(AHE), on ln(Age), Female, and Bachelor. If age increases from 25 to 26, how are earnings expected to change? If age increases from 33 to 34, how are earnings expected to change?

In [27]:
cps_df["ln_age"] = np.log(cps_df["age"])

model_c = ols("ln_ahe ~ ln_age + female + bachelor", data=cps_df).fit()
print(f"{model_c.summary()=}")

ln_age_coeff_c = model_c.params["ln_age"]
print(f"{ln_age_coeff_c=}")

pct_change_age_25_to_26 = ((26-25)/25) * 100
pct_change_age_33_to_34 = ((34-33)/33) * 100

earnings_change_25_to_26 = ln_age_coeff_c * pct_change_age_25_to_26
earnings_change_33_to_34 = ln_age_coeff_c * pct_change_age_33_to_34

print(f"{earnings_change_25_to_26=}")
print(f"{earnings_change_33_to_34=}")

model_c.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.208
Method:                 Least Squares   F-statistic:                     623.4
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        05:55:27   Log-Likelihood:                -4820.8
No. Observations:                7098   AIC:                             9650.
Df Residuals:                    7094   BIC:                             9677.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------

In this model, we now have a log-log model for which the coefficient of the logarithmic term represents the elasticity of the variable of earnings changing with age, that is, the percentage amount that the earnings will change as the age changes. Looking at the coefficient, we see that a 1% increase in age leads to a 0.7153749610992853% change in the earnings, so we can do simple arithmetic by multiplying to convert the percentage change in age to the relevant percentage change in earnings. Doing this, we find that the change for 25 to 26 is **2.8614998443971413%** and the change for 33 to 34 is **2.1678029124220766%**.

##### d. Run a regression of the logarithm of average hourly earnings, ln(AHE), on Age, Age^2, Female, and Bachelor. If age increases from 25 to 26, how are earnings expected to change? If age increases from 33 to 34, how are earnings expected to change?

In [None]:
cps_df["age_squared"] = cps_df["age"] ** 2

model_d = ols("ln_ahe ~ age + age_squared + female + bachelor", data=cps_df).fit()
print(f"{model_d.summary()=}")

def marginal_age_effect(age):
    age_coeff = model_d.params["age"]
    age_squared_coeff = model_d.params["age_squared"]
    return age_coeff + 2 * age_squared_coeff * age

print(f"{marginal_age_effect(25)=}")
print(f"{marginal_age_effect(33)=}")

model_d.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     468.6
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        06:17:30   Log-Likelihood:                -4819.1
No. Observations:                7098   AIC:                             9648.
Df Residuals:                    7093   BIC:                             9682.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------

We have to find the marginal effect by taking the derivative relative to age in the model, i.e.:

> ln(ahe) = age_coeff * age + age_squared_coeff * age^2 + other terms

> d/age ln(ahe) = age_coeff + 2 * age_squared_coeff * age

Then we can find the marginal effect for the two ages by plugging in the values of 25 and 33, respectively, and then multiplying by the coefficient for age to find the percentage change in earnings. Calculating this, we find that the change for 25 to 26 is **4.1100786801177264%** and the change for 33 to 34 is **1.1336188193903682%**.

##### e. Do you prefer the regression in (c) to the regression in (b)? Explain.

I would prefer the regression in (c) because the regression in (b) only measures absolute changes as a result of the variables being changed, which is not really useful or comparable across different ranges of earners, while the regression in (c) is better in being able to measure both varying returns and relative at different ages which is much more useful. The ability to intuitively understand the elasticity of earnings as age increases is more useful than absolute changes.

##### f. Do you prefer the regression in (d) to the regression in (b)? Explain.

I would prefer the regression in (d) because as previously stated, the regression in (b) only measures absolute changes as a result of the variables being changed, which is not really useful or comparable across different ranges of earners, while the regression in (d) is better in being able to measure both varying returns and relative at different ages which is much more useful. Being able to measure the marginal effect of age on earnings is more useful than the absolute changes themselves.

##### g. Do you prefer the regression in (d) to the regression in (c)? Explain.

I would prefer the regression in (d). While both are adaptable in showing effects on earnings that change as age increases, the regression in (d) is more flexible in that it allows for a non-linear relationship between age and earnings, which is likely more realistic and accurate as this relationship is highly complex in the real world. In addition, the elasticity as specified in the regression in (c) is constant over time, which seems to be a bit simplistic.

##### h. Run a regression of ln(AHE), on Age, Age^2, Female, Bachelor, and the interaction term Female*Bachelor. What does the coefficient on the interaction term measure? Alexis is a 30-year-old female with a bachelor’s degree. What does the regression predict for her value of ln(AHE)? Jane is a 30-year-old female with a high school degree. What does the regression predict for her value of ln(AHE)? What is the predicted difference between Alexis’s and Jane’s earnings? Bob is a 30-year-old male with a bachelor’s degree. What does the regression predict for his value of ln(AHE)? Jim is a 30-year-old male with a high school degree. What does the regression predict for his value of ln(AHE)? What is the predicted difference between Bob’s and Jim’s earnings?

In [45]:
cps_df["female_bachelor"] = cps_df["female"] * cps_df["bachelor"]

model_h = ols("ln_ahe ~ age + age_squared + female + bachelor + female_bachelor", data=cps_df).fit()
print(f"{model_h.summary()=}")

alexis_ln_ahe = (model_h.params["Intercept"] + 
                 model_h.params["age"] * 30 +
                 model_h.params["age_squared"] * 30 ** 2 +
                 model_h.params["female"] * 1 +
                 model_h.params["bachelor"] * 1 +
                 model_h.params["female_bachelor"] * 1)
print(f"{alexis_ln_ahe=}")

jane_ln_ahe = (model_h.params["Intercept"] + 
               model_h.params["age"] * 30 +
               model_h.params["age_squared"] * 30 ** 2 +
               model_h.params["female"] * 1 +
               model_h.params["bachelor"] * 0 +
               model_h.params["female_bachelor"] * 0)
print(f"{jane_ln_ahe=}")

alexis_jane_diff = np.exp(alexis_ln_ahe) - np.exp(jane_ln_ahe)
print(f"{alexis_jane_diff=}")

bob_ln_ahe = (model_h.params["Intercept"] + 
              model_h.params["age"] * 30 +
              model_h.params["age_squared"] * 30 ** 2 +
              model_h.params["female"] * 0 +
              model_h.params["bachelor"] * 1 +
              model_h.params["female_bachelor"] * 0)
print(f"{bob_ln_ahe=}")

jim_ln_ahe = (model_h.params["Intercept"] + 
              model_h.params["age"] * 30 +
              model_h.params["age_squared"] * 30 ** 2 +
              model_h.params["female"] * 0 +
              model_h.params["bachelor"] * 0 +
              model_h.params["female_bachelor"] * 0)
print(f"{jim_ln_ahe=}")

bob_jim_diff = np.exp(bob_ln_ahe) - np.exp(jim_ln_ahe)
print(f"{bob_jim_diff=}")

model_h.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     375.1
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        06:39:56   Log-Likelihood:                -4818.5
No. Observations:                7098   AIC:                             9649.
Df Residuals:                    7092   BIC:                             9690.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                      coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------

The coefficient on the interaction term measures the additional effect of having a bachelor's degree for women beyond just the already included effects of separately being a woman and having a bachelor's on earnings, that is, any difference in earnings of having a bachelor's degree between men and women.

We see from the calculations above the following:
- Alexis' predicted ln(AHE) is **3.057717242724363**
- Jane's predicted ln(AHE) is **2.5821294247778046**
- The difference between Alexis's and Jane's earnings is **$8.053656637447899 per hour**
- Bob's predicted ln(AHE) is **3.2245672291181027**
- Jim's predicted ln(AHE) is **2.7724535719982715**
- The difference between Bob's and Jim's earnings is **$9.144853034268806 per hour**

##### i. Is the effect of age on earnings different for men than for women? Specify and estimate a regression that you can use to answer this question.

In [46]:
# Add interaction terms both linearly and squared 
cps_df["female_age"] = cps_df["female"] * cps_df["age"]
cps_df["female_age_squared"] = cps_df["female"] * cps_df["age_squared"]

model_i = ols("ln_ahe ~ age + age_squared + female + bachelor + female_age + female_age_squared", data=cps_df).fit()
print(f"{model_i.summary()=}")

print(f"{model_i.params['female_age']=}")
print(f"{model_i.pvalues['female_age']=}")

print(f"{model_i.params['female_age_squared']=}")
print(f"{model_i.pvalues['female_age_squared']=}")

model_i.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.210
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     313.4
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        06:53:55   Log-Likelihood:                -4816.4
No. Observations:                7098   AIC:                             9647.
Df Residuals:                    7091   BIC:                             9695.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------

After adding in the interaction terms for for female * age, we find both coefficients to be rather low and with very high p-values that are not statistically significant (>0.05), so we can conclude that the effect of age on earnings is likely not different between men and women.

##### j. Is the effect of age on earnings different for high school graduates than for college graduates? Specify and estimate a regression that you can use to answer this question.

In [48]:
# Add interaction terms both linearly and squared 
cps_df["bachelor_age"] = cps_df["bachelor"] * cps_df["age"]
cps_df["bachelor_age_squared"] = cps_df["bachelor"] * cps_df["age_squared"]

model_j = ols("ln_ahe ~ age + age_squared + female + bachelor + bachelor_age + bachelor_age_squared", data=cps_df).fit()
print(f"{model_j.summary()=}")

print(f"{model_j.params['bachelor_age']=}")
print(f"{model_j.pvalues['bachelor_age']=}")

print(f"{model_j.params['bachelor_age_squared']=}")
print(f"{model_j.pvalues['bachelor_age_squared']=}")

model_j.summary()=<class 'statsmodels.iolib.summary.Summary'>
"""
                            OLS Regression Results                            
Dep. Variable:                 ln_ahe   R-squared:                       0.209
Model:                            OLS   Adj. R-squared:                  0.209
Method:                 Least Squares   F-statistic:                     312.7
Date:                Sat, 16 Aug 2025   Prob (F-statistic):               0.00
Time:                        07:02:04   Log-Likelihood:                -4818.0
No. Observations:                7098   AIC:                             9650.
Df Residuals:                    7091   BIC:                             9698.
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------

After adding in the interaction terms for for bachelor * age, we find both coefficients to be low and with very high p-values that are not statistically significant (>0.05), although with both of these values holding a bit more value and impact than when compared to the female * age terms. Nonetheless, we can conclude that the effect of age on earnings is likely not different between high school and college graduates.

##### k. After running all these regressions, summarize the effect of age on earnings for young workers.

For younger workers, the effect of increased age on earnings is generally positive, but the marginal effect of increased age on earnings decreases as age continues to increase. This means that while older workers do indeed tend to earn more than younger workers, the rate at which their earnings increase slows down with age. In addition, there seem to be little differences in this effect of age on earnings when comparing across gender or education differences. Overall, younger workers get more value from increased age in terms of earnings compared to older workers.

##### Extra Credit (5 points):
In a few sentences, describe something you learned/discovered through this assignment.

I wasn't aware of the marginal effect of age on earnings decreasing over time. I think that I had some thoughts around ageism in certain industries and how this would affect overall employment, which in turn could possibly mean that there may be a tendency for older people to be better paid and therefore for earnings to accelerate more steeply on average for higher ages. The regressions seem to generally reflect that this is not the case. In addition, the effect of gender seemed to generally be rather small on earnings, which I also found to be interesting and could warrant further investigation into furhter breakdowns such as categorizations of jobs and career lengths.