Nick Carroll, Yueyin Su
<br>
Unifying Data Science
<br>
Estimating Gender Discrimination in the Workplace

Exercise 1: Loading Data

In [1]:
import pandas as pd
data = pd.read_stata('https://github.com/nickeubank/MIDS_Data/raw/master/Current_Population_Survey/morg18.dta')

Exercise 2: Subsetting Data

Data is subset for people who are full time and employed.

In [2]:
subset = data.loc[(data.loc[:, 'lfsr94'] == 'Employed-At Work') & (data.loc[:, 'uhourse'] >= 35), :]

Exercise 3: Basic Wage Gap Estimate

In [11]:
print(f"In the subset data, there are {subset.loc[:, 'earnwke'].isna().sum(): ,} missing values associated with weekly earnings.  {subset.loc[subset.loc[:, 'sex'] == 1, 'earnwke'].isna().sum(): ,} of those values belong to men, and {subset.loc[subset.loc[:, 'sex'] == 2, 'earnwke'].isna().sum(): ,} belong to women.")

In the subset data, there are  11,211 missing values associated with weekly earnings.   7,946 of those values belong to men, and  3,265 belong to women.


We are choosing to ignore the missing values in the further analysis.

In [4]:
%%capture --no-display
subset.loc[:, 'average hourly rate'] = subset.loc[:, 'earnwke'] / subset.loc[:, 'uhourse']

In [5]:
men_subset = subset.loc[subset.loc[:, 'sex'] == 1]
women_subset = subset.loc[subset.loc[:, 'sex'] == 2]

In [17]:
print(f"The average gender wage gap is: ${men_subset.loc[:, 'average hourly rate'].mean() - women_subset.loc[:, 'average hourly rate'].mean(): .2f} per hour, or ${men_subset.loc[:, 'earnwke'].mean() - women_subset.loc[:, 'earnwke'].mean(): .2f} per week.")
print(f"Therefore, on average men tend to make {(men_subset.loc[:, 'average hourly rate'].mean() / women_subset.loc[:, 'average hourly rate'].mean() - 1) * 100: .2f}% more per hour, or {100 * (men_subset.loc[:, 'earnwke'].mean() / women_subset.loc[:, 'earnwke'].mean() - 1): .2f}% more per week than women.")

The average gender wage gap is: $ 4.08 per hour, or $ 219.05 per week.
Therefore, on average men tend to make  17.14% more per hour, or  22.22% more per week than women.


Exercise 4: Comparing Annual Earnings Wage Gap

In [25]:
print(f"Assuming a 40 hour work week, and a 48 work year, the average anunal gender wage gap is: ${(men_subset.loc[:, 'average hourly rate'].mean() - women_subset.loc[:, 'average hourly rate'].mean()) * 40 * 48:,.2f}.")
print(f"Therefore, on average men tend to make {(men_subset.loc[:, 'average hourly rate'].mean() / women_subset.loc[:, 'average hourly rate'].mean() - 1) * 100: .2f}% more than women.")

Assuming a 40 hour work week, and a 48 work year, the average anunal gender wage gap is: $7,833.93.
Therefore, on average men tend to make  17.14% more than women.


Exercise 5: Comparing the Causal Relationship between Gender and Earnings.

In order for this to be an accurate causal estimate, these two groups must be identical in their attributes that are associated with earnings besides gender (i.e. education level, experience, motivation, performance), or they must be a randomized selection of groups where the other effects cancel out.  One reason why this may not be an accurate causal estimate is because this data is survey data, and relies on the accuracy of the people who respond to be representative of the actual population.  For example, there may be many men in their sixties who still consider themselves working full-time, but the women in their sixties may be fewer and may work less than full-time, so it doesn't give an accurate representation.

Exercise 6: Regression of Wage Earnings on Gender and Age

In [59]:
import statsmodels.api as sm

subset.loc[:, 'age squared'] = subset.loc[:, 'age'].astype('int32') ** 2
dropped_subset = subset.loc[:, ['sex', 'age', 'age squared', 'average hourly rate']].dropna()
y = dropped_subset.loc[:, 'average hourly rate'].to_numpy()
X = dropped_subset.loc[:, ['sex', 'age', 'age squared']]
X = sm.add_constant(X)
model = sm.OLS(y, X)
results = model.fit()
print(results.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.073
Model:                            OLS   Adj. R-squared:                  0.073
Method:                 Least Squares   F-statistic:                     3199.
Date:                Fri, 17 Feb 2023   Prob (F-statistic):               0.00
Time:                        16:47:10   Log-Likelihood:            -5.0447e+05
No. Observations:              122603   AIC:                         1.009e+06
Df Residuals:                  122599   BIC:                         1.009e+06
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.1715      0.433      5.020      

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset.loc[:, 'age squared'] = subset.loc[:, 'age'].astype('int32') ** 2


In [71]:
print(f"According to the linear regression, accounting for age, the average gender wage gap is ${abs(results.params['sex']): .2f} / hour.")

According to the linear regression, accounting for age, the average gender wage gap is $ 4.18 / hour.


The hourly gender wage gap from the linear regression accounting for age is fairly similar to the baseline estimate for comparing the average hourly wage gap between men and women.

Exercise 7: Understanding the Implication of the Regression.

The implicit comparison being made in this regression is the assumption that if an average man were to "become" female his wage would drop by $ 4.18 / hour.  We are comparing two groups that are theoretically identical in every way, except that one is male and one is female.

Exercise 8: Labeling Education Levels

In [75]:
subset.loc[:, 'High School Graduate'] = subset.loc[:, 'grade92'].apply(lambda x: 1 if x >= 39 else 0)
subset.loc[:, 'College Graduate'] = subset.loc[:, 'grade92'].apply(lambda x: 1 if x >= 43 else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset.loc[:, 'High School Graduate'] = subset.loc[:, 'grade92'].apply(lambda x: 1 if x >= 39 else 0)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  subset.loc[:, 'College Graduate'] = subset.loc[:, 'grade92'].apply(lambda x: 1 if x >= 43 else 0)


Exercise 9: Regressing on Gender, Age, and Education Graduate Level

In [81]:
dropped_subset2 = subset.loc[:, ['sex', 'age', 'age squared', 'average hourly rate', 'High School Graduate', 'College Graduate']].dropna()
y2 = dropped_subset2.loc[:, 'average hourly rate'].to_numpy()
X2 = dropped_subset2.loc[:, ['sex', 'age', 'age squared', 'High School Graduate', 'College Graduate']]
X2 = sm.add_constant(X2)
model2 = sm.OLS(y2, X2)
results2 = model2.fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.264
Model:                            OLS   Adj. R-squared:                  0.264
Method:                 Least Squares   F-statistic:                     8793.
Date:                Fri, 17 Feb 2023   Prob (F-statistic):               0.00
Time:                        17:06:59   Log-Likelihood:            -4.9031e+05
No. Observations:              122603   AIC:                         9.806e+05
Df Residuals:                  122597   BIC:                         9.807e+05
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   -2.5007 

In [82]:
print(f"According to the linear regression, accounting for age and graduated education level, the average gender wage gap is ${abs(results2.params['sex']): .2f} / hour.")

According to the linear regression, accounting for age and graduated education level, the average gender wage gap is $ 5.28 / hour.


Exercise 10: Comparing Education Impact on Wage Gap Regression

Based on the comparison of the wage gap between regressions which do and do not account for education, the education for the women in the dataset is generally higher than the education for the men in this dataset.

Exercise 11: Potential Outcomes for Men and Women without Education

The Potential Outcomes for men and women before adding education as a control underestimates the wage gap because education is strongly correlated with earnings.

Exercise 12: Accounting for Fixed Effects

In [101]:
dropped_subset3 = pd.concat([subset.loc[:, ['sex', 'age', 'age squared', 'average hourly rate', 'High School Graduate', 'College Graduate']], pd.get_dummies(subset.loc[:, 'ind02'])], axis = 1).dropna()
y3 = dropped_subset3.loc[:, 'average hourly rate'].to_numpy()
X3 = dropped_subset3.loc[:, dropped_subset3.columns[dropped_subset3.columns != 'average hourly rate']]
X3 = sm.add_constant(X3)
model3 = sm.OLS(y3, X3)
results3 = model3.fit()
print(results3.summary())

                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.311
Model:                            OLS   Adj. R-squared:                  0.310
Method:                 Least Squares   F-statistic:                     210.4
Date:                Fri, 17 Feb 2023   Prob (F-statistic):               0.00
Time:                        17:28:36   Log-Likelihood:            -4.8621e+05
No. Observations:              122603   AIC:                         9.730e+05
Df Residuals:                  122339   BIC:                         9.755e+05
Df Model:                         263                                         
Covariance Type:            nonrobust                                         
                                                                                                                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------

In [103]:
print(f"According to the linear regression, accounting for age, graduated education level, and industry, the average gender wage gap is ${abs(results3.params['sex']): .2f} / hour.")

According to the linear regression, accounting for age, graduated education level, and industry, the average gender wage gap is $ 4.54 / hour.


Exercise 13: Understanding the Wage Gap Accounting for Industry

Accounting for industry, the counter-factual groups are men and women within the same industry.

Exercise 14: Comparing Wage Gap Estimate Accounting for Industry Effects

When accounting for industry, as opposed to simply accounting for age and education, the wage gap reduced.  This suggests that women are more likely to work in lower paid industries.