# Estimating Gender Discrimination in the Workplace

In this exercise we'll use data from the 2018 US Current Population Survey (CPS) to try and estimate the effect of being a woman on workplace compensation. 

Note that our focus will be *only* on differential compensation in the work place, and as a result it is important to bear in mind that our estimates are not estimates of *all* forms of gender discrimination. For example, these analyses will not account for things like gender discrimination in terms of *getting* jobs. We'll discuss this in more detail below.

In [1]:
import warnings
import numpy as np
import pandas as pd

warnings.filterwarnings("ignore")
pd.set_option("mode.copy_on_write", True)

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


## Exercise 1: 

Begin by downloading and importing 2018 CPS data from [https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey](https://github.com/nickeubank/MIDS_Data/tree/master/Current_Population_Survey). The file is called `morg18.dta` and is a Stata dataset. Additional data on the dataset can be found by following the links in the README.txt file in the folder, but for the moment it is sufficient to know this is a national survey run in the United States.

The survey does include some survey weights we won't be using (i.e. not everyone in the sample was included with the same probability), so the numbers we estimate will not be perfect estimates of the gender wage gap in the United States, but they are pretty close.

In [2]:
data = pd.read_stata(
    "https://github.com/nickeubank/MIDS_Data/blob/master/Current_Population_Survey/morg18.dta?raw=true"
)

## Exercise 2:

Because our interest is only in-the-workplace wage discrimination among full-time workers, we need to start by subsetting our data for people currently employed (and "at work", not "absent") at the time of this survey using the `lfsr94` variable, who are employed full time (meaning that their usual hours per week—`uhourse`—is 35 or above).

As noted above, this analysis will miss many forms of gender discrimination. For example, in dropping anyone who isn't working, we immediately lose any women who couldn't get jobs, or who chose to lose the workforce because the wages they were offered (which were likely lower than those offered men) were lower than they were willing / could accept. And in focusing on full time employees, we miss the fact women may not be offered full time jobs at the same rate as men. 

In [3]:
data["lfsr94"].unique()

['Disabled-Not In Labor Force', 'Retired-Not In Labor Force', 'Employed-At Work', 'Unemployed-Looking', 'Employed-Absent', 'Other-Not In Labor Force', NaN, 'Unemployed-On Layoff']
Categories (7, object): ['Employed-At Work' < 'Employed-Absent' < 'Unemployed-On Layoff' < 'Unemployed-Looking' < 'Retired-Not In Labor Force' < 'Disabled-Not In Labor Force' < 'Other-Not In Labor Force']

In [4]:
data["uhourse"].unique()

array([nan, 40., 30., -4., 36., 20., 48., 35., 44., 15., 50., 45., -1.,
       60., 17., 19., 41., 10., 70., 25., 11., 75., 18., 32.,  2., 80.,
        5., 24.,  4.,  3., 38.,  6., 56., 96., 82., 55., 52., 77., 84.,
        0., 12., 16., 42., 43., 54., 23., 53., 33., 28., 65.,  9., 37.,
       72., 26.,  7., 61., 27., 99., 58.,  8., 21., 22., 13., 47., 63.,
        1., 46., 39., 29., 14., 49., 66., 34., 57., 90., 62., 85., 31.,
       87., 68., 69., 76., 83., 98., 88., 51., 78., 64., 67., 92., 95.,
       74., 86., 81., 59., 71., 73., 91., 89., 94.])

In [5]:
subset = data[data["lfsr94"] == "Employed-At Work"]
subset = subset[subset["uhourse"] >= 35]

In [6]:
subset.info()

<class 'pandas.core.frame.DataFrame'>
Index: 133814 entries, 2 to 302329
Data columns (total 98 columns):
 #   Column     Non-Null Count   Dtype   
---  ------     --------------   -----   
 0   hhid       133814 non-null  object  
 1   intmonth   133814 non-null  category
 2   hurespli   133801 non-null  float64 
 3   hrhtype    133814 non-null  category
 4   minsamp    133814 non-null  category
 5   hrlonglk   133814 non-null  category
 6   hrsample   133814 non-null  object  
 7   hrhhid2    133814 non-null  object  
 8   serial     133814 non-null  object  
 9   hhnum      133814 non-null  int8    
 10  stfips     133814 non-null  category
 11  cbsafips   133814 non-null  int32   
 12  county     133814 non-null  int16   
 13  centcity   111313 non-null  float64 
 14  smsastat   132638 non-null  float64 
 15  icntcity   18381 non-null   float64 
 16  smsa04     133814 non-null  int8    
 17  relref95   133814 non-null  int8    
 18  age        133814 non-null  int8    
 19  spouse 

## Exercise 3

Now let's estimate the basic wage gap for the United States!

Earnings per week worked can be found in the `earnwke` variable. Using the variable `sex` (1=Male, 2=Female), estimate the gender wage gap in terms of wages per hour worked!

(You may also find it helpful, for context, to estimate the average hourly pay by dividing weekly pay by `uhourse`.)

In [7]:
subset["female"] = (subset["sex"] == 2).astype(int)
subset["average_hourly_wage"] = subset["earnwke"] / subset["uhourse"]
female_avg = subset[subset["female"] == 1].earnwke.mean()
male_avg = subset[subset["female"] == 0].earnwke.mean()
average = subset.earnwke.mean()

In [8]:
print(
    f"basic wage gap for the United States is ${male_avg - female_avg: .1f} with the percentage difference of {(male_avg - female_avg)/average*100:.1f}%"
)
print(
    f"to be precise average wage per week for women is ${female_avg: .1f}, in contrast to men which is ${male_avg: .1f}"
)

basic wage gap for the United States is $ 219.1 with the percentage difference of 19.8%
to be precise average wage per week for women is $ 985.7, in contrast to men which is $ 1204.7


## Exercise 4

Assuming 48 work weeks in a year, calculate annual earnings for men and women. Report the difference in dollars and in percentage terms.

In [9]:
subset["yearly_income"] = subset["earnwke"] * 48
female_yearly = subset[subset["female"] == 1].yearly_income.mean()
male_yearly = subset[subset["female"] == 0].yearly_income.mean()
average_yearly = subset.yearly_income.mean()

In [10]:
print(
    f"basic wage gap for the United States is ${male_yearly - female_yearly: .1f} with the percentage difference of {(male_yearly - female_yearly)/average_yearly*100:.1f}%"
)
print(
    f"to be precise average wage per year for women is ${female_yearly: .1f}, in contrast to men which is ${male_yearly: .1f}"
)

basic wage gap for the United States is $ 10514.4 with the percentage difference of 19.8%
to be precise average wage per year for women is $ 47312.8, in contrast to men which is $ 57827.2


## Exercise 5

We just compared all full-time working men to all full-time working women. For this to be an accurate *causal* estimate of the effect of being a woman in the work place, what must be true of these two groups? What is one reason that this may *not* be true?

> It might not to be true if men earn this much because they deserved that for working longer and having more experience. Maybe they have different achivements like more certifications or higher education level. If those are equal then it makes sense to conclude that there is a pay gap between men and and women.

> It is true to compare if men and women do the same work and thus their potential outcomes are the same.

## Exercise 6

One answer to the second part of Exercise 5 is that working women are likely to be younger, since a larger portion of younger women are entering the workforce as compared to older generations.

To *control* for this difference, let's now regress annual earnings on gender, age, and age-squared (the relationship between age and income is generally non-linear). What is the implied average annual wage difference between women and men? Is it different from your raw estimate? 

In [16]:
import statsmodels.formula.api as smf


subset["age"] = subset["age"].astype('int64')
subset["age_sq"] = subset.age**2
smf.ols("yearly_income ~ female + age + age_sq", subset).fit().summary()

0,1,2,3
Dep. Variable:,yearly_income,R-squared:,0.083
Model:,OLS,Adj. R-squared:,0.083
Method:,Least Squares,F-statistic:,3710.0
Date:,"Tue, 09 Apr 2024",Prob (F-statistic):,0.0
Time:,03:19:17,Log-Likelihood:,-1442600.0
No. Observations:,122603,AIC:,2885000.0
Df Residuals:,122599,BIC:,2885000.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-7102.4067,878.728,-8.083,0.000,-8824.700,-5380.114
female,-1.074e+04,178.919,-60.000,0.000,-1.11e+04,-1.04e+04
age,2730.5944,41.676,65.519,0.000,2648.910,2812.279
age_sq,-25.8227,0.467,-55.283,0.000,-26.738,-24.907

0,1,2,3
Omnibus:,18004.695,Durbin-Watson:,1.739
Prob(Omnibus):,0.0,Jarque-Bera (JB):,27027.927
Skew:,1.096,Prob(JB):,0.0
Kurtosis:,3.699,Cond. No.,23000.0


## Exercise 7

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we're basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

> Two groups that we are comparing unconditionally are men and women of the same age.

## Exercise 8

Now let's add to our regression an indicator variable for whether the respondent has at least graduated high school, and an indicator for whether the respondent at least has a BA. 

In answering this question, use the following table of codes for the variable `grade92`. 

Education is coded as follows:
    
![CPS Educ Codes](../images/cps_educ_codes.png)

In [17]:
subset["high_school_grad"] = subset.grade92 >= 39
subset["BA_grad"] = subset.grade92 >= 43

education_added = smf.ols(
    "yearly_income ~ female + age + age_sq + high_school_grad + BA_grad", subset
).fit()
education_added.summary()

0,1,2,3
Dep. Variable:,yearly_income,R-squared:,0.273
Model:,OLS,Adj. R-squared:,0.273
Method:,Least Squares,F-statistic:,9195.0
Date:,"Tue, 09 Apr 2024",Prob (F-statistic):,0.0
Time:,03:19:54,Log-Likelihood:,-1428400.0
No. Observations:,122603,AIC:,2857000.0
Df Residuals:,122597,BIC:,2857000.0
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-1.92e+04,839.427,-22.875,0.000,-2.08e+04,-1.76e+04
high_school_grad[T.True],1.37e+04,341.482,40.114,0.000,1.3e+04,1.44e+04
BA_grad[T.True],2.695e+04,166.062,162.282,0.000,2.66e+04,2.73e+04
female,-1.304e+04,159.947,-81.518,0.000,-1.34e+04,-1.27e+04
age,2210.0675,37.239,59.348,0.000,2137.079,2283.056
age_sq,-20.0288,0.417,-47.990,0.000,-20.847,-19.211

0,1,2,3
Omnibus:,14162.034,Durbin-Watson:,1.851
Prob(Omnibus):,0.0,Jarque-Bera (JB):,20366.744
Skew:,0.887,Prob(JB):,0.0
Kurtosis:,3.917,Cond. No.,25000.0


## Exercise 9 

In running this regression and interpreting the coefficient on `female`, what is the implicit comparison you are making? In other words, when we run this regression and interpreting the coefficient on `female`, we are once more basically pretending we are comparing two groups and assuming they are counter-factuals for one another. What are these two groups?

> We are comparing men and women of the same age and education level.

## Exercise 10

Given how the coefficient on `female` has changed between Exercise 6 and Exercise 8, what can you infer about the educational attainment of the women in your survey data (as compared to the educational attainment of men)?

> It looks like without controlling for education level, the coefficient for `female` is approximately -10,570 suggesting that women, on average, earn $10,570 less than men anually. However, looking at results from exercise 8, the gender pay gap increases after controlling for educational attainment.

## Exercise 11

What does that tell you about the *potential outcomes* of men and women before you added education as a control?

> The initial analysis suggested a gender pay gap that did not account for women's higher educational attainment. When education was factored in, the expectation would be that this gap might decrease, as higher education is associated with higher earnings. However, the opposite is observed - the gender pay gap widened, indicating that despite women's higher educational attainment, they still earned less than men, highlighting a significant disparity in earnings that education alone cannot explain.

## Exercise 12

Finally, let's include *fixed effects* for the type of job held by each respondent. 

Fixed effects are a method used when we have a nested data structure in which respondents belong to groups, and those groups may all be subject to different pressures. In this context, for example, we can add fixed effects for the industry of each respondent—since wages often vary across industries, controlling for industry is likely to improve our estimates. Use `ind02` to control for industry.

(Note that fixed effects are very similar in principle to hierarchical models. There are some differences [you will read about](../fixed_effects_v_hierarchical.ipynb) for our next class, but they are designed to serve the same role, just with slightly different mechanics). 

When we add fixed effects for groups like this, our interpretation of the other coefficients changes. Whereas in previous exercises we were trying to explain variation in men and women's wages *across all respondents*, we are now effectively comparing men and women's wages *within each employment sector*. Our coefficient on `female`, in other words, now tells us how much less (on average) we would expect a woman to be paid than a man *within the same industry*, not across all respondents. 

(Note that running this regression will result in lots of coefficients popping up you don't care about. We'll introduce some more efficient methods for adding fixed effects that aren't so messy in a later class -- for now, you can ignore those coefficients!)

In [18]:
industry_added = smf.ols(
    "yearly_income ~ female + age + age_sq + high_school_grad + BA_grad + C(ind02)",
    subset,
).fit()
industry_added.summary()

0,1,2,3
Dep. Variable:,yearly_income,R-squared:,0.32
Model:,OLS,Adj. R-squared:,0.319
Method:,Least Squares,F-statistic:,218.9
Date:,"Tue, 09 Apr 2024",Prob (F-statistic):,0.0
Time:,03:20:07,Log-Likelihood:,-1424300.0
No. Observations:,122603,AIC:,2849000.0
Df Residuals:,122339,BIC:,2852000.0
Df Model:,263,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-2.03e+04,1330.133,-15.258,0.000,-2.29e+04,-1.77e+04
high_school_grad[T.True],1.122e+04,339.591,33.051,0.000,1.06e+04,1.19e+04
BA_grad[T.True],2.438e+04,177.203,137.604,0.000,2.4e+04,2.47e+04
C(ind02)[T.Animal production (112)],-564.0145,1688.307,-0.334,0.738,-3873.069,2745.040
"C(ind02)[T.Forestry except logging (1131, 1132)]",166.5881,3693.081,0.045,0.964,-7071.789,7404.965
C(ind02)[T.Logging (1133)],5192.1331,2975.316,1.745,0.081,-639.437,1.1e+04
"C(ind02)[T.Fishing, hunting, and trapping (114)]",3436.9737,4492.725,0.765,0.444,-5368.693,1.22e+04
C(ind02)[T.Support activities for agriculture and forestry (115)],5832.1722,2738.404,2.130,0.033,464.945,1.12e+04
C(ind02)[T.Oil and gas extraction (211)],3.303e+04,3048.017,10.835,0.000,2.71e+04,3.9e+04

0,1,2,3
Omnibus:,14330.493,Durbin-Watson:,1.865
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21302.41
Skew:,0.873,Prob(JB):,0.0
Kurtosis:,4.06,Cond. No.,1.16e+16


## Exercise 13

Now that we've added industry fixed effects, what groups are we implicitly treated as counter-factuals for one another now? 

> Still men and women of the same age, same education level in the same industry.

## Exercise 14

What happened to your estimate of the gender wage gap when you added industry fixed effects? What does that tell you about the industries chosen by women as opposed to men?

> Men likely to work in higher paying industries rather than women because coefficient fell when industry fix was added.

In [19]:
difference = industry_added.params["female"] - education_added.params["female"]
pie = difference / education_added.params["female"]

print(f"The effect of industry fixing is ${difference: .2f} of fall of the pay gap.")
print(f"In terms of percentage change is {-pie*100: .1f}%.")

The effect of industry fixing is $ 2058.31 of fall of the pay gap.
In terms of percentage change is  15.8%.


When you're done, please come read [this discussion](discussion_regressions_incomeineq.ipynb).