# Estimating Gender Discrimination in the Workplace

In this exercise we’ll use data from the 2018 US Current Population Survey (CPS) to try and estimate the effect of being a woman on workplace compensation. Note that our focus will be only on differential compensation in the work place, and as a result it is important to bear in mind that our estimates are not estimates of all forms of gender discrimination. For example, these analyses will not account for things like gender discrimination in terms of getting jobs.

## Excercise 1

### Explore the data

In [12]:
import pandas as pd

df = pd.read_stata('data/morg18.dta')
df.head()

Unnamed: 0,county,smsastat,age,sex,grade92,race,ethnic,marital,uhourse,earnhre,...,yrcoll,grprof,gr6cor,ms123,occ2012,lfsr94,class94,unioncov,ind02,stfips
0,0,1.0,71,1,42,1,,1,,,...,3.0,,,,,Disabled-Not In Labor Force,,,,AL
1,0,1.0,64,2,40,1,,1,,,...,2.0,,,,,Retired-Not In Labor Force,,,,AL
2,0,1.0,52,2,39,2,,5,40.0,2084.0,...,,,,,5700.0,Employed-At Work,Government - State,,"Residential care facilities, without nursing (...",AL
3,0,1.0,19,2,39,2,,7,40.0,1000.0,...,,,,,5240.0,Employed-At Work,"Private, For Profit",No,Business support services (5614),AL
4,0,1.0,56,2,43,2,,5,40.0,2500.0,...,,,,,3255.0,Employed-At Work,Government - Federal,,Hospitals (622),AL


In [13]:
df.columns

Index(['county', 'smsastat', 'age', 'sex', 'grade92', 'race', 'ethnic',
       'marital', 'uhourse', 'earnhre', 'earnwke', 'chldpres', 'ownchild',
       'ged', 'gedhigr', 'yrcoll', 'grprof', 'gr6cor', 'ms123', 'occ2012',
       'lfsr94', 'class94', 'unioncov', 'ind02', 'stfips'],
      dtype='object')

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 302332 entries, 0 to 302331
Data columns (total 25 columns):
 #   Column    Non-Null Count   Dtype  
---  ------    --------------   -----  
 0   county    302332 non-null  int16  
 1   smsastat  299460 non-null  float64
 2   age       302332 non-null  int8   
 3   sex       302332 non-null  int8   
 4   grade92   302332 non-null  int8   
 5   race      302332 non-null  int8   
 6   ethnic    39222 non-null   float64
 7   marital   302332 non-null  int8   
 8   uhourse   180100 non-null  float64
 9   earnhre   93700 non-null   float64
 10  earnwke   159855 non-null  float64
 11  chldpres  302332 non-null  int8   
 12  ownchild  302332 non-null  int8   
 13  ged       86165 non-null   float64
 14  gedhigr   8193 non-null    float64
 15  yrcoll    82266 non-null   float64
 16  grprof    0 non-null       float64
 17  gr6cor    0 non-null       float64
 18  ms123     0 non-null       float64
 19  occ2012   193103 non-null  float64
 20  lfsr

In [15]:
df.describe()

Unnamed: 0,county,smsastat,age,sex,grade92,race,ethnic,marital,uhourse,earnhre,earnwke,chldpres,ownchild,ged,gedhigr,yrcoll,grprof,gr6cor,ms123,occ2012
count,302332.0,299460.0,302332.0,302332.0,302332.0,302332.0,39222.0,302332.0,180100.0,93700.0,159855.0,302332.0,302332.0,86165.0,8193.0,82266.0,0.0,0.0,0.0,193103.0
mean,25.626563,1.191885,47.853191,1.52101,40.41597,1.431526,2.569961,3.449734,35.864259,1808.223148,973.528817,1.244827,0.459677,1.095085,6.483217,2.80997,,,,4088.699161
std,61.920623,0.393784,18.749476,0.499559,2.71504,1.275442,2.415717,2.65243,15.080222,1032.46875,697.560813,2.732001,0.944044,0.293334,1.387283,0.990923,,,,2601.576064
min,0.0,1.0,16.0,1.0,31.0,1.0,1.0,1.0,-4.0,17.0,0.0,0.0,0.0,1.0,1.0,1.0,,,,10.0
25%,0.0,1.0,32.0,1.0,39.0,1.0,1.0,1.0,35.0,1150.0,480.0,0.0,0.0,1.0,6.0,2.0,,,,2100.0
50%,0.0,1.0,48.0,2.0,40.0,1.0,1.0,1.0,40.0,1500.0,775.0,0.0,0.0,1.0,7.0,3.0,,,,4220.0
75%,27.0,1.0,63.0,2.0,43.0,1.0,4.0,7.0,40.0,2100.0,1276.0,0.0,0.0,1.0,7.0,3.0,,,,5620.0
max,810.0,2.0,85.0,2.0,46.0,26.0,8.0,7.0,99.0,9999.0,2884.61,15.0,11.0,2.0,8.0,5.0,,,,9840.0


## Exercise 2

In [16]:
df = df[df.lfsr94 == 'Employed-At Work']
df = df[df.uhourse >= 35]
df.head()

Unnamed: 0,county,smsastat,age,sex,grade92,race,ethnic,marital,uhourse,earnhre,...,yrcoll,grprof,gr6cor,ms123,occ2012,lfsr94,class94,unioncov,ind02,stfips
2,0,1.0,52,2,39,2,,5,40.0,2084.0,...,,,,,5700.0,Employed-At Work,Government - State,,"Residential care facilities, without nursing (...",AL
3,0,1.0,19,2,39,2,,7,40.0,1000.0,...,,,,,5240.0,Employed-At Work,"Private, For Profit",No,Business support services (5614),AL
4,0,1.0,56,2,43,2,,5,40.0,2500.0,...,,,,,3255.0,Employed-At Work,Government - Federal,,Hospitals (622),AL
6,97,1.0,48,1,39,1,,7,40.0,1700.0,...,,,,,9130.0,Employed-At Work,"Private, For Profit",No,Truck transportation (484),AL
17,97,1.0,59,1,39,2,,7,40.0,2000.0,...,,,,,9620.0,Employed-At Work,"Private, For Profit",No,****Department stores and discount stores (s45...,AL


## Exercise 3

In [17]:
# change from cents to dollars
df['earnhre'] = df['earnhre']/100

In [18]:
df.groupby('sex')['earnhre'].mean()

sex
1    20.554632
2    18.078692
Name: earnhre, dtype: float64

Males earn on average 2 dollars per hour more than women

## Exercise 4

In [19]:
df['annual_wage'] = df['earnhre']*(df['uhourse']*52)

In [20]:
df.groupby('sex')['annual_wage'].mean()

sex
1    45105.302454
2    37864.635158
Name: annual_wage, dtype: float64

In [21]:
print( f"The difference between annual wages between Males and females is ${45105.302454-37864.635158:.4f}" )

The difference between annual wages between Males and females is $7240.6673


## Exercise 5
To have an accurate causal estimate of the effect of being a woman in the work place, both genders must have the same experience and level of seniority. However, in reality, this is not the case since people have worked different years in the field for certain positions and this may affect the estimate. On the other hand, it may also not be true because people negotiate different salaries for work positions.

## Exercise 6

In [24]:
import statsmodels.formula.api as smf

In [25]:
smf.ols('annual_wage ~ C(sex) + age ', df).fit().summary()

0,1,2,3
Dep. Variable:,annual_wage,R-squared:,0.057
Model:,OLS,Adj. R-squared:,0.057
Method:,Least Squares,F-statistic:,1974.0
Date:,"Tue, 15 Feb 2022",Prob (F-statistic):,0.0
Time:,14:38:50,Log-Likelihood:,-752250.0
No. Observations:,65755,AIC:,1504000.0
Df Residuals:,65752,BIC:,1505000.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,3.265e+04,288.243,113.284,0.000,3.21e+04,3.32e+04
C(sex)[T.2],-7557.2936,176.107,-42.913,0.000,-7902.462,-7212.125
age,304.1829,6.406,47.485,0.000,291.627,316.738

0,1,2,3
Omnibus:,33771.778,Durbin-Watson:,1.868
Prob(Omnibus):,0.0,Jarque-Bera (JB):,307310.621
Skew:,2.305,Prob(JB):,0.0
Kurtosis:,12.535,Cond. No.,146.0


Women on average earn $7557.2936 less than men per year. The raw estimate is less than the calculated regressed difference

## Exercise 7

The comparison we are making is of the annual wages between the two groups. Here we have the indicator groups i.e. females and the reference group is male

## Exercise 8

In [31]:
df["BA"] = df.grade92 >= 43
df["high_school"] = df.grade92 >= 39

In [32]:
df.head()

Unnamed: 0,county,smsastat,age,sex,grade92,race,ethnic,marital,uhourse,earnhre,...,ms123,occ2012,lfsr94,class94,unioncov,ind02,stfips,annual_wage,BA,high_school
2,0,1.0,52,2,39,2,,5,40.0,20.84,...,,5700.0,Employed-At Work,Government - State,,"Residential care facilities, without nursing (...",AL,43347.2,False,True
3,0,1.0,19,2,39,2,,7,40.0,10.0,...,,5240.0,Employed-At Work,"Private, For Profit",No,Business support services (5614),AL,20800.0,False,True
4,0,1.0,56,2,43,2,,5,40.0,25.0,...,,3255.0,Employed-At Work,Government - Federal,,Hospitals (622),AL,52000.0,True,True
6,97,1.0,48,1,39,1,,7,40.0,17.0,...,,9130.0,Employed-At Work,"Private, For Profit",No,Truck transportation (484),AL,35360.0,False,True
17,97,1.0,59,1,39,2,,7,40.0,20.0,...,,9620.0,Employed-At Work,"Private, For Profit",No,****Department stores and discount stores (s45...,AL,41600.0,False,True


In [35]:
smf.ols('annual_wage ~ C(sex) + age + C(BA) +C(high_school) ', df).fit().summary()

0,1,2,3
Dep. Variable:,annual_wage,R-squared:,0.152
Model:,OLS,Adj. R-squared:,0.152
Method:,Least Squares,F-statistic:,2946.0
Date:,"Tue, 15 Feb 2022",Prob (F-statistic):,0.0
Time:,14:51:54,Log-Likelihood:,-748740.0
No. Observations:,65755,AIC:,1497000.0
Df Residuals:,65750,BIC:,1498000.0
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,2.116e+04,380.335,55.637,0.000,2.04e+04,2.19e+04
C(sex)[T.2],-9185.1056,168.057,-54.655,0.000,-9514.498,-8855.713
C(BA)[T.True],1.517e+04,207.215,73.191,0.000,1.48e+04,1.56e+04
C(high_school)[T.True],9685.9802,293.781,32.970,0.000,9110.169,1.03e+04
age,309.6835,6.076,50.972,0.000,297.775,321.592

0,1,2,3
Omnibus:,31243.139,Durbin-Watson:,1.897
Prob(Omnibus):,0.0,Jarque-Bera (JB):,274123.061
Skew:,2.098,Prob(JB):,0.0
Kurtosis:,12.08,Cond. No.,234.0


## Exercise 9

The two groups are the indicator groups i.e. annual wages for females with atleast a bachelor's degree or females with atleast a high school degree and the reference group is males with atleast a bachelor's degree or males with atleast a high school degree

## Exercise 10

We see that after controlling for education, the annual wage coefficiant value for females becomes worse. Which implies that education makes women worse off than men