CIVIL-465 Resarch Project: Women empowerment project
=====

Gender pay gap study
---------------------------------------------------------------------

created by Justine Bourdette on 26.10.2021

# Description

In the scope of this project on women empowerment, a first simple study is made on gender pay gap between men and women. The Ordinary Least Squares (OLS) regression method is used to determine which variables have the biggest impact on wages. The data set has been taken from the website of Glassdoor. The following variables are being used : 

| Feature name     | Variable Type | Description 
|------------------|---------------|--------------------------------------------------------
| Job Title        | Categorical   | Job name
| Gender           | Categorical   | Male or Female   
| Age              | Continuous    | Age in years
| PerfEval         | Continuous    | Performance evaluation score between 1 and 5
| Education        | Categorical   | Level of education (Highschool, College, Masters, PhD)
| Dept             | Categorical   | Department  
| Seniority        | Continuous    | Seniority (number of years worked) 
| Base Pay         | Continuous    | Annual basic pay in dollars 
| Bonus            | Continuous    | Annual bonus pay in dollars

# Import librairies

In [13]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Import data

In [14]:
data = pd.read_csv('Glassdoor_Gender_Pay_Gap.csv')

In [15]:
# See data shape

data.shape

(1000, 9)

In [16]:
# Visualize imported data

data.head()

Unnamed: 0,JobTitle,Gender,Age,PerfEval,Education,Dept,Seniority,BasePay,Bonus
0,Graphic Designer,Female,18,5,College,Operations,2,42363,9938
1,Software Engineer,Male,21,5,College,Management,5,108476,11128
2,Warehouse Associate,Female,19,4,PhD,Administration,5,90208,9268
3,Software Engineer,Male,20,5,Masters,Sales,4,108080,10154
4,Graphic Designer,Male,26,5,Masters,Engineering,5,99464,9319


# Preprocess data

In [17]:
# Construct age categories of interest

mask_young = data['Age'] < 25 
mask_middle = np.logical_and(data['Age'] >= 25,data['Age'] <= 45)
mask_old = data['Age'] > 45

data['Age_category'] = np.empty(data.shape[0])

data['Age_category'][mask_young] = 'Young' 
data['Age_category'][mask_middle] = 'Middle' 
data['Age_category'][mask_old] = 'Old' 

In [18]:
# Select columns of interest

data = data.loc[:,['Gender','Age_category','PerfEval','Education','Seniority','BasePay']]

In [19]:
# Transform multilabel columns in dummies variables

gender_data = pd.get_dummies(
    data, columns=['Gender','Age_category','PerfEval','Education','Seniority'], drop_first=True)

In [20]:
# Visualize modified data

gender_data.head()

Unnamed: 0,BasePay,Gender_Male,Age_category_Old,Age_category_Young,PerfEval_2,PerfEval_3,PerfEval_4,PerfEval_5,Education_High School,Education_Masters,Education_PhD,Seniority_2,Seniority_3,Seniority_4,Seniority_5
0,42363,0,0,1,0,0,0,1,0,0,0,1,0,0,0
1,108476,1,0,1,0,0,0,1,0,0,0,0,0,0,1
2,90208,0,0,1,0,0,1,0,0,0,1,0,0,0,1
3,108080,1,0,1,0,0,0,1,0,1,0,0,0,1,0
4,99464,1,0,0,0,0,0,1,0,1,0,0,0,0,1


# Regression analysis

In [21]:
# Import econometrics librairy

import statsmodels.api as sm

In [22]:
# Separate features and target variable 

X = gender_data.loc[:,gender_data.columns != 'BasePay']
y = gender_data['BasePay']

In [23]:
# Define and run the model

mod = sm.OLS(y, X)     # Describe model

res = mod.fit()        # Fit model

print(res.summary())   # Summarize model

                                 OLS Regression Results                                
Dep. Variable:                BasePay   R-squared (uncentered):                   0.946
Model:                            OLS   Adj. R-squared (uncentered):              0.945
Method:                 Least Squares   F-statistic:                              1232.
Date:                Mon, 25 Oct 2021   Prob (F-statistic):                        0.00
Time:                        18:09:52   Log-Likelihood:                         -11451.
No. Observations:                1000   AIC:                                  2.293e+04
Df Residuals:                     986   BIC:                                  2.300e+04
Df Model:                          14                                                  
Covariance Type:            nonrobust                                                  
                            coef    std err          t      P>|t|      [0.025      0.975]
------------------------------