Challenge Name: BWWC: Hack the Gender Wage Gap - UK Data Challenge

Description: 
This project is sponsored by the Boston Women’s Workforce Council, a public-private partnership between the City of Boston, Boston University, and over 200 companies in Boston.

In 2017 the UK was the first government in the world to require organizations with more than 250 employees to make wage data public. The findings showed a persistent pay gap among 77% of the reporting employers stated that median hourly pay was higher for men than for women in their organisation. We want to better understand these data to inform the creation of policies that could reduce this wage gap.

### This is an incredible opportunity to understand what might be driving factors to the overall gender wage gap. We want students to analyze this data, combine it with other public data and help us understand what the US can learn from the UK open data set. 

Examples of possible factors of interest could be:
Industry and Job characteristics (e.g. historically male role, male-dominated industry, role level - e.g. low wage vs. professional, etc.)
Leadership characteristics (e.g. number or percentage of women in executive level or C-suite roles - CEO, etc., number or percentage of women on the board, etc.)
Employer provided benefits (e.g. family leave, etc.)

Note that the UK government has already performed substantial analysis related to characteristics of employees (full time, male/femaie, etc.) We are looking for correlations specifically to employer practices.

If time, allows, we hope these data can be presented in a visually compelling and even interactive way.  For access to the datasets visit: https://tinyurl.com/techtogetherbwwc
r
Criteria: 
Analysis: the hack with the most insightful analysis on employer factors related to the pay gap
Usefulness: the hack provides data in a format that can be used for further analysis
Presentation: the hack presents the data or findings in a compelling manner 
Prizes: $500 award
Judges: BU Spark! Team and BWWC
Award Criteria
Insights and Actionability of Analysis
Presentation of results
Platform reusability (e.g. with additional data integrations)

_________________________________________________________________
The Data

Links to the data needed for this project can be found below:
 
Gender Pay Gap Portal: 
https://gender-pay-gap.service.gov.uk 

https://www.ukdataservice.ac.uk/get-data/key-data
 
UK employment data by sector and gender:

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/datasets/employmentbyindustryemp13
 
You can filter these by sector, any organisation that has 250 employees or more and the best part, download it all as a clean CSV!
 
For deeper analysis projects, you can find auxiliary datasets follow below:
 
U.K. Salary Scales: 

https://www.europeandataportal.eu/data/en/dataset/salary-scales1

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/bulletins/annualsurveyofhoursandearnings/2017provisionaland2016revisedresults 

(master list contains statistician's commentary on potential pitfalls, if needed)

[Provisional] Earnings broken down by age and sex, but not company.

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/agegroupashetable6 (by age and sex but not company)

Earnings by qualifications (broken down by sex):

https://data.gov.uk/dataset/d164c231-36fe-4e2e-82dc-5f4dc46cc3d7/earnings-by-qualification-in-the-uk

You could combine datasets 2 and 3 to study wage gap by qualification and sex.

Combine that further with earnings across industries:

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/earningsandworkinghours/datasets/averageweeklyearningsbyindustryearn03

NB: While in spreadsheets, these data will all need to be thoroughly cleaned (removing extraneous header information, etc.)

For an analysis only on London, we also have earnings by borough:

https://data.london.gov.uk/dataset/earnings-workplace-borough
 
For Glassdoor data that may be useful for the analysis, but will not be able to be used/promoted publicly, please see the following datasets. Note that this data will be acceptable for the competition and will not hurt your chances of winning:

https://github.com/MatthewChatham/glassdoor-review-scraper

https://www.glassdoor.com/developer/index.htm

https://nycdatascience.com/blog/student-works/web-scraping/glassdoor-web-scraping/

https://nycdatascience.com/blog/student-works/web-scraping-glassdoor-insight-employee-turnover-within-financial-firms/

Additional small datasets which may be used to show correlations:
Maternity benefits at UK universities:

https://warwick.ac.uk/fac/soc/economics/staff/vetroeger/maternity/maternitybenefits_heis.pdf

Or feel free to develop your own benefits-oriented data set through other sources, noting all sources you use.

In [27]:
import numpy as np 
import pandas as pd 

import matplotlib.pyplot as plt 
%matplotlib inline


In [19]:
df_17_18 = pd.read_csv('data/UK Gender Pay Gap Data - 2017 to 2018 (1).csv')
df_18_19 = pd.read_csv('data/UK Gender Pay Gap Data - 2018 to 2019 (1).csv')

In [28]:
df_18_19.sort_values(by='DiffMedianHourlyPercent').head(10)

Unnamed: 0,EmployerName,Address,CompanyNumber,SicCodes,DiffMeanHourlyPercent,DiffMedianHourlyPercent,DiffMeanBonusPercent,DiffMedianBonusPercent,MaleBonusPercent,FemaleBonusPercent,...,FemaleUpperMiddleQuartile,MaleTopQuartile,FemaleTopQuartile,CompanyLinkToGPGInfo,ResponsiblePerson,EmployerSize,CurrentName,SubmittedAfterTheDeadline,DueDate,DateSubmitted
2190,Realise Futures Cic,"Realise Futures Cic,\r\nLovetofts Drive,\r\nIp...",7828443.0,"56102,\r\n82990,\r\n85590",-39.5,-90.7,0.0,0.0,0.0,0.0,...,78.8,9.5,90.5,https://www.realisefutures.org,Ramon Garcia (Human Resources Manager),250 to 499,Realise Futures Cic,False,05/04/2019 00:00:00,15/03/2019 11:21:14
3205,YELLOW DOT HOLDINGS LIMITED,"2, Crown Court,\r\nCrown Way,\r\nRushden,\r\nN...",11159598.0,64209,-39.7,-80.7,-66.0,-636.4,66.7,64.7,...,98.8,2.4,97.6,https://www.brighthorizons.co.uk/statutory-inf...,John Handley (HR Director),250 to 499,YELLOW DOT HOLDINGS LIMITED,False,05/04/2019 00:00:00,15/03/2019 15:24:35
818,Ducas Ltd,"The Meeting House,\r\nLittle Mount Sion,\r\nTu...",6126794.0,82990,-21.7,-67.5,0.0,0.0,0.0,0.0,...,77.2,15.8,84.2,,Simon Bailey (Managing Director),Less than 250,Ducas Ltd,False,05/04/2019 00:00:00,14/01/2019 12:50:49
2090,PONTRILAS SAWMILLS LIMITED,"Pontrilas,\r\nHerefordshire,\r\nUnited Kingdom...",457573.0,16100,-16.7,-59.2,100.0,100.0,80.3,0.0,...,0.0,77.8,22.2,http://www.pontrilassawmills.co.uk/corporate/h...,Eric Hilton (Finance Director),250 to 499,PONTRILAS SAWMILLS LIMITED,False,05/04/2019 00:00:00,18/03/2019 14:25:28
2949,TURNER BIANCA PLC,"Bell Mill, Claremont Street,\r\nHathershaw,Old...",473824.0,46410,-19.5,-53.9,40.9,-30.0,90.7,88.8,...,66.2,32.4,67.6,,KEITH WALMSLEY (FINANCE AND OPERATIONS DIRECTOR),250 to 499,TURNER BIANCA PLC,False,05/04/2019 00:00:00,10/01/2019 12:19:40
2240,RITTER-COURIVAUD LIMITED,"Equity House, Irthlingborough Road,\r\nWelling...",363411.0,46380,-20.3,-52.5,51.2,72.9,96.6,92.7,...,25.4,71.2,28.8,https://www.tescoplc.com/genderpay/,"Charles Wilson (Chief Executive, Booker)",Less than 250,RITTER-COURIVAUD LIMITED,False,05/04/2019 00:00:00,12/03/2019 12:18:31
2757,The Donaldson Trust,"Preston Road,\r\nLinlithgow,\r\nEH49 6HZ",,1,-10.4,-45.8,-100.0,-100.0,0.0,100.0,...,85.7,16.7,83.3,,,Less than 250,The Donaldson Trust,False,31/03/2019 00:00:00,06/11/2018 17:30:06
3074,WALKERS SNACKS LIMITED,"450 South Oak Way,\r\nGreen Park,\r\nReading,\...",3474989.0,82990,-11.3,-44.3,28.9,-63.7,58.2,48.4,...,48.0,56.0,44.0,http://pepsico.co.uk/docs/album/UK_Corporate_R...,Marisa Lorch (HR Director - Commercial),250 to 499,WALKERS SNACKS LIMITED,False,05/04/2019 00:00:00,04/03/2019 09:33:42
339,BOOKER RETAIL PARTNERS (GB) LIMITED,"Equity House,\r\nIrthlingborough Road,\r\nWell...",6460554.0,46390,-21.6,-41.3,66.0,95.2,80.8,57.8,...,38.3,71.9,28.1,https://www.tescoplc.com/genderpay/,"Charles Wilson (Chief Executive, Booker)",1000 to 4999,BOOKER RETAIL PARTNERS (GB) LIMITED,False,05/04/2019 00:00:00,12/03/2019 12:16:27
11,AB INBEV UK LIMITED,"Porter Tun House,\r\n500 Capability Green,\r\n...",3982132.0,11050,-31.0,-39.0,-49.0,-160.0,60.0,64.0,...,25.0,66.0,34.0,https://ab-inbev.co.uk/about/our-promise/polic...,Claire Richardson (People Director),1000 to 4999,AB INBEV UK LIMITED,False,05/04/2019 00:00:00,22/03/2019 10:45:36


In [62]:
# companies with either zero mean or median diff in hourly percent 

df_18_19[(df_18_19['DiffMeanHourlyPercent']==0)|(df_18_19['DiffMedianHourlyPercent']==0)].to_csv('./special_companies/zerodiff_mean_or_median.csv', sep='\t', encoding='utf-8')

In [63]:

# companies with either zero mean and median diff in hourly percent 

df_18_19[(df_18_19['DiffMeanHourlyPercent']==0) & (df_18_19['DiffMedianHourlyPercent']==0)].to_csv('./special_companies/zerodiff_mean_and_median.csv', sep='\t', encoding='utf-8')

In [64]:
# companies with the lowest difference in median hourly rate 

df_18_19.sort_values(by='DiffMedianHourlyPercent').tail(10).to_csv('./special_companies/worst_diff_median_hourly', sep='\t', encoding='utf-8')

In [77]:
df_18_19.sort_values(by='DiffMedianHourlyPercent').head(10)

Unnamed: 0,EmployerName,Address,CompanyNumber,SicCodes,DiffMeanHourlyPercent,DiffMedianHourlyPercent,DiffMeanBonusPercent,DiffMedianBonusPercent,MaleBonusPercent,FemaleBonusPercent,...,FemaleUpperMiddleQuartile,MaleTopQuartile,FemaleTopQuartile,CompanyLinkToGPGInfo,ResponsiblePerson,EmployerSize,CurrentName,SubmittedAfterTheDeadline,DueDate,DateSubmitted
2190,Realise Futures Cic,"Realise Futures Cic,\r\nLovetofts Drive,\r\nIp...",7828443.0,"56102,\r\n82990,\r\n85590",-39.5,-90.7,0.0,0.0,0.0,0.0,...,78.8,9.5,90.5,https://www.realisefutures.org,Ramon Garcia (Human Resources Manager),250 to 499,Realise Futures Cic,False,05/04/2019 00:00:00,15/03/2019 11:21:14
3205,YELLOW DOT HOLDINGS LIMITED,"2, Crown Court,\r\nCrown Way,\r\nRushden,\r\nN...",11159598.0,64209,-39.7,-80.7,-66.0,-636.4,66.7,64.7,...,98.8,2.4,97.6,https://www.brighthorizons.co.uk/statutory-inf...,John Handley (HR Director),250 to 499,YELLOW DOT HOLDINGS LIMITED,False,05/04/2019 00:00:00,15/03/2019 15:24:35
818,Ducas Ltd,"The Meeting House,\r\nLittle Mount Sion,\r\nTu...",6126794.0,82990,-21.7,-67.5,0.0,0.0,0.0,0.0,...,77.2,15.8,84.2,,Simon Bailey (Managing Director),Less than 250,Ducas Ltd,False,05/04/2019 00:00:00,14/01/2019 12:50:49
2090,PONTRILAS SAWMILLS LIMITED,"Pontrilas,\r\nHerefordshire,\r\nUnited Kingdom...",457573.0,16100,-16.7,-59.2,100.0,100.0,80.3,0.0,...,0.0,77.8,22.2,http://www.pontrilassawmills.co.uk/corporate/h...,Eric Hilton (Finance Director),250 to 499,PONTRILAS SAWMILLS LIMITED,False,05/04/2019 00:00:00,18/03/2019 14:25:28
2949,TURNER BIANCA PLC,"Bell Mill, Claremont Street,\r\nHathershaw,Old...",473824.0,46410,-19.5,-53.9,40.9,-30.0,90.7,88.8,...,66.2,32.4,67.6,,KEITH WALMSLEY (FINANCE AND OPERATIONS DIRECTOR),250 to 499,TURNER BIANCA PLC,False,05/04/2019 00:00:00,10/01/2019 12:19:40
2240,RITTER-COURIVAUD LIMITED,"Equity House, Irthlingborough Road,\r\nWelling...",363411.0,46380,-20.3,-52.5,51.2,72.9,96.6,92.7,...,25.4,71.2,28.8,https://www.tescoplc.com/genderpay/,"Charles Wilson (Chief Executive, Booker)",Less than 250,RITTER-COURIVAUD LIMITED,False,05/04/2019 00:00:00,12/03/2019 12:18:31
2757,The Donaldson Trust,"Preston Road,\r\nLinlithgow,\r\nEH49 6HZ",,1,-10.4,-45.8,-100.0,-100.0,0.0,100.0,...,85.7,16.7,83.3,,,Less than 250,The Donaldson Trust,False,31/03/2019 00:00:00,06/11/2018 17:30:06
3074,WALKERS SNACKS LIMITED,"450 South Oak Way,\r\nGreen Park,\r\nReading,\...",3474989.0,82990,-11.3,-44.3,28.9,-63.7,58.2,48.4,...,48.0,56.0,44.0,http://pepsico.co.uk/docs/album/UK_Corporate_R...,Marisa Lorch (HR Director - Commercial),250 to 499,WALKERS SNACKS LIMITED,False,05/04/2019 00:00:00,04/03/2019 09:33:42
339,BOOKER RETAIL PARTNERS (GB) LIMITED,"Equity House,\r\nIrthlingborough Road,\r\nWell...",6460554.0,46390,-21.6,-41.3,66.0,95.2,80.8,57.8,...,38.3,71.9,28.1,https://www.tescoplc.com/genderpay/,"Charles Wilson (Chief Executive, Booker)",1000 to 4999,BOOKER RETAIL PARTNERS (GB) LIMITED,False,05/04/2019 00:00:00,12/03/2019 12:16:27
11,AB INBEV UK LIMITED,"Porter Tun House,\r\n500 Capability Green,\r\n...",3982132.0,11050,-31.0,-39.0,-49.0,-160.0,60.0,64.0,...,25.0,66.0,34.0,https://ab-inbev.co.uk/about/our-promise/polic...,Claire Richardson (People Director),1000 to 4999,AB INBEV UK LIMITED,False,05/04/2019 00:00:00,22/03/2019 10:45:36


In [89]:
df_18_19['Address'].str.split('United Kingdom', expand=True)


Unnamed: 0,0,1,2
0,"Fusion Point,\r\nDumballs Road,\r\nCardiff,\r\n",",\r\nCF10 5BF",
1,"Royal Grammar School, High Street,\r\nGuildfor...",,
2,"8, St. Loyes Street,\r\nBedford,\r\nMK40 1EP",,
3,"Fairview Mill, Ingliston,\r\nNewbridge,\r\nMid...",,
4,"Unit 3 Hedge End Retail Park, Charles Watts Wa...",,
5,"3m, Centre, Cain Road,\r\nBracknell,\r\nBerksh...",,
6,"206 Hurley Common,\r\nHurley,\r\nAtherstone,\r...",",\r\nCV9 2LR",
7,"2nd Floor Bradburn House,\r\n64-68 Northumberl...",",\r\nNE1 7DF",
8,"Addison Road,\r\nChilton Industrial Estate,\r\...",",\r\nCO10 2YW",
9,"Icknield Way, Kentford,\r\nNewmarket,\r\nSuffo...",,


In [65]:
# companies with the highest difference in median hourly rate 

df_18_19.sort_values(by='DiffMedianHourlyPercent').head(10).to_csv('./special_companies/best_diff_median_hourly', sep='\t', encoding='utf-8')

In [71]:
for column in df_18_19.columns[:-1]:
    df_temp = df_18_19.groupby(['SicCodes', column])[column].count().unstack('DiffMeanHourlyPercent')
    df_temp.plot(kind='bar')        


KeyError: 'Level DiffMeanHourlyPercent not found'