# Group 40 Project Proposal

## Introduction

Begin by providing some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.

Clearly state the question you will try to answer with your project. Your question should involve one or more random variables of interest, spread across two or more categories that are interesting to compare. For example, you could consider the annual maxima river flow at two different locations along a river, or perhaps gender diversity at different universities. Of the response variable, identify one location parameter (mean, median, quantile, etc.) and one scale parameter (standard deviation, inter-quartile range, etc.) that would be useful in answering your question. Justify your choices.

UPDATE (Mar 1, 2022): If it doesn’t make sense to infer a scale parameter, you can choose another parameter, or choose a second variable altogether. Ultimately, we’re looking for a comprehensive inference analysis on one parameter spread across 2+ groups (with at least one hypothesis test), plus a bit more (such as an investigation on the variance, a quantile, or a different variable). In total, you should use both bootstrapping and asymptotics somewhere in your report at least once each. Also, your hypothesis test(s) need not be significant: it is perfectly fine to write a report claiming no significant findings (i.e. your p-value is large).

Identify and describe the dataset that will be used to answer the question. Remember, this dataset is allowed to contain more variables than you need – feel free to drop them!

Also, be sure to frame your question/objectives in terms of what is already known in the literature. Be sure to include at least two scientific publications that can help frame your study (you will need to include these in the References section). We have no specific citation style requirements, but be consistent.


The gender pay gap is the difference between wages earned by men and women. The disparity has long been reported (Maloney 2016). The Equal Pay Act was signed by President John F.Kennedy in 1963 (cite) and it mandates that women should receive equal pay for doing "substantially equal" work. Over the last 50 years, numerous laws and legislation have been passed in hope to deminish the disparity. However, Statistic Canada reported that female employees in Ontario earned $0.75 for every dollars earned by men in 2020(Statistics Canada data from the Canadian Income Survey). In this report, we would confirm whether women are earning less than men. We will compare mean of income and standard deviation between men and women. To achieve this, we used a dataset genereated by scraping of Glassdoor. It contains income for various job titles based on gender. 



**Dataset**:
* Found using <a href="https://www.kaggle.com/datasets/nilimajauhari/glassdoor-analyze-gender-pay-gap" target="_blank">this link</a> 

* Dataset is scraped from <a href="https://www.glassdoor.com/" target="_blank">Glassdoor website</a> 


## Preliminary Results
Demonstrate that the dataset can be read from the web into R.

Clean and wrangle your data into a tidy format.

Plot the relevant raw data, tailoring your plot in a way that addresses your question.

Compute estimates of the parameter you identified across your groups. Present this in a table. If relevant, include these estimates in your plot.

In [2]:
#library needed for this project
library(tidyverse)
library(dplyr)
library(RColorBrewer)
library(tidyr)
library(tidymodels)
library(repr)
library(cowplot)
options(repr.matrix.max.rows = 6)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.2 ──
[32m✔[39m [34mggplot2[39m 3.3.6      [32m✔[39m [34mpurrr  [39m 0.3.4 
[32m✔[39m [34mtibble [39m 3.1.8      [32m✔[39m [34mdplyr  [39m 1.0.10
[32m✔[39m [34mtidyr  [39m 1.2.1      [32m✔[39m [34mstringr[39m 1.4.1 
[32m✔[39m [34mreadr  [39m 2.1.2      [32m✔[39m [34mforcats[39m 0.5.2 
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.1     [32m✔[39m [34mrsample     [39m 1.1.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.3     [32m✔[39m [34mworkflows   [39m 1.0.0
[

### Load data into Jupyter notebook

In [20]:
df <- read_csv("https://raw.githubusercontent.com/kristennli/stat201/main/glassdoor.csv")

[1mRows: [22m[34m1000[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (4): JobTitle, Gender, Education, Dept
[32mdbl[39m (5): Age, PerfEval, Seniority, BasePay, Bonus

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [21]:
summary(df)

   JobTitle            Gender               Age           PerfEval    
 Length:1000        Length:1000        Min.   :18.00   Min.   :1.000  
 Class :character   Class :character   1st Qu.:29.00   1st Qu.:2.000  
 Mode  :character   Mode  :character   Median :41.00   Median :3.000  
                                       Mean   :41.39   Mean   :3.037  
                                       3rd Qu.:54.25   3rd Qu.:4.000  
                                       Max.   :65.00   Max.   :5.000  
  Education             Dept             Seniority        BasePay      
 Length:1000        Length:1000        Min.   :1.000   Min.   : 34208  
 Class :character   Class :character   1st Qu.:2.000   1st Qu.: 76850  
 Mode  :character   Mode  :character   Median :3.000   Median : 93328  
                                       Mean   :2.971   Mean   : 94473  
                                       3rd Qu.:4.000   3rd Qu.:111558  
                                       Max.   :5.000   Max.   :179726  

## Methods: Plan

The previous sections will carry over to your final report (you’ll be allowed to improve them based on feedback you get). Begin this Methods section with a brief description of “the good things” about this report – specifically, in what ways is this report trustworthy?

Continue by explaining why the plot(s) and estimates that you produced are not enough to give to a stakeholder, and what you should provide in addition to address this gap. Make sure your plans include at least one hypothesis test and one confidence interval. If possible, compare both the bootstrapping and asymptotics methods.

Finish this section by reflecting on how your final report might play out:

What do you expect to find?

What impact could such findings have?

What future questions could this lead to?


## References
At least two citations of literature relevant to the project. The citation format is your choice – just be consistent. Make sure to cite the source of your data as well.

Maloney, Carolyn B. (April 2016). "Gender Pay Inequity: Consequences for Women, Families and the Economy" (PDF). Joint Economic Committee.
