# Does faculty salary vary by gender and/or rank?

## Set up

Before getting started, the only addtional library you should have to install (that did not come with the anaconda python distribution) is `seaborn`, a package for visualization. Execute this command on your terminal.

```
pip install seaborn
```

Let's begin by reading in some data from [this course website](http://data.princeton.edu/wws509/datasets/#salary). Columns included are:

- **sx** = Sex, coded 1 for female and 0 for male
- **rk** = Rank, coded
    - 1 for assistant professor,
    - 2 for associate professor, and
    - 3 for full professor
- **yr** = Number of years in current rank
- **dg** = Highest degree, coded 1 if doctorate, 0 if masters
- **yd** = Number of years since highest degree was earned
- **sl** = Academic year salary, in dollars.

In [1]:
# Set up
import numpy as np
import pandas as pd
import seaborn as sns # for visualiation
import urllib.request # to load data
from scipy import stats # ANOVA
from scipy.stats import ttest_ind # t-tests
import statsmodels.formula.api as smf # linear modeling
import matplotlib.pyplot as plt # plotting
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline 

In [2]:
# Read data from URL
data = urllib.request.urlopen('http://data.princeton.edu/wws509/datasets/salary.dat')
salary_data= pd.read_table(data, sep='\s+')

## Descriptive statistics by gender

Before doing any statistical tests, you should get a basic feel for the gender breakdown in your dataset

In [3]:
# What is the number of males/females in the dataset? What does this already tell you...?


In [4]:
# What is the mean salary by sex? Hint: you'll have to groupby sex (`sx`)


In [5]:
# Draw histograms for the distribution of salaries for males and females (separately)
# Hint: you can use the `.hist` method, and specify what you want to separate *by*
# The x and y axes should be consistent between the graphs


In [6]:
# Create a boxplot of the salaries by sex -- don't worry if you get a warning here.

In [7]:
# Show salary distributions for males and females in a stripplot (jittered density plot)
# Use the sns.stripplot method


## Test for a difference in means by gender
Use a t-test to see if there is a significant difference in means

In [8]:
# Separate males and females into different variables


In [9]:
# Test for difference using `ttest_ind`


## Descriptive Statistics by Rank

In [10]:
# Draw histograms for the distribution of salaries by rank



## Test for differences in means by rank

First, we'll want to leverage the **t-test** to test for differences by rank. To do this, we'll need to first break the dataset into two groups (full professors, not-full professors), then perform our t-test.

In [11]:
# Separate into different variables by rank (full, not_full)


# Test for difference


Alternatively, we could use an **Analysis of Variance (ANOVA)** test to assess the statistical significance in differences across multiple groups (an extension of the t-test)

In [12]:
# Use the ANOVA method to test for differences in means across multiple groups
# Use the `stats.f_oneway` method to perform the test


## How does salary (`sl`) compare to years since degress (`yd`) and years in current rank (`yr`)?

In [13]:
# Create scatterplots to show how salary compares to years since degree / in current rank
# Show these at the same time


## How does salary vary across rank and sex?

In [14]:
# Create stripplots of salary by sex and by rank placed next to one another
# Hint: you can use `sns.PairGrid`


In [15]:
# Create different stripplots of salary (by gender) for each rank




In [None]:
# What does this tell you about gender discrimination on the faculty?