# Does faculty salary vary by gender and/or rank? 
A linear modeling approach.

## Set up
Let's begin by reading in some data from [this course website](http://data.princeton.edu/wws509/datasets/#salary). Columns included are:

- **sx** = Sex, coded 1 for female and 0 for male
- **rk** = Rank, coded
    - 1 for assistant professor,
    - 2 for associate professor, and
    - 3 for full professor
- **yr** = Number of years in current rank
- **dg** = Highest degree, coded 1 if doctorate, 0 if masters
- **yd** = Number of years since highest degree was earned
- **sl** = Academic year salary, in dollars.

Before performing the statistical analysis here, make sure you have completed `part-1.ipynb` to get to know your data.

In [1]:
# Set up
import numpy as np
import pandas as pd
import seaborn as sns # for visualiation
import urllib.request # to load data
from scipy import stats # ANOVA
from scipy.stats import ttest_ind # t-tests
import statsmodels.formula.api as smf # linear modeling
import altair as alt
alt.renderers.enable('notebook') # enable altair rendering
import matplotlib.pyplot as plt # plotting (optional)
%matplotlib inline 

In [2]:
# Read data from URL
data = urllib.request.urlopen('http://data.princeton.edu/wws509/datasets/salary.dat')
salary_data= pd.read_table(data, sep='\s+')

## Simple linear regression: what is the salary increase associated with each additional year in your current position (`yr`)?

In [3]:
# Create a simple linear model using `smf.ols()` that assesses the relationship between 
# years in current position (`yd`) with salary (`sl`).

# Then, use the `.summary()` method of your model to print our information about your model


**Assess the fit of your model**. In the space provided here, you should use clear and interpretable language to interpret your model. For example, you can use statements like, the beta value of *** indicates that each unit increase in *** is associated with an *** increase in *** ).

Describe the _accuracy of your coefficient estimates_. In doing so, interpret the following: 
- **Coefficient** (beta) in your model:  _(your interpretation here)_
- **Standard errors** of your estimate: _(your interpretation here)_
- **Confidence intervals** around your coefficient: _(your interpretation here)_

Describe the _**accuracy** of your model_. In doing so, interpret the following:
- **R-squared** value: _(your interpretation here)_

In [5]:
# Create a `predictions` column of your dataframe by 
# making predictions from your linear model. Hint: use the `.predict()` method of your model


In [3]:
# Draw a scatterplot comparing years in current rank (`yr` -- x axis) to salary (`sl` -- y axis).
# Add to that scatterplot your "best fit line" that shows how well the model fits our data
# (this line will have `yr` as the x axis, and your `predictions` on the y axis)


## Multiple Regression

Now you will improve (well, likely improve) your model by predicing your outcome of interest (salary) using **multiple** independent variables

In [4]:
# Using multiple regression, create a linear model that uses 
# sex, rank, and years in current rank variables to estimate salary

# Then, use the `.summary()` method of your model to print out information about the model.


**Assess the fit of your model**. In the space provided here, you should use clear and interpretable language to interpret your model. For each independent variable, you should write out a sentence explaining the beta and confidence intervals. For example, 

> There was an observed association of BETA (LOWER_BOUND, UPPER_BOUND) salary increase for each unit increase in VARIABLE.


Describe the _**accuracy** of your model_. In doing so, interpret the following:
- **R-squared** value: _(your interpretation here)_

Write down at least one relationship in your model that you find surprising (i.e., would not have expected given your analysis up until this point)

In [5]:
# Create a `mult_preds` column of your dataframe by 
# making predictions from your new (multivariate) linear model. Hint: use the `.predict()` method of your model



In [6]:
# Visually compare these predictions (`mult_preds`) to those from your linear model (`predictions`) 
# by creating a scatterplot of the two variables


Write down at least one relationship in this graph that you find notable.

## Assessing prediction accuracy

In [7]:
# Make a scatterplot that compares the actual salary data (`sl` -- x axis) 
# to the multivariate predictions (`mult_preds` -- y axis)
# Add a line to this plot showing where the perfect prediction values would be 
# (i.e., a line whose x and y values are both the `sl` column)


Using the r-squared values of each model -- univariate (`predictions`) and multivariate (`mult_preds`) -- describe 
which one explains more variance?


### Residual plots

In [8]:
# For each model, plot the salary (`sl`) v.s. the *residuals* (difference between actual and predicted values)
# Add a horizontal line at 0 to help interpret the graph
# (I suggest rendering adjacent plots, though you are welcome to make them separately)


Write at least **one observation** based on the residual plots above. More specifically, describe how each each model systematically fits (or _fails_ to fit) the data