## Inference Practice - In class

Inference attempts to help us make statements about the population of interest given the sample data that we have. Inference is only needed if sample data is obtained. If data from the entire population is obtained, descriptive statistics are sufficient. 

### Steps in Inference

1. Generate hypotheses, ie null and alternative hypotheses.
2. Establish an $\alpha$ value
3. Estimate parameters
4. Compute test statistic
5. Generate p-value


### General Research Question
Does the year of data help understand differences in average or median hourly nurse pay?

### Data Description
The data are state level data about nurse pay from all United States, including Puerto Rico, Washington DC, Guam, and US Virgin Islands. 

#### Actual data

|attribute                                        |class     |description |
|:-----------------------------------------------|:---------|:-----------|
|State                                           |character | State |
|Year                                            |double    | Year|
|Total Employed RN                               |double    | Total Employed Registered Nurses |
|Employed Standard Error (%)                     |double    | Employed standard error (%) |
|Hourly Wage Avg                                 |double    | Hourly wage average|
|Hourly Wage Median                              |double    | Hourly wage median |
|Annual Salary Avg                               |double    | Annual salary average |
|Annual Salary Median                            |double    | Annual salary median |
|Wage/Salary standard error (%)                  |double    | Wage/salary standard error % |
|Hourly 10th Percentile                          |double    | Hourly 10th Percentile               |
|Hourly 25th Percentile                          |double    | Hourly 25th Percentile                         |
|Hourly 75th Percentile                          |double    | Hourly 75th Percentile                         |
|Hourly 90th Percentile                          |double    | Hourly 90th Percentile                         |
|Annual 10th Percentile                          |double    | Annual 10th Percentile                         |
|Annual 25th Percentile                          |double    | Annual 25th Percentile                         |
|Annual 75th Percentile                          |double    | Annual 75th Percentile                         |
|Annual 90th Percentile                          |double    | Annual 90th Percentile                         |
|Location Quotient                               |double    | Location Quotient                              |
|Total Employed (National)_Aggregate             |double    | Total Employed (National)_Aggregate            |
|Total Employed (Healthcare, National)_Aggregate |double    | Total Employed (Healthcare, National)_Aggregate|
|Total Employed (Healthcare, State)_Aggregate    |double    | Total Employed (Healthcare, State)_Aggregate   |
|Yearly Total Employed (State)_Aggregate         |double    | Yearly Total Employed (State)_Aggregate        |


In [1]:
library(tidyverse)
library(ggformula)

theme_set(theme_bw(base_size = 16))

nurses <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-05/nurses.csv')

head(nurses)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.3.2 --
[32mv[39m [34mggplot2[39m 3.3.6      [32mv[39m [34mpurrr  [39m 0.3.4 
[32mv[39m [34mtibble [39m 3.1.8      [32mv[39m [34mdplyr  [39m 1.0.10
[32mv[39m [34mtidyr  [39m 1.2.1      [32mv[39m [34mstringr[39m 1.4.1 
[32mv[39m [34mreadr  [39m 2.1.2      [32mv[39m [34mforcats[39m 0.5.2 
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
Loading required package: ggstance


Attaching package: 'ggstance'


The following objects are masked from 'package:ggplot2':

    GeomErrorbarh, geom_errorbarh


Loading required package: scales


Attaching package: 'scales'


The following object is masked from 'package:purrr':

    discard


The following object is masked from 'package:readr':

 

State,Year,Total Employed RN,Employed Standard Error (%),Hourly Wage Avg,Hourly Wage Median,Annual Salary Avg,Annual Salary Median,Wage/Salary standard error (%),Hourly 10th Percentile,⋯,Hourly 90th Percentile,Annual 10th Percentile,Annual 25th Percentile,Annual 75th Percentile,Annual 90th Percentile,Location Quotient,Total Employed (National)_Aggregate,"Total Employed (Healthcare, National)_Aggregate","Total Employed (Healthcare, State)_Aggregate",Yearly Total Employed (State)_Aggregate
<chr>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,⋯,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
Alabama,2020,48850,2.9,28.96,28.19,60230,58630,0.8,20.75,⋯,38.67,43150,49360,68960,80420,1.2,140019790,8632190,128600,1903210
Alaska,2020,6240,13.0,45.81,45.23,95270,94070,1.4,31.5,⋯,60.7,65530,76830,110890,126260,0.98,140019790,8632190,17730,296300
Arizona,2020,55520,3.7,38.64,37.98,80380,79010,0.9,27.66,⋯,50.14,57530,67760,92920,104290,0.91,140019790,8632190,171010,2835110
Arkansas,2020,25300,4.2,30.6,29.97,63640,62330,1.4,21.47,⋯,39.65,44660,53490,73630,82480,1.0,140019790,8632190,80410,1177860
California,2020,307060,2.0,57.96,56.93,120560,118410,1.0,36.62,⋯,83.35,76180,93970,147830,173370,0.87,140019790,8632190,844740,16430660
Colorado,2020,52330,2.8,37.43,36.78,77860,76500,0.7,26.84,⋯,50.03,55820,64580,90410,104070,0.95,140019790,8632190,144490,2578000


In [2]:
count(nurses, Year)

Year,n
<dbl>,<int>
1998,54
1999,54
2000,54
2001,54
2002,54
2003,54
2004,54
2005,54
2006,54
2007,54


In [3]:
count(nurses, State)

State,n
<chr>,<int>
Alabama,23
Alaska,23
Arizona,23
Arkansas,23
California,23
Colorado,23
Connecticut,23
Delaware,23
District of Columbia,23
Florida,23


## Univariate exploration

First, explore the univariate distribution of the outcome of interest. Fill in the outcome of interest (either median or average hourly wage) in place of "%%" in the code below. You may also want to fill in an appropriate axis label in place of "@@" below. 

+ Summarize any key features of this distribution.

In [None]:
gf_density(~ %%, data = nurses) |>
  gf_labs(x = "@@")

## Bivariate exploration

It is important to explore relationships bivariately before going to the model phase. To do this, fill in the outcome of interest in place of "%%" below and fill in the appropriate predictor in place of "^^". You may also want to fill in an appropriate axis labels in place of "@@" below. 

+ Summarize the bivariate association

In [None]:
gf_point(%% ~ ^^, data = nurses, size = 4) |> 
  gf_smooth(method = 'lm', size = 1.5) |>
  gf_smooth(method = 'loess', size = 1.5, color = 'green') |>
  gf_labs(x = "@@",
          y = "@@")

### Jittered bivariate plot

A jittered plot can be helpful in this situation. 

+ We haven't talked about jitter before, compare the following figure

In [None]:
gf_jitter(%% ~ ^^, data = nurses, size = 4) |> 
  gf_smooth(method = 'lm', size = 1.5) |>
  gf_smooth(method = 'loess', size = 1.5, color = 'green') |>
  gf_labs(x = "@@",
          y = "@@")

## Establish Hyptheses

## Set an $\alpha$ value

## Estimate Parameters

To do this, fill in the outcome of interest in place of "%%" below and fill in the appropriate predictor in place of "^^". 

In [None]:
nurse_year <- lm(%% ~ ^^, data = nurses)

coef(nurse_year) |> round(3)