# Lab 10 Practice: Inference for quantitative data

## Reminder - working with notebooks

#### 1) It is important to save your work, exit the notebook, and logout of syzygy whenever you are finished working on the notebook for that session. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in to resume working on the notebook.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**


#### 2) When you resume your work on a notebook, your previous work/output may still be displayed, but none of your previous work is maintained in memory accessible by the notebook. In particular, you will need to load the dataset again in order to continue working with the data. One easy way to refresh your notebook is to go to the notebook cell where you left off and do the following.

- **Select Kernel > Restart Kernel and Run up to Selected Cell.**
#### This will run all of the code in your notebook up to the selected cell.

## Objectives
The objectives of this tutorial/lab are to explore inference procedures such as confidence intervals and hypothesis tests. The R function `inference` will be used to construct the confidence intervals and perform the hypothesis tests. The emphasis in this lab will be on the sample mean and population mean.

* Numerical summaries of quantitative variables
* Graphical summaries of quantitative variables
* Confidence interval for population mean  
* Hypothesis test for population mean  
---

## Load Data: 

In [None]:
download.file("http://www.openintro.org/data/rda/acs12.rda", destfile = "acs12.rda")
load("acs12.rda")

The `download.file` and `load` functions are used to import the dataset that will be used in the tutorial. The data that is available to you is called `acs12`.

In [None]:
source("inference_1770.R")

In addition to loading the data, you must also load the `inference_1770.R` file. This will allow you to use the `inference` function for confidence intervals and hypothesis tests.

## Data Information:

### Data Set:

Today we will be using data from the 2012 American Community Survey. Every year the U.S. Census Bureau contacts over 3.5 million households to participate in the American Community Survey (ACS). The survey provides vital information on a yearly basis about the United States and its people. Information from the survey generates data that help determine how more than $675 billion in federal and state funds are distributed each year. 

The `acs12` dataset contains responses from a random sample of 2000 U.S. adults on 13 variables.

 
#### Name: #### 
* `acs12` - American Community Survey data from 2,000 U.S. adults.

#### Variables: ####
* `income` - Annual income
* `employment` - Employment status.
* `hrs_work` - Hours worked per week.
* `race` - Race.
* `age` - Age, in years.
* `gender` - Gender.
* `citizen` - Whether the person is a U.S. citizen.
* `time_to_work` - Travel time to work, in minutes.
* `lang` - Language spoken at home.
* `married` - Whether the person is married.
* `edu` - Education level.
* `disability` - Whether the person is disabled.
* `birth_qrtr` - The quarter of the year that the person was born, e.g. Jan thru Mar.

## Getting Started

R stores data in data frames, which you might think of as a type of spreadsheet. Each row is a different observation (a different subject) and each column is a different variable.

Use the `head` function and/or `tail` function and/or `str` function and/or `dim` function and/or `summary` function and/or `names` function to begin exploring the `acs12` dataframe.

In [None]:
head(acs12)

In [None]:
dim(acs12)

In [None]:
summary(acs12)

There are many observations that have missing data. Ordinarily one would want to be careful about dealing with missing data. However, we are simply going to use the `na.omit` function to remove all observations with missing data and create a new dataset called `acs`.

In [None]:
acs  = na.omit(acs12)

In [None]:
dim(acs)

The revised dataset includes 783 observations.

The variable of interest for us is `hrs_work`, which records the hours worked per week.

### Summarizing sample data

#### Numerical summaries

Numerical variables, such as `hrs_work` may be summarized using the `summary` function, or by using individual functions such as `mean`, `sd`, `median`, and others.

The `summary` function returns the five-number summary plus the mean.

In [None]:
summary(acs$hrs_work)

The `mean` function returns the mean of the data and the `sd` function returns the standard deviation.

In [None]:
mean(acs$hrs_work)

In [None]:
sd(acs$hrs_work)

#### Graphical summaries

Histograms and boxplots are common graphical summaries for numerical variables.  
  
The function `hist` creates a histogram of the data. The function will automatically define bins and display the number of observations within each bin. The number of bins, or bin cutoffs can be specified, but we will just use the default choices for now.

The `hist` function only requires the data to be graphed, but as with the `plot` function, there are some optional arguments that may be used to customize the histogram. Some of these optional arguments include:
* `xlab` - specify the label for the x-axis, eg xlab = "Hours per week"
* `ylab` - specify the label for the y-axis
* `xlim` - specify the minimum and maximum value for the x-axis, eg xlim=c(minimum, maximum)
* `ylim` - specify the minimum and maximum value for the y-axis, eg ylim=c(minimum, maximum)
* `main` - specify a main title for the graph, eg main = "Weekly work hours for U.S. adults"



In [None]:
hist(acs$hrs_work, xlab = "Hours per week", main = "Weekly work hours for U.S. adults" ) 
#produces histogram of weekly hours worked

The function `boxplot` creates a histogram of the data.

The `boxplot` function only requires the data to be graphed, but as with the `plot` function, there are some optional arguments that may be used to customize the boxplot. Some of these optional arguments include:
* `xlab` - specify the label for the x-axis
* `ylab` - specify the label for the y-axis, eg ylab = "Hours per week"
* `main` - specify a main title for the graph, eg main = "Weekly work hours for U.S. adults"

#### The distribution of weekly hours worked is generally symmetric. It is approximately bell-shaped, but has a much greater concentration of observations in the 30-40 hour range than we would typically expect of bell-shaped data. The centre of the distribution is around 40 hours and almost all respondents worked between 0-80 hours in a week.

In [None]:
boxplot(acs$hrs_work, ylab = "Hours per week", main = "Weekly work hours for U.S. adults" ) 
#produces histogram of weekly hours worked

#### The boxplot suggests a distribution that is generally symmetric. The median hours worked is around 40 hours and the interquartile range is around 20 hours, since Q1 is approximately 30 hours and Q3 is approximately 50 hours. Quite a few observations are identified as potential outliers, but there are very few, if any, observations that are actual outliers in the sense that they are distinct/removed from the rest of the data. This is also apparent in the histogram.

### Inference on means

We would like insight into the population parameters. We know the average (mean) hours worked per week for the U.S. adults in  the sample. This represents a **statistic**, $\bar{x}$. We are interested in the average (mean) hours worked per week for all U.S. adults. We need to estimate this **parameter**, $\mu$.

The inferential tools for estimating population mean are confidence interval and hypothesis test.

#### Hypothesis tests

#### Recall: Hypothesis tests for a population mean typically rely upon the Central Limit Theorem

In order for the Central Limit Theorem to hold:
* observations must be independent
* sample size must be sufficiently large, or underlying data must have a nearly normal distribution. 

The sample size is typically considered sufficiently large when there are at 30 observations, $n \ge 30$.

**Independence: <br>In this study, we have a simple random sample from the population. The most common way for observations to be considered independent is if they are from a simple random sample.** <br> ***Also require sample size to be less than 10% of the population, which is clearly true in this case, 783 U.S. adults sampled (2000 U.S. adults originally sampled) is much less than 10% of all U.S. adults.***

**Nearly normal/sample size: <br> The histogram suggests that the population distribution is symmetric. Although it may not be quite normal, we may be comfortable assuming that it is nearly normal. If we are not comfortable with that assumption, the sample size of 783 is clearly large enough to satisfy the necessary condition for the central limit theorem. Additionally there are no extreme outliers of concern.**

Suppose we are interested in testing whether, or not, workers are working a traditional 40 hour workweek. The alternative hypothesis that we have in mind is that the workweek differs from 40 hours. The `inference` function may be used to test the hypotheses

$H_0: \mu = 40$ vs $H_A: \mu \neq 40$

The `inference` function may be used for hypothesis tests of population mean with the appropriate arguments:
* `y` - response variable of interest; in this exercise `acs$hrs_work` records reported weekly hours worked
* `est` - parameter we are interested in: "mean", "median", or "proportion"
* `type` - type of inference; confidence interval, "ci", or hypothesis test, "ht"
* `null` - the value of the parameter stated in the null hypothesis
* `alternative` - the alternative hypothesis is either "greater", "less", or "twosided"
* `method` - method of inference; based upon normal distribution, "theoretical", or based on simulation, "simulation"

#### Note: There is no `success` argument as there is when conducting hypothesis tests for proportions.

In [None]:
inference(y = acs$hrs_work, est = "mean", type = "ht",  null = 40, alternative = "twosided", method = "theoretical")

#### Question: <br><br>Identify the value of the test statistic and the p-value.

#### Answer:<br><br> The test statistic is $Z = -2.245$ with an associated p-value of 0.0248. 

#### Question: <br><br>Is there convincing evidence that the average weekly hours worked by U.S. adults differs from 40 hours?

#### Answer: <br><br> A p-value of 0.0248 typically represents strong evidence against the null hypothesis and in favour of the alternative hypothesis. We have strong evidence against the claim that the average weekly hours worked is 40 hours and in favour of the claim that is differs from 40 hours.

#### Note: The sample average is 39 hours. Although we have established a statistically significant difference in average hours worked, the difference may not be meaningful in a practical sense.

#### Exercise: <br><br>  Suppose we had been interested in testing whether, or not, workers are working less than a traditional 40 hour workweek. The alternative hypothesis that we have in mind is that the workweek is less than 40 hours; <br><br> $H_0: \mu = 40$ vs $H_A: \mu < 40$. <br><br> Perform a test of these hypotheses. Report the test statistic and p-value. What is is the conclusion of the hypothesis test (in the context of this question)?

<details>

<summary><b>Click to view sample code:</b></summary>


```
inference(y = acs$hrs_work, est = "mean", type = "ht",  null = 40, alternative = "less", method = "theoretical")
```


</details>

#### Answer:   

<details>

<summary><b>Sample Answer:</b></summary>

<br>*The test statistic is Z = -2.245 and the p-value is 0.0124.<br>Since the p-value is so small, p-value = 0.0124, we have strong evidence against the null hypothesis and in favour of the alternative that the average workweek is less than 40 hours.*

</details>

### Why isn't the test statistic a T-statistic? 

#### For large samples (large degrees of freedom), the T-distribution and Z-distribution are virtually indistinguishable. Therefore, working with the T-distribution, as suggested by statistical theory, will produce nearly identical results as working with the Z-distribution.

#### With that information in mind, the `inference` function chooses to use the Z-distribution for confidence intervals and hypothesis tests whenever the sample size is at least 30 ($n\ge30$). 

#### For example, consider a smaller sample of the hours worked data. Suppose we had a sample of 15 observations and conducted the same hypothesis test.

In [None]:
acs_hrs_small_sample = sample(acs$hrs_work,size=15)
# selects a random sample of 15 observations and saves those values as a new object called small.sample

In [None]:
inference(y = acs_hrs_small_sample, est = "mean", type = "ht",  null = 40, alternative = "twosided", method = "theoretical")

#### Now the test statistic is a T-statistic, as expected, and we are given the degrees of freedom, $df = n-1 = 15 - 1 = 14$, that identifies the T-distribution that is needed for calculating the p-value.

#### Confidence intervals

If the conditions for inference are reasonable, we can either calculate the standard error, $SE_{\bar{x}}$ and construct the interval by hand, or allow the `inference` function to do it for us.  
  
The `inference` function relies upon many arguments. For constructing confidence intervals for proportions, the following arguments are important:
* `y` - response variable of interest; in this exercise `acs$hrs_work` records records reported weekly hours worked
* `est` - parameter we are interested in: "mean", "median", or "proportion"
* `type` - type of inference; confidence interval, "ci", or hypothesis test, "ht"
* `conflevel` - confidence level desired
* `method` - method of inference; based upon normal distribution, "theoretical", or based on simulation, "simulation"

#### Note: There is no `success` argument as there is when conducting hypothesis tests for proportions.

Use the `inference` function to construct a 95% confidence interval for the average weekly hours worked by all U.S. adults.

In [None]:
inference(y = acs$hrs_work, est = "mean", type = "ci", conflevel = 0.95, method = "theoretical")

#### Interpretation: We are 95% confident that U.S. adults work on average between 38.1 hours and 39.9 hours per week.

#### Note: This suggest that 40 hours is not a plausible value for the average hours worked per week by all U.S. adults. This is consistent with the results of our earlier hypothesis test. A 95% confidence interval and a two-sided hypothesis test at the $\alpha = 0.05$ level of significance provide equivalent information on whether any particular value is plausible.

#### Exercise: <br><br>Construct a 99% confidence interval for for the average weekly hours worked by all U.S. adults. Report and interpret the confidence interval.

<details>

<summary><b>Click to view sample code:</b></summary>


```
inference(y = acs$hrs_work, est = "mean", type = "ci", conflevel = 0.99, method = "theoretical")
```


</details>

#### Answer:   

<details>

<summary><b>Sample Answer:</b></summary>

<br>*The 99% confidence interval is (37.8498, 40.1476). We are 99% confident that U.S. adults work on average between 37.8 hours and 40.1 hours per week.*

</details>

#### Let’s stop here. 

#### It is important to save your work, exit the notebook, and logout of syzygy when you are done. Simply closing the window in which you are working will leave the notebook running which can produce some minor problems when you next try to log in.

- **Select File > Save Notebook or select the Save icon above to save your work.**
- **To exit the notebook, select File > Close and Shutdown Notebook.**
- **Select File > Log Out.**