# Worksheet 11: More Inference

## Objectives: ##

Practice with hypothesis tests and confidence intervals in different contexts.

## Instructions: ##
* Do NOT round any of the values unless your are explicitly told to do so in the question.
* You can compute the required values using R as your calculator.

## Formulae: ##
A confidence interval is calculated by finding
$(point\  \  \  estimate) \pm multiplier\times SE$

Standard Error

Standard error for $\bar{x}$

$SE(\bar{x})=\frac{\sigma}{\sqrt{n}}$

$SE(\bar{x_1}-\bar{x_2})=\sqrt{\frac{\sigma_1^2}{n_1}+\frac{\sigma_2^2}{n_2}}$

Standard error for $\hat{p}$

$SE(\hat{p})=\sqrt{\frac{p(1-p)}{n}}$


Test statistic 

$t=\frac{\bar{x}-\mu_0}{\frac{s}{\sqrt{n}}}$

$t=\frac{(\bar{x_1}-\bar{x_2})-0}{\sqrt{\frac{s_1}{n_1}+\frac{s_2}{n_2}}}$

$z=\frac{\hat{p}-p_0}{\sqrt{\frac{p_0(1-p_0)}{n}}}$

Degrees of Freedom

$df=n-1$ 

$df=min(n_1-1, n_2-1)$

## Tools: ##

To find the area under the t-distribution you can use the code below to find the area to the left of t, with degrees of freedom df.

`pt(t,df)`

To find the cut off that will have area a to the left you can use the code 

`qt(a,df)`

(Note that these work the same was as `pnorm` and `qnorm` but for t distributions.)

To find mean or standard deviation for subsets of a data set we will use the `by` command. I have included the necessary code, 

`by(dataset$var1,dataset$var2, mean)` would compute the mean of variable 1, for the different groups in variable 2. (Variable 1 should be numerical and variable 2 should be categorical.)

Remeber to summarize a categorical variable you can use the `table` command.

If you would find it useful to have a graph to look at for one of these questions you can use the normalplot (defined below) just remember to run the code block for the normalplot. 
* Recall that to draw a normal curve with mean (m) and standard deviation (sd), that is shaded from min to max enter the command:
  * `normalplot(m, sd, c(min, max))`
* NOTE: You are not required to graph for any of this week's questions.

In [None]:
normalplot<-function(m,sd,region=0){
  x<-seq(m-(3.5)*sd,m+(3.5)*sd,length=1000)
  y<-dnorm(x,m,sd)
  plot(x,y,type="l",xlab="",ylab="", bty="n", yaxt="n")
  z<-x[x>region[1]]
  z<-z[z<region[2]]
  polygon(c(region[1],z,region[2]),
          c(0,dnorm(z,m,sd),0),col="gray")
  abline(v=m)
  abline(h=0)}

## Data Information: ##

## North Carolina births

In 2004, the state of North Carolina released a large data set containing information on births recorded in this state. This data set is useful to researchers studying the relation between habits and practices of expectant mothers and the birth of their children. We will work with a random sample of observations from this data set.


There are **809** observations in this dataset.


#### Name: #### 
* `ncbirths` - a random* sample of 1998 births in North Carolina from 2004.

#### Variables: ####
* `fage`	father’s age in years.
* `mage`	mother’s age in years.
* `mature`	maturity status of mother.
* `mature`	maturity status of mother.
* `weeks`	length of pregnancy in weeks.
* `premie`	whether the birth was classified as premature (premie) or full-term.
* `visits`	number of hospital visits during pregnancy.
* `marital`	whether mother is married or not married at birth.
* `gained`	weight gained by mother during pregnancy in pounds.
* `weight`	weight of the baby at birth in pounds.
* `lowbirthweight`	whether baby was classified as low birthweight (low) or not (not low).
* `gender`	gender of the baby, female or male.
* `habit`	status of the mother as a nonsmoker or a smoker.
* `whitemom`	whether mom is white or not white.

If you read the code to load the data you can see that this isn't quite a random sample, and is a bit different from last week. (feel free to ask **Jana** why in class)

In [None]:
source("https://www.openintro.org/data/R/ncbirths.R")
ncbirths<-ncbirths[-which(is.na(ncbirths$gained)==TRUE | is.na(ncbirths$weeks)==TRUE | is.na(ncbirths$fage)==TRUE),]

# Question 1: Is there a difference in the mean weight of babies of smokers vs non smokers?

### Prepare:

* a.  What are the parameters? What are the null and alternate hypothesis?

* b. Will this be paired data? Or are the observations independent?

#### We will use $\alpha=0.01$ for this test.

### Check

We can assume that the sample is random, the data was collected independently, and 809 is less than 10% of the population. 

* c. Make a histogram by running the given code

* d.Do you meet the requirements to perform a valid hypothesis test?

### Calculate

* e.  Calculate the necessary sample statistics. (You need to know the sample mean and sample standard deviation.)

* f. Calculate the standard error and t-score 

* g. What is the estimated df, the degrees of freedom?

* h. Compute the p-value

### Conclude

* i. State your conclusion.

### Answers

### Prepare:

* a.  What are the parameters? What are the null and alternate hypothesis?

$H_0:  $

$H_0:  $

* b. Will this be paired data? Or are the observations independent?

Type your answer here

#### We will use $\alpha=0.01$ for this test.

* c. Make a histogram by running the given code

In [None]:
par(mfrow=c(3,1))
hist(ncbirths$weight[which(ncbirths$habit=="nonsmoker")],
     main="Birth weights of babies whoes mothers are non smokers.",xlab="Birth Weight in Lbs")
hist(ncbirths$weight[which(ncbirths$habit=="smoker")],
     main="Birth weights of babies whoes mothers are smokers.",xlab="Birth Weight in Lbs")
boxplot(ncbirths$weight~ncbirths$habit,horizontal=T,xlab="Birth Wweight (lbs)")

We can assume that the sample is random, the data was collected independently, and 809 is less than 10% of the population. 

You can verify below that there are 84 smokers in our sample and 725 non smokers.

* d.Do you meet the requirements to perform a valid hypothesis test? Do you have any concerns?



Type your answer here

### Calculate 

* e.  Calculate the necessary sample statistics by running the code cells below.

In [None]:
by(ncbirths$weight,ncbirths$habit,mean)

Calculate the sample standard deviations

In [None]:
by(ncbirths$weight,ncbirths$habit,sd)

You will need these values later

In [None]:
table(ncbirths$habit)

* f. Calculate the standard error and t-score 

* g. What is the estimated df, the degrees of freedom?

* h. Compute the p-value

* h. Conclusion

Type your answer here

# Question 2: Confidence interval for the difference in mean mothers age and mean fathers age

### Prepare:

* a.  What are the parameters?

* b. Will this be paired data? Or are the observations independent?

### Check

We can assume that the sample is random, the data was collected independently, and 809 is less than 10% of the population. 

* c. Make a histogram for the difference in age.

* d.Do you meet the requirements to compute a valid confidence interval?

### Calculate

* e.  Calculate the necessary sample statistics. (You need to know the sample mean and sample standard deviation.)

* f. Find the t* multiplier

* g. What is df, the degrees of freedom?

* h. Compute the 90% confidence interval.

### Conclude

* i. Write one or two sentences interpreting this confidence interval.

### Prepare:

* a. What are the parameters?


Type your asnwer here

* b. Will this be paired data? Or are the observations independent?

type your answer here

* c. Make a histogram and boxplot for the difference in age.

* c.Do you meet the requirements to compute a valid confidence interval?


type your answer here

### Calculate

* d.  Calculate the necessary sample statistics. (You need to know the sample mean and sample standard deviation.)

* e. Find the standard error

In [None]:
4.3120134179754/sqrt(809)

* f.  What is df, the degrees of freedom?

* g. Find the t* multiplier for the 90% confidence interval

* h. Compute the 90% confidence interval.


### Conclude

* i. Write one or two sentences interpreting this confidence interval.

type your conclusion here

 # Question 3:  Are less than 20% of babies born early (premies)

### Prepare:

* a. What is the parameter of interest?  What are the hypothesis?


#### We will use $\alpha=0.01$ for this test.

### Check

We can assume that the sample is random, the data was collected independently, and 809 is less than 10% of the population. 


* b.Do you meet the requirements to perform a valid hypothesis test?

### Calculate

* c. Calculate the standard error.

* d.  Calculate the necessary sample statistics. 

* e. Calculate the z score


* f. Compute the p-value

### Conclude

* g. State your conclusion.

### Prepare:

* a. What is the parameter of interest?  What are the hypothesis?

p the proportion of babies in the whole populaiton that are classified as premie.


$H_0: $

$H_a: $

#### We will use $\alpha=0.01$ for this test.

### Check

We can assume that the sample is random, the data was collected independently, and 809 is less than 10% of the population. 


* b.Do you meet the requirements to perform a valid hypothesis test?

type your answer here

### Calculate

* c. Calculate the standard error.

* d.  Calculate the necessary sample statistic.

* e. Calculate the z score

* f. Compute the p-value

### Conclude

* g. State your conclusion.

type your answer here