# Homework #6

**See Canvas for the HW #6 assignment and due date**. Complete all of the following problems. Ideally, the theoretical problems should be answered in a Markdown cell directly underneath the question. If you don't know LaTex/Markdown, you may submit separate handwritten solutions to the theoretical problems, but please see the [class scanning policy](https://docs.google.com/document/d/17y5ksolrn2rEuXYBv_3HeZhkPbYwt48UojNT1OvcB_w/edit?usp=sharing). Please do not turn in messy work. Computational problems should be completed in this notebook (using the R kernel). Computational questions may require code, plots, analysis, interpretation, etc. Working in small groups is allowed, but it is important that you make an effort to master the material and hand in your own work. 



## A. Theoretical Problems

## A.1 [10 points] Approximate Confidence Interval for Proportions

Recall from an earlier assignment that if $np > 5$ and $n(1-p) > 5$, then $X \sim \text{Bin}(n,p)$ is well-approximated by $Y \sim N\left(np, np(1-p)\right)$. Use this approximation in the question below. In particular, an approximate $(1-\alpha)\times 100\%$ confidence interval for a population proportion $p$ is given by

\begin{align*}
\widehat{p} \pm z_{1-\alpha/2}\sqrt{\frac{\widehat{p}(1-\widehat{p})}{n}},
\end{align*}

where $\widehat{p}$ is the sample proportion. 

**Suppose that 12 people in a sample of 95 are members of the Green Party. Calculate an approximate 90% confidence interval for the true proportion of Green Party members in the population. Interpret this interval.**

The sample proportion, $\hat{p} = \frac{12}{95}$. In other words, it's the number of successes over the total number of people in the sample. I know that the z-value for a 90% confidence interval is 1.645, so I can plug my numbers into the provided equation using the code below. 

In [4]:
n = 95
phat = 12 / 95 
z = 1.645

lower_bound = phat - z * sqrt((phat * (1 - phat)) / n)

cat("The lower bound for a 90% CI is:", lower_bound)

The lower bound for a 90% CI is: 0.07024842

In [5]:
upper_bound = phat + z * sqrt((phat * (1 - phat)) / n)
cat("The upper bound for a 90% CI is:", upper_bound)

The upper bound for a 90% CI is: 0.1823832

Intrepretation: 

We are confident that 90% of the samples will contain the true, theoretical mean within the confidence interval of 7.02% and 18.24% 

## A.2 [10 points] Speed of Light

In 1881 Michelson and Newcomb measured the time light took to travel a distance of $7400$ meters. From a study of their experimental setup and a descriptive study of their $64$ measurements, we conclude that the data can be assumed to be i.i.d. These measurements yield the following sample quantities in microseconds for the sample mean $\bar x$ and sample standard deviation $s$:
$$\bar{x}=27.75, s=5.08$$
Construct a 95% confidence interval for the time light takes to travel $7400$ meters.



We need to use the t-distribution since we don't know the population standard deviation. Therefore, we need to use the degrees of freedom (n-1) to account for this uncertainity in estimating the population standard deviation. Lower degrees of freedom indicate heavier tails, or more variability in the data. Whereas larger degrees of freedom make the t-distribution more similar to the normal with less variability. We use the degrees of freedom to calculate the t-value because it makes the interval wider, compensating for more uncertainity in the data. The t-value calculation depends on alpha and the sample size. Here, I'm using the qt() function in R to find the t-value for me. 

The t-distribution confidence interval formula is: 

$$ \bar{x} \pm t_{1-\alpha/2} \frac{s}{\sqrt{n}} $$

From here, we can plug our numbers in to get the confidence interval.

In [11]:
xbar = 27.75
sd = 5.08
n = 64
df = n - 1
alpha = 0.05

t_val = qt(1 - alpha/2, df)

lower = xbar - (t_val * (sd / sqrt(n)))

cat("The lower bound for the confidence interval is:", lower)

The lower bound for the confidence interval is: 26.48105

In [12]:
upper = xbar + (t_val * (sd / sqrt(n)))
cat("The upper bound for the confidence interval is:", upper)

The upper bound for the confidence interval is: 29.01895

## A.3 A Change in Confidence

A journal article reports that a sample of size $n = 5$ was used as a basis for calculating a $95\%$ CI for the true average natural frequency (Hz) of delaminated beams of a certain type. The resulting interval was $(229.764, 233.504)$. You decide that a confidence level of 99% is more appropriate than the 95% level used.

**A.3 [14 points] (a) What are the limits of the 99% interval?**


We need to use the t-distribution here since our sample size is very small (n=5). Generally, we would consider using the t-distribution compared to the normal distribution for a sample size less than 30. The t-distribution confidence interval formula is: 

$$ \bar{x} \pm t_{1-\alpha/2} \frac{s}{\sqrt{n}} $$

Given the confidence interval for 95% confidence, we can find the $\bar{x}$ by averaging the two numbers of the confidence interval. We also know that the margin of error is equivalent to either the upper or lower value subtracted to or added to $\bar{x}$. In my code below, I follow the format $ 233.504 - \bar{x} $. With the margin or error known, I can solve for the standard deviation of the sample since I can find the t-value for a 95% confidence interval and I already know the sample size. 

$$ \text{Margin of Error} = t_{1-\alpha/2} \frac{sd}{\sqrt{n}} $$

Rearranging, I'll solve for the standard deviation of the sample:

$$ sd = \frac{\text{Margin of Error} * \sqrt{n}}{t_{1-\alpha/2}} $$

As a reminder, the t-value we're calculating here is from a 95% confidence interval, but the standard error is constant and doesn't depend on the confidence interval width. Now, since I have the $\bar{x}$, standard deviation of the sample, sample size (n), and can calculate the t-value for a 99% confidence interval, I can solve for the new confidence interval 

In [25]:
n1 = 5
df1 = n1 - 1
alpha_95 = 0.05
t_val_95 = qt(1 - alpha_95/2, df1)
xbar1 = (233.504 + 229.764) / 2

marerr = 233.504 - xbar1

sd1 = (marerr * sqrt(n1)) / t_val_95

alpha_99 = 0.01
t_val_99 = qt(1 - alpha_99/2, df1)

lower1 = xbar1 - (t_val_99 * (sd1 / sqrt(n1)))

cat("The lower bound for the confidence interval is:", lower1)

The lower bound for the confidence interval is: 228.533

In [26]:
upper1 = xbar1 + (t_val_99 * (sd1 / sqrt(n1)))
cat("The upper bound for the confidence interval is:", upper1)

The upper bound for the confidence interval is: 234.735

## A.4 MLEs

Suppose that $X_1,...,X_n \overset{iid}{\sim}N(\mu, \sigma^2)$, where $\sigma$ is known, and we are ultimately interested in an estimator for $\theta = \mu^2$.

**A.4 (a) [12 points] First, find the MLE of $\mu$.**

First, we need the pdf for a normal distribution: 

$$ f(x | \mu, \sigma) = \frac{1}{\sigma \sqrt{2 \pi}} e^{\frac{1}{2} (\frac{x - \mu}{\sigma})^2} $$

We can write the likelihood function from the product of individual pdf's: 

$$ L(\mu) = \prod_{i=1}^{n} f(X_{i} | \mu, \sigma) = \prod_{i=1}^{n} \frac{1}{\sigma \sqrt{2 \pi}} e^{\frac{1}{2} (\frac{X_{i} - \mu}{\sigma})^2} $$

Rewrite it as: 

$$ L(\mu) = (\frac{1}{\sigma \sqrt{2 \pi}})^n e^{\frac{1}{2 \sigma^2} \sum_{i=1}^{n}({X_{i} - \mu})^2} $$

Log-likelihood function: 

$$ \ln(L(\mu)) = \ell(\mu) = \frac{-n}{2} \ln(2 \pi \sigma^2) - \frac{1}{2 \sigma^2} \sum_{i=1}^{n}({X_{i} - \mu})^2 $$

Differentiate with respect to $\mu$ and set equal to zero:

$$ \frac{d\ell(\mu)}{d\mu} = 0 - \frac{1}{2 \sigma^2} (2)(-1) \sum_{i=1}^{n}({X_{i} - \mu}) = 0 $$

$$ = \frac{1}{\sigma^2} \sum_{i=1}^{n}({X_{i} - \mu}) $$

$$ = \sum_{i=1}^{n} X_{i} - n\mu $$

$$ \sum_{i=1}^{n} X_{i} = n\mu $$

$$ \bar{X}n = n\mu $$

$$ \bar{X} = \mu $$

**A.4 (b) [4 points] Find the maximum likelihood estimator (MLE) for $\theta$, denoted $\widehat{\theta}$.**

This should be easy!

From the previous question, we already know that 

$$ \mu = \frac{1}{n} \sum_{i=1}^{n} X_{i} = \bar{X} $$

$$ \hat{\theta} = \hat{\mu}^2 = (\frac{1}{n} \sum_{i=1}^{n} X_{i})^2 = \bar{X}^2 $$

So, 

$$ \hat{\theta} = \bar{X}^2 $$

**A.4(c) [10 points] Compute the bias of $\widehat{\theta}$, denoted $\text{Bias}(\widehat{\theta})$. Recall that $Bias(\widehat{\theta}) = E(\widehat{\theta}) - \theta$.**

We can equate $E[\hat{\theta}]$ to $E[\bar{X}^2]$ and use the variance short cut, $Var(\bar{X}) = E[\bar{X}^2] - E[\bar{X}]^2$ to solve for $E[\bar{X}^2]$. Rearranging, we get: 

$$ E[\bar{X}^2] = Var(\bar{X}) + E[\bar{X}]^2 $$

For a sample mean, the variance is $Var(\bar{X}) = \frac{\sigma^2}{n}$. Plugging into the equation, we get:

$$ E[\bar{X}^2] = \frac{\sigma^2}{n} + \mu^2 = E[\hat{\theta}] $$

We can substitute this back into the bias equation given: (We're given that $\theta = \mu^2$ in the beginning of the problem)

$$ Bias(\hat{\theta}) = E[\hat{\theta}] - \theta = \frac{\sigma^2}{n} + \mu^2 - \mu^2 $$ 

$$ Bias(\hat{\theta}) = \frac{\sigma^2}{n} $$

## B. Computational Problems

## B.1 Hubble Data

Load `hubble.csv` into `R`. A description of the variables can be obtained from page 73 of https://cran.r-project.org/web/packages/gamair/gamair.pdf.

In [3]:
hubble = read.csv('hubble.csv')
head(hubble)

Unnamed: 0_level_0,X,Galaxy,y,x
Unnamed: 0_level_1,<int>,<chr>,<int>,<dbl>
1,1,NGC0300,133,2.0
2,2,NGC0925,664,9.16
3,3,NGC1326A,1794,16.14
4,4,NGC1365,1594,17.95
5,5,NGC1425,1473,21.88
6,6,NGC2403,278,3.22


**B.1 (a) [20 points] Calculate the $85\%$ confidence interval for the mean of a galaxy's distance from Earth in megaparsecs in `R` by doing the computation explicitly.**

In [22]:
alpha85 = 0.15 
xbar85 = mean(hubble$x)
n_hubble = length(hubble$x)
df_hubble = n_hubble - 1
t85 = qt(1 - alpha85/2, df_hubble)
sd85 = sd(hubble$x)

lower85 = xbar85 - t85 * (sd85/sqrt(n_hubble))

cat("The lower confidence bound for the mean distance (Mega Parsecs) of the galaxy from Earth is:", lower85)

The lower confidence bound for the mean distance (Mega Parsecs) of the galaxy from Earth is: 10.28695

In [23]:
upper85 = xbar85 + t85 * (sd85/sqrt(n_hubble))

cat("The upper confidence bound for the mean distance (Mega Parsecs) of the galaxy from Earth is:", upper85)

The upper confidence bound for the mean distance (Mega Parsecs) of the galaxy from Earth is: 13.82222

**B.1 (b) [10 points] Find a built-in `R` function that does this computation automatically and verify that the built in `R` function does the same thing as the confidence interval formula used in part (a).**

In [17]:
ci85 = t.test(hubble$x, conf.level = 0.85)
ci85


	One Sample t-test

data:  hubble$x
t = 10.156, df = 23, p-value = 5.701e-10
alternative hypothesis: true mean is not equal to 0
85 percent confidence interval:
 10.28695 13.82222
sample estimates:
mean of x 
 12.05458 


**B.1(c) [10 points] Interpret the confidence interval.**

85% of the samples will have a true mean that is within the confidence interval (10.29, 13.82)