# Homework #7: Computational Questions

All computations should be done in this notebook using the R kernel. This is your first opportunity to get familiar with R outside of class, so please take your time on the problems that require it. Working in small groups is allowed, but it is important that you make an effort to master the material and hand in your own work. Follow all instructions very closely or points will be deducted.

#### You will be required to submit this notebook, fully compiled with your solutions, as an Jupyter Notebook and as an HTML file to Canvas by 5pm on Wednesday, March 20.

#### Read and sign the Honor Code Pledge below:

**Honor Code Pledge: _On my honor, as a University of Colorado Boulder student, I have neither given nor received unauthorized assistance on this work._**

### _Juan Vargas-Murillo_

## Problem 1

This problem uses data on velocities and distances of 24 galaxies containing Cepheid stars from the Hubble space telescope. You will find a data frame with 3 columns and 24 rows. The columns are (from left to right):

"Galaxy A" - label identifying the galaxies<br/>
"y" - the galaxy's relative velocity in km/sec<br/>
"x" - the galaxy's distance from earth in Megaparsecs (1 parsec is approximately 30 trillion km)<br/>

#### (a) Load the hubble telescope data found in the file ${\tt hubble-SP19-TAB.txt}$ or ${\tt hubble-SP19-R.csv}$ into R. 

In [1]:
hubba_bubba <- read.csv('hubble-SP19-R.csv')
head(hubba_bubba)

Galaxy,y,x
NGC0300,133,2.0
NGC0925,664,9.16
NGC1326A,1794,16.14
NGC1365,1594,17.95
NGC1425,1473,21.88
NGC2403,278,3.22


#### (b) Note that since the size of the data set is $n=24<40$, and since the distribution of the average distance of a galaxy from earth is assumed to be normally distributed (by the CLT, if we assume independence) and since we don't know the variance, we will use a <ins>t-Distribution Confidence Interval</ins>.

##### Now suppose we wish to determine a 85% t-Confidence Interval for the mean of a galaxy's distance from the Earth (in Megaparsecs), what upper <ins>t critical value</ins> would we use? That is, find the value of $t_{\alpha/2,n-1}$ for a 85% t-Confidence Interval for the mean of a galaxy's distance from the Earth.

In [5]:
# alpha = 1 - CI = 1 - .85 = 0.15; t sub alpha / 2 = 0.075 and in R alpha = .925; 
# and degrees of freedom is n - 1 = 24 - 1 = 23
critical.value <- qt(0.925, 23)
critical.value

#### (c) Now calculate the 85% t-confidence interval for the mean of a galaxy's distance from Earth by doing the computation for the confidence interval formula <ins>explicitly</ins> .

In [50]:
sample.mean <- mean(hubba_bubba$x)
sample.variance <- sum((hubba_bubba$x - sample.mean)**2) / 23 # s = sum(x_i-x_bar)**2 / (n - 1 = 24 - 1 = 23)
# sample.mean
confidence.interval <- c(sample.mean - (critical.value * (sample.variance / sqrt(23))), sample.mean + (critical.value * (sample.variance / sqrt(23))))

confidence.interval


#### (d) Use the built in R function [${\tt t.test()}$](https://www.rdocumentation.org/packages/stats/versions/3.5.2/topics/t.test) to find a 85% t-Confidence Interval for the mean of a galaxy's distance from Earth.

In [32]:
# ?t.test
t.test(hubba_bubba$x, conf.level = .85) #$conf.int[1]


	One Sample t-test

data:  hubba_bubba$x
t = 10.156, df = 23, p-value = 5.701e-10
alternative hypothesis: true mean is not equal to 0
85 percent confidence interval:
 10.28695 13.82222
sample estimates:
mean of x 
 12.05458 


In [33]:
?t.test

#### (e) Interpret the confidence interval.

**STUDENT ANSWER:** The confidence interval is telling us that in repeated sampling of the distances of galaxy's from Earth, 85% of the confidence intervals obtained from all samples have as there limits 10.28695 and 13.82222 (megaparsec).

## Problem 2 - Interpreting Confidence Intervals

#### (a) Draw 100 random samples (of size 24) from a <ins>normal distribution</ins> with mean $\mu=9$ and standard deviation $\sigma=1.43$. Put this data in a matrix with <ins>24 rows</ins> and <ins>100 columns</ins> (so each column represents one sample of 24 data points, use the [${\tt matrix()}$](https://www.rdocumentation.org/packages/base/versions/3.5.2/topics/matrix) command to create the matrix.) Use the ${\tt dim()}$ command on your matrix to show it has the right dimensions.

In [29]:
set.seed(20430)
rows   = 24
cols   = 100

# sample from the normal distribution
dat = rnorm(n*k, mean = 9, sd = 1.43)
x   = matrix(data=dat, nrow=rows, ncol=cols)
dim(x)
head(x)

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20
10.028335,9.546444,10.515212,9.719587,8.141072,9.955483,6.337439,13.178706,9.750329,7.942663,⋯,9.217956,7.125195,8.928547,10.222949,11.16543,7.737488,8.659221,9.815217,10.515025,9.066363
7.452748,6.794591,6.284152,11.588242,7.708659,7.984314,9.610407,9.345248,10.363567,9.680446,⋯,9.523282,8.915744,7.637325,11.98416,8.687178,9.340377,7.470117,6.688072,8.185364,7.914446
10.099379,8.663095,9.696549,9.52582,9.121004,11.179189,8.491143,7.945668,8.572236,7.821948,⋯,6.397463,9.047139,8.012543,9.046531,7.749223,7.854213,9.953238,9.582053,11.729488,10.746724
7.822116,9.207262,10.791697,8.788026,9.510115,7.457688,8.349871,9.498505,12.59329,9.928362,⋯,7.684475,9.213252,7.956641,8.298329,6.337637,5.880875,10.029232,8.943203,10.147852,8.876776
7.55986,7.211397,9.946219,8.571802,10.977203,5.384408,7.821065,11.044525,10.16147,10.060353,⋯,8.32099,7.665498,7.2862,6.587294,9.764433,12.940667,9.867729,8.404407,9.529858,6.688796
8.253048,8.187509,9.166707,9.400143,6.897901,9.807155,7.871724,9.752349,7.736833,8.677976,⋯,6.162928,8.226868,8.943327,10.220301,8.213657,11.644709,7.467402,10.25907,10.102217,10.413646


#### (b) Now construct a 95% t-Confidence Interval for each of the 100 samples. Put the intervals in a matrix called ${\tt intervals}$ with 100 rows and 2 columns. Use the ${\tt head(intervals)}$ and ${\tt dim(intervals)}$ to preview the matrix and show its dimensions. (*Hint:* Note that the command ${\tt t.test(x)\$ conf.int}$ extracts just the confidence interval from the t.test for any dataset $x$.)

In [59]:
# ?apply
get.conf.int = function(x) t.test(x, conf.level = .95)$conf.int
intervals = matrix(apply(x, 2, get.conf.int), nrow = 100, ncol = 2)
head(intervals)
dim(intervals)

0,1
8.463397,8.371739
9.298606,9.400025
7.936146,7.89049
9.127108,9.30878
8.276937,8.882918
9.617759,9.883715


#### (c) Now count the number of confidence intervals that contain $\mu=9$. (Hint: use logical indexing.)

In [60]:
sum(intervals[,1] <= 9 & intervals[,2] >= 9)

#### (d) Interpret what a "95% confidence interval" for the population mean $\mu$ actually "means" (no pun intended).


**STUDENT ANSWER:** The confidence interval is telling us that in repeated sampling, 95% of the confidence intervals obtained from all samples will actually capture the true parameter being estimated, in this case, $\mu$.