-
Notifications
You must be signed in to change notification settings - Fork 0
/
Rinference_mean.Rmd
229 lines (165 loc) · 7.24 KB
/
Rinference_mean.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
---
title: "Inference for means"
author: "Marianne Huebner, Michigan State University"
date: '`r Sys.Date()`'
output: word_document
---
## Confidence Interval for $\mu$
The formula for a confidence interval for the sample mean is
$$
\bar{x} \pm t^*_{n-1} s/\sqrt{n}
$$
where $t^*$ is the critical value from the t-distribution and depends on the confidence level. For example, to calculate a 95% confidence interval for 19 observations (= 18 degrees of freedom), the critical value is
```{r}
qt(.975,18)
```
Note that to get a t interval and t test the same function is used.
Type
````r
?t.test
````
to check what
options are available
#### Example: Detergents
The production of a nationally marketed detergent results in certain workers receiving
prolonged exposure to the bacillus subtilis enzyme. Nineteen workers were tested to
determine the effects of these exposures on various respiratory functions.
The airflow rate, FEV1, is the ratio of a person’s forced expiratory volume to the
vital capacity, VC (max. volume of air a person can exhale after taking a deep breath).
If the enzyme has an effect, it will be to reduce the FEV1/VC ratio. The norm is 0.80 in persons with no lung dysfunction.
```{r}
ratio<-c(0.61, 0.70, 0.63, 0.76, 0.67, 0.72, 0.64, 0.82, 0.88,
0.82, 0.78, 0.84, 0.83, 0.82, 0.74, 0.85, 0.73, 0.85, 0.87)
```
There were measurements from `r length(ratio)` workers.
Assumptions for the t-interval are
* data are from a simple random sample
* independent
* identically distributed
* approximately normal distributed
The workers are likely unrelated and thus the independence assumption is satisfied.
A Q-Q plot indicates the normal distribution of the data.
```{r fig.width=4, fig.height=4}
qqnorm(ratio)
qqline(ratio)
```
A 90% confidence interval (t-interval) can be obtained with
```{r}
temp<-t.test(ratio, mu=0.80, conf.level=0.90)
temp
```
Specifically, the confidence interval is
```{r}
temp$conf.int
```
The sample mean is
```{r}
temp$estimate
sd(ratio)
```
#### Answer:
__A 90% confidence interval for the FEV1/VC ratio is (`r temp$conf.int[1]`, `r temp$conf.int[2]`). This includes
the true (normal) value of `r temp$null.value`.__
A 95% confidence interval would be
```{r}
t.test(ratio, mu=0.80, conf.level=0.95)$conf.int
```
## Hypothesis test for $\mu$
For a hypothesis test on the population mean the null hypothesis is
$$
H_0: \mu = \mu_0
$$
with test statistic
$$
t=\frac{\bar{x}-\mu_0}{s/\sqrt{n}}
$$
The test statistic $t$ has t-distribution with $n-1$ degrees of freedom.
#### Detergent example continued
If the enzyme has an effect, it will be to reduce the FEV1/VC ratio. The norm is 0.80 in persons with no lung dysfunction. Hence the research question is whether the ratio is lower than normal. This is a one sided
hypothesis test, namely the alternative is $H_A: \mu < 0.80.$
```{r}
temp.1sided<-t.test(ratio, mu=0.80, alternative="less")
````
#### Answer:
__The pvalue is `r temp.1sided$p.value`. The data provide evidence that exposure to B. subtilis may reduce the FEV1/VC ratio, but are inconclusive at the 5% significance level.__
## Coverage of confidence intervals
An aspirin manufacturer fills bottles by weight rather than by count. The weight per tablet should be 5 grains. Simulate a sample of 100 tablets taken from a very large lot.
Suppose 20 tablets are sampled from a lot with mean weight 5 grains and standard deviation 0.3 grains. A 95% confidence interval is calculated. Does this confidence interval contain the true mean?
Take `sims =100` such samples of 20 tablets:
```{r}
sims = 100
weights = replicate(sims,rnorm(20,mean=5,sd=0.3))
```
Calculate 95% confidence intervals for each sample:
```{r}
tint<-matrix(NA, nrow=dim(weights)[2], ncol=2)
for(i in 1:dim(weights)[2]){
temp<-t.test(weights[,i], conf.level = 0.95)
tint[i,]<-temp$conf.int
}
colnames(tint)<-c("lcl","ucl")
```
Check the first 10 confidence intervals from these samples:
```{r}
tint[1:10,]
```
How many of these 100 confidence intervals contain the true mean?
```{r}
tint = data.frame(tint)
indx = (tint$lcl <=5) & (tint$ucl>=5)
sum(indx)
```
In this simulation `r sum(indx)` of the 100 confidence intervals (95% confidence level) contain the true mean. How many would you expect?
## Inference for comparing means $\mu_1-\mu_2$
The confidence interval for comparing two means from independent samples is given by
$$
\bar{x}_1 - \bar{x}_2 \pm t^* \sqrt{ \frac {s_1^2}{n_1} + \frac {s_2^2}{n_2} }
$$
The confidence interval for comparing two means from dependent samples is given by
$$
\bar{x}_d \pm t^*\, s_d/\sqrt{n}
$$
where $t$ is from a t-distribution with $n-1$ degrees of freedom. The $n$ differences are treated as one sample, and $\bar{x}$ is denoted as $\bar{x}_d$, $s$ is denoted as $s_d$, and $\mu$ is denoted as $\mu_d$. The index $d$ stands for "difference.""
### Example: Change of internet user per 100 people from 2001 to 2011
Internet users per 100 people downloaded from GAPMINDER (http://www.gapminder.org/data/)
subset of 15 countries.
There are 15 countries in 2001 and 2011.
```{r }
countries<-c("Argentina" ,"Australia","Belgium","Canada", "Egypt","France", "Germany",
"Greece", "India" , "Japan" , "Mexico" ,"Niger","Singapore", "United.Kingdom", "United.States")
Y2011<-c(17.72058337, 63.02693628, 55.47689608, 71.59660113, 11.69839821, 41.39108117,
68.76941828, 24.17107186, 2.388075, 66.19826223, 17.21, 0.221341351, 61.00284566,
69.9749171, 68.26789985)
Y2001<-c(9.780807285, 52.60563888, 31.05933918, 59.97791664, 0.838945611, 25.48284606,
31.66413443, 11.0173514, 0.660146377, 38.15162324, 7.038023117, 0.105185431,
40.08896306, 33.47495977, 49.18000685)
```
__Is there a sigificant change in internet users from 2001 to 2011 in these countries?__
If we consider the internet users per country as dependent samples the corresponding hypotheses are $H_0 : \mu_d = 0$ vs $H_A : \mu_d \ne 0$.
```{r }
t.test(Y2001, Y2011, paired=TRUE)
```
We could argue that a lot changes in 10 years and internet uses in 2001 are a different breed from internet users in 2011 and the number of internet users has increased, then we would be testing independent samples of no change versus the alternative hypothesis
that in 2001 were fewer internet users with hypotheses $H_0 : \mu_{2001} = \mu_{2011}$ vs $H_A : \mu_{2001} < \mu_{2011}$.
```{r }
t.test(Y2001, Y2011, alternative="less")
```
To check the validity of this statistical methods for the data in question, we need to also check whether the data are approximately normally distributed.
```{r setup, include=FALSE}
#knitr::opts_chunk$set(dev = 'pdf')
```
```{r, echo=FALSE, fig.width=5, fig.height=3 }
par(mfrow=c(1,2))
qqnorm(Y2001, main="2001", xlab="", ylab="", cex=0.8); qqline(Y2001)
qqnorm(Y2011, main="2011", xlab="", ylab="", cex=0.8); qqline(Y2011)
par(mfrow=c(1,1))
```
There is a question whether the 2011 internet usage data follow a symmetric distribution
for the selected countries.
Optional: Look at the data. Which countries changed the most?
```{r, fig.width=5, fig.height=5 }
plot(Y2001, Y2011, xlab="Internet users per 100 in 2001", ylab="Internet users per 100 in 2011", xlim=c(-1, 70),
ylim=c(0,80),col= "blue", pch = 19, cex = 1, lty = "solid", lwd = 2)
text(Y2001, Y2011, labels=countries, cex= 0.7, pos=3)
abline(a=0,b=1)
```