In [1]:
import numpy as np
import pandas as pd
from scipy import stats
from IPython.display import Markdown

# Q5.1
For each part, state the null ($H_0$) and alternative ($H_A$) hypotheses.

(a) Has the average community level of suspended particulates for the month
of August exceeded 30 $\mu g/cm^3$?

(b) Does the mean age of onset of a certain acute disease for schoolchildren
differ from 11.5 years?

(c) A psychologist claims that the average IQ of a sample of 60 children is
significantly above the normal IQ of 100.

(d) Is the average cross‐sectional area of the lumen of coronary arteries for
men, ages 40–59 years, less than 31.5% of the total arterial cross‐section?

(e) Is the mean hemoglobin level of high‐altitude workers different from 16 $g/cm^3$?

(f) Does the average speed of 50 cars as checked by radar on a particular
highway differ from 55 mph?

## A5.1
(a) $H_0: \mu = 30$, $H_A: \mu \gt 30$  
(b) $H_0: \mu = 11.5$, $H_A: \mu \ne 11.5$  
(c) Population-level data is available, no need for hypothesis testing  
(d) $H_0: \mu = 0.315$, $H_A: \mu \lt 0.315$  
(e) $H_0: \mu = 16$, $H_A: \mu \ne 16$  
(f) Population-level data is available, no need for hypothesis testing

# Q5.3
E. canis infection is a tick‐borne disease of dogs that is sometimes contracted
by humans. Among infected humans, the distribution of white blood cell counts
has an unknown mean μ and a standard deviation σ. In the general population
the mean white blood count is 7250 per $mm^3$. It is believed that persons infected
with E. canis must on average have a lower white blood cell count. What is the
null hypothesis for the test? Is this a one‐ or two‐sided alternative?

## A5.3
$H_0: \mu = 7250$, $H_A: \mu \lt 7250$

# Q5.4
It is feared that the smoking rate in young females has increased in the last
several
years. In 1985, 38% of the females in the 17‐ to 24‐year age group were
smokers. An experiment is to be conducted to gain evidence to support or refute
the increase contention. Set up the appropriate null and alternative hypotheses.
Explain in a practical sense what, if anything, has occurred if a type I or type II
error has been committed.

## A5.4
$H_0: \pi = 0.38$, $H_A: \pi \gt 0.38$  
Type I Error (FP): The test suggests that smoking rates have increased, when in fact they have not.  
Type II Error (FN): The test suggests that smoking rates have not increase, when in fact they have.  

# Q5.5
A group of investigators wishes to explore the relationship between the use of
hair dyes and the development of breast cancer in females. A group of 1000
beauticians 40–49 years of age is identified and followed for five years. After
five years, 20 new cases of breast cancer have occurred. Assume that breast
cancer incidence over this time period for average American women in this age
group is 7 per 1000. We wish to test the hypothesis that using hair dyes increases
the risk of breast cancer. Is a one‐ or two‐sided test appropriate here? Compute
the p value for your choice.

## A5.5
$H_0: \pi = 0.007$, $H_A: \pi \gt 0.007$

In [2]:
pi = 0.007
se = np.sqrt(pi * (1 - pi) / 1000)
z = (0.02 - pi) / se
pval = stats.norm.sf(z)
print(f"One-sided p-value = {pval :.4} (z-score = {z :.3})")

One-sided p-value = 4.094e-07 (z-score = 4.93)


# Q5.6
Height and weight are often used in epidemiological studies as possible predictors
of disease outcomes. If the people in the study are assessed in a clinic,
heights and weights are usually measured directly. However, if the people are
interviewed at home or by mail, a person’s self‐reported height and weight are
often used instead. Suppose that we conduct a study on 10 people to test the
comparability of these two methods. Data from these 10 people were obtained
using both methods on each person. What is the criterion for the comparison?
What is the null hypothesis? Should a two‐ or a one‐sided test be used here?

## A5.6
Mean difference (two-sided test)  
$H_0: \mu_1 - \mu_2 = 0$, $H_A: \mu_1 - \mu_2 \le 0$

# Q5.8
A food‐frequency questionnaire was mailed to 20 subjects to assess the intake of
various food groups. The sample standard deviation of vitamin C intake over the
20 subjects was 15 (exclusive of vitamin C supplements). Suppose that we know
from using an in‐person diet interview method in an earlier large study that the
standard deviation is 20. Formulate the null and alternative hypotheses if we want
to test for any differences between the standard deviations of the two methods.

## A5.8
$H_0: \sigma = 20$, $H_A: \sigma_1 \ne 20$

# Q5.9
In Example 5.1 it was assumed that the national smoking rate among men
is 25%. A study is to be conducted for New England states using a sample size
n = 100 and the decision rule:

If $p \le 0.20$, $H_{0}$ is rejected,

where $H_{0}$: $\pi = 0.25$ and $\pi$ and $p$ are population and sample proportions,
respectively, for New England states. Is this a one‐ or a two‐tailed test?

## A5.9
One-tailed

# Q5.10
In Example 5.1, with the rule:

If $p \le 0.20$, $H_{0}$ is rejected,

it was found that the probabilities of type I and type II errors are:

$\alpha = 0.123$  
$\beta = 0.082$

for $H_{A}$: $\pi = 0.15$. Find α and β if the rule is changed to:

If $p \le 0.18$, $H_{0}$ is rejected.

How does this change affect α and β values?

## A5.10
$\alpha = Pr(p \le 0.18 | \pi = 0.25)$  
$\beta = Pr(p \gt 0.18 | \pi = 0.15)$  

In [3]:
pi = 0.25
se = np.sqrt(pi * (1 - pi) / 100)
alpha = stats.norm(pi, se).cdf(0.18)

pi = 0.15
se = np.sqrt(pi * (1 - pi) / 100)
beta = stats.norm(pi, se).sf(0.18)

Markdown(f"$\\alpha = {alpha :.3}$  \n$\\beta = {beta :.3}$")

$\alpha = 0.053$  
$\beta = 0.2$

# Q5.12
Recalculate the p value in Example 5.2 if it was found that 18 (instead of 15)
men in a sample of n = 100 are smokers.

## A5.12

In [4]:
n = 100
pi = 0.25
diff = (25 - 18) / n
se = np.sqrt(pi * (1 - pi) / n)
N = stats.norm(pi, se)

pval = N.cdf(pi - diff) + N.sf(pi + diff)
print(f"Two-sided p-value = {pval :.4}")

Two-sided p-value = 0.106


# Q5.13
Calculate the 95% confidence interval for π using the sample in Exercise 5.12
and compare the findings to the testing results of Exercise 5.12.

## A5.13

In [5]:
n = 100
p = 0.18
se = np.sqrt(p * (1 - p) / n)
N = stats.norm(p, se)

p_left, p_right = N.interval(0.95)
print(f"95% CI = [{p_left :.3}, {p_right :.3}]")

95% CI = [0.105, 0.255]


# Q5.14
Plasma glucose levels are used to determine the presence of diabetes. Suppose
that the mean log plasma glucose concentration (mg/dL) in 35‐ to 44‐year olds
is 4.86 with standard deviation 0.54. A study of 100 sedentary persons in
this age group is planned to test whether they have higher levels of plasma
glucose than the general population.

(a) Set up the null and alternative hypotheses.

(b) If the real increase is 0.1 log unit, what is the power of such a study if a
two‐sided test is to be used with α = 0.05?

## A5.14

(a) $H_0: \pi = 4.86, H_A: \pi \ne 4.86$

In [6]:
alpha = 0.05
n = 100
mu_0 = 4.86
se = 0.54 / np.sqrt(n)
N_0 = stats.norm(mu_0, se)
ci_left = N_0.ppf(alpha / 2)
ci_right = N_0.ppf(1 - alpha / 2)

# beta = Pr(a false H0 is not rejected)
#      = Pr(getting a sample mean within 95% CI of H0 | mean is actually 4.96)
N_A = stats.norm(mu_0 + 0.1, se)
beta = N_A.cdf(ci_right) - N_A.cdf(ci_left)
power = 1 - beta

print(f"(b) Power = {power :.4}")

(b) Power = 0.457


# Q5.15
Suppose that we are interested in investigating the effect of race on level of
blood pressure. The mean and standard deviation of systolic blood pressure
among 25‐ to 34‐year‐old white males were reported as 128.6 and 11.1 mmHg,
respectively, based on a very large sample. Suppose that the actual mean for
black males in the same age group is 135 mmHg. What is the power of the test
(two‐sided, α = 0.05) if n =100 and we assume that the variances are the same
for whites and blacks?

## A5.15

In [7]:
alpha = 0.05
n = 100
mu_0 = 128.6
se = 11.1 / np.sqrt(n)
N_0 = stats.norm(mu_0, se)
ci_left = N_0.ppf(alpha / 2)
ci_right = N_0.ppf(1 - alpha / 2)

N_A = stats.norm(135, se)
beta = N_A.cdf(ci_right) - N_A.cdf(ci_left)
power = 1 - beta

print(f"Power = {power :.4}")

Power = 0.9999
