<a href="https://colab.research.google.com/github/lonespear/MA206/blob/main/Lesson_14_Two_Means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
install.packages('tidyverse')

In [None]:
library(tidyverse)
wbc <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/main/wisconsin_breast_cancer.csv")
ranger <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/main/ranger_school.csv")
wcgs <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/main/wcgs.csv")

# Lesson 14: Two Means and the Two-Sample T-Test

Last lesson we discussed how to compare two proportions to test for a significant difference.

Today we will extend this to quantitative variables building from our knowledge of the one-sample t-test.

### 1. Ask a research question
*For example:* Is there a difference in the mean 12-mile ruck time for someone who passes Ranger School compared to someone who did not?

### 2. Write your null and alternative hypotheses:
Let $\mu_1$ be the mean ruck time for those who **pass** Ranger School, and $ \mu_2$ be the mean ruck time for those who **fail**.

- **Null Hypothesis $H_0$**: $ \mu_1 - \mu_2 = 0 $ (No difference in mean ruck times)
- **Alternative Hypothesis $H_A$**: $ \mu_1 - \mu_2 \neq 0 $ (There is a difference in mean ruck times)

### 3. Explore the data and find your observed statistics:
We will use the difference in our sample means as our observed statistic.

- **Observed Statistic**:
  $\bar{x}_1 - \bar{x}_2$
- **Standardize your Observed Statistic**: $ t = \frac{(\bar{x}_1 - \bar{x}_2) - \mu_{null}}{SD_{null}} $

In our hypothesis statements we are assuming no diffference, so $\mu_0 = 0$. The formula for standard deviation of our null distribution is $ \sqrt{\frac{s_1^2}{n_1^2} + \frac{s_2^2}{n_1^2}} $.

Putting that all together our standardized statistic becomes:

$$
 t = \frac{(\bar{x}_1 - \bar{x}_2)}{\sqrt{\frac{s_1^2}{n_1^2} + \frac{s_2^2}{n_1^2}}}
$$

### 4. Interpret your strength of evidence:
 We obtain a p-value the same way as in a one-sample t-test. We have converted our observed statistic into a standardized value that is in units of standard deviations away from the null distribution.

For a two-sided hypothesis test like this example is using:

$$
\begin{align*}
p-value &= 2 \cdot \mathbb{P}(T > |t|) = 2 \cdot \int_{|t|}^\infty f_T(x) dx \\
&= 2 \cdot (1 - pt(|t|)) \ \ \ \ \text{R code}
\end{align*}
$$

### On Confidence Intervals:
For a difference in means we use the same basic formula as the other three tests, except using our observed statistic for this test, the difference of sample means, and the respective SE formula which is identical to the $SD_{NULL}$ like in the one-sample t-test:

$$
\begin{align*}
\text{Statistic} &\pm \text{Margin of Error} \\
\text{Statistic} &\pm M \cdot SE \\
(\bar{x}_1 - \bar{x}_2) &\pm t_{n-2, 1-\alpha/2} \cdot \sqrt{\frac{s_1^2}{n_1^2} + \frac{s_2^2}{n_1^2}}
\end{align*}
$$

The "M" (multiplier) turned into the critical value for the t-distribution at the respective $\alpha$-level of significane. The R-code you will use for this is `qt(1-sig_level/2, n-2)`. Notice the degrees of freedom (df) is no longer (n-1). This is due to us having to estimate another parameter in our standard deviation calculation, (ie we are estimating two variances now whereas before we were only estimating one).

## Example 1: Ranger School Success and 12 Mile Ruck Times

We have a dataset of a ranger school class that started RAP week together, and aggregated all phyiscal performance stats they had prior to attending from their unit. The last column is whether they passed or failed (Binary) encoded as a `1` or a `0`.

In [None]:
ranger %>% head

Let's visualize their ruck times in a histogram categorized by pass/fail.

In [None]:
ranger %>% ggplot(aes(x=ruck_min)) + geom_histogram() + facet_grid(Ranger_tab ~ .) + theme_minimal()

Questions may arise about skewness and indeed they are well received. If this is the only data we have on hand this needs to be mentioned in our conclusions from conducting a two-sample test. Remember, with larger sample sizes by the Central Limit Theorem our distribution of sample means will still achieve normality, so a two-sample test could still be used in those circumstances. But how large is large remains the question to be answered. In practice, if I am worried about validity conditions, I would use a non-parametric approach however this is beyond the scope of the course.

### Find our observed statistic:

In [None]:
ranger %>%
  group_by(Ranger_tab) %>%
  summarize(mean = mean(ruck_min),
            s = sd(ruck_min),
            n = n())

**VALIDITY CONDITIONS FOR A TWO-SAMPLE T-TEST ARE THERE ARE 20 OBSERVATIONS IN EACH GROUP AND THE DATA IS NOT STRONGLY SKEWED**

From the above I have everything I need to execute the two-sample t-test:

In [None]:
null = 0      #Enter the value of your Null Hypothesis Parameter
xbar_1 = 170.64    # sample mean of group 1
xbar_2 = 140.69   # sample mean of group 2
s_1 =  34.28     # sample standard deviation of group 1
s_2 = 24.289      # sample standard deviation of group 2
n_1 =  198     # sample size of group 1
n_2 =  91     # sample size of group 2
n = n_1 + n_2   # total sample size
diff = xbar_1 - xbar_2
sd = sqrt(s_1^2/n_1 + s_2^2/n_2)
t = (diff-null)/sd   ; t # standardized statistic

In [None]:
pvalue = 2*(1-pt(abs(t), n-2)) ; pvalue

In [None]:
siglevel = 0.05             #Enter your significance level (alpha)
multiplier = qt(1-siglevel/2, n-2)
se = sqrt(s_1^2/n_1 + s_2^2/n_2) # standard error
CI = c(diff-multiplier*se, diff+multiplier*se) ; CI # confidence interval

**Conclusion** \\
With a p-value of 1.11e-15 we have extremely strong evidence to reject the null hypothesis that there is no difference in average ruck times for those that pass and fail ranger school. We are 95% confident the true difference in ruck times for those that pass and fail ranger school is between (23.01, 36.89) minutes.

### Example 1.1: You try
With a partner, test if there is a significant difference in the average height of a ranger student that passes and fails ranger school.

Write your hypotheses statements:

$H_0$:

$H_A$:

1. Visualize the data you are going to test

2. Tabulate the data to calculate your summary statistics

3. Find your t-statistic

4. Find your p-value

5. Create a confidence interval

**State your conclusions:**

## Example 2: Coronary Heart Disease and Cholesterol

Heart disease is the leading cause of death in the country. An attributing factor is high cholesterol. Let us investigate the following dataset `wcgs` and test the significance between average cholesterol levels between patients with and without CHD.

In [None]:
wcgs %>% ggplot(aes(x=chol)) + geom_histogram(bins=30) + facet_grid(chd ~ .)

## Example 2.1 Cigarettes and CHD

Test whether the average number of cigarettes smoked per ______(figure out the units of `cigs` by typing `help(wcgs)`) is significantly different from people with CHD and people without CHD.

In [None]:
wcgs %>% ggplot(aes(x=cigs)) + geom_histogram(bins=30) + facet_grid(chd ~ .)

## Challenge: Wisconsin Breast Cancer Dataset

The `wbc` is a well known breast cancer dataset that describes many benign and malignant tumors studied in the breast.

In [None]:
wbc %>% head

Use the below visualization to identify a variable that may appear different based on malignant vs. benign.

In [None]:
options(repr.plot.width = 16, repr.plot.height = 16)
# Pivot the numeric variables into long format
wbc_long <- wbc %>%
  pivot_longer(
    cols = where(is.numeric),
    names_to = "variable",
    values_to = "value"
  )

# Create the boxplot faceted by variable, with diagnosis on the x-axis
ggplot(wbc_long, aes(x = diagnosis, y = value, fill = diagnosis)) +
  geom_boxplot() +
  facet_wrap(~ variable, scales = "free_y") +
  labs(title = "Boxplots of Numeric Variables by Diagnosis",
       x = "Diagnosis",
       y = "Value") +
  theme_bw() +
  theme(legend.position = "none")

Formally test whether your 'hunch' is significant.