In [None]:
library(tidyverse)
library(janitor)
gallen <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/main/gallen.csv")
health <- read_csv("https://raw.githubusercontent.com/lonespear/MA206/main/nhanes.csv")

In [None]:
gallen

# Is there a statistically significant difference in the proportion of Zac's fastballs thrown to the inside half of the strike zone when facing left-handed batters compared to right-handed batters?

### Parameter of interest.

We start with the two separate proportions we are interested in, ie: $\pi_1$ is the proportion of four-seam fastball pitches thrown to lefties that are inside. $\pi_2$ is the proportion of four-seam fastball pitches thrown to righties that are inside.

### Our research question is about the difference between these two proportions. If Zac threw the same to both right and left-handed batters, we would anticipate the difference in these proportions to be not significantly different from zero (our null hypothesis). Our alternative hypothesis would then be that there is a significant difference (alternative hypothesis).

1.  $\pi=\pi_1 - \pi_2$.

2.  Finding observed statistics:

In [None]:
gallen %>% filter(pitch_type == 'FF') %>% tabyl(in_out, stand) %>% adorn_totals

In [None]:
gallen %>% filter(pitch_type == 'FF') %>% select(stand, in_out) %>% table() %>%
  plot(color=TRUE, main='Mosaic Plot of Pitching Inside/Outside vs. Batter Stance')

3.  Finding Z-Statistic

In [None]:
null = 0           # Enter the value of your Null Hypothesis Parameter
successes_1 = 1259    # number of successes in group 1
successes_2 = 1779    # number of successes in group 2
n_1 = 3055   # sample size of group 1     
n_2 = 3863   # sample size of group 2
n = n_1 + n_2     # total sample size
phat_1 = successes_1/n_1
phat_2 = successes_2/n_2
phat_t = (successes_1 + successes_2)/(n)
diff = phat_1-phat_2 # ensure this matches your null hypothesis order
sd = sqrt(phat_t*(1-phat_t)*(1/n_1 + 1/n_2))
z = (diff-null)/sd  ; z # standardized statistic

4.  Finding p-value and state a conclusion with 95% significance.

In [None]:
2*(1-pnorm(abs(z)))

With a p-value of 5.6e-05, at a significance level of 0.05 there is very strong evidence to reject the null hypothesis that Zac Gallen throws the same proportion of fastballs inside to right-handed and left-handed batters.

## Lets make a 95% confidence interval to go along with our conclusion.

In [None]:
siglevel = 0.05             # Enter your significance level (alpha)
multiplier = qnorm(1-siglevel/2)
se = sqrt(phat_1*(1-phat_1)/n_1+phat_2*(1-phat_2)/n_2) # Standard Error
CI = c(diff-multiplier*se, diff+multiplier*se)  ; CI # Confidence Interval

Note since zero is not included in our confidence interval and we were conducting a two-sided test we can also reject the null hypothesis.

In [None]:
gallen %>% filter(pitch_type == 'KC') %>% mutate(pitch_zone = as.factor(zone)) %>%
  ggplot(aes(x=plate_x, y=plate_z, color=pitch_zone)) + geom_point() + scale_color_viridis_d() + theme_minimal()

In [None]:
# Identify the top 5 events based on frequency
top_events <- gallen %>% 
  filter(!is.na(events)) %>% 
  count(events, sort = TRUE) %>% 
  slice_max(n, n = 5) %>% 
  pull(events)

# Filter the data to only include those top events, then plot
gallen %>% 
  filter(events %in% top_events) %>% 
  ggplot(aes(x = events)) +
  geom_bar() +
  facet_grid(rows = vars(zone), cols = vars(stand)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Is Zac Gallen more confident pitching outside or inside to left or right handed batters in a potential walk situation?

To start, lets look at all the pitches thrown in 3 ball counts.

In [None]:
gallen %>% filter(balls == 3) %>% ggplot(aes(pitch_type)) + geom_bar() + 
  facet_grid(stand ~ in_out ~ game_year) + theme_bw()

In [None]:
gallen %>% filter(balls == 3 & pitch_type == 'KC') %>% select(stand, in_out) %>% table() %>% plot(color=TRUE)

In [None]:
gallen %>% filter(balls == 3 & strikes == 2) %>% ggplot(aes(x=fct_infreq(events))) + geom_bar() +
  facet_grid(stand ~ .) +
  theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

In [None]:
gallen %>% filter(events == 'walk') %>% select(stand, in_out) %>% table() %>%
  plot(main='Walk producing pitches thrown by Zac Gallen', color=TRUE)

They look really close! Lets do a two-proportion z-test to see if there is any significance.

Write our $H_0 \ \text{and} \ H_A$ hypotheses:

\$H_0: \$ \$H_A: \$

Then find our observed statistic

In [None]:
gallen %>% filter(events == 'walk') %>% tabyl(in_out, stand) %>% adorn_totals()

In [None]:
null = 0           # Enter the value of your Null Hypothesis Parameter
successes_1 = 73    # number of successes in group 1
successes_2 = 75    # number of successes in group 2
n_1 = 155   # sample size of group 1     
n_2 = 164   # sample size of group 2
n = n_1 + n_2     # total sample size
phat_1 = successes_1/n_1
phat_2 = successes_2/n_2
phat_t = (successes_1 + successes_2)/(n)
diff = phat_1-phat_2 # ensure this matches your null hypothesis order
sd = sqrt(phat_t*(1-phat_t)*(1/n_1 + 1/n_2))
z = (diff-null)/sd  ; z # standardized statistic
2*(1-pnorm(abs(z)))
siglevel = 0.05             # Enter your significance level (alpha)
multiplier = qnorm(1-siglevel/2)
se = sqrt(phat_1*(1-phat_1)/n_1+phat_2*(1-phat_2)/n_2) # Standard Error
CI = c(diff-multiplier*se, diff+multiplier*se)  ; CI # Confidence Interval

Our p-value is 0.8, we fail to reject the null hypothesis and also see that the confidence interval includes zero.

# Drug Use and Hyptertension

In [None]:
health$hypertensive <- ifelse(health$BPSysAve >= 140 | health$BPDiaAve >= 90, "Yes", "No")

# Filter complete cases
health_spec <- na.omit(health[, c("Gender", "hypertensive", "HardDrugs", "AgeDecade", "Smoke100n")])

In [None]:
health_spec %>% select(hypertensive, HardDrugs) %>% table()

In [None]:
health_spec %>% select(HardDrugs, hypertensive) %>% table() %>% plot()

In [None]:
null = 0           # Enter the value of your Null Hypothesis Parameter
successes_1 = 487    # number of successes in group 1
successes_2 = 124    # number of successes in group 2
n_1 = 487+4133   # sample size of group 1     
n_2 = 124+923   # sample size of group 2
n = n_1 + n_2     # total sample size
phat_1 = successes_1/n_1
phat_2 = successes_2/n_2
phat_t = (successes_1 + successes_2)/(n)
diff = phat_1-phat_2 # ensure this matches your null hypothesis order
sd = sqrt(phat_t*(1-phat_t)*(1/n_1 + 1/n_2))
z = (diff-null)/sd  ; z # standardized statistic
2*(1-pnorm(abs(z)))
siglevel = 0.05             # Enter your significance level (alpha)
multiplier = qnorm(1-siglevel/2)
se = sqrt(phat_1*(1-phat_1)/n_1+phat_2*(1-phat_2)/n_2) # Standard Error
CI = c(diff-multiplier*se, diff+multiplier*se)  ; CI # Confidence Interval

## What about smoking and education level?

In [None]:
health %>% filter(!is.na(Education) & !is.na(Smoke100n)) %>% tabyl(Smoke100n, Education) %>% adorn_totals()


Just looking at college and high school grads:

In [None]:
null = 0           # Enter the value of your Null Hypothesis Parameter
successes_1 = 684       # number of successes in group 1
successes_2 = 755    # number of successes in group 2
n_1 = 2098   # sample size of group 1     
n_2 = 1517   # sample size of group 2
n = n_1 + n_2     # total sample size
phat_1 = successes_1/n_1
phat_2 = successes_2/n_2
phat_t = (successes_1 + successes_2)/(n)
diff = phat_1-phat_2 # ensure this matches your null hypothesis order
sd = sqrt(phat_t*(1-phat_t)*(1/n_1 + 1/n_2))
z = (diff-null)/sd  ; z # standardized statistic
2*(1-pnorm(abs(z)))
siglevel = 0.05             # Enter your significance level (alpha)
multiplier = qnorm(1-siglevel/2)
se = sqrt(phat_1*(1-phat_1)/n_1+phat_2*(1-phat_2)/n_2) # Standard Error
CI = c(diff-multiplier*se, diff+multiplier*se)  ; CI # Confidence Interval