# Workshop #6: Hypothesis Tests

In [None]:
# Loading the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd

## Problem 1
*Gulf Real Estate Properties Inc.* is a real estate firm located in southwest Florida. The company, which advertises itself as “expert in the real estate market,” monitors condominium sales by collecting data on location, list price, sale price, and number of days it takes to sell each unit. Each condominium is classified as *Yes* for Gulf view if it is located directly on the Gulf of Mexico or *No* for Gulf view if it is located on the bay or a golf course, near but not on the Gulf. Sample data from the Multiple Listing Service in Naples, Florida, provided recent sales data for some condominiums. The prices are in thousands of dollars. The data are given in `condominiums.csv`.

Construct a 95% confidence interval estimate of the population mean *Sale Price* for condominiums **with Gulf view**, and then the 95% confidence interval for the population mean *Sale Price* for condominiums **without Gulf view**. Based on your results, does it seem that the prices differ?

In [None]:
cond_data = pd.read_csv('condominiums.csv')
print(cond_data.head())

  gulf_view  list_price  sale_price  days_to_sell
0       yes       495.0       475.0           130
1       yes       379.0       350.0            71
2       yes       529.0       519.0            85
3       yes       552.5       534.5            95
4       yes       334.9       334.9           119


In [None]:
gulf_view_data = cond_data[cond_data['gulf_view'] == 'yes']['sale_price']
no_gulf_view_data = cond_data[cond_data['gulf_view'] == 'no']['sale_price']

mean_gulf_view = gulf_view_data.mean()
std_gulf_view = gulf_view_data.std()
n_gulf_view = len(gulf_view_data)

mean_no_gulf_view = no_gulf_view_data.mean()
std_no_gulf_view = no_gulf_view_data.std()
n_no_gulf_view = len(no_gulf_view_data)

#t-score for 95% confidence interval
t_score_gulf_view = stats.t.ppf(0.975, df=n_gulf_view-1)
t_score_no_gulf_view = stats.t.ppf(0.975, df=n_no_gulf_view-1)

#confidence interval for gulf view
margin_of_error_gulf_view = t_score_gulf_view * (std_gulf_view / np.sqrt(n_gulf_view))
ci_gulf_view = (mean_gulf_view - margin_of_error_gulf_view, mean_gulf_view + margin_of_error_gulf_view)

#confidence interval for no gulf view
margin_of_error_no_gulf_view = t_score_no_gulf_view * (std_no_gulf_view / np.sqrt(n_no_gulf_view))
ci_no_gulf_view = (mean_no_gulf_view - margin_of_error_no_gulf_view, mean_no_gulf_view + margin_of_error_no_gulf_view)

#prices differ significantly between condominiums with a Gulf view and those without
print(f"95% Confidence Interval for Gulf View Sale Price: {ci_gulf_view}")
print(f"95% Confidence Interval for No Gulf View Sale Price: {ci_no_gulf_view}")

95% Confidence Interval for Gulf View Sale Price: (392.65233545150653, 515.7926645484936)
95% Confidence Interval for No Gulf View Sale Price: (181.36204825516631, 225.01572952261148)


## Problem 2
Triphammer Road is a busy street that passes through a residential neighborhood. Residents there are concerned that vehicles traveling on Triphammer *often **exceed** the posted speed* limit of 30 miles per hour. The local police sometimes place a radar speed detector by the side of the road; as a vehicle approaches, this detector displays the vehicle’s speed to its driver. The local residents are not convinced that such a passive method is helping the problem. They wish to persuade the village to add extra police patrols to encourage drivers to observe the speed limit. To help their case, a resident stood where he could see the detector and recorded the speed of vehicles passing it during a 15-minute period one day. When clusters of vehicles went by, he noted only the speed of the front vehicle. The data are given in `triphammer.csv`.

Is there sufficient evidence to support the residents' concern about the speed of vehicles passing on Triphammer Road? State the hypotheses of the test, and then perform the correct test to reach a conclusion.


In [None]:
trip_data = pd.read_csv('triphammer.csv')
print(trip_data.head())

  vehicle type  speed (mph)
0          car           29
1          SUV           34
2        truck           34
3        truck           28
4        truck           30


In [None]:
mu = 30

#t-test
t_statistic, p_value = stats.ttest_1samp(trip_data['speed (mph)'], mu)

#divide the p-value by 2
if t_statistic > 0:
  p_value_p = p_value / 2
else:
  p_value_p = 1 - (p_value / 2)

print(f"t-statistic: {t_statistic}")
print(f"p-value: {p_value_p}")

#conclusion
a = 0.05
print()
if p_value_p < a:
    print("There is sufficient evidence to support the concern about the speed of vehicles")
else:
    print("There is insufficient evidence to support the concern about the speed of vehicles")

t-statistic: 1.1781136648171595
p-value: 0.12566910367402262

There is insufficient evidence to support the concern about the speed of vehicles


## Problem 3
In an investigation of environmental causes of disease, data were collected on the annual mortality rate (deaths per 100,000) for males in 61 large towns in England and Wales. In addition, the water hardness was recorded as the calcium concentration (parts per million, ppm) in the drinking water. The data set (given in `mortality_rates.csv`) also notes, for each town, whether it was south or north of Derby.

Perform an appropriate hypothesis test to establish if there is a significant **difference in mortality rates** in the two regions?

In [None]:
mort_data = pd.read_csv('mortality_rates.csv')
print(mort_data.head())

   derby  mortality  calcium
0  South       1702       44
1  South       1309       59
2  South       1259      133
3  North       1427       27
4  North       1724        6


In [None]:
south_data = mort_data[mort_data['derby'] == 'South']['mortality']
north_data = mort_data[mort_data['derby'] == 'North']['mortality']

#t-test
t_statistic, p_value = stats.ttest_ind(south_data, north_data, equal_var=False)

print(f"p-value: {p_value}")

#conclusion
alpha = 0.05
print()
if p_value < alpha:
    print("There is a significant difference in mortality rates between the two regions")
else:
    print("There is no significant difference in mortality rates between the two regions")

p-value: 3.1512169364549926e-08

There is a significant difference in mortality rates between the two regions


## Problem 4
A hygiene scientist decided to investigate just how effective washing with soap is in eliminating bacteria. To do this she tested four different methods—washing with water only, washing with regular soap, washing with antibacterial soap (ABS), and spraying hands with antibacterial spray (AS) (containing 65% ethanol as an active ingredient). She suspected that the number of bacteria on her hands before washing might vary considerably from day to day. To help even out the effects of those changes, she generated random numbers to determine the order of the four treatments. Each morning she washed her hands according to the treatment randomly chosen. Then she placed her right hand on a sterile media plate designed to encourage bacteria growth. She incubated each plate for 2 days at 36 °C after which she counted the bacteria colonies. She replicated this procedure 8 times for each of the four treatments. The data are given in `bacterial_counts.csv`.

Is there evidence that the average bacterial counts are different for the four methods she tested?

In [None]:
bact_data = pd.read_csv('bacterial_counts.csv')
print(bact_data.head())

               method  bacterial_count
0               water               74
1                soap               84
2  antibacterial soap               70
3       alcohol spray               51
4               water              135


In [None]:
methods = bact_data['method'].unique()
groups = [bact_data[bact_data['method'] == method]['bacterial_count'] for method in methods]

#ANOVA test
f_stat, p_value = stats.f_oneway(*groups)

print(f'ANOVA test : p-value = {p_value}')

#Tukey HSD test
tukey = pairwise_tukeyhsd(endog=bact_data['bacterial_count'], groups=bact_data['method'], alpha=0.05)

tukey_summary = tukey.summary()

significant_comparisons = tukey.summary().data[1:]

group_a = []
group_b = []

for comparison in significant_comparisons:
  group1, group2, meandiff, pval, lower, upper, reject = comparison
  if pval < 0.05:  #if the p-value is less than 0.05, it is significant
    group_a.append(group2)
    group_b.append(group1)

group_a = sorted(set(group_a))
group_b = sorted(set(group_b))

print()
print(f"Tukey HSD test: Group A: {', '.join(group_a)} Group B: {', '.join(group_b)}")

ANOVA test : p-value = 0.0011114593949963585

Tukey HSD test: Group A: antibacterial soap, soap, water Group B: alcohol spray


## Problem 5
In July 1991, and again in April 2001, the *Gallup Poll* asked random samples of adults about their opinions on working parents. The given in `working_parents.csv` contains responses to the question "Considering the needs of both parents and children, which of the following do you see as the ideal family in today’s society?"

Based on these data, is there evidence that there was a change in people’s attitudes during the 10 years between these polls? (in other words, are the responses in 1991 and 2001 independent)

In [None]:
work_data = pd.read_csv('working_parents.csv')
print(work_data.head())

     ideal_arrangement  year
0  both_work_full_time  1991
1  both_work_full_time  1991
2  both_work_full_time  1991
3  both_work_full_time  1991
4  both_work_full_time  1991


In [None]:
#create a contingency table
contingency_table = pd.crosstab(work_data['ideal_arrangement'], work_data['year'])

#chi-square test
chi2_stat, p_value, dof, expected = stats.chi2_contingency(contingency_table)

print(f"chi-sq: {chi2_stat}")
print(f"p-value: {p_value}")

#conclusion based on p-value
alpha = 0.05
print()
if p_value < alpha:
    print("There is a significant difference in people's attitudes")
else:
    print("There is no significant difference in people's attitudes")

chi-sq: 4.030209095036681
p-value: 0.4019329311784825

There is no significant difference in people's attitudes
