## Autograded Notebook (Canvas & CodeGrade)

This notebook will be automatically graded. It is designed to test your answers and award points for the correct answers. Following the instructions for each Task carefully.

### Instructions

* **Download this notebook** as you would any other ipynb file
* **Upload** to Google Colab or work locally (if you have that set-up)
* **Delete `raise NotImplementedError()`**
* Write your code in the `# YOUR CODE HERE` space
* **Execute** the Test cells that contain `assert` statements - these help you check your work (others contain hidden tests that will be checked when you submit through Canvas)
* **Save** your notebook when you are finished
* **Download** as a `ipynb` file (if working in Colab)
* **Upload** your complete notebook to Canvas (there will be additional instructions in Slack and/or Canvas)

# Unit 1 Sprint 2 Module 1

## Hypothesis Testing - One and two-sample t-Tests

### Objectives

* Explain the purpose of a t-test and identify applications
* Use a t-test for independence to test for a statistically significant association between two categorical variables
* Use a t-test p-value to draw the correct conclusion about the null and alternative hypothesis

#### Total notebook points: 13

### Introduction

Mosquito nets have traditionally been an important tool to prevent mosquito bites in parts of the world where malaria is endemic. However, it may not be practical for an army that is on the move to set up and carry mosquito nets each night and day. Impregnating soldiers’ uniforms with insect repellent solves the mobility problem but also has drawbacks. First, the insect repellent quickly becomes ineffective with repeated washing and ironing and must be frequently reapplied. Second, in hot and humid climates the insect repellent can be absorbed through the skin, and the long-term effects of this exposure are unknown. One compromise is to have soldiers apply patches treated with insect repellent to their clothing. These patches would last longer because they would not be washed or ironed and would not expose the entire body to the insect repellent.

### Dataset Description

The `Mosquito.xlsx` dataset contains data recorded in an experiment conducted on male soldiers in the Indian Army who were stationed in the Tezpur/Solmara garrison in Northeast India. Thirty soldiers were randomly selected to receive one of three types of mosquito single repellent patch. After giving informed consent, the study participants affixed the patches at predetermined points on their uniforms and research assistants (who were blinded to the type of repellent used) counted the number of times a mosquito landed on each individual in an hour. 

Medical officers with the Indian Army have recorded data on mosquito bites and related illness for many years and can say with authority that the mean number of mosquito touches for soldiers not wearing any mosquito repellent is 8.2 per hour.**We wish to determine if wearing a single repellent patch changes the mean number of mosquito touches for soldiers compared to not wearing any mosquito repellent.**

*Adapted from: A. Bhatnagar and V.K. Mehta (2007). "Efficacy of Deltamethrin and Cyfluthrin Impregnated Cloth Over Uniform Against Mosquito Bites," Medical Journal Armed Forces India, Vol. 63, pp. 120-122.*

**Task 1** - Load the data

Let's load the data! The URL has been provided as well as the imports for pandas and numpy.

* load your CSV file into a DataFrame named `df_mosquito`

In [None]:
# Task 1
import pandas as pd
import numpy as np

# URL for the dataset
data_url = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Mosquito/Mosquito.csv'

# YOUR CODE HERE
df_mosquito = pd.read_csv(data_url)

# Print out your DataFrame
df_mosquito.head()

Unnamed: 0,ID,Mosq_count
0,1,4
1,2,10
2,3,13
3,4,0
4,5,11


**Task 1 Test**

In [None]:
# Task 1 - Test

assert isinstance(df_mosquito, pd.DataFrame), 'Have you created a DataFrame named `df_mosquito`?'
assert len(df_mosquito) == 90


**Task 2** - Calculate the mean

* Calculate the mean number of mosquito touches in the sample. Assign your answer to the variable `mosquito_touch_mean`.

In [None]:
# Task 2

# YOUR CODE HERE
mosquito_touch_mean = df_mosquito['Mosq_count'].mean()

**Task 2 Test**

In [None]:
# Task 2 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 3** - Calculate the standard deviation

* Calculate the standard deviation of the number of mosquito touches in the sample. Assign your answer to `mosquito_touch_std`.

In [None]:
# Task 3

# YOUR CODE HERE
mosquito_touch_std = df_mosquito['Mosq_count'].std()
  #by default, ddof = 1 -> divides sum of squared deviations by n - ddof

**Task 3 Test**

In [None]:
# Task 3 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 4 -** Statistical hypotheses

From the list of choices below, select the null and alternative hypotheses using the experiment information described above.  Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: $H_0: \mu = 8$ vs. $H_a: \mu = 8$

B: $H_0: \mu \neq 8.2$ vs. $H_a: \mu = 8$

C: $H_0: \mu \neq 8.2$ vs. $H_a: \mu = 8.2$ 

D: $H_0: \mu =8.2$ vs. $H_a: \mu \neq 8.2$ 

In [None]:
# Task 4

# YOUR CODE HERE
Answer = 'D'


**Task 4 Test**

In [None]:
# Task 4 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 5** - One-sample t-test

* Conduct a 1-sample t-test to test your hypotheses. Assign your t-test result to the variable `mosquito_pval`. **This should be a single value**.

*Hint: The `stats.ttest_1samp()` function returns two values; assign the results of the t-test to `_, mosquito_pval`.*

In [None]:
# Task 5

# Use the 'ttest_1samp' from the stats package
from scipy import stats

# YOUR CODE HERE
mosquito_pval = stats.ttest_1samp(df_mosquito['Mosq_count'], popmean=8.2).pvalue
print(mosquito_pval)

0.5864980356272131


**Task 5 Test**

In [None]:
# Task 5 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 6**

Select the correct conclusion at the 0.05 significance level from the list of choices below. Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: We reject the null hypothesis at the 0.05 significance level and conclude that a single repellent patch reduces the mean number of mosquito touches.

B: We fail to reject the null hypothesis at the 0.05 significance level and conclude that a single repellent patch reduces the mean number of mosquito touches.

C: We fail to reject the null hypothesis at the 0.05 significance level and conclude that a single repellent patch does not change the mean number of mosquito touches.

D: We reject the null hypothesis at the 0.05 significance level and conclude that a single repellent patch does not increase the mean number of mosquito touches.


In [None]:
# Task 6

# YOUR CODE HERE
Answer = 'C'


**Task 6 Test**

In [None]:
# Task 6 - Test
# Hidden tests - you will see the results when you submit to Canvas

## Use the following information to complete Tasks 7-13

### Introduction

More than 14,000 people finished the 2020 Disney Marathon held on January 12. The results by age and gender group are included in the `Disney.csv dataset`. 

**We wish to determine if the mean finishing time for male and female marathon runners is the same or if there is a difference in the mean finishing time between male and female marathon runners.**


[Source: Track Shack. 2020 Disney Marathon Race Results](https://www.trackshackresults.com/disneysports/results/wdw/wdw20/mar_results.php)

**Task 7** - Load the next dataset

Let's load the data! The URL has been provided.

* load your CSV file into a DataFrame named `df_disney`

In [None]:
# Task 7

# URL for Disney marathon dataset
data_url2 = 'https://raw.githubusercontent.com/LambdaSchool/data-science-practice-datasets/main/unit_1/Disney_Marathon/Disney.csv'

# YOUR CODE HERE
df_disney = pd.read_csv(data_url2)
df_disney

Unnamed: 0,ID,gender,age,group,time
0,1,M,30,M30-34,2.371944
1,2,M,26,M25-29,2.450556
2,3,M,32,M30-34,2.457778
3,4,M,35,M35-39,2.655833
4,5,M,26,M25-29,2.736111
...,...,...,...,...,...
14101,14102,F,39,F35-39,7.320278
14102,14103,F,54,F50-54,7.340556
14103,14104,M,39,M35-39,7.383333
14104,14105,M,52,M50-54,7.400000


**Task 7 Test**

In [None]:
# Task 7 - Test

assert isinstance(df_disney, pd.DataFrame), 'Have you created a DataFrame named `df_disney`?'
assert len(df_disney) == 14106


**Task 8 -** Statistical hypotheses

From the list of choices below, select the null and alternative hypotheses using the experiment information described above.  Let $\mu_1$ be the mean finishing time for all male runners and $\mu_2$ be the mean finishing time for all female runners.

Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: $H_0: \mu_1 \neq \mu_2$ vs. $H_a: \mu_1 = \mu_2$

B: $H_0: \mu_1 = \mu_2$ vs. $H_a: \mu_1 \neq \mu_2$

C: $H_0: \mu_1 > \mu_2$ vs. $H_a: \mu_1 < \mu_2$ 

D: $H_0: \mu_1 <  \mu_2$ vs. $H_a: \mu_1 > \mu_2$ 

In [None]:
# Task 8

# YOUR CODE HERE
Answer = 'B'


**Task 8 Test**

In [None]:
# Task 8 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 9** - Create new series from a DataFrame

Create **Series** (a pandas DataFrame column is a Series):

* one containing finishing times for male participants (`male_finish`)
* one containing finishing times for female participants (`female_finish`)

*Hint: Check the size of your resulting Series - it should have only one column!*

In [None]:
df_disney.head(n=3)

Unnamed: 0,ID,gender,age,group,time
0,1,M,30,M30-34,2.371944
1,2,M,26,M25-29,2.450556
2,3,M,32,M30-34,2.457778


In [None]:
# Task 9

# YOUR CODE HERE
male_finish = df_disney[df_disney['gender']=='M']['time']
female_finish = df_disney[df_disney['gender']=='F']['time']



In [None]:
# assert male_finish.shape[0] + female_finish.shape[0] == df_disney.shape[0]

**Task 9 Test**

In [None]:
# Task 9 - Test
# Visible testing - use this to check your results!
assert male_finish.shape == (6577,), 'Make sure you selected M and only have a single column.'
assert female_finish.shape == (7529,), 'Make sure you selected F and only have a single column'

# NO hidden tests

**Task 10** - Calculate the mean finishing times

* Calculate the mean finishing time for male and female participants separately. Name your variables `male_finish_mean` and `female_finish_mean`.

In [None]:
# Task 10

# YOUR CODE HERE
male_finish_mean = male_finish.mean()
female_finish_mean = female_finish.mean()

**Task 10 Test**

In [None]:
# Task 10 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 11** - Calculate the standard deviation

* Calculate standard deviation of the mean finishing time for male and female participants separately. Name your variables `male_finish_std` and `female_finish_std`.

In [None]:
# Task 11

# YOUR CODE HERE
male_finish_std = male_finish.std()
female_finish_std = female_finish.std()

**Task 11 Test**

In [None]:
# Task 11 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 12** - Conduct a 2-sample t-test

Conduct a 2-sample t-test to test your hypotheses:

* Assign the t-statistic to a variable called `disney_tval`
* Assign the p-value to a variable called `disney_pval`

**Note:** The function returns two values and you can assign them with one line (example):

`variable1, variable2` = `some.function(arguments)`

In [None]:
# Task 12

# YOUR CODE HERE
disney_tval, disney_pval = stats.ttest_ind(male_finish, female_finish)
print(disney_tval, disney_pval)

-29.27857393997243 5.485138013952879e-183


In [None]:
# #practice using scipy.stats t test from descriptive stats
# stats.ttest_ind_from_stats(male_finish_mean, male_finish_std, len(male_finish), female_finish_mean, female_finish_std, len(female_finish))

Ttest_indResult(statistic=-29.27857393997243, pvalue=5.485138013952879e-183)

**Task 12 Test**

In [None]:
# Task 12 - Test
# Hidden tests - you will see the results when you submit to Canvas

**Task 13**

Select the correct conclusion at the 0.05 significance level from the list of choices below. Specify your answer in the next code block using `Answer = `.  For example, if the correct answer is choice B, you'll type `Answer = 'B'`.

A: We reject the null hypothesis at the 0.05 significance level and conclude the mean finishing time for male and female marathon runners is different.

B: We fail to reject the null hypothesis at the 0.05 significance level and conclude the mean finishing time for male and female marathon runners is different.

C: We reject the null hypothesis at the 0.05 significance level and conclude the mean finishing time for male and female marathon runners is the same.

D: We fail to reject the null hypothesis at the 0.05 significance level and conclude the mean finishing time for male and female marathon runners is the same.


In [None]:
# Task 13

# YOUR CODE HERE
Answer = 'A'


**Task 13 Test**

In [None]:
# Task 13 - Test
# Hidden tests - you will see the results when you submit to Canvas