Binomial tests are useful for comparing the frequency of some outcome in a sample to the expected probability of that outcome. For example, if we expect 90% of ticketed passengers to show up for their flight but only 80 of 100 ticketed passengers actually show up, we could use a binomial test to understand whether 80 is significantly different from 90.

Binomial tests are similar to one-sample t-tests in that they test a sample statistic against some population-level expectation. The difference is that:

- binomial tests are used for binary categorical data to compare a sample frequency to an expected population-level probability

- one-sample t-tests are used for quantitative data to compare a sample mean to an expected population mean.

In [1]:
import numpy as np
import pandas as pd

We will walk through the process of using a binomial test to analyze data from a hypothetical online company, Live-it-LIVE.com — a website that sells all the necessary props and costumes to recreate iconic movie scenes at home!

In [2]:
monthly_report = pd.read_csv('../Datasets/monthly_report.csv')
print(monthly_report)

               timestamp purchase                       item
0    2020-01-17 17:23:06        y  cue cards - love actually
1    2020-01-25 17:09:39        n                        NaN
2    2020-01-25 05:22:01        n                        NaN
3    2020-01-18 04:33:40        y      t-rex - jurassic park
4    2020-01-24 17:24:52        n                        NaN
..                   ...      ...                        ...
495  2020-01-16 08:40:02        n                        NaN
496  2020-01-09 21:11:19        n                        NaN
497  2020-01-31 08:54:51        n                        NaN
498  2020-01-21 19:35:03        n                        NaN
499  2020-01-31 09:48:43        n                        NaN

[500 rows x 3 columns]


Note that the purchase column tells us whether a purchase was made; if so, the item that was purchased is listed in the item column. 

#### Summarizing the Sample
The marketing department at Live-it-LIVE reports that, during this time of year, about 10% of visitors to Live-it-LIVE.com make a purchase.

The monthly report shows every visitor to the site and whether or not they made a purchase. The checkout page had a small bug this month, so the business department wants to know whether the purchase rate dipped below expectation. They’ve asked us to investigate this question.

In order to run a hypothesis test to address this, we’ll first need to know two things from the data:

The number of people who visited the website
The number of people who made a purchase on the website
Assuming each row of our dataset represents a unique site visitor, we can calculate the number of people who visited the website by finding the number of rows in the data frame. We can then find the number who made a purchase by using a conditional statement to add up the total number of rows where a purchase was made.

For example, suppose that the dataset candy contains a column named chocolate with 'yes' recorded for every candy that has chocolate in it and 'no' otherwise. The following code calculates the sample size (the number of candies) and the number of those candies that contain chocolate:

sample size (number of rows): 
`samp_size = len(candy)`
 
number with chocolate: 
`total_with_chocolate = np.sum(candy.chocolate == 'yes')`

In [3]:
#calculate and print sample_size:
sample_size = len(monthly_report)
print(sample_size)

500


In [4]:
#calculate and print num_purchased:
num_purchased = np.sum(monthly_report['purchase'] == 'y')
print(num_purchased)

41
