# Discounts Hypothesis Tests
Do discounts have a statistically significant effect on the number of products customers order? If so, at what level(s) of discount?
* On an order as a whole when at least one product has a discount
* For particular products
* For particular products purchased by particular customers with and without a discount

# Imports and Constants

In [36]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import scipy.stats as stats

In [2]:
DB_NAME = 'Northwind_small.sqlite'
RANDOM_STATE = 42

# Connect to Database

In [3]:
conn = sqlite3.connect(DB_NAME)
cur = conn.cursor()

# Do discounts have a significant effect on the number of products customers order when considering orders that have at least one discount on a product?

## Hypothesis
* H0 = A discount **does not** have a significant effect on the number of products in an order
* HA = A discount **does** have a significant effect on the number of products in an order

## Test Type

Conduct a Two Tailed, One Sample T-Test on the mean of the sample (with discounts) vs. the population mean

## Set Significance Level 

In [35]:
alpha = 0.95

## Query database for Order Data

In [4]:
q = """
    SELECT * from OrderDetail;
    """

In [5]:
df = pd.DataFrame(cur.execute(q).fetchall(),
                  columns=[description[0] for description in cur.description])

In [6]:
df.head()

Unnamed: 0,Id,OrderId,ProductId,UnitPrice,Quantity,Discount
0,10248/11,10248,11,14.0,12,0.0
1,10248/42,10248,42,9.8,10,0.0
2,10248/72,10248,72,34.8,5,0.0
3,10249/14,10249,14,18.6,9,0.0
4,10249/51,10249,51,42.4,40,0.0


## Group Data by OrderId
So that the number of items per order can be calculated
* Quantity will indicate the total number of items in an order
* Discount will be the sum of the discounts on an order
    * 0.0 will indicate no discounts on the order
    * \> 0.0 will indicate at least 1 item had a discount on the order

In [29]:
df_groupby_order_id = df[['OrderId', 'Quantity', 'Discount']].groupby('OrderId').agg(total_qty = ('Quantity', 'sum'),
                                                                                     max_discount = ('Discount', 'max'))

In [30]:
df_groupby_order_id.head()

Unnamed: 0_level_0,total_qty,max_discount
OrderId,Unnamed: 1_level_1,Unnamed: 2_level_1
10248,27,0.0
10249,49,0.0
10250,60,0.15
10251,41,0.05
10252,105,0.05


In [31]:
df_groupby_order_id.describe()

Unnamed: 0,total_qty,max_discount
count,830.0,830.0
mean,61.827711,0.066928
std,50.748158,0.087484
min,1.0,0.0
25%,26.0,0.0
50%,50.0,0.0
75%,81.0,0.15
max,346.0,0.25


## Separate orders with a discount

In [32]:
orders_with_discount_df = df_groupby_order_id[df_groupby_order_id.max_discount > 0.0]

In [33]:
orders_with_discount_df.head()

Unnamed: 0_level_0,total_qty,max_discount
OrderId,Unnamed: 1_level_1,Unnamed: 2_level_1
10250,60,0.15
10251,41,0.05
10252,105,0.05
10254,57,0.15
10258,121,0.2


In [34]:
orders_with_discount_df.describe()

Unnamed: 0,total_qty,max_discount
count,380.0,380.0
mean,72.944737,0.146184
std,51.403927,0.071582
min,2.0,0.05
25%,37.0,0.1
50%,62.5,0.15
75%,95.0,0.2
max,330.0,0.25


## Conduct T-Test

In [45]:
results = stats.ttest_1samp(orders_with_discount_df['total_qty'], 
                            df_groupby_order_id['total_qty'].mean())

In [46]:
p_value = results[1]

In [47]:
p_value

3.114975153738426e-05

## Results

Because the p-value is less than alpha, the null hypothesis can be rejected in favor of the alternative hypothesis.  Hence, with a high confidence, offering a discount leads to a different amount of items purchased by a customer.