# Mod 2 Code Challenge: Northwind

## Hypothesis Test Setup

The business question we are trying to answer is:
> Is the mean quantity of a product ordered greater when the product is discounted, compared to when it is not?

### Hypotheses

1. **Null hypothesis:** the mean quantity of a discounted product ordered is less than or equal to the mean quantity of a non-discounted product ordered
2. **Alternative hypothesis:** the mean quantity of a discounted product ordered is greater than the mean quantity of a non-discounted product ordered

### Types of Errors

1. **Type I:** a Type I error would mean rejecting the null hypothesis when the null hypothesis is true.  In this case, that would mean saying that the mean quantity ordered is greater for discounted products, when it is actually not greater
2. **Type II:** a Type II error would mean failing to reject the null hypothesis when the null hypothesis is false.  In this case, that would mean saying that the mean quantity ordered is not greater, when it actually is greater

## Importing and Preprocessing Data

In [1]:
import sqlite3
import pandas as pd
import numpy as np
import flatiron_stats
import scipy.stats as stats

In [2]:
conn = sqlite3.connect('northwind_db.sqlite')

First approach (probably most common) is to select everything with SQL, then separate discounted from non-discounted in Pandas

In [3]:
order_details = pd.read_sql("SELECT * FROM OrderDetail;", conn)
order_details.head()

Unnamed: 0,Id,OrderId,ProductId,UnitPrice,Quantity,Discount
0,10248/11,10248,11,14.0,12,0.0
1,10248/42,10248,42,9.8,10,0.0
2,10248/72,10248,72,34.8,5,0.0
3,10249/14,10249,14,18.6,9,0.0
4,10249/51,10249,51,42.4,40,0.0


In [4]:
discounted_df = order_details[order_details["Discount"] > 0]
not_discounted_df = order_details[order_details["Discount"] == 0]

In [5]:
discounted_1 = discounted_df["Quantity"]
not_discounted_1 = not_discounted_df["Quantity"]

Second approach (more likely for students who are still not very comfortable with Pandas) is to use a more elaborate SQL query, and not use Pandas at all

In [6]:
cur = conn.cursor()

In [7]:
discounted_query = """
SELECT Quantity
FROM OrderDetail
WHERE Discount > 0
;
"""

In [8]:
cur.execute(discounted_query)
discounted_2 = [x[0] for x in cur.fetchall()]

In [9]:
not_discounted_query = """
SELECT Quantity
FROM OrderDetail
WHERE Discount = 0
;
"""

In [10]:
cur.execute(not_discounted_query)
not_discounted_2 = [x[0] for x in cur.fetchall()]

Evidence that the two approaches produce the same sets of numbers:

In [11]:
np.array_equal(discounted_1, discounted_2)

True

In [12]:
np.array_equal(not_discounted_1, not_discounted_2)

True

Just grab the first one since it doesn't matter

In [13]:
discounted = discounted_1
not_discounted = not_discounted_1

Good practice to close the connection

In [14]:
conn.close()

Check whether the discounted mean is greater

In [15]:
discounted.mean()

27.10978520286396

In [16]:
not_discounted.mean()

21.715261958997722

## Hypothesis Test Execution

Given array-like variables `discounted` and `not_discounted` and the business question:
> Is the mean quantity of a product ordered greater when the product is discounted, compared to when it is not?

An appropriate test would be a one-tailed, two-sample Welch's t-test.  A two-sample t-test is used for comparing means, which is relevant to the business question.  It is one-tailed rather than two-tailed because we are specifically asking whether the mean quantity is *greater* for discounted products, not just whether it is *different*.  Welch's t-test is specifically appropriate because it does not assume equal population variance.

## Code

First approach is to use the functions from `flatiron_stats.py`

In [17]:
t_statistic_1 = flatiron_stats.welch_t(discounted, not_discounted)
t_statistic_1

6.239069142123973

In [18]:
p_value_1 = flatiron_stats.p_value_welch_ttest(discounted, not_discounted)
p_value_1

2.8282065578366655e-10

Second approach is to use `scipy.stats`

In [19]:
t_statistic_2, p_value_2 = stats.ttest_ind(discounted, not_discounted, equal_var=False)

In [20]:
t_statistic_2

6.239069142123973

In [21]:
# needs to be divided by 2 because SciPy defaults assume two-tailed
p_value_2 / 2

2.828207145152165e-10

Evidence that the two approaches produce the same statistics:

In [22]:
np.allclose([t_statistic_1, p_value_1], [t_statistic_2, p_value_2 / 2])

True

Just grab the first one since it doesn't matter

In [23]:
t_statistic = t_statistic_1
p_value = p_value_1

### Result

With a significance level of alpha = 0.05, and a p-value less than 0.01, we can reject the null hypothesis and accept the alternative hypothesis.  In other words, we have evidence to believe that quantities ordered are higher when the products are discounted than when they are not.