# Region Sales Performance Hypothesis Test

# Imports and Constants

In [1]:
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import scipy.stats as stats

In [2]:
DB_NAME = 'Northwind_small.sqlite'
RANDOM_STATE = 42

# Connect to Database

In [3]:
conn = sqlite3.connect(DB_NAME)
cur = conn.cursor()

# Is any region doing significantly better than the combination of regions when considering order totals?

## Hypothesis

H0 = None of the regions is significantly different from the population

HA = Region x is significantly different from the population

## Set significance level

In [4]:
alpha = 0.05

## Test Type

One tailed, One Sample T-Test

## Query database for data

In [5]:
q = """
    WITH OrderTotals AS (SELECT `Order`.ID AS OrderID,
                                `Order`.CustomerID,
                                `Order`.EmployeeID,
                                SUM (CASE 
                                        WHEN OrderDetail.Discount = 0.0
                                        THEN OrderDetail.UnitPrice * OrderDetail.Quantity
                                        ELSE OrderDetail.UnitPrice * OrderDetail.Quantity * OrderDetail.Discount
                                     END) AS OrderTotal
                         FROM `Order`
                         JOIN OrderDetail
                         ON `Order`.ID = OrderDetail.OrderID
                         GROUP BY `Order`.ID )
    
    SELECT  DISTINCT Region.ID AS RegionId,
                     Region.RegionDescription,
                     OrderTotals.OrderID,
                     OrderTotals.OrderTotal
    FROM Region
    JOIN Territory
    ON Region.ID = Territory.RegionID
    JOIN EmployeeTerritory
    ON Territory.ID = EmployeeTerritory.TerritoryID
    JOIN Employee
    ON EmployeeTerritory.EmployeeID = Employee.ID
    JOIN OrderTotals
    ON Employee.ID = OrderTotals.EmployeeID
    
    """

In [7]:
df = pd.DataFrame(cur.execute(q).fetchall(), 
                  columns=[description[0] for description in cur.description])

In [8]:
df.head()

Unnamed: 0,RegionId,RegionDescription,OrderID,OrderTotal
0,1,Eastern,10248,440.0
1,2,Western,10249,1863.4
2,1,Eastern,10250,337.4
3,4,Southern,10251,352.74
4,1,Eastern,10252,1220.1


## Conduct T-Tests

In [9]:
pop_mean = df['OrderTotal'].mean()

In [14]:
for region in df['RegionId'].unique():
    print(f'Region {region}')
    print(stats.ttest_1samp(df[df['RegionId'] == region]['OrderTotal'], pop_mean))
    print('')

Region 1
Ttest_1sampResult(statistic=-0.14431850328182574, pvalue=0.8853188562574451)

Region 2
Ttest_1sampResult(statistic=-2.0234843004504097, pvalue=0.04495398800582793)

Region 4
Ttest_1sampResult(statistic=1.3401796046571586, pvalue=0.18259866089274754)

Region 3
Ttest_1sampResult(statistic=-0.017467687922363348, pvalue=0.9860873583063376)



I can reject H0 for a region if the p-value is less than alpha (0.05)

The only region that is significantly different is Region 2

**Is Region 2 doing significantly better or worse?**

* For Better I can reject H0 if p/2 < alpha and t > 0
* For Worse I can reject H0 if p/2 < alpha and t < 0

Region 2 is significantly worse than the population because the test-statistic is less than 0 and p/2 < alpha (~0.022 < 0.05).

## Conclusion

When considering Order Totals, Region 2 is underperforming. (It is doing significantly worse that the population)

# Region 3 and Region 4 are the Best Performing Regions with Respect to Mean Order Totals, Are the doing significantly differently? 

## Hypothesis

H0 = There is no difference between the regions Order Totals

HA = There is a difference between the two regions Order Totals

## T-Test Type

Two Tailed - Two Sample Independent T-Test

## Set Significance Level

In [20]:
alpha = 0.05

## Perform T-Test

In [21]:
stats.ttest_ind(df[df['RegionId'] == 3]['OrderTotal'], 
                df[df['RegionId'] == 4]['OrderTotal'], 
                equal_var=False)

Ttest_indResult(statistic=-1.126701299848741, pvalue=0.2610755870536371)

## Conclusion

I failed to reject H0 because p is much bigger than alpha, therefor I cannot say that regions 3 and 4 are significantly different regarding order totals.