# README

In [1]:
import IPython.display as dp
import ipywidgets as widgets

def show_readme():
    file = 'README.md'
    with open(file,'r') as f:
        readme = f.read()


        return dp.Markdown(readme)
show_readme()


# Module 3 -  Final Project Specifications

## Introduction

In this lesson, we'll review all the guidelines and specifications for the final project for Module 3.

## Objectives

* Understand all required aspects of the Final Project for Module 3
* Understand all required deliverables
* Understand what constitutes a successful project

### Final Project Summary

Another module down--you're half way there!

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-mod-3-project/master/halfway-there.gif'>

For the culmination of Module 3, you just need to complete the final project!

### The Project

For this project, you'll be working with the Northwind database--a free, open-source dataset created by Microsoft containing data from a fictional company. You probably remember the Northwind database from our section on Advanced SQL. Here's the schema for the Northwind database:

<img src='https://raw.githubusercontent.com/learn-co-curriculum/dsc-mod-3-project/master/Northwind_ERD_updated.png'>

The goal of this project is to test your ability to gather information from a real-world database and use your knowledge of statistical analysis and hypothesis testing to generate analytical insights that can be of value to the company.

## The Deliverables

The goal of your project is to query the database to get the data needed to perform a statistical analysis.  In this statistical analysis, you'll need to perform a hypothesis test (or perhaps several) to answer the following question:

**_Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?_**

In addition to answering this question with a hypothesis test, you will also need to come up with **_at least 3 other hypotheses to test on your own_**.  These can by anything that you think could be imporant information for the company.

For this hypothesis, be sure to specify both the **_null hypothesis_** and the **_alternative hypothesis_** for your question.  You should also specify if this is one-tail or a two-tail test.

For online students, there will be four deliverables for this project:

1. A **_Jupyter Notebook_** containing any code you've written for this project. This work will need to be pushed to your GitHub repository in order to submit your project.
2. An organized **README.md** file in the GitHub repository that describes the contents of the repository. This file should be the source of information for navigating through the repository.
3. A **_[Blog Post](https://github.com/learn-co-curriculum/dsc-welcome-blogging)_**.
4. An **_"Executive Summary" PowerPoint Presentation_** that explains the hypothesis tests you ran, your findings, and their relevance to company stakeholders.  

Note: On-campus students may have different delivarables, please speak with your instructor. 

### Jupyter Notebook Must-Haves

For this project, your Jupyter Notebook should meet the following specifications:

**_Organization/Code Cleanliness_**

* The notebook should be well organized, easy to follow, and code is commented where appropriate.  
<br>  
    * Level Up: The notebook contains well-formatted, professional looking markdown cells explaining any substantial code. All functions have docstrings that act as professional-quality documentation.  
<br>      
* The notebook is written to technical audiences with a way to both understand your approach and reproduce your results. The target audience for this deliverable is other data scientists looking to validate your findings.  
<br>    
* Any SQL code written to source data should also be included.  

**_Findings_**

* Your notebook should clearly show how you arrived at your results for each hypothesis test, including how you calculated your p-values.   
<br>
* You should also include any other statistics that you find relevant to your analysis, such as effect size.

### Blog Post Must-Haves

Refer back to the [Blogging Guidelines](https://github.com/learn-co-curriculum/dsc-welcome-blogging) for the technical requirements and blog ideas.


### Executive Summary Must-Haves

Your presentation should:

* Contain between 5-10 professional quality slides detailing:
<br>  
    * A high-level overview of your methodology  
    <br>  
    * The results of your hypothesis tests  
    <br>  
    * Any real-world recommendations you would like to make based on your findings (ask yourself--why should the executive team care about what you found? How can your findings help the company?)  
    <br>  
* Take no more than 5 minutes to present  
<br>  
* Avoid technical jargon and explain results in a clear, actionable way for non-technical audiences.  

## Grading Rubric 

Online students can find a PDF of the grading rubric for this project [here](https://github.com/learn-co-curriculum/dsc-mod-3-project/blob/master/module3_project_rubric.pdf). _Note: On-campus students may have different requirements, please speak with your instructor._


# STUDENT NOTEBOOK

- Name: 
- Cohort:
- Instructor:


## ORD 

<img src="https://raw.githubusercontent.com/jirvingphd/dsc-mod-3-project-online-ds-ft-100719/master/Northwind_ERD_updated.png">

# Questions to Answer

- Question 1A: Do customers buy higher quantities of discounted products?
    - If 1B: if so, which level(s) of discount?
    
- Question 2:


- Question 3:

- Question 4:



# OBTAIN

In [2]:
!pip install -U fsds_100719
from fsds_100719.imports import *
from fsds_100719.ds.flatiron_stats import Cohen_d,find_outliers

fsds_1007219  v0.6.9 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


['[i] Pandas .iplot() method activated.']


In [3]:
fs.ihelp(Cohen_d,0)

------------------------------------------------------------------------------------
------ SOURCE ----------------------------------------------------------------------


```python
def Cohen_d(group1, group2, correction = False):
    """Compute Cohen's d
    d = (group1.mean()-group2.mean())/pool_variance.
    pooled_variance= (n1 * var1 + n2 * var2) / (n1 + n2)

    Args:
        group1 (Series or NumPy array): group 1 for calculating d
        group2 (Series or NumPy array): group 2 for calculating d
        correction (bool): Apply equation correction if N<50. Default is False. 
    Returns:
        d (float): calculated d value
         
    INTERPRETATION OF COHEN's D: 
    > Small effect = 0.2
    > Medium Effect = 0.5
    > Large Effect = 0.8
    """
    import numpy as np
    N = len(group1)+len(group2)
    diff = group1.mean() - group2.mean()

    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()

    # Calculate the pooled threshold as shown earlier
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    
    ## Apply correction if needed
    if (N < 50) & (correction==True):
        d=d * ((N-3)/(N-2.25))*np.sqrt((N-2)/N)
    
    return d

```

In [4]:
import sqlite3
connect = sqlite3.connect('Northwind_small.sqlite')
cur = connect.cursor()

In [5]:
cur.execute("""SELECT name FROM sqlite_master WHERE type='table';""")
df_tables = pd.DataFrame(cur.fetchall(), columns=['Table'])
df_tables

Unnamed: 0,Table
0,Employee
1,Category
2,Customer
3,Shipper
4,Supplier
5,Order
6,Product
7,OrderDetail
8,CustomerCustomerDemo
9,CustomerDemographic


In [6]:
def get_table(cur, table='employee'):
    cur.execute(f"SELECT * from {table};")
    df = pd.DataFrame(cur.fetchall())
    df.columns = [desc[0] for desc in cur.description]
    return df

In [7]:
df_employees = get_table(cur)
df_employees

Unnamed: 0,Id,LastName,FirstName,Title,TitleOfCourtesy,BirthDate,HireDate,Address,City,Region,PostalCode,Country,HomePhone,Extension,Photo,Notes,ReportsTo,PhotoPath
0,1,Davolio,Nancy,Sales Representative,Ms.,1980-12-08,2024-05-01,507 - 20th Ave. E. Apt. 2A,Seattle,North America,98122,USA,(206) 555-9857,5467,,Education includes a BA in psychology from Col...,2.0,http://accweb/emmployees/davolio.bmp
1,2,Fuller,Andrew,"Vice President, Sales",Dr.,1984-02-19,2024-08-14,908 W. Capital Way,Tacoma,North America,98401,USA,(206) 555-9482,3457,,Andrew received his BTS commercial in 1974 and...,,http://accweb/emmployees/fuller.bmp
2,3,Leverling,Janet,Sales Representative,Ms.,1995-08-30,2024-04-01,722 Moss Bay Blvd.,Kirkland,North America,98033,USA,(206) 555-3412,3355,,Janet has a BS degree in chemistry from Boston...,2.0,http://accweb/emmployees/leverling.bmp
3,4,Peacock,Margaret,Sales Representative,Mrs.,1969-09-19,2025-05-03,4110 Old Redmond Rd.,Redmond,North America,98052,USA,(206) 555-8122,5176,,Margaret holds a BA in English literature from...,2.0,http://accweb/emmployees/peacock.bmp
4,5,Buchanan,Steven,Sales Manager,Mr.,1987-03-04,2025-10-17,14 Garrett Hill,London,British Isles,SW1 8JR,UK,(71) 555-4848,3453,,Steven Buchanan graduated from St. Andrews Uni...,2.0,http://accweb/emmployees/buchanan.bmp
5,6,Suyama,Michael,Sales Representative,Mr.,1995-07-02,2025-10-17,Coventry House Miner Rd.,London,British Isles,EC2 7JR,UK,(71) 555-7773,428,,Michael is a graduate of Sussex University (MA...,5.0,http://accweb/emmployees/davolio.bmp
6,7,King,Robert,Sales Representative,Mr.,1992-05-29,2026-01-02,Edgeham Hollow Winchester Way,London,British Isles,RG1 9SP,UK,(71) 555-5598,465,,Robert King served in the Peace Corps and trav...,5.0,http://accweb/emmployees/davolio.bmp
7,8,Callahan,Laura,Inside Sales Coordinator,Ms.,1990-01-09,2026-03-05,4726 - 11th Ave. N.E.,Seattle,North America,98105,USA,(206) 555-1189,2344,,Laura received a BA in psychology from the Uni...,2.0,http://accweb/emmployees/davolio.bmp
8,9,Dodsworth,Anne,Sales Representative,Ms.,1998-01-27,2026-11-15,7 Houndstooth Rd.,London,British Isles,WG2 7LT,UK,(71) 555-4444,452,,Anne has a BA degree in English from St. Lawre...,5.0,http://accweb/emmployees/davolio.bmp


In [51]:
## StatFactory Framework
import statsmodels.api as sms
import statsmodels.formula.api as smf
import scipy.stats as stats
import numpy as np

class TTester():
    """A class to test the assumptions of hypothesis testing 
    and to perform required processing."""
    
    def __init__(self,verbose=True,show_help=True):
        
        self.help()
        self._verbose = verbose
    
    def help(self):
        """Display Hypothesis testing assumption and workflow"""
        workflow="""
        1. Instantiate a TTester
        >> tester = TTester()
        2. Fit data
        - If using series/arrays
        >> tester.fit(group1,group2)
        - If using a df and group/target cols:
        >> test.fit_df(df,group_col='discounted',target_col='Quantity')
        
        3. Test Assumptions
        >> tester.check_assumptions()"""
    
                
    def fit(self,group_vars,names=None):
        """Fits data provided as seperate arrays/series"""
        
        if names is None:
            names = [f"group{i+1}" for i in range(len(group_vars))]
            
        self._data = group_vars
        self._group_names = names
        self._n = [len(x) for x in group_vars]
        self._mean = [np.mean(x) for x in group_vars]
#         self.group_data = dict(zip(self._data,self._group_names))
        
    def fit_df(self,df,group_col,target_col):
        """Fit a DataFrame using grouping column and target column"""
        grps = df.groupby(group_col).groups
        data = {}
        for grp_name in grps:
            data[grp_name] = df.loc[grps[grp_name]]
            
        self._data = data.values()
        self._group_names= data.keys()
        self._n = [len(x) for x in data.values()]
        self._mean = [np.mean(x) for x in data.values()]
        self.group_data = data

        
    def check_assumptions(self,summary=True):
        assumptions =[['Assumption','Group','Stat','p','p<.05']]
        
        for data in self._data:
            stat,p = test_normality
            assumptions.append(['Normality',])
        
        if summary:
            self.summary()
        
    
    def test_normality(self):
        pass
    
    def test_equal_variance(self):
        pass
    
    def summary(self):
        
        pass
    

        
#         self._normal = stats.normaltest(data) # normaltest result
#         self._equal_var = stats.levene(data)
#         self._mean = [np.mean(d) for d in data]
#         self._sem = [stats.sem(d) for d in data]
#         ## Check Group Sizes
#         self._n = [len(x) for x in data]
        



## Testing Code

In [52]:
df_orderDetails = get_table(cur,'orderDetail')

In [53]:
df_orderDetails['discounted'] =(df_orderDetails['Discount']>0).map({False:0,True:1})
print(df_orderDetails['discounted'].value_counts())
df_orderDetails

0    1317
1     838
Name: discounted, dtype: int64


Unnamed: 0,Id,OrderId,ProductId,UnitPrice,Quantity,Discount,discounted
0,10248/11,10248,11,14.00,12,0.00,0
1,10248/42,10248,42,9.80,10,0.00,0
2,10248/72,10248,72,34.80,5,0.00,0
3,10249/14,10249,14,18.60,9,0.00,0
4,10249/51,10249,51,42.40,40,0.00,0
...,...,...,...,...,...,...,...
2150,11077/64,11077,64,33.25,2,0.03,1
2151,11077/66,11077,66,17.00,1,0.00,0
2152,11077/73,11077,73,15.00,2,0.01,1
2153,11077/75,11077,75,7.75,4,0.00,0


In [54]:

quant_disc = df_orderDetails.groupby('discounted').get_group(1)['Quantity'].values
quant_no_disc = df_orderDetails.groupby('discounted').get_group(0)['Quantity'].values

In [55]:
tester = TTester()

In [56]:
# tester._data[0]

In [57]:
tester.fit([quant_disc,quant_no_disc])

In [58]:
tester

<__main__.TTester at 0x1c1e9695c0>

In [62]:
tester._data[1]

array([12, 10,  5, ...,  1,  4,  2])