# Exploratory Data Analysis

This notebook summarizes various discoveries about the dataset that 
influenced how it was normalized. I moved the code out of the notebook
that creates and fills the normalized tables because that notebook was
getting pretty full/busy. 


In [1]:
%load_ext sql
%sql postgres://localhost/ex_superstore_normalize

'Connected: @ex_superstore_normalize'

## What is unique to a given product?

In [2]:
%%sql
WITH multicheck AS (
    SELECT 
        COUNT(DISTINCT product_category) cat_count,
        COUNT(DISTINCT product_subcategory) subcat_count,
        COUNT(DISTINCT product_container) cont_count,
        product_name,
        COUNT(DISTINCT product_base_margin) base_count,
        COUNT(DISTINCT unit_price) price_count
    FROM
        orders
    GROUP BY 
        product_name
)
SELECT 
    *
FROM 
    multicheck
WHERE
        cat_count > 1
    OR
        subcat_count > 1
    OR
        cont_count > 1
    OR
        base_count > 1
    OR
        price_count > 1
    ;

 * postgres://localhost/ex_superstore_normalize
14 rows affected.


cat_count,subcat_count,cont_count,product_name,base_count,price_count
1,1,1,Adesso Programmable 142-Key Keyboard,2,1
1,1,1,"Bevis Round Bullnose 29"" High Table Top",2,1
1,1,1,Bevis Round Conference Table Top & Single Column Base,2,1
1,1,1,"Bevis Round Conference Table Top, X-Base",2,1
1,1,1,BoxOffice By Design Rectangular and Half-Moon Meeting Room Tables,2,1
1,1,1,Bretford CR8500 Series Meeting Room Furniture,2,1
1,1,1,Bush Advantage Collection® Round Conference Table,2,1
1,1,1,"Fellowes Basic 104-Key Keyboard, Platinum",2,1
1,1,1,"Fellowes Smart Design 104-Key Enhanced Keyboard, PS/2 Adapter, Platinum",2,1
1,1,1,Keytronic 105-Key Spanish Keyboard,2,1


For each product name, there is only one category, subcategory and container. However, there can be multiple unit_price and product_base_margin values.

## Customer segment 

It turns out customer segment is not unique per customer!

In [3]:
%%sql
SELECT 
    customer_name,
    COUNT(DISTINCT customer_segment)
FROM
    orders
GROUP BY 
    customer_name
HAVING
    COUNT(DISTINCT customer_segment) > 1
LIMIT 
    10;

 * postgres://localhost/ex_superstore_normalize
10 rows affected.


customer_name,count
Anna Wood,2
Annette McIntyre,2
Arlene Long,2
Benjamin Lam,2
Bonnie Matthews Rowland,2
Bradley Schroeder,2
Cameron Kendall,2
Carlos Hanson,2
Carolyn Greer,2
Christopher Norton Patterson,2


There are actually 58 customers with more than one segment. 

However, each customer order is only associated with one segment. I put the 
customer segment in the new order table. 

In [4]:
%%sql
SELECT
    customer_name, 
    order_id, 
    COUNT(DISTINCT customer_segment)
FROM
    orders
GROUP BY 
    customer_name, order_id
HAVING
    COUNT(DISTINCT customer_segment) > 1
;

 * postgres://localhost/ex_superstore_normalize
0 rows affected.


customer_name,order_id,count


## Each customer has one name 

Also a spot check to see if a customer id has more than one address.

In [5]:
%sql SELECT COUNT(DISTINCT customer_id) FROM orders;

 * postgres://localhost/ex_superstore_normalize
1 rows affected.


count
1130


In [6]:
%sql SELECT COUNT(DISTINCT customer_name) FROM orders;

 * postgres://localhost/ex_superstore_normalize
1 rows affected.


count
1130


In [7]:
%%sql 
SELECT 
    COUNT(DISTINCT city) 
FROM 
    orders 
GROUP BY 
    customer_id
HAVING COUNT(DISTINCT city) > 1
;

 * postgres://localhost/ex_superstore_normalize
0 rows affected.


count


## Exploring order attributes

- The entire order has one order date
- A specific item in the order can have a different priority than another item in the same order
- Specific items can also have different ship modes
- The same order_id can have multiple customer_id's associated with it
- Two different order_id's can re-use the same row_id for order items


In [8]:
%%sql
SELECT
    order_id, count(distinct order_date)
FROM
    orders
GROUP BY
    order_id
HAVING
    COUNT(DISTINCT order_date) > 1
;

 * postgres://localhost/ex_superstore_normalize
0 rows affected.


order_id,count


In [9]:
%%sql
SELECT
    order_id, COUNT(DISTINCT order_priority)
FROM
    orders
GROUP BY
    order_id
HAVING
    COUNT(DISTINCT order_priority) > 1
;

 * postgres://localhost/ex_superstore_normalize
3 rows affected.


order_id,count
86885,2
88908,2
90540,2


In [10]:
%%sql
SELECT
    order_id, COUNT(DISTINCT ship_mode)
FROM
    orders
GROUP BY
    order_id
HAVING
    COUNT(DISTINCT ship_mode) > 1
LIMIT
    5 /* there are 203 of these */
;

 * postgres://localhost/ex_superstore_normalize
5 rows affected.


order_id,count
962,2
9606,2
12224,2
13959,2
21636,2


In [11]:
%%sql
SELECT
    order_id, COUNT(DISTINCT customer_id)
FROM
    orders
GROUP BY
    order_id
HAVING
    COUNT(DISTINCT customer_id) > 1
LIMIT
    5 /* there are 104 of these */
;

 * postgres://localhost/ex_superstore_normalize
5 rows affected.


order_id,count
85880,2
85966,2
86012,2
86051,2
86075,2


In [12]:
%%sql
SELECT
    row_id, count(distinct order_id)
FROM
    orders
GROUP BY
    row_id
HAVING
    COUNT(DISTINCT order_id) > 1
;

 * postgres://localhost/ex_superstore_normalize
1 rows affected.


row_id,count
22015,2


## There are some NULLs in the original data for base_margin

In [13]:
%%sql
SELECT 
    * 
FROM 
    orders 
WHERE 
    NOT (orders IS NOT NULL);

 * postgres://localhost/ex_superstore_normalize
16 rows affected.


row_id,order_priority,discount,unit_price,shipping_cost,customer_id,customer_name,ship_mode,customer_segment,product_category,product_subcategory,product_container,product_name,product_base_margin,country,region,state,city,postal_code,order_date,ship_date,profit,quantity_ordered_new,sales,order_id
18261,Critical,0.06,276.2,24.49,335,Curtis O'Connell,Regular Air,Corporate,Furniture,Chairs & Chairmats,Large Box,SAFCO Arco Folding Chair,,United States,West,Oregon,Medford,97504,2015-05-04 00:00:00,2015-05-05 00:00:00,2639.47,14,3825.32,87277
18305,Critical,0.01,128.24,12.65,508,Cameron Owens,Regular Air,Corporate,Furniture,Chairs & Chairmats,Medium Box,SAFCO Folding Chair Trolley,,United States,South,Kentucky,Covington,41011,2015-04-18 00:00:00,2015-04-21 00:00:00,140.135,4,554.08,87357
24764,Critical,0.09,349.45,60.0,868,Sharon Ellis,Delivery Truck,Corporate,Furniture,Tables,Jumbo Drum,"SAFCO PlanMaster Heigh-Adjustable Drafting Table Base, 43w x 30d x 30-37h, Black",,United States,Central,Minnesota,Shoreview,55126,2015-03-06 00:00:00,2015-03-07 00:00:00,-2946.05,12,3918.98,91195
19185,High,0.09,349.45,60.0,1178,Sandy Hunt,Delivery Truck,Consumer,Furniture,Tables,Jumbo Drum,"SAFCO PlanMaster Heigh-Adjustable Drafting Table Base, 43w x 30d x 30-37h, Black",,United States,South,Florida,Altamonte Springs,32701,2015-04-09 00:00:00,2015-04-10 00:00:00,-369.11,7,2307.26,89787
20592,Medium,0.03,128.24,12.65,1237,Eva Simpson,Regular Air,Corporate,Furniture,Chairs & Chairmats,Medium Box,SAFCO Folding Chair Trolley,,United States,Central,Texas,Carrollton,75007,2015-01-31 00:00:00,2015-02-02 00:00:00,790.464,9,1145.6,86075
21848,Not Specified,0.08,128.24,12.65,1267,Rosemary Branch,Regular Air,Corporate,Furniture,Chairs & Chairmats,Medium Box,SAFCO Folding Chair Trolley,,United States,South,Florida,Boca Raton,33433,2015-05-12 00:00:00,2015-05-13 00:00:00,-379.344,3,366.44,89515
22125,Low,0.1,238.4,24.49,1281,Pauline Denton,Regular Air,Small Business,Furniture,Chairs & Chairmats,Large Box,Safco Contoured Stacking Chairs,,United States,Central,Indiana,Vincennes,47591,2015-01-24 00:00:00,2015-01-26 00:00:00,875.284,8,1774.5,89112
4125,Low,0.1,238.4,24.49,1282,Dana Sharpe,Regular Air,Small Business,Furniture,Chairs & Chairmats,Large Box,Safco Contoured Stacking Chairs,,United States,East,Pennsylvania,Philadelphia,19134,2015-01-24 00:00:00,2015-01-26 00:00:00,460.676,30,6654.39,29319
22593,High,0.09,349.45,60.0,1739,Edna Pierce,Delivery Truck,Corporate,Furniture,Tables,Jumbo Drum,"SAFCO PlanMaster Heigh-Adjustable Drafting Table Base, 43w x 30d x 30-37h, Black",,United States,South,North Carolina,Goldsboro,27534,2015-05-03 00:00:00,2015-05-04 00:00:00,-90.748,17,5835.41,85867
19914,Not Specified,0.08,95.99,35.0,2211,Anita Hahn,Express Air,Home Office,Office Supplies,Storage & Organization,Large Box,Safco Industrial Wire Shelving,,United States,East,Maryland,Bowie,20715,2015-01-01 00:00:00,2015-01-03 00:00:00,-425.208,2,193.88,88028


In [14]:
# which columns? 
nulls = _.DataFrame()

In [15]:
nulls.isna().any()

row_id                  False
order_priority          False
discount                False
unit_price              False
shipping_cost           False
customer_id             False
customer_name           False
ship_mode               False
customer_segment        False
product_category        False
product_subcategory     False
product_container       False
product_name            False
product_base_margin      True
country                 False
region                  False
state                   False
city                    False
postal_code             False
order_date              False
ship_date               False
profit                  False
quantity_ordered_new    False
sales                   False
order_id                False
dtype: bool

In [16]:
# so all the NUlls are in the product_base_margin column