# Exploratory Data Analysis

Guiding Questions:
* What customer purchasing patterns can you discover? Such as activity on different days of the week, or weekly, monthly, quarterly, yearly, etc.
* Are there specific days/months/quarters when the sales have been unusually high/low, and what could be the possible reasons? How about the profit and loss margin?
* Which States and which customers made the highest number of orders? Are they the same as the highest spenders?
* Can you make a map showing the 5 States generating the most and least sales revenue?
* Can we see any patterns in the quarterly revenue behavior?
* Can you create a plot showing the growth rate of new customers over the months?
* What do you think about the customers? Are they individuals or wholesalers? Why would you say so?
* Are there any issues with the dataset?

In [1]:
import pandas as pd

Some insights

How is it going currently? 
* Sales increase in 2017
* Customer Segment
  * sales increase across all segments
  * sales ranking: Consumer, Corporate, Home Office 
  * profit ranking similar; recently there has been a drop in profits in the Corporate segment 
  * the distribution approximately stays constant
* Regions, States
  * Sales highest in California, New York, Texas; similar to population
  * Profits however can be negative! Texas, Florida 

How is the regional sales going? 
* What the fuck happened in Texas?
  * Greedy customers buy stuff when on discount!
  * Try a different discount approach: 
    * don't easily give the discounts away; discounts should not be gameable; is that the case? Do people just come when discounts are available? 
    * less discounts, such that there is still a profit margin
  * can customers be segmented into Discount buyers and repeated buyers?

In [42]:
%store -r dashboard_data

In [43]:
dashboard_data.head()

Unnamed: 0,OrderDate,ProductID,CustomerID,ShipmentID,OrderReference,Quantity,Discount,Sales,Profit,DiscountAmount,...,ProductName,Category,SubCategory,ShipDate,ShipMode,PostalCode,City,Country,Region,State
0,2014-09-07 00:00:00,TEC-PH-10000000,DK-13375,0,CA-2014-100006,3,0.0,377.97,109.6113,0.0,...,AT&T EL51110 DECT,Technology,Phones,2014-09-13 00:00:00,Standard Class,10024,New York City,United States,East,New York
1,2014-10-19 00:00:00,TEC-PH-10000001,EH-14125,10,CA-2014-100867,6,0.2,321.552,20.097,80.388,...,RCA Visys Integrated PBX 8-Line Router,Technology,Phones,2014-10-24 00:00:00,Standard Class,90712,Lakewood,United States,West,California
2,2014-11-21 00:00:00,TEC-PH-10000001,JK-15325,4328,US-2014-168501,5,0.2,267.96,16.7475,66.99,...,RCA Visys Integrated PBX 8-Line Router,Technology,Phones,2014-11-27 00:00:00,Standard Class,75220,Dallas,United States,Central,Texas
3,2015-06-16 00:00:00,TEC-PH-10000001,LC-16885,4459,US-2015-163825,2,0.0,133.98,33.495,0.0,...,RCA Visys Integrated PBX 8-Line Router,Technology,Phones,2015-06-19 00:00:00,First Class,10009,New York City,United States,East,New York
4,2017-01-20 00:00:00,TEC-PH-10000001,TH-21100,4011,CA-2017-161809,3,0.2,160.776,10.0485,40.194,...,RCA Visys Integrated PBX 8-Line Router,Technology,Phones,2017-01-26 00:00:00,Standard Class,90045,Los Angeles,United States,West,California


In [5]:
dashboard_data.columns

Index(['OrderDate', 'ProductID', 'CustomerID', 'ShipmentID', 'OrderReference',
       'Quantity', 'Discount', 'Sales', 'Profit', 'DiscountAmount',
       'CustomerName', 'Segment', 'ProductName', 'Category', 'SubCategory',
       'ShipDate', 'ShipMode', 'PostalCode', 'City', 'Country', 'Region',
       'State'],
      dtype='object')

In [11]:
total_revenue = dashboard_data.Sales.sum()
print(total_revenue)

2297200.8603


In [34]:
# How many discount hunters are there?
customers_only_buy_on_discount = dashboard_data.groupby('CustomerID').apply(lambda g: (g.Discount>0).all()).rename('orders_only_with_discount')
print('fraction of discount hunters of all customers', customers_only_buy_on_discount.sum() / customers_only_buy_on_discount.shape[0])

fraction of discount hunters of all customers 0.04287515762925599


  customers_only_buy_on_discount = dashboard_data.groupby('CustomerID').apply(lambda g: (g.Discount>0).all()).rename('orders_only_with_discount')


In [None]:
# How much revenue is lost to discount hunters?
discount_hunters = customers_only_buy_on_discount[customers_only_buy_on_discount==True]
dashboard_data.loc[dashboard_data.CustomerID.isin(discount_hunters.index), 'Sales'].sum() / total_revenue

np.float64(0.01504325564090509)

They make only a small fraction of the total revenue. This quantity is tilted, because their sales is reduced by the discount.

In [28]:
# How many orders do customers with negative net profit place?
# How much of the sales do they generate?
customers_sales_profit = dashboard_data[['CustomerID', 'Sales', 'Profit']].groupby(by='CustomerID').sum().reset_index()
customers_sales_profit.loc[customers_sales_profit.Profit<=0,'Sales'].sum() / total_revenue
# customers_sales_profit

np.float64(0.17062086884679892)

In [None]:
dashboard_data.loc[dashboard_data.Discount>0, ['CustomerID', 'OrderReference', 'State', 'Sales']].groupby(['CustomerID', 'State']).count()

CustomerID,State
AA-10315,California
AA-10315,Texas
AA-10375,Arizona
AA-10375,California
AA-10375,New York
...,...
ZC-21910,Oregon
ZC-21910,Texas
ZD-21925,California
ZD-21925,Florida


# Shipments
* Do shipments arrive late? 