# Statistics Challenge (Optional)

Use the `orders.csv` dataset in the same directory to complete this challenge.

**Background**:

There are exactly 100 sneaker shops on a sneaker retailing website, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis. 

**Questions**:

- What went wrong with this metric and our analysis? 

- Propose some new metrics that better represents the behavior of the stores' customers. Why are these metrics better? You can propose as many new metrics as you wish but quality heavily outweights quantity.

- Find the values of your new metrics.

- Report any other interesting findings.

Show all of your work in this notebook.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('orders.csv')
df.head()

Unnamed: 0,order_id,shop_id,user_id,order_value,total_items,payment_method,created_at
0,1,53,746,224,2,cash,2017-03-13 12:36:56
1,2,92,925,90,1,cash,2017-03-03 17:38:52
2,3,44,861,144,1,cash,2017-03-14 4:23:56
3,4,18,935,156,1,credit_card,2017-03-26 12:43:37
4,5,18,883,156,1,credit_card,2017-03-01 4:35:11


In [4]:
df['order_value'].mean()

3145.128

In [5]:
df['order_value'].median()

284.0

In [6]:
# this means that there are some values skewing the data to the right
df['order_value'].max()

704000

In [9]:
df['order_value'].describe()

count      5000.000000
mean       3145.128000
std       41282.539349
min          90.000000
25%         163.000000
50%         284.000000
75%         390.000000
max      704000.000000
Name: order_value, dtype: float64

In [15]:
sorted_list = df['order_value'].sort_values(ascending = True)
sorted_list.tail()

2153    704000
1562    704000
1362    704000
520     704000
3332    704000
Name: order_value, dtype: int64

In [32]:
df.sort_values(
...     by="order_value",
...     ascending=False
... ).head(18)


Unnamed: 0,order_id,shop_id,user_id,order_value,total_items,payment_method,created_at
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
3332,3333,42,607,704000,2000,credit_card,2017-03-24 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
...,...,...,...,...,...,...,...
1419,1420,78,912,25725,1,cash,2017-03-30 12:23:43
3440,3441,78,982,25725,1,debit,2017-03-19 19:02:54
1204,1205,78,970,25725,1,credit_card,2017-03-17 22:32:21
1364,1365,42,797,1760,5,cash,2017-03-10 6:28:21


In [34]:
df.sort_values(
...     by="order_value",
...     ascending=False
... ).head(18)

Unnamed: 0,order_id,shop_id,user_id,order_value,total_items,payment_method,created_at
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
3332,3333,42,607,704000,2000,credit_card,2017-03-24 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
2835,2836,42,607,704000,2000,credit_card,2017-03-28 4:00:00
4646,4647,42,607,704000,2000,credit_card,2017-03-02 4:00:00
2297,2298,42,607,704000,2000,credit_card,2017-03-07 4:00:00
1436,1437,42,607,704000,2000,credit_card,2017-03-11 4:00:00
4882,4883,42,607,704000,2000,credit_card,2017-03-25 4:00:00


In [33]:
df.count()

order_id          5000
shop_id           5000
user_id           5000
order_value       5000
total_items       5000
payment_method    5000
created_at        5000
dtype: int64

In [35]:
# The problem with the AOV metric is that it does not accurately represent the behavior of the typical customer.
# The outliers in this data set make it so that the mean order value is much larger than the median.
# In fact, the top 17 order values describe a single, very affluet customer making 17 orders of 2000 items each.
# These data points, as well the rest of the top 63 order values, skew the mean.
# These top buyers do not represent the avergage buyer enough to raise the mean so high.

# This is why the Median Order Value (MOV) is a more accurate metric that describes the behavior of customers.

df['order_value'].median()

284.0

In [None]:
# $284 seems much more reasonable in describing what the typical customer is paying at a sneake