# Data Science Intern Challenge, Fall 2022
### Pooja Mathur

## Question 1
The average order value, AOV, is calculated by dividing the total value of all orders within a given period (in this case, 30 days), and dividing it by the total number of orders.

We will first import all necessary data and libraries, and take a look at how the data is set up.

In [3]:
import pandas as pd
import numpy as np

In [4]:
shoes = pd.read_csv('2019 Winter Data Science Intern Challenge Data Set.csv')
shoes.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
0,1,53,746,224,2,cash,2017-03-13 12:36:56
1,2,92,925,90,1,cash,2017-03-03 17:38:52
2,3,44,861,144,1,cash,2017-03-14 4:23:56
3,4,18,935,156,1,credit_card,2017-03-26 12:43:37
4,5,18,883,156,1,credit_card,2017-03-01 4:35:11


Next, we recalculate and verify the naive AOV.

In [5]:
shoes['order_amount'].sum() / shoes.shape[0]

# shoes.shape[0] is the number of rows in the dataset, or # of orders.

3145.128

This matches the given AOV.

To determine what might be going wrong with the calculation, we take a look at the summary statistics of the data and check for skews.

In [6]:
shoes['order_amount'].describe()

count      5000.000000
mean       3145.128000
std       41282.539349
min          90.000000
25%         163.000000
50%         284.000000
75%         390.000000
max      704000.000000
Name: order_amount, dtype: float64

The maximum seems suspiciously high in comparison to the other percentiles. 

Lets break it down further:

In [7]:
shoes['order_amount'].quantile(np.arange(1,26) / 25)

0.04       117.00
0.08       130.00
0.12       137.76
0.16       147.00
0.20       156.00
0.24       162.00
0.28       172.00
0.32       180.00
0.36       196.00
0.40       236.00
0.44       260.00
0.48       276.00
0.52       294.00
0.56       312.00
0.60       322.00
0.64       336.00
0.68       352.00
0.72       374.00
0.76       399.00
0.80       444.00
0.84       474.00
0.88       516.00
0.92       561.00
0.96       692.00
1.00    704000.00
Name: order_amount, dtype: float64

## What could be going wrong with the calculation?

*The data seems to be **skewed to the right**, creating a higher-than-expected AOV.*



We can examine data above the 98th percentile, in between quantiles 0.96 and 1.00, to see why some order amounts are extremely high.

In [8]:
percentile_98_sales = shoes.loc[shoes['order_amount'] > shoes['order_amount'].quantile(0.98)]
percentile_98_sales.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
160,161,78,990,25725,1,credit_card,2017-03-12 5:56:57
490,491,78,936,51450,2,debit,2017-03-26 17:08:19
493,494,78,983,51450,2,cash,2017-03-16 21:39:35


Orders with 2000 items could explain the high order amount, but there may be orders where items cost much higher than expected.

We can calculate the amount paid per item (order amount divided by total items) to examine this issue.

In [9]:
percentile_98_sales['amount_per_item'] = percentile_98_sales['order_amount'] / percentile_98_sales['total_items']
percentile_98_sales.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  percentile_98_sales['amount_per_item'] = percentile_98_sales['order_amount'] / percentile_98_sales['total_items']


Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at,amount_per_item
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00,352.0
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00,352.0
160,161,78,990,25725,1,credit_card,2017-03-12 5:56:57,25725.0
490,491,78,936,51450,2,debit,2017-03-26 17:08:19,25725.0
493,494,78,983,51450,2,cash,2017-03-16 21:39:35,25725.0


I may not be much of a sneakerhead myself, but I don't believe it is common for shoes to cost **above $25,000 per pair**.

## What metric should be reported?

*Given the presence of skew and major outliers within the data, it is a better idea to report the **median order value, or MOV**, as it is a more skew-resistant measure to represent the typical customer for sneakers.*

In [10]:
shoes['order_amount'].quantile(0.5)

284.0

## What is its value?

Calculating the MOV (at the 50th percentile), we find a median of **$284**.