#Summer 2022 Data Science Intern Challenge 

Kathy Simon


## Question 1

<ol>On Shopify, we have exactly 100 sneaker shops, and each of these shops sells only one model of shoe. We want to do some analysis of the average order value (AOV). When we look at orders data over a 30 day window, we naively calculate an AOV of $3145.13. Given that we know these shops are selling sneakers, a relatively affordable item, something seems wrong with our analysis.


<ol>a). Think about what could be going wrong with our calculation. Think about a better way to evaluate this data. <br>
b). What metric would you report for this dataset? <br>
c). What is its value?
</ol>



In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.read_csv('./drive/MyDrive/spotify_data.csv')

In [3]:
data.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
0,1,53,746,224,2,cash,2017-03-13 12:36:56
1,2,92,925,90,1,cash,2017-03-03 17:38:52
2,3,44,861,144,1,cash,2017-03-14 4:23:56
3,4,18,935,156,1,credit_card,2017-03-26 12:43:37
4,5,18,883,156,1,credit_card,2017-03-01 4:35:11


In [4]:
data.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items
count,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,50.0788,849.0924,3145.128,8.7872
std,1443.520003,29.006118,87.798982,41282.539349,116.32032
min,1.0,1.0,607.0,90.0,1.0
25%,1250.75,24.0,775.0,163.0,1.0
50%,2500.5,50.0,849.0,284.0,2.0
75%,3750.25,75.0,925.0,390.0,3.0
max,5000.0,100.0,999.0,704000.0,2000.0


In [5]:
round(data['order_amount'].mean(),2)

3145.13

* 75% of the order_amount is 390 dollars or less, however the maximum order amount is 704,000 dollars.
* 75% of the total_items are 3 or less, however the maximum total_items is 2,000.
* There are some orders with significantly more total_items which is makes the average order value 3,145.13 dollars

In [6]:
data[data['total_items'] > 8]

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1104,1105,42,607,704000,2000,credit_card,2017-03-24 4:00:00
1362,1363,42,607,704000,2000,credit_card,2017-03-15 4:00:00
1436,1437,42,607,704000,2000,credit_card,2017-03-11 4:00:00
1562,1563,42,607,704000,2000,credit_card,2017-03-19 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
2297,2298,42,607,704000,2000,credit_card,2017-03-07 4:00:00


In [7]:
data[data['total_items'] > 8].count()

order_id          17
shop_id           17
user_id           17
order_amount      17
total_items       17
payment_method    17
created_at        17
dtype: int64

There are 17 orders that are greater than 8 items. All of them are for 2000 items, from user_id 607 and are for 704,000 dollars.

In [8]:
data[data['user_id'] == 607]

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at
15,16,42,607,704000,2000,credit_card,2017-03-07 4:00:00
60,61,42,607,704000,2000,credit_card,2017-03-04 4:00:00
520,521,42,607,704000,2000,credit_card,2017-03-02 4:00:00
1104,1105,42,607,704000,2000,credit_card,2017-03-24 4:00:00
1362,1363,42,607,704000,2000,credit_card,2017-03-15 4:00:00
1436,1437,42,607,704000,2000,credit_card,2017-03-11 4:00:00
1562,1563,42,607,704000,2000,credit_card,2017-03-19 4:00:00
1602,1603,42,607,704000,2000,credit_card,2017-03-17 4:00:00
2153,2154,42,607,704000,2000,credit_card,2017-03-12 4:00:00
2297,2298,42,607,704000,2000,credit_card,2017-03-07 4:00:00





* The AOV is impacted by the 17 orders that have 2000 total items. The rest of the orders have 8 or less items.
* A better metrics would be average value per unit ordered. 

In [9]:
data['average_unit_value'] = data['order_amount'] / data['total_items']

In [10]:
data.head()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,payment_method,created_at,average_unit_value
0,1,53,746,224,2,cash,2017-03-13 12:36:56,112.0
1,2,92,925,90,1,cash,2017-03-03 17:38:52,90.0
2,3,44,861,144,1,cash,2017-03-14 4:23:56,144.0
3,4,18,935,156,1,credit_card,2017-03-26 12:43:37,156.0
4,5,18,883,156,1,credit_card,2017-03-01 4:35:11,156.0


In [11]:
data.describe()

Unnamed: 0,order_id,shop_id,user_id,order_amount,total_items,average_unit_value
count,5000.0,5000.0,5000.0,5000.0,5000.0,5000.0
mean,2500.5,50.0788,849.0924,3145.128,8.7872,387.7428
std,1443.520003,29.006118,87.798982,41282.539349,116.32032,2441.963725
min,1.0,1.0,607.0,90.0,1.0,90.0
25%,1250.75,24.0,775.0,163.0,1.0,133.0
50%,2500.5,50.0,849.0,284.0,2.0,153.0
75%,3750.25,75.0,925.0,390.0,3.0,169.0
max,5000.0,100.0,999.0,704000.0,2000.0,25725.0


In [12]:
round(data['average_unit_value'].mean(),2)

387.74

In [13]:
data[data['average_unit_value'] > 2000].count()

order_id              46
shop_id               46
user_id               46
order_amount          46
total_items           46
payment_method        46
created_at            46
average_unit_value    46
dtype: int64

In [14]:
round(data[data['average_unit_value'] < 2000].mean(),2)

order_id              2498.99
shop_id                 49.82
user_id                848.92
order_amount          2717.37
total_items              8.85
average_unit_value     152.48
dtype: float64

* The average_unit_value is 387.74 dollars.
* Out of the 5000 orders, there are 47 orders that include an unit worth over 25,000 dollars. The 47 orders represent less than 1% or the orders. 
* The average_unit_value without the 47 orders of units over 25,000 dollars is 152.48 dollars.