# Challenge 3

In this challenge we will work on the `Orders` data set. In your work you will apply the thinking process and workflow we showed you in Challenge 2.

You are serving as a Business Intelligence Analyst at the headquarter of an international fashion goods chain store. Your boss today asked you to do two things for her:

**First, identify two groups of customers from the data set.** The first group is **VIP Customers** whose **aggregated expenses** at your global chain stores are **above the 95th percentile** (aka. 0.95 quantile). The second group is **Preferred Customers** whose **aggregated expenses** are **between the 75th and 95th percentile**.

**Second, identify which country has the most of your VIP customers, and which country has the most of your VIP+Preferred Customers combined.**

## Q1: How to identify VIP & Preferred Customers?

We start by importing all the required libraries:

In [1]:
# import required libraries
import numpy as np
import pandas as pd

Next, import `Orders` from Ironhack's database into a dataframe variable called `orders`. Print the head of `orders` to overview the data:

In [2]:
# your code here
orders= pd.read_csv('Orders.csv')

orders.head()

Unnamed: 0.1,Unnamed: 0,InvoiceNo,StockCode,year,month,day,hour,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,amount_spent
0,0,536365,85123A,2010,12,3,8,white hanging heart t-light holder,6,2010-12-01 08:26:00,2.55,17850,United Kingdom,15.3
1,1,536365,71053,2010,12,3,8,white metal lantern,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
2,2,536365,84406B,2010,12,3,8,cream cupid hearts coat hanger,8,2010-12-01 08:26:00,2.75,17850,United Kingdom,22.0
3,3,536365,84029G,2010,12,3,8,knitted union flag hot water bottle,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34
4,4,536365,84029E,2010,12,3,8,red woolly hottie white heart.,6,2010-12-01 08:26:00,3.39,17850,United Kingdom,20.34


In [3]:
orders.columns = [c.lower().replace(' ', '_') for c in orders.columns]

orders.columns

Index(['unnamed:_0', 'invoiceno', 'stockcode', 'year', 'month', 'day', 'hour',
       'description', 'quantity', 'invoicedate', 'unitprice', 'customerid',
       'country', 'amount_spent'],
      dtype='object')

In [4]:
orders['customerid'].dtype

dtype('int64')

In [5]:
nan_cols = orders.isna().sum()

nan_cols[nan_cols>0]

Series([], dtype: int64)

In [6]:
orders.shape[0]

397924

In [7]:
orders.amount_spent.value_counts()

amount_spent
15.00     20082
19.80     11033
17.70      9174
16.50      8490
10.20      8028
          ...  
233.10        1
145.53        1
215.80        1
114.10        1
66.36         1
Name: count, Length: 2811, dtype: int64

In [8]:
orders.amount_spent.describe()

count    397924.000000
mean         22.394749
std         309.055588
min           0.000000
25%           4.680000
50%          11.800000
75%          19.800000
max      168469.600000
Name: amount_spent, dtype: float64

---

"Identify VIP and Preferred Customers" is the non-technical goal of your boss. You need to translate that goal into technical languages that data analysts use:

## How to label customers whose aggregated `amount_spent` is in a given quantile range?


We break down the main problem into several sub problems:

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

*Note: If you want to break down the main problem in a different way, please feel free to revise the sub problems above.*

Now in the workspace below, tackle each of the sub problems using the iterative problem solving workflow. Insert cells as necessary to write your codes and explain your steps.

#### Sub Problem 1: How to aggregate the  `amount_spent` for unique customers?

In [9]:
customers= orders.groupby('customerid').agg({'amount_spent': 'sum'})
customers= customers.reset_index()

In [10]:
customers = customers.rename(columns={'customerid':'id'})

#### Sub Problem 2: How to select customers whose aggregated `amount_spent` is in a given quantile range?

In [11]:
customers[(customers.amount_spent>customers.amount_spent.quantile(0.75)) & (customers.amount_spent<customers.amount_spent.quantile(0.95))]

Unnamed: 0,id,amount_spent
1,12347,4310.00
2,12348,1797.24
3,12349,1757.55
5,12352,2506.04
9,12356,2811.43
...,...,...
4319,18259,2338.60
4320,18260,2643.20
4328,18272,3078.58
4337,18283,2094.88


#### Sub Problem 3: How to label selected customers as "VIP" or "Preferred"?

In [12]:
customers['group']= np.nan

In [13]:
customers

Unnamed: 0,id,amount_spent,group
0,12346,77183.60,
1,12347,4310.00,
2,12348,1797.24,
3,12349,1757.55,
4,12350,334.40,
...,...,...,...
4334,18280,180.60,
4335,18281,80.82,
4336,18282,178.05,
4337,18283,2094.88,


In [14]:
#Esta función es muy ineficiente - %%time 5,41s
"""
lst=[]
for e in customers.amount_spent:
    if e>customers.amount_spent.quantile(0.95):
        lst.append('VIP')
    elif e>customers.amount_spent.quantile(0.75) and e<customers.amount_spent.quantile(0.95):
        lst.append('Preferred')
    else:
        lst.append('Regular')

customers['group']= lst

customers.head()"""

"\nlst=[]\nfor e in customers.amount_spent:\n    if e>customers.amount_spent.quantile(0.95):\n        lst.append('VIP')\n    elif e>customers.amount_spent.quantile(0.75) and e<customers.amount_spent.quantile(0.95):\n        lst.append('Preferred')\n    else:\n        lst.append('Regular')\n\ncustomers['group']= lst\n\ncustomers.head()"

In [22]:
%%time

#Probé con esta y es bastatne más eficiente 

indices_VIP = customers[(customers.amount_spent>customers.amount_spent.quantile(0.95))].index
indices_preferred =customers[(customers.amount_spent>customers.amount_spent.quantile(0.75)) & (customers.amount_spent<customers.amount_spent.quantile(0.95))].index

customers.loc[indices_VIP, 'group'] = 'VIP'
customers.loc[indices_preferred, 'group'] = 'Preferred'
customers.loc[[e for e in customers.index if(e not in indices_VIP and e not in indices_preferred)], 'group']='Regular'

CPU times: total: 31.2 ms
Wall time: 33.9 ms


In [27]:
#Comprobación: creé un customerid aleatorio y comparé el amount_spent.sum() de orders y su valor en customers

cliente_aleatorio= orders.customerid.sample(1)

print("Sumatoria del gasto en orders:", orders[(orders['customerid']==cliente_aleatorio.values[0])].amount_spent.sum(), "\n",
      "Valor de amount_spent en customers: ", customers[(customers['id']==cliente_aleatorio.values[0])].amount_spent.sum(), "\n",
      "Grupo de cliente:", customers[(customers['id']==cliente_aleatorio.values[0])].group.unique()
     )

print("Percentil 95%:", customers.amount_spent.quantile(0.95).round(2), "\n",
      "Percentil 75%:", customers.amount_spent.quantile(0.75).round(2))

Sumatoria del gasto en orders: 2467.3499999999995 
 Valor de amount_spent en customers:  2467.35 
 Grupo de cliente: ['Preferred']
Percentil 95%: 5840.18 
 Percentil 75%: 1661.64


In [28]:
customers[(customers.amount_spent>customers.amount_spent.quantile(0.95))].head(10)

Unnamed: 0,id,amount_spent,group
0,12346,77183.6,VIP
10,12357,6207.67,VIP
12,12359,6372.58,VIP
50,12409,11072.67,VIP
55,12415,124914.53,VIP
66,12428,7956.46,VIP
69,12431,6487.45,VIP
71,12433,13375.87,VIP
73,12435,7829.89,VIP
86,12451,9035.52,VIP


Now we'll leave it to you to solve Q2 & Q3, which you can leverage from your solution for Q1:

## Q2: How to identify which country has the most VIP Customers?

In [None]:
# your code here

In [34]:
customers['country']= np.nan

In [100]:
for e in customers.id:
    customers.loc[customers.id==e, 'country'] = orders.loc[orders.customerid==e, 'country'].values[0]

In [118]:
customers_vip= customers[(customers.group=='VIP')]

customers_vip.head()

Unnamed: 0,id,amount_spent,group,country
0,12346,77183.6,VIP,United Kingdom
10,12357,6207.67,VIP,Switzerland
12,12359,6372.58,VIP,Cyprus
50,12409,11072.67,VIP,Switzerland
55,12415,124914.53,VIP,Australia


In [158]:
customers.groupby('country').count().sort_values("id", ascending=False).head(1)

Unnamed: 0_level_0,id,amount_spent,group
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United Kingdom,3921,3921,3921


## Q3: How to identify which country has the most VIP+Preferred Customers combined?

In [None]:
# your code here

In [142]:
customers_vip_preferred= customers[(customers.group=='VIP') | (customers.group=='Preferred')]

customers_vip_preferred.head()

Unnamed: 0,id,amount_spent,group,country
0,12346,77183.6,VIP,United Kingdom
1,12347,4310.0,Preferred,Iceland
2,12348,1797.24,Preferred,Finland
3,12349,1757.55,Preferred,Italy
5,12352,2506.04,Preferred,Norway


In [159]:
customers_vip_preferred.groupby('country').count().sort_values("id", ascending=False).head(1)

Unnamed: 0_level_0,id,amount_spent,group
country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
United Kingdom,932,932,932
