## TO DO 

## ADJUST PERCENTAGE OF PROFIT

## Notebook objectives

Classify Clusters in a Business sense and Exctract strategies to make more money out of them.


## Agenda

[Business Context](#Business-Context)<br> 
&emsp;[Business Questions](#Business-Questions)<br>

[Imports](#Imports)<br>
&emsp;[Helper Functions and Classes](#Helper-Functions-and-Classes)<br>

[Settings](#Settings)<br>

[Loading Data](#Loading-Data)<br>
&emsp;[df_clusters](#df_clusters)<br>
&emsp;&emsp;[df_kmeans_clusters](#df_clusters)<br>
&emsp;&emsp;[df_dbscan_clusters](#df_clusters)<br>
&emsp;&emsp;[df_agglomerative_clustering](#df_clusters)<br>
&emsp;[df_payments](#df_payments)<br>
&emsp;[df_products](#df_products)<br>
&emsp;[df_order_items](#df_order_items)<br>
&emsp;[df_geolocations](#df_geolocations)<br>
&emsp;[df_customers](#df_customers)<br>


[Analytical Base Table](#Analytical-Base-Table)<br>
&emsp;[ABT Metadata](#ABT-Metadata)<br>
&emsp;[df_payment_abt](#df_payment_abt)<br>
&emsp;[df_orders_abt](#df_orders_abt)<br>
&emsp;[df_geolocation_abt](#df_geolocation_abt)<br>

[Analysis](#Analysis)<br>
&emsp;[K-Means](#K-Means)<br>
&emsp;[DBScan](#DBScan)<br>
&emsp;[AgglomerativeClustering](#AgglomerativeClustering)<br>

[Final Clusters](#Final-Clusters)<br>



## Business Context

E-Mart is a Chinese retailer that discovered the e-commerce as a way to sell to the entire world, rather than just to the population of their home city. The company has been growing without much worries, and now It wants to start using the data collected during its 4 years of e-commerce to keep growing and make more money.


The company has been growing without much worries, and now It wants to start using the data collected during the years to keep growing and make more money.
At first, the board of directors expects:

- A Dashboard with KPIs to track their growth.
<br>

- Robust Data Analysis, as well as recommendation of actions. What's the actionable based on your analysis?
<br>

- **An Analysis on geolocation, a segmentation by sales, profit and more. They want insights to help increasing revenue.**
<br>

- Sales forecast for the next year, in order to enable strategic planning.



### Business Questions

**How can we group markets by geolocation and profit from it? What are their characteristics?**

## Hypothesis


Why are discounts so high in a cluster? Were they applied in the initial sales period to expand the brand and win customers?

## Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### Helper Functions and Classes

In [123]:
def get_snakecase_columns(df):
    """Sets column name to snake_case format
    
    df: pandas.DataFrame
    
    Return: map
    """
    snakecase = lambda x: str(x).lower().replace(' ', '_').replace('-', '_')
    return map(snakecase, df.columns)

def date(str_date):
    """Apply pandas.to_datetime to argument converting it to datetime.
        
    """
    return pd.to_datetime(str_date)

def find_column(df, col_name):
    """Checks if DataFrame contains a 'column name' and returns the matched columns
    
    df: pandas.DataFrame
    col_name: column name or part of column name to search for
    
    Return: DataFrame with column names that match the col_name searched
    """
    
    df_cols = pd.DataFrame(df.columns, columns=['col_name'])

    return df_cols[df_cols['col_name'].str.contains(col_name)].reset_index(drop=True)


def fig(x=15, y=5, set_as_global=False, reset_to_default=False):
    """ Adjust size of matplotlib figure

    x: figure width.
    y: figure height.
    set_as_global: bool.
        If True, then it sets "x" and "y" axis for all subsequent plots.
    reset_to_default: bool.
        If True, then it resets the global figure size back to default.
    """
    if set_as_global:
        plt.rcParams["figure.figsize"] = (x, y)
    elif reset_to_default:
        plt.rcParams["figure.figsize"] = plt.rcParamsDefault["figure.figsize"]
    else:
        plt.figure(figsize=(x,y))
        
def cluster_stats(df, column, cluster='cluster', quantile_1=.8, quantile_2=.9):    
    stats = df.groupby(cluster)[[column]].describe()[column]
    quantiles = pd.concat([df.groupby(cluster)[[column]].quantile(quantile_1), df.groupby(cluster)[[column]].quantile(quantile_2)], axis=1)
    quantile_1_name, quantile_2_name = f'{round(quantile_1*100)}%', f'{round(quantile_2*100)}%'
    quantiles.columns = [quantile_1_name, quantile_2_name]
    stats = pd.concat([stats, quantiles], axis=1)
    return stats[['mean', 'std', 'min', '25%', '50%', '75%', quantile_1_name, quantile_2_name, 'max']]

### Settings

In [3]:
pd.set_option('display.max_columns', 500)

fig(12,4, set_as_global=True)

## Loading Data

### df_clusters


In [21]:
df_kmeans_clusters = pd.read_csv("../../data/country_clusters/kmeans_clusters.csv")
df_dbscan_clusters = pd.read_csv("../../data/country_clusters/dbscan_clusters.csv")
df_agglomerative_clustering = pd.read_csv("../../data/country_clusters/agglomerative_clustering.csv")

### df_orders

In [4]:
df_orders = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_orders_table.csv")
df_orders.columns = get_snakecase_columns(df_orders)

df_orders['customer_id'] = df_orders['customer_id'].apply(lambda x: x[2:-2])
df_orders['order_date'] = df_orders['order_date'].apply(date)
df_orders['ship_date'] = df_orders['ship_date'].apply(date)
df_orders['delivery_date'] = df_orders['delivery_date'].apply(date)
df_orders['deadline_date'] = df_orders['deadline_date'].apply(date)
df_orders['order_priority'] = df_orders['order_priority'].apply(lambda x: x[2:-2])
df_orders = df_orders.assign(postal_code=df_orders['postal_code'].apply(lambda x: int(x[2:-2])))

In [5]:
df_orders.duplicated(subset=['order_id']).sum()

0

### df_payments

In [6]:
df_payments = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_payment_table.csv")
df_payments.columns = get_snakecase_columns(df_payments)

In [7]:
df_payments.duplicated(subset=['order_id']).sum()

0

### df_products

In [8]:
df_products = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_products_table.csv")
df_products.columns = get_snakecase_columns(df_products)

In [9]:
df_products.duplicated(subset=['product_id']).sum()

0

### df_order_items

In [10]:
df_order_items = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_order_items_table.csv")
df_order_items.columns = get_snakecase_columns(df_order_items)

In [11]:
df_order_items.duplicated(subset=['order_item_id']).sum()

35

### df_geolocations

In [12]:
df_geolocations = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_geolocation_table.csv")
df_geolocations.columns = get_snakecase_columns(df_geolocations)

In [13]:
df_geolocations.duplicated(subset=['postal_code']).sum()

0

### df_customers

In [14]:
df_customers = pd.read_csv("https://raw.githubusercontent.com/pauloreisdatascience/datasets/main/e_market/e_mart_customers_table.csv")
df_customers.columns = get_snakecase_columns(df_customers)

In [15]:
df_customers.duplicated(subset=['customer_id']).sum()

0

## Analytical Base Table

### ABT Metadata

[df_orders_abt](#df_orders_abt)<br>



### df_payment_abt

In [16]:
df_payment_abt = df_order_items.merge(df_products, how='left', on=['product_id'])

df_payment_abt['sales'] = (
    (df_payment_abt['quantity']*df_payment_abt['product_price'] *(1-df_payment_abt['discount']))
    + df_payment_abt['shipping_cost']
)

cols = ['order_item_id', 'order_id', 'product_id',
        'quantity', 'product_price', 'discount',
        'shipping_cost', 'sales']
df_payment_abt = df_payment_abt[cols]


df_payment_abt = (
    df_payment_abt.groupby('order_id')
                    .agg(n_products=('quantity', 'sum'),
                         total_discount=('discount', 'sum'),
                         avg_discount=('discount', 'mean'),
                         avg_product_price=('product_price', 'mean'),
                         shipping_cost=('shipping_cost', 'sum'),
                         sales=('sales', 'sum'),
                         max_product_price=('product_price', 'max'),
                         min_product_price=('product_price', 'min')
                        )
)

# Check if Payment and Sales are equal
# df_payment_abt = df_payment_abt.merge(df_payments.rename(columns={'sales':'payment'}),
#                                       how='left', on=['order_id'])

df_payment_abt.head(2)

Unnamed: 0_level_0,n_products,total_discount,avg_discount,avg_product_price,shipping_cost,sales,max_product_price,min_product_price
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
AE-2011-9160-PO-8865,8,1.4,0.7,110.775909,9.56,176.424976,193.273579,28.278238
AE-2013-1130-EB-4110,7,1.4,0.7,52.94744,60.18,242.152214,100.135833,5.759048


### df_orders_abt


In [24]:
df_orders_abt = (df_orders
         .merge(df_geolocations, how='left', on='postal_code')
         .merge(df_customers, how='left', on='customer_id')
         .merge(df_payments, how='left', on=['order_id'])
)

df_orders_abt['order_date_monthly'] = df_orders_abt['order_date'].dt.to_period("M")#.dt.to_timestamp(freq='M')
df_orders_abt['market_region'] = df_orders_abt['market'] +" | " + df_orders_abt['region']
df_orders_abt['delivery_time'] = (df_orders_abt['delivery_date'] - df_orders_abt['order_date']).dt.days
df_orders_abt['expected_delivery_time'] = (df_orders_abt['deadline_date'] - df_orders_abt['order_date']).dt.days
df_orders_abt['days_to_ship'] = (df_orders_abt['ship_date'] - df_orders_abt['order_date']).dt.days
df_orders_abt['delayed_days'] = df_orders_abt['delivery_time'] - df_orders_abt['expected_delivery_time']
df_orders_abt['delivery_on_time'] = df_orders_abt['delayed_days'].apply(lambda x: True if x <= 0
                                                                        else False)
df_orders_abt['profitable'] = df_orders_abt['profit'].apply(lambda x: True if x>0 else False)

df_orders_abt = df_orders_abt.merge(df_payment_abt.drop(columns=['sales']),
                                    how='left', on=['order_id'])

df_orders_abt.head(2)

Unnamed: 0,order_id,customer_id,order_date,ship_date,delivery_date,ship_mode,postal_code,market,order_priority,deadline_date,order_status,region,country,state,city,customer_name,segment,sales,profit,order_date_monthly,market_region,delivery_time,expected_delivery_time,days_to_ship,delayed_days,delivery_on_time,profitable,n_products,total_discount,avg_discount,avg_product_price,shipping_cost,max_product_price,min_product_price
0,AE-2011-9160-PO-8865,PO-8865,2019-03-10,2019-03-29,2019-04-01,Standard Class,5137041,EMEA,Medium,2019-04-19,Delivered,EMEA,United Arab Emirates,'Ajman,Ajman,Patrick O'Donnell,Consumer,176.424976,-246.078,2019-03,EMEA | EMEA,22.0,40,19.0,-18.0,True,False,8,1.4,0.7,110.775909,9.56,193.273579,28.278238
1,AE-2013-1130-EB-4110,EB-4110,2021-10-14,2021-10-14,2021-10-23,Same Day,51378252664,EMEA,High,2021-11-23,Delivered,EMEA,United Arab Emirates,Ra's Al Khaymah,Ras al Khaymah,Eugene Barchas,Consumer,242.152214,-236.964,2021-10,EMEA | EMEA,9.0,40,0.0,-31.0,True,False,7,1.4,0.7,52.94744,60.18,100.135833,5.759048


In [25]:
# df_orders_abt.to_csv("../../data/growth_analysis/orders_abt.csv")

### df_geolocation_abt

In [29]:
df_geolocation_abt = pd.read_csv("../../data/country_clusters/countries_abt.csv")

## Analysis


    Columns to Analyse:
    
            profitable_rate
        Sales:
            avg_sales, avg_profit, avg_discount, avg_product_price, avg_shipping_cost
        Quantity:
            n_orders, n_products, avg_products_per_order
        Delivery:
            avg_delivery_time, avg_days_to_ship, avg_delayed_days, delivery_on_time, on_time_rate

### K-Means

Initial Cluster Segment Analysis 

    Cluster 0 (Profitable Performance):
        Positive Profit
        No Discounts
    
    Cluster 1 (Bad Performance):
        Negative Profit
        Higher Discounts
    
    Cluster 2 (Extraordinary Performance):
        Positive Profit
        Higher Average Sales
        Lower Discounts
        Higher Shipping Cost
        Higher Quantity of Sold Products 
        Higher Number of Orders
        
        
**Final Group Characteristics**

    Extraordinary Performance:
        11 countries with 
            55% of Total Revenue
            79% of Total Profit
            57% of Total Number of Orders
            and 62% of all Sold Products 
            
            Some level of discount (up to 22%)
            Higher Average Shipping cost Charged to Customers (62)
            Higher NUmber of Orders (Average of 1351 per Country)
            More Cross-Selling (Average Number of Products per Order 7)
            Higher Average Ticket (6262)
    
    Profitable Performance:
        107 countries with 
            32% of Total Revenue
            51% of Total Profit
            27%% of Total Number of Orders
            and 22% of all Sold Products 
            
            Lower Discounts (up to 3%)
            Average Shipping cost (59)
            Average Number of Orders per country (65)
            Average Number of Products per Order (5)
            Average Ticket (3986)
        
    Bad Performance:
        29 countries with
            Higher Discounts (23% up to 70%)
            Negative Profit (average of -135)
            Lower Average Shipping cost Charged to Customers (31)
            
            Average Number of Orders per country (135)
            Average Ticket (2924)

In [130]:
df_aux = df_geolocation_abt.merge(df_kmeans_clusters, how='left', on='country')

df_aux['cluster'] = df_aux['cluster'].map(
        {0:'Profitable Performance',
         1:'Bad Performance',
         2:'Extraordinary Performance',}
    )
df_aux['cluster'] = pd.Categorical(df_aux['cluster'], ["Bad Performance", "Profitable Performance", "Extraordinary Performance"])

df_aux['cluster'].value_counts()

Profitable Performance       107
Bad Performance               29
Extraordinary Performance     11
Name: cluster, dtype: int64

In [119]:
cols = ['total_sales', 'total_profit', 'n_orders', 'n_products',]
df_aux.groupby("cluster")[cols].sum().sort_values("cluster", ascending=False).round(2)

Unnamed: 0_level_0,total_sales,total_profit,n_orders,n_products
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Extraordinary Performance,76546977.42,1166657.38,14861,111617
Profitable Performance,44679286.06,748699.58,6977,40652
Bad Performance,16390928.55,-447899.67,3915,26043


In [120]:
df_aux.groupby("cluster")[cols].sum().sum().astype(int)

total_sales     137617192
total_profit      1467457
n_orders            25753
n_products         178312
dtype: int32

In [121]:
cols = ['profitable_rate', 'avg_sales', 'avg_profit', 'avg_discount', 'avg_product_price', 'avg_shipping_cost',
'n_orders', 'n_products', 'avg_products_per_order', 'avg_delivery_time', 'avg_days_to_ship', 'avg_delayed_days', 'delivery_on_time', 'on_time_rate']

df_aux.groupby("cluster")[cols].mean().sort_values("cluster", ascending=False)

Unnamed: 0_level_0,profitable_rate,avg_sales,avg_profit,avg_discount,avg_product_price,avg_shipping_cost,n_orders,n_products,avg_products_per_order,avg_delivery_time,avg_days_to_ship,avg_delayed_days,delivery_on_time,on_time_rate
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Extraordinary Performance,0.798515,6262.59077,85.582126,0.106523,864.272209,62.57799,1351.0,10147.0,7.49297,18.744547,13.693868,-21.255453,1224.636364,0.905487
Profitable Performance,0.982469,3986.164671,117.96627,0.005051,562.333598,59.462045,65.205607,379.925234,5.708846,18.502917,13.536394,-21.497083,59.158879,0.917419
Bad Performance,0.112196,2924.094839,-135.469907,0.505343,674.590157,31.339149,135.0,898.034483,6.866929,18.770639,13.531202,-21.229361,123.206897,0.90159


In [127]:
cluster_stats(df_aux, 'avg_discount', quantile_2=.95).sort_values("cluster", ascending=False)

Unnamed: 0_level_0,mean,std,min,25%,50%,75%,80%,95%,max
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Extraordinary Performance,0.106523,0.080441,0.018279,0.060738,0.070584,0.146016,0.148744,0.227213,0.297613
Profitable Performance,0.005051,0.022621,0.0,0.0,0.0,0.0,0.0,0.030504,0.176508
Bad Performance,0.505343,0.154859,0.234884,0.40765,0.486848,0.7,0.7,0.7,0.7


### DBScan

Initial Cluster Segment Analysis 


    Cluster -1 (Profitable Performance):
        Positive and Negative Profit
        Some Level of Discount
        Higher Number of Sold Products 
        Higher Number of Orders 
    
    Cluster 0 (High Margins):
        Positive Profit
        No Discounts
    
    Cluster 1 (Bad Performance):
        Negative Profit
        Higher Discounts
        
**Final Group Characteristics**

    High Margins: (Sales Focused on Profit)
        106 countries with 
            Higher Average Profit Marging (2%)
            Lower Discounts (up to 6%)
            Higher Average Profit (113)
            37% of Total Revenue
            75% of Total Profit
            41% of Total Orders
            
    
    Profitable Performance: (Sales Focused on Revenue and Quantity Sold)
        14 countries with 
            Average Profit Marging (1%)
            Some Level os Discount (most of them lower than 20% but can get to 63%)
            Average Profit (85)
            50% of Total Revenue
            48% of Total Profit
            45% of Total Orders
            
        
    Bad Performance:
        27 countries with
            Higher Discounts (23% up to 70%)
            Average Deficit (-135)
            

In [172]:
df_aux = df_geolocation_abt.merge(df_dbscan_clusters, how='left', on='country')

df_aux['cluster'] = df_aux['cluster'].map(
        {-1:'Profitable Performance',
         1:'Bad Performance',
         0:'High Margins',}
    )
df_aux['cluster'] = pd.Categorical(df_aux['cluster'], ["Bad Performance", "Profitable Performance", "High Margins"])

df_aux['cluster'].value_counts()

High Margins              106
Bad Performance            27
Profitable Performance     14
Name: cluster, dtype: int64

In [173]:
cols = ['total_sales', 'total_profit', 'n_orders', 'n_products',]
df_aux.groupby("cluster")[cols].sum().sort_values("cluster", ascending=False).round(2)

Unnamed: 0_level_0,total_sales,total_profit,n_orders,n_products
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
High Margins,51327174.06,1102208.37,10685,68289
Profitable Performance,69995050.61,714438.5,11834,87018
Bad Performance,16294967.36,-349189.57,3234,23005


In [174]:
df_aux.groupby("cluster")[cols].sum().sum().astype(int)

total_sales     137617192
total_profit      1467457
n_orders            25753
n_products         178312
dtype: int32

In [175]:
cols = ['profitable_rate', 'avg_sales', 'avg_profit', 'avg_discount', 'avg_product_price', 'avg_shipping_cost',
'n_orders', 'n_products', 'avg_products_per_order', 'avg_delivery_time', 'avg_days_to_ship', 'avg_delayed_days', 'delivery_on_time', 'on_time_rate']

df_aux.groupby("cluster")[cols].mean().sort_values("cluster", ascending=False)

Unnamed: 0_level_0,profitable_rate,avg_sales,avg_profit,avg_discount,avg_product_price,avg_shipping_cost,n_orders,n_products,avg_products_per_order,avg_delivery_time,avg_days_to_ship,avg_delayed_days,delivery_on_time,on_time_rate
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
High Margins,0.972261,3297.220153,113.922158,0.01014,430.511863,55.635171,100.801887,644.235849,5.659102,18.227375,13.242645,-21.772625,91.150943,0.917443
Profitable Performance,0.774875,10437.751201,86.544571,0.138388,1723.464179,84.263361,845.285714,6215.571429,7.489859,21.377707,16.433218,-18.622293,768.5,0.877636
Bad Performance,0.120507,3132.33391,-135.26688,0.494628,721.372139,32.689514,119.777778,852.037037,6.951383,18.480037,13.246156,-21.519963,109.37037,0.916091


In [176]:
cluster_stats(df_aux, 'avg_discount', quantile_2=.95).sort_values("cluster", ascending=False)

Unnamed: 0_level_0,mean,std,min,25%,50%,75%,80%,95%,max
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
High Margins,0.01014,0.038774,0.0,0.0,0.0,0.0,3e-05,0.063957,0.297613
Profitable Performance,0.138388,0.225257,0.0,0.000106,0.049648,0.14738,0.151971,0.635,0.7
Bad Performance,0.494628,0.154612,0.234884,0.406406,0.453452,0.7,0.7,0.7,0.7


### AgglomerativeClustering


Initial Cluster Segment Analysis 

    Cluster 0 (Extraordinary Performance):
        Positive Profit
        Some Level of Discount
        Higher Shipping Cost
        Higher Number of Sold Products
        Higher Number of Orders
        More Orders Delivered On Time
        More Consistency in Delivery Time and Time to Ship Product
    
    Cluster 1 (Bad Performance):
        Negative Profit
        Higher Discounts
    
    Cluster 2 (Profitable Performance):
        Positive Profit
        No Discount
        
**Final Group Characteristics**

    Extraordinary Performance:
        13 countries with 
            
    
    Profitable Performance:
        105 countries with 
            
        
    Bad Performance:
        29 countries with
            

In [177]:
df_aux = df_geolocation_abt.merge(df_agglomerative_clustering, how='left', on='country')

df_aux['cluster'] = df_aux['cluster'].map(
        {0:'Extraordinary Performance',
         1:'Bad Performance',
         2:'Profitable Performance',}
    )
df_aux['cluster'] = pd.Categorical(df_aux['cluster'], ["Bad Performance", "Profitable Performance", "Extraordinary Performance"])

df_aux['cluster'].value_counts()

Profitable Performance       105
Bad Performance               29
Extraordinary Performance     13
Name: cluster, dtype: int64

In [196]:
cols = ['total_sales', 'total_profit', 'n_orders', 'n_products',]
df_aux.groupby("cluster")[cols].sum().sort_values("cluster", ascending=False).round(2)

Unnamed: 0_level_0,total_sales,total_profit,n_orders,n_products
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Extraordinary Performance,77621346.44,1237647.78,15631,116946
Profitable Performance,43604917.03,677709.17,6207,35323
Bad Performance,16390928.55,-447899.67,3915,26043


    Extraordinary Performance:
        13 countries with 
            56% of Total Revenue
            84% of Total Profit
            60% of Number of Orders
    
    Profitable Performance:
        105 countries with 
            31% of Total Revenue
            46% of Total Profit
            24% of Number of Orders
        
    Bad Performance:
        29 countries with
            

In [183]:
df_aux.groupby("cluster")[cols].sum().sum().astype(int)

total_sales     137617192
total_profit      1467457
n_orders            25753
n_products         178312
dtype: int32

In [184]:
cols = ['profitable_rate', 'avg_sales', 'avg_profit', 'avg_discount', 'avg_product_price', 'avg_shipping_cost',
'n_orders', 'n_products', 'avg_products_per_order', 'avg_delivery_time', 'avg_days_to_ship', 'avg_delayed_days', 'delivery_on_time', 'on_time_rate']

df_aux.groupby("cluster")[cols].mean().sort_values("cluster", ascending=False)

Unnamed: 0_level_0,profitable_rate,avg_sales,avg_profit,avg_discount,avg_product_price,avg_shipping_cost,n_orders,n_products,avg_products_per_order,avg_delivery_time,avg_days_to_ship,avg_delayed_days,delivery_on_time,on_time_rate
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Extraordinary Performance,0.792094,5496.312346,85.664725,0.10659,763.291474,62.874026,1202.384615,8995.846154,7.3981,18.730512,13.672544,-21.269488,1090.230769,0.906577
Profitable Performance,0.986768,4037.67674,118.572884,0.00311,569.084763,59.366042,59.114286,336.409524,5.686608,18.500052,13.536035,-21.499948,53.6,0.917512
Bad Performance,0.112196,2924.094839,-135.469907,0.505343,674.590157,31.339149,135.0,898.034483,6.866929,18.770639,13.531202,-21.229361,123.206897,0.90159


In [185]:
cluster_stats(df_aux, 'avg_discount', quantile_2=.95).sort_values("cluster", ascending=False)

Unnamed: 0_level_0,mean,std,min,25%,50%,75%,80%,95%,max
cluster,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Extraordinary Performance,0.10659,0.078731,0.018279,0.060642,0.070584,0.148744,0.153585,0.22495,0.297613
Profitable Performance,0.00311,0.015001,0.0,0.0,0.0,0.0,0.0,0.009929,0.125
Bad Performance,0.505343,0.154859,0.234884,0.40765,0.486848,0.7,0.7,0.7,0.7
