In [139]:
# import libraries
import pandas as pd
import sqlite3
from sqlite3 import Error
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from collections import defaultdict
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats

# Executive Summary


TripleTen World Consultancy (TTWC) has conducted an in-depth audit of AtliQ Hardware's sales data to facilitate data-driven decision-making and enhance operational efficiency. The analysis encompassed financial, product, customer, and geographic perspectives, shedding light on key trends and actionable insights.

**Problem Overview:**
AtliQ Hardware, a prominent computer hardware producer, sought TTWC's assistance in conducting a comprehensive audit of their sales data. The aim was to glean insights into revenue, profits, product performance, customer segmentation, and regional dynamics, ultimately enabling informed decision-making and strategic planning.

**Data and Analysis Approach:**
TTWC leveraged a SQLite database provided by AtliQ Hardware for the analysis. The data underwent meticulous cleaning and preprocessing, including handling duplicates and merging relevant tables into a single DataFrame. Key metrics such as revenue, margin, and customer behavior were calculated and visualized to facilitate deeper insights.

**Findings and Recommendations:**

1. **Financial Analysis:**
   - Steady improvement in revenue and margin over the years, with a notable surge in FY2022.
   - Seasonal patterns observed, with peak performance during the 4th quarter (September including).
   - Consistent discount range, suggesting potential for targeted pricing strategies.
   - Margins remained stable, underscoring the company's pricing effectiveness.

   *Recommendation:* Implement tailored pricing strategies during peak periods to maximize profitability.

2. **Geographical Analysis:**
   - Expansion across four regions, with EU exhibiting rapid growth.
   - APAC emerged as the largest region, while LATAM showed lower margins due to higher discounts.

   *Recommendation:* Focus on capitalizing on growth opportunities in EU while optimizing pricing strategies in LATAM.

3. **Sales Channels Analysis:**
   - Diverse sales channels with varied performance metrics.
   - Brick & Mortar retail sales dominated revenue, with consistent growth.
   - Discounts in offline stores remained lower, possibly reflecting higher overhead costs.

   *Recommendation:* Explore channel-specific strategies to maximize sales and optimize pricing.

4. **Customer Analysis:**
   - It's important to note that the customer data provided reflects sales channels rather than individual end customers. As a result, the analysis is somewhat limited in providing granular insights into individual customer behavior and preferences.
   - Moving forward, AtliQ Hardware may consider augmenting its data collection methods to capture more granular customer-level data, enabling deeper analysis and more targeted strategies to enhance customer satisfaction and drive growth.
   - Significant growth in customer base, especially in the local retailers channel.
   - Expansion efforts in EU yielded positive results, while LATAM expansion faced challenges.
   - No clear correlation between discounts and customer growth rates.

   *Recommendation:* Focus on nurturing relationships with local retailers and refining expansion strategies in challenging markets.

5. **Product Analysis:**
   - Limited data available for comprehensive product analysis.
   - Margin stability observed, irrespective of product class.

   *Recommendation:* Conduct a more detailed analysis with a broader dataset to gain deeper insights into product performance and pricing strategies. Products of higher class should assume higher margins.

**Hypothesis Testing:**
- Seasonality: Significant difference in ARPU between the 4th quarter and other months.
- Discount: Inconclusive evidence of increased discounts in Latin America.
- Margins: No significant difference in margins across product classes.

**Conclusion:**
The analysis provides valuable insights into AtliQ Hardware's sales performance, highlighting areas of strength and opportunities for improvement. By leveraging these insights, AtliQ Hardware can refine its pricing strategies, optimize channel management, and tailor its product offerings to enhance competitiveness and drive sustainable growth in the dynamic hardware market.

# Connection to the database


**Note:** In the beginning I download the whole database (like SELECT *) because as we see later it is obviously a small portion of the full database. But further if appropriate I will also use more 'precise' SQL queries to align with the rules of the project

In [140]:
### connect to the database
con = sqlite3.connect('atliq_db.sqlite3')

### check all tables in the database
cursor = con.cursor()
table_names = cursor.execute("SELECT name FROM sqlite_master  WHERE type='table';").fetchall()
print(table_names)

[('dim_customer',), ('dim_product',), ('fact_pre_discount',), ('fact_manufacturing_cost',), ('fact_gross_price',), ('fact_sales_monthly',)]


In [141]:
# download all tables into the dictionary of DataFrames to quicly look through the data
data_types = {
    'customer_code': 'float64',
    'customer':'str',
    'platform':'category',
    'channel':'category',
    'market':'category',
    'sub_zone':'category',
    'region':'category',
    'product_code':'category',
    'division':'category',
    'segment':'category',
    'category':'category',
    'product':'category',
    'variant':'category',
    'fiscal_year':'float64',
    'pre_invoice_discount_pct':'float64',
    'cost_year':'int16',
    'manufacturing_cost':'float64',
    'gross_price':'float64',
    # 'date':,
    'sold_quantity':'float64',
}

data_dates = {
    'date' : '%Y-%m-%d'
}

data_dict = {}
for tbl in table_names:
    query_all = """Select * from """ + str(tbl[0])

    # create list of columns except `date` in the current table to recreate specific datatype dictionary
    col_names = [desc[0] for desc in cursor.execute(query_all).description if desc[0] !='date']
    tbl_types = {}
    tbl_types = {col:data_types[col] for col in col_names}

    # read table as element of dictionary of dataframes
    data_dict[tbl[0]] = pd.read_sql_query(query_all, con, dtype=tbl_types, parse_dates=data_dates)
    print(str(tbl[0]))
    data_dict[tbl[0]].info()
    print()

con.close()

dim_customer
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   customer_code  209 non-null    float64 
 1   customer       209 non-null    object  
 2   platform       209 non-null    category
 3   channel        209 non-null    category
 4   market         209 non-null    category
 5   sub_zone       209 non-null    category
 6   region         209 non-null    category
dtypes: category(5), float64(1), object(1)
memory usage: 6.5+ KB

dim_product
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 397 entries, 0 to 396
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   product_code  397 non-null    category
 1   division      397 non-null    category
 2   segment       397 non-null    category
 3   category      397 non-null    category
 4   product       397 non-null 

**Note:** There is only one row with empty cells. It could be dropped.

In [142]:
# drop 1 NaN row
display(data_dict['fact_sales_monthly'].tail(1))
data_dict['fact_sales_monthly'].dropna(inplace=True)
data_dict['fact_sales_monthly'].isna().sum()

Unnamed: 0,date,product_code,customer_code,sold_quantity,fiscal_year
67250,2019-06-01,A0,,,


date             0
product_code     0
customer_code    0
sold_quantity    0
fiscal_year      0
dtype: int64

In [143]:
# now we can convert customer_code and fiscal_year to 'category'

data_dict['dim_customer']['customer_code'] = data_dict['dim_customer']['customer_code'].astype('int64').astype('category')

data_dict['fact_pre_discount']['customer_code'] = data_dict['fact_pre_discount']['customer_code'].astype('int64').astype('category')
data_dict['fact_pre_discount']['fiscal_year'] = data_dict['fact_pre_discount']['fiscal_year'].astype('int64').astype('category')

data_dict['fact_gross_price']['fiscal_year'] = data_dict['fact_gross_price']['fiscal_year'].astype('int64').astype('category')

data_dict['fact_sales_monthly']['customer_code'] = data_dict['fact_sales_monthly']['customer_code'].astype('int64').astype('category')
data_dict['fact_sales_monthly']['fiscal_year'] = data_dict['fact_sales_monthly']['fiscal_year'].astype('int64').astype('category')


In [144]:
# Let's take a look at data
for key, tbl in data_dict.items():
    print(key)
    display(tbl.sample(10))

dim_customer


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region
24,70014143,Atliq e Store,E-Commerce,Direct,Netherlands,NE,EU
49,90002003,Ezone,Brick & Mortar,Retailer,India,India,APAC
129,90015146,Mbit,Brick & Mortar,Retailer,Norway,NE,EU
82,90008166,Sound,Brick & Mortar,Retailer,Australia,ANZ,APAC
206,90025209,Electricalsbea Stores,Brick & Mortar,Retailer,Columbia,LATAM,LATAM
178,90021092,UniEuro,Brick & Mortar,Retailer,United Kingdom,NE,EU
14,70009134,Atliq e Store,E-Commerce,Direct,Newzealand,ANZ,APAC
118,90013123,Expert,Brick & Mortar,Retailer,Italy,SE,EU
79,90007197,Amazon,E-Commerce,Retailer,South Korea,ROA,APAC
15,70010047,Atliq Exclusive,Brick & Mortar,Direct,Bangladesh,ROA,APAC


dim_product


Unnamed: 0,product_code,division,segment,category,product,variant
49,A1118150201,P & A,Peripherals,Processors,AQ 5000 Series Electron 8 5900X Desktop Processor,Standard
74,A1819150303,P & A,Peripherals,MotherBoard,AQ MB Crossx 2,Plus 2
10,A0418150101,P & A,Peripherals,Graphic Card,AQ Mforce Gen X,Standard 1
100,A2319150305,P & A,Accessories,Mouse,AQ Gamers Ms,Premium 1
201,A4118110102,PC,Notebook,Personal Laptop,AQ Aspiron,Standard Blue
245,A4720110701,PC,Notebook,Personal Laptop,AQ GEN Z,Standard Grey
197,A4021150403,P & A,Accessories,Batteries,AQ Mx NB,Plus 2
82,A2021150503,P & A,Peripherals,MotherBoard,AQ MB Lito 2,Plus 2
208,A4218110202,PC,Notebook,Personal Laptop,AQ Digit,Standard Blue
332,A5820110104,PC,Desktop,Business Laptop,AQ BZ Allin1,Plus 1


fact_pre_discount


Unnamed: 0,customer_code,fiscal_year,pre_invoice_discount_pct
673,90016171,2021,0.2974
586,90013122,2019,0.2057
425,90009127,2018,0.2343
778,90018109,2021,0.2375
662,90015149,2020,0.27
548,90012040,2021,0.1814
538,90012038,2021,0.2785
688,90016174,2021,0.2556
722,90017053,2020,0.1871
576,90013120,2019,0.2213


fact_manufacturing_cost


Unnamed: 0,product_code,cost_year,manufacturing_cost
838,A5018110207,2019,121.7607
164,A1119150203,2019,35.1868
9,A0118150103,2019,5.5306
136,A0921150601,2021,13.3333
1039,A5820110101,2022,220.99
815,A5018110202,2018,123.2546
396,A2821150801,2022,6.4552
544,A3621150803,2022,11.8008
802,A4918110103,2020,120.5517
1043,A5820110103,2022,230.3172


fact_gross_price


Unnamed: 0,product_code,fiscal_year,gross_price
460,A3119150301,2020,12.9355
572,A3818150201,2019,13.8166
924,A5318110108,2019,458.2562
1168,A7219160201,2021,32.9575
379,A2620150605,2021,16.5882
820,A5018110203,2018,409.7719
263,A2118150101,2018,2.9296
540,A3521150705,2022,34.184
783,A4821110802,2022,417.3259
311,A2219150205,2020,6.8929


fact_sales_monthly


Unnamed: 0,date,product_code,customer_code,sold_quantity,fiscal_year
39860,2021-05-01,A0220150203,70012042,36.0,2021
64744,2021-09-01,A0418150103,90026205,31.0,2022
29060,2020-12-01,A0219150201,90009130,22.0,2021
63401,2021-01-01,A0418150103,90009130,17.0,2021
38869,2020-10-01,A0220150203,90019201,7.0,2021
38905,2020-11-01,A0220150203,70005163,26.0,2021
48539,2020-12-01,A0321150303,90023027,43.0,2021
12465,2018-03-01,A0118150103,70008169,20.0,2018
10588,2020-11-01,A0118150102,90023024,105.0,2021
4340,2020-10-01,A0118150101,90006156,90.0,2021


In [145]:
# describe()
for key, tbl in data_dict.items():
    print(key)
    display(tbl.describe().T)

dim_customer


Unnamed: 0,count,unique,top,freq
customer_code,209,209,70002017,1
customer,209,75,Amazon,25
platform,209,2,Brick & Mortar,150
channel,209,3,Retailer,164
market,209,27,India,18
sub_zone,209,7,NE,61
region,209,4,EU,105


dim_product


Unnamed: 0,count,unique,top,freq
product_code,397,397,A0118150101,1
division,397,3,P & A,200
segment,397,6,Notebook,129
category,397,14,Personal Laptop,61
product,397,73,AQ Gen Y,8
variant,397,27,Plus 2,35


fact_pre_discount


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
pre_invoice_discount_pct,1045.0,0.233807,0.058077,0.051,0.2048,0.2439,0.2767,0.3099


fact_manufacturing_cost


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cost_year,1182.0,2020.57445,1.249199,2018.0,2020.0,2021.0,2022.0,2022.0
manufacturing_cost,1182.0,63.000676,74.015524,0.8654,5.41925,11.4176,122.56035,263.4207


fact_gross_price


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
gross_price,1182.0,211.086558,248.388384,2.8445,18.0776,38.3837,414.7115,890.1364


fact_sales_monthly


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
sold_quantity,67250.0,56.251822,136.970027,0.0,7.0,20.0,52.0,4127.0


# Data cleaning and preprocessing

## Duplicates

### Full duplicates

In [146]:
# full duplicates
print('Number of full duplicates in the table:')
for key, tbl in data_dict.items():
    print(f'{(key + ":"):<25}{tbl.duplicated().sum():>5}')

Number of full duplicates in the table:
dim_customer:                0
dim_product:                 0
fact_pre_discount:           0
fact_manufacturing_cost:     0
fact_gross_price:            0
fact_sales_monthly:          0


### Implicit duplicates

Implicit duplicates are anticipated primarily within DIM tables.

#### dim_customer

In [147]:
# implicit duplicates
display(sorted(data_dict['dim_customer']['channel'].unique()))
display(sorted(data_dict['dim_customer']['platform'].unique()))
display(sorted(data_dict['dim_customer']['region'].unique()))
display(sorted(data_dict['dim_customer']['sub_zone'].unique()))
display(sorted(data_dict['dim_customer']['market'].unique()))

['Direct', 'Distributor', 'Retailer']

['Brick & Mortar', 'E-Commerce']

['APAC', 'EU', 'LATAM', 'NA']

['ANZ', 'India', 'LATAM', 'NA', 'NE', 'ROA', 'SE']

['Australia',
 'Austria',
 'Bangladesh',
 'Brazil',
 'Canada',
 'Chile',
 'China',
 'Columbia',
 'France',
 'Germany',
 'India',
 'Indonesia',
 'Italy',
 'Japan',
 'Mexico',
 'Netherlands',
 'Newzealand',
 'Norway',
 'Pakistan',
 'Philiphines',
 'Poland',
 'Portugal',
 'South Korea',
 'Spain',
 'Sweden',
 'USA',
 'United Kingdom']

In [148]:
# implicit duplicates - customer
sorted(data_dict['dim_customer']['customer'].unique())

['Acclaimed Stores',
 'All-Out',
 'Amazon',
 'Amazon ',
 "Argos (Sainsbury's)",
 'Atlas Stores',
 'Atliq Exclusive',
 'Atliq e Store',
 'BestBuy',
 'Billa',
 'Boulanger',
 'Chip 7',
 'Chiptec',
 'Circuit City',
 'Control',
 'Coolblue',
 'Costco',
 'Croma',
 'Currys (Dixons Carphone)',
 'Digimarket',
 'Ebay',
 'Electricalsara Stores',
 'Electricalsbea Stores',
 'Electricalslance Stores',
 'Electricalslytical',
 'Electricalsocity',
 'Electricalsquipo Stores',
 'Elite',
 'Elkjøp',
 'Epic Stores',
 'Euronics',
 'Expert',
 'Expression',
 'Ezone',
 'Flawless Stores',
 'Flipkart',
 'Fnac-Darty',
 'Forward Stores',
 'Girias',
 'Info Stores',
 'Insight',
 'Integration Stores',
 'Leader',
 'Logic Stores',
 'Lotus',
 'Mbit',
 'Media Markt',
 'Neptune',
 'Nomad Stores',
 'Notebillig',
 'Nova',
 'Novus',
 'Otto',
 'Path',
 'Power',
 'Premium Stores',
 'Propel',
 'Radio Popular',
 'Radio Shack',
 'Reliance Digital',
 'Relief',
 'Sage',
 'Saturn',
 'Sorefoz',
 'Sound',
 'Staples',
 'Surface Stores',


In [149]:
# function to highlight duplicates
def highlight_dupl(tbl, dupl='Amazon ', color='red'):
    '''
    highlight the duplicate in a Series or DataFrame
    '''
    attr = 'background-color: {}'.format(color)
    if tbl.ndim == 1:  # Series from .apply(axis=0) or axis=1
        is_dupl = tbl == dupl
        return [attr if v else '' for v in is_dupl]
    else:  # from .apply(axis=None)
        is_dupl = tbl == dupl
        return pd.tblFrame(np.where(is_dupl, attr, ''),
                            index=tbl.index, columns=tbl.columns)


In [150]:
# Amazon

data_dict['dim_customer'][data_dict['dim_customer']['customer'].isin(['Amazon', 'Amazon '])] \
    .style.apply(highlight_dupl, dupl='Amazon', color='green', axis=1)

Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region
54,90002008,Amazon,E-Commerce,Retailer,India,India,APAC
62,90002016,Amazon,E-Commerce,Retailer,India,India,APAC
64,90003180,Amazon,E-Commerce,Retailer,Indonesia,ROA,APAC
71,90004067,Amazon,E-Commerce,Retailer,Japan,ROA,APAC
72,90004068,Amazon,E-Commerce,Retailer,Japan,ROA,APAC
76,90005162,Amazon,E-Commerce,Retailer,Pakistan,ROA,APAC
78,90006156,Amazon,E-Commerce,Retailer,Philiphines,ROA,APAC
79,90007197,Amazon,E-Commerce,Retailer,South Korea,ROA,APAC
84,90008168,Amazon,E-Commerce,Retailer,Australia,ANZ,APAC
90,90009132,Amazon,E-Commerce,Retailer,Newzealand,ANZ,APAC


In [151]:
# Electricals

data_dict['dim_customer'][data_dict['dim_customer']['customer'].str.contains('electricals', case=False)]


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region
58,90002012,Electricalsocity,Brick & Mortar,Retailer,India,India,APAC
59,90002013,Electricalslytical,Brick & Mortar,Retailer,India,India,APAC
67,90004063,Electricalsbea Stores,Brick & Mortar,Retailer,Japan,ROA,APAC
83,90008167,Electricalsocity,Brick & Mortar,Retailer,Australia,ANZ,APAC
120,90014135,Electricalslance Stores,Brick & Mortar,Retailer,Netherlands,NE,EU
134,90016171,Electricalsara Stores,Brick & Mortar,Retailer,Poland,NE,EU
141,90017050,Electricalsara Stores,Brick & Mortar,Retailer,Portugal,SE,EU
146,90017055,Electricalslance Stores,Brick & Mortar,Retailer,Portugal,SE,EU
166,90020098,Electricalsquipo Stores,Brick & Mortar,Retailer,Austria,NE,EU
172,90021086,Electricalsquipo Stores,Brick & Mortar,Retailer,United Kingdom,NE,EU


- We recognize 'Amazon' and 'Amazon ' as typos that should be corrected.
- 'Electricals..' does not appear to be indicative of implicit duplicates.

In [152]:
# what to change
customers_dict = {
    # wrong customer : correct customer
    'Amazon ': 'Amazon'
}

customer_codes_dict = dict()

for k,v in customers_dict.items():

    wrong_codes = data_dict['dim_customer'][data_dict['dim_customer']['customer'] == k]
    correct_codes = data_dict['dim_customer'][data_dict['dim_customer']['customer'] == v]

    customer_codes_dict_temp = wrong_codes \
        .merge(
            correct_codes,
            on=['platform', 'channel', 'market', 'sub_zone', 'region'],
            how='inner',
            suffixes = ['_correct', '_wrong']
        ) \
        .set_index('customer_code_correct') \
        .loc[:, 'customer_code_wrong'] \
        .to_dict()
    customer_codes_dict.update(customer_codes_dict_temp)
    
customer_codes_dict

{90002016: 90002008,
 90004068: 90004067,
 90022082: 90022081,
 90023030: 90023023}

In [153]:
# check if discounts are the same for these customer codes

wrong_discounts = data_dict['fact_pre_discount'][data_dict['fact_pre_discount']['customer_code'].isin(customer_codes_dict.keys())]
wrong_discounts['correct_codes'] = wrong_discounts['customer_code'].map(customer_codes_dict)
# display(wrong_discounts)
correct_discounts = data_dict['fact_pre_discount'][data_dict['fact_pre_discount']['customer_code'].isin(customer_codes_dict.values())]
# display(correct_discounts)

amazon_discounts = wrong_discounts.merge(correct_discounts, left_on=['correct_codes', 'fiscal_year'], right_on=['customer_code', 'fiscal_year'], suffixes=['_wrong', '_corr'])
display(amazon_discounts)




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



Unnamed: 0,customer_code_wrong,fiscal_year,pre_invoice_discount_pct_wrong,correct_codes,customer_code_corr,pre_invoice_discount_pct_corr
0,90002016,2018,0.2134,90002008.0,90002008,0.2385
1,90002016,2019,0.229,90002008.0,90002008,0.1862
2,90002016,2020,0.1876,90002008.0,90002008,0.2
3,90002016,2021,0.2933,90002008.0,90002008,0.2207
4,90002016,2022,0.3022,90002008.0,90002008,0.2912
5,90004068,2018,0.2728,90004067.0,90004067,0.2501
6,90004068,2019,0.309,90004067.0,90004067,0.2545
7,90004068,2020,0.2453,90004067.0,90004067,0.2905
8,90004068,2021,0.1977,90004067.0,90004067,0.2672
9,90004068,2022,0.2856,90004067.0,90004067,0.2772


**Conclusion:**

- Inconsistencies in Amazon discounts arise from a typo in the dim_customer, affecting uniformity within the same channels and locations.
- Despite the inconsistency, applied discounts cannot be selectively retained or discarded.
- Our interim solution involves replacing 'Amazon ' with 'Amazon' in 'dim_product' to ensure uniform customer consideration for future analyses.
- A comprehensive assessment of the mistake's impact on revenue and margin to be done and measures to be taken if deemed necessary.


In [154]:
# replace 'AMazon ' with 'Amazon'

data_dict['dim_customer']['customer'].replace(customers_dict, inplace=True)

In [155]:
# Amazon check

data_dict['dim_customer'][data_dict['dim_customer']['customer'].str.contains('Amazon')] \
    .style.apply(highlight_dupl, dupl='Amazon ', color='green', axis=1)

Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region
54,90002008,Amazon,E-Commerce,Retailer,India,India,APAC
62,90002016,Amazon,E-Commerce,Retailer,India,India,APAC
64,90003180,Amazon,E-Commerce,Retailer,Indonesia,ROA,APAC
71,90004067,Amazon,E-Commerce,Retailer,Japan,ROA,APAC
72,90004068,Amazon,E-Commerce,Retailer,Japan,ROA,APAC
76,90005162,Amazon,E-Commerce,Retailer,Pakistan,ROA,APAC
78,90006156,Amazon,E-Commerce,Retailer,Philiphines,ROA,APAC
79,90007197,Amazon,E-Commerce,Retailer,South Korea,ROA,APAC
84,90008168,Amazon,E-Commerce,Retailer,Australia,ANZ,APAC
90,90009132,Amazon,E-Commerce,Retailer,Newzealand,ANZ,APAC


#### dim_product

In [156]:
# implicit duplicates
display(sorted(data_dict['dim_product']['division'].unique()))
display(sorted(data_dict['dim_product']['segment'].unique()))
display(sorted(data_dict['dim_product']['category'].unique()))
display(sorted(data_dict['dim_product']['product'].unique()))
display(sorted(data_dict['dim_product']['variant'].unique()))

['N & S', 'P & A', 'PC']

['Accessories', 'Desktop', 'Networking', 'Notebook', 'Peripherals', 'Storage']

['Batteries',
 'Business Laptop',
 'External Solid State Drives',
 'Gaming Laptop',
 'Graphic Card',
 'Internal HDD',
 'Keyboard',
 'MotherBoard',
 'Mouse',
 'Personal Desktop',
 'Personal Laptop',
 'Processors',
 'USB Flash Drives',
 'Wi fi extender']

['AQ 5000 Series Electron 8 5900X Desktop Processor',
 'AQ 5000 Series Electron 9 5900X Desktop Processor',
 'AQ 5000 Series Ultron 8 5900X Desktop Processor',
 'AQ Aspiron',
 'AQ BZ 101',
 'AQ BZ Allin1',
 'AQ BZ Allin1 Gen 2',
 'AQ BZ Compact',
 'AQ BZ Gen Y',
 'AQ BZ Gen Z',
 'AQ Clx1',
 'AQ Clx2',
 'AQ Clx3',
 'AQ Digit',
 'AQ Digit SSD',
 'AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache',
 'AQ Electron 3 3600 Desktop Processor',
 'AQ Electron 4 3600 Desktop Processor',
 'AQ Electron 5 3600 Desktop Processor',
 'AQ Elite',
 'AQ F16',
 'AQ GEN Z',
 'AQ GT 21',
 'AQ Gamer 1',
 'AQ Gamer 2',
 'AQ Gamer 3',
 'AQ Gamers ',
 'AQ Gamers Ms',
 'AQ Gen X',
 'AQ Gen Y',
 'AQ HOME Allin1 Gen 2',
 'AQ Home Allin1',
 'AQ LION x1',
 'AQ LION x2',
 'AQ LION x3',
 'AQ Lite',
 'AQ Lite Ms',
 'AQ Lumina',
 'AQ Lumina Ms',
 'AQ MB Crossx',
 'AQ MB Crossx 2',
 'AQ MB Elite',
 'AQ MB Lito',
 'AQ MB Lito 2',
 'AQ Marquee P3',
 'AQ Marquee P4',
 'AQ Master wired x1',
 'AQ Master wired x1 Ms',

['Plus',
 'Plus 1',
 'Plus 1 ',
 'Plus 2',
 'Plus 3',
 'Plus Black',
 'Plus Blue',
 'Plus Cool Blue',
 'Plus Firey Red',
 'Plus Grey',
 'Plus Red',
 'Premium',
 'Premium 1',
 'Premium 2',
 'Premium Black',
 'Premium Misty Green',
 'Premium Plus',
 'Standard',
 'Standard 1',
 'Standard 2',
 'Standard 3',
 'Standard Black',
 'Standard Blue',
 'Standard Cool Blue',
 'Standard Firey Red',
 'Standard Grey',
 'Standard Red']

**Conclusion:**

The presence of implicit duplicate products cannot be definitively established with 100% certainty.

### Other logical duplicates

In [157]:
# other duplicates ## dim_customer

print('Number of duplicates in dim_customer')
print(f'{"customer_code:":<40}{data_dict["dim_customer"]["customer_code"].duplicated().sum()}')
print(f'{"customer identification:":<40}{data_dict["dim_customer"].loc[:, ["customer", "platform","channel", "market"]].duplicated().sum()}')
display(data_dict["dim_customer"][data_dict["dim_customer"].loc[:, ["customer", "platform","channel", "market"]].duplicated()])

# # the same analytics duplicated in various other analytics
display(data_dict['dim_customer'].groupby(['market'], as_index=False).filter(lambda x: x['sub_zone'].nunique()>1))
display(data_dict['dim_customer'].groupby(['market'], as_index=False).filter(lambda x: x['region'].nunique()>1))
display(data_dict['dim_customer'].groupby(['sub_zone'], as_index=False).filter(lambda x: x['region'].nunique()>1))
display(data_dict['dim_customer'].groupby(['customer'], as_index=False).filter(lambda x: x['platform'].nunique()>1))


Number of duplicates in dim_customer
customer_code:                          0
customer identification:                4


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region
62,90002016,Amazon,E-Commerce,Retailer,India,India,APAC
72,90004068,Amazon,E-Commerce,Retailer,Japan,ROA,APAC
193,90022082,Amazon,E-Commerce,Retailer,USA,,
203,90023030,Amazon,E-Commerce,Retailer,Canada,,


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region


Unnamed: 0,customer_code,customer,platform,channel,market,sub_zone,region


As we could expect only 'Amazon' codes are not unique in terms of "platform","channel" and "market".

In [158]:
# other duplicates ## dim_product

print('Number of duplicates in dim_product')
print(f'{"product_code:":<20}{data_dict["dim_product"]["product_code"].duplicated().sum()}')
print(f'{"product + variant:":<20}{data_dict["dim_product"].loc[:, ["product_code","variant"]].duplicated().sum()}')

# the same product duplicated in various other analytics
display(data_dict['dim_product'].groupby(['product', 'variant'], as_index=False).filter(lambda x: x['product_code'].nunique()>1))
display(data_dict['dim_product'].groupby(['product'], as_index=False).filter(lambda x: x['division'].nunique()>1))
display(data_dict['dim_product'].groupby(['product'], as_index=False).filter(lambda x: x['segment'].nunique()>1))
display(data_dict['dim_product'].groupby(['product'], as_index=False).filter(lambda x: x['category'].nunique()>1))
display(data_dict['dim_product'].groupby(['category'], as_index=False).filter(lambda x: x['segment'].nunique()>1))
display(data_dict['dim_product'].groupby(['category'], as_index=False).filter(lambda x: x['division'].nunique()>1))
display(data_dict['dim_product'].groupby(['segment'], as_index=False).filter(lambda x: x['division'].nunique()>1))


Number of duplicates in dim_product
product_code:       0
product + variant:  0


Unnamed: 0,product_code,division,segment,category,product,variant


Unnamed: 0,product_code,division,segment,category,product,variant


Unnamed: 0,product_code,division,segment,category,product,variant


Unnamed: 0,product_code,division,segment,category,product,variant


Unnamed: 0,product_code,division,segment,category,product,variant
261,A4918110101,PC,Notebook,Business Laptop,AQ BZ 101,Standard Grey
262,A4918110102,PC,Notebook,Business Laptop,AQ BZ 101,Standard Blue
263,A4918110103,PC,Notebook,Business Laptop,AQ BZ 101,Premium Black
264,A4918110104,PC,Notebook,Business Laptop,AQ BZ 101,Premium Misty Green
265,A5018110201,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Grey
266,A5018110202,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Blue
267,A5018110203,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Red
268,A5018110204,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Grey
269,A5018110205,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Blue
270,A5018110206,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Red


Unnamed: 0,product_code,division,segment,category,product,variant


Unnamed: 0,product_code,division,segment,category,product,variant


**Conclusion:** we have category 'Business Laptop' that falls both into 'desktop' and 'notebook'. Let's correct 'desktop' segment which seems to be an error for Laptop category. (It is our assumption, in reality we would check it with the data provider)

In [159]:
# replace segment for 'Business Laptop'
data_dict['dim_product'].loc[data_dict['dim_product']['category'] == 'Business Laptop', 'segment'] = 'Notebook'
data_dict['dim_product'][data_dict['dim_product']['category'] == 'Business Laptop']

Unnamed: 0,product_code,division,segment,category,product,variant
261,A4918110101,PC,Notebook,Business Laptop,AQ BZ 101,Standard Grey
262,A4918110102,PC,Notebook,Business Laptop,AQ BZ 101,Standard Blue
263,A4918110103,PC,Notebook,Business Laptop,AQ BZ 101,Premium Black
264,A4918110104,PC,Notebook,Business Laptop,AQ BZ 101,Premium Misty Green
265,A5018110201,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Grey
266,A5018110202,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Blue
267,A5018110203,PC,Notebook,Business Laptop,AQ BZ Compact,Standard Red
268,A5018110204,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Grey
269,A5018110205,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Blue
270,A5018110206,PC,Notebook,Business Laptop,AQ BZ Compact,Plus Red


In [160]:
# other duplicates ## Other data where duplication might cause interference
print('Number of duplicates')
print(f'{"customer_code + fiscal_year:":<40}{data_dict["fact_pre_discount"].loc[:, ["customer_code","fiscal_year"]].duplicated().sum()}')
print(f'{"product_code + cost_year:":<40}{data_dict["fact_manufacturing_cost"].loc[:, ["product_code","cost_year"]].duplicated().sum()}')
print(f'{"product_code + fiscal_year:":<40}{data_dict["fact_gross_price"].loc[:, ["product_code","fiscal_year"]].duplicated().sum()}')


Number of duplicates
customer_code + fiscal_year:            0
product_code + cost_year:               0
product_code + fiscal_year:             0


## Logical checks

In [161]:
# check time range of 'fiscal_year'

data_dict['fact_sales_monthly'].groupby('fiscal_year').agg(start_date=('date', 'min'), end_date = ('date', 'max'))

Unnamed: 0_level_0,start_date,end_date
fiscal_year,Unnamed: 1_level_1,Unnamed: 2_level_1
2018,2017-09-01,2018-08-01
2019,2018-09-01,2019-08-01
2020,2019-09-01,2020-08-01
2021,2020-09-01,2021-08-01
2022,2021-09-01,2021-12-01


In [162]:
# check time range of 'cost_year'
data_dict['fact_manufacturing_cost']['cost_year'].value_counts()

2022    345
2021    334
2020    245
2019    171
2018     87
Name: cost_year, dtype: int64

**Conlusion:** as we can see 'fiscal year' lasts from September to August. And transactions are dated on monthly basis. 
Important: Fiscal year of 2022 is not full

As no other data are provided, and there are no data in 'fact_manufacturing_cost' for 2017 let's assume that 'cost_year' equals to 'fiscal_year'. So no additional work is needed

In [163]:
# create calendar year column in sales table and rename 'cost_year' into 'fiscal_year'

data_dict['fact_sales_monthly']['year'] = data_dict['fact_sales_monthly']['date'].dt.year

data_dict['fact_manufacturing_cost'].rename(columns={"cost_year": "fiscal_year"}, inplace=True)

In [164]:
# do all customer codes exist in dim_customer and in 
set(data_dict['fact_sales_monthly']['customer_code']) - set(data_dict['dim_customer']['customer_code'])

set()

In [165]:
# do all customer codes exist in fact_pre_discount
set(data_dict['fact_sales_monthly']['customer_code']) - set(data_dict['fact_pre_discount']['customer_code'])

set()

In [166]:
# do all pairs customer_code + fiscal_year exist in fact_pre_discount
set(
    data_dict['fact_sales_monthly'].groupby(['customer_code', 'fiscal_year']).groups.keys()
) - set(
    data_dict['fact_pre_discount'].groupby(['customer_code', 'fiscal_year']).groups.keys()
)

set()

In [167]:
# do all product_codes exist in fact_manufacturing_cost
set(data_dict['fact_sales_monthly']['product_code']) - set(data_dict['fact_manufacturing_cost']['product_code'])

set()

In [168]:
# do all pairs product_code + fiscal_year exist in fact_manufacturing_cost
set(
    data_dict['fact_sales_monthly'].groupby(['product_code', 'fiscal_year']).groups.keys()
) - set(
    data_dict['fact_manufacturing_cost'].groupby(['product_code', 'fiscal_year']).groups.keys()
)

set()

In [169]:
# do all pairs product_code + fiscal year exist in fact_gross_price
set(
    data_dict['fact_sales_monthly'].groupby(['product_code', 'fiscal_year']).groups.keys()
) - set(
    data_dict['fact_gross_price'].groupby(['product_code', 'fiscal_year']).groups.keys()
)

set()

## Merge all data into single DataFrame
**Note:** we can do so as it is a small data set. In reality, use specific queries.

In [170]:
data = data_dict['fact_sales_monthly'].merge(data_dict['dim_customer'], on='customer_code', how='left')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 12 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           67250 non-null  datetime64[ns]
 1   product_code   67250 non-null  category      
 2   customer_code  67250 non-null  category      
 3   sold_quantity  67250 non-null  float64       
 4   fiscal_year    67250 non-null  category      
 5   year           67250 non-null  int64         
 6   customer       67250 non-null  object        
 7   platform       67250 non-null  category      
 8   channel        67250 non-null  category      
 9   market         67250 non-null  category      
 10  sub_zone       67250 non-null  category      
 11  region         67250 non-null  category      
dtypes: category(8), datetime64[ns](1), float64(1), int64(1), object(1)
memory usage: 3.2+ MB


In [171]:
data = data.merge(data_dict['dim_product'], on='product_code', how='left')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 17 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           67250 non-null  datetime64[ns]
 1   product_code   67250 non-null  object        
 2   customer_code  67250 non-null  category      
 3   sold_quantity  67250 non-null  float64       
 4   fiscal_year    67250 non-null  category      
 5   year           67250 non-null  int64         
 6   customer       67250 non-null  object        
 7   platform       67250 non-null  category      
 8   channel        67250 non-null  category      
 9   market         67250 non-null  category      
 10  sub_zone       67250 non-null  category      
 11  region         67250 non-null  category      
 12  division       67250 non-null  category      
 13  segment        67250 non-null  category      
 14  category       67250 non-null  category      
 15  product        6725

In [172]:
data = data.merge(data_dict['fact_pre_discount'], on=['customer_code', 'fiscal_year'] , how='left')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 18 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   product_code              67250 non-null  object        
 2   customer_code             67250 non-null  category      
 3   sold_quantity             67250 non-null  float64       
 4   fiscal_year               67250 non-null  category      
 5   year                      67250 non-null  int64         
 6   customer                  67250 non-null  object        
 7   platform                  67250 non-null  category      
 8   channel                   67250 non-null  category      
 9   market                    67250 non-null  category      
 10  sub_zone                  67250 non-null  category      
 11  region                    67250 non-null  category      
 12  division          

In [173]:
data = data.merge(data_dict['fact_manufacturing_cost'], on=['product_code', 'fiscal_year'], how='left')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 19 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   product_code              67250 non-null  object        
 2   customer_code             67250 non-null  category      
 3   sold_quantity             67250 non-null  float64       
 4   fiscal_year               67250 non-null  int64         
 5   year                      67250 non-null  int64         
 6   customer                  67250 non-null  object        
 7   platform                  67250 non-null  category      
 8   channel                   67250 non-null  category      
 9   market                    67250 non-null  category      
 10  sub_zone                  67250 non-null  category      
 11  region                    67250 non-null  category      
 12  division          

In [174]:
data = data.merge(data_dict['fact_gross_price'], on=['product_code', 'fiscal_year'] , how='left')
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   product_code              67250 non-null  object        
 2   customer_code             67250 non-null  category      
 3   sold_quantity             67250 non-null  float64       
 4   fiscal_year               67250 non-null  object        
 5   year                      67250 non-null  int64         
 6   customer                  67250 non-null  object        
 7   platform                  67250 non-null  category      
 8   channel                   67250 non-null  category      
 9   market                    67250 non-null  category      
 10  sub_zone                  67250 non-null  category      
 11  region                    67250 non-null  category      
 12  division          

In [175]:
# adjust datatypes

data['product_code'] = data['product_code'].astype('category')
data['customer'] = data['customer'].astype('category')
data['product'] = data['product'].astype('category')
data['variant'] = data['variant'].astype('category')
data['fiscal_year'] = data['fiscal_year'].astype('category')
data['customer_code'] = data['customer_code'].astype('category')

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   product_code              67250 non-null  category      
 2   customer_code             67250 non-null  category      
 3   sold_quantity             67250 non-null  float64       
 4   fiscal_year               67250 non-null  category      
 5   year                      67250 non-null  int64         
 6   customer                  67250 non-null  category      
 7   platform                  67250 non-null  category      
 8   channel                   67250 non-null  category      
 9   market                    67250 non-null  category      
 10  sub_zone                  67250 non-null  category      
 11  region                    67250 non-null  category      
 12  division          

## Add revenue and margin

In [176]:
# calculate new columns
data['gross_revenue'] = data['gross_price'] * data['sold_quantity']
data['discount'] = data['gross_revenue'] * data['pre_invoice_discount_pct']
data['net_revenue'] = data['gross_revenue'] - data['discount']
data['costs'] = data['manufacturing_cost'] * data['sold_quantity']
data['margin'] = data['net_revenue'] - data['costs']


In [177]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 67250 entries, 0 to 67249
Data columns (total 25 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   product_code              67250 non-null  category      
 2   customer_code             67250 non-null  category      
 3   sold_quantity             67250 non-null  float64       
 4   fiscal_year               67250 non-null  category      
 5   year                      67250 non-null  int64         
 6   customer                  67250 non-null  category      
 7   platform                  67250 non-null  category      
 8   channel                   67250 non-null  category      
 9   market                    67250 non-null  category      
 10  sub_zone                  67250 non-null  category      
 11  region                    67250 non-null  category      
 12  division          

In [178]:
# Revenue
data.pivot_table(
    values='net_revenue',
    index=['platform', 'region'],
    columns=['category','fiscal_year'],
    aggfunc='sum',
    observed=True,
    fill_value=0
).T.style.format('{:,.0f}')


Unnamed: 0_level_0,platform,Brick & Mortar,Brick & Mortar,Brick & Mortar,Brick & Mortar,E-Commerce,E-Commerce,E-Commerce,E-Commerce
Unnamed: 0_level_1,region,APAC,EU,NA,LATAM,APAC,EU,NA,LATAM
category,fiscal_year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Graphic Card,2018,637060,56848,158483,0,244222,13342,84675,8595
Graphic Card,2019,686039,201058,214864,0,280751,49060,112775,11511
Graphic Card,2020,648460,275690,222537,504,392667,103207,172192,15403
Graphic Card,2021,460133,219237,154953,766,181129,55222,88422,3481
Graphic Card,2022,1464333,642193,465877,3086,520705,171041,240161,9994
Internal HDD,2018,544728,49380,145125,0,213357,11592,68078,7699
Internal HDD,2019,1584675,462521,500115,0,608946,112236,281861,28811
Internal HDD,2020,2910430,1238728,990912,1935,1765943,480614,823395,70075
Internal HDD,2021,6898928,3291283,2502788,12402,2717845,879077,1333379,54318
Internal HDD,2022,10556177,4898869,3784610,19264,3988851,1270269,1860086,72399


In [179]:
# Margin
data.pivot_table(
    values='margin',
    index=['platform', 'region'],
    columns=['category','fiscal_year'],
    aggfunc='sum',
    observed=True,
    fill_value=0
).T.style.format('{:,.0f}')

Unnamed: 0_level_0,platform,Brick & Mortar,Brick & Mortar,Brick & Mortar,Brick & Mortar,E-Commerce,E-Commerce,E-Commerce,E-Commerce
Unnamed: 0_level_1,region,APAC,EU,NA,LATAM,APAC,EU,NA,LATAM
category,fiscal_year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Graphic Card,2018,390176,35103,97682,0,149941,8165,50592,5288
Graphic Card,2019,419539,121798,129767,0,172397,30048,67457,6980
Graphic Card,2020,394343,166698,135096,290,240618,62698,103101,9191
Graphic Card,2021,283082,135019,96287,473,111037,33768,54135,2090
Graphic Card,2022,918018,402720,296368,1964,319998,106091,150072,6138
Internal HDD,2018,330979,30179,88996,0,130162,7047,40282,4703
Internal HDD,2019,984860,284506,307328,0,379993,69810,171357,17807
Internal HDD,2020,1788685,755409,606183,1125,1090525,294577,498266,42192
Internal HDD,2021,4212297,2008595,1536657,7599,1648082,532819,812159,32266
Internal HDD,2022,6475486,3004594,2350891,11989,2395563,770338,1140868,43340


In [180]:
# Margin %
(data.pivot_table(
    values='margin',
    index=['platform', 'region'],
    columns=['category','fiscal_year'],
    aggfunc='sum',
    observed=True,
    fill_value=0
).T / data.pivot_table(
    values='net_revenue',
    index=['platform', 'region'],
    columns=['category','fiscal_year'],
    aggfunc='sum',
    observed=True,
    fill_value=0
).T).style.format('{:,.1%}').background_gradient(cmap='viridis')

Unnamed: 0_level_0,platform,Brick & Mortar,Brick & Mortar,Brick & Mortar,Brick & Mortar,E-Commerce,E-Commerce,E-Commerce,E-Commerce
Unnamed: 0_level_1,region,APAC,EU,NA,LATAM,APAC,EU,NA,LATAM
category,fiscal_year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Graphic Card,2018,61.2%,61.7%,61.6%,nan%,61.4%,61.2%,59.7%,61.5%
Graphic Card,2019,61.2%,60.6%,60.4%,nan%,61.4%,61.2%,59.8%,60.6%
Graphic Card,2020,60.8%,60.5%,60.7%,57.5%,61.3%,60.8%,59.9%,59.7%
Graphic Card,2021,61.5%,61.6%,62.1%,61.7%,61.3%,61.1%,61.2%,60.0%
Graphic Card,2022,62.7%,62.7%,63.6%,63.6%,61.5%,62.0%,62.5%,61.4%
Internal HDD,2018,60.8%,61.1%,61.3%,nan%,61.0%,60.8%,59.2%,61.1%
Internal HDD,2019,62.1%,61.5%,61.5%,nan%,62.4%,62.2%,60.8%,61.8%
Internal HDD,2020,61.5%,61.0%,61.2%,58.1%,61.8%,61.3%,60.5%,60.2%
Internal HDD,2021,61.1%,61.0%,61.4%,61.3%,60.6%,60.6%,60.9%,59.4%
Internal HDD,2022,61.3%,61.3%,62.1%,62.2%,60.1%,60.6%,61.3%,59.9%


## Create new SQLite database after data processing

In [181]:
# new db as function
def create_connection(db_file):
    """ create a database connection to a SQLite database """
    connection = None
    try:
        connection = sqlite3.connect(db_file)
        print(sqlite3.version)
    except Error as e:
        print(e)
    finally:
        if connection:
            connection.close()

In [182]:
# Create new db
path_new_db = r'atliq_full.db'

create_connection(path_new_db)

2.6.0


In [183]:
# Create tables in new db (aplly 'replace' method)
new_con = sqlite3.connect(path_new_db)

for key, tbl in data_dict.items():
    tbl.to_sql(key, new_con, if_exists='replace', index=False)

new_cursor = new_con.cursor()
new_table_names = new_cursor.execute("SELECT name FROM sqlite_master  WHERE type='table';").fetchall()
print(new_table_names)

[('dim_customer',), ('dim_product',), ('fact_pre_discount',), ('fact_manufacturing_cost',), ('fact_gross_price',), ('fact_sales_monthly',)]


## Export to CSV for Tableau


In [184]:
for key, tbl in data_dict.items():
    tbl.to_csv('csv/' + key + '.csv', index=False)

# Exploratory Analysis

**Note:** 
For our exploratory data analysis (EDA), we'll leverage SQL queries. Considering the manageable size of our data, for efficiency in query composition, we'll consolidate a single 'with statement' to form a comprehensive table, allowing us to query from this unified source. In practical scenarios, it's advisable to segment queries, separating calculations from data retrieval to efficiently obtain only the necessary information.

In [185]:
region_order = ['APAC', 'EU', 'LATAM', 'NA']

In [186]:
# WITH SQL statement for all data

q_all_data = '''
WITH all_data AS 
(
    SELECT
        DATE(fsm.date) AS date,
        fsm.year as year,
        fsm.product_code AS product_code,
        p.division AS division,
        p.segment AS segment,
        p.category AS category,
        p.product AS product,
        p.variant AS variant,
        fsm.customer_code AS customer_code,
        c.customer AS customer,
        c.platform AS platform,
        c.channel AS channel,
        c.market AS market,
        c.sub_zone AS sub_zone,
        c.region AS region,
        fsm.fiscal_year AS fiscal_year,
        fsm.sold_quantity AS sold_quantity,
        fgp.gross_price AS gross_price,
        fpd.pre_invoice_discount_pct AS pre_invoice_discount_pct,
        fmc.manufacturing_cost AS manufacturing_cost
    FROM fact_sales_monthly fsm
        LEFT JOIN fact_gross_price fgp ON fsm.fiscal_year = fgp.fiscal_year AND fsm.product_code = fgp.product_code
        LEFT JOIN fact_pre_discount fpd ON fsm.fiscal_year = fpd.fiscal_year AND fsm.customer_code = fpd.customer_code
        LEFT JOIN fact_manufacturing_cost fmc ON fsm.fiscal_year = fmc.fiscal_year AND fsm.product_code = fmc.product_code
        LEFT JOIN dim_customer c ON fsm.customer_code = c.customer_code
        LEFT JOIN dim_product p ON fsm.product_code = p.product_code
)
'''


In [187]:
# to test the approach we create ULTIMATE dataframe
q_test = q_all_data + 'SELECT * FROM all_data'

new_data_types = {
    'customer_code': 'category',
    'customer':'category',
    'platform':'category',
    'channel':'category',
    'market':'category',
    'sub_zone':'category',
    'region':'category',
    'product_code':'category',
    'division':'category',
    'segment':'category',
    'category':'category',
    'product':'category',
    'variant':'category',
    'fiscal_year':'category',
    'pre_invoice_discount_pct':'float64',
    'year':'int16',
    'manufacturing_cost':'float64',
    'gross_price':'float64',
    # 'date':,
    'sold_quantity':'float64',
}

ultimate_data =  pd.read_sql_query(q_test, new_con, dtype=new_data_types, parse_dates=data_dates)
ultimate_data.info()
display(ultimate_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 67250 entries, 0 to 67249
Data columns (total 20 columns):
 #   Column                    Non-Null Count  Dtype         
---  ------                    --------------  -----         
 0   date                      67250 non-null  datetime64[ns]
 1   year                      67250 non-null  int16         
 2   product_code              67250 non-null  category      
 3   division                  67250 non-null  category      
 4   segment                   67250 non-null  category      
 5   category                  67250 non-null  category      
 6   product                   67250 non-null  category      
 7   variant                   67250 non-null  category      
 8   customer_code             67250 non-null  category      
 9   customer                  67250 non-null  category      
 10  platform                  67250 non-null  category      
 11  channel                   67250 non-null  category      
 12  market            

Unnamed: 0,date,year,product_code,division,segment,category,product,variant,customer_code,customer,platform,channel,market,sub_zone,region,fiscal_year,sold_quantity,gross_price,pre_invoice_discount_pct,manufacturing_cost
0,2017-09-01,2017,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,70002017,Atliq Exclusive,Brick & Mortar,Direct,India,India,APAC,2018,51.0,15.3952,0.0824,4.619
1,2017-09-01,2017,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,70002018,Atliq e Store,E-Commerce,Direct,India,India,APAC,2018,77.0,15.3952,0.2956,4.619
2,2017-09-01,2017,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,70003181,Atliq Exclusive,Brick & Mortar,Direct,Indonesia,ROA,APAC,2018,17.0,15.3952,0.0536,4.619
3,2017-09-01,2017,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,70003182,Atliq e Store,E-Commerce,Direct,Indonesia,ROA,APAC,2018,6.0,15.3952,0.2378,4.619
4,2017-09-01,2017,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,70006157,Atliq Exclusive,Brick & Mortar,Direct,Philiphines,ROA,APAC,2018,5.0,15.3952,0.1057,4.619


## Dynamics over time

In [188]:
# create dataframe
q_time = '''
SELECT
    date,
    CAST(fiscal_year as str) as fiscal_year,
    sum(sold_quantity) as quantity,
    SUM(sold_quantity * gross_price) AS gross_revenue,
    SUM(sold_quantity * gross_price * pre_invoice_discount_pct) AS discount, 
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct) AS net_revenue,
    SUM(sold_quantity * manufacturing_cost) AS costs,
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct - sold_quantity * manufacturing_cost) AS margin
FROM all_data
GROUP BY date, fiscal_year
'''

monthly_data =  pd.read_sql_query(q_all_data + q_time, new_con, parse_dates=data_dates)
monthly_data['month'] = monthly_data['date'].dt.month_name(locale='English')
monthly_data['year'] = monthly_data['date'].dt.year
monthly_data.info()
display(monthly_data.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           52 non-null     datetime64[ns]
 1   fiscal_year    52 non-null     int64         
 2   quantity       52 non-null     float64       
 3   gross_revenue  52 non-null     float64       
 4   discount       52 non-null     float64       
 5   net_revenue    52 non-null     float64       
 6   costs          52 non-null     float64       
 7   margin         52 non-null     float64       
 8   month          52 non-null     object        
 9   year           52 non-null     int64         
dtypes: datetime64[ns](1), float64(6), int64(2), object(1)
memory usage: 4.2+ KB


Unnamed: 0,date,fiscal_year,quantity,gross_revenue,discount,net_revenue,costs,margin,month,year
0,2017-09-01,2018,11425.0,203560.7803,50011.079142,153549.701158,60487.6785,93062.022658,September,2017
1,2017-10-01,2018,14860.0,264533.7946,60499.327728,204034.466872,78490.0838,125544.383072,October,2017
2,2017-11-01,2018,21012.0,375191.4062,88199.124663,286992.281537,111272.7596,175719.521937,November,2017
3,2017-12-01,2018,21615.0,385598.6583,94842.336255,290756.322045,114595.9086,176160.413445,December,2017
4,2018-01-01,2018,11713.0,208699.9808,50507.739083,158192.241717,61910.378,96281.863717,January,2018


In [189]:
# order of months as per fiscal year
order_of_months = pd.date_range(start='2018-09', freq='M', periods=12).month_name(locale='English').to_list()
order_of_months

['September',
 'October',
 'November',
 'December',
 'January',
 'February',
 'March',
 'April',
 'May',
 'June',
 'July',
 'August']

In [190]:
# gross revenue over time
fig=px.line(
    monthly_data,
    x='month',
    y='gross_revenue',
    color='fiscal_year',
    template='seaborn',
    category_orders={'month':order_of_months},
    title='Gross revenue over time'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="gross revenue"
)

fig.show()

In [191]:
# yearly growth rates by months

monthly_data \
    .groupby(['month','fiscal_year']) \
    .agg(growth_rate_yoy = ('gross_revenue','sum')) \
    .unstack() \
    .reindex(index=order_of_months) \
    .transform(lambda x: x / x.shift(1, axis=1) - 1) \
    .style.format('{:.1%}')

Unnamed: 0_level_0,growth_rate_yoy,growth_rate_yoy,growth_rate_yoy,growth_rate_yoy,growth_rate_yoy
fiscal_year,2018,2019,2020,2021,2022
month,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
September,nan%,150.3%,124.4%,57.1%,275.7%
October,nan%,134.5%,124.9%,62.2%,272.9%
November,nan%,114.2%,142.0%,55.3%,283.6%
December,nan%,126.8%,136.2%,57.9%,279.0%
January,nan%,136.9%,125.9%,60.0%,nan%
February,nan%,140.8%,118.8%,64.8%,nan%
March,nan%,113.7%,-67.9%,1055.0%,nan%
April,nan%,124.3%,4.4%,251.7%,nan%
May,nan%,137.7%,29.1%,177.4%,nan%
June,nan%,126.5%,104.6%,75.7%,nan%


In [192]:
# net revenue over time
fig=px.line(
    monthly_data,
    x='month',
    y='net_revenue',
    color='fiscal_year',
    template='seaborn',
    category_orders={'month':order_of_months},
    title='Net revenue over time'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="net revenue"
)

fig.show()

In [193]:
# discount % over time
monthly_data['discount_pct'] = monthly_data['discount'] / monthly_data['gross_revenue']
fig=px.line(
    monthly_data,
    x='month',
    y='discount_pct',
    color='fiscal_year',
    template='seaborn',
    category_orders={'month':order_of_months},
    title='Discount over time'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="discount",
    yaxis_tickformat = '.1%'
)

fig.show()

In [194]:
# margin over time
fig=px.line(
    monthly_data,
    x='month',
    y='margin',
    color='fiscal_year',
    template='seaborn',
    category_orders={'month':order_of_months},
    title='Margin over time'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="net revenue"
)

fig.show()

In [195]:
# margin over time
monthly_data['margin_pct'] = monthly_data['margin'] / monthly_data['gross_revenue']
fig=px.line(
    monthly_data,
    x='month',
    y='margin_pct',
    color='fiscal_year',
    template='seaborn',
    category_orders={'month':order_of_months},
    title='Margin (%) over time'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="margin",
    yaxis_tickformat = '.1%'
)

fig.show()

In [196]:
# Seasonality index

monthly_avg = monthly_data.groupby(['fiscal_year'])['gross_revenue'].mean()  # You can replace revenue with any desired metric

# Calculating yearly averages
yearly_avg = monthly_data.groupby(['month', 'fiscal_year'])['gross_revenue'].mean()
# Calculating seasonality index for each month
seasonality_index = ( yearly_avg /monthly_avg).unstack().T.reindex(columns=order_of_months).style.format('{:.1%}')
seasonality_index

month,September,October,November,December,January,February,March,April,May,June,July,August
fiscal_year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018,83.0%,107.8%,152.9%,157.2%,85.1%,84.6%,91.4%,89.5%,85.8%,86.9%,85.8%,90.0%
2019,92.1%,112.1%,145.2%,158.0%,89.3%,90.3%,86.6%,89.0%,90.4%,87.3%,80.8%,79.0%
2020,104.2%,127.1%,177.2%,188.1%,101.7%,99.6%,14.0%,46.8%,58.9%,90.0%,94.7%,97.7%
2021,87.4%,110.1%,147.0%,158.7%,87.0%,87.7%,86.4%,88.0%,87.2%,84.5%,87.2%,88.8%
2022,69.0%,86.3%,118.5%,126.3%,nan%,nan%,nan%,nan%,nan%,nan%,nan%,nan%


### Conclusion
1. Each year demonstrates steadly improved results in terms of revenue and margin, except for FY2022, where turnover unexpectedly tripled. We can conduct a more in-depth analysis to ascertain the drivers behind this growth. It could stem from various factors such as acquiring new clients, expanding into new markets, introducing innovative products, or it might simply be influenced by the way the data is presented. We'll investigate whether this seasonality pattern holds true across all markets and products or if it's specific to particular segments. 
2. There's a noticeable seasonality across months, with October to December being the peak months and the rest of the year maintaining relatively flat performance.
3. A slight decrease is observed in the middle of 2020, likely attributed to the consequences of COVID-19.
4. Discounts consistently fall within a narrow range of 22-25% throughout the available time span.
5. Margins remain stable, ranging between 45-48%, with occasional fluctuations.

## Geografical drill

In [197]:
# create query
q_geo = """
SELECT
    date,
    market,
    sub_zone,
    region,
    SUM(sold_quantity) as quantity,
    SUM(sold_quantity * gross_price) AS gross_revenue,
    SUM(sold_quantity * gross_price * pre_invoice_discount_pct) AS discount, 
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct) AS net_revenue,
    SUM(sold_quantity * manufacturing_cost) AS costs,
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct - sold_quantity * manufacturing_cost) AS margin
FROM all_data
GROUP BY date, market, sub_zone, region
"""
geo_data =  pd.read_sql_query(q_all_data + q_geo, new_con, parse_dates=data_dates)
geo_data['year'] = geo_data['date'].dt.year
geo_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1129 entries, 0 to 1128
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           1129 non-null   datetime64[ns]
 1   market         1129 non-null   object        
 2   sub_zone       1129 non-null   object        
 3   region         1129 non-null   object        
 4   quantity       1129 non-null   float64       
 5   gross_revenue  1129 non-null   float64       
 6   discount       1129 non-null   float64       
 7   net_revenue    1129 non-null   float64       
 8   costs          1129 non-null   float64       
 9   margin         1129 non-null   float64       
 10  year           1129 non-null   int64         
dtypes: datetime64[ns](1), float64(6), int64(1), object(3)
memory usage: 97.2+ KB


In [198]:
# Overview
pd.pivot_table(
    data=geo_data,
    index=[
        'region',
        'sub_zone',
        # 'market'
    ],
    values=['quantity','gross_revenue', 'discount', 'net_revenue', 'costs', 'margin'],
    aggfunc='sum',
    # margins=True
) \
.eval(
    """
      avg_price=gross_revenue / quantity
      discount_pct=discount / gross_revenue
      margin_pct=margin/gross_revenue
    """
) \
.reindex(columns=['quantity','gross_revenue', 'avg_price', 'discount', 'discount_pct', 'net_revenue', 'costs', 'margin', 'margin_pct']) \
.style.format(
    {
        'quantity': '{:,.0f}',
        'gross_revenue': '{:,.0f}',
        'avg_price': '{:,.1f}',
        'discount': '{:,.0f}',
        'discount_pct': '{:.1%}',
        'net_revenue': '{:,.0f}',
        'costs': '{:,.0f}',
        'margin': '{:,.0f}',
        'margin_pct': '{:.1%}',
    }
) \
.background_gradient(cmap='vlag_r')


Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,gross_revenue,avg_price,discount,discount_pct,net_revenue,costs,margin,margin_pct
region,sub_zone,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
APAC,ANZ,192601,4424988,23.0,1002594,22.7%,3422395,1313709,2108686,47.7%
APAC,India,1087710,24674547,22.7,5872890,23.8%,18801658,7322732,11478926,46.5%
APAC,ROA,863084,19656591,22.8,4575263,23.3%,15081327,5833159,9248169,47.0%
EU,NE,436805,10236309,23.4,2445063,23.9%,7791245,3038336,4752909,46.4%
EU,SE,376303,8702248,23.1,2012027,23.1%,6690221,2583148,4107073,47.2%
LATAM,LATAM,19539,426877,21.8,106634,25.0%,320244,126808,193435,45.3%
,,806893,18434349,22.8,4229059,22.9%,14205290,5471746,8733544,47.4%


In [199]:
# countries of presence
pd.pivot_table(
    data=geo_data,
    index='region',
    columns='year',
    values='market',
    aggfunc='nunique',
    margins=True
)

year,2017,2018,2019,2020,2021,All
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
APAC,6,8,10,10,10,10
EU,4,8,11,11,11,11
LATAM,1,2,4,4,4,4
,2,2,2,2,2,2
All,13,20,27,27,27,27


In [200]:
# top 3 countries by revenue in every region
geo_data \
    .groupby(['region', 'market'])['gross_revenue'] \
    .sum() \
    .groupby(['region'], group_keys=False) \
    .nlargest(3) \
    .to_frame().style.format('{:,.0f}')

Unnamed: 0_level_0,Unnamed: 1_level_0,gross_revenue
region,market,Unnamed: 2_level_1
APAC,India,24674547
APAC,South Korea,7645632
APAC,Philiphines,4354125
EU,United Kingdom,3817765
EU,France,3367269
EU,Italy,2121230
LATAM,Brazil,194594
LATAM,Mexico,148521
LATAM,Chile,65268
,USA,13890448


In [201]:
# growth rates

geo_data_growth_rate = geo_data \
    .groupby(['date' , 'region']) \
    .agg(growth_rate_yoy = ('gross_revenue','sum')) \
    .unstack() \
    .transform(lambda x: x / x.shift(12, axis=0) - 1) \
    .tail(geo_data.date.nunique() - 12) \
    .stack() \
    # .style.format('{:.1%}')

In [202]:
# growth rates
fig=px.line(
    geo_data_growth_rate.reset_index(),
    x='date',
    y='growth_rate_yoy',
    color='region',
    template='seaborn',
    title='Growth rate over time by Regions'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="growth rate",
    yaxis_tickformat = '.1%'
)

fig.show()

In [203]:
# growth index

geo_data_index = geo_data \
    .groupby(['date' , 'region']) \
    .agg(growth_rate_yoy = ('gross_revenue','sum')) \
    .unstack() \
    .rolling(12).sum() \
    .transform(lambda x: (x / x.shift(1, axis=0)).cumprod()) \
    .tail(geo_data.date.nunique()-11) \
    .fillna(1) \
    .stack() \
    # # .style.format('{:.1%}')

geo_data_index

Unnamed: 0_level_0,Unnamed: 1_level_0,growth_rate_yoy
date,region,Unnamed: 2_level_1
2018-08-01,APAC,1.000000
2018-08-01,EU,1.000000
2018-08-01,LATAM,1.000000
2018-08-01,,1.000000
2018-09-01,APAC,1.077332
...,...,...
2021-11-01,,15.478826
2021-12-01,APAC,13.744180
2021-12-01,EU,73.308687
2021-12-01,LATAM,9.218667


In [204]:
# growth index
fig=px.line(
    geo_data_index.reset_index(),
    x='date',
    y='growth_rate_yoy',
    color='region',
    template='seaborn',
    title='Revenue growth index by Regions (LTM, Aug 2018 = 1)'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="growth rate",
    yaxis_tickformat = '.1f'
)

fig.show()

In [205]:
# revenue over time
fig = px.histogram(
    geo_data,
    x="date",
    y="gross_revenue",
    color="region",
    title='Gross revenue over time',
    template='seaborn',
    barmode='relative',
    category_orders={'region':region_order},
    nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="month", yaxis_title="gross revenue"
)
fig.show()

In [206]:
# revenue over time by sub_zones
fig = px.histogram(
    geo_data,
    x="year",
    y="gross_revenue",
    color="sub_zone",
    title='Gross revenue over time',
    template='seaborn',
    barmode='relative',
    #nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="month", yaxis_title="gross revenue"
)
fig.show()

In [207]:
# net revenue over time
fig = px.histogram(
    geo_data,
    x="date",
    y="net_revenue",
    color="region",
    title='Net revenue over time',
    template='seaborn',
    category_orders={'region':region_order},
    nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="Month", yaxis_title="net revenue"
)
fig.show()

In [208]:
# costs over time
fig = px.histogram(
    geo_data,
    x="date",
    y="costs",
    color="region",
    title='Costs over time',
    template='seaborn',
    category_orders={'region':region_order},
    nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="Month", yaxis_title="costs"
)
fig.show()

In [209]:
# margin over time
fig = px.histogram(
    geo_data,
    x="date",
    y="margin",
    color="region",
    title='Margin over time',
    template='seaborn',
    category_orders={'region':region_order},
    nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="Month", yaxis_title="margin"
)
fig.show()

In [210]:
# percents accross regions

geo_margin_pct = geo_data \
    .groupby(['date', 'region'])[['margin', 'net_revenue', 'gross_revenue', 'discount']] \
    .sum() \
    .eval("""
          margin_pct = margin / gross_revenue
          discount_pct = discount / gross_revenue
          """) [['margin_pct', 'discount_pct']] \
    .reset_index() \
    # .fillna(0) \
    
geo_margin_pct

Unnamed: 0,date,region,margin_pct,discount_pct
0,2017-09-01,APAC,0.460402,0.242375
1,2017-09-01,EU,0.480441,0.222602
2,2017-09-01,LATAM,0.423415,0.280300
3,2017-09-01,,0.439927,0.263106
4,2017-10-01,APAC,0.478381,0.224994
...,...,...,...,...
203,2021-11-01,,0.487887,0.216618
204,2021-12-01,APAC,0.450654,0.253704
205,2021-12-01,EU,0.466108,0.238176
206,2021-12-01,LATAM,0.457852,0.245517


In [211]:
# % discount
fig=px.line(
    geo_margin_pct,
    x='date',
    y='discount_pct',
    color='region',
    template='seaborn',
    title='Discount (%) over time by Regions'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="discount",
    yaxis_tickformat = '.1%'
)

fig.show()

In [212]:
# % margin
fig=px.line(
    geo_margin_pct,
    x='date',
    y='margin_pct',
    color='region',
    template='seaborn',
    title='Margin (%) over time by Regions'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="margin",
    yaxis_tickformat = '.1%'
)

fig.show()

Conclusion:
1. The company has been active in four regions, expanding its reach from 2017 to 2019, reaching 27 countries, and maintaining this presence since.
2. Long-term trend shows EU as the fastest growing region.
3. APAC stands as the largest region, while LATAM is the smallest. 
4. Seasonal patterns are consistent across all regions.
5. Margins and discounts generally show little variation, except for LATAM, where recent years have seen higher discounts, leading to the lowest margins. 

## Sales channels

In [213]:
# create query

q_channel = """
SELECT
    date,
    platform,
    channel,
    CAST(fiscal_year as str) as fiscal_year,
    SUM(sold_quantity) as quantity,
    SUM(sold_quantity * gross_price) AS gross_revenue,
    SUM(sold_quantity * gross_price * pre_invoice_discount_pct) AS discount, 
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct) AS net_revenue,
    SUM(sold_quantity * manufacturing_cost) AS costs,
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct - sold_quantity * manufacturing_cost) AS margin
FROM all_data
GROUP BY date, platform, channel, fiscal_year
"""
channel_data =  pd.read_sql_query(q_all_data + q_channel, new_con, parse_dates=data_dates)
channel_data['year'] = channel_data['date'].dt.year
channel_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260 entries, 0 to 259
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           260 non-null    datetime64[ns]
 1   platform       260 non-null    object        
 2   channel        260 non-null    object        
 3   fiscal_year    260 non-null    int64         
 4   quantity       260 non-null    float64       
 5   gross_revenue  260 non-null    float64       
 6   discount       260 non-null    float64       
 7   net_revenue    260 non-null    float64       
 8   costs          260 non-null    float64       
 9   margin         260 non-null    float64       
 10  year           260 non-null    int64         
dtypes: datetime64[ns](1), float64(6), int64(2), object(2)
memory usage: 22.5+ KB


In [214]:
# Overview
pd.pivot_table(
    data=channel_data,
    index=[
        'channel',
        'platform'
        
    ],
    values=['quantity','gross_revenue', 'discount', 'net_revenue', 'costs', 'margin'],
    aggfunc='sum',
    # margins=True
) \
.eval(
    """
      avg_price=gross_revenue / quantity
      discount_pct=discount / gross_revenue
      margin_pct=margin/gross_revenue
    """
) \
.reindex(columns=['quantity','gross_revenue', 'avg_price', 'discount', 'discount_pct', 'net_revenue', 'costs', 'margin', 'margin_pct']) \
.style.format(
    {
        'quantity': '{:,.0f}',
        'gross_revenue': '{:,.0f}',
        'avg_price': '{:,.1f}',
        'discount': '{:,.0f}',
        'discount_pct': '{:.1%}',
        'net_revenue': '{:,.0f}',
        'costs': '{:,.0f}',
        'margin': '{:,.0f}',
        'margin_pct': '{:.1%}',
    }
) \
.background_gradient(cmap='vlag_r')

Unnamed: 0_level_0,Unnamed: 1_level_0,quantity,gross_revenue,avg_price,discount,discount_pct,net_revenue,costs,margin,margin_pct
channel,platform,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Direct,Brick & Mortar,264399,6042435,22.9,486349,8.0%,5556086,1793772,3762314,62.3%
Direct,E-Commerce,340492,7748695,22.8,1821761,23.5%,5926935,2300488,3626446,46.8%
Distributor,Brick & Mortar,443286,10062149,22.7,2599312,25.8%,7462837,2985763,4477074,44.5%
Retailer,Brick & Mortar,1949942,44857926,23.0,10971857,24.5%,33886070,13314679,20571391,45.9%
Retailer,E-Commerce,784816,17844703,22.7,4364250,24.5%,13480453,5294937,8185516,45.9%


In [215]:
# revenue over time by channels
fig = px.histogram(
    channel_data,
    x="date",
    y="gross_revenue",
    color="channel",
    pattern_shape="platform",
    pattern_shape_sequence=["|", "-"],
    title='Gross revenue over time',
    template='seaborn',
    barmode='relative',
    nbins=geo_data.date.nunique()
)
fig.update_layout(
    xaxis_title="month", yaxis_title="gross revenue"
)
fig.show()

In [216]:
# growth index

channel_data_index = channel_data \
    .groupby(['date' , 'platform', 'channel']) \
    .agg(growth_rate_yoy = ('gross_revenue','sum')) \
    .unstack([1, 2]) \
    .rolling(12).sum() \
    .transform(lambda x: (x / x.shift(1, axis=0)).cumprod()) \
    .tail(channel_data.date.nunique()-11) \
    .fillna(1) \
    .stack([1, 2]) \
    # # .style.format('{:.1%}')

channel_data_index

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,growth_rate_yoy
date,platform,channel,Unnamed: 3_level_1
2018-08-01,Brick & Mortar,Direct,1.000000
2018-08-01,Brick & Mortar,Distributor,1.000000
2018-08-01,Brick & Mortar,Retailer,1.000000
2018-08-01,E-Commerce,Direct,1.000000
2018-08-01,E-Commerce,Retailer,1.000000
...,...,...,...
2021-12-01,Brick & Mortar,Direct,17.132156
2021-12-01,Brick & Mortar,Distributor,13.484397
2021-12-01,Brick & Mortar,Retailer,20.339773
2021-12-01,E-Commerce,Direct,20.495582


In [217]:
# growth index
fig=px.line(
    channel_data_index.reset_index(),
    x='date',
    y='growth_rate_yoy',
    color='channel',
    line_dash='platform',
    template='seaborn',
    title='Growth index over time by Platform and Channel (LTM, Aug 2018 = 1 )'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="growth rate",
    yaxis_tickformat = '.1f'
)

fig.show()

In [218]:
# percents accross channels

channel_margin_pct = channel_data \
    .groupby(['date', 'platform', 'channel'])[['margin', 'net_revenue', 'gross_revenue', 'discount']] \
    .sum() \
    .eval("""
          margin_pct = margin / gross_revenue
          discount_pct = discount / gross_revenue
          """) [['margin_pct', 'discount_pct']] \
    .reset_index() \
    # .fillna(0) \
    
channel_margin_pct

Unnamed: 0,date,platform,channel,margin_pct,discount_pct
0,2017-09-01,Brick & Mortar,Direct,0.627451,0.075888
1,2017-09-01,Brick & Mortar,Distributor,0.410943,0.290939
2,2017-09-01,Brick & Mortar,Retailer,0.442412,0.260428
3,2017-09-01,E-Commerce,Direct,0.450892,0.252260
4,2017-09-01,E-Commerce,Retailer,0.454275,0.248823
...,...,...,...,...,...
255,2021-12-01,Brick & Mortar,Direct,0.620312,0.083586
256,2021-12-01,Brick & Mortar,Distributor,0.439818,0.264361
257,2021-12-01,Brick & Mortar,Retailer,0.461139,0.243121
258,2021-12-01,E-Commerce,Direct,0.454861,0.249516


In [219]:
# % discount
fig=px.line(
    channel_margin_pct,
    x='date',
    y='discount_pct',
    color='channel',
    line_dash='platform',
    template='seaborn',
    title='Discount (%) over time by Platform and Channel'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="discount",
    yaxis_tickformat = '.1%'
)

fig.show()

In [220]:
# % margin
fig=px.line(
    channel_margin_pct,
    x='date',
    y='margin_pct',
    color='channel',
    line_dash='platform',
    template='seaborn',
    title='Margin (%) over time by Platform and Channel'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="margin",
    yaxis_tickformat = '.1%'
)

fig.show()

### Conclusion:
1. There are five ways to connect with customers across various points of presence:
    - Own webstore
    - Own offline stores
    - Distributors with local operations
    - Retailers with local stores
    - Sales via partner retailers' webstores (marketplaces)

    **Note for Tableau dashboard:** Detailed analysis should be conducted within each of these specific channels, while only a broad comparison can be made across channels.

2. The retail sales in Brick & Mortar dominate, constituting over 50% of revenue. Furthermore, this channel stands out as the top performer in terms of growth.
3. Discounts remain consistent, ranging from 20% to 27%, except for own offline stores (Direct Sales in Brick & Mortar), where smaller discounts of 7-8% are observed, possibly attributed to higher overhead costs.

## Customer perspective

In [221]:
# create query

q_customer = """
SELECT
    date,
    customer_code,
    customer,
    platform,
    channel,
    region,
    market,
    CAST(fiscal_year as str) as fiscal_year,
    SUM(sold_quantity) as quantity,
    SUM(sold_quantity * gross_price) AS gross_revenue,
    SUM(sold_quantity * gross_price * pre_invoice_discount_pct) AS discount, 
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct) AS net_revenue,
    SUM(sold_quantity * manufacturing_cost) AS costs,
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct - sold_quantity * manufacturing_cost) AS margin
FROM all_data
GROUP BY date, customer_code, customer, platform, channel, region, market, fiscal_year
"""
customer_data =  pd.read_sql_query(q_all_data + q_customer, new_con, parse_dates=data_dates)
customer_data['year'] = customer_data['date'].dt.year
customer_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6707 entries, 0 to 6706
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           6707 non-null   datetime64[ns]
 1   customer_code  6707 non-null   object        
 2   customer       6707 non-null   object        
 3   platform       6707 non-null   object        
 4   channel        6707 non-null   object        
 5   region         6707 non-null   object        
 6   market         6707 non-null   object        
 7   fiscal_year    6707 non-null   int64         
 8   quantity       6707 non-null   float64       
 9   gross_revenue  6707 non-null   float64       
 10  discount       6707 non-null   float64       
 11  net_revenue    6707 non-null   float64       
 12  costs          6707 non-null   float64       
 13  margin         6707 non-null   float64       
 14  year           6707 non-null   int64         
dtypes: datetime64[ns](1),

### Overview

In [222]:
# top 5 customers in each channel
customer_data \
    .groupby(['channel','platform','customer'])['gross_revenue'] \
    .sum() \
    .groupby(['channel', 'platform'], group_keys=False) \
    .nlargest(5) \
    .to_frame().style.format('{:,.0f}')

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,gross_revenue
channel,platform,customer,Unnamed: 3_level_1
Direct,Brick & Mortar,Atliq Exclusive,6042435
Direct,E-Commerce,Atliq e Store,7748695
Distributor,Brick & Mortar,Sage,2782356
Distributor,Brick & Mortar,Leader,2753139
Distributor,Brick & Mortar,Neptune,1623534
Distributor,Brick & Mortar,Novus,1533944
Distributor,Brick & Mortar,Synthetic,1369177
Retailer,Brick & Mortar,Electricalsocity,1791628
Retailer,Brick & Mortar,Propel,1613682
Retailer,Brick & Mortar,Electricalslytical,1526938


In [223]:
# number of customers and metrics per 1 customer
customer_data \
    .groupby(['channel','platform','customer']) \
    .agg(
        {
            'customer':'nunique',
            'customer_code': 'nunique',
            # 'platform':'nunique',
            # 'channel':'nunique',
            'region':'nunique',
            'market':'nunique',
            'gross_revenue':'sum',
        }
    ) \
    .groupby(by=['channel','platform']) \
    .agg(
        {
            'customer':'sum',
            'customer_code': 'sum',
            # 'platform':'mean',
            # 'channel':'mean',
            'region': 'mean',
            'market': 'mean',
            'gross_revenue':'mean',
        }
    ) \
    .style.format(
        {
            'customer': '{:,.0f}',
            'customer_code': '{:,.0f}',
            'region': '{:,.1f}',
            'market': '{:,.1f}',
            'gross_revenue': '{:,.0f}',
        }
)

Unnamed: 0_level_0,Unnamed: 1_level_0,customer,customer_code,region,market,gross_revenue
channel,platform,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Direct,Brick & Mortar,1,16,3.0,16.0,6042435
Direct,E-Commerce,1,24,4.0,24.0,7748695
Distributor,Brick & Mortar,5,5,1.0,1.0,2012430
Retailer,Brick & Mortar,68,129,1.2,1.9,659675
Retailer,E-Commerce,4,35,2.5,7.8,4461176


### Regional overview

Due to the data structure, where certain customers represent distinct sales channels beyond final buyers, not all customer metrics are meaningful. If analysis is applicable, its relevance is confined to the specific sales channel.

In [224]:
# Function that produces charts for customer metrics of the specific channel
def customer_metrics(data: pd.DataFrame, platform = None, channel = None):
    

    if isinstance(platform, str): platform = [platform]
    if isinstance(channel, str): channel = [channel]
    
    filtered_df = data.copy()
    if platform is not None:
        filtered_df = filtered_df[filtered_df['platform'].isin(platform)]
        title_paltform=platform
    else:
        title_paltform = 'all'
    if channel is not None:
        filtered_df = filtered_df[filtered_df['channel'].isin(channel)]
        title_channel=channel
    else:
        title_channel = 'all'

    filtered_df = filtered_df\
        .groupby(['date' , 'region']) \
        .agg(
            {
                'gross_revenue':'sum',
                'net_revenue' : 'sum',
                'quantity': 'sum',
                'customer':'nunique',
                'customer_code': 'nunique',
                'discount': 'sum',
                'margin': 'sum'
            }
        ) \
        .eval(
            '''
            ARPU = gross_revenue / customer
            net_price_per_unit = net_revenue / quantity
            quantity_per_customer = quantity / customer
            penetration = customer_code / customer
            discount_pct = discount / gross_revenue
            margin_pct = margin / gross_revenue
            '''
        ) \
        .reset_index() \
        .melt(
                id_vars = ['date', 'region'],
                value_vars = ['gross_revenue','customer', 'ARPU', 'penetration', 'quantity_per_customer' , 'discount_pct' ,'net_price_per_unit' , 'margin_pct'],
                var_name = 'Metric'
        )

    fig = px.line(
            data_frame = filtered_df,
            x='date',
            y='value',
            color='region',
            template='seaborn',
            facet_col='Metric', 
            facet_col_wrap=2,
            facet_row_spacing = 0.1,
            facet_col_spacing = 0.05,
            # log_y=True
            title=f'Customer metrics for {title_paltform} platforms and {title_channel} channels',
            height=1000,
            category_orders = {'region':region_order}
        )

    fig.update_yaxes(showticklabels=True, matches=None, title=None)
    fig.update_xaxes(showticklabels=True, title=None)

    fig.update_yaxes(col=2, row=2, tickformat='.1%')
    fig.update_yaxes(col=2, row=1, tickformat='.1%')
    fig.update_yaxes(col=1, row=3, type='log')
    fig.update_yaxes(col=1, row=4, type='log')

    fig.update_layout(
        legend=dict(
            orientation="h",
            yanchor="bottom",
            y=1.02,
            xanchor="center",
            x=0.5
        )
    )

    fig.show()

#### Local retailers

In [225]:
customer_metrics(customer_data, channel='Retailer', platform='Brick & Mortar')

In [226]:
# histogram of applied discounts by regions
px.histogram(
    customer_data[
        (customer_data['channel']=='Retailer') 
        & (customer_data['platform']=='Brick & Mortar')
        ].groupby(['fiscal_year','customer_code', 'region', 'market']) \
        .agg(
            {
                'gross_revenue': 'sum',
                'discount': 'sum'
            }

        ) \
        .eval('discount_pct = discount / gross_revenue') \
        .reset_index(),
    x='discount_pct',
    facet_col='region',
    color='region',
    facet_col_wrap=2,
    template='seaborn',
    # barnorm='fraction',
    # histnorm='percent',
    nbins=10,
    marginal='box',
    category_orders = {'region':region_order},
    orientation='v',
    title='Distribution of discounts applied to customers accross regions'
).update_xaxes(title=None, tickformat='.0%')

In [227]:
# growth rates vs discounts  - prepare
growth_by_customer_code = customer_data \
    .groupby(['fiscal_year','region', 'channel', 'platform', 'customer_code' ]) \
    .agg(growth_rate_yoy = ('gross_revenue','mean')) \
    .unstack([1,2, 3, 4]) \
    .transform(lambda x: x / x.shift(1, axis=0) - 1) \
    .stack([1,2,3,4]) \
    .reset_index()
    # .isna().sum()
    # .style.format('{:,.0f}')

growth_by_customer_code

Unnamed: 0,fiscal_year,region,channel,platform,customer_code,growth_rate_yoy
0,2019,APAC,Direct,Brick & Mortar,70002017,0.823287
1,2019,APAC,Direct,Brick & Mortar,70003181,1.204667
2,2019,APAC,Direct,Brick & Mortar,70006157,2.809565
3,2019,APAC,Direct,Brick & Mortar,70007198,0.578621
4,2019,APAC,Direct,Brick & Mortar,70008169,1.052065
...,...,...,...,...,...,...
682,2022,,Retailer,E-Commerce,90022081,3.152348
683,2022,,Retailer,E-Commerce,90022082,3.156744
684,2022,,Retailer,E-Commerce,90022083,3.426824
685,2022,,Retailer,E-Commerce,90023023,3.570985


In [228]:
# growth rates vs discounts - prepare
q_discounts = '''
SELECT
    CAST(fiscal_year as str) as fiscal_year,
    customer_code, 
    pre_invoice_discount_pct
FROM fact_pre_discount
'''
discount_data =  pd.read_sql_query(q_discounts, new_con)
discount_data['fiscal_year'] = discount_data['fiscal_year'].astype(int)
discount_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1045 entries, 0 to 1044
Data columns (total 3 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   fiscal_year               1045 non-null   int32  
 1   customer_code             1045 non-null   object 
 2   pre_invoice_discount_pct  1045 non-null   float64
dtypes: float64(1), int32(1), object(1)
memory usage: 20.5+ KB


In [229]:
# growth rates vs discounts  - prepare
growth_by_customer_code = growth_by_customer_code.merge(discount_data, on=['fiscal_year', 'customer_code'], how='left')
growth_by_customer_code.query('~(fiscal_year == 2022)')

Unnamed: 0,fiscal_year,region,channel,platform,customer_code,growth_rate_yoy,pre_invoice_discount_pct
0,2019,APAC,Direct,Brick & Mortar,70002017,0.823287,0.0777
1,2019,APAC,Direct,Brick & Mortar,70003181,1.204667,0.0546
2,2019,APAC,Direct,Brick & Mortar,70006157,2.809565,0.0972
3,2019,APAC,Direct,Brick & Mortar,70007198,0.578621,0.0534
4,2019,APAC,Direct,Brick & Mortar,70008169,1.052065,0.0748
...,...,...,...,...,...,...,...
473,2021,,Retailer,E-Commerce,90022081,0.268161,0.2902
474,2021,,Retailer,E-Commerce,90022082,0.513372,0.2128
475,2021,,Retailer,E-Commerce,90022083,0.575903,0.1927
476,2021,,Retailer,E-Commerce,90023023,0.432468,0.2652


In [230]:
# growth rates vs discounts - final
condition = "~(fiscal_year == 2022) & channel == 'Retailer' & platform == 'Brick & Mortar'"

px.scatter(
    growth_by_customer_code.query(condition),
    x='pre_invoice_discount_pct',
    y='growth_rate_yoy',
    color='region',
    title='Growth rate of gross revenue vs customer discounts',
    template='seaborn',
    category_orders={'region':region_order}
) \
.update_xaxes(title='discount', tickformat='.1%') \
.update_yaxes(title='growth rate y-o-y', tickformat='.1%')

#### Distributors

In [231]:
customer_metrics(customer_data, channel='Distributor', platform='Brick & Mortar')

In [232]:
# do distributors enjoy exclusiveness in the market?
customer_data[customer_data['channel'] == 'Distributor'].loc[:, ['market', 'customer']].drop_duplicates()

Unnamed: 0,market,customer
20,China,Neptune
21,Philiphines,Synthetic
22,Philiphines,Novus
23,South Korea,Sage
104,South Korea,Leader


In [233]:
# Gross revenue by Distributor
fig = px.line(
    data_frame = customer_data[
        (customer_data['channel']=='Distributor') 
        & (customer_data['platform']=='Brick & Mortar')
    ],
    x='date',
    y='gross_revenue',
    color='customer',
    line_dash='market',
    title='Gross revenue by Distributor',
    template='seaborn'
)

fig.show()

In [234]:
# Margin by Distributor
fig = px.line(
    data_frame = customer_data[
        (customer_data['channel']=='Distributor') 
        & (customer_data['platform']=='Brick & Mortar')
    ],
    x='date',
    y='margin',
    color='customer',
    line_dash='market',
    title='Margin by Distributor',
    template='seaborn'
)

fig.show()

In [235]:
# discount by Distributor
fig = px.line(
    data_frame = customer_data[
        (customer_data['channel']=='Distributor') 
        & (customer_data['platform']=='Brick & Mortar')
    ].eval('discount_pct = discount / gross_revenue'),
    x='date',
    y='discount_pct',
    color='customer',
    line_dash='market',
    title='Discount by Distributor',
    template='seaborn'
)

fig.update_yaxes(tickformat='.1%')

fig.show()

In [236]:
# growth rates vs discounts - final
condition = "~(fiscal_year == 2022) & channel == 'Distributor'"

px.scatter(
    growth_by_customer_code.query(condition),
    x='pre_invoice_discount_pct',
    y='growth_rate_yoy',
    color='customer_code',
    title='Growth rate of gross revenue vs customer discounts',
    template='seaborn',
    category_orders={'region':region_order},
    # trendline='ols',
    # trendline_scope='overall'
) \
.update_xaxes(title='discount', tickformat='.1%') \
.update_yaxes(title='growth rate y-o-y', tickformat='.1%')

#### Marketplaces

In [237]:
customer_metrics(customer_data, channel='Retailer', platform='E-Commerce')

In [238]:
# growth rates vs discounts - final
condition = "~(fiscal_year == 2022) & channel == 'Retailer' & platform == 'E-Commerce'"

px.scatter(
    growth_by_customer_code.query(condition),
    x='pre_invoice_discount_pct',
    y='growth_rate_yoy',
    color='region',
    title='Growth rate of gross revenue vs customer discounts',
    template='seaborn',
    category_orders={'region':region_order},
    # trendline='ols',
    # trendline_scope='overall'
) \
.update_xaxes(title='discount', tickformat='.1%') \
.update_yaxes(title='growth rate y-o-y', tickformat='.1%')

#### Own webstore

In [239]:
customer_metrics(customer_data, channel='Direct', platform='E-Commerce')

In [240]:
# growth rates vs discounts - final
condition = "~(fiscal_year == 2022) & channel == 'Direct' & platform == 'E-Commerce'"

px.scatter(
    growth_by_customer_code.query(condition),
    x='pre_invoice_discount_pct',
    y='growth_rate_yoy',
    color='region',
    title='Growth rate of gross revenue vs customer discounts',
    template='seaborn',
    category_orders={'region':region_order},
    # trendline='ols',
    # trendline_scope='overall'
) \
.update_xaxes(title='discount', tickformat='.1%') \
.update_yaxes(title='growth rate y-o-y', tickformat='.1%')

#### Own local shops

In [241]:
customer_metrics(customer_data, channel='Direct', platform='Brick & Mortar')

In [242]:
# growth rates vs discounts - final
condition = "~(fiscal_year == 2022) & channel == 'Direct' & platform == 'Brick & Mortar'"

px.scatter(
    growth_by_customer_code.query(condition),
    x='pre_invoice_discount_pct',
    y='growth_rate_yoy',
    color='region',
    title='Growth rate of gross revenue vs customer discounts',
    template='seaborn',
    category_orders={'region':region_order},
    # trendline='ols',
    # trendline_scope='overall'
) \
.update_xaxes(title='discount', tickformat='.1%') \
.update_yaxes(title='growth rate y-o-y', tickformat='.1%')

### **Conclusion:**

1. **Local Retailers Channel:**
   - Analyzing customer metrics, it is sensible to focus on the local retailers channel. In other channels, customers essentially represent entire channels or major distributors who further sell products to end consumers.
   - Significant growth in the number of customers occurred in this channel, especially in the APAC and EU regions.
   - These customers expanded into new markets, particularly in the EU, without notable price reductions or increased discounts.
   - Monthly revenue per customer increased substantially, especially in the EU, with growth exceeding a hundredfold.
   - Expansion attempts in the LATAM region from September 2019 seem unsuccessful despite higher discounts, and even a single customer struggled to establish a consistent presence across multiple markets.
   - No observable correlation between discounts and customer growth rates.
   - Collectively, these trends positioned this segment as the largest, with the EU region emerging as the most dynamically evolving.

2. **Distributors:**
   - Represent a niche channel in the APAC region.
   - Each distributor operates exclusively in one country.
   - Multiple distributors may operate in a single country, potentially leading to inefficient competition.
   - No apparent increase in gross revenue growth rates with higher discounts.

3. **Other Three Channels:**
   - Lack valuable customer-level insights.

4. **Seasonality:**
   - Noticeable seasonal patterns with a 4-month cycle; customer activity peaks every 4 months.
   - Cycle dates differ across channels, and no hypothesis explains this phenomenon.
    

## Product perspective

In [243]:
# create query

q_product = """
SELECT
    date,
    product_code,
    -- division, # only one division in data
    -- segment, # only one segment in data
    category, 
    product,
    variant,
    CAST(fiscal_year as str) as fiscal_year,
    SUM(sold_quantity) as quantity,
    SUM(sold_quantity * gross_price) AS gross_revenue,
    SUM(sold_quantity * gross_price * pre_invoice_discount_pct) AS discount, 
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct) AS net_revenue,
    SUM(sold_quantity * manufacturing_cost) AS costs,
    SUM(sold_quantity * gross_price - sold_quantity * gross_price * pre_invoice_discount_pct - sold_quantity * manufacturing_cost) AS margin
FROM all_data
GROUP BY date, product_code, category, product, variant, fiscal_year
"""
product_data =  pd.read_sql_query(q_all_data + q_product, new_con, parse_dates=data_dates)
product_data['year'] = product_data['date'].dt.year
product_data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 514 entries, 0 to 513
Data columns (total 13 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   date           514 non-null    datetime64[ns]
 1   product_code   514 non-null    object        
 2   category       514 non-null    object        
 3   product        514 non-null    object        
 4   variant        514 non-null    object        
 5   fiscal_year    514 non-null    int64         
 6   quantity       514 non-null    float64       
 7   gross_revenue  514 non-null    float64       
 8   discount       514 non-null    float64       
 9   net_revenue    514 non-null    float64       
 10  costs          514 non-null    float64       
 11  margin         514 non-null    float64       
 12  year           514 non-null    int64         
dtypes: datetime64[ns](1), float64(6), int64(2), object(4)
memory usage: 52.3+ KB


In [244]:
product_data \
    .groupby(['category', 'product'])[['gross_revenue']] \
    .sum() \
    .style.format('{:,.0f}')
    # .eval("""
    #       margin_pct = margin / gross_revenue
    #       discount_pct = discount / gross_revenue
    #       """) [['margin_pct', 'discount_pct']] \
    # .reset_index() \
    # # .fillna(0) \

Unnamed: 0_level_0,Unnamed: 1_level_0,gross_revenue
category,product,Unnamed: 2_level_1
Graphic Card,AQ Mforce Gen X,12099575
Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM 256 MB Cache,25433048
Internal HDD,AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm,26003960
Internal HDD,AQ Zion Saga,23019326


In [245]:
# revenue over time
fig = px.histogram(
    product_data,
    x="date",
    y="gross_revenue",
    color='product',
    pattern_shape="variant",
    #  pattern_shape_sequence=["|", "-"],
    title='Gross revenue by products over time',
    template='seaborn',
    barmode='relative',
    # category_orders={'region':region_order},
    nbins=product_data.date.nunique()
)
fig.update_layout(
    xaxis_title="month", yaxis_title="gross revenue"
)
fig.show()

In [246]:
# percents by products

products_margin_pct = product_data \
    .groupby(['date', 'product'])[['margin', 'net_revenue', 'gross_revenue', 'discount', 'quantity', 'costs']] \
    .sum() \
    .eval("""
          margin_pct = margin / gross_revenue
          discount_pct = discount / gross_revenue
          avg_price = net_revenue / quantity
          avg_cost = costs / quantity
          """) [['margin_pct', 'discount_pct', 'avg_price', 'avg_cost']] \
    .reset_index() \
    # .fillna(0) \
    
products_margin_pct

Unnamed: 0,date,product,margin_pct,discount_pct,avg_price,avg_cost
0,2017-09-01,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,0.458720,0.242261,14.001092,5.525112
1,2017-09-01,AQ Mforce Gen X,0.455767,0.248780,12.964767,5.099009
2,2017-10-01,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,0.474787,0.226769,14.286914,5.514309
3,2017-10-01,AQ Mforce Gen X,0.474422,0.230296,13.301092,5.102704
4,2017-11-01,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,0.464142,0.237363,14.084146,5.512512
...,...,...,...,...,...,...
167,2021-11-01,AQ Zion Saga,0.474403,0.229688,21.935818,8.426477
168,2021-12-01,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,0.451706,0.242572,18.030163,7.277551
169,2021-12-01,AQ Mforce Gen X,0.470987,0.242900,15.968230,6.034500
170,2021-12-01,AQ WereWolf NAS Internal Hard Drive HDD – 8.89 cm,0.464031,0.242974,19.161601,7.416208


In [247]:
# % discount
fig=px.line(
    products_margin_pct,
    x='date',
    y='discount_pct',
    color='product',
    # line_dash='variant',
    template='seaborn',
    title='Discount (%) over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="discount",
    yaxis_tickformat = '.1%'
)

fig.show()

In [248]:
# % margin
fig=px.line(
    products_margin_pct,
    x='date',
    y='margin_pct',
    color='product',
    # line_dash='variant',
    template='seaborn',
    title='Margin (%) over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="margin",
    yaxis_tickformat = '.1%'
)

fig.show()

In [249]:
# % margin
fig=px.line(
    products_margin_pct,
    x='date',
    y='avg_price',
    color='product',
    # line_dash='variant',
    template='seaborn',
    title='Average price over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="price",
    yaxis_tickformat = '.1f'
)

fig.show()

In [250]:
# average net price
fig=px.line(
    products_margin_pct,
    x='date',
    y='avg_cost',
    color='product',
    # line_dash='variant',
    template='seaborn',
    title='Average cost over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="cost",
    yaxis_tickformat = '.1f'
)

fig.show()

In [251]:
# costs
# we take only those products that were sold

q_product_costs = """
SELECT
    fmc.manufacturing_cost,
    fgp.*,
    p.*    
FROM fact_manufacturing_cost fmc
LEFT JOIN fact_gross_price fgp
ON fmc.product_code = fgp.product_code AND fmc.fiscal_year = fgp.fiscal_year
LEFT JOIN dim_product p
ON fmc.product_code = p.product_code
WHERE fmc.product_code IN (
    SELECT DISTINCT product_code FROM fact_sales_monthly
)
"""
costs_data =  pd.read_sql_query(q_product_costs, new_con)
costs_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 52 entries, 0 to 51
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   manufacturing_cost  52 non-null     float64
 1   product_code        52 non-null     object 
 2   fiscal_year         52 non-null     object 
 3   gross_price         52 non-null     float64
 4   product_code        52 non-null     object 
 5   division            52 non-null     object 
 6   segment             52 non-null     object 
 7   category            52 non-null     object 
 8   product             52 non-null     object 
 9   variant             52 non-null     object 
dtypes: float64(2), object(8)
memory usage: 4.2+ KB


In [252]:
costs_data['margin_pct'] = 1 - costs_data['manufacturing_cost'] / costs_data['gross_price']

In [253]:
# average cost
fig=px.line(
    costs_data.sort_values(by='fiscal_year'),
    x='fiscal_year',
    y='manufacturing_cost',
    color='product',
    line_dash='variant',
    template='seaborn',
    title='Average cost over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="cost",
    yaxis_tickformat = '.1f'
)

fig.show()

In [254]:
# gross price
fig=px.line(
    costs_data.sort_values(by='fiscal_year'),
    x='fiscal_year',
    y='gross_price',
    color='product',
    line_dash='variant',
    template='seaborn',
    title='Gross price over time by Product'
)

fig.update_layout(
    xaxis_title="month", yaxis_title="price",
    yaxis_tickformat = '.1f'
)

fig.show()

In [255]:
# gross margin %
fig=px.line(
    costs_data.sort_values(by='fiscal_year'),
    x='fiscal_year',
    y='margin_pct',
    color='product',
    line_dash='variant',
    template='seaborn',
    title='Gross margin (%) over time by Product'
)

fig.update_layout(
    xaxis_title="year", yaxis_title="margin",
    yaxis_tickformat = '.1%'
)

fig.show()

### Conclusion:

1. The database only contains a small snapshot of sales data for just 4 products out of the entire range. That's why the product analysis may not be very representative.
2. Some product variations are periodically removed from the assortment, and new ones are introduced.
3. The company keeps an eye on cost growth and maintains a gross margin within a narrow range of 69-72%.
4. All the fluctuations are influenced by the actual mix of sales to customers and their activity.

# Hypotheses testing



1. Seasonality: is average monthly check per customer in the 4th quarter higher than in other months?
2. Discount: is average dicsount for clients in Latin America is higher that in other regions?
3. Margins: do products of higher class assume higher margins?

In [256]:
# significance level
alpha = 0.05

## 1. Seasonality: is average monthly check per customer in the market in the 4th quarter higher than in other months?

In [257]:
customer_seasonality_data = customer_data.drop(['quantity', 'gross_revenue', 'discount', 'costs', 'margin', 'year'], axis=1).copy()
customer_seasonality_data['month'] = customer_seasonality_data['date'].dt.month
customer_seasonality_data

Unnamed: 0,date,customer_code,customer,platform,channel,region,market,fiscal_year,net_revenue,month
0,2017-09-01,70002017,Atliq Exclusive,Brick & Mortar,Direct,APAC,India,2018,5753.845210,9
1,2017-09-01,70002018,Atliq e Store,E-Commerce,Direct,APAC,India,2018,5220.298538,9
2,2017-09-01,70003181,Atliq Exclusive,Brick & Mortar,Direct,APAC,Indonesia,2018,2310.211802,9
3,2017-09-01,70003182,Atliq e Store,E-Commerce,Direct,APAC,Indonesia,2018,1184.073584,9
4,2017-09-01,70006157,Atliq Exclusive,Brick & Mortar,Direct,APAC,Philiphines,2018,853.920566,9
...,...,...,...,...,...,...,...,...,...,...
6702,2021-12-01,90023028,walmart,Brick & Mortar,Retailer,,Canada,2022,35525.632083,12
6703,2021-12-01,90023030,Amazon,E-Commerce,Retailer,,Canada,2022,55733.629245,12
6704,2021-12-01,90024183,Electricalsbea Stores,Brick & Mortar,Retailer,LATAM,Chile,2022,5654.145486,12
6705,2021-12-01,90024184,Amazon,E-Commerce,Retailer,LATAM,Chile,2022,4396.600093,12


**Local retailers**

In [258]:
# calculate monthly ARPU accross markets
# consider local retailers separately
# Exclude FY2022 because it is not complete

ARPU_seasonality = customer_seasonality_data[
    (customer_seasonality_data['platform'] == 'Brick & Mortar') &
    (customer_seasonality_data['channel'] == 'Retailer') &
    ~(customer_seasonality_data['fiscal_year'] == 2022)
] \
.groupby(['fiscal_year', 'market', 'month']) \
.agg(
    total_net_revenue = ('net_revenue', 'sum'),
    number_of_active_customers_codes = ('customer_code', 'nunique')
) \
.eval(
    '''
    ARPU = total_net_revenue / number_of_active_customers_codes
    '''
) \
.reset_index()

ARPU_seasonality

Unnamed: 0,fiscal_year,market,month,total_net_revenue,number_of_active_customers_codes,ARPU
0,2018,Australia,1,4387.093833,3,1462.364611
1,2018,Australia,2,3239.465265,3,1079.821755
2,2018,Australia,3,3855.685250,3,1285.228417
3,2018,Australia,4,3881.866892,3,1293.955631
4,2018,Australia,5,4268.475377,3,1422.825126
...,...,...,...,...,...,...
837,2021,United Kingdom,8,49818.894443,7,7116.984920
838,2021,United Kingdom,9,51187.500863,7,7312.500123
839,2021,United Kingdom,10,66917.807443,7,9559.686778
840,2021,United Kingdom,11,72867.911535,6,12144.651923


In [259]:
ARPU_seasonality['avg_ARPU'] = ARPU_seasonality.groupby(['fiscal_year', 'market'])['ARPU'].transform('mean')
ARPU_seasonality['index_ARPU'] = ARPU_seasonality['ARPU'] / ARPU_seasonality['avg_ARPU']

high_months = [10,11,12]

ARPU_seasonality['busy_season'] = ARPU_seasonality['month'].isin(high_months)

ARPU_seasonality.dropna(inplace=True)

In [260]:
# histogram
fig = px.histogram(
    ARPU_seasonality,
    x="index_ARPU",
    color='busy_season',
    barmode='overlay',
    histnorm='probability',
    # cumulative=True,
    # nbins=25,
    title='Distrubution of ARPU in relation to average ARPU in the 4th quarter vs others',
    template='seaborn'
    
)

fig.update_layout(
    xaxis_title="ARPU to Average ARPU",
    yaxis_tickformat = '.1%'
)

fig.show()

In [261]:
fig = px.box(
    ARPU_seasonality,
    color='busy_season',
    y="index_ARPU",
    # color='busy_season',
    # barmode='overlay',
    # histnorm='probability',
    # cumulative=True,
    # nbins=25,
    title='Distrubution of ARPU in relation to average ARPU in the 4th quarter vs others',
    template='seaborn'
    
)

fig.update_layout(
    # xaxis_title="ARPU to Average ARPU",
    yaxis_tickformat = '.1%'
)

fig.show()

In [262]:
# check distribution - Busy season

statistic, p_value = stats.shapiro(ARPU_seasonality[(ARPU_seasonality['busy_season'])]['index_ARPU'])

# Print the results
print("Shapiro-Wilk Test: Busy season")
print("Test Statistic:", statistic)
print("P-value:", p_value)

# Check the significance level

if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

Shapiro-Wilk Test: Busy season
Test Statistic: 0.9529159069061279
P-value: 1.8621122990225558e-06
Sample does not look Gaussian (reject H0)


In [263]:
# check distribution - low season

statistic, p_value = stats.shapiro(ARPU_seasonality[~(ARPU_seasonality['busy_season'])]['index_ARPU'])

# Print the results
print("Shapiro-Wilk Test: Busy season")
print("Test Statistic:", statistic)
print("P-value:", p_value)

# Check the significance level

if p_value > alpha:
    print("Sample looks Gaussian (fail to reject H0)")
else:
    print("Sample does not look Gaussian (reject H0)")

Shapiro-Wilk Test: Busy season
Test Statistic: 0.8619232177734375
P-value: 4.600883554940094e-23
Sample does not look Gaussian (reject H0)


As subsets does not look normal we will apply Mann-Whitney U test to compare means

In [264]:
# compare means 

statistic, p_value = stats.mannwhitneyu(
    ARPU_seasonality[~(ARPU_seasonality['busy_season'])]['index_ARPU'],
    ARPU_seasonality[(ARPU_seasonality['busy_season'])]['index_ARPU']
)

# Print the results
print("Mann-Whitney U Test:")
print("Test Statistic:", statistic)
print("P-value:", p_value)

# Check the significance level
if p_value > alpha:
    print("No significant difference between the groups (fail to reject H0)")
    
else:
    print("Significant difference between the groups (reject H0)")
    print(
        'ARPU in the 4th quarter is ',
        (ARPU_seasonality[(ARPU_seasonality['busy_season'])]['index_ARPU'].mean() / ARPU_seasonality[~(ARPU_seasonality['busy_season'])]['index_ARPU'].mean()).round(2),
        ' times higher than in other months on average'
    )

Mann-Whitney U Test:
Test Statistic: 2854.0
P-value: 5.434914528863221e-97
Significant difference between the groups (reject H0)
ARPU in the 4th quarter is  1.85  times higher than in other months on average


**Conclusion:**

1. It's evident that the 4th quarter stands out as the high season within the company's industry, with each month generating an average revenue per user (ARPU) 1.85 times lower throughout the rest of the year.
   
2. Given this observation, it's advisable for the company to consider implementing a tailored pricing model during these peak periods, as the current pricing structure appears uniform throughout the year.

## 2. Discount in Latin America: is average dicsount for clients in Latin America is higher that in other regions?

Since in sales data we have only 4 customers, lets check the hypothesis based on the entire customers base.

In [265]:
# create dataframe
# exclude discounts in own retail shops

q_discounts = '''
SELECT
    fpd.*,
    c.region,
    c.channel,
    c.platform
FROM fact_pre_discount fpd
LEFT JOIN dim_customer c on fpd.customer_code = c.customer_code
WHERE 
    NOT ((c.platform = 'Brick & Mortar') AND (c.channel = 'Direct'))
'''

discounts_data =  pd.read_sql_query(q_discounts, new_con, parse_dates=data_dates)

discounts_data['is_LATAM'] = discounts_data['region']=='LATAM'

discounts_data.info()
display(discounts_data.head())


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 965 entries, 0 to 964
Data columns (total 7 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_code             965 non-null    object 
 1   fiscal_year               965 non-null    object 
 2   pre_invoice_discount_pct  965 non-null    float64
 3   region                    965 non-null    object 
 4   channel                   965 non-null    object 
 5   platform                  965 non-null    object 
 6   is_LATAM                  965 non-null    bool   
dtypes: bool(1), float64(1), object(5)
memory usage: 46.3+ KB


Unnamed: 0,customer_code,fiscal_year,pre_invoice_discount_pct,region,channel,platform,is_LATAM
0,70002018,2018,0.2956,APAC,Direct,E-Commerce,False
1,70002018,2019,0.2577,APAC,Direct,E-Commerce,False
2,70002018,2020,0.2255,APAC,Direct,E-Commerce,False
3,70002018,2021,0.2061,APAC,Direct,E-Commerce,False
4,70002018,2022,0.2931,APAC,Direct,E-Commerce,False


In [266]:
# histogram
fig = px.histogram(
    discounts_data,
    x="pre_invoice_discount_pct",
    color='is_LATAM',
    barmode='overlay',
    histnorm='probability',
    # cumulative=True,
    nbins=20,
    title='Distrubution of discounts',
    template='seaborn'
    
)

fig.update_layout(
    xaxis_title="discount",
    yaxis_tickformat = '.1%',
    xaxis_tickformat = '.1%',
)

fig.show()

In [267]:
# compare means 

statistic, p_value = stats.mannwhitneyu(
    discounts_data[~(discounts_data['is_LATAM'])]['pre_invoice_discount_pct'],
    discounts_data[(discounts_data['is_LATAM'])]['pre_invoice_discount_pct']
)

# Print the results
print("Mann-Whitney U Test:")
print("Test Statistic:", statistic)
print("P-value:", p_value)

# Check the significance level
if p_value > alpha:
    print("No significant difference between the groups (fail to reject H0)")
    
else:
    print("Significant difference between the groups (reject H0)")


Mann-Whitney U Test:
Test Statistic: 15766.5
P-value: 0.7536521122230235
No significant difference between the groups (fail to reject H0)


**Conclusion:**

1. The data available does not definitively confirm the existence of increased discounts.
2. It's plausible that the insufficient discounts may have contributed to the lack of success in penetrating the market. However, further investigation and data analysis are required to firmly establish this correlation.

## 3. Margins of premium products

In [268]:
# margins

q_product_margins = """
SELECT
    fmc.manufacturing_cost,
    fgp.*,
    p.*    
FROM fact_manufacturing_cost fmc
LEFT JOIN fact_gross_price fgp
ON fmc.product_code = fgp.product_code AND fmc.fiscal_year = fgp.fiscal_year
LEFT JOIN dim_product p
ON fmc.product_code = p.product_code
"""
margin_data =  pd.read_sql_query(q_product_margins, new_con)
margin_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1182 entries, 0 to 1181
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   manufacturing_cost  1182 non-null   float64
 1   product_code        1182 non-null   object 
 2   fiscal_year         1182 non-null   object 
 3   gross_price         1182 non-null   float64
 4   product_code        1182 non-null   object 
 5   division            1182 non-null   object 
 6   segment             1182 non-null   object 
 7   category            1182 non-null   object 
 8   product             1182 non-null   object 
 9   variant             1182 non-null   object 
dtypes: float64(2), object(8)
memory usage: 92.5+ KB


In [269]:
margin_data['variant'].value_counts()

Standard               102
Plus 2                 102
Plus 1                  90
Premium                 89
Standard 1              67
Standard 2              66
Premium 1               63
Premium 2               59
Plus                    54
Premium Black           49
Premium Misty Green     44
Standard Blue           41
Standard Grey           41
Standard Red            36
Plus Blue               34
Plus Red                32
Plus Grey               31
Plus 3                  29
Standard 3              20
Plus Cool Blue          20
Standard Firey Red      19
Standard Black          19
Plus Firey Red          19
Plus Black              18
Standard Cool Blue      17
Plus 1                  16
Premium Plus             5
Name: variant, dtype: int64

Let's map variant into 3 groups - standard, plus, premium - and check if margins differ accross these group. We expect that premium product should earn higher margins.

In [270]:
# mapping
variants = {}

for variant in margin_data['variant'].unique():
    if 'premium' in variant.lower():
        variants[variant] = 'Premium'
        continue
    if 'plus' in variant.lower():
        variants[variant] = 'Plus'
        continue
    if 'standard' in variant.lower():
        variants[variant] = 'Standard'
        continue
    variants[variant] = 'other'

variants

{'Standard': 'Standard',
 'Plus': 'Plus',
 'Premium': 'Premium',
 'Premium Plus': 'Premium',
 'Standard 1': 'Standard',
 'Standard 2': 'Standard',
 'Standard 3': 'Standard',
 'Plus 1': 'Plus',
 'Plus 2': 'Plus',
 'Plus 3': 'Plus',
 'Premium 1': 'Premium',
 'Premium 2': 'Premium',
 'Plus 1 ': 'Plus',
 'Standard Grey': 'Standard',
 'Standard Blue': 'Standard',
 'Standard Red': 'Standard',
 'Plus Grey': 'Plus',
 'Plus Blue': 'Plus',
 'Plus Red': 'Plus',
 'Premium Black': 'Premium',
 'Premium Misty Green': 'Premium',
 'Standard Firey Red': 'Standard',
 'Standard Cool Blue': 'Standard',
 'Standard Black': 'Standard',
 'Plus Firey Red': 'Plus',
 'Plus Cool Blue': 'Plus',
 'Plus Black': 'Plus'}

In [271]:
margin_data['variant_group'] = margin_data['variant'].map(variants)
margin_data['margin_pct'] = (margin_data['gross_price'] - margin_data['manufacturing_cost']) / margin_data['gross_price']
margin_data

Unnamed: 0,manufacturing_cost,product_code,fiscal_year,gross_price,product_code.1,division,segment,category,product,variant,variant_group,margin_pct
0,4.6190,A0118150101,2018,15.3952,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Standard,0.699971
1,4.2033,A0118150101,2019,14.4392,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Standard,0.708897
2,5.0207,A0118150101,2020,16.2323,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Standard,0.690697
3,5.5172,A0118150101,2021,19.0573,A0118150101,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Standard,Standard,0.710494
4,5.6036,A0118150102,2018,19.5875,A0118150102,P & A,Peripherals,Internal HDD,AQ Dracula HDD – 3.5 Inch SATA 6 Gb/s 5400 RPM...,Plus,Plus,0.713920
...,...,...,...,...,...,...,...,...,...,...,...,...
1177,13.4069,A7321160301,2022,44.6260,A7321160301,N & S,Networking,Wi fi extender,AQ Wi Power Dx3,Standard,Standard,0.699572
1178,12.5670,A7321160302,2021,43.9446,A7321160302,N & S,Networking,Wi fi extender,AQ Wi Power Dx3,Plus,Plus,0.714026
1179,13.1954,A7321160302,2022,46.0399,A7321160302,N & S,Networking,Wi fi extender,AQ Wi Power Dx3,Plus,Plus,0.713392
1180,12.9502,A7321160303,2021,42.8483,A7321160303,N & S,Networking,Wi fi extender,AQ Wi Power Dx3,Premium,Premium,0.697766


In [272]:
# histogram
fig = px.histogram(
    margin_data,
    x="margin_pct",
    color='variant_group',
    barmode='overlay',
    histnorm='probability',
    # cumulative=True,
    # nbins=20,
    title='Distrubution of margins',
    template='seaborn'
    
)

fig.update_layout(
    xaxis_title="margin",
    yaxis_tickformat = '.1%',
    xaxis_tickformat = '.1%',
)

fig.show()

In [273]:
# compare means 
# Perform Kruskal-Wallis test

statistic, p_value = stats.kruskal(
    margin_data[margin_data['variant_group']=='Premium']['margin_pct'],
    margin_data[margin_data['variant_group']=='Plus']['margin_pct'],
    margin_data[margin_data['variant_group']=='Standard']['margin_pct'],
)

# Print the results
print("Kruskal-Wallis Test:")
print("Test Statistic:", statistic)
print("P-value:", p_value)

# Check the significance level

if p_value > alpha:
    print("No significant difference between the groups (fail to reject H0)")
else:
    print("Significant difference between the groups (reject H0)")

Kruskal-Wallis Test:
Test Statistic: 0.3737531500373734
P-value: 0.8295461167810869
No significant difference between the groups (fail to reject H0)


**Conclusion:**

1. The analysis suggests that products of higher class do not necessarily yield higher margins.
   
2. Therefore, it's advisable for the company to review and adjust its pricing model accordingly.