# e-Commerce EDA

useful links:
- https://userpilot.com/blog/customer-growth/
- https://chartio.com/learn/product-analytics/top-product-metrics/


In [None]:
# Imoports
import pandas as pd
import sqlite3
import plotly.express as px

## Load data set

In [None]:
# Connect SQLite database.
db_conn = sqlite3.connect("SuperstoreDB/superstore.db")

An initial eda and data cleaning has been performed in superstore_db_preperartion.ipynb, so there is no need anymore to check for missing or duplicated data, etc.

## Orders

In [None]:
# Orders per year.
orders_yearly = pd.read_sql(
    """
    SELECT Year, COUNT (DISTINCT OrderID) AS OrderCount
    FROM (
        SELECT
            *,
            SUBSTR(OrderDate, 1, 4) AS Year
        FROM Orders
    )
    GROUP BY Year

    """, db_conn)

# Calculate year-on-year percentage growth
orders_yearly['YearOnYearGrowth'] = orders_yearly['OrderCount'].pct_change() * 100

In [None]:
fig = px.bar(orders_yearly,
             x='Year',
             y='OrderCount',
             title='Orders Yearly')
fig.show()

**Observation**: The ammount of orders per year has been growing steadily between 2014 and 2017. When compared to the plot above showing 'Customer Count Per Year', we can see that the order count per year is growing quicker.

In [None]:
# Year-on-year growth of order numbers in percentage.
fig = px.bar(orders_yearly,
             x='Year',
             y='YearOnYearGrowth',
             title='Year-on-Year Order Growth (%)')
fig.show()

**Observation**: By checking the year-on-year growth of order amount, we can see that company is growing each year more than the last one. So we could say the growth is exponential. this is very good news, especially when we compare those numbers to the relatively stagnant customer count per year. We could conclude that even though our customer base is not growing fast, their demand still is.

In [None]:
# Orders per month.
orders_monthly_over_years = pd.read_sql(
    """
    SELECT
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUBSTR(OrderDate, 6, 2) AS Month,
    COUNT(OrderID) AS OrderCount
    FROM Orders
    GROUP BY Year, Month
    """, db_conn)

fig = px.line(orders_monthly_over_years,
              x='Month',
              y='OrderCount',
              color='Year',
              title='Orders Monthly')
fig.show()


**Observation**: Looking at the monthly number of orders, there seems to be a recurring trend of more orders being made at the end of the year in September, November and December. There is a small dip in November. There is a lower demand in January and February.

In [None]:
# Orders per day of week.
orders_per_day_of_week = pd.read_sql(
    """
    SELECT 
    STRFTIME('%w', OrderDate) AS DayOfWeek,
    Segment,
    COUNT(*) AS OrderCount
    FROM Orders
    JOIN Customers
    ON Customers.CustomerID = Orders.CustomerID
    GROUP BY DayOfWeek, Segment;
    """, db_conn)

# The output of STRFTIME are numbers (0 for Sunday, 1 for Monday, and so on, up to 6 for Saturday) so we will convert them for readibility.
# Define a dictionary to map numerical day of the week to day names.
day_names = {
    '0': 'Sunday',
    '1': 'Monday',
    '2': 'Tuesday',
    '3': 'Wednesday',
    '4': 'Thursday',
    '5': 'Friday',
    '6': 'Saturday'
}

# Replace numerical day of the week with day names.
orders_per_day_of_week['DayOfWeek'] = orders_per_day_of_week['DayOfWeek'].map(day_names)

fig = px.bar(orders_per_day_of_week,
             x='DayOfWeek',
             y='OrderCount',
             color='Segment',
             barmode='group',
             title='Orders per Day of Week')
fig.show()

**Observation**: The least orders are made on Wednesday and the highest amount of orders is placed at the start and end of the workweek (Monady and Friday). What is a bit surprising, a lot of orders get's placed on the weekend, even by corporate customers.

In [None]:
# Orders per day of month.
orders_per_day_of_month = pd.read_sql(
    """
    SELECT Day, COUNT (DISTINCT OrderID) AS OrderCount
    FROM (
        SELECT
            *,
            SUBSTR(OrderDate, 9, 2) AS Day
        FROM Orders
    )
    GROUP BY Day

    """, db_conn)

fig = px.bar(orders_per_day_of_month,
             x='Day',
             y='OrderCount',
             title='Orders per Day of Month')
fig.show()

**Observation**: There is no visible trend when it comes to amount of orders per day of month. The drop at the end of the month can be explained by months having different amount of days.

In [None]:
# Which states made the highest number of orders?
orders_per_state = pd.read_sql(
    """
    SELECT
    Region,
    State,
    COUNT(*) AS "Orders"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    GROUP BY Region, State
    """, db_conn)

fig = px.sunburst(orders_per_state,
                  path=['Region', 'State'],
                  values='Orders',
                  title='Number of Orders by State')
fig.show()

In [None]:
# Which states made the highest number of orders? (in percentage)
orders_per_state = pd.read_sql(
    """
    SELECT
    Region,
    State,
    COUNT(*) AS "Orders"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    GROUP BY Region, State
    """, db_conn)

orders_per_state['Orders (%)'] = round(orders_per_state['Orders'] / orders_per_state['Orders'].sum(), 2) * 100

fig = px.sunburst(orders_per_state,
                  path=['Region', 'State'],
                  values='Orders (%)',
                  title='Number of Orders by State (%)')
fig.show()

**Observation**: Most of the orders come from California, New York and Texas. With those 3 states together being responsible for over 40% of the orders. 

## Sales

In [None]:
# Monthly Sales
sales_monthly_over_years = pd.read_sql(
    """
    SELECT
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUBSTR(OrderDate, 6, 2) AS Month,
    SUM(Sales) / 100.0 AS Sales
    FROM Orders
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Year, Month
    """, db_conn)

fig = px.line(sales_monthly_over_years, x='Month', y='Sales', color='Year', title='Sales Monthly')
fig.show()

**Observation**: We can see that the sales numbers follow the ammount of orders. There seems to be a recurring trend of more sales being made at the end of the year in September, November and December. There is a small dip in November. There is a lower demand in January and February.

In [None]:
# Which states made the highest sales?
sales_per_state = pd.read_sql(
    """
    SELECT
    Region,
    State,
    SUM(Sales)/100.0 AS "Sales ($)"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Region, State
    """, db_conn)

fig = px.sunburst(sales_per_state,
                  path=['Region', 'State'],
                  values='Sales ($)',
                  title='Sales by State')
fig.show()

In [None]:
# Which states made the highest sales? (in percentage)
sales_per_state = pd.read_sql(
    """
    SELECT
    Region,
    State,
    SUM(Sales)/100.0 AS "Sales ($)"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Region, State
    """, db_conn)

sales_per_state['Sales (%)'] = round(sales_per_state['Sales ($)'] / sales_per_state['Sales ($)'].sum(), 2) * 100

fig = px.sunburst(sales_per_state,
                  path=['Region', 'State'],
                  values='Sales (%)',
                  title='Sales by State (%)')
fig.show()

**Observation**: Here also the three top states are California (20%), New York (14%) and Texas (7%). What is interesting is that New York generates more sales revenue per order (it's responsible for 14% of sales, having 11% of orders) and Texas (it's responsible for 7% of sales, having 10% of orders) less. Calofornia is in the middle, generating 20% of sales and having 20% of orders.

## Profit

In [None]:
# Monthly Profit
sales_monthly_over_years = pd.read_sql(
    """
    SELECT
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUBSTR(OrderDate, 6, 2) AS Month,
    SUM(Profit) / 100.0 AS Profit 
    FROM Orders
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Year, Month
    """, db_conn)

fig = px.line(sales_monthly_over_years,
              x='Month',
              y='Profit',
              color='Year',
              title='Profit Monthly')
fig.show()

**Observation**: There seems to be no real monthly pattern occuring in the data when it comes to profit. There have been two months where the company had a loss. Those were July 2014 and January 2015.

In [None]:
# Which states made the highest profit?
profit_per_state = pd.read_sql(
    """
    SELECT
    Region,
    State,
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Region, State
    ORDER BY "Profit ($)"
    """, db_conn)

fig = px.bar(profit_per_state,
             x='State',
             y='Profit ($)',
             color='Region',
             title='Profit by State')
fig.show()

**Observation**: There is some states that have generated loss. Interstingly one of them is Texas, which makes also for a big part of orders and sales. On the other end, as the highest profit states, we can see New York and California. In all the regions there are states that bring profit and states that bring loss.

In [None]:
# Which states made the highest profit?
# Did that change over the years?
profit_per_state_yearly = pd.read_sql(
    """
    SELECT
    State,
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Year, State
    ORDER BY Year, "Profit ($)"
    """, db_conn)

fig = px.bar(profit_per_state_yearly,
             x='State',
             y='Profit ($)',
             color='Year',
             barmode='group',
             title='Profit by State over Years')
fig.show()

**Observation**: There is some states where from the beginning the company is making a loss. Those are Texas, Pennsylvania, Ohio, Illinois, Oregon. There is also some states where there was only one year with a positive profit: North Carolina, Florida, Arizona and Colorado. About half of the states also just have very low profit numbers, probably due to low actvity.

In [None]:
# Make a map showing the states generating the most and least profit.
profit_per_state = pd.read_sql(
    """
    SELECT
    State,
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM Orders
    JOIN Addresses
    ON Orders.AddressID = Addresses.AddressID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY State
    ORDER BY "Profit ($)"
    """, db_conn)

# Load the US codes required by the px.choropleth map.
us_codes = pd.read_json("Helpers/states_titlecase.json")

# Merge them into our data.
profit_per_state = profit_per_state.merge(us_codes, left_on='State', right_on='name', how='left')

# Make a map showing the states colored by profit.
fig = px.choropleth(
    profit_per_state,
    scope="usa",
    locations='abbreviation',  # Use 'State' column as location
    locationmode='USA-states',  # Set location mode to USA states
    color='Profit ($)',  # Color based on profit
    color_continuous_scale='RdYlGn',  # Choose color scale
    labels={'Profit ($)': 'Profit ($)'},  # Label for color bar
    title='Profit by State'  # Title of the plot
)

# Show the plot
fig.show()

In [None]:
# Make a map showing the 5 states generating the most and least profit.
# Sort the profit_per_state DataFrame by "Profit ($)" column
profit_per_state_sorted = profit_per_state.sort_values(by="Profit ($)", ascending=False)

# Get the top 5 and bottom 5 states based on sales revenue
top_5_states = profit_per_state_sorted.head(5)
bottom_5_states = profit_per_state_sorted.tail(5)

# Create a new column indicating whether each state is in the top or bottom group
profit_per_state_sorted['Group'] = ''  # Initialize the 'Group' column
profit_per_state_sorted.loc[top_5_states.index, 'Group'] = 'Top 5'
profit_per_state_sorted.loc[bottom_5_states.index, 'Group'] = 'Bottom 5'

# Filter the DataFrame to include only the top 5 and bottom 5 states
selected_states = profit_per_state_sorted[(profit_per_state_sorted['Group'] == 'Top 5') | (profit_per_state_sorted['Group'] == 'Bottom 5')]

# Create the choropleth map using Plotly Express
fig = px.choropleth(
    selected_states,
    scope="usa",
    locations='abbreviation',  # Use 'State' column as location
    locationmode='USA-states',  # Set location mode to USA states
    color='Group',  # Color based on the group column
    color_discrete_map={'Top 5': 'green', 'Bottom 5': 'red'},  # Assign colors to groups
    labels={'Group': 'Group'},  # Label for color legend
    title='Top and Bottom 5 States by Profit',  # Title of the plot
    hover_data={'abbreviation': False, 'Profit ($)': ':,.2f', 'State': True}  # Customize hover information
)

# Show the plot
fig.show()

## Customers

In [None]:
# Unique customers per year.
customers_per_year = pd.read_sql(
    """
    SELECT Year, COUNT (DISTINCT CustomerID) AS CustomerCount
    FROM (
        SELECT
            *,
            SUBSTR(OrderDate, 1, 4) AS Year
        FROM Orders
    )
    GROUP BY Year

    """, db_conn)

fig = px.bar(customers_per_year,
             x='Year',
             y='CustomerCount',
             title='Customer Count Per Year')
fig.show()

**Observation**: The amount of active users is slightly increasing over the years with a small dip in 2015. We define a user as active if they placed at least one order in a given year.

In [None]:
# Which customers made the highest number of orders?
orders_per_customer_segment = pd.read_sql(
    """
    SELECT
    Segment,
    COUNT(*) AS "Orders"
    FROM Orders
    JOIN Customers
    ON Orders.CustomerID = Customers.CustomerID
    GROUP BY Segment
    """, db_conn)

fig = px.pie(orders_per_customer_segment,
             values='Orders',
             names='Segment',
             title='Orders by Customer Segment')
fig.show()

**Observation**: Most of the customers come from the consumer segment.

In [None]:
# Which customers made the highest sales?
sales_per_customer_segment = pd.read_sql(
    """
    SELECT
    Segment,
    SUM(Sales)/100.0 AS "Sales ($)"
    FROM Orders
    JOIN Customers
    ON Orders.CustomerID = Customers.CustomerID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Segment
    """, db_conn)

sales_per_customer_segment
fig = px.pie(sales_per_customer_segment,
             values="Sales ($)",
             names='Segment',
             title='Sales by Customer Segment')
fig.show()

**Observation**: Most sales were done to customers from the consumer segment. The whole plot has almost the same proportions as the 'Orders by Customer Segment' plot, indicating a strong correlation.

In [None]:
# Which customers made the highest profit?
profit_per_customer_segment = pd.read_sql(
    """
    SELECT
    Segment,
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM Orders
    JOIN Customers
    ON Orders.CustomerID = Customers.CustomerID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Segment
    """, db_conn)

fig = px.pie(profit_per_customer_segment,
             values="Profit ($)",
             names='Segment',
             title='Profit by Customer Segment')
fig.show()

**Observation**: Most profit also comes from the consumer segment customers, but in comparison to the sales and orders numbers, we can see that the proportions change a bit. So per order and per sale, consumer level customers bring in slightly less profit. Corporate and home office customers bring in higher profit with the same ammount of orders/sales.

In [None]:
# Analyse customer segments on order count, order quantity, profit and sales.
customers_df = pd.read_sql(
    """
    SELECT
    Customers.CustomerID,
    Segment,
    COUNT(*) AS OrderCount,
    SUM(Quantity) AS OrderQuantity,
    SUM(Profit) / 100.0 AS "Profit ($)",
    SUM(Sales) / 100.0 AS "Sales ($)"
    FROM Orders
    JOIN Customers
    ON Orders.CustomerID = Customers.CustomerID
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Customers.CustomerID
    """, db_conn)

In [None]:
# Define color mappings for segments
color_map = {'Consumer': 'blue', 'Corporate': 'green', 'Home Office': 'red'}

for i in ['OrderCount', 'OrderQuantity', 'Profit ($)', 'Sales ($)']:
    # Calculate the average y value
    average_y = customers_df[i].mean()

    # Plot the bar chart with specified color mapping
    fig = px.bar(customers_df.sort_values(i), x='CustomerID', y=i, color='Segment', color_discrete_map=color_map)

    # Add a horizontal line for the average y value
    fig.add_hline(y=average_y, line_dash="dot", annotation_text=f'Average {i}: {average_y:.2f}', annotation_position="bottom right")

    # Show the plot
    fig.show()


**Observation**:

- **Order Count** - what's interesting here is that the vast majority of customers has placed multiple orders at our company. Only a few customers have never made a second order. On average each customer made over 12 orders in the past 4 years and this statistioc does not even account for the fact that part of the customers is with us shorter than the 4 years.

- **Order Quantity** - shows a very similar line to Order Count. The customers have ordered on average over 47 items.

- **Profit ($)** - This is the most interesting, because it shows us that there are certain customers, that bring us a net loss!

- **Sales ($)** - This again follows the same shape as order count and quantity, which is not surprising.

# Customer Growth

New customer growth rate is the speed at which you gain new customers over defined periods of time. Growth rate is usually measured with a monthly period. Growth rate measured in this way is commonly referred to as month over month growth.
Calculated correctly, the new customer growth rate helps you understand your overall success in attracting new customers.

In [None]:
# Create a plot showing how many new customers there was in each month.
new_customers_per_month = pd.read_sql(
    """
    SELECT
    SUBSTR(FirstOrderDate, 1, 4) AS Year,
    SUBSTR(FirstOrderDate, 6, 2) AS Month,
    COUNT(CustomerID) AS NewCustomers
    FROM (
        SELECT
        Orders.CustomerID,
        MIN(OrderDate) AS FirstOrderDate
        FROM Orders
        JOIN Customers
        ON Orders.CustomerID = Customers.CustomerID
        JOIN OrdersDetails
        ON Orders.OrderID = OrdersDetails.OrderID
        GROUP BY Orders.CustomerID
    )
    GROUP BY Year, Month
    """, db_conn)

new_customers_per_month

# Plot the growth rate over the months
fig = px.line(new_customers_per_month,
              x='Month',
              y='NewCustomers',
              color='Year',
              title='New Customers Monthly')
fig.show()


**Observation**: Most of of the customers joined in 2014, where the company was gaining on average over 40 new customers a month. After that, the growth was dropping year by year, having less tha 10 new customers a month over the last two years (2016 and 2017) with some months, where there have been 0 new customers like January, February and December 2017.

In [None]:
# Create a plot showing the growth rate of new customers over the months.
new_customers_per_month = pd.read_sql(
    """
    SELECT
    SUBSTR(FirstOrderDate, 1, 7) AS YearMonth,
    COUNT(CustomerID) AS NewCustomers
    FROM (
        SELECT
        Orders.CustomerID,
        MIN(OrderDate) AS FirstOrderDate
        FROM Orders
        JOIN Customers
        ON Orders.CustomerID = Customers.CustomerID
        JOIN OrdersDetails
        ON Orders.OrderID = OrdersDetails.OrderID
        GROUP BY Orders.CustomerID
    )
    GROUP BY YearMonth
    """, db_conn)

# Calculate cumulative sum of new customers
new_customers_per_month['CumulativeNewCustomers'] = new_customers_per_month['NewCustomers'].cumsum()

# Calculate monthly growth rate of customers
new_customers_per_month['NewCustomerGrowthRate'] = (new_customers_per_month['NewCustomers']/new_customers_per_month['CumulativeNewCustomers']) * 100

# Split YearMonth column into Year and Month
new_customers_per_month[['Year', 'Month']] = new_customers_per_month['YearMonth'].str.split('-', expand=True)

# Plot the growth rate of new customers over the months
fig = px.line(new_customers_per_month,
              x='YearMonth',
              y='NewCustomers',
              title='New Customers Over Time')
fig.update_xaxes(title='Month')
fig.update_yaxes(title='Number of New Customers')
fig.show()

**Observation**: An alternative way of showing 'New Customers Monthly' would be the graph above, where all years are represented on one line and we can see a significant decline in new customers at the beginning of 2015.

In [None]:
# Plot the growth of new customers over the months.
fig = px.line(new_customers_per_month,
              x='YearMonth',
              y='CumulativeNewCustomers',
              title='New Customers Over Time (Cumulative)')
fig.update_xaxes(title='Month')
fig.update_yaxes(title='Cumulative Number of New Customers')
fig.show()

**Observation**: In the cumulative view off new customers, we can see that the growth of the customer base started plateauing arround the beginning 2015 and almost flatened out in 2017. It suggests that the rate of acquiring new customers slowed down during that period. This plateauing and flattening could indicate a saturation point in customer acquisition efforts or a stabilization of the customer base.

In [None]:
fig = px.line(new_customers_per_month,
              x='Month',
              y='NewCustomerGrowthRate',
              color='Year',
              title='Customer Growth Rate (%) Monthly')
fig.update_yaxes(title='Customer Growt Rate (%)')
fig.show()

**Observation**: Looking at the month-over-month growth rate of customers calculated we can see that after the successfull 2014 where most of the customer base was built, the customer growth rate has been stuck at 1 - 3% of monthly growth in 2015, below 1% in 2016 and below 0.5% in 2017.

The formula used is:

Growth Rate = ((Total Customers at End of Period - Total Customers at Start of Period) / Total Customers at Start of Period) * 100%

This formula calculates the percentage change in the number of customers from the start to the end of each period. This does not account for churn, the rate at which existing customers leave your company.

## Average Order Value

Average order value (AOV) tracks the average dollar amount spent each time a customer places an order.  
AOV  = Sales/Number od Orders

In [None]:
# Calculate Monthly Average Order Value
# AOV = Revenue/TotalOrders
monthly_aov = pd.read_sql(
    """
    SELECT
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUBSTR(OrderDate, 6, 2) AS Month,
    SUM(Sales) / COUNT(*) / 100.0 AS AverageOrderValue
    FROM Orders
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    JOIN Products
    ON OrdersDetails.ProductID = Products.ProductID
    GROUP BY Month, Year
    """, db_conn)

fig = px.line(monthly_aov,
              x='Month',
              y='AverageOrderValue',
              color='Year',
              title='Monthly Average Order Value')
fig.show()

**Observation**: The monthly AOV peaks usually in March. That's when there is also a slight peak after the January and February lows (the february low is also visible in the AOV) in order count and sales. For the rest of the year the AOV seems to fluctuate mostly between 170 and 260 with single spikes in the autumn months and in some years January.

In [None]:
# Show yearly AOV.
yearly_aov_sum = monthly_aov.groupby('Year')['AverageOrderValue'].sum().reset_index()
yearly_aov_sum
fig = px.bar(yearly_aov_sum,
              x='Year',
              y='AverageOrderValue',
              title='Yearly Average Order Value')
fig.show()

**Observation**: The Yearly Average Order Value didn't move much in the 4 years. It peaked a bit in 2016 to drop again a bit in 2017.

## Average Profit Margin

Profit Margin is typically expressed as a percentage, using the formula:

*Profit Margin (%) = Profit/Sales * 100%*

This formula helps to assess the efficiency of a business in generating profits relative to its sales.

Applying this formula to calculate the profit margin for each transaction or aggregated over a period (e.g., monthly, yearly) to gain insights into the profitability of the business operations.

In [None]:
# Monthly Sales and Profit
average_profit_margin = pd.read_sql(
    f"""
    SELECT
    SUBSTR(OrderDate, 1, 4) AS Year,
    SUBSTR(OrderDate, 6, 2) AS Month,
    SUM(Sales) / 100.0 AS Sales,
    SUM(Profit) / 100.0 AS Profit
    FROM Orders
    JOIN OrdersDetails
    ON Orders.OrderID = OrdersDetails.OrderID
    GROUP BY Year, Month
    """, db_conn)

# Calculate Avg. Profit Margin
average_profit_margin['ProfitMargin'] = average_profit_margin['Profit'] / average_profit_margin['Sales'] * 100

# Combine Year and Month for ploting purposes
average_profit_margin['YearMonth'] = average_profit_margin['Year'] + '-' + average_profit_margin['Month']

In [None]:
# Plot the growth of new customers over the months.
fig = px.line(average_profit_margin,
              x='YearMonth',
              y='ProfitMargin',
              title=f'Monthly Average Profit Margin')
fig.update_xaxes(title='Month')
fig.show()

**Observation**: Aside from there being two dips into negative profit values in Jul 2014 (-2.48%) and Jan 2015 (18.05%), the company has been profitable.

## Products Sales vs Profit

In [None]:
# Sales and Profit per Product Category
sales_profit_per_category = pd.read_sql(
    """
    SELECT Category,
    SUM(Sales)/100.0 AS "Sales ($)",
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM OrdersDetails
    JOIN Products ON OrdersDetails.ProductID = Products.ProductID
    GROUP BY Category
    """, db_conn)

# Calculate percentage values for sales and profit
sales_profit_per_category['Sales (%)'] = (sales_profit_per_category['Sales ($)'] / sales_profit_per_category['Sales ($)'].sum()) * 100
sales_profit_per_category['Profit (%)'] = (sales_profit_per_category['Profit ($)'] / sales_profit_per_category['Profit ($)'].sum()) * 100

In [None]:
# Create bar chart with Sales and Profit per Product Category
fig = px.bar(sales_profit_per_category,
             x='Category',
             y=['Sales ($)', 'Profit ($)'],
             barmode='group',
             title='Sales and Profit Comparison')
fig.update_yaxes(title='USD ($)')

# Show the plot
fig.show()

**Observation**: The sales-to-profit ratio for furniture looks quite bad. Investigate further using pie charts.

In [None]:
# Visualize sales by product category in percentage
fig_sales = px.pie(sales_profit_per_category,
                   values='Sales (%)',
                   names='Category',
                   title='Sales (%) by Product Category')
fig_sales.show()

**Observation**: We see that the sales are almost evenly distributed among the 3 categories of products.

In [None]:
# Visualize profit by product category in percentage
fig_profit = px.pie(sales_profit_per_category,
                    values='Profit (%)',
                    names='Category',
                    title='Profit (%) by Product Category')
fig_profit.show()

**Observation**: The profit on the other hand is comming mostly from technology and we almost don't make profit selling furniture.

In [None]:
# Sales and Profit per Product SubCategory
sales_profit_per_subcategory = pd.read_sql(
    """
    SELECT
    SubCategory,
    Category,
    SUM(Sales)/100.0 AS "Sales ($)",
    SUM(Profit)/100.0 AS "Profit ($)"
    FROM OrdersDetails
    JOIN Products ON OrdersDetails.ProductID = Products.ProductID
    GROUP BY SubCategory
    ORDER BY "Profit ($)" DESC;
    """, db_conn)

# Calculate percentage values for sales and profit
sales_profit_per_subcategory['Sales (%)'] = (sales_profit_per_subcategory['Sales ($)'] / sales_profit_per_subcategory['Sales ($)'].sum()) * 100
sales_profit_per_subcategory['Profit (%)'] = (sales_profit_per_subcategory['Profit ($)'] / sales_profit_per_subcategory['Profit ($)'].sum()) * 100

In [None]:
# Create bar chart with Sales and Profit per Product SubCategory
fig = px.bar(sales_profit_per_subcategory,
             x='SubCategory',
             y=['Sales ($)', 'Profit ($)'],
             barmode='group',
             title='Sales and Profit Comparison')
fig.update_yaxes(title='USD ($)')

# Show the plot
fig.show()

**Observation**: We make the most profit out of Copiers, Phones and Accessories and the least out of Tables, Bookcases and Supplies. On those, we basically lose money.

## Profit Margin for Product Sub-Categories

Profit margin can also be used to measure the profitability of a product.

In [None]:
# Calculate Profit Margin for Sub-Categories of Products.
sales_profit_per_subcategory['ProfitMargin (%)'] = sales_profit_per_subcategory['Profit ($)']/sales_profit_per_subcategory['Sales ($)'] * 100

In [None]:
# Show Profit Margin for Product Sub-Categories
fig = px.bar(sales_profit_per_subcategory.sort_values('ProfitMargin (%)', ascending=False),
              x='SubCategory',
              y='ProfitMargin (%)',
              color='Category',
              title='Profit Margin (%) for Product Sub-Categories')
fig.show()

**Observation**: The analysis of profit margin confirms taht the huge differences in proportion, when it comes to profit among the product categories comes from how much margin there company has on the product price.

## Discounts

In [None]:
# Discounts
discounts = pd.read_sql(
    """
    SELECT
        Products.Category,
        Products.SubCategory,
        OrdersDetails.Discount,
        COUNT(OrdersDetails.Discount) AS DiscountCount,
        SubCatCount.SubCount,
        ROUND((COUNT(OrdersDetails.Discount) * 100.0) / SubCatCount.SubCount, 2) AS DiscountCountPercentage
    FROM OrdersDetails
    JOIN Products ON OrdersDetails.ProductID = Products.ProductID
    JOIN (
        SELECT
            SubCategory,
            COUNT(SubCategory) AS SubCount
        FROM OrdersDetails
        JOIN Products ON OrdersDetails.ProductID = Products.ProductID
        GROUP BY SubCategory
    ) AS SubCatCount ON SubCatCount.SubCategory = Products.SubCategory
    GROUP BY Products.Category, Products.SubCategory, OrdersDetails.Discount;

    """, db_conn)

In [None]:
fig = px.bar(discounts,
             x='SubCategory',
             y='DiscountCountPercentage',
             color='Discount',
             #barmode='group',
             title='Discounts per Product Sub-Category')
fig.show()