# Plotly

<img src="images/Plotly.png" width="800"/>

Plotly is a Python (and JavaScript) data visualization library that allows you to create interactive, web-based charts and dashboards.

It’s especially known for:

Interactivity (zoom, pan, hover)

Publication-quality visuals

Integration with tools like Dash, Jupyter Notebooks, and web apps



### Built on JavaScript

### Supports Many Chart Types
Basic: Line, bar, scatter, pie, histogram

Statistical: Boxplot, violin, heatmap

Advanced: 3D plots, maps (Geo/Choropleth), animations, subplots

Custom dashboards with Plotly Dash

In [6]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import nbformat

In [7]:
df_1 = pd.read_excel('./DataSets Folder/online_retail_II.xlsx', sheet_name='Year 2009-2010')
df_2 = pd.read_excel('./DataSets Folder/online_retail_II.xlsx', sheet_name='Year 2010-2011')

df = pd.concat([df_1, df_2], ignore_index=True)
# combines the two dataframes (df_1 and df_2) into a single one (df), 
# resetting the index so it runs from 0 to N without duplicating indices.

print(df.shape)
print(df.columns)

(1067371, 8)
Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'Price', 'Customer ID', 'Country'],
      dtype='object')


In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1067371 entries, 0 to 1067370
Data columns (total 8 columns):
 #   Column       Non-Null Count    Dtype         
---  ------       --------------    -----         
 0   Invoice      1067371 non-null  object        
 1   StockCode    1067371 non-null  object        
 2   Description  1062989 non-null  object        
 3   Quantity     1067371 non-null  int64         
 4   InvoiceDate  1067371 non-null  datetime64[ns]
 5   Price        1067371 non-null  float64       
 6   Customer ID  824364 non-null   float64       
 7   Country      1067371 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 65.1+ MB


# Cleaning & Feature Engineering

In [9]:
df = df.dropna(subset=['InvoiceDate', 'Customer ID'])
# Drop rows from df where either InvoiceDate or Customer ID is missing (NaN).
print(df.shape)

(824364, 8)


In [10]:
cancelled_invoice_Flags = df['Invoice'].str.startswith('C', na=False)
# any content in Invoice column starting with C - considered cancelled invoice - return True
# NaN will be changed to False, also correct invoice number as well to False
cancelled_invoice_data = df[cancelled_invoice_Flags]
print(cancelled_invoice_Flags.sum(), ":  rows which has cancelled invoices")
cancelled_invoice_data['Invoice']

18744 :  rows which has cancelled invoices


178        C489449
179        C489449
180        C489449
181        C489449
182        C489449
            ...   
1065910    C581490
1067002    C581499
1067176    C581568
1067177    C581569
1067178    C581569
Name: Invoice, Length: 18744, dtype: object

In [11]:
# lets remove the cancelled ones from main Data set
df = df[~df['Invoice'].astype(str).str.startswith('C')]
# df['Invoice'].astype(str) – ensures the Invoice column is treated as a string
# .str.startswith('C') – returns True for rows where the Invoice starts with 'C'.
# ~ – inverts the condition (i.e., select rows not starting with 'C').
print(df.shape)

(805620, 8)


In [12]:
# data cleaning
# Removes rows where the quantity is 0 or negative
df = df[df['Quantity'] > 0]
# Removes free, zero-priced, or negative-priced items
df = df[df['Price'] > 0]
# Useful for ensuring all revenue calculations are accurate
print(df.shape)

(805549, 8)


In [13]:
# Add new features 

df['Revenue'] = df['Quantity'] * df['Price'] # (feature engineering)
df['InvoiceDate'] = pd.to_datetime(df['InvoiceDate'])
# Ensures InvoiceDate is in datetime format,essential for : Time-based filtering (e.g., by year, month),Time series analysis, Plotting trends
df['Month'] = df['InvoiceDate'].dt.to_period('M').astype(str) # (feature engineering)
# Creates a new column Month like "2010-11", "2011-01", etc.
df.head()

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country,Revenue,Month
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom,83.4,2009-12
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0,2009-12
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom,81.0,2009-12
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.1,13085.0,United Kingdom,100.8,2009-12
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom,30.0,2009-12


In [14]:
monthly_revenue = df.groupby('Month')['Revenue'].sum().sort_index()
# Groups the DataFrame by the Month column you previously created using .dt.to_period('M').
# ['Revenue'].sum() - For each month, it sums all the revenue values.
# .sort_index() - Ensures that the months are sorted chronologically (e.g., Jan to Dec), not alphabetically.
monthly_revenue.head()

Month
2009-12    686654.160
2010-01    557319.062
2010-02    506371.066
2010-03    699608.991
2010-04    594609.192
Name: Revenue, dtype: float64

In [15]:
df_sample = df[['Quantity', 'Price', 'Revenue', 'Country', 'Month']].copy()
df_sample.head()

Unnamed: 0,Quantity,Price,Revenue,Country,Month
0,12,6.95,83.4,United Kingdom,2009-12
1,12,6.75,81.0,United Kingdom,2009-12
2,12,6.75,81.0,United Kingdom,2009-12
3,48,2.1,100.8,United Kingdom,2009-12
4,24,1.25,30.0,United Kingdom,2009-12


In [16]:
monthly_rev = df_sample.groupby('Month')['Revenue'].sum().reset_index()
monthly_rev.head()

Unnamed: 0,Month,Revenue
0,2009-12,686654.16
1,2010-01,557319.062
2,2010-02,506371.066
3,2010-03,699608.991
4,2010-04,594609.192


# Line Chart

In [27]:
fig = px.line(monthly_rev,
              x='Month', 
              y='Revenue',
              markers=True)

# markers=True: Adds markers (dots) at each data point, making the trend more readable.

fig.update_layout(xaxis_title="Month", 
                  yaxis_title="Revenue",
                  xaxis_tickangle=45, 
                  title='Monthly Revenue Trend',
                  plot_bgcolor='Black')

# update_layout - This function customizes the appearance of the chart.

fig.update_traces(line_color='Blue', marker_color='red')
# customizes the appearance of the line and markers in your Plotly figure 

fig.show()

# Bar Plot
### Top 10 Countries by Revenue

In [29]:
top_countries = df_sample.groupby('Country')['Revenue'].sum().sort_values(ascending=False).head(10).reset_index()
top_countries.head()

Unnamed: 0,Country,Revenue
0,United Kingdom,14723150.0
1,EIRE,621631.1
2,Netherlands,554232.3
3,Germany,431262.5
4,France,355257.5


In [30]:
fig = px.bar(top_countries,
             x='Country', 
             y='Revenue',
             title='Top 10 Countries by Revenue',
             text='Revenue', # Display revenue as a label on each bar.
             color='Country') # Each bar gets a different color (based on country).

# update_traces - properties of data like, line style, text label, hover info, marker size / color

fig.update_traces(texttemplate='%{text:.3s}', textposition='outside')
# texttemplate='%{text:.3s}': - Custom format for the label text.
# '%{text:.3s}' means: take only the first 3 characters of the text (i.e., revenue value).
# Places the label above the bar (outside the bar's top).
fig.update_layout(xaxis_tickangle=45, showlegend=False)
fig.show()

### Axis formatting

In [38]:


fig = px.bar(top_countries.head(5),
             x='Country', y='Revenue',
             text='Revenue',
             color='Country',
             title='Top 5 Countries by Revenue')

fig.update_layout(
    xaxis_title="Country (Top 5)",
    yaxis_title="Revenue in USD",
    xaxis_tickangle=45,
    yaxis_tickformat=",",  # adds commas to big numbers
)

fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.show()

### Custom Colors and Palettes

In [39]:
color_map = {
    'United Kingdom': '#1f77b4',
    'Germany': '#ff7f0e',
    'France': '#2ca02c',
    'EIRE': '#d62728',
    'Netherlands': '#9467bd'
}

fig = px.bar(top_countries.head(5),
             x='Country', y='Revenue',
             text='Revenue',
             color='Country',
             color_discrete_map=color_map) # this will map colours to the Country names

fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.show()

### Background and Grid Styling

In [47]:
color_map = {
    'United Kingdom': '#1f77b4',
    'Germany': '#ff7f0e',
    'France': '#2ca02c',
    'EIRE': '#d62728',
    'Netherlands': '#9467bd'
}

fig = px.bar(top_countries.head(5),
             x='Country', y='Revenue',
             text='Revenue',
             color='Country',
             color_discrete_map=color_map)

fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')

fig.update_layout(
    plot_bgcolor='white',
    paper_bgcolor='#f8f8f8',
    xaxis=dict(showgrid=True, gridcolor='lightgray'),
    yaxis=dict(showgrid=True, gridcolor='lightgray'),
)

fig.show()

### Legend Customization

In [57]:
color_map = {
    'United Kingdom': '#1f77b4',
    'Germany': '#ff7f0e',
    'France': '#2ca02c',
    'EIRE': '#d62728',
    'Netherlands': '#9467bd'
}

fig = px.bar(top_countries.head(5),
             x='Country', y='Revenue',
             text='Revenue',
             color='Country',
             color_discrete_map=color_map)
fig.update_traces(texttemplate='%{text:.2s}', textposition='outside')
fig.update_layout(
    showlegend=True,
    legend_title="Countries",
    legend=dict(                         # Controls how and where the legend appears
        orientation='h',                 # Makes the legend horizontal (default is vertical).
        yanchor='bottom',                # Anchors the legend box vertically using its bottom edge.
        y=1.02,                          # Places the legend slightly above the main chart area (2% above top).
        xanchor='right',                 # Anchors the legend box horizontally using its right edge.
        x=1                              # Aligns the right edge of the legend with the right edge of the plot.
    )
)
fig.show()

#  Scatter Plot (bubble chart)
### Price vs Quantity with Revenue as Size

In [31]:
fig = px.scatter(df_sample.sample(1000), # Randomly selects 1000 rows from your DataFrame (for performance and clarity).
                 x='Quantity', 
                 y='Price',
                 size='Revenue', 
                 color='Country',
                 hover_data=['Revenue', 'Month'], # When you hover over a bubble, it shows Revenue and Month in the tooltip.
                 title='Price vs Quantity (Bubble = Revenue)')

fig.update_layout(xaxis_type='log', yaxis_type='log')
# This applies logarithmic scaling to both axes.
fig.show()

In [36]:
print(df_sample.columns)
print(df_sample.values)
print(df_sample.index)

Index(['Quantity', 'Price', 'Revenue', 'Country', 'Month'], dtype='object')
[[12 6.95 83.4 'United Kingdom' '2009-12']
 [12 6.75 81.0 'United Kingdom' '2009-12']
 [12 6.75 81.0 'United Kingdom' '2009-12']
 ...
 [4 4.15 16.6 'France' '2011-12']
 [3 4.95 14.850000000000001 'France' '2011-12']
 [1 18.0 18.0 'France' '2011-12']]
Index([      0,       1,       2,       3,       4,       5,       6,       7,
             8,       9,
       ...
       1067361, 1067362, 1067363, 1067364, 1067365, 1067366, 1067367, 1067368,
       1067369, 1067370],
      dtype='int64', length=805549)


In [37]:
# changing the hover_data = {'column_name': True/False}
fig = px.scatter(df_sample.sample(1000),
                 x='Quantity', y='Price',
                 size='Revenue', color='Country',
                 hover_data={'Quantity': True, 'Price':True, 'Revenue':False, 'Country':False, 'Month':True},
                 title='Price vs Quantity (Bubble = Revenue)')

fig.update_layout(xaxis_type='log', yaxis_type='log')
fig.show()

### Fonts, Titles, and Global Style

In [58]:
fig = px.scatter(df_sample.sample(1000),
                 x='Quantity', y='Price',
                 color='Country',
                 size='Revenue',
                 hover_data=['Revenue', 'Month'])

fig.update_layout(
    title=dict(
        text="Styled Top Countries by Revenue",
        font=dict(size=20, color='darkblue', family='Arial'),
        x=0.5  # center the title
    ),
    font=dict(family="Verdana", size=14, color="black"),
)

fig.show()


# Grouped Bar chart

In [60]:
filtered_df = df_sample[(df_sample['Revenue'] < 1000) & (df_sample['Revenue'] > 0)]
filtered_df.head()

Unnamed: 0,Quantity,Price,Revenue,Country,Month
0,12,6.95,83.4,United Kingdom,2009-12
1,12,6.75,81.0,United Kingdom,2009-12
2,12,6.75,81.0,United Kingdom,2009-12
3,48,2.1,100.8,United Kingdom,2009-12
4,24,1.25,30.0,United Kingdom,2009-12


In [61]:
monthly_revenue = (
    filtered_df.groupby(['Country', 'Month'])['Revenue']
    .sum()
    .reset_index()
)

# Focus on top 5 countries only
top_countries = monthly_revenue.groupby('Country')['Revenue'].sum().nlargest(5).index
monthly_revenue_top5 = monthly_revenue[monthly_revenue['Country'].isin(top_countries)]

In [64]:
fig = px.bar(
    monthly_revenue_top5,
    x='Month',
    y='Revenue',
    color='Country',
    title='Monthly Revenue (Top 5 Countries, Filtered Revenue < 1000)',
    barmode='group' # Groups bars for each month side-by-side (instead of stacking them).
    # barmode='stack', it would stack the countries’ bars on top of each other for each month instead.
)

fig.update_layout(
    xaxis_title='Month',
    yaxis_title='Revenue',
    xaxis_tickangle=45
)

fig.show()

# Pie Chart
### Country Revenue Share
### Each country's share of total revenue

In [72]:
country_rev = df_sample.groupby('Country')['Revenue'].sum().nlargest(5).reset_index() # Data Aggregation
# .nlargest(5) selects the top 5 countries with the highest total revenue.
# country_rev = df_sample.groupby('Country')['Revenue'].sum().sort_values(ascending = False).head(5).reset_index() # can be done like this as well

fig = px.pie(country_rev, 
             names='Country', # Each wedge represents a country.
             values='Revenue', # Size of each wedge is based on that country's total revenue.
             title='Top 5 Countries by Revenue Share',
             hole=0.4)  # Donut Chart

fig.update_traces(textinfo='percent+label')
# Controls what text is shown on each wedge.
# 'percent+label' - Show percentage of total revenue, Show country name (label)

fig.show()

# Treemap
### Revenue by Country and Month

In [73]:
df_tree = df_sample.groupby(['Country', 'Month'])['Revenue'].sum().reset_index()
top_countries = df_tree.groupby('Country')['Revenue'].sum().nlargest(5).index
df_tree = df_tree[df_tree['Country'].isin(top_countries)]

fig = px.treemap(df_tree, path=['Country', 'Month'], values='Revenue',
                 title='Revenue by Country and Month')

# Creates a Treemap — a chart for hierarchical, part-to-whole data.
# path=['Country', 'Month']: First level: Country | Second level: Month
# values='Revenue': Size of each box is determined by total Revenue.

# Treemap Show - Big rectangles for each country, sized by total revenue.
# Within each country rectangle, smaller boxes show monthly revenue breakdown

fig.show()

#  Sunburst Chart 
### (Same as Treemap, but Circular)

In [74]:
fig = px.sunburst(df_tree, path=['Country', 'Month'], values='Revenue',
                  title='Sunburst of Revenue by Country/Month')
fig.show()

### We have created a seprate dashboard project to practice "Dash by Plotly".