Create category weights for each. Points distribution from 1 to 10. Use industry group data point median as 5.

"The question of materiality, or in other words, the relative weight, is determined by the relative median value for
a company in that industry group. The relative median values for each industry group to which the data point is
material are compared, and decile ranks are assigned. The decile rank determines the relative weight
assigned to that data point in determining the industry weight – from 1 to 10"

<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of Environmental Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 3:</b> Impact analysis </span></center>

## Introduction to this section

## Imports

In [2]:
import pandas as pd
import numpy as np
import sys
import os

pd.set_option("display.max_columns", None)
sys.path.append(os.path.abspath(".."))
# import random

from functions import test_company, show_missing_values
from IPython.display import display

pd.options.display.float_format = "{:,.2f}".format

## Impact (materiality)

In [3]:
impact_df = pd.read_csv("../datasets/impact_df_original.csv")

In [4]:
impact_df.describe()

Unnamed: 0,year,csrd_2025,csrd_2027,base_year,external_audit_of_ESG_report,revenue_MEUR,energy_consump_GJ,water_withdraw_thm3,ghg_emis_kt,transport_emis_kt,ghg_emis_per_MEUR_revenue,water_withdraw_per_MEUR_revenue
count,1837.0,1837.0,1837.0,1837.0,1837.0,1802.0,1109.0,648.0,1232.0,946.0,1228.0,646.0
mean,2020.43,0.61,0.33,2019.13,0.4,2476.33,11051733.2,259321.08,6633063.26,1290251.09,2396.5,614.96
std,1.11,0.49,0.47,0.51,0.49,7014.54,48349169.42,2979205.87,152416032.97,39340312.81,53105.08,12379.17
min,2019.0,0.0,0.0,2019.0,0.0,0.03,0.05,0.0,0.0,0.0,0.0,0.0
25%,2019.0,0.0,0.0,2019.0,0.0,217.1,60449.4,100.89,1.89,2.62,0.0,0.07
50%,2020.0,1.0,0.0,2019.0,0.0,601.58,423277.92,562.18,16.37,25.03,0.01,0.41
75%,2021.0,1.0,1.0,2019.0,1.0,1912.4,2420142.0,5754.5,137.67,389.67,0.07,4.27
max,2022.0,1.0,1.0,2022.0,1.0,143208.96,489600000.0,63372912.0,4834768000.0,1210000000.0,1653973.29,312843.68


In [5]:
impact_df.head()

Unnamed: 0,company,ticker,year,csrd_2025,csrd_2027,segment,industry,hq_country,base_year,external_audit_of_ESG_report,revenue_MEUR,energy_consump_GJ,water_withdraw_thm3,ghg_emis_kt,transport_emis_kt,ghg_emis_per_MEUR_revenue,water_withdraw_per_MEUR_revenue
0,Archer Ltd.,ARCHO,2020,1,0,Mid,Energy,Norway,2020,1,735.71,459927.0,,,,,
1,AutoStore Holdings Ltd.,AUTO,2021,0,1,Large,Industrial Goods and Services,Bermuda,2021,0,292.5,,,0.74,371.92,0.0,
2,Avance Gas Holding ltd,AGAS,2019,0,1,Mid,Energy,Norway,2019,0,223.59,,,,,,
3,Avance Gas Holding ltd,AGAS,2020,0,0,Mid,Energy,Norway,2019,1,183.68,5934145.0,,,,,
4,Borr Drilling Ltd,BDRILL,2019,1,0,Mid,Energy,Bermuda,2019,0,291.85,1980428.4,,150.78,43.67,0.52,


## Further cleaning

In [6]:
impact_df.groupby("segment")["revenue_MEUR"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
segment,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Large,1015.0,3987.52,9022.35,1.95,409.41,1313.8,3726.17,143208.96
Mid,777.0,527.22,992.48,0.03,102.82,304.23,649.11,20718.0
Small,10.0,537.9,428.82,12.28,108.39,602.97,823.31,1194.64


# First, visualise...

In [7]:
from typing import Tuple, Union, Optional, List, Dict


def calculate_bins_for_eda(
    data: pd.Series,
) -> Tuple[Union[float, int], Union[float, int], Union[float, int], int]:
    """
    Calculates bin parameters for numeric and categorical data.

    Parameters:
        data (pd.Series): A Pandas Series (column) to calculate bins for.

    Returns:
        Tuple[Union[float, int], Union[float, int], Union[float, int], int]:
            - bin_start (float or int): The start of the bins (always 0 for numeric data).
            - bin_end (float or int): The end of the bins.
            - bin_size (float or int): The size of each bin (always 1 for categorical data).
            - n_bins (int): The number of bins.
    """
    if isinstance(data.dtype, pd.CategoricalDtype) or pd.api.types.is_object_dtype(
        data
    ):
        unique_values: int = len(data.unique())
        bin_start: int = 0
        bin_end: int = unique_values
        bin_size: int = 1
        n_bins: int = unique_values
    else:
        data = data.dropna()
        if data.empty:
            return 0, 1, 1, 1

        data_min: float = max(0, data.min())
        data_max: float = data.max()
        data_range: float = data_max - data_min

        n_bins: int = min(50, len(data.unique()))  # maximum of 20 bins
        bin_size: float = max(1, round(data_range / n_bins))  # avoid zero-sized bins

        # Adjust bin size to a more rounded number (e.g., 1, 2, 5, 10, etc.)
        if bin_size >= 10:
            scale = 10 ** (len(str(int(bin_size))) - 1)
            bin_size = round(bin_size / scale) * scale

        bin_start: float = 0
        bin_end: float = np.ceil(data_max / bin_size) * bin_size

    return bin_start, bin_end, bin_size, n_bins

In [8]:
import pandas as pd
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Define discrete columns outside the function for flexibility
columns_to_include = [
    # "year",
    # "segment",
    # "industry",
    # "hq_country",
    # "external_audit_of_ESG_report",
    # "base_year",
    "revenue_MEUR",
    "energy_consump_GJ",
    "water_withdraw_thm3",
    "ghg_emis_kt",
    "transport_emis_kt",
    "ghg_emis_per_MEUR_revenue",
    "water_withdraw_per_MEUR_revenue",
]


def chart_visualisations(
    df: pd.DataFrame,
    columns_to_include: list,
    legend_column: str = None,
    n_cols: int = 3,
) -> go.Figure:
    """
    Creates a 3-column subplot visualization for the provided DataFrame columns,
    displaying a histogram (with quantile-defined bins) for each column.

    Parameters:
    ----------
    df : pd.DataFrame
        The input DataFrame containing the data to visualize.

    columns_to_include : list
        A list of columns to visualize.

    legend_column : str, optional
        The name of the column that differentiates the data groups. If None, no grouping is applied.

    n_cols : int, optional
        The number of columns in the subplot layout.

    Returns:
    -------
    go.Figure
        The Plotly figure object containing the histograms.
    """
    relevant_columns = [col for col in columns_to_include if col in df.columns]
    n_rows = -(-len(relevant_columns) // n_cols)
    subplot_titles = [f"{col}" for col in relevant_columns]

    fig = make_subplots(
        rows=n_rows,
        cols=n_cols,
        subplot_titles=subplot_titles,
    )

    for idx, col in enumerate(relevant_columns):
        row_num = (idx // n_cols) + 1
        col_num = (idx % n_cols) + 1

        data = pd.to_numeric(df[col], errors="coerce").dropna()
        if data.empty:
            continue

        bin_start, bin_end, bin_size, n_bins = calculate_bins_for_eda(data)
        bins = np.arange(bin_start, bin_end + bin_size, bin_size)

        # Fix: np.histogram should count occurrences, not sum values
        counts, _ = np.histogram(data, bins=bins)

        bin_centers = (bins[:-1] + bins[1:]) / 2
        widths = bins[1:] - bins[:-1]

        fig.add_trace(
            go.Bar(
                x=bin_centers,
                y=counts,  # Ensure we're plotting counts, not sums
                width=widths,
                name=f"{col} Distribution",
                showlegend=False,
            ),
            row=row_num,
            col=col_num,
        )

        fig.update_yaxes(title_text="Count", row=row_num, col=col_num, showgrid=False)

    fig.update_layout(
        bargap=0,
        showlegend=False,
        legend_title_text=legend_column,
        height=400 * n_rows,
        width=1600,
        template="plotly_white",
    )

    return fig

In [9]:
chart_visualisations(impact_df, columns_to_include=columns_to_include)

Star Bulk Carriers - Rev. --> 821.365M, not 821,000
Telenor - 9,799M
Cloetta - 649.106

In [10]:
impact_df[impact_df["revenue_MEUR"] > 40000].sort_values(
    by="revenue_MEUR", ascending=False
)

Unnamed: 0,company,ticker,year,csrd_2025,csrd_2027,segment,industry,hq_country,base_year,external_audit_of_ESG_report,revenue_MEUR,energy_consump_GJ,water_withdraw_thm3,ghg_emis_kt,transport_emis_kt,ghg_emis_per_MEUR_revenue,water_withdraw_per_MEUR_revenue
811,Equinor ASA (formerly Statoil ASA),EQNR,2022,1,0,Large,Energy,Norway,2019,1,143208.96,,6000.0,11400.0,243000.0,0.08,0.04
396,Fortum Oyj,FORTUM,2021,1,0,Large,Utilities,Finland,2019,1,112400.0,399600000.0,12359000.0,69750.7,120228.0,0.62,109.96
810,Equinor ASA (formerly Statoil ASA),EQNR,2021,1,0,Large,Energy,Norway,2019,1,79235.71,212400000.0,8000.0,12100.0,249000.0,0.15,0.1
103,A.P. Møller -Maersk A/S,MAERSK,2022,1,0,Large,Industrial Goods and Services,Denmark,2019,0,77425.45,447345000.0,916.0,34506.0,43451.0,0.45,0.01
808,Equinor ASA (formerly Statoil ASA),EQNR,2019,1,0,Large,Energy,Norway,2019,1,56170.54,252000000.0,12000.0,14900.0,247000.0,0.27,0.21
102,A.P. Møller -Maersk A/S,MAERSK,2021,1,0,Large,Industrial Goods and Services,Denmark,2019,1,55166.96,473188000.0,1834.0,37173.0,28952.0,0.67,0.03
395,Fortum Oyj,FORTUM,2020,1,0,Large,Utilities,Finland,2019,1,49015.0,489600000.0,4967000.0,49632.0,27836.4,1.01,101.34
593,AstraZeneca PLC,AZN,2022,1,0,Large,Health Care,United Kingdom,2019,0,42118.71,588963.6,3750.0,440.24,5884.0,0.01,0.09
1062,Volvo AB,VOLV,2019,1,0,Large,Industrial Goods and Services,Sweden,2019,1,41133.12,7624800.0,5706.0,324.0,,0.01,0.14
809,Equinor ASA (formerly Statoil ASA),EQNR,2020,1,0,Large,Energy,Norway,2019,1,40850.89,234000000.0,8000.0,13500.0,250000.0,0.33,0.2


Create a column: 'GHG per EUR revenue_ranking_all_companies' - This is binned from 1 to 10 (using quartiles and calculated using only values 
from the same year)

Create a column: 'GHG per EUR revenue_ranking_sector' - This is also binned from 1 to 10 (and calculated using only values from the same year)

Calculate the average GHG per EUR revenue as well as IQR--apply the outlier transformation and put all outliers in the '0' bin

Create a column: 'GHG per EUR revenue_ranking_all_PY' -- This is to compare to the values from the previous year

Create a column: 'GHG per EUR revenue_ranking_sector_PY' -- This is to compare to the values from the previous year

Create a column: '% change in GHG per EUR revenue vs PY'

Create a column: '% change in GHG emissions vs PY'

Create a column: 'transport emissions as a % of total emissions'

Create a column: '% change in transport emissions vs PY'

Create a column: 'Transport emissions as % of total emissions' (compare to sector)


Use the bins only for GHG emissions/EUR--compare values in each bin for all columns...

See how bin values vary from year to year

Calculate the number of companies that have migrated from bin to bin





#### Bonus columns

'GHG intensity reduction % vs sector-specific targets'--normalise so make it a % above or below target

'GHG intensity reduction % vs others in the sector_CY'--also normalise (and consider whether positive is good or bad)


