## To do:

- Emphasise that this can be used to target new customers--you can immediately see which companies have missing X, Y and Z as part of their gap analysis...
- You can see which data companies find it hardest to collect, etc...

- Show those companies/number of companies that have migrated from one ranking to the next from year to year. (Use the example of a credit scorecard as inspiration)
- Consider grouping them
- Rating vs. Ranking -- give them a ranking relative to their competitors, and a rating based on the number of metrics they are successfully reporting.

The question of materiality, or in other words, the relative weight, is determined based on the disclosure of
relative level in that industry group. The disclosure percentage for each industry group to which the data point
is material is identified, and decile ranks are assigned. The decile rank determines the relative weight assigned
to that data point in determining the industry weight – from 1 to 10

- Add a column that says: "required to comply with CSRD" -- assume that it's only those with turnover above 50 MEUR

#### Summary columns

Summarise results by:

- Segment/Industry

- HQ country

Declarations per year

--check which industry has the highest % of missing values

Percentage of companies in each industry that have their sustainability work audited

Create 'gap analysis: total missing metrics (coverage of metrics)'

Carbon intensive industries = Energy, Materials/Basic Materials, Industrials, Utilities

# Reporting: Gap analysis

<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of Environmental Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 2:</b> Gap analysis </span></center>

## Plan for sector-specific environmental scoring calculation:

1) Calculate the mean value for each parameter for a given industry (use overall, rather than year, so that there can be some comparison between years).
2) Compare that to the sum of mean values for each parameter for a given industry (e.g. mean_env_policy + mean_ghg_emis_bool).
3) Calculate the value for each parameter for a given company and the associated percentile (e.g. 1 = 0.81111)
4) Multiply that percentile value for the company by the magnitude of the column. (This will make sure the most important columns are prioritised)
5) Add all the new scores up for the given company. This will give the company a raw_sector_score.
6) Now take the sector score for every company in that sector and calculate the percentile. That will give the company's adjusted_sector_score = This can be considered the company's sector-specific E(SG) score.
--The advantage of this is that you can compare scores within a sector across years. But then we have a problem when we add another year. Should we use the base year of 2019 as the calculator? But things might have become more prioritised in recent years. So maybe best to stick with overall.

## Plan for overall environmental scoring calculation:

1) Calculate the mean value for each industry for a given parameter (use overall). Ignore magnitude.
2) This will give you the transparency weighting. The more companies fill in this metric, the more transparent it is.
3) Calculate the transparency of each metric relative to the total: this will give you the weighting of each metric.
4) For each company/year, calculate their overall percentile for each metric.
5) Multiply each percentile by the weighting of the metric it applies to. Sum  all of the values.
6) This will give the company's raw_overall_score.
7) Now calculate the percentile for each company relative to others in the same year (or maybe overall). This will give the company's adjusted_overall_score 


^^This score will give the company its reporting score. In the next section, calculate its emission score.
For the overall score, maybe consider doing a 66:33 split for emissions:reporting.
^^ Is it even necessary to do an overall score if I'm using this for business leads/identifying potential businesses to target?





## Introduction to this section

In the previous section, I cleaned the original dataset to ensure that all companies were entered under a single name and had a consistent ticker, segment, industry and country. I then removed any duplicates, transformed any anomalous values in Boolean columns (columns that accept only 0 or 1), and set a base year of 2019. All data prior to 2019 was deleted, new columns were added, and the data frame was divided into a reporting_df and an impact_df.

In this section, I analyse the reporting_df, which shows how well each company is meeting their environmental reporting requirements under CSRD (based on the available data we have). It is important to note that CSRD came into force in 2024, but the most recent data in the dataset is from 2022.

## Imports

In [375]:
import pandas as pd
import numpy as np
import sys
import os
import plotly.graph_objects as go
from plotly.subplots import make_subplots

pd.set_option("display.max_columns", None)
sys.path.append(os.path.abspath(".."))
import random

from functions import (
    test_company,
    show_missing_values,
    chart_visualisations,
    summarise_boolean_values,
    parameters_by_year,
)

from IPython.display import display

pd.options.display.float_format = "{:,.2f}".format

# HERE - build the scoring system

In [376]:
scoring_cols = [
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]

In [377]:
reporting_df["industry"].value_counts()

industry
Industrial Goods and Services    441
Finance                          387
Consumer Goods and Services      329
Health Care                      204
Technology                       149
Energy and Utilities             137
Other                            102
Basic Materials                   88
Name: count, dtype: int64

In [378]:
import numpy as np
import pandas as pd


def calculate_industry_weights(reporting_df, scoring_cols):
    """
    Calculate industry-level materiality weights for ESG scoring.

    Parameters:
    - reporting_df (pd.DataFrame): The ESG dataset containing company data.
    - scoring_cols (list): List of columns to be used for scoring.

    Returns:
    - pd.DataFrame: A DataFrame with industry-level materiality weights.
    """
    # Step 1: Compute industry-level means across all years
    industry_means = reporting_df.groupby("industry")[scoring_cols].mean()

    # Step 2: Compute industry materiality score (sum of all means per industry)
    industry_means["industry_materiality_score"] = industry_means.sum(axis=1)

    # Step 3: Compute materiality weight for each variable
    materiality_weights = {}
    for col in scoring_cols:
        materiality_weights[f"{col}_materiality_weight"] = (
            industry_means[col] / industry_means["industry_materiality_score"]
        )

    # Create a new DataFrame with interleaved original and materiality weight columns
    interleaved_columns = []
    for col in scoring_cols:
        interleaved_columns.append(col)
        interleaved_columns.append(f"{col}_materiality_weight")

    industry_weights_df = industry_means.assign(**materiality_weights)[
        interleaved_columns + ["industry_materiality_score"]
    ]

    return industry_weights_df

In [379]:
# Keep only materiality weight columns and the materiality score
# industry_weights_df = industry_means[
#     ["industry_materiality_score"]
#     + [f"{col}_materiality_weight" for col in scoring_cols]
# ]

In [380]:
industry_weights_df = calculate_industry_weights(reporting_df, scoring_cols)
industry_weights_df

Unnamed: 0_level_0,external_audit_of_ESG_report,external_audit_of_ESG_report_materiality_weight,ceo_sust_statem,ceo_sust_statem_materiality_weight,environmental_policy_and_assessment,environmental_policy_and_assessment_materiality_weight,environmental_performance_targets,environmental_performance_targets_materiality_weight,reduced_environmental_impact,reduced_environmental_impact_materiality_weight,increased_renewable_energy,increased_renewable_energy_materiality_weight,disclosure_of_raw_material_use,disclosure_of_raw_material_use_materiality_weight,resource_efficiency_target,resource_efficiency_target_materiality_weight,disclosure_of_water_discharges,disclosure_of_water_discharges_materiality_weight,supplier_guidelines,supplier_guidelines_materiality_weight,disclosure_of_suppliers_audited,disclosure_of_suppliers_audited_materiality_weight,disclosure_of_supplier_evaluation_procedures,disclosure_of_supplier_evaluation_procedures_materiality_weight,supplier_environmental_assessment,supplier_environmental_assessment_materiality_weight,energy_consump_bool,energy_consump_bool_materiality_weight,water_withdraw_bool,water_withdraw_bool_materiality_weight,ghg_emis_bool,ghg_emis_bool_materiality_weight,transport_emis_bool,transport_emis_bool_materiality_weight,industry_materiality_score
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
Basic Materials,0.65,0.05,0.67,0.05,0.99,0.08,0.93,0.07,0.91,0.07,0.35,0.03,0.73,0.06,0.69,0.06,0.53,0.04,0.88,0.07,0.6,0.05,0.58,0.05,0.81,0.06,0.9,0.07,0.85,0.07,0.88,0.07,0.52,0.04,12.47
Consumer Goods and Services,0.42,0.04,0.63,0.06,0.96,0.09,0.88,0.08,0.94,0.09,0.45,0.04,0.31,0.03,0.68,0.07,0.15,0.01,0.89,0.09,0.38,0.04,0.47,0.04,0.77,0.07,0.68,0.07,0.46,0.04,0.76,0.07,0.57,0.06,10.38
Energy and Utilities,0.26,0.03,0.59,0.07,0.91,0.11,0.82,0.09,0.85,0.1,0.19,0.02,0.05,0.01,0.54,0.06,0.25,0.03,0.71,0.08,0.38,0.04,0.38,0.04,0.59,0.07,0.66,0.08,0.31,0.04,0.66,0.08,0.49,0.06,8.65
Finance,0.42,0.05,0.57,0.07,0.9,0.11,0.78,0.09,0.87,0.1,0.35,0.04,0.09,0.01,0.46,0.05,0.01,0.0,0.77,0.09,0.2,0.02,0.3,0.04,0.63,0.07,0.59,0.07,0.3,0.04,0.68,0.08,0.63,0.07,8.56
Health Care,0.34,0.05,0.44,0.06,0.82,0.11,0.57,0.08,0.78,0.11,0.27,0.04,0.07,0.01,0.32,0.04,0.03,0.0,0.83,0.11,0.4,0.05,0.29,0.04,0.62,0.08,0.4,0.05,0.33,0.05,0.52,0.07,0.35,0.05,7.41
Industrial Goods and Services,0.4,0.04,0.63,0.07,0.94,0.1,0.84,0.09,0.94,0.1,0.38,0.04,0.12,0.01,0.59,0.06,0.06,0.01,0.85,0.09,0.41,0.04,0.44,0.05,0.69,0.07,0.6,0.06,0.32,0.03,0.66,0.07,0.47,0.05,9.35
Other,0.49,0.06,0.59,0.07,0.94,0.11,0.82,0.09,0.94,0.11,0.25,0.03,0.09,0.01,0.43,0.05,0.01,0.0,0.76,0.09,0.25,0.03,0.39,0.04,0.69,0.08,0.68,0.08,0.2,0.02,0.73,0.08,0.58,0.07,8.83
Technology,0.32,0.04,0.44,0.06,0.78,0.11,0.64,0.09,0.78,0.11,0.26,0.04,0.05,0.01,0.43,0.06,0.01,0.0,0.69,0.1,0.3,0.04,0.28,0.04,0.57,0.08,0.47,0.07,0.2,0.03,0.52,0.07,0.42,0.06,7.15


^^This is good, because then each variable has the same weight from year to year, making it easier to compare companies across years.

In [381]:
# import numpy as np
# import pandas as pd


# def calculate_industry_esg_scores(reporting_df, scoring_cols):
#     """
#     Calculate an industry-specific ESG score for each company.

#     Parameters:
#     - reporting_df (pd.DataFrame): The ESG dataset containing company data.
#     - scoring_cols (list): List of columns to be used for scoring.

#     Returns:
#     - pd.DataFrame: A DataFrame with company-specific ESG scores.
#     """

#     # Step 1: Compute industry-level means across all years
#     industry_means = reporting_df.groupby("industry")[scoring_cols].mean()

#     # Step 2: Compute company-level scores
#     company_scores = reporting_df.copy()

#     for col in scoring_cols:
#         # Compute the percentile score for each company within its industry
#         company_scores[f"{col}_percentile"] = company_scores.groupby("industry")[
#             col
#         ].rank(pct=True)

#         # Adjust score by industry-specific magnitude
#         company_scores[f"{col}_adjusted"] = (
#             company_scores[f"{col}_percentile"]
#             * industry_means.loc[company_scores["industry"], col].values
#         )

#     # Step 3: Compute raw industry score for each company (sum of all adjusted scores)
#     company_scores["raw_industry_score"] = company_scores[
#         [f"{col}_adjusted" for col in scoring_cols]
#     ].sum(axis=1)

#     # Step 4: Compute industry-specific percentile score
#     company_scores["adjusted_sector_score"] = company_scores.groupby("industry")[
#         "raw_industry_score"
#     ].rank(pct=True)

#     return company_scores[
#         [
#             "company",
#             "ticker",
#             "year",
#             "industry",
#             "raw_industry_score",
#             "adjusted_sector_score",
#         ]
#     ]

In [382]:
# score_df = calculate_industry_esg_scores(reporting_df, scoring_cols=scoring_cols)

# HERE - back to normal code

In [383]:
reporting_df = pd.read_csv("../datasets/reporting_df_original.csv")

In [384]:
reporting_df.dtypes

company                                         object
ticker                                          object
year                                             int64
csrd_2025                                        int64
csrd_2027                                        int64
segment                                         object
industry                                        object
hq_country                                      object
years_esg_data                                   int64
base_year                                        int64
external_audit_of_ESG_report                     int64
ceo_sust_statem                                  int64
environmental_policy_and_assessment              int64
environmental_performance_targets                int64
reduced_environmental_impact                     int64
increased_renewable_energy                       int64
disclosure_of_raw_material_use                   int64
resource_efficiency_target                       int64
disclosure

In [385]:
reporting_df["industry"].unique()

array(['Energy and Utilities', 'Industrial Goods and Services',
       'Consumer Goods and Services', 'Basic Materials', 'Finance',
       'Other', 'Technology', 'Health Care'], dtype=object)

## Exploratory Data Analysis

I define the columns to be used for visualisation. Some columns, such as 'company' or 'ticker', are excluded.

In [386]:
object_columns = [
    "year",
    "segment",
    "industry",
    "hq_country",
    "years_esg_data",
    "base_year",
]

boolean_columns = [
    "csrd_2025",
    "csrd_2027",
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]


columns_for_viz = object_columns + boolean_columns

In [387]:
# chart_visualisations(reporting_df, columns_for_viz)

Visualisations show the distribution of values for each column. I summarise the mean values for each Boolean column in the table below. Columns with high mean values indicate a high number of companies reporting/meeting this metric. 

In [388]:
excel_file_df = summarise_boolean_values(
    reporting_df, boolean_columns, ["year", "industry"]
)

In [389]:
# folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"
# excel_file_df.to_csv(f"{folder_path}/excel_file_df.csv")

The chart below visualises these figures for each year (green = 1; beige = 0). 

In [390]:
# parameters_by_year(reporting_df, boolean_columns, display_mode="Count")

## Score calculator

```diff
- According to the CSRD, a first set of European Sustainability Reporting Standards (ESRS) were adopted in 2023, which were sector-agnostic, so they were to be applied regardless of the sector in which the company operates. Sector-specific reporting standards were expected to be introduced by June 2026, but at the time of writing, February 2025, it is likely that this requirement will be shelved (see European Commission, 2025).

- I apply the ESG scoring system used by the London Stock Exchange Group (LSEG, 2024), and calculate sector-specific scores...

# Insert stuff about scoring here...

I first calculate the number of metrics each company has reported in a given year (max = 17).

In [391]:
metrics = list(set(boolean_columns) - {"csrd_2025", "csrd_2027"})
reporting_df["metrics_reported"] = reporting_df[metrics].sum(axis=1)

### Sector-specific calculations

I then summarise all columns using the mean (and append the median of metrics reported). These values are later used to calculate each company's ESG score for a given year.  

In [392]:
summary_by_industry_df = (
    reporting_df.groupby(["industry", "year"])[metrics + ["metrics_reported"]]
    .mean()  # mean used here because median would either show 0 or 1
    .reset_index()
    .set_index(["industry", "year"])
)

median_count_metrics = (
    reporting_df.groupby(["industry", "year"])["metrics_reported"]
    .median()
    .reset_index()
)

median_count_metrics.rename(
    columns={"metrics_reported": "metrics_reported_median"}, inplace=True
)

# Merge the median with the original summary DataFrame
summary_by_industry_df = summary_by_industry_df.merge(
    median_count_metrics, on=["industry", "year"], how="left"
).set_index(["industry", "year"])

In [393]:
summary_by_industry_df

Unnamed: 0_level_0,Unnamed: 1_level_0,ghg_emis_bool,disclosure_of_water_discharges,disclosure_of_suppliers_audited,supplier_environmental_assessment,disclosure_of_raw_material_use,resource_efficiency_target,energy_consump_bool,external_audit_of_ESG_report,environmental_performance_targets,supplier_guidelines,transport_emis_bool,disclosure_of_supplier_evaluation_procedures,environmental_policy_and_assessment,ceo_sust_statem,increased_renewable_energy,water_withdraw_bool,reduced_environmental_impact,metrics_reported,metrics_reported_median
industry,year,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
Basic Materials,2019,0.8,0.56,0.56,0.64,0.64,0.64,0.8,0.64,0.84,0.68,0.4,0.52,1.0,0.76,0.28,0.88,0.84,11.48,13.0
Basic Materials,2020,0.88,0.52,0.52,0.84,0.52,0.76,0.88,0.64,0.96,0.92,0.44,0.4,1.0,0.88,0.32,0.8,0.88,12.16,13.0
Basic Materials,2021,0.89,0.56,0.72,0.94,0.94,0.67,1.0,0.72,1.0,1.0,0.67,0.83,1.0,1.0,0.44,0.89,1.0,14.28,14.0
Basic Materials,2022,0.95,0.5,0.65,0.85,0.9,0.7,0.95,0.6,0.95,0.95,0.65,0.65,0.95,0.0,0.4,0.85,0.95,12.45,13.0
Consumer Goods and Services,2019,0.63,0.17,0.36,0.74,0.34,0.62,0.57,0.38,0.85,0.88,0.44,0.52,0.99,0.73,0.35,0.39,0.92,9.88,11.0
Consumer Goods and Services,2020,0.76,0.2,0.38,0.75,0.29,0.77,0.76,0.43,0.91,0.93,0.56,0.45,0.98,0.82,0.44,0.48,0.94,10.83,11.0
Consumer Goods and Services,2021,0.79,0.08,0.35,0.8,0.32,0.6,0.66,0.54,0.85,0.88,0.61,0.46,0.89,0.84,0.54,0.46,0.93,10.6,12.0
Consumer Goods and Services,2022,0.88,0.13,0.43,0.79,0.28,0.74,0.76,0.31,0.91,0.88,0.71,0.43,0.97,0.0,0.49,0.53,0.97,10.21,11.0
Energy and Utilities,2019,0.47,0.17,0.39,0.5,0.03,0.42,0.58,0.17,0.78,0.64,0.25,0.33,0.94,0.69,0.19,0.28,0.92,7.75,8.0
Energy and Utilities,2020,0.66,0.24,0.41,0.61,0.1,0.8,0.73,0.27,0.9,0.68,0.39,0.29,0.93,0.8,0.12,0.27,0.88,9.1,10.0


### Cross-sector calculations

I apply the same logic as in the previous section, but across all industries.

In [394]:
summary_overall_df = (
    reporting_df.groupby(["year"])[metrics + ["metrics_reported"]]
    .mean()
    .reset_index()
    .set_index(["year"])
)

median_count_metrics = (
    reporting_df.groupby(["year"])["metrics_reported"].median().reset_index()
)

median_count_metrics.rename(
    columns={"metrics_reported": "metrics_reported_median"}, inplace=True
)

# Merge the median with the original summary DataFrame
summary_overall_df = summary_overall_df.merge(
    median_count_metrics, on=["year"], how="left"
).set_index(["year"])

In [395]:
summary_overall_df

Unnamed: 0_level_0,ghg_emis_bool,disclosure_of_water_discharges,disclosure_of_suppliers_audited,supplier_environmental_assessment,disclosure_of_raw_material_use,resource_efficiency_target,energy_consump_bool,external_audit_of_ESG_report,environmental_performance_targets,supplier_guidelines,transport_emis_bool,disclosure_of_supplier_evaluation_procedures,environmental_policy_and_assessment,ceo_sust_statem,increased_renewable_energy,water_withdraw_bool,reduced_environmental_impact,metrics_reported,metrics_reported_median
year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2019,0.54,0.1,0.36,0.62,0.15,0.49,0.52,0.42,0.75,0.75,0.35,0.34,0.93,0.65,0.26,0.3,0.88,8.42,8.0
2020,0.61,0.1,0.34,0.67,0.16,0.62,0.62,0.4,0.81,0.83,0.45,0.39,0.94,0.76,0.33,0.32,0.89,9.25,10.0
2021,0.75,0.08,0.36,0.69,0.16,0.46,0.61,0.46,0.78,0.84,0.61,0.43,0.9,0.84,0.38,0.4,0.88,9.63,10.0
2022,0.81,0.09,0.33,0.71,0.15,0.53,0.68,0.31,0.83,0.84,0.68,0.4,0.87,0.0,0.4,0.4,0.9,8.91,10.0


## Evaluating company performance

To determine whether a company has improved its reporting in absolute terms since the previous year, I calculate the change in the number of metrics reported relative to the previous year.

```diff
- Then calculate relative change vs. previous year--ie. rating change (whether it has migrated up or not)
- I can also evaluate whether a company is doing better than the previous year using their ESG score...


In [396]:
# Ensure DataFrame is sorted by company and year
# reporting_df = reporting_df.sort_values(by=["company", "year"])

# Calculate the change directly without keeping the previous year column
reporting_df["metrics_change_from_prev_year"] = reporting_df[
    "metrics_reported"
] - reporting_df.groupby(["company", "year"])["metrics_reported"].shift(1)

# Now consider how I am going to calculate my percentiles and rankings...

```diff
- Pillar score: A score for each pillar (within Environmental, Social and Governance). This can change from industry to industry--e.g. the environmental pillar might be worth more in the energy industry than another industry...


Numeric data:
Relative percentile ranking only applied if numeric data point is reported by a company.

Percentile rank is adopted to calculate the 10 category scores;
- How many companies are worse than the current one?
- How many have the same value?
- How many companies have a value at all?

```

$$
\text{score} = \frac{\text{no. of companies with a worse value} + \frac{\text{no. of companies with the same value included in the current one}}{2}}{\text{no. of companies with a value}}
$$

```diff

TRBC group is used as the benchmark.

Materiality matrix created in the form of category weights.

7% transparency threshold (I guess that means mean is higher than 0.07)
Then use industry median (and add all medians up, then take the relative weighting of each)

For Boolean data points, use transparency weights (same principle, but using magnitude weight)



```diff
- For each company, sum all scores (percentile rank for each column). Add all scores for every company, and give each company a percentile rank based on the score equation above.

## Materiality calculation

I first calculate how material the metric is to each industry. This is done by calculating the mean value of each column. Because we are analysing Boolean columns, the higher the mean, the more companies report on this metric, and therefore the more important (material) this must be to the industry.

The higher the mean, the higher the weight that metric is given to that industry. 

Okay, let's calculate the industry median first.

This is taken as a percentage of all industry medians to give the weight of each column to the overall score for that column.
So if the industry median for ghg_emis_bool is 0.6 for 'Energy' and 0.1 for 'Consumer Goods', then the weight for Energy is going to be 0.6/(0.6+0.1) = 90%



Check p. 20 of LSEG document. London Stock Exchange Group (LSEG) method for Booleans is: (no. of companies with worse value + (0.5 * no. of companies with the same value)) / total no. of companies with a value 

In [397]:
from typing import Union


def assign_rating(percentile: float) -> str:
    """
    Assigns a rating (A+ to D-) based on the given percentile.

    Parameters:
    percentile (float): The percentile value (0 to 1).

    Returns:
    str: The corresponding rating.
    """
    if 0.916666 < percentile <= 1:
        return "A+"
    elif 0.833333 < percentile <= 0.916666:
        return "A"
    elif 0.750000 < percentile <= 0.833333:
        return "A-"
    elif 0.666666 < percentile <= 0.750000:
        return "B+"
    elif 0.583333 < percentile <= 0.666666:
        return "B"
    elif 0.500000 < percentile <= 0.583333:
        return "B-"
    elif 0.416666 < percentile <= 0.500000:
        return "C+"
    elif 0.333333 < percentile <= 0.416666:
        return "C"
    elif 0.250000 < percentile <= 0.333333:
        return "C-"
    elif 0.166666 < percentile <= 0.250000:
        return "D+"
    elif 0.083333 < percentile <= 0.166666:
        return "D"
    elif 0.0 <= percentile <= 0.083333:
        return "D-"
    else:
        return "Invalid percentile"

In [398]:
# Ensure DataFrame is sorted


reporting_df = reporting_df.sort_values(by=["year", "industry", "metrics_reported"])


# Compute industry percentile (within each year & industry)


reporting_df["industry_percentile"] = reporting_df.groupby(["year", "industry"])[
    "metrics_reported"
].rank(pct=True)


# Compute overall percentile (within each year)


reporting_df["overall_percentile"] = reporting_df.groupby("year")[
    "metrics_reported"
].rank(pct=True)


# Assign ratings based on the defined scale


reporting_df["industry_rating"] = reporting_df["industry_percentile"].apply(
    assign_rating
)

reporting_df["overall_rating"] = reporting_df["overall_percentile"].apply(assign_rating)


# Display relevant columns


reporting_df[
    [
        "company",
        "year",
        "industry",
        "metrics_reported",
        "industry_percentile",
        "industry_rating",
        "overall_percentile",
        "overall_rating",
    ]
]

Unnamed: 0,company,year,industry,metrics_reported,industry_percentile,industry_rating,overall_percentile,overall_rating
475,Afarak Group Plc,2019,Basic Materials,1,0.04,D-,0.04,D-
52,Josemaria Resources Inc.,2019,Basic Materials,2,0.08,D-,0.07,D-
56,Lundin Gold Inc.,2019,Basic Materials,5,0.12,D,0.23,D+
160,H+H International A/S,2019,Basic Materials,7,0.16,D,0.37,C
54,Lucara Diamond Corp.,2019,Basic Materials,8,0.20,D+,0.46,C+
...,...,...,...,...,...,...,...,...
1303,Tobii AB,2022,Technology,12,0.89,A,0.79,A-
1764,Garo AB,2022,Technology,12,0.89,A,0.79,A-
338,Nokia Oyj,2022,Technology,13,0.96,A+,0.88,A
1136,Mycronic AB,2022,Technology,13,0.96,A+,0.88,A


In [399]:
import pandas as pd
from typing import Optional


def top_n_companies(
    df: pd.DataFrame,
    n_companies: int = 20,
    industry: Optional[str] = None,
    year: Optional[int] = None,
    hq_country: Optional[str] = None,
    segment: Optional[str] = None,
) -> pd.DataFrame:
    """
    Returns the top N companies based on metrics_reported.

    Parameters:
    df (pd.DataFrame): The reporting DataFrame.
    n_companies (int): Number of top companies to return (default = 20).
    industry (Optional[str]): Industry to filter by (default = None, includes all industries).
    year (Optional[int]): Year to filter by (default = None, includes all years).
    hq_country (Optional[str]): HQ country to filter by (default = None, includes all countries).
    segment (Optional[str]): Segment to filter by (default = None, includes all segments).

    Returns:
    pd.DataFrame: Top N companies sorted by metrics_reported.
    """

    # Create a filtered DataFrame based on user input
    filtered_df = df.copy()

    if industry is not None:
        filtered_df = filtered_df[filtered_df["industry"] == industry]

    if year is not None:
        filtered_df = filtered_df[filtered_df["year"] == year]

    if hq_country is not None:
        filtered_df = filtered_df[filtered_df["hq_country"] == hq_country]

    if segment is not None:
        filtered_df = filtered_df[filtered_df["segment"] == segment]

    # Sort by metrics_reported in descending order
    top_companies = filtered_df.sort_values(
        by="metrics_reported", ascending=False
    ).head(n_companies)

    return top_companies

In [400]:
reporting_df["industry"].unique()

array(['Basic Materials', 'Consumer Goods and Services',
       'Energy and Utilities', 'Finance', 'Health Care',
       'Industrial Goods and Services', 'Other', 'Technology'],
      dtype=object)

In [401]:
desired_columns = [
    "company",
    "year",
    "industry",
    "metrics_reported",  # Move this near the beginning
    "industry_rating",  # Move this near the beginning
    "overall_rating",  # Move this near the beginning
    "industry_percentile",
    "overall_percentile",
]

# Ensure all other columns are preserved in the order
remaining_columns = [col for col in reporting_df.columns if col not in desired_columns]

# Reorder the DataFrame columns
reporting_df = reporting_df[desired_columns + remaining_columns]

In [402]:
reporting_df[reporting_df["years_esg_data"] == 5]

Unnamed: 0,company,year,industry,metrics_reported,industry_rating,overall_rating,industry_percentile,overall_percentile,ticker,csrd_2025,csrd_2027,segment,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool,metrics_change_from_prev_year


In [403]:
pd.set_option("display.max_colwidth", 180)

# Create a DataFrame with unique values for each column
unique_values_df = pd.DataFrame(
    {
        "columns": reporting_df.columns,
        "unique_values": [
            reporting_df[col].unique().tolist() for col in reporting_df.columns
        ],
    }
).set_index("columns")

unique_values_df

Unnamed: 0_level_0,unique_values
columns,Unnamed: 1_level_1
company,"[Afarak Group Plc, Josemaria Resources Inc., Lundin Gold Inc., H+H International A/S, Lucara Diamond Corp., SP Group A/S, Boliden AB, Ahlstrom-Munksjö Oyj, Elkem ASA, Norske Sk..."
year,"[2019, 2020, 2021, 2022]"
industry,"[Basic Materials, Consumer Goods and Services, Energy and Utilities, Finance, Health Care, Industrial Goods and Services, Other, Technology]"
metrics_reported,"[1, 2, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 3, 4, 6, 0, 17]"
industry_rating,"[D-, D, D+, C-, C, C+, B-, B+, A-, A+, B, A]"
overall_rating,"[D-, D+, C, C+, B-, B, B+, A-, A, A+, D, C-]"
industry_percentile,"[0.04, 0.08, 0.12, 0.16, 0.2, 0.26, 0.34, 0.4, 0.46, 0.58, 0.68, 0.76, 0.92, 0.02247191011235955, 0.0449438202247191, 0.07303370786516854, 0.12359550561797752, 0.18539325842696..."
overall_percentile,"[0.0411522633744856, 0.06995884773662552, 0.2294238683127572, 0.3734567901234568, 0.4609053497942387, 0.5401234567901234, 0.6121399176954733, 0.6831275720164609, 0.758230452674..."
ticker,"[AFAGR, JOSE, LUG, HH, LUC, SPG, BOL, AM1, ELK, NSKOG, LUMI, HPOL, STERV, BRG, HOLM, BEIA, UPM, METSA, NHY, SSAB, OUT1V, KEMIRA, YAR, SKF, BILL, CTM, STAR B, BHG, UIE, BETCO, K..."
csrd_2025,"[0, 1]"


In [404]:
top_n_companies(reporting_df, n_companies=None, industry="Energy", year=2021)

Unnamed: 0,company,year,industry,metrics_reported,industry_rating,overall_rating,industry_percentile,overall_percentile,ticker,csrd_2025,csrd_2027,segment,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool,metrics_change_from_prev_year


# Create a summary_df to make it easier to query individual companies

## References

European Commission, 2025. Proposal for a Directive of the European Parliament and of the Council amending Directives (EU) 2022/2464 and (EU) 2024/1760 as regards the dates from which Member States are to apply certain corporate sustainability reporting and due diligence requirements. COM(2025) 80 final. Brussels. Available at: https://commission.europa.eu/document/download/0affa9a8-2ac5-46a9-98f8-19205bf61eb5_en?filename=COM_2025_80_EN.pdf (Accessed 27 February 2025)

LSEG, 2024. Environmental, Social and Governance Scores from LSEG: October 2024. Available at: https://www.lseg.com/content/dam/data-analytics/en_us/documents/methodology/lseg-esg-scores-methodology.pdf (Accessed: 27 February 2025)

## Appendix

In [405]:
# # 2. Create 'consecutive_years_esg_data' by checking consecutive years starting from 2022
# def calculate_consecutive_years(group):
#     # Create a set of years for the current 'comp_name'
#     years = set(group["year"])
#     # Start from 2022 and count consecutive years backwards
#     count = 0
#     for year in range(2022, 2019, -1):  # Checking years 2022, 2021, 2020, ...
#         if year in years:
#             count += 1
#         else:
#             break  # Stop if any year is missing in the consecutive sequence

#     return count


# # Apply the function to each group of 'comp_name'
# df["consecutive_years_esg_data"] = (
#     df.groupby("comp_name")
#     .apply(calculate_consecutive_years)
#     .reset_index(level=0, drop=True)
# )

Metrics change from base year

In [406]:
# reporting_df = reporting_df.merge(
#     reporting_df.loc[
#         reporting_df["year"] == reporting_df["base_year"],
#         ["company", "metrics_reported"],
#     ],
#     on="company",
#     how="left",
#     suffixes=("", "_base_year"),
# )

# # Compute the change from base year
# reporting_df["metrics_change_from_base_year"] = (
#     reporting_df["metrics_reported"] - reporting_df["metrics_reported_base_year"]
# )

# reporting_df.loc[
#     reporting_df["year"] == reporting_df["base_year"], "metrics_change_from_base_year"
# ] = float("nan")

# reporting_df.drop("metrics_reported_base_year", axis=1, inplace=True)