## To do:

- Emphasise that this can be used to target new customers--you can immediately see which companies have missing X, Y and Z as part of their gap analysis...
- You can see which data companies find it hardest to collect, etc...

- Show those companies/number of companies that have migrated from one ranking to the next from year to year. (Use the example of a credit scorecard as inspiration)
- Consider grouping them
- Coverage rating vs. Ranking -- give them a ranking relative to their competitors, and a coverage rating based on the number of metrics they are successfully reporting.

The question of materiality, or in other words, the relative weight, is determined based on the disclosure of
relative level in that industry group. The disclosure percentage for each industry group to which the data point
is material is identified, and decile ranks are assigned. The decile rank determines the relative weight assigned
to that data point in determining the industry weight – from 1 to 10

#### Summary columns

Summarise results by:

- Segment/Industry

- HQ country

Declarations per year

--check which industry has the highest % of missing values

Percentage of companies in each industry that have their sustainability work audited

Create 'gap analysis: total missing metrics (coverage of metrics)'

Carbon intensive industries = Energy, Materials/Basic Materials, Industrials, Utilities

- Map the results of 'ceo_sust_statem' from 2021 onto the 2022 results. I think this is the fairest way.

<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of Environmental Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 2:</b> Reporting | Gap analysis </span></center>

## Introduction to this section

In the previous section, I cleaned the original dataset to ensure that all companies were entered under a single name and had a consistent ticker, segment, industry and country. I then removed any duplicates, transformed any anomalous values in Boolean columns (columns that accept only 0 or 1), and set a base year of 2019. All data prior to 2019 were deleted, new columns were added, and the data frame was divided into a reporting_df and an impact_df. 

In this section, I analyse the reporting_df, which shows how well each company is meeting their environmental reporting requirements under CSRD (based on the available data we have). It is important to note that CSRD came into force in 2024, but the most recent data in the dataset is from 2022.

I first explore the dataset to see how reporting varies from one metric to the next. This might be a useful guide to help understand which services might be in demand in the near future. For example, we see that many companies do not disclose details about water discharges. This may be because they find it hard to measure, and therefore will require help to report that metric.

I use this to run a gap analysis, which shows the percentage of metrics reported...

Finally, I create a score calculator, based on the London Stock Exchange Group's methodology (LSEG, 2024), to give each company an environmental score for a given year. This score calculator is limited by the data available--the Nordic Compass dataset has around 17 relevant columns for measuring a company's environmental score, whereas LSEG used 68 metrics to constitute an environmental score. I also have a limited number of companies in some industries, so it's unreasonable to give an industry-specific score in those industries. Some industries, therefore, have been merged with others. This trade-off would not be required if the dataset was larger. My system is a simplified version, but can be useful for measuring a company's progress on environmental issues and determining companies that are performing well and not so well relative to their competitors.

To create the score calculator, I first decide what metrics are material to each industry. This is done using materiality weight, which is calculated...

```diff
- Insert here...

```

I then convert that score into an industry-specific rating and an overall rating. The industry-specific rating compares the company's performance in that year to 

## Imports

In [96]:
import pandas as pd
import numpy as np
import sys
import os
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from typing import Dict, List, Optional

pd.set_option("display.max_columns", None)
sys.path.append(os.path.abspath(".."))
import random

from functions import (
    test_company,
    show_missing_values,
    chart_visualisations,
    summarise_boolean_values,
    visualisations_by_year,
)

from IPython.display import display

pd.options.display.float_format = "{:,.2f}".format

In [97]:
reporting_df = pd.read_csv("../datasets/reporting_df_original.csv")

In [98]:
# reporting_df.dtypes

## Exploratory Data Analysis

I define the columns to be used for visualisation. Some columns, such as 'company' or 'ticker', are excluded.

In [99]:
object_columns = [
    "year",
    "segment",
    "industry",
    "hq_country",
    "years_esg_data",
    "base_year",
]

boolean_columns = [
    "csrd_2025",
    "csrd_2027",
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]


columns_for_viz = object_columns + boolean_columns

I first visualise all relevant columns to get an idea of the dataset's distribution. I then visualise by year, to see if there is any observable progress from one year to the next. Display mode can be toggled between 'count' and 'percentage'. Data from fewer companies are available in 2022 relative to other years, so percentage may be a more suitable option to compare company performance from year to year.

Noticeably, some columns seem to be missing data. 'CEO_sust_statem', for example is almost completely missing from the 2022 data, suggesting a problem with the data collection. This will be handled later.

In [100]:
# chart_visualisations(reporting_df, columns_for_viz)

In [101]:
# visualisations_by_year(reporting_df, boolean_columns, display_mode="percentage")

I group the data by industry and summarise the mean values for each year in the table below. Columns with high mean values indicate a high number of companies reporting this metric. 

In [102]:
# summarise_boolean_values(reporting_df, boolean_columns, ["year", "industry"])

# Evaluating company performance

```diff
- Insert text here...

## Summary of metrics reported

I first calculate the number of metrics each company has reported in a given year, and the percentage of the total number of metrics to report.

In [103]:
# all Boolean columns are included, except those related to CSRD--these show only whether a company must report CSRD data in 2025 or 2027
metrics = list(set(boolean_columns) - {"csrd_2025", "csrd_2027"})

metrics_reported_df = reporting_df.copy()
metrics_reported_df["metrics_reported"] = metrics_reported_df[metrics].sum(axis=1)
metrics_reported_df["metric_coverage"] = metrics_reported_df["metrics_reported"] / len(
    metrics
)

In [104]:
# metrics_reported_df.head()

To determine whether a company has improved its reporting in absolute terms since the previous year, I calculate the change in the number of metrics reported relative to the previous year.

In [105]:
# Ensure DataFrame is sorted by company and year
metrics_reported_df = metrics_reported_df.sort_values(by=["company", "year"])
metrics_reported_df["metrics_change_from_prev_year"] = metrics_reported_df[
    "metrics_reported"
] - metrics_reported_df.groupby(["company"])["metrics_reported"].shift(1)

In [106]:
metrics_reported_df[
    ["company", "year", "metrics_reported", "metrics_change_from_prev_year"]
]

Unnamed: 0,company,year,metrics_reported,metrics_change_from_prev_year
104,A.P. Møller -Maersk A/S,2019,13,
101,A.P. Møller -Maersk A/S,2020,14,1.00
102,A.P. Møller -Maersk A/S,2021,14,0.00
103,A.P. Møller -Maersk A/S,2022,13,-1.00
1644,AAK AB,2019,12,
...,...,...,...,...
611,Össur hf,2022,7,-6.00
187,Ørsted A/S,2019,12,
188,Ørsted A/S,2020,15,3.00
189,Ørsted A/S,2021,15,0.00


# HERE

In [None]:
# Ensure DataFrame is sorted


metrics_reported_df = metrics_reported_df.sort_values(
    by=["year", "industry", "metrics_reported"]
)


# Compute industry percentile (within each year & industry)


metrics_reported_df["industry_percentile"] = metrics_reported_df.groupby(
    ["year", "industry"]
)["metrics_reported"].rank(pct=True)


# Compute overall percentile (within each year)


metrics_reported_df["overall_percentile"] = metrics_reported_df.groupby("year")[
    "metrics_reported"
].rank(pct=True)


# Assign ratings based on the defined scale


# metrics_reported_df["industry_rating"] = metrics_reported_df[
#     "industry_percentile"
# ].apply(assign_rating)

# metrics_reported_df["overall_rating"] = metrics_reported_df["overall_percentile"].apply(
#     assign_rating
# )


# Display relevant columns


metrics_reported_df[
    [
        "company",
        "year",
        "industry",
        "metrics_reported",
        "industry_percentile",
        # "industry_rating",
        "overall_percentile",
        # "overall_rating",
    ]
]

Unnamed: 0,company,year,industry,metrics_reported,industry_percentile,overall_percentile
475,Afarak Group Plc,2019,Basic Materials,1,0.04,0.04
52,Josemaria Resources Inc.,2019,Basic Materials,2,0.08,0.07
56,Lundin Gold Inc.,2019,Basic Materials,5,0.12,0.23
160,H+H International A/S,2019,Basic Materials,7,0.16,0.37
54,Lucara Diamond Corp.,2019,Basic Materials,8,0.20,0.46
...,...,...,...,...,...,...
1764,Garo AB,2022,Technology,12,0.89,0.79
1303,Tobii AB,2022,Technology,12,0.89,0.79
1136,Mycronic AB,2022,Technology,13,0.96,0.88
338,Nokia Oyj,2022,Technology,13,0.96,0.88


```diff
- Pillar score: A score for each pillar (within Environmental, Social and Governance). This can change from industry to industry--e.g. the environmental pillar might be worth more in the energy industry than another industry...


Numeric data:
Relative percentile ranking only applied if numeric data point is reported by a company.

Percentile rank is adopted to calculate the 10 category scores;
- How many companies are worse than the current one?
- How many have the same value?
- How many companies have a value at all?

```

$$
\text{score} = \frac{\text{no. of companies with a worse value} + \frac{\text{no. of companies with the same value included in the current one}}{2}}{\text{no. of companies with a value}}
$$

```diff

TRBC group is used as the benchmark.

Materiality matrix created in the form of category weights.

7% transparency threshold (I guess that means mean is higher than 0.07)
Then use industry median (and add all medians up, then take the relative weighting of each)

For Boolean data points, use transparency weights (same principle, but using magnitude weight)



```diff
- For each company, sum all scores (percentile rank for each column). Add all scores for every company, and give each company a percentile rank based on the score equation above.

## Materiality calculation

I first calculate how material the metric is to each industry. This is done by calculating the mean value of each column. Because we are analysing Boolean columns, the higher the mean, the more companies report on this metric, and therefore the more important (material) this must be to the industry.

The higher the mean, the higher the weight that metric is given to that industry. 

Okay, let's calculate the industry median first.

This is taken as a percentage of all industry medians to give the weight of each column to the overall score for that column.
So if the industry median for ghg_emis_bool is 0.6 for 'Energy' and 0.1 for 'Consumer Goods', then the weight for Energy is going to be 0.6/(0.6+0.1) = 90%



Check p. 20 of LSEG document. London Stock Exchange Group (LSEG) method for Booleans is: (no. of companies with worse value + (0.5 * no. of companies with the same value)) / total no. of companies with a value 

In [107]:
from typing import Union


def assign_rating(percentile: float) -> str:
    """
    Assigns a rating (A+ to D-) based on the given percentile.

    Parameters:
    percentile (float): The percentile value (0 to 1).

    Returns:
    str: The corresponding rating.
    """
    if 0.916666 < percentile <= 1:
        return "A+"
    elif 0.833333 < percentile <= 0.916666:
        return "A"
    elif 0.750000 < percentile <= 0.833333:
        return "A-"
    elif 0.666666 < percentile <= 0.750000:
        return "B+"
    elif 0.583333 < percentile <= 0.666666:
        return "B"
    elif 0.500000 < percentile <= 0.583333:
        return "B-"
    elif 0.416666 < percentile <= 0.500000:
        return "C+"
    elif 0.333333 < percentile <= 0.416666:
        return "C"
    elif 0.250000 < percentile <= 0.333333:
        return "C-"
    elif 0.166666 < percentile <= 0.250000:
        return "D+"
    elif 0.083333 < percentile <= 0.166666:
        return "D"
    elif 0.0 <= percentile <= 0.083333:
        return "D-"
    else:
        return "Invalid percentile"

In [109]:
import pandas as pd
from typing import Optional


def top_n_companies(
    df: pd.DataFrame,
    n_companies: int = 20,
    industry: Optional[str] = None,
    year: Optional[int] = None,
    hq_country: Optional[str] = None,
    segment: Optional[str] = None,
) -> pd.DataFrame:
    """
    Returns the top N companies based on metrics_reported.

    Parameters:
    df (pd.DataFrame): The reporting DataFrame.
    n_companies (int): Number of top companies to return (default = 20).
    industry (Optional[str]): Industry to filter by (default = None, includes all industries).
    year (Optional[int]): Year to filter by (default = None, includes all years).
    hq_country (Optional[str]): HQ country to filter by (default = None, includes all countries).
    segment (Optional[str]): Segment to filter by (default = None, includes all segments).

    Returns:
    pd.DataFrame: Top N companies sorted by metrics_reported.
    """

    # Create a filtered DataFrame based on user input
    filtered_df = df.copy()

    if industry is not None:
        filtered_df = filtered_df[filtered_df["industry"] == industry]

    if year is not None:
        filtered_df = filtered_df[filtered_df["year"] == year]

    if hq_country is not None:
        filtered_df = filtered_df[filtered_df["hq_country"] == hq_country]

    if segment is not None:
        filtered_df = filtered_df[filtered_df["segment"] == segment]

    # Sort by metrics_reported in descending order
    top_companies = filtered_df.sort_values(
        by="metrics_reported", ascending=False
    ).head(n_companies)

    return top_companies

In [110]:
reporting_df["industry"].unique()

array(['Energy and Utilities', 'Industrial Goods and Services',
       'Consumer Goods and Services', 'Basic Materials', 'Finance',
       'Other', 'Technology', 'Health Care'], dtype=object)

In [111]:
desired_columns = [
    "company",
    "year",
    "industry",
    "metrics_reported",  # Move this near the beginning
    "industry_rating",  # Move this near the beginning
    "overall_rating",  # Move this near the beginning
    "industry_percentile",
    "overall_percentile",
]

# Ensure all other columns are preserved in the order
remaining_columns = [col for col in reporting_df.columns if col not in desired_columns]

# Reorder the DataFrame columns
reporting_df = reporting_df[desired_columns + remaining_columns]

KeyError: "['metrics_reported', 'industry_rating', 'overall_rating', 'industry_percentile', 'overall_percentile'] not in index"

In [None]:
top_n_companies(reporting_df, n_companies=None, industry="Energy", year=2021)

Unnamed: 0,company,year,industry,metrics_reported,industry_rating,overall_rating,industry_percentile,overall_percentile,ticker,csrd_2025,csrd_2027,segment,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool,metrics_change_from_prev_year


# Create a summary_df to make it easier to query individual companies

# HERE - build the scoring system

In [None]:
scoring_cols = [
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]

In [None]:
reporting_df["industry"].value_counts()

industry
Industrial Goods and Services    441
Finance                          387
Consumer Goods and Services      329
Health Care                      204
Technology                       149
Energy and Utilities             137
Other                            102
Basic Materials                   88
Name: count, dtype: int64

In [None]:
industry_means_all_years_df = reporting_df.groupby("industry")[scoring_cols].mean()
industry_means_all_years_df

Unnamed: 0_level_0,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Basic Materials,0.65,0.67,0.99,0.93,0.91,0.35,0.73,0.69,0.53,0.88,0.6,0.58,0.81,0.9,0.85,0.88,0.52
Consumer Goods and Services,0.42,0.63,0.96,0.88,0.94,0.45,0.31,0.68,0.15,0.89,0.38,0.47,0.77,0.68,0.46,0.76,0.57
Energy and Utilities,0.26,0.59,0.91,0.82,0.85,0.19,0.05,0.54,0.25,0.71,0.38,0.38,0.59,0.66,0.31,0.66,0.49
Finance,0.42,0.57,0.9,0.78,0.87,0.35,0.09,0.46,0.01,0.77,0.2,0.3,0.63,0.59,0.3,0.68,0.63
Health Care,0.34,0.44,0.82,0.57,0.78,0.27,0.07,0.32,0.03,0.83,0.4,0.29,0.62,0.4,0.33,0.52,0.35
Industrial Goods and Services,0.4,0.63,0.94,0.84,0.94,0.38,0.12,0.59,0.06,0.85,0.41,0.44,0.69,0.6,0.32,0.66,0.47
Other,0.49,0.59,0.94,0.82,0.94,0.25,0.09,0.43,0.01,0.76,0.25,0.39,0.69,0.68,0.2,0.73,0.58
Technology,0.32,0.44,0.78,0.64,0.78,0.26,0.05,0.43,0.01,0.69,0.3,0.28,0.57,0.47,0.2,0.52,0.42


In [None]:
import numpy as np
import pandas as pd


def calculate_industry_weights(reporting_df, scoring_cols):
    """
    Calculate industry-level materiality weights for ESG scoring.

    Parameters:
    - reporting_df (pd.DataFrame): The ESG dataset containing company data.
    - scoring_cols (list): List of columns to be used for scoring.

    Returns:
    - pd.DataFrame: A DataFrame with industry-level materiality weights.
    """
    # Step 1: Compute industry-level means across all years
    industry_means = reporting_df.groupby("industry")[scoring_cols].mean()

    # Step 2: Compute industry materiality score (sum of all means per industry)
    industry_means["industry_materiality_score"] = industry_means.sum(axis=1)

    # Step 3: Compute materiality weight for each variable
    materiality_weights = {}
    for col in scoring_cols:
        materiality_weights[f"{col}_materiality_weight"] = (
            industry_means[col] / industry_means["industry_materiality_score"]
        )

    # Create a new DataFrame with interleaved original and materiality weight columns
    interleaved_columns = []
    for col in scoring_cols:
        interleaved_columns.append(col)
        interleaved_columns.append(f"{col}_materiality_weight")

    industry_weights_df = industry_means.assign(**materiality_weights)[
        interleaved_columns + ["industry_materiality_score"]
    ]

    return industry_weights_df

In [None]:
# Keep only materiality weight columns and the materiality score
# industry_weights_df = industry_means[
#     ["industry_materiality_score"]
#     + [f"{col}_materiality_weight" for col in scoring_cols]
# ]

In [None]:
industry_weights_df = calculate_industry_weights(reporting_df, scoring_cols)
industry_weights_df

Unnamed: 0_level_0,external_audit_of_ESG_report,external_audit_of_ESG_report_materiality_weight,ceo_sust_statem,ceo_sust_statem_materiality_weight,environmental_policy_and_assessment,environmental_policy_and_assessment_materiality_weight,environmental_performance_targets,environmental_performance_targets_materiality_weight,reduced_environmental_impact,reduced_environmental_impact_materiality_weight,increased_renewable_energy,increased_renewable_energy_materiality_weight,disclosure_of_raw_material_use,disclosure_of_raw_material_use_materiality_weight,resource_efficiency_target,resource_efficiency_target_materiality_weight,disclosure_of_water_discharges,disclosure_of_water_discharges_materiality_weight,supplier_guidelines,supplier_guidelines_materiality_weight,disclosure_of_suppliers_audited,disclosure_of_suppliers_audited_materiality_weight,disclosure_of_supplier_evaluation_procedures,disclosure_of_supplier_evaluation_procedures_materiality_weight,supplier_environmental_assessment,supplier_environmental_assessment_materiality_weight,energy_consump_bool,energy_consump_bool_materiality_weight,water_withdraw_bool,water_withdraw_bool_materiality_weight,ghg_emis_bool,ghg_emis_bool_materiality_weight,transport_emis_bool,transport_emis_bool_materiality_weight,industry_materiality_score
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1
Basic Materials,0.65,0.05,0.67,0.05,0.99,0.08,0.93,0.07,0.91,0.07,0.35,0.03,0.73,0.06,0.69,0.06,0.53,0.04,0.88,0.07,0.6,0.05,0.58,0.05,0.81,0.06,0.9,0.07,0.85,0.07,0.88,0.07,0.52,0.04,12.47
Consumer Goods and Services,0.42,0.04,0.63,0.06,0.96,0.09,0.88,0.08,0.94,0.09,0.45,0.04,0.31,0.03,0.68,0.07,0.15,0.01,0.89,0.09,0.38,0.04,0.47,0.04,0.77,0.07,0.68,0.07,0.46,0.04,0.76,0.07,0.57,0.06,10.38
Energy and Utilities,0.26,0.03,0.59,0.07,0.91,0.11,0.82,0.09,0.85,0.1,0.19,0.02,0.05,0.01,0.54,0.06,0.25,0.03,0.71,0.08,0.38,0.04,0.38,0.04,0.59,0.07,0.66,0.08,0.31,0.04,0.66,0.08,0.49,0.06,8.65
Finance,0.42,0.05,0.57,0.07,0.9,0.11,0.78,0.09,0.87,0.1,0.35,0.04,0.09,0.01,0.46,0.05,0.01,0.0,0.77,0.09,0.2,0.02,0.3,0.04,0.63,0.07,0.59,0.07,0.3,0.04,0.68,0.08,0.63,0.07,8.56
Health Care,0.34,0.05,0.44,0.06,0.82,0.11,0.57,0.08,0.78,0.11,0.27,0.04,0.07,0.01,0.32,0.04,0.03,0.0,0.83,0.11,0.4,0.05,0.29,0.04,0.62,0.08,0.4,0.05,0.33,0.05,0.52,0.07,0.35,0.05,7.41
Industrial Goods and Services,0.4,0.04,0.63,0.07,0.94,0.1,0.84,0.09,0.94,0.1,0.38,0.04,0.12,0.01,0.59,0.06,0.06,0.01,0.85,0.09,0.41,0.04,0.44,0.05,0.69,0.07,0.6,0.06,0.32,0.03,0.66,0.07,0.47,0.05,9.35
Other,0.49,0.06,0.59,0.07,0.94,0.11,0.82,0.09,0.94,0.11,0.25,0.03,0.09,0.01,0.43,0.05,0.01,0.0,0.76,0.09,0.25,0.03,0.39,0.04,0.69,0.08,0.68,0.08,0.2,0.02,0.73,0.08,0.58,0.07,8.83
Technology,0.32,0.04,0.44,0.06,0.78,0.11,0.64,0.09,0.78,0.11,0.26,0.04,0.05,0.01,0.43,0.06,0.01,0.0,0.69,0.1,0.3,0.04,0.28,0.04,0.57,0.08,0.47,0.07,0.2,0.03,0.52,0.07,0.42,0.06,7.15


^^This is good, because then each variable has the same weight from year to year, making it easier to compare companies across years.

In [None]:
# folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# industry_weights_df.to_csv(f"{folder_path}/industry_weights_df.csv", index=True)
# industry_means_all_years_df.to_csv(f"{folder_path}/industry_means_df.csv", index=True)

# Calculate company percentiles

In [None]:
import pandas as pd
from typing import List


def calculate_company_percentiles(
    reporting_df: pd.DataFrame, scoring_cols: List[str]
) -> pd.DataFrame:
    """
    Calculate percentile scores for each company, grouped by industry and year.
    - If a company has a value of 1 for a metric, its percentile is calculated as:
        ((# of values less than 1) + (# of values equal to 1)/2) / (total companies in group)
    - If a company has a value of 0 for a metric, its percentile is set to 0.

    Parameters:
      reporting_df (pd.DataFrame): The ESG dataset containing company data.
      scoring_cols (List[str]): List of columns to be used for scoring.

    Returns:
      pd.DataFrame: A DataFrame with percentile scores for each company.
    """
    result_dfs = []

    # Iterate over each group defined by industry and year
    for (industry, year), group in reporting_df.groupby(["industry", "year"]):
        group_copy = group.copy()
        total_count = len(group_copy)
        # Process each scoring column for this group
        for col in scoring_cols:
            ones_count = (group_copy[col] == 1).sum()
            if ones_count == 0:
                group_copy[f"{col}_percentile"] = 0
            else:
                group_copy[f"{col}_percentile"] = group_copy[col].apply(
                    lambda x: (
                        0
                        if x == 0
                        else (
                            (
                                (group_copy[col] < x).sum()
                                + (group_copy[col] == x).sum() / 2
                            )
                            / total_count
                        )
                    )
                )
        result_dfs.append(group_copy)

    company_percentiles = pd.concat(result_dfs, ignore_index=True)
    # Select and order the desired columns
    cols = ["company", "ticker", "year", "industry"] + [
        f"{col}_percentile" for col in scoring_cols
    ]
    return company_percentiles[cols]

In [None]:
company_percentiles_df = calculate_company_percentiles(reporting_df, scoring_cols)

In [None]:
company_percentiles_df[company_percentiles_df["company"] == "Archer Ltd."]

Unnamed: 0,company,ticker,year,industry,external_audit_of_ESG_report_percentile,ceo_sust_statem_percentile,environmental_policy_and_assessment_percentile,environmental_performance_targets_percentile,reduced_environmental_impact_percentile,increased_renewable_energy_percentile,disclosure_of_raw_material_use_percentile,resource_efficiency_target_percentile,disclosure_of_water_discharges_percentile,supplier_guidelines_percentile,disclosure_of_suppliers_audited_percentile,disclosure_of_supplier_evaluation_procedures_percentile,supplier_environmental_assessment_percentile,energy_consump_bool_percentile,water_withdraw_bool_percentile,ghg_emis_bool_percentile,transport_emis_bool_percentile
467,Archer Ltd.,ARCHO,2020,Energy and Utilities,0.87,0.6,0.54,0.55,0.56,0.0,0.0,0.6,0.0,0.66,0.79,0.0,0.0,0.63,0.0,0.0,0.0


# Now combine them... company_percentiles multiplied by the relative weight, to calculate the total ESG score

In [None]:
def calculate_weighted_scores(
    company_percentiles_df: pd.DataFrame,
    industry_weights_df: pd.DataFrame,
    scoring_cols: List[str],
) -> pd.DataFrame:
    """
    Multiply each company's percentile score by the industry materiality weight for each metric.

    Parameters:
    - company_percentiles_df (pd.DataFrame): DataFrame with company percentile scores per metric.
    - industry_weights_df (pd.DataFrame): DataFrame with industry materiality weights per metric.
    - scoring_cols (List[str]): List of columns to be used for scoring.

    Returns:
    - pd.DataFrame: A DataFrame with weighted scores and an overall environmental score for each company.
    """
    weighted_scores_df = company_percentiles_df.copy()

    score_cols = []
    for col in scoring_cols:
        weight_col = f"{col}_materiality_weight"
        score_col = f"{col}_score"

        # Map industry materiality weights to each company's industry
        weighted_scores_df[score_col] = weighted_scores_df.apply(
            lambda row: (
                row[f"{col}_percentile"]
                * industry_weights_df.loc[row["industry"], weight_col]
                if row["industry"] in industry_weights_df.index
                else np.nan
            ),
            axis=1,
        )
        score_cols.append(score_col)

    # Compute overall environmental score as the sum of all individual scores
    weighted_scores_df["overall_environmental_score_raw"] = weighted_scores_df[
        score_cols
    ].sum(axis=1)

    return weighted_scores_df[
        ["company", "ticker", "year", "industry", "overall_environmental_score_raw"]
        + score_cols
    ]

In [None]:
# Test to see that the function works...

# calc_test_df = reporting_df[
#     (reporting_df["industry"] == "Energy and Utilities")
#     & (reporting_df["year"] == 2020)
# ]

# calc_test_df

# folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# calc_test_df.to_csv(f"{folder_path}/test_of_scoring_system.csv", index=False)

# Company percentiles are correct. Now just check industry magnitudes are correct and that the weighted scores are therefore correct

In [None]:
weighted_scores_df = calculate_weighted_scores(
    company_percentiles_df, industry_weights_df, scoring_cols
)

In [None]:
weighted_scores_df[
    (weighted_scores_df["industry"] == "Energy and Utilities")
    & (weighted_scores_df["year"] == 2020)
].head()

Unnamed: 0,company,ticker,year,industry,overall_environmental_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score
453,Northern Drilling Ltd,NODL,2020,Energy and Utilities,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
454,DNO ASA,DNO,2020,Energy and Utilities,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
455,RAK Petroleum plc,RAKP,2020,Energy and Utilities,0.04,0.0,0.04,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
456,Seadrill Ltd,SDRL,2020,Energy and Utilities,0.11,0.0,0.0,0.06,0.0,0.06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
457,Questerre Energy Corporation,QEC,2020,Energy and Utilities,0.15,0.0,0.04,0.06,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Now I just need to convert the final score into a percentile (relative to other companies). And this will be the reporting score...

In [None]:
import pandas as pd


def calculate_adjusted_env_score(df, raw_column, by_industry=True):
    """
    Calculates the percentile ranking of each company's environmental score relative to
    other companies in the same industry and year.

    Parameters:
        df (pd.DataFrame): The input DataFrame containing company data.
        raw_column (str): The column name for the raw environmental score.
        by_industry (bool): If True, percentiles are calculated within industry-year groups.

    Returns:
        pd.DataFrame: The DataFrame with an additional column for percentiles.
    """
    result_dfs = []

    # Group by industry and year if by_industry is True; otherwise, group only by year
    groupby_cols = ["industry", "year"] if by_industry else ["year"]

    for _, group in df.groupby(groupby_cols):
        group_copy = group.copy()
        total_count = len(group_copy)

        if total_count > 0:
            group_copy[f"{raw_column}_percentile"] = group_copy[raw_column].apply(
                lambda x: (
                    0
                    if x == 0
                    else (
                        (
                            (group_copy[raw_column] < x).sum()
                            + (group_copy[raw_column] == x).sum() / 2
                        )
                        / total_count
                    )
                )
            )

        result_dfs.append(group_copy)

    # Concatenate results and return only relevant columns
    final_df = pd.concat(result_dfs, ignore_index=True)
    selected_cols = [
        "company",
        "ticker",
        "year",
        "industry",
        f"{raw_column}_percentile",
    ]

    return final_df[selected_cols]

# This looks correct. Just change the name and we are good to go. 

Then clean this section before applying the same thing to an overall score. Separate dfs to make it more manageable for the reader.

In [None]:
final_df = calculate_adjusted_env_score(
    weighted_scores_df, "overall_environmental_score_raw"
)

final_df[(final_df["industry"] == "Energy and Utilities") & (final_df["year"] == 2020)]

Unnamed: 0,company,ticker,year,industry,overall_environmental_score_raw_percentile
453,Northern Drilling Ltd,NODL,2020,Energy and Utilities,0.0
454,DNO ASA,DNO,2020,Energy and Utilities,0.0
455,RAK Petroleum plc,RAKP,2020,Energy and Utilities,0.06
456,Seadrill Ltd,SDRL,2020,Energy and Utilities,0.09
457,Questerre Energy Corporation,QEC,2020,Energy and Utilities,0.11
458,Africa Oil Corp.,AOI,2020,Energy and Utilities,0.13
459,Siem Offshore Inc.,SIOFF,2020,Energy and Utilities,0.16
460,FLEX LNG Ltd,FLNG,2020,Energy and Utilities,0.28
461,Tanker Investments Ltd,TNK,2020,Energy and Utilities,0.21
462,Norwegian Energy Company ASA,NOR,2020,Energy and Utilities,0.23


It is debatable whether the external audit should be included in the weighting. That should be as important for any, so should discount any score...

Consider how to deal with this using the scoring columns.


This is also where I should consider adding unit tests...

## References

European Commission, 2025. Proposal for a Directive of the European Parliament and of the Council amending Directives (EU) 2022/2464 and (EU) 2024/1760 as regards the dates from which Member States are to apply certain corporate sustainability reporting and due diligence requirements. COM(2025) 80 final. Brussels. Available at: https://commission.europa.eu/document/download/0affa9a8-2ac5-46a9-98f8-19205bf61eb5_en?filename=COM_2025_80_EN.pdf (Accessed 27 February 2025)

LSEG, 2024. Environmental, Social and Governance Scores from LSEG: October 2024. Available at: https://www.lseg.com/content/dam/data-analytics/en_us/documents/methodology/lseg-esg-scores-methodology.pdf (Accessed: 27 February 2025)

## Appendix

## Plan for sector-specific environmental scoring calculation:

1) Calculate the mean value for each parameter for a given industry (use overall, rather than year, so that there can be some comparison between years).
2) Compare that to the sum of mean values for each parameter for a given industry (e.g. mean_env_policy + mean_ghg_emis_bool).
3) Calculate the value for each parameter for a given company and the associated percentile (e.g. 1 = 0.81111)
4) Multiply that percentile value for the company by the magnitude of the column. (This will make sure the most important columns are prioritised)
5) Add all the new scores up for the given company. This will give the company a raw_sector_score.
6) Now take the sector score for every company in that sector and calculate the percentile. That will give the company's adjusted_sector_score = This can be considered the company's sector-specific E(SG) score.
--The advantage of this is that you can compare scores within a sector across years. But then we have a problem when we add another year. Should we use the base year of 2019 as the calculator? But things might have become more prioritised in recent years. So maybe best to stick with overall.

## Plan for overall environmental scoring calculation:

1) Calculate the mean value for each industry for a given parameter (use overall). Ignore magnitude.
2) This will give you the transparency weighting. The more companies fill in this metric, the more transparent it is.
3) Calculate the transparency of each metric relative to the total: this will give you the weighting of each metric.
4) For each company/year, calculate their overall percentile for each metric.
5) Multiply each percentile by the weighting of the metric it applies to. Sum  all of the values.
6) This will give the company's raw_overall_score.
7) Now calculate the percentile for each company relative to others in the same year (or maybe overall). This will give the company's adjusted_overall_score 


^^This score will give the company its reporting score. In the next section, calculate its emission score.
For the overall score, maybe consider doing a 66:33 split for emissions:reporting.
^^ Is it even necessary to do an overall score if I'm using this for business leads/identifying potential businesses to target?





In [None]:
# # 2. Create 'consecutive_years_esg_data' by checking consecutive years starting from 2022
# def calculate_consecutive_years(group):
#     # Create a set of years for the current 'comp_name'
#     years = set(group["year"])
#     # Start from 2022 and count consecutive years backwards
#     count = 0
#     for year in range(2022, 2019, -1):  # Checking years 2022, 2021, 2020, ...
#         if year in years:
#             count += 1
#         else:
#             break  # Stop if any year is missing in the consecutive sequence

#     return count


# # Apply the function to each group of 'comp_name'
# df["consecutive_years_esg_data"] = (
#     df.groupby("comp_name")
#     .apply(calculate_consecutive_years)
#     .reset_index(level=0, drop=True)
# )

Metrics change from base year

In [None]:
# reporting_df = reporting_df.merge(
#     reporting_df.loc[
#         reporting_df["year"] == reporting_df["base_year"],
#         ["company", "metrics_reported"],
#     ],
#     on="company",
#     how="left",
#     suffixes=("", "_base_year"),
# )

# # Compute the change from base year
# reporting_df["metrics_change_from_base_year"] = (
#     reporting_df["metrics_reported"] - reporting_df["metrics_reported_base_year"]
# )

# reporting_df.loc[
#     reporting_df["year"] == reporting_df["base_year"], "metrics_change_from_base_year"
# ] = float("nan")

# reporting_df.drop("metrics_reported_base_year", axis=1, inplace=True)

esg scores calculation

In [None]:
# import numpy as np
# import pandas as pd


# def calculate_industry_esg_scores(reporting_df, scoring_cols):
#     """
#     Calculate an industry-specific ESG score for each company.

#     Parameters:
#     - reporting_df (pd.DataFrame): The ESG dataset containing company data.
#     - scoring_cols (list): List of columns to be used for scoring.

#     Returns:
#     - pd.DataFrame: A DataFrame with company-specific ESG scores.
#     """

#     # Step 1: Compute industry-level means across all years
#     industry_means = reporting_df.groupby("industry")[scoring_cols].mean()

#     # Step 2: Compute company-level scores
#     company_scores = reporting_df.copy()

#     for col in scoring_cols:
#         # Compute the percentile score for each company within its industry
#         company_scores[f"{col}_percentile"] = company_scores.groupby("industry")[
#             col
#         ].rank(pct=True)

#         # Adjust score by industry-specific magnitude
#         company_scores[f"{col}_adjusted"] = (
#             company_scores[f"{col}_percentile"]
#             * industry_means.loc[company_scores["industry"], col].values
#         )

#     # Step 3: Compute raw industry score for each company (sum of all adjusted scores)
#     company_scores["raw_industry_score"] = company_scores[
#         [f"{col}_adjusted" for col in scoring_cols]
#     ].sum(axis=1)

#     # Step 4: Compute industry-specific percentile score
#     company_scores["adjusted_sector_score"] = company_scores.groupby("industry")[
#         "raw_industry_score"
#     ].rank(pct=True)

#     return company_scores[
#         [
#             "company",
#             "ticker",
#             "year",
#             "industry",
#             "raw_industry_score",
#             "adjusted_sector_score",
#         ]
#     ]

In [None]:
# score_df = calculate_industry_esg_scores(reporting_df, scoring_cols=scoring_cols)

```diff
- According to the CSRD, a first set of European Sustainability Reporting Standards (ESRS) were adopted in 2023, which were sector-agnostic, so they were to be applied regardless of the sector in which the company operates. Sector-specific reporting standards were expected to be introduced by June 2026, but at the time of writing, February 2025, it is likely that this requirement will be shelved (see European Commission, 2025).

- I apply the ESG scoring system used by the London Stock Exchange Group (LSEG, 2024), and calculate sector-specific scores...

### Sector-specific calculations

I then summarise all columns using the mean (and append the median of metrics reported). These values are later used to calculate each company's ESG score for a given year.  

In [None]:
# summary_by_industry_df = (
#     metrics_reported_df.groupby(["industry", "year"])[metrics + ["metrics_reported"]]
#     .mean()  # mean used here because median would either show 0 or 1
#     .reset_index()
#     .set_index(["industry", "year"])
# )

# median_count_metrics = (
#     metrics_reported_df.groupby(["industry", "year"])["metrics_reported"]
#     .median()
#     .reset_index()
# )

# median_count_metrics.rename(
#     columns={"metrics_reported": "metrics_reported_median"}, inplace=True
# )

# # Merge the median with the original summary DataFrame
# summary_by_industry_df = summary_by_industry_df.merge(
#     median_count_metrics, on=["industry", "year"], how="left"
# ).set_index(["industry", "year"])

### Cross-sector calculations

I apply the same logic as in the previous section, but across all industries.

In [None]:
# summary_overall_df = (
#     metrics_reported_df.groupby(["year"])[metrics + ["metrics_reported"]]
#     .mean()
#     .reset_index()
#     .set_index(["year"])
# )

# median_count_metrics = (
#     metrics_reported_df.groupby(["year"])["metrics_reported"].median().reset_index()
# )

# median_count_metrics.rename(
#     columns={"metrics_reported": "metrics_reported_median"}, inplace=True
# )

# # Merge the median with the original summary DataFrame
# summary_overall_df = summary_overall_df.merge(
#     median_count_metrics, on=["year"], how="left"
# ).set_index(["year"])

In [None]:
# summary_overall_df

In [None]:
# shows all unique values for each column

# pd.set_option("display.max_colwidth", 180)


# # Create a DataFrame with unique values for each column

# unique_values_df = pd.DataFrame(

#     {

#         "columns": reporting_df.columns,

#         "unique_values": [

#             reporting_df[col].unique().tolist() for col in reporting_df.columns

#         ],

#     }

# ).set_index("columns")


# unique_values_df

# END