## To do:

- Emphasise that this can be used to target new customers--you can immediately see which companies have missing X, Y and Z as part of their gap analysis...
- You can see which data companies find it hardest to collect, etc...

- Show those companies/number of companies that have migrated from one ranking to the next from year to year. (Use the example of a credit scorecard as inspiration)
- Consider grouping them
- Coverage rating vs. Ranking -- give them a ranking relative to their competitors, and a coverage rating based on the number of metrics they are successfully reporting.

The question of materiality, or in other words, the relative weight, is determined based on the disclosure of
relative level in that industry group. The disclosure percentage for each industry group to which the data point
is material is identified, and decile ranks are assigned. The decile rank determines the relative weight assigned
to that data point in determining the industry weight – from 1 to 10

#### Summary columns

Declarations per year

--check which industry has the highest % of missing values

Percentage of companies in each industry that have their sustainability work audited

Create 'gap analysis: total missing metrics (coverage of metrics)'

Carbon intensive industries = Energy, Materials/Basic Materials, Industrials, Utilities

Create a summary_df to make it easier to query individual companies

Add some sort of unit tests to show that the results of each df are correct--for example, that the weighted magnitudes all = 1

It is debatable whether the external audit should be included in the weighting. That should be as important for any, so should discount any score...

Consider how to deal with this using the scoring columns.

-	Use an animation to show the average scores for each group over time—use the animation from this video: https://youtu.be/GGL6U0k8WYA?si=9Yckk-Ig9egp8HOA&t=4697

<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of Environmental Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 2:</b> Reporting | Gap analysis </span></center>

## Introduction to this section

In the previous section, I cleaned the original dataset to ensure that all companies were entered under a single name and had a consistent ticker, segment, industry and country. I then removed any duplicates, transformed any anomalous values in Boolean columns (columns that accept only 0 or 1), and set a base year of 2019. All data prior to 2019 were deleted, new columns were added, and the data frame was divided into a reporting_df and an impact_df. 

In this section, I analyse the reporting_df, which shows how well each company is meeting their environmental reporting requirements under CSRD (based on the available data we have). It is important to note that CSRD came into force in 2024, but the most recent data in the dataset is from 2022.

I first explore the dataset to see how reporting varies from one metric to the next. This might be a useful guide to help understand which services might be in demand in the near future. For example, we see that many companies do not disclose details about water discharges. This may be because they find it hard to measure, and therefore will require help to report that metric.

I use this to run a gap analysis, which shows the percentage of metrics reported...

Finally, I create a score calculator, based on the London Stock Exchange Group's methodology (LSEG, 2024), to give each company an environmental score for a given year. This score calculator is limited by the data available--the Nordic Compass dataset has around 17 relevant columns for measuring a company's environmental score, whereas LSEG used 68 metrics to constitute an environmental score. I also have a limited number of companies in some industries, so it's unreasonable to give an industry-specific score in those industries. Some industries, therefore, have been merged with others. This trade-off would not be required if the dataset was larger. My system is a simplified version, but can be useful for measuring a company's progress on environmental issues and determining companies that are performing well and not so well relative to their competitors.

To create the score calculator, I first decide what metrics are material to each industry. This is done using materiality weight, which is calculated...

```diff
- Insert here...

```

I then convert that score into an industry-specific rating and an overall rating. The industry-specific rating compares the company's performance in that year to 

## Imports

In [80]:
import pandas as pd
import numpy as np
import sys
import os
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from typing import Dict, List, Optional

pd.set_option("display.max_columns", None)
sys.path.append(os.path.abspath(".."))
import random

from functions import (
    test_company,
    show_missing_values,
    chart_visualisations,
    summarise_boolean_values,
    visualisations_by_year,
    map_ceo_statements,
    calculate_industry_weights,
    calculate_company_percentiles,
    calculate_raw_score,
    calculate_adjusted_score,
    assign_rating,
    cast_to_rating_category,
)

from IPython.display import display

pd.options.display.float_format = "{:,.2f}".format

In [81]:
reporting_df = pd.read_csv("../datasets/reporting_df_original.csv")

In [82]:
# reporting_df.dtypes

## Exploratory Data Analysis

I define the columns to be used for visualisation. Some columns, such as 'company' or 'ticker', are excluded.

In [83]:
object_columns = [
    "year",
    "segment",
    "industry",
    "hq_country",
    "years_esg_data",
    "base_year",
]

boolean_columns = [
    "csrd_2025",
    "csrd_2027",
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]


columns_for_viz = object_columns + boolean_columns

I first visualise all relevant columns to get an idea of the dataset's distribution. I then visualise by year, to see if there is any observable progress from one year to the next. Display mode can be toggled between 'count' and 'percentage'. Data from fewer companies are available in 2022 relative to other years, so percentage may be a more suitable option to compare company performance from year to year.

Noticeably, some columns seem to be missing data. 'CEO_sust_statem', for example is almost completely missing from the 2022 data, suggesting a problem with the data collection.

In [84]:
# chart_visualisations(reporting_df, columns_for_viz)

In [85]:
# visualisations_by_year(reporting_df, boolean_columns, display_mode="percentage")

I group the data by industry and summarise the mean values for each year in the table below. Columns with high mean values indicate a high number of companies reporting this metric. 

In [86]:
# summarise_boolean_values(reporting_df, boolean_columns, ["year", "industry"])

To handle missing data from 2022, I extrapolate all 'CEO_sust_statem' data from 2021.

In [87]:
map_ceo_statements(reporting_df)

Unnamed: 0,company,ticker,year,csrd_2025,csrd_2027,segment,industry,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
0,Archer Ltd.,ARCHO,2020,1,0,Mid,Energy and Utilities,Norway,1,2020,1,1,1,1,1,0,0,1,0,1,1,0,0,1,0,0,0
1,AutoStore Holdings Ltd.,AUTO,2021,0,1,Large,Industrial Goods and Services,Bermuda,1,2021,0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,1,1
2,Avance Gas Holding ltd,AGAS,2019,0,1,Mid,Energy and Utilities,Norway,2,2019,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0
3,Avance Gas Holding ltd,AGAS,2020,0,0,Mid,Energy and Utilities,Norway,2,2019,1,1,1,1,1,0,0,1,0,1,0,0,0,1,0,0,0
4,Borr Drilling Ltd,BDRILL,2019,1,0,Mid,Energy and Utilities,Bermuda,4,2019,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1832,Betsson AB,BETS,2022,1,0,Mid,Other,Sweden,4,2019,0,1,1,1,1,0,0,0,0,1,0,0,1,1,0,0,0
1833,Karnov Group AB,KAR,2020,0,1,Mid,Consumer Goods and Services,Sweden,4,2019,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0,0
1834,Blackstone Inc. (formerly: G4S plc),BX,2022,1,0,Large,Finance,United States,1,2022,0,0,1,1,1,0,0,0,0,1,0,0,1,0,0,0,0
1835,Gaming Innovation Group Inc,GIG,2020,0,1,Mid,Technology,Malta,2,2019,1,1,1,1,1,1,0,1,0,0,0,0,0,0,0,0,0


# Evaluating performance

```diff
- Insert text here...
```

I first measure absolute performance by determining the number of metrics each company reports in a given year. I then compare that to the number of metrics the company reported in the previous year.

## Metrics reported

### Metrics reported by company

I first calculate the number of metrics each company has reported in a given year, and the coverage as a percentage of all metrics that could have been reported.

In [88]:
# all Boolean columns are included, except those related to CSRD--these show only whether a company must report CSRD data in 2025 or 2027
metrics = list(set(boolean_columns) - {"csrd_2025", "csrd_2027"})

metrics_reported_by_company_df = reporting_df.copy()
metrics_reported_by_company_df["metrics_reported"] = metrics_reported_by_company_df[
    metrics
].sum(axis=1)
metrics_reported_by_company_df["metric_coverage"] = metrics_reported_by_company_df[
    "metrics_reported"
] / len(metrics)

In [89]:
# metrics_reported_by_company_df.head()

To determine whether a company has improved its reporting in absolute terms since the previous year, I calculate the change in the number of metrics reported relative to the previous year.

In [90]:
# Ensure DataFrame is sorted by company and year
metrics_reported_by_company_df = metrics_reported_by_company_df.sort_values(
    by=["company", "year"]
)
metrics_reported_by_company_df["metrics_change_from_prev_year"] = (
    metrics_reported_by_company_df["metrics_reported"]
    - metrics_reported_by_company_df.groupby(["company"])["metrics_reported"].shift(1)
)

In [91]:
metrics_reported_by_company_df[
    ["company", "year", "metrics_reported", "metrics_change_from_prev_year"]
]

Unnamed: 0,company,year,metrics_reported,metrics_change_from_prev_year
104,A.P. Møller -Maersk A/S,2019,13,
101,A.P. Møller -Maersk A/S,2020,14,1.00
102,A.P. Møller -Maersk A/S,2021,14,0.00
103,A.P. Møller -Maersk A/S,2022,14,0.00
1644,AAK AB,2019,12,
...,...,...,...,...
611,Össur hf,2022,8,-5.00
187,Ørsted A/S,2019,12,
188,Ørsted A/S,2020,15,3.00
189,Ørsted A/S,2021,15,0.00


In [92]:
# Compute industry percentile (within each year & industry)
metrics_reported_by_company_df["industry_percentile"] = (
    metrics_reported_by_company_df.groupby(["year", "industry"])[
        "metrics_reported"
    ].rank(pct=True)
)


# Compute overall percentile (within each year)
metrics_reported_by_company_df["overall_percentile"] = (
    metrics_reported_by_company_df.groupby("year")["metrics_reported"].rank(pct=True)
)

In [93]:
desired_columns = [
    "company",
    "ticker",
    "year",
    "industry",
    "metrics_reported",
    "metric_coverage",
    "industry_percentile",
    "overall_percentile",
    "metrics_change_from_prev_year",
]

remaining_columns = [
    col for col in metrics_reported_by_company_df.columns if col not in desired_columns
]

metrics_reported_by_company_df = metrics_reported_by_company_df[
    desired_columns + remaining_columns
]

In [94]:
metrics_reported_by_company_df.head()

Unnamed: 0,company,ticker,year,industry,metrics_reported,metric_coverage,industry_percentile,overall_percentile,metrics_change_from_prev_year,csrd_2025,csrd_2027,segment,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
104,A.P. Møller -Maersk A/S,MAERSK,2019,Industrial Goods and Services,13,0.76,0.85,0.85,,1,0,Large,Denmark,4,2019,1,1,1,1,1,0,0,1,0,1,1,1,1,1,1,1,0
101,A.P. Møller -Maersk A/S,MAERSK,2020,Industrial Goods and Services,14,0.82,0.88,0.89,1.0,1,0,Large,Denmark,4,2019,1,1,1,1,1,1,0,1,0,1,1,0,1,1,1,1,1
102,A.P. Møller -Maersk A/S,MAERSK,2021,Industrial Goods and Services,14,0.82,0.91,0.87,0.0,1,0,Large,Denmark,4,2019,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,1
103,A.P. Møller -Maersk A/S,MAERSK,2022,Industrial Goods and Services,14,0.82,0.86,0.88,0.0,1,0,Large,Denmark,4,2019,0,1,1,1,1,1,0,1,0,1,1,1,1,1,1,1,1
1644,AAK AB,AAK,2019,Consumer Goods and Services,12,0.71,0.63,0.76,,1,0,Large,Sweden,4,2019,1,1,1,1,1,0,0,1,0,1,1,1,1,1,1,0,0


### Metrics reported by industry

In [95]:
metrics_reported_agg_df = metrics_reported_by_company_df.copy().drop(
    columns=["industry_percentile", "overall_percentile"]
)

In [136]:
metrics_reported_by_industry_df = metrics_reported_agg_df.groupby(
    [
        "year",
        "industry",
    ],
    as_index=True,
).agg(
    no_of_companies=("company", "count"),
    metric_coverage_mean=("metric_coverage", "mean"),
    metric_coverage_median=("metric_coverage", "median"),
)

metrics_reported_by_industry_df

Unnamed: 0_level_0,Unnamed: 1_level_0,no_of_companies,metric_coverage_mean,metric_coverage_median
year,industry,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019,Basic Materials,25,0.68,0.76
2019,Consumer Goods and Services,89,0.58,0.65
2019,Energy and Utilities,36,0.46,0.47
2019,Finance,103,0.46,0.47
2019,Health Care,54,0.41,0.41
2019,Industrial Goods and Services,116,0.52,0.53
2019,Other,27,0.48,0.47
2019,Technology,36,0.37,0.38
2020,Basic Materials,25,0.72,0.76
2020,Consumer Goods and Services,87,0.64,0.65


### Metrics reported by HQ country

In [137]:
metrics_reported_by_country_df = metrics_reported_agg_df.groupby(
    ["year", "hq_country"], as_index=True
).agg(
    no_of_companies=("company", "count"),
    metric_coverage_mean=("metric_coverage", "mean"),
    metric_coverage_median=("metric_coverage", "median"),
)

metrics_reported_by_country_df.query(
    "hq_country in ['Sweden', 'Norway', 'Denmark', 'Finland']"
)

Unnamed: 0_level_0,Unnamed: 1_level_0,no_of_companies,metric_coverage_mean,metric_coverage_median
year,hq_country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2019,Denmark,59,0.49,0.47
2019,Finland,71,0.58,0.65
2019,Norway,94,0.44,0.44
2019,Sweden,213,0.51,0.53
2020,Denmark,60,0.55,0.59
2020,Finland,72,0.63,0.65
2020,Norway,103,0.49,0.53
2020,Sweden,202,0.55,0.59
2021,Denmark,54,0.6,0.65
2021,Finland,62,0.63,0.71


## Sector-specific environmental score

Metrics reported is a useful measure, but it can be misleading. Not all companies are equally affected by every metric, so each company must to determine what environmental factors are material to its operations. (Water discharges, for example, will be far more material for a mining company than for a technology company, and should be weighted accordingly.) I first calculate how material the metric is to each industry. This is done by calculating the mean value of each column. Because we are analysing Boolean columns, the higher the mean, the more companies report on this metric, and therefore the more important (material) this must be to the industry. 



The higher the mean, the higher the weight that metric is given to that industry. 

Okay, let's calculate the industry median first.

This is taken as a percentage of all industry medians to give the weight of each column to the overall score for that column.
So if the industry median for ghg_emis_bool is 0.6 for 'Energy' and 0.1 for 'Consumer Goods', then the weight for Energy is going to be 0.6/(0.6+0.1) = 90%



### Calculate industry mean

I decide which columns will be included in the environmental score.

```diff
- If external audit is 0, there should be a deduction of 10 points from the final ESG score, I think... Remove it from the scoring columns, but use it at the end.

In [98]:
scoring_cols = [
    "external_audit_of_ESG_report",  # decide whether to include this, or perhaps add it to the index
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]

I calculate the mean using values from 2019 to 2022, rather than calculating the mean for each year. The result will be used to assign materiality weights to each metric, so this method allows each metric to have the same weight from year to year, making it easier to compare companies across years. If I calculated the mean for each year, the materiality weight for each metric would vary from one year to the next, making it harder to determine whether a company has made progress from year to year.

The potential downside of the chosen method is that some metrics may become more material over time, so it may be better to calculate mean values using the most recent year as a benchmark, or the most recent n years. This is something to consider when data is added to the database in future.


In [99]:
industry_means_all_years_df = reporting_df.groupby("industry")[scoring_cols].mean()
industry_means_all_years_df

Unnamed: 0_level_0,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
Basic Materials,0.65,0.85,0.99,0.93,0.91,0.35,0.73,0.69,0.53,0.88,0.6,0.58,0.81,0.9,0.85,0.88,0.52
Consumer Goods and Services,0.42,0.8,0.96,0.88,0.94,0.45,0.31,0.68,0.15,0.89,0.38,0.47,0.77,0.68,0.46,0.76,0.57
Energy and Utilities,0.26,0.74,0.91,0.82,0.85,0.19,0.05,0.54,0.25,0.71,0.38,0.38,0.59,0.66,0.31,0.66,0.49
Finance,0.42,0.75,0.9,0.78,0.87,0.35,0.09,0.46,0.01,0.77,0.2,0.3,0.63,0.59,0.3,0.68,0.63
Health Care,0.34,0.6,0.82,0.57,0.78,0.27,0.07,0.32,0.03,0.83,0.4,0.29,0.62,0.4,0.33,0.52,0.35
Industrial Goods and Services,0.4,0.81,0.94,0.84,0.94,0.38,0.12,0.59,0.06,0.85,0.41,0.44,0.69,0.6,0.32,0.66,0.47
Other,0.49,0.75,0.94,0.82,0.94,0.25,0.09,0.43,0.01,0.76,0.25,0.39,0.69,0.68,0.2,0.73,0.58
Technology,0.32,0.61,0.78,0.64,0.78,0.26,0.05,0.43,0.01,0.69,0.3,0.28,0.57,0.47,0.2,0.52,0.42


### Calculate materiality

Once I have calculated the mean of each column for every industry, I am able to calculate the weight that should be given to that column. This is done by adding the mean of every column for a given industry, which gives me the 'industry_materiality_score'. I then divide the mean value for each column by the industry materiality score to calculate the materiality weight for each column.

Industries with the highest 'industry_materiality_score' are assumed to be the industries with the most reporting requirements--more metrics are material for this industry than any other industry. 

```diff
- This is a flawed approach, because materiality is decided by CSRD regulations rather than by the number of companies in the industry that have reported on a metric, but this approach is still useful given the data available, and is based on the method used by LSEG (2024). 



In [100]:
industry_weights_df = calculate_industry_weights(reporting_df, scoring_cols)

In [101]:
# industry_weights_df

In [102]:
# Show only materiality weight columns and the materiality score
industry_weights_df[
    ["industry_materiality_score"]
    + [f"{col}_materiality_weight" for col in scoring_cols]
]

Unnamed: 0_level_0,industry_materiality_score,external_audit_of_ESG_report_materiality_weight,ceo_sust_statem_materiality_weight,environmental_policy_and_assessment_materiality_weight,environmental_performance_targets_materiality_weight,reduced_environmental_impact_materiality_weight,increased_renewable_energy_materiality_weight,disclosure_of_raw_material_use_materiality_weight,resource_efficiency_target_materiality_weight,disclosure_of_water_discharges_materiality_weight,supplier_guidelines_materiality_weight,disclosure_of_suppliers_audited_materiality_weight,disclosure_of_supplier_evaluation_procedures_materiality_weight,supplier_environmental_assessment_materiality_weight,energy_consump_bool_materiality_weight,water_withdraw_bool_materiality_weight,ghg_emis_bool_materiality_weight,transport_emis_bool_materiality_weight
industry,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
Basic Materials,12.65,0.05,0.07,0.08,0.07,0.07,0.03,0.06,0.05,0.04,0.07,0.05,0.05,0.06,0.07,0.07,0.07,0.04
Consumer Goods and Services,10.55,0.04,0.08,0.09,0.08,0.09,0.04,0.03,0.06,0.01,0.08,0.04,0.04,0.07,0.06,0.04,0.07,0.05
Energy and Utilities,8.8,0.03,0.08,0.1,0.09,0.1,0.02,0.01,0.06,0.03,0.08,0.04,0.04,0.07,0.08,0.03,0.08,0.06
Finance,8.74,0.05,0.09,0.1,0.09,0.1,0.04,0.01,0.05,0.0,0.09,0.02,0.03,0.07,0.07,0.03,0.08,0.07
Health Care,7.57,0.05,0.08,0.11,0.08,0.1,0.04,0.01,0.04,0.0,0.11,0.05,0.04,0.08,0.05,0.04,0.07,0.05
Industrial Goods and Services,9.53,0.04,0.09,0.1,0.09,0.1,0.04,0.01,0.06,0.01,0.09,0.04,0.05,0.07,0.06,0.03,0.07,0.05
Other,8.99,0.05,0.08,0.1,0.09,0.1,0.03,0.01,0.05,0.0,0.09,0.03,0.04,0.08,0.08,0.02,0.08,0.06
Technology,7.32,0.04,0.08,0.11,0.09,0.11,0.03,0.01,0.06,0.0,0.09,0.04,0.04,0.08,0.06,0.03,0.07,0.06


### Calculate company percentiles

Company percentiles are calculated by comparing the value of each metric to other companies in the industry. Given that each metric is Boolean (either 1 or 0), only those values of 1 will be calculated as a relative percentile rank--a company with a value of 0 for a given metric will receive a percentile rank of 0 for that metric.

For companies with a value of 1 for a given metric, the percentile rank is calculated as follows: 

$$

\text{score} = \frac{\text{no. of companies with a worse value} + \frac{\text{no. of companies with the same value, inc. current company}}{2}}{\text{total no. of companies}}

$$


In [103]:
company_percentiles_by_industry_df = calculate_company_percentiles(
    reporting_df, scoring_cols, by_industry=True
)

In [104]:
company_percentiles_by_industry_df[
    company_percentiles_by_industry_df["company"] == "Archer Ltd."
]

Unnamed: 0,company,ticker,year,industry,external_audit_of_ESG_report_percentile,ceo_sust_statem_percentile,environmental_policy_and_assessment_percentile,environmental_performance_targets_percentile,reduced_environmental_impact_percentile,increased_renewable_energy_percentile,disclosure_of_raw_material_use_percentile,resource_efficiency_target_percentile,disclosure_of_water_discharges_percentile,supplier_guidelines_percentile,disclosure_of_suppliers_audited_percentile,disclosure_of_supplier_evaluation_procedures_percentile,supplier_environmental_assessment_percentile,energy_consump_bool_percentile,water_withdraw_bool_percentile,ghg_emis_bool_percentile,transport_emis_bool_percentile
453,Archer Ltd.,ARCHO,2020,Energy and Utilities,0.87,0.6,0.54,0.55,0.56,0.0,0.0,0.6,0.0,0.66,0.79,0.0,0.0,0.63,0.0,0.0,0.0


### Calculate score

The environmental score for each company is calculated by multiplying the percentile for each metric with that metric's relative weight. These values are then added to give a raw score. To calculate the adjusted score, these raw scores are then converted to percentiles, giving each company a score that reflects their ranking relative to others in the same industry. This is then converted to a rating, from A+ to D-.

I first calculate the raw score.

In [105]:
sector_specific_scoring_df = calculate_raw_score(
    company_percentiles_by_industry_df, industry_weights_df, scoring_cols
)

In [106]:
sector_specific_scoring_df[
    (sector_specific_scoring_df["industry"] == "Energy and Utilities")
    & (sector_specific_scoring_df["year"] == 2020)
].head()

Unnamed: 0,company,ticker,year,industry,industry_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score
453,Archer Ltd.,ARCHO,2020,Energy and Utilities,0.41,0.03,0.05,0.06,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.0,0.0,0.05,0.0,0.0,0.0
454,Avance Gas Holding ltd,AGAS,2020,Energy and Utilities,0.37,0.03,0.05,0.06,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0
455,Borr Drilling Ltd,BDRILL,2020,Energy and Utilities,0.52,0.0,0.05,0.06,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.04,0.05,0.05,0.0,0.05,0.0
456,BW LPG,BWLPG,2020,Energy and Utilities,0.47,0.0,0.05,0.06,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.04,0.05,0.05,0.0,0.0,0.0
457,BW Offshore Limited,BWO,2020,Energy and Utilities,0.42,0.0,0.05,0.06,0.05,0.05,0.0,0.0,0.04,0.02,0.05,0.0,0.0,0.05,0.05,0.0,0.0,0.0


I then calculate an adjusted score by converting the raw score into a percentile relative to other companies in the same industry and the same year. I created an option in the function to group by_industry or to compare against all companies, regardless of industry. To calculate the sector-specific score, I group by industry. Later, when calculating the cross-sector score, I will compare to all other companies.

In [107]:
sector_specific_scoring_df = calculate_adjusted_score(
    sector_specific_scoring_df, raw_column="industry_score_raw", by_industry=True
)

In [108]:
sector_specific_scoring_df

Unnamed: 0,company,ticker,year,industry,industry_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score,industry_score_adjusted
0,Josemaria Resources Inc.,JOSE,2019,Basic Materials,0.08,0.00,0.00,0.04,0.00,0.00,0.00,0.00,0.04,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.06
1,Lucara Diamond Corp.,LUC,2019,Basic Materials,0.31,0.03,0.04,0.04,0.00,0.00,0.00,0.00,0.04,0.00,0.00,0.00,0.00,0.00,0.04,0.04,0.04,0.03,0.18
2,Lundin Gold Inc.,LUG,2019,Basic Materials,0.19,0.00,0.04,0.04,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.04,0.04,0.03,0.10
3,Lundin Mining Corporation,LUMI,2019,Basic Materials,0.46,0.03,0.04,0.04,0.04,0.04,0.02,0.00,0.04,0.03,0.05,0.00,0.00,0.00,0.04,0.04,0.04,0.00,0.42
4,H+H International A/S,HH,2019,Basic Materials,0.27,0.00,0.00,0.04,0.04,0.04,0.02,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.04,0.04,0.04,0.00,0.14
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1832,Cint Group AB,CINT,2022,Technology,0.38,0.04,0.05,0.07,0.05,0.06,0.00,0.00,0.00,0.00,0.06,0.00,0.00,0.05,0.00,0.00,0.00,0.00,0.33
1833,Garo AB,GARO,2022,Technology,0.60,0.04,0.05,0.07,0.05,0.06,0.03,0.00,0.05,0.00,0.06,0.00,0.03,0.05,0.04,0.02,0.04,0.00,0.87
1834,Proact IT Group AB,PACT,2022,Technology,0.49,0.04,0.05,0.07,0.05,0.06,0.00,0.00,0.05,0.00,0.06,0.03,0.03,0.05,0.00,0.00,0.00,0.00,0.56
1835,Hexagon AB,HEXA,2022,Technology,0.51,0.00,0.05,0.07,0.00,0.06,0.03,0.00,0.00,0.00,0.06,0.03,0.00,0.05,0.04,0.02,0.04,0.04,0.59


I then give each company an industry rating based on its adjusted score. This rating system is based on LSEG (2024):

| Percentile Range  | Rating |
|-------------------|--------|
| 0.92 < p ≤ 1     | A+     |
| 0.83 < p ≤ 0.92  | A      |
| 0.75 < p ≤ 0.83  | A-     |
| 0.67 < p ≤ 0.75  | B+     |
| 0.58 < p ≤ 0.66  | B      |
| 0.50 < p ≤ 0.58  | B-     |
| 0.42 < p ≤ 0.50  | C+     |
| 0.33 < p ≤ 0.42  | C      |
| 0.25 < p ≤ 0.33  | C-     |
| 0.17 < p ≤ 0.25  | D+     |
| 0.08 < p ≤ 0.17  | D      |
| 0.00 ≤ p ≤ 0.08  | D-     |


In [109]:
sector_specific_scoring_df["industry_rating"] = sector_specific_scoring_df[
    "industry_score_adjusted"
].apply(assign_rating)

sector_specific_scoring_df["industry_rating"] = cast_to_rating_category(
    sector_specific_scoring_df["industry_rating"]
)

In [110]:
# re-order columns
first_cols = [
    "company",
    "ticker",
    "year",
    "industry",
    "industry_rating",
    "industry_score_adjusted",
]

remaining_cols = [
    col for col in sector_specific_scoring_df.columns if col not in first_cols
]
sector_specific_scoring_df = sector_specific_scoring_df[first_cols + remaining_cols]

Finally, I calculate the previous year's rating for each company to be used as a comparison.

In [111]:
sector_specific_scoring_df = sector_specific_scoring_df.sort_values(
    ["company", "year"]
).assign(
    industry_rating_previous_year=lambda df: df.groupby("company")[
        "industry_rating"
    ].shift(1)
)

In [112]:
sector_specific_scoring_df.head()

Unnamed: 0,company,ticker,year,industry,industry_rating,industry_score_adjusted,industry_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score,industry_rating_previous_year
1152,A.P. Møller -Maersk A/S,MAERSK,2019,Industrial Goods and Services,A,0.87,0.59,0.03,0.05,0.05,0.05,0.05,0.0,0.0,0.05,0.0,0.05,0.03,0.04,0.05,0.05,0.03,0.05,0.0,
1268,A.P. Møller -Maersk A/S,MAERSK,2020,Industrial Goods and Services,A,0.88,0.61,0.03,0.05,0.05,0.05,0.05,0.03,0.0,0.04,0.0,0.05,0.04,0.0,0.05,0.04,0.03,0.05,0.04,A
1388,A.P. Møller -Maersk A/S,MAERSK,2021,Industrial Goods and Services,A,0.9,0.59,0.03,0.05,0.05,0.05,0.05,0.03,0.0,0.0,0.0,0.05,0.03,0.04,0.05,0.04,0.03,0.04,0.04,A
1494,A.P. Møller -Maersk A/S,MAERSK,2022,Industrial Goods and Services,A,0.86,0.59,0.0,0.05,0.06,0.05,0.05,0.03,0.0,0.04,0.0,0.05,0.03,0.04,0.04,0.04,0.03,0.04,0.03,A
173,AAK AB,AAK,2019,Consumer Goods and Services,B,0.59,0.5,0.03,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.03,0.05,0.05,0.04,0.0,0.0,


In [113]:
# sector_specific_scoring_df[
#     (sector_specific_scoring_df["industry"] == "Energy and Utilities")
#     & (sector_specific_scoring_df["year"] == 2020)
# ]

## Cross-sector environmental score

```diff
- Add notes in this section

I then compare the sector-specific score with a cross-sector score, comparing all companies, regardless of industry.

```diff
- For each company, sum all scores (percentile rank for each column). Add all scores for every company, and give each company a percentile rank based on the score equation above.

```diff
- First, calculate the materiality weight for each column, regardless of industry.
- Then, calculate company percentiles for each column.
- Sum percentiles.
- Then take the percentiles for that sum.`

### Calculate overall mean and materiality

In [114]:
metric_materiality_all_years_df = reporting_df[scoring_cols].mean().reset_index()
metric_materiality_all_years_df.columns = ["", "mean_score"]

In [115]:
metric_materiality_all_years_df["overall_materiality_score"] = (
    metric_materiality_all_years_df["mean_score"].sum()
)

metric_materiality_all_years_df["metric_materiality"] = (
    metric_materiality_all_years_df["mean_score"]
    / metric_materiality_all_years_df["overall_materiality_score"]
)

metric_materiality_all_years_df = metric_materiality_all_years_df.set_index("").T

In [116]:
metric_materiality_all_years_df

Unnamed: 0,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
mean_score,0.4,0.75,0.91,0.79,0.89,0.34,0.16,0.53,0.09,0.81,0.35,0.39,0.67,0.6,0.35,0.67,0.51
overall_materiality_score,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22,9.22
metric_materiality,0.04,0.08,0.1,0.09,0.1,0.04,0.02,0.06,0.01,0.09,0.04,0.04,0.07,0.07,0.04,0.07,0.06


### Calculate company percentiles

In [117]:
company_percentiles_overall_df = calculate_company_percentiles(
    reporting_df, scoring_cols, by_industry=False
)

In [118]:
company_percentiles_overall_df[
    company_percentiles_overall_df["company"] == "Archer Ltd."
]

Unnamed: 0,company,ticker,year,industry,external_audit_of_ESG_report_percentile,ceo_sust_statem_percentile,environmental_policy_and_assessment_percentile,environmental_performance_targets_percentile,reduced_environmental_impact_percentile,increased_renewable_energy_percentile,disclosure_of_raw_material_use_percentile,resource_efficiency_target_percentile,disclosure_of_water_discharges_percentile,supplier_guidelines_percentile,disclosure_of_suppliers_audited_percentile,disclosure_of_supplier_evaluation_procedures_percentile,supplier_environmental_assessment_percentile,energy_consump_bool_percentile,water_withdraw_bool_percentile,ghg_emis_bool_percentile,transport_emis_bool_percentile
486,Archer Ltd.,ARCHO,2020,Energy and Utilities,0.8,0.62,0.53,0.59,0.55,0.0,0.0,0.69,0.0,0.59,0.83,0.0,0.0,0.69,0.0,0.0,0.0


### Calculate score

In [119]:
cross_sector_scoring_df = calculate_raw_score(
    company_percentiles_overall_df,
    metric_materiality_all_years_df,
    scoring_cols,
    by_industry=False,
)

In [120]:
cross_sector_scoring_df[
    (cross_sector_scoring_df["industry"] == "Energy and Utilities")
    & (cross_sector_scoring_df["year"] == 2020)
].head()

Unnamed: 0,company,ticker,year,industry,overall_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score
486,Archer Ltd.,ARCHO,2020,Energy and Utilities,0.41,0.03,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.0,0.0,0.05,0.0,0.0,0.0
487,Avance Gas Holding ltd,AGAS,2020,Energy and Utilities,0.38,0.03,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.0,0.0,0.0,0.05,0.0,0.0,0.0
488,Borr Drilling Ltd,BDRILL,2020,Energy and Utilities,0.51,0.0,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.03,0.05,0.05,0.0,0.05,0.0
489,BW LPG,BWLPG,2020,Energy and Utilities,0.46,0.0,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.0,0.05,0.03,0.03,0.05,0.05,0.0,0.0,0.0
490,BW Offshore Limited,BWO,2020,Energy and Utilities,0.4,0.0,0.05,0.05,0.05,0.05,0.0,0.0,0.04,0.01,0.05,0.0,0.0,0.05,0.05,0.0,0.0,0.0


The overall adjusted score is calculated by converting the overall raw score into a percentile, relative to other company scores for a given year. I decided this was more suitable as a relative measure, rather than comparing to other company scores across all years. However, one potential downside is that some companies have a higher overall_raw_score but lower overall_adjusted_score than others. This is because the performance of other companies in that year will have been higher, dragging down the company's percentile score.

In [121]:
cross_sector_scoring_df = calculate_adjusted_score(
    cross_sector_scoring_df, raw_column="overall_score_raw", by_industry=False
)

In [122]:
cross_sector_scoring_df

Unnamed: 0,company,ticker,year,industry,overall_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score,overall_score_adjusted
0,Avance Gas Holding ltd,AGAS,2019,Energy and Utilities,0.22,0.00,0.06,0.05,0.05,0.05,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.19
1,Borr Drilling Ltd,BDRILL,2019,Energy and Utilities,0.35,0.00,0.00,0.05,0.00,0.05,0.00,0.00,0.04,0.00,0.05,0.00,0.00,0.00,0.05,0.00,0.05,0.05,0.39
2,BW LPG,BWLPG,2019,Energy and Utilities,0.16,0.00,0.06,0.05,0.00,0.05,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.13
3,BW Offshore Limited,BWO,2019,Energy and Utilities,0.27,0.00,0.06,0.05,0.00,0.05,0.00,0.00,0.00,0.01,0.05,0.00,0.00,0.00,0.05,0.00,0.00,0.00,0.27
4,FLEX LNG Ltd,FLNG,2019,Energy and Utilities,0.36,0.00,0.06,0.05,0.05,0.05,0.00,0.00,0.04,0.00,0.00,0.00,0.00,0.05,0.05,0.00,0.00,0.00,0.39
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1832,Collector Bank AB (formerly: Collector AB),COLL,2022,Finance,0.33,0.00,0.05,0.06,0.05,0.05,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.00,0.04,0.00,0.04,0.04,0.22
1833,Modern Times Group MTG AB,MTG,2022,Other,0.43,0.00,0.05,0.06,0.05,0.05,0.00,0.00,0.04,0.00,0.05,0.00,0.00,0.05,0.00,0.00,0.04,0.04,0.40
1834,SECTRA AB,SECT,2022,Health Care,0.25,0.00,0.05,0.06,0.00,0.05,0.00,0.00,0.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,0.04,0.00,0.16
1835,Betsson AB,BETS,2022,Other,0.35,0.00,0.05,0.06,0.05,0.05,0.00,0.00,0.00,0.00,0.05,0.00,0.00,0.05,0.04,0.00,0.00,0.00,0.25


In [123]:
cross_sector_scoring_df["overall_rating"] = cross_sector_scoring_df[
    "overall_score_adjusted"
].apply(assign_rating)

cross_sector_scoring_df["overall_rating"] = cast_to_rating_category(
    cross_sector_scoring_df["overall_rating"]
)

In [124]:
# re-order columns
first_cols = [
    "company",
    "ticker",
    "year",
    "industry",
    "overall_rating",
    "overall_score_adjusted",
]

remaining_cols = [
    col for col in cross_sector_scoring_df.columns if col not in first_cols
]
cross_sector_scoring_df = cross_sector_scoring_df[first_cols + remaining_cols]

In [125]:
cross_sector_scoring_df = cross_sector_scoring_df.sort_values(
    ["company", "year"]
).assign(
    overall_rating_previous_year=lambda df: df.groupby("company")[
        "overall_rating"
    ].shift(1)
)

In [126]:
# cross_sector_scoring_df[
#     (cross_sector_scoring_df["industry"] == "Energy and Utilities")
#     & (cross_sector_scoring_df["year"] == 2020)
# ]

In [127]:
cross_sector_scoring_df

Unnamed: 0,company,ticker,year,industry,overall_rating,overall_score_adjusted,overall_score_raw,external_audit_of_ESG_report_score,ceo_sust_statem_score,environmental_policy_and_assessment_score,environmental_performance_targets_score,reduced_environmental_impact_score,increased_renewable_energy_score,disclosure_of_raw_material_use_score,resource_efficiency_target_score,disclosure_of_water_discharges_score,supplier_guidelines_score,disclosure_of_suppliers_audited_score,disclosure_of_supplier_evaluation_procedures_score,supplier_environmental_assessment_score,energy_consump_bool_score,water_withdraw_bool_score,ghg_emis_bool_score,transport_emis_bool_score,overall_rating_previous_year
28,A.P. Møller -Maersk A/S,MAERSK,2019,Industrial Goods and Services,A,0.87,0.60,0.03,0.06,0.05,0.05,0.05,0.00,0.00,0.04,0.00,0.05,0.03,0.03,0.05,0.05,0.03,0.05,0.00,
517,A.P. Møller -Maersk A/S,MAERSK,2020,Industrial Goods and Services,A,0.91,0.61,0.03,0.05,0.05,0.05,0.05,0.03,0.00,0.04,0.00,0.05,0.03,0.00,0.05,0.05,0.03,0.05,0.04,A
998,A.P. Møller -Maersk A/S,MAERSK,2021,Industrial Goods and Services,A,0.87,0.59,0.03,0.05,0.05,0.05,0.05,0.03,0.00,0.00,0.00,0.05,0.03,0.03,0.05,0.05,0.03,0.05,0.04,A
1437,A.P. Møller -Maersk A/S,MAERSK,2022,Industrial Goods and Services,A,0.89,0.60,0.00,0.05,0.06,0.05,0.05,0.03,0.00,0.04,0.00,0.05,0.03,0.03,0.05,0.04,0.03,0.04,0.04,A
462,AAK AB,AAK,2019,Consumer Goods and Services,B+,0.74,0.55,0.03,0.06,0.05,0.05,0.05,0.00,0.00,0.04,0.00,0.05,0.03,0.03,0.05,0.05,0.03,0.00,0.00,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1559,Össur hf,OSSR,2022,Health Care,C-,0.26,0.36,0.00,0.05,0.06,0.00,0.05,0.00,0.00,0.00,0.00,0.05,0.00,0.00,0.00,0.04,0.03,0.04,0.04,B+
51,Ørsted A/S,ORSTED,2019,Energy and Utilities,B+,0.70,0.53,0.03,0.06,0.05,0.05,0.05,0.03,0.00,0.00,0.01,0.05,0.00,0.00,0.00,0.05,0.03,0.05,0.05,
540,Ørsted A/S,ORSTED,2020,Energy and Utilities,A+,0.94,0.63,0.03,0.05,0.05,0.05,0.05,0.03,0.02,0.04,0.00,0.05,0.03,0.00,0.05,0.05,0.03,0.05,0.04,B+
1017,Ørsted A/S,ORSTED,2021,Energy and Utilities,A+,0.93,0.62,0.03,0.05,0.05,0.05,0.05,0.03,0.00,0.04,0.01,0.05,0.03,0.00,0.05,0.05,0.03,0.05,0.04,A+


## Summary of results

```diff
- Create a migration matrix to see how many change from year to year
- Consider what to do with unaudited companies

In [128]:
reporting_summary_df = sector_specific_scoring_df.merge(
    cross_sector_scoring_df, on=["company", "year"], how="left", suffixes=("", "_drop")
).merge(
    metrics_reported_by_company_df,
    on=["company", "year"],
    how="left",
    suffixes=("", "_drop2"),
)

reporting_summary_df = reporting_summary_df.loc[
    :, ~reporting_summary_df.columns.str.endswith(("_drop", "_drop2"))
]

columns_to_keep = [
    "company",
    "ticker",
    "year",
    "industry",
    "hq_country",
    "segment",
    "industry_rating",
    "overall_rating",
    "metrics_reported",
    "metric_coverage",
    "industry_score_adjusted",
    "industry_score_raw",
    "overall_score_adjusted",
    "overall_score_raw",
    "external_audit_of_ESG_report",
    "csrd_2025",
    "csrd_2027",
    "industry_rating_previous_year",
    "overall_rating_previous_year",
]

reporting_summary_df = reporting_summary_df[columns_to_keep]

In [129]:
reporting_summary_df

Unnamed: 0,company,ticker,year,industry,hq_country,segment,industry_rating,overall_rating,metrics_reported,metric_coverage,industry_score_adjusted,industry_score_raw,overall_score_adjusted,overall_score_raw,external_audit_of_ESG_report,csrd_2025,csrd_2027,industry_rating_previous_year,overall_rating_previous_year
0,A.P. Møller -Maersk A/S,MAERSK,2019,Industrial Goods and Services,Denmark,Large,A,A,13,0.76,0.87,0.59,0.87,0.60,1,1,0,,
1,A.P. Møller -Maersk A/S,MAERSK,2020,Industrial Goods and Services,Denmark,Large,A,A,14,0.82,0.88,0.61,0.91,0.61,1,1,0,A,A
2,A.P. Møller -Maersk A/S,MAERSK,2021,Industrial Goods and Services,Denmark,Large,A,A,14,0.82,0.90,0.59,0.87,0.59,1,1,0,A,A
3,A.P. Møller -Maersk A/S,MAERSK,2022,Industrial Goods and Services,Denmark,Large,A,A,14,0.82,0.86,0.59,0.89,0.60,0,1,0,A,A
4,AAK AB,AAK,2019,Consumer Goods and Services,Sweden,Large,B,B+,12,0.71,0.59,0.50,0.74,0.55,1,1,0,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1832,Össur hf,OSSR,2022,Health Care,Iceland,Large,C,C-,8,0.47,0.36,0.39,0.26,0.36,0,1,0,A,B+
1833,Ørsted A/S,ORSTED,2019,Energy and Utilities,Denmark,Large,A-,B+,12,0.71,0.79,0.54,0.70,0.53,1,1,0,,
1834,Ørsted A/S,ORSTED,2020,Energy and Utilities,Denmark,Large,A,A+,15,0.88,0.91,0.61,0.94,0.63,1,1,0,A-,B+
1835,Ørsted A/S,ORSTED,2021,Energy and Utilities,Denmark,Large,A,A+,15,0.88,0.88,0.61,0.93,0.62,1,1,0,A,A+


## Visualisations

```diff
- Materiality matrix

- Average metrics reported by industry (heatmap across four years)
- Average metrics reported by HQ country (heatmap across four years)


- Main visualisations can come in the emissions section:
-- Box plots
- Average metrics reported by CSRD 2025, CSRD 2027 (this can be better represented on a dashboard)



## References

European Commission, 2025. Proposal for a Directive of the European Parliament and of the Council amending Directives (EU) 2022/2464 and (EU) 2024/1760 as regards the dates from which Member States are to apply certain corporate sustainability reporting and due diligence requirements. COM(2025) 80 final. Brussels. Available at: https://commission.europa.eu/document/download/0affa9a8-2ac5-46a9-98f8-19205bf61eb5_en?filename=COM_2025_80_EN.pdf (Accessed 27 February 2025)

LSEG, 2024. Environmental, Social and Governance Scores from LSEG: October 2024. Available at: https://www.lseg.com/content/dam/data-analytics/en_us/documents/methodology/lseg-esg-scores-methodology.pdf (Accessed: 27 February 2025)

## Appendix

In [130]:
# top_n_companies(
#     cross_sector_scoring_df,
#     n_companies=10,
#     sort_by="overall_rating",
#     industry="Energy and Utilities",
#     year=2021,
# )

## Plan for sector-specific environmental scoring calculation:

1) Calculate the mean value for each parameter for a given industry (use overall, rather than year, so that there can be some comparison between years).
2) Compare that to the sum of mean values for each parameter for a given industry (e.g. mean_env_policy + mean_ghg_emis_bool).
3) Calculate the value for each parameter for a given company and the associated percentile (e.g. 1 = 0.81111)
4) Multiply that percentile value for the company by the magnitude of the column. (This will make sure the most important columns are prioritised)
5) Add all the new scores up for the given company. This will give the company a raw_sector_score.
6) Now take the sector score for every company in that sector and calculate the percentile. That will give the company's adjusted_sector_score = This can be considered the company's sector-specific E(SG) score.
--The advantage of this is that you can compare scores within a sector across years. But then we have a problem when we add another year. Should we use the base year of 2019 as the calculator? But things might have become more prioritised in recent years. So maybe best to stick with overall.

## Plan for overall environmental scoring calculation:

1) Calculate the mean value for each industry for a given parameter (use overall). Ignore magnitude.
2) This will give you the transparency weighting. The more companies fill in this metric, the more transparent it is.
3) Calculate the transparency of each metric relative to the total: this will give you the weighting of each metric.
4) For each company/year, calculate their overall percentile for each metric.
5) Multiply each percentile by the weighting of the metric it applies to. Sum  all of the values.
6) This will give the company's raw_overall_score.
7) Now calculate the percentile for each company relative to others in the same year (or maybe overall). This will give the company's adjusted_overall_score 


^^This score will give the company its reporting score. In the next section, calculate its emission score.
For the overall score, maybe consider doing a 66:33 split for emissions:reporting.
^^ Is it even necessary to do an overall score if I'm using this for business leads/identifying potential businesses to target?





In [131]:
# # 2. Create 'consecutive_years_esg_data' by checking consecutive years starting from 2022
# def calculate_consecutive_years(group):
#     # Create a set of years for the current 'comp_name'
#     years = set(group["year"])
#     # Start from 2022 and count consecutive years backwards
#     count = 0
#     for year in range(2022, 2019, -1):  # Checking years 2022, 2021, 2020, ...
#         if year in years:
#             count += 1
#         else:
#             break  # Stop if any year is missing in the consecutive sequence

#     return count


# # Apply the function to each group of 'comp_name'
# df["consecutive_years_esg_data"] = (
#     df.groupby("comp_name")
#     .apply(calculate_consecutive_years)
#     .reset_index(level=0, drop=True)
# )

Metrics change from base year

In [132]:
# reporting_df = reporting_df.merge(
#     reporting_df.loc[
#         reporting_df["year"] == reporting_df["base_year"],
#         ["company", "metrics_reported"],
#     ],
#     on="company",
#     how="left",
#     suffixes=("", "_base_year"),
# )

# # Compute the change from base year
# reporting_df["metrics_change_from_base_year"] = (
#     reporting_df["metrics_reported"] - reporting_df["metrics_reported_base_year"]
# )

# reporting_df.loc[
#     reporting_df["year"] == reporting_df["base_year"], "metrics_change_from_base_year"
# ] = float("nan")

# reporting_df.drop("metrics_reported_base_year", axis=1, inplace=True)

```diff
- According to the CSRD, a first set of European Sustainability Reporting Standards (ESRS) were adopted in 2023, which were sector-agnostic, so they were to be applied regardless of the sector in which the company operates. Sector-specific reporting standards were expected to be introduced by June 2026, but at the time of writing, February 2025, it is likely that this requirement will be shelved (see European Commission, 2025).

- I apply the ESG scoring system used by the London Stock Exchange Group (LSEG, 2024), and calculate sector-specific scores...

In [133]:
# shows all unique values for each column

# pd.set_option("display.max_colwidth", 180)


# # Create a DataFrame with unique values for each column

# unique_values_df = pd.DataFrame(

#     {

#         "columns": reporting_df.columns,

#         "unique_values": [

#             reporting_df[col].unique().tolist() for col in reporting_df.columns

#         ],

#     }

# ).set_index("columns")


# unique_values_df

In [134]:
# folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# industry_weights_df.to_csv(f"{folder_path}/industry_weights_df.csv", index=True)
# industry_means_all_years_df.to_csv(f"{folder_path}/industry_means_df.csv", index=True)

In [135]:
# Test to see that the function works...

# calc_test_df = reporting_df[
#     (reporting_df["industry"] == "Energy and Utilities")
#     & (reporting_df["year"] == 2020)
# ]

# calc_test_df

# folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# calc_test_df.to_csv(f"{folder_path}/test_of_scoring_system.csv", index=False)

```diff
- Pillar score: A score for each pillar (within Environmental, Social and Governance). This can change from industry to industry--e.g. the environmental pillar might be worth more in the energy industry than another industry..

TRBC group is used as the benchmark.

Materiality matrix created in the form of category weights.

7% transparency threshold (I guess that means mean is higher than 0.07)
Then use industry median (and add all medians up, then take the relative weighting of each)

For Boolean data points, use transparency weights (same principle, but using magnitude weight)



# END