<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of Environmental Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 1:</b> Preprocessing and cleaning </span></center>

## Imports

In [116]:
import pandas as pd
import numpy as np
from rapidfuzz import fuzz, process
import sys
import os
from typing import Any, List

pd.set_option("display.max_columns", None)
sys.path.append(os.path.abspath(".."))
import random
from functions import (
    display_unique_counts,
    show_missing_values,
    test_filter,
    test_ticker,
    tickers_with_multiple_companies,
    companies_with_multiple_tickers,
    apply_most_recent_company_name,
    apply_most_recent_ticker,
    test_company,
    find_similar_entries,
    map_similar_pairs,
    update_segments_remove_na,
    get_most_recent_values,
    generate_binary_summary,
)

# pd.options.display.float_format = "{:.0f}".format

In [117]:
# load the file
df = pd.read_csv("../datasets/NordicCompass2014_2022.csv")

## Introduction to this section

I took the old Python module and am now re-doing it to strengthen my skills in Python. As a result, I wanted to apply the skills I learned to topics that are relevant to my future, rather than practising on datasets for the sake of practice. As a result, I am submitting a project that doesn't match the requirements, but I would be very grateful if this won't be viewed unfavourably! Rather than analysing the Spotify dataset, I decided to choose the Nordic Compass (2022) database as the subject of analysis. I was curious to see the performance of Nordic companies on various ESG (Environmental, Social, Governance) metrics, and I think that the task requirements from Turing can be applied to many datasets.

In this notebook, I first filter in only the columns most relevant for my analysis. These are largely environmental metrics, allowing me to reduce the number of columns from around 100 to just 25. I then clean the data to remove 'duplicates'. However, I don't want to delete them, because it may be that the same company is written in two different ways in different years (e.g. 'Company X Ltd.' vs. 'Company X ltd'). These should not appear as two separate companies, but should instead by integrated into a single company. To identify the correct company name, I use the name of the company for the most up-to-date data, assuming that the most recent data is the most likely to be correct. I cross-reference this with the company's stock market ticker, and also do the opposite--where a company has multiple tickers, I integrate both tickers into one based on whichever appears most recently. (If it's confusing, it will hopefully become more clear later in my notebook.) For incorrect company/ticker combinations that are harder to spot, I use a library that has a similarity index (rapidfuzz), and finally manually update those that I missed. This is not the most pythonic way, but for a dataset this size, I could get away with it.

I then further clean the data by removing duplicates and standardising each company's segment and industry. To avoid having some companies in 'Financial' and others in 'Financial Services', I create a dictionary of industry names with the keys representing the original name, and the values representing the modified name. I then map the new names into the relevant column. I finally handle any missing values, convert data types and transform any anomalies, before deleting all data older than 2019 (which I define as my base year). This final step could have been done at the beginning, but with a dataset this small, it didn't have much impact on processing power, so I was lazy and left it until the end.

This dataset is then ready for further analysis, which I will present in Sprint 3.

I have stored all of the functions used in this notebook in the functions folder. Note that this is notebook 1 of 3. The second notebook analyses the cleaned data according to relevant environmental reporting metrics (I haven't finished it, but I plan to present this for sprint 3), while the third notebook uses impact metrics (I have just started a new job, so I don't think I will ever get around to finishing this one, but never mind!). I very much look forward to our review.



## Turing sprint requirements (and translation to this sprint)

1. **Download the dataset**  
   Use the [Spotify Top 50 Tracks of 2020 dataset](#) (insert dataset link if available).

```diff
- I downloaded the Nordic Compass dataset from the following location: https://www.hhs.se/en/houseoffinance/data-center/nordic-compass-shofs-esg-database/
 ```

2. **Load the data**  
   Load the dataset using **Pandas**.

---

Perform the following data cleaning tasks:

- Handle missing values  
- Remove duplicate samples and features  
- Treat outliers  

```diff
- I handle missing values and duplicates, but rather than treating outliers, I treat naming inconsistencies. I hope that the result demonstrates similar skills, but I'm happy to discuss the treatment of outliers in the review.
   ```
---

Answer the following questions through your EDA:

- How many **observations** are there in this dataset?  
- How many **features** does this dataset have?  
- Which of the features are **categorical**?  
- Which of the features are **numeric**?  

- Are there any **artists** with more than one popular track? If yes, which ones and how many?  
- Who was the **most popular artist** overall?  
- How many **unique artists** have tracks in the top 50?  
- Are there any **albums** with more than one popular track? If yes, which ones and how many?  
- How many **unique albums** are represented in the top 50?  

- Which tracks have a **danceability score above 0.7**?  
- Which tracks have a **danceability score below 0.4**?  
- Which tracks have their **loudness above -5 dB**?  
- Which tracks have their **loudness below -8 dB**?  
- Which track is the **longest**?  
- Which track is the **shortest**?  

- Which **genre is the most popular**?  
- Which genres have **only one song** in the top 50?  
- How many **genres in total** are represented?  

- Which features are **strongly positively correlated**?  
- Which features are **strongly negatively correlated**?  
- Which features are **not correlated at all**?  

Compare the following features across these genres: **Pop, Hip-Hop/Rap, Dance/Electronic, Alternative/Indie**:

- Danceability  
- Loudness  
- Acoustics  

```diff
- I choose a very different analysis (obviously), but I hope the insights are equally interesting/demonstrate my skills in Pandas. In the section below, I have tried to answer some similar questions, just to demonstrate that I'm capable. If I'm missing something, I can perhaps demonstrate it using the second notebook, which is the analysis of the data I have cleaned.
   
```

---

- Provide **clear explanations** for each step in your notebook.
- Explain what you are analyzing, what results you found, and what the results imply.
- Make sure your insights are easy to understand and well-documented.

```diff

- I think I did this.
   
```

---

- Discuss potential improvements for your analysis.
- Mention any additional data that could enhance the insights.
- Suggest advanced techniques (e.g., clustering, sentiment analysis, etc.) that could be explored in future iterations.

```diff

- I'm not sure this is relevant for my analysis.
   
```

<!-- Below are some examples of equivalent questions that could be asked of the Nordic Compass dataset. -->

In [118]:
# How many observations/features are there in this dataset?
rows, cols = df.shape
# print(f"Observations: {rows}, Features: {cols}")

In [119]:
# Which of the features are categorical/numeric?
# df.dtypes

In [121]:
# Companies with the most years of observations
most_frequent_companies = df["comp_name"].value_counts().loc[lambda x: x == x.max()]
# most_frequent_companies.to_dict()

In [122]:
# Number of unique companies
unique_companies = df["comp_name"].nunique()
# unique_companies

In [129]:
# Number of companies with sales/revenue greater than 200
df["sales"] = pd.to_numeric(df["sales"].replace(",", "", regex=True), errors="coerce")
companies_with_sales_over_x_value = df[df["sales"] > 200]["comp_name"].nunique()
# companies_with_sales_over_x_value

## Data overview

The database is too large to focus on all columns necessary for full compliance with CSRD. I will focus only on columns relating to a company's environmental performance.

In [None]:
# df.columns.tolist()

relevant_columns = [
    "comp_name",
    "ticker",
    "year",
    "segment",
    "industry",
    "hq_country",
    "ceo_sust_statem",
    "sales",
    "num_employees",
    "env_policy",
    "ep_targets",
    "env_impact_red",
    "energy_consump",
    "incr_renew_en",
    "disclosure_raw",
    "resource_target",
    "water_withdraw",
    "water_disclose",
    "ghg_emis",
    "transport_emis",
    "audit_es_report",
    "su_guidelines",
    "su_aud_disclose",
    "su_eva_disclose",
    "su_env_assess",
]

In [None]:
# selects all rows and only relevant columns
df = df.loc[:, relevant_columns]

In [None]:
df.head()

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHER,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,5112,N,Y,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHER,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,4785,Y,N,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,4556,Y,Y,Y,459927,N,N,Y,ND,N,ND,ND,Y,Y,Y,N,N
3,AutoStore Holdings Ltd.,AUTO,2021.0,Large,Industrials,Bermuda,Y,292.5,578,Y,N,Y,ND,N,Y,N,ND,N,0.7366,371.9243,N,Y,N,Y,N
4,Avance Gas Holding ltd,AVACF,2019.0,Mid,Energy,Norway,Y,223.5901786,271,Y,Y,Y,ND,N,N,N,ND,N,N,N,N,N,N,N,N


Some companies, such as Archer Ltd., have more than one ticker, so this will need to be modified.

In [None]:
display_unique_counts(df)

Unique companies in the database:  782
Unique tickers in the database:  783


## Data cleaning

I want to ensure that companies and tickers match. Some companies currently have multiple tickers per company, while some tickers have multiple companies per ticker. I will then focus on duplicates (e.g. same company, same year appearing multiple times). Where data is duplicated, I will prioritise data where 'GHG emissions' data exists and/or 'sales' are higher (some have net income reported by mistake). 

### Multiple companies associated with one ticker

I first check tickers that are associated with multiple companies.

In [None]:
# Show rows where a single ticker is associated with multiple companies (e.g. 'ACR' = 'Axactor' and 'Axactor SE')
tickers_with_multiple_companies(df)

Tickers associated with multiple companies:  114



['ACR: Axactor, Axactor SE',
 'ADE: Adevinta, Adevinta ASA',
 'AF: ÅF AB, ÅF Pöyry AB',
 'AKRBP: Aker BP ASA, Aker BP ASA (Det norske oljeselskap ASA)',
 'AKTIA: Aktia Bank PLC (formerly known as Aktia, Aktia Bank Plc (formerly known as Aktia Pankki Oyj), Aktia Bank plc',
 'AM1: Ahlstrom-Munksjö Oyj, Ahlstrom-Munksjö Oyj  (Munksjö Oyj)',
 'ANOD: AddNode Group AB, Addnode Group AB',
 'ARCUS: Arcus ASA, Arcus asa',
 'ARION: Arion Banki hf., Arion Banki hf. SDB',
 'ASSA: ASSA ABLOY AB, Assa Abloy AB',
 'AZTO: ArcticZymes Technologies (Biotec Pharmacon ), ArcticZymes Technologies ASA(formerly Biotec Pharmacon ASA)',
 'BALD: Balder Fastighets AB, Fastighets AB Balder',
 'BDRILL: Borr Drilling Ltd, Borr Drilling Ltd.',
 'BHG: BHG (formerly Bygghemma Group First AB), BHG AB(formerly Bygghemma Group First AB), BHG Group AB (Bygghemma Group First AB)',
 'BITTI: Bittium Oyj, Bittium Oyj  (Elektrobit Oyj)',
 'BO: Bang & Olufsen A/S, Bang & Olufsen Holding A/S',
 'BWLPG: BW LPG, BW LPG Ltd',
 'BWO

In [None]:
test_ticker(df, "ACR")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
2110,Axactor,ACR,2019.0,Mid,Financials,Norway,N,368.1,1140,Y,N,N,ND,N,N,N,ND,N,N,N,N,Y,N,N,N
2111,Axactor,ACR,2020.0,Mid,Financial Services,Norway,N,201.2,1137,Y,N,Y,5751.542,Y,N,N,ND,N,0.495177,0.1753,N,Y,N,Y,Y
2114,Axactor SE,ACR,2021.0,Mid,Financials,Sweden,Y,195.1,1225,Y,Y,Y,4655.2356,Y,N,N,ND,N,0.707504,0.144031,N,Y,N,N,Y


Where multiple company names are associated with a single ticker, the company name from the most recent year will be stored. All others will be replaced by the name associated with the most recent year. 

In [None]:
apply_most_recent_company_name(df)

In [None]:
test_ticker(df, "ACR")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
2110,Axactor SE,ACR,2019.0,Mid,Financials,Norway,N,368.1,1140,Y,N,N,ND,N,N,N,ND,N,N,N,N,Y,N,N,N
2111,Axactor SE,ACR,2020.0,Mid,Financial Services,Norway,N,201.2,1137,Y,N,Y,5751.542,Y,N,N,ND,N,0.495177,0.1753,N,Y,N,Y,Y
2114,Axactor SE,ACR,2021.0,Mid,Financials,Sweden,Y,195.1,1225,Y,Y,Y,4655.2356,Y,N,N,ND,N,0.707504,0.144031,N,Y,N,N,Y


Validate that all tickers have only one associated company.

In [None]:
tickers_with_multiple_companies(df)
display_unique_counts(df)

Tickers associated with multiple companies:  0

Unique companies in the database:  666
Unique tickers in the database:  783


### Multiple tickers associated with one company

In [None]:
companies_with_multiple_tickers(df)

Companies associated with multiple tickers:  111



['A.P. Møller -Maersk A/S: MAERSK, MAERSK A',
 'ABG Sundal Collier Holding ASA: ABG, ASC',
 'Akastor  ASA: AKAST, AKKVF',
 'Aker BP ASA: AKERBP, AKRBP',
 'Ambu A/S: AMBU, AMBU-B',
 'Archer Ltd.: ARCHER, ARCHO',
 'Asetek A/S: ASETEK, ASTK',
 'Avance Gas Holding ltd: AGAS, AVACF',
 'Axactor SE: ACR, AXA',
 'BankNordik P/F: BNORDIK, BNORDIK CSE',
 'Beijer Alma AB: BEIA, BEIA B',
 'Beijer Ref AB: BEIJ, BEIJ B',
 'Belships ASA: BEL, BELO',
 'Bonheur ASA: BON, BONH',
 'Borregaard ASA: BRG, BRGO',
 'Bouvet ASA: BOUV, BOUVET',
 'Carlsberg A/S: CARL, CARL B',
 'Caverion Oyj: CAV, CAV1V',
 'CellaVision AB: CEVI, SEVI',
 'Cloetta AB: CLA, CLA B',
 'Coloplast A/S: COLO, COLO B',
 'ContextVision: CONTX, COVO',
 'Corem Property Group AB: CORE, CORE A',
 'Crayon Group Holding ASA: CRAYN, CRAYNO',
 'DOF ASA: DOF, DOFO',
 'Elanders AB: ELAN, ELAN B',
 'Frontline Ltd: FRO, FROo',
 'Genmab A/S: GEN, GMAB',
 'H. Lundbeck A/S: HLUN, LUN',
 'Hexagon AB: HEXA, HEXA B',
 'Huhtamäki Oyj: HUH, HUH1V',
 'Höegh L

In [None]:
test_company(df, "Archer Ltd.")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHER,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,5112,N,Y,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHER,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,4785,Y,N,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,4556,Y,Y,Y,459927,N,N,Y,ND,N,ND,ND,Y,Y,Y,N,N


In [None]:
apply_most_recent_ticker(df)

In [None]:
test_company(df, "Archer Ltd.")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHO,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,5112,N,Y,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHO,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,4785,Y,N,Y,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,4556,Y,Y,Y,459927,N,N,Y,ND,N,ND,ND,Y,Y,Y,N,N


Validate that all companies have only one ticker.

In [None]:
companies_with_multiple_tickers(df)
display_unique_counts(df)

Companies associated with multiple tickers:  0

Unique companies in the database:  666
Unique tickers in the database:  665


'ICA Gruppen AB' was missing a ticker, so I filled it. 'Truecaller AB' had a ticker '1', so I manually updated it.

In [None]:
# test_company(df, "ICA Gruppen AB").head(1)
df.loc[df["comp_name"] == "ICA Gruppen AB", "ticker"] = "ICA"

In [None]:
# manually update the ticker for 'Truecaller AB'
df.loc[df["comp_name"] == "Truecaller AB", "ticker"] = "TRUE B"

# df[df["comp_name"] == "Truecaller AB"]

In [None]:
display_unique_counts(df)

Unique companies in the database:  666
Unique tickers in the database:  666


### Other company name errors

I now search for company names that are similar enough to potentially represent the same company.

In [None]:
similar_pairs_df = find_similar_entries(df, 75)

print(f"Number of rows: {len(similar_pairs_df)}", end="\n")
similar_pairs_df

Number of rows: 43


Unnamed: 0,entry1,year1,entry2,year2,similarity
0,"SCA, Svenska Cellulosa AB (SCA)",2022.0,SCA. Svenska Cellulosa AB (SCAA),2021.0,95.238095
1,Hagar hf (HAGA),2022.0,Hagar hf. (HAGAR),2015.0,93.75
2,Akastor ASA (AKAST),2020.0,Akastor ASA (AKA),2017.0,91.891892
3,"Ericsson, Telefonab. L M (ERIC)",2022.0,Ericsson Telefonab LM (ERIC-B),2019.0,91.803279
4,SpareBank 1 SR-Bank (SRBANK),2022.0,SpareBank 1 SR-Bank ASA (SRBNK),2020.0,91.525424
5,Tanker Investments Ltd (TNK),2020.0,Tanker Investments Ltd. (TIL),2016.0,91.22807
6,Nobia AB (NOBI),2022.0,Nobina AB (NOBINA),2022.0,90.909091
7,"Hennes & Mauritz AB, H & M (HM)",2022.0,Hennes & Mauritz AB. H&M (HM B),2020.0,90.322581
8,Momentum Group (MMGR),2022.0,Momentum Group AB (MMGR B),2020.0,89.361702
9,Avance Gas Holding ltd (AGAS),2020.0,Avance Gas Holding ltd. (AVANCE),2016.0,88.52459


Not all of the companies above are the same company, so I only want to change the ones that are. I do this manually (although I am sure there must be a more robust solution).

In [None]:
# is there a more robust way of doing this? Probably.
indices_to_keep = [0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 14, 15, 16, 19, 21]

similar_pairs_df = similar_pairs_df.loc[indices_to_keep]

similar_pairs_df

Unnamed: 0,entry1,year1,entry2,year2,similarity
0,"SCA, Svenska Cellulosa AB (SCA)",2022.0,SCA. Svenska Cellulosa AB (SCAA),2021.0,95.238095
1,Hagar hf (HAGA),2022.0,Hagar hf. (HAGAR),2015.0,93.75
2,Akastor ASA (AKAST),2020.0,Akastor ASA (AKA),2017.0,91.891892
3,"Ericsson, Telefonab. L M (ERIC)",2022.0,Ericsson Telefonab LM (ERIC-B),2019.0,91.803279
4,SpareBank 1 SR-Bank (SRBANK),2022.0,SpareBank 1 SR-Bank ASA (SRBNK),2020.0,91.525424
5,Tanker Investments Ltd (TNK),2020.0,Tanker Investments Ltd. (TIL),2016.0,91.22807
7,"Hennes & Mauritz AB, H & M (HM)",2022.0,Hennes & Mauritz AB. H&M (HM B),2020.0,90.322581
8,Momentum Group (MMGR),2022.0,Momentum Group AB (MMGR B),2020.0,89.361702
9,Avance Gas Holding ltd (AGAS),2020.0,Avance Gas Holding ltd. (AVANCE),2016.0,88.52459
10,Wärtsilä Oyj Abp (WRT1V),2021.0,Wärtsilä Oyj (WRT1),2020.0,88.372093


I merge those companies that I identify as the same company, replacing the older name with the most recent name.

In [None]:
map_similar_pairs(similar_pairs_df, df)

In [None]:
# # Catella AB and Catena AB should both be in the dataset, but Hagar hf. should be merged into Hagar hf
# test_filter(


#     df, {"comp_name": ["Catella AB", "Catena AB", "Hagar hf", "Hagar hf."]}, "comp_name"


# )

I now want to catch the last companies that are the same, but may have names that weren't caught by the similarity checker. I will do this manually.

In [None]:
# sorted(df['comp_name'].unique().tolist())

In [None]:
# To get the final companies that appear multiple times under different names, I check manually and compile a list
similar_companies_manual = [
    "Kindred Group Plc (formerly Unibet Group)",
    "Kindred Group Plc",
    "Ahlstrom Oyj",
    "Ahlstrom-Munksjö Oyj",
    "Bergman & Beving AB",
    "Bergman & Beving AB  (B&B Tools AB)",
    "F-Secure Corporation",
    "F-Secure Oyj",
    "Kinnevik AB",
    "Kinnevik AB  (Kinnevik Investment AB)",
    "MT Højgaard A/S (formerly known as Højga",
    "MT Højgaard Holding A/S  (Højgaard Holding A/S)",
    "Metso Outotec Oyj",
    "Metso Outotec Oyj  (Outotec Oyj)",
    "Nordnet AB",
    "Nordnet AB publ",
    "Radisson Hospitality AB",
    "Radisson Hospitality AB  (Rezidor Hotel Group AB)",
    "Revenio Group Corporation",
    "Revenio Group Oyj",
    "Raisio Oyj",
    "Raisio Oyj Vaihto-osake",
    "Royal Caribbean Cruises Ltd",
    "Royal Caribbean Group (formerly: Royal Caribbean Cruises Ltd)",
    "TORM A/S",
    "TORM plc",
    "VBG GROUP AB",
    "VBG Group AB",
]

filtered_df = df[df["comp_name"].isin(similar_companies_manual)]

latest_entries = filtered_df.sort_values(by="year", ascending=False).drop_duplicates(
    subset=["comp_name"], keep="first"
)

# Merge back to get ticker and ensure entry1 has the most recent year
similar_pairs_manual = []
for company1, company2 in zip(
    similar_companies_manual[::2], similar_companies_manual[1::2]
):
    entry1 = latest_entries[latest_entries["comp_name"] == company1]
    entry2 = latest_entries[latest_entries["comp_name"] == company2]

    if not entry1.empty and not entry2.empty:
        # Extract relevant details
        year1 = entry1["year"].values[0]
        year2 = entry2["year"].values[0]
        ticker1 = entry1["ticker"].values[0]
        ticker2 = entry2["ticker"].values[0]

        # Ensure entry1 has the most recent year
        if year1 < year2:
            company1, company2 = company2, company1
            year1, year2 = year2, year1
            ticker1, ticker2 = ticker2, ticker1

        # Format the entries
        formatted_entry1 = f"{company1} ({ticker1})"
        formatted_entry2 = f"{company2} ({ticker2})"

        # Append to the list
        similar_pairs_manual.append((formatted_entry1, year1, formatted_entry2, year2))

# Convert to DataFrame
similar_pairs_manual_df = pd.DataFrame(
    similar_pairs_manual, columns=["entry1", "year1", "entry2", "year2"]
)

similar_pairs_manual_df

Unnamed: 0,entry1,year1,entry2,year2
0,Kindred Group Plc (formerly Unibet Group) (KIND),2022.0,Kindred Group Plc (KIND SDB),2019.0
1,Ahlstrom-Munksjö Oyj (AM1),2020.0,Ahlstrom Oyj (AHL1V),2016.0
2,Bergman & Beving AB (BERG),2022.0,Bergman & Beving AB (B&B Tools AB) (BBTO),2015.0
3,F-Secure Corporation (FSECURE),2022.0,F-Secure Oyj (FSC1V),2022.0
4,MT Højgaard A/S (formerly known as Højga (MTHH),2022.0,MT Højgaard Holding A/S (Højgaard Holding A/S...,2020.0
5,Metso Outotec Oyj (METSO),2022.0,Metso Outotec Oyj (Outotec Oyj) (OTE),2019.0
6,Nordnet AB (SAVE),2022.0,Nordnet AB publ (NN),2020.0
7,Radisson Hospitality AB (RADH),2017.0,Radisson Hospitality AB (Rezidor Hotel Group ...,2016.0
8,Revenio Group Corporation (REG1V),2022.0,Revenio Group Oyj (REG),2020.0
9,Raisio Oyj Vaihto-osake (RAIVV),2022.0,Raisio Oyj (RAI),2020.0


I then merge again, replacing the old name with the new name.

In [None]:
map_similar_pairs(similar_pairs_manual_df, df)

In [None]:
# # check that the changes were made correctly
df[df["comp_name"].str.startswith("Sampo")].sort_values(by="comp_name", ascending=True)
# df[df["comp_name"].str.match(r"^Ahlstrom.*")]

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
770,Sampo,SAMPO,2022.0,Large,Financials,Finland,,1863,13475,Y,Y,Y,ND,N,N,Y,ND,N,5.2541,15.586,Y,Y,Y,Y,Y
771,Sampo,SAMPO,2021.0,Large,Insurance,Finland,Y,13451,13326,Y,Y,Y,118313.5194,Y,N,Y,33.653,N,4.3264,6.3149,N,Y,N,N,Y
772,Sampo,SAMPO,2014.0,Large,Financials,Finland,Y,6474,6723,Y,Y,Y,104400,Y,N,Y,ND,N,ND,ND,N,N,N,N,Y
773,Sampo,SAMPO,2015.0,Large,Financials,Finland,Y,6566,6755,Y,N,Y,93600,N,N,Y,ND,N,9,7.8,Y,N,N,N,N
774,Sampo,SAMPO,2016.0,Large,Financials,Finland,Y,6252,6780,Y,N,Y,90000,N,N,N,46,N,1.2,7.1,N,N,N,N,N
775,Sampo,SAMPO,2017.0,Large,Financials,Finland,N,7009,6452,N,N,N,88200,N,N,N,ND,N,1.5,ND,N,Y,N,N,N
776,Sampo,SAMPO,2018.0,Large,Financials,Finland,Y,7907,9509,Y,N,Y,86400,N,N,N,ND,N,1.421,ND,N,N,N,N,N
777,Sampo,SAMPO,2019.0,Large,Financials,Finland,Y,8744,9927,Y,Y,Y,81878.4,Y,Y,Y,60.342,N,4.463,11.951,N,Y,N,N,Y
778,Sampo,SAMPO,2020.0,Large,Financial Services,Finland,Y,9913,13162,Y,Y,Y,74026.8,Y,Y,Y,24.2436,N,4.401,11.9501,N,Y,N,N,Y
1106,Sampo Oyj,SAMAS,2021.0,Large,Financials,Finland,Y,10580,38,Y,Y,Y,ND,Y,N,Y,33.653,N,4.3264,6.3149,N,Y,N,N,Y


### Handle duplicates

For duplicate rows, I decide manually which to drop, based on the presence/absence of important data (e.g. GHG emissions)

In [None]:
# df[df.duplicated(subset=['comp_name', 'year'])]
df[df.duplicated(subset=["ticker", "year"], keep=False)].sort_values(
    by=["comp_name", "year"], ascending=[True, False]
)

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
973,Ahlstrom-Munksjö Oyj,AM1,2016.0,Mid,Basic Materials,Sweden,Y,1085.9,3255,Y,Y,Y,11900837,N,Y,Y,23000,Y,467.2,ND,Y,Y,N,N,Y
1128,Ahlstrom-Munksjö Oyj,AM1,2016.0,Mid,Basic Materials,Sweden,Y,1142.9,2755,Y,Y,Y,ND,N,N,N,ND,N,ND,ND,N,Y,N,N,N
972,Ahlstrom-Munksjö Oyj,AM1,2015.0,Mid,Basic Materials,Finland,N,1074.7,3310,Y,Y,Y,12976250,Y,Y,Y,24000,N,502.2,ND,N,Y,N,N,N
1127,Ahlstrom-Munksjö Oyj,AM1,2015.0,Mid,Basic Materials,Sweden,Y,1130.7,2900,Y,N,Y,1980000,N,N,N,40500,N,337,ND,N,Y,N,Y,Y
971,Ahlstrom-Munksjö Oyj,AM1,2014.0,Mid,Basic Materials,Finland,Y,1001.1,3398,Y,Y,Y,8667000,N,Y,Y,26,N,333.4,ND,N,Y,N,N,Y
1126,Ahlstrom-Munksjö Oyj,AM1,2014.0,Mid,Basic Materials,Sweden,Y,1137.3,1802,Y,Y,Y,5040000,Y,Y,Y,41250,Y,352,ND,N,Y,N,N,Y
1396,Arion Banki hf.,ARION,2019.0,Large,Financials,Iceland,N,430.4688077,791,Y,N,N,29760.2496,N,N,N,136.99,N,0.0634,0.0708,N,N,N,N,N
3433,Arion Banki hf.,ARION,2019.0,Large,Financials,Iceland,Y,354.3743079,735,Y,Y,Y,29760.2496,Y,N,N,136.99,N,0.1342,0.3154,N,Y,N,N,Y
1084,F-Secure Corporation,FSECURE,2022.0,Mid,Technology,Finland,,111.0,376,N,N,N,ND,ND,N,N,ND,N,ND,ND,N,N,ND,N,N
1279,F-Secure Corporation,FSECURE,2022.0,Mid,Technology,Finland,,111.0,357,N,N,N,ND,N,N,N,ND,N,ND,ND,N,N,N,N,N


In [None]:
# is this the most robust way of doing this? I don't think so...just make sure the index doesn't change above!
duplicates_to_drop = [971, 972, 973, 1084, 1396, 2126, 1247]
df = df.drop(index=duplicates_to_drop)

In [None]:
df[df.duplicated(subset=["ticker", "year"], keep=False)].sort_values(
    by=["comp_name", "year"], ascending=[True, False]
)

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess


In [None]:
display_unique_counts(df)

Unique companies in the database:  639
Unique tickers in the database:  639


### Standardise segment and industry

Some companies are missing segment data some years, while others show different segments from year to year. I fill these by taking the most recent segment for each company and applying it to all other years.

In [None]:
df["segment"].unique().tolist()

['Mid', 'Large', 'Small', nan, 'ND', '0']

In [None]:
companies_missing_segments = (
    df[df["segment"].isin(["ND", "0", np.nan])]["comp_name"].unique().tolist()
)

companies_missing_segments

['Seadrill Ltd',
 'Basware Oyj',
 'Bakkafrost P/F',
 'Onxeo SA',
 'ICA Gruppen AB',
 'Norwegian Finans Holding',
 'Schibsted ASA']

In [None]:
update_segments_remove_na(df)

In [None]:
# # verification that this worked
# df[
#     df["comp_name"].isin(
#         [
#             "Seadrill Ltd",
#             "Basware Oyj",
#             "Bakkafrost P/F",
#             "Onxeo SA",
#             "ICA Gruppen AB",
#             "Norwegian Finans Holding",
#             "Schibsted ASA",
#         ]
#     )
# ]

Now all rows have an associated segment, but some companies appear in different segments from one year to the next. To standardise the segment for each company, I extract data from the most recent year for each company and apply that to all years.

In [None]:
get_most_recent_values(df, columns_to_update=["segment", "industry", "hq_country"])

Now each company's data from the most recent year (in the relevant columns) will be applied to all years. 

In [None]:
# verify that the above code did its job
# test_company(df, "Seadrill Ltd")

Some industries have multiple names. Unfortunately, the 'industry' column contains data entries across the economic sector, business sector, and industry group (see the TRBC classification scheme for more information; Thomson Reuters, 2012). For example, some companies have 'Energy' (economic sector) as their industry, while others have 'Oil & Gas' (industry group). Economic sector is too broad to compare companies, but converting every economic or business sector to an industry group would be time consuming and prone to errors. To manage this, I merge industry groups into business and economic sectors, where appropriate, to create a hybrid system.    

In [None]:
industry_mapping = {
    "Oil & Gas": "Energy",
    "Oil & Gas Equipment & Services": "Energy",
    "Industrials": "Industrial Goods and Services",
    "Personal Care, Drug and Grocery Stores": "Consumer Goods and Services",
    "Consumer Goods": "Consumer Goods and Services",
    "Consumer Discretionary": "Consumer Goods and Services",
    "Consumer Services": "Consumer Goods and Services",
    "Consumer Staples": "Consumer Goods and Services",
    "Basic Resources": "Basic Materials",
    "Financial Services": "Finance",
    "Financials": "Finance",
    "Healthcare": "Health Care",
}

df["industry"] = df["industry"].replace(industry_mapping)

sorted(df["industry"].unique().tolist())

['Basic Materials',
 'Biotechnology',
 'Consumer Goods and Services',
 'Energy',
 'Finance',
 'Health Care',
 'Industrial Goods and Services',
 'Leisure',
 'Media',
 'Real Estate',
 'Retail',
 'Technology',
 'Telecommunications',
 'Travel and Leisure',
 'Unknown',
 'Utilities']

In [None]:
# ensure that all countries are correctly formatted
df["hq_country"] = df["hq_country"].str.strip().replace({"UK": "United Kingdom"})
sorted(df["hq_country"].unique().tolist())

['Belgium',
 'Bermuda',
 'Canada',
 'Cayman Islands',
 'Chile',
 'Cyprus',
 'Denmark',
 'Estonia',
 'Faroe Islands',
 'Finland',
 'France',
 'Germany',
 'Iceland',
 'Jersey',
 'Luxembourg',
 'Malta',
 'Netherlands',
 'Norway',
 'Sweden',
 'Switzerland',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Virgin Islands, British']

### Handle missing values

I first remove any row with a missing 'year'.

In [None]:
df = df.dropna(subset=["year"])

In [None]:
show_missing_values(df)

Unnamed: 0_level_0,Missing Values,Missing Percentage,'ND' Values,'ND' Percentage
cols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
comp_name,0,0.0,0,0.0
ticker,0,0.0,0,0.0
year,0,0.0,0,0.0
segment,0,0.0,0,0.0
industry,0,0.0,0,0.0
hq_country,0,0.0,0,0.0
ceo_sust_statem,499,13.12,2,0.05
sales,19,0.5,0,0.0
num_employees,21,0.55,35,0.92
env_policy,23,0.6,2,0.05


In [None]:
# Make 'ND' more robust to any later aggregation
df.replace("ND", np.nan, inplace=True)

# Convert 'Y' and 'N' to dummy variables
df.replace({"Y": 1, "N": 0}, inplace=True)

In [None]:
# show_missing_values(df)

### Convert data types and transform anomalies

Continuous variables should have numbers considerably higher than 1 or 0, so where these values are present, the data are considered to be invalid. They are converted to nulls. Binary columns have the opposite problem. Inputs should be either 1 or 0, so any other entry is considered invalid. These are converted to 0. Other numeric columns are also cleaned to remove invalid entries.

In [None]:
generate_binary_summary(df)

Unnamed: 0,Column Name,Data Type,1s,0s,NaNs
0,comp_name,object,0,0,0
1,ticker,object,0,0,0
2,year,float64,0,0,0
3,segment,object,0,0,0
4,industry,object,0,0,0
5,hq_country,object,0,0,0
6,ceo_sust_statem,object,2306,994,501
7,sales,object,0,0,19
8,num_employees,object,0,0,56
9,env_policy,object,3407,369,25


In [None]:
df["su_guidelines"].unique()

array([0, 1, nan, 'T', '0'], dtype=object)

Each column is given a valid data type to remove invalid results.

In [None]:
string_columns = ["comp_name", "ticker", "segment", "industry", "hq_country"]
df[string_columns] = df[string_columns].astype(str)

In [None]:
bool_columns = [
    "ceo_sust_statem",
    "env_policy",
    "ep_targets",
    "env_impact_red",
    "incr_renew_en",
    "disclosure_raw",
    "resource_target",
    "water_disclose",
    "audit_es_report",
    "su_guidelines",
    "su_aud_disclose",
    "su_eva_disclose",
    "su_env_assess",
]

unique_values_bool = []

for col in bool_columns:
    uniques = df[col].unique().tolist()
    unique_values_bool.append([col, uniques])

unique_bool_df = pd.DataFrame(unique_values_bool, columns=["Column", "Value"])

unique_bool_df.head(20)

Unnamed: 0,Column,Value
0,ceo_sust_statem,"[0, 1, nan, y, 0]"
1,env_policy,"[0, 1, nan, y, 0, Y ]"
2,ep_targets,"[1, 0, nan, 0]"
3,env_impact_red,"[1, 0, nan, 1152240, 0]"
4,incr_renew_en,"[0, nan, 1, 0]"
5,disclosure_raw,"[0, 1, nan, 0]"
6,resource_target,"[0, 1, nan, 0, 631.0571121]"
7,water_disclose,"[0, nan, 1, Y , Y?, 0, 1200, N , N, 17.582]"
8,audit_es_report,"[0, 1, nan, 0, y]"
9,su_guidelines,"[0, 1, nan, T, 0]"


In [None]:
def bool_transform(x: Any) -> int:
    """
    Convert:
      - Any numeric value > 0  -> 1
      - 'Y' or 'y'            -> 1
      - Everything else       -> 0
    """
    # Try converting x to a float. If it fails (e.g., it's a string that can't be converted),
    try:
        val: float = float(x)
        return 1 if val > 0 else 0
    except (ValueError, TypeError):
        # If it's not numeric, check if it's 'Y'/'T'. If not, it's 'N', and should return 0
        return 1 if str(x).lower() in ("y", "t") else 0


df[bool_columns] = df[bool_columns].map(bool_transform)

In [None]:
numeric_columns = [
    "sales",
    "num_employees",
    "transport_emis",
    "energy_consump",
    "water_withdraw",
    "ghg_emis",
]

for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors="coerce")
    # replace negative values with NaN
    df[col] = df[col].apply(lambda x: np.nan if x <= 0 else x)

I then transform any other anomalies.

In [None]:
df["year"] = pd.to_datetime(df["year"], format="%Y").dt.year

Some industries don't have enough data points to analyse, so they are merged into other industries.

In [None]:
df["industry"] = df["industry"].replace(
    {
        "Unknown": "Technology",
        "Biotechnology": "Technology",
        "Leisure": "Other",
        "Travel and Leisure": "Other",
        "Retail": "Consumer Goods and Services",
        "Telecommunications": "Other",
        "Media": "Other",
        "Utilities": "Energy and Utilities",
        "Energy": "Energy and Utilities",
        "Real Estate": "Other",
    }
)

In [None]:
df["industry"].value_counts()

industry
Industrial Goods and Services    957
Finance                          769
Consumer Goods and Services      694
Health Care                      400
Technology                       302
Energy and Utilities             287
Basic Materials                  199
Other                            196
Name: count, dtype: int64

In [None]:
# np.nanpercentile(df["transport_emis"], 5)

In [None]:
# I still have some 1s in float columns, but this not necessarily an error. I will leave them
generate_binary_summary(df)

Unnamed: 0,Column Name,Data Type,1s,0s,NaNs
0,comp_name,object,0,0,0
1,ticker,object,0,0,0
2,year,int32,0,0,0
3,segment,object,0,0,0
4,industry,object,0,0,0
5,hq_country,object,0,0,0
6,ceo_sust_statem,int64,2307,1497,0
7,sales,float64,0,0,66
8,num_employees,float64,1,0,69
9,env_policy,int64,3408,396,0


In [None]:
# show_missing_values(df)

I notice that some revenues seem to be an order of magnitude larger than they should be. I correct these

In [None]:
new_revenue_values = {
    "Star Bulk Carriers Corp.": 821.365,
    "Telenor ASA": 9799,
    "Cloetta AB": 649.106,
}

# Update revenue_MEUR based on company name
df.loc[df["comp_name"].isin(new_revenue_values.keys()), "sales"] = df["comp_name"].map(
    new_revenue_values
)

### Remove data older than base year

I drop all data older than 2019. I chose this year because it is the base year for the Science-Based Targets Initiative's Business Ambition for 1.5°C (SBTI, 2024), which increased the number of companies who made climate commitments by over 80%. This makes it easier to compare companies relative to a common base year, while also avoiding the distorting effect of Covid-19 on business performance. Companies that have no data for 2019 use their earliest year of reporting as the base year.

In [None]:
df = df[df["year"] >= 2019]

df["year"].value_counts()

year
2020    491
2019    486
2021    439
2022    421
Name: count, dtype: int64

I create a base year column based on the earliest year of data for each company.

In [None]:
# define the earliest year for each company
earliest_year_df = df.groupby("comp_name")["year"].min()

# join the earliest_year_df on "comp_name" column
df = df.join(earliest_year_df, on="comp_name", how="left", rsuffix="_base")
df = df.rename(columns={"year_base": "base_year"}).reset_index(drop=True)

I then calculate the number of years of ESG reporting since 2019 for each company.

In [None]:
# 'years_esg_data' = count of rows for each 'comp_name'
df["years_esg_data"] = df.groupby("comp_name")["year"].transform("count")

In [None]:
df.head()

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,num_employees,env_policy,ep_targets,env_impact_red,energy_consump,incr_renew_en,disclosure_raw,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess,base_year,years_esg_data
0,Archer Ltd.,ARCHO,2020,Mid,Energy and Utilities,Norway,1,735.714286,4556.0,1,1,1,459927.0,0,0,1,,0,,,1,1,1,0,0,2020,1
1,AutoStore Holdings Ltd.,AUTO,2021,Large,Industrial Goods and Services,Bermuda,1,292.5,578.0,1,0,1,,0,1,0,,0,0.7366,371.9243,0,1,0,1,0,2021,1
2,Avance Gas Holding ltd,AGAS,2019,Mid,Energy and Utilities,Norway,1,223.590179,271.0,1,1,1,,0,0,0,,0,,,0,0,0,0,0,2019,2
3,Avance Gas Holding ltd,AGAS,2020,Mid,Energy and Utilities,Norway,1,183.675,6.0,1,1,1,5934145.0,0,0,1,,0,,,1,1,0,0,0,2019,2
4,Borr Drilling Ltd,BDRILL,2019,Mid,Energy and Utilities,Bermuda,0,291.848552,1936.0,1,0,1,1980428.4,0,0,1,,0,150.784,43.671,0,1,0,0,0,2019,4


### Edit column names and add new columns

I add boolean columns to my continuous variables to track companies with missing data.

In [None]:
# List of columns to create boolean indicators
columns_to_boolean = [
    "energy_consump",
    "water_withdraw",
    "ghg_emis",
    "transport_emis",
]

# Creating new boolean columns
for col in columns_to_boolean:
    df[f"{col}_bool"] = df[col].notna().astype(int)

In [None]:
df = df.rename(
    columns={
        "comp_name": "company",
        "sales": "revenue_MEUR",
        "energy_consump": "energy_consump_GJ",
        "water_withdraw": "water_withdraw_thm3",
        "ghg_emis": "ghg_emis_kt",
        "transport_emis": "transport_emis_kt",
        "audit_es_report": "external_audit_of_ESG_report",
        "env_policy": "environmental_policy_and_assessment",
        "ep_targets": "environmental_performance_targets",
        "env_impact_red": "reduced_environmental_impact",
        "incr_renew_en": "increased_renewable_energy",
        "disclosure_raw": "disclosure_of_raw_material_use",
        "resource_target": "resource_efficiency_target",
        "water_disclose": "disclosure_of_water_discharges",
        "su_guidelines": "supplier_guidelines",
        "su_aud_disclose": "disclosure_of_suppliers_audited",
        "su_eva_disclose": "disclosure_of_supplier_evaluation_procedures",
        "su_env_assess": "supplier_environmental_assessment",
    }
)

I calculate greenhouse gas emissions and water withdrawal intensity, measured per million Euros of revenue, in line with CSRD.

In [None]:
df["ghg_emis_per_MEUR_revenue"] = df["ghg_emis_kt"] / df["revenue_MEUR"]
df["water_withdraw_per_MEUR_revenue"] = df["water_withdraw_thm3"] / df["revenue_MEUR"]

In [None]:
df.sort_values(by="revenue_MEUR", ascending=False).head()

Unnamed: 0,company,ticker,year,segment,industry,hq_country,ceo_sust_statem,revenue_MEUR,num_employees,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,energy_consump_GJ,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,water_withdraw_thm3,disclosure_of_water_discharges,ghg_emis_kt,transport_emis_kt,external_audit_of_ESG_report,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,base_year,years_esg_data,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool,ghg_emis_per_MEUR_revenue,water_withdraw_per_MEUR_revenue
811,Equinor ASA (formerly Statoil ASA),EQNR,2022,Large,Energy and Utilities,Norway,0,143208.9649,21936.0,1,1,1,,1,0,1,6000.0,0,11400.0,243000.0,1,1,1,0,1,2019,4,0,1,1,1,0.079604,0.041897
396,Fortum Oyj,FORTUM,2021,Large,Energy and Utilities,Finland,1,112400.0,18461.1,1,1,1,399600000.0,1,0,1,12359000.0,1,69750.7,120228.0,1,1,1,1,1,2019,4,1,1,1,1,0.620558,109.955516
810,Equinor ASA (formerly Statoil ASA),EQNR,2021,Large,Energy and Utilities,Norway,1,79235.71429,21115.0,1,1,1,212400000.0,0,0,1,8000.0,1,12100.0,249000.0,1,1,0,0,1,2019,4,1,1,1,1,0.152709,0.100965
103,A.P. Møller -Maersk A/S,MAERSK,2022,Large,Industrial Goods and Services,Denmark,0,77425.45109,104260.0,1,1,1,447345000.0,1,0,1,916.0,0,34506.0,43451.0,0,1,1,1,1,2019,4,1,1,1,1,0.445667,0.011831
808,Equinor ASA (formerly Statoil ASA),EQNR,2019,Large,Energy and Utilities,Norway,1,56170.53571,21412.0,1,1,1,252000000.0,0,0,1,12000.0,1,14900.0,247000.0,1,1,1,1,0,2019,4,1,1,1,1,0.265264,0.213635


## Alignment with new CSRD rules

The Corporate Sustainability Reporting Directive (CSRD) is a European Union regulation that strengthens sustainability disclosure requirements for companies operating in the EU. The regulations are likely to change following a recent amendment by the European Commission (2025), which stipulated that the new criteria for reporting in the financial year of 2025 will apply to companies with the following: 
- more than 1000 employees on average (up from 250 employees in the previous directive)
- either €50 million in net turnover or €25 million in total assets (on the balance sheet).

Reporting requirements for those companies with fewer than 1000 employees and/or exceeding neither financial metric will be postponed until at least 2027.

This dataset does not include balance sheet data, so I classify companies only by their yearly revenue and number of employees. I assume that companies with a yearly revenue above €50 million and 1000 employees will be affected by CSRD, but if one of these two metrics is missing, I assume them to be unaffected. (This seems reasonable although not technically correct, but the assumption can be updated with the addition of balance sheet data.)

In [None]:
df["csrd_2025"] = ((df["revenue_MEUR"] > 50) & (df["num_employees"] >= 1000)).astype(
    int
)

In [None]:
# generate_binary_summary(df)

Companies that aren't subject to reporting requirements in 2025 but were expected to be reporting from 1st January 2026 are currently those which have at least two of the following three:
- more than 10 employees
- more than 700,000 EUR (0.7 MEUR) in turnover
- more than 350,000 EUR (0.35 MEUR) in total assets

These reporting requirements are likely to change, and have been postponed until at least 2027. Nevertheless, I cautiously include companies with employees and turnover exceeding this requirement in 'csrd_2027'.

In [None]:
df["csrd_2027"] = (
    (df["csrd_2025"] == 0) & (df["num_employees"] > 10) & (df["revenue_MEUR"] > 0.7)
).astype(int)

## Export data

The original dataset has been cleaned to remove invalid data, so it is ready to be manipulated. I will divide the data into multiple data frames: one for reporting, another for emissions. The reporting_df analyses how well companies comply with their reporting requirements (the equivalent of a gap analysis), while the emissions_df analyses companies' emissions relative to their competitors and their previous performance.

In [None]:
reporting_df_cols = [
    "company",
    "ticker",
    "year",
    "revenue_MEUR",
    "csrd_2025",
    "csrd_2027",
    "segment",
    "industry",
    "hq_country",
    "years_esg_data",
    "base_year",
    "external_audit_of_ESG_report",
    "ceo_sust_statem",
    "environmental_policy_and_assessment",
    "environmental_performance_targets",
    "reduced_environmental_impact",
    "increased_renewable_energy",
    "disclosure_of_raw_material_use",
    "resource_efficiency_target",
    "disclosure_of_water_discharges",
    "supplier_guidelines",
    "disclosure_of_suppliers_audited",
    "disclosure_of_supplier_evaluation_procedures",
    "supplier_environmental_assessment",
    "energy_consump_bool",
    "water_withdraw_bool",
    "ghg_emis_bool",
    "transport_emis_bool",
]

reporting_df = df[df.columns.intersection(reporting_df_cols)]
reporting_df = reporting_df.reindex(columns=reporting_df_cols)

In [None]:
# reporting_df.head()

In [None]:
impact_df_cols = [
    "company",
    "ticker",
    "year",
    "csrd_2025",
    "csrd_2027",
    "segment",
    "industry",
    "hq_country",
    "base_year",
    "external_audit_of_ESG_report",
    "revenue_MEUR",
    "energy_consump_GJ",
    "water_withdraw_thm3",
    "ghg_emis_kt",
    "transport_emis_kt",
    "ghg_emis_per_MEUR_revenue",
    "water_withdraw_per_MEUR_revenue",
]


impact_df = df[df.columns.intersection(impact_df_cols)]
impact_df = impact_df.reindex(columns=impact_df_cols)

In [None]:
# impact_df.sort_values(by="revenue_MEUR", ascending=False)

In [None]:
# impact_df[impact_df["revenue_MEUR"] > 50].sort_values(
#     by="revenue_MEUR", ascending=False
# )

In [None]:
reporting_df.head()

Unnamed: 0,company,ticker,year,revenue_MEUR,csrd_2025,csrd_2027,segment,industry,hq_country,years_esg_data,base_year,external_audit_of_ESG_report,ceo_sust_statem,environmental_policy_and_assessment,environmental_performance_targets,reduced_environmental_impact,increased_renewable_energy,disclosure_of_raw_material_use,resource_efficiency_target,disclosure_of_water_discharges,supplier_guidelines,disclosure_of_suppliers_audited,disclosure_of_supplier_evaluation_procedures,supplier_environmental_assessment,energy_consump_bool,water_withdraw_bool,ghg_emis_bool,transport_emis_bool
0,Archer Ltd.,ARCHO,2020,735.714286,1,0,Mid,Energy and Utilities,Norway,1,2020,1,1,1,1,1,0,0,1,0,1,1,0,0,1,0,0,0
1,AutoStore Holdings Ltd.,AUTO,2021,292.5,0,1,Large,Industrial Goods and Services,Bermuda,1,2021,0,1,1,0,1,0,1,0,0,1,0,1,0,0,0,1,1
2,Avance Gas Holding ltd,AGAS,2019,223.590179,0,1,Mid,Energy and Utilities,Norway,2,2019,0,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0
3,Avance Gas Holding ltd,AGAS,2020,183.675,0,0,Mid,Energy and Utilities,Norway,2,2019,1,1,1,1,1,0,0,1,0,1,0,0,0,1,0,0,0
4,Borr Drilling Ltd,BDRILL,2019,291.848552,1,0,Mid,Energy and Utilities,Bermuda,4,2019,0,0,1,0,1,0,0,1,0,1,0,0,0,1,0,1,1


I save the data frames to file and load them in the next notebook.

In [None]:
# folder_path = r"C:\Users\james\OneDrive - Högskolan Dalarna\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# df.to_csv(f"{folder_path}/nordic_compass_df_cleaned_01.csv", index=False)
# reporting_df.to_csv(f"{folder_path}/reporting_df_original.csv", index=False)
# impact_df.to_csv(f"{folder_path}/impact_df_original.csv", index=False)

## References

European Commission, 2025. Proposal for a Directive of the European Parliament and of the Council amending Directives (EU) 2022/2464 and (EU) 2024/1760 as regards the dates from which Member States are to apply certain corporate sustainability reporting and due diligence requirements. COM(2025) 80 final. Brussels. Available at: https://commission.europa.eu/document/download/0affa9a8-2ac5-46a9-98f8-19205bf61eb5_en?filename=COM_2025_80_EN.pdf (Accessed 27 February 2025)

Nordic Compass, 2022. Nordic Compass, Swedish House of Finance's ESG Database. https://www.hhs.se/en/houseoffinance/data-center/nordic-compass-shofs-esg-database/

SBTI, 2024. Business ambition for 1.5°C campaign: final report. Available at: https://sciencebasedtargets.org/resources/files/SBTi-Business-Ambition-final-report.pdf (Accessed 17 February 2025)

Thomson Reuters, 2012. Thomson Reuters Business Classification (now owned by Refinitiv). Available at: https://www.equidam.com/resources/trbc-fact-sheet.pdf (Accessed 26 February 2025) 

## Appendix

To-do list:
- In the Overview, add some useful tables (e.g. number of companies per industry, number of companies by size, number of companies by HQ, etc...)
- Check the other projects I have done and borrow some functions.
- Create a 'search_company' function to allow for a regex search.

Change year from float to datetime

Use the code below to track how some rows change as you apply changes to the whole dataset.

In [None]:
# # edit this to make the companies regexes
# validation_filter = {
#     "comp_name": [
#         "Avance Gas Holding ltd",
#         "Prosafe SE",
#         "Seadrill Ltd.",
#         "Tallink",
#         "ICA Gruppen AB",
#     ]
# }
# validation_cols_to_show = None  # 'None' shows all columns in df by default