To-do list:
- In the Overview, add some useful tables (e.g. number of companies per industry, number of companies by size, number of companies by HQ, etc...)
- Check the other projects I have done and borrow some functions.


Change year from float to datetime

### Appendix: code validation

Use the code below to track how some rows change as you apply changes to the whole dataset.

In [457]:
# edit this to make the companies regexes
validation_filter = {
    "comp_name": [
        "Avance Gas Holding ltd",
        "Prosafe SE",
        "Seadrill Ltd.",
        "Tallink",
        "ICA Gruppen AB",
    ]
}
validation_cols_to_show = None  # 'None' shows all columns in df by default

<center><span style="font-size:30px; font-weight: bold;">Nordic Compass Database</span></center>
<center><span style="font-size:24px;">Analysis of ESG Performance and CSRD Compliance</span></center>

<center><span style="font-size:22px;"><b>Section 1:</b> Preprocessing and cleaning </span></center>

## Introduction to this section

Insert discussion here...

## Imports

In [458]:
import pandas as pd
import numpy as np
from rapidfuzz import fuzz, process
import sys
import os

sys.path.append(os.path.abspath(".."))
import random
from functions import (
    display_unique_counts,
    show_missing_values,
    test_filter,
    test_ticker,
    tickers_with_multiple_companies,
    companies_with_multiple_tickers,
    apply_most_recent_company_name,
    apply_most_recent_ticker,
    test_company,
    find_similar_entries,
    map_similar_pairs,
    update_segments_remove_na,
    get_most_recent_values,
    generate_binary_summary,
)

In [459]:
# load the file
df = pd.read_csv("../datasets/NordicCompass2014_2022.csv")

## Data overview

The database is too large to focus on all columns necessary for full compliance with CSRD. I will focus only on columns relating to a company's environmental performance.

In [460]:
# df.columns.tolist()

relevant_columns = [
    "comp_name",
    "ticker",
    "year",
    "segment",
    "industry",
    "hq_country",
    "ceo_sust_statem",
    "sales",
    "env_policy",
    "ep_targets",
    "env_impact_red",
    "energy_consump",
    "incr_renew_en",
    "disclosure_raw",
    "resource_target",
    "water_withdraw",
    "water_disclose",
    "ghg_emis",
    "transport_emis",
    "audit_es_report",
    "su_guidelines",
    "su_aud_disclose",
    "su_eva_disclose",
    "su_env_assess",
]

In [461]:
# selects all rows and only relevant columns
df = df.loc[:, relevant_columns]

In [462]:
df.head()

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHER,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,N,Y,...,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHER,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,Y,N,...,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,Y,Y,...,Y,ND,N,ND,ND,Y,Y,Y,N,N
3,AutoStore Holdings Ltd.,AUTO,2021.0,Large,Industrials,Bermuda,Y,292.5,Y,N,...,N,ND,N,0.7366,371.9243,N,Y,N,Y,N
4,Avance Gas Holding ltd,AVACF,2019.0,Mid,Energy,Norway,Y,223.5901786,Y,Y,...,N,ND,N,N,N,N,N,N,N,N


Some companies, such as Archer Ltd., have more than one ticker, so this will need to be modified.

In [463]:
display_unique_counts(df)

Unique companies in the database:  782
Unique tickers in the database:  783


## Data cleaning

I want to ensure that companies and tickers match. Some companies currently have multiple tickers per company, while some tickers have multiple companies per ticker. I will then focus on duplicates (e.g. same company, same year appearing multiple times). Where data is duplicated, I will prioritise data where 'GHG emissions' data exists and/or 'sales' are higher (some have net income reported by mistake). 

### Multiple companies associated with one ticker

I first check tickers that are associated with multiple companies.

In [464]:
# Show rows where a single ticker is associated with multiple companies (e.g. 'ACR' = 'Axactor' and 'Axactor SE')
tickers_with_multiple_companies(df)

Tickers associated with multiple companies:  114



['ACR: Axactor, Axactor SE',
 'ADE: Adevinta, Adevinta ASA',
 'AF: ÅF AB, ÅF Pöyry AB',
 'AKRBP: Aker BP ASA, Aker BP ASA (Det norske oljeselskap ASA)',
 'AKTIA: Aktia Bank PLC (formerly known as Aktia, Aktia Bank Plc (formerly known as Aktia Pankki Oyj), Aktia Bank plc',
 'AM1: Ahlstrom-Munksjö Oyj, Ahlstrom-Munksjö Oyj  (Munksjö Oyj)',
 'ANOD: AddNode Group AB, Addnode Group AB',
 'ARCUS: Arcus ASA, Arcus asa',
 'ARION: Arion Banki hf., Arion Banki hf. SDB',
 'ASSA: ASSA ABLOY AB, Assa Abloy AB',
 'AZTO: ArcticZymes Technologies (Biotec Pharmacon ), ArcticZymes Technologies ASA(formerly Biotec Pharmacon ASA)',
 'BALD: Balder Fastighets AB, Fastighets AB Balder',
 'BDRILL: Borr Drilling Ltd, Borr Drilling Ltd.',
 'BHG: BHG (formerly Bygghemma Group First AB), BHG AB(formerly Bygghemma Group First AB), BHG Group AB (Bygghemma Group First AB)',
 'BITTI: Bittium Oyj, Bittium Oyj  (Elektrobit Oyj)',
 'BO: Bang & Olufsen A/S, Bang & Olufsen Holding A/S',
 'BWLPG: BW LPG, BW LPG Ltd',
 'BWO

In [465]:
test_ticker(df, "ACR")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
2110,Axactor,ACR,2019.0,Mid,Financials,Norway,N,368.1,Y,N,...,N,ND,N,N,N,N,Y,N,N,N
2111,Axactor,ACR,2020.0,Mid,Financial Services,Norway,N,201.2,Y,N,...,N,ND,N,0.495177,0.1753,N,Y,N,Y,Y
2114,Axactor SE,ACR,2021.0,Mid,Financials,Sweden,Y,195.1,Y,Y,...,N,ND,N,0.707504,0.144031,N,Y,N,N,Y


Where multiple company names are associated with a single ticker, the company name from the most recent year will be stored. All others will be replaced by the name associated with the most recent year. 

In [466]:
apply_most_recent_company_name(df)

In [467]:
test_ticker(df, "ACR")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
2110,Axactor SE,ACR,2019.0,Mid,Financials,Norway,N,368.1,Y,N,...,N,ND,N,N,N,N,Y,N,N,N
2111,Axactor SE,ACR,2020.0,Mid,Financial Services,Norway,N,201.2,Y,N,...,N,ND,N,0.495177,0.1753,N,Y,N,Y,Y
2114,Axactor SE,ACR,2021.0,Mid,Financials,Sweden,Y,195.1,Y,Y,...,N,ND,N,0.707504,0.144031,N,Y,N,N,Y


Validate that all tickers have only one associated company.

In [468]:
tickers_with_multiple_companies(df)
display_unique_counts(df)

Tickers associated with multiple companies:  0

Unique companies in the database:  666
Unique tickers in the database:  783


### Multiple tickers associated with one company

In [469]:
companies_with_multiple_tickers(df)

Companies associated with multiple tickers:  111



['A.P. Møller -Maersk A/S: MAERSK, MAERSK A',
 'ABG Sundal Collier Holding ASA: ABG, ASC',
 'Akastor  ASA: AKAST, AKKVF',
 'Aker BP ASA: AKERBP, AKRBP',
 'Ambu A/S: AMBU, AMBU-B',
 'Archer Ltd.: ARCHER, ARCHO',
 'Asetek A/S: ASETEK, ASTK',
 'Avance Gas Holding ltd: AGAS, AVACF',
 'Axactor SE: ACR, AXA',
 'BankNordik P/F: BNORDIK, BNORDIK CSE',
 'Beijer Alma AB: BEIA, BEIA B',
 'Beijer Ref AB: BEIJ, BEIJ B',
 'Belships ASA: BEL, BELO',
 'Bonheur ASA: BON, BONH',
 'Borregaard ASA: BRG, BRGO',
 'Bouvet ASA: BOUV, BOUVET',
 'Carlsberg A/S: CARL, CARL B',
 'Caverion Oyj: CAV, CAV1V',
 'CellaVision AB: CEVI, SEVI',
 'Cloetta AB: CLA, CLA B',
 'Coloplast A/S: COLO, COLO B',
 'ContextVision: CONTX, COVO',
 'Corem Property Group AB: CORE, CORE A',
 'Crayon Group Holding ASA: CRAYN, CRAYNO',
 'DOF ASA: DOF, DOFO',
 'Elanders AB: ELAN, ELAN B',
 'Frontline Ltd: FRO, FROo',
 'Genmab A/S: GEN, GMAB',
 'H. Lundbeck A/S: HLUN, LUN',
 'Hexagon AB: HEXA, HEXA B',
 'Huhtamäki Oyj: HUH, HUH1V',
 'Höegh L

In [470]:
test_company(df, "Archer Ltd.")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHER,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,N,Y,...,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHER,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,Y,N,...,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,Y,Y,...,Y,ND,N,ND,ND,Y,Y,Y,N,N


In [471]:
apply_most_recent_ticker(df)

In [472]:
test_company(df, "Archer Ltd.")

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
0,Archer Ltd.,ARCHO,2016.0,Mid,Oil & Gas,Bermuda,N,841.9,N,Y,...,N,ND,N,ND,ND,N,N,N,N,N
1,Archer Ltd.,ARCHO,2017.0,Mid,Oil & Gas,Bermuda,Y,705.7,Y,N,...,N,ND,N,ND,ND,N,N,N,N,N
2,Archer Ltd.,ARCHO,2020.0,Mid,Energy,Norway,Y,735.7142857,Y,Y,...,Y,ND,N,ND,ND,Y,Y,Y,N,N


Validate that all companies have only one ticker.

In [473]:
companies_with_multiple_tickers(df)
display_unique_counts(df)

Companies associated with multiple tickers:  0

Unique companies in the database:  666
Unique tickers in the database:  665


'ICA Gruppen AB' was missing a ticker, so I filled it. 'Truecaller AB' had a ticker '1', so I manually updated it.

In [474]:
# test_company(df, "ICA Gruppen AB").head(1)
df.loc[df["comp_name"] == "ICA Gruppen AB", "ticker"] = "ICA"

In [475]:
# manually update the ticker for 'Truecaller AB'
df.loc[df["comp_name"] == "Truecaller AB", "ticker"] = "TRUE B"

# df[df["comp_name"] == "Truecaller AB"]

In [476]:
display_unique_counts(df)

Unique companies in the database:  666
Unique tickers in the database:  666


### Other company name errors

I now search for company names that are similar enough to potentially represent the same company.

In [477]:
similar_pairs_df = find_similar_entries(df, 75)

print(f"Number of rows: {len(similar_pairs_df)}", end="\n")
similar_pairs_df

Number of rows: 43


Unnamed: 0,entry1,year1,entry2,year2,similarity
0,"SCA, Svenska Cellulosa AB (SCA)",2022.0,SCA. Svenska Cellulosa AB (SCAA),2021.0,95.238095
1,Hagar hf (HAGA),2022.0,Hagar hf. (HAGAR),2015.0,93.75
2,Akastor ASA (AKAST),2020.0,Akastor ASA (AKA),2017.0,91.891892
3,"Ericsson, Telefonab. L M (ERIC)",2022.0,Ericsson Telefonab LM (ERIC-B),2019.0,91.803279
4,SpareBank 1 SR-Bank (SRBANK),2022.0,SpareBank 1 SR-Bank ASA (SRBNK),2020.0,91.525424
5,Tanker Investments Ltd (TNK),2020.0,Tanker Investments Ltd. (TIL),2016.0,91.22807
6,Nobia AB (NOBI),2022.0,Nobina AB (NOBINA),2022.0,90.909091
7,"Hennes & Mauritz AB, H & M (HM)",2022.0,Hennes & Mauritz AB. H&M (HM B),2020.0,90.322581
8,Momentum Group (MMGR),2022.0,Momentum Group AB (MMGR B),2020.0,89.361702
9,Avance Gas Holding ltd (AGAS),2020.0,Avance Gas Holding ltd. (AVANCE),2016.0,88.52459


Not all of the companies above are the same company, so I only want to change the ones that are. I do this manually (although I am sure there must be a more robust solution).

In [478]:
# is there a more robust way of doing this? Probably.
indices_to_keep = [0, 1, 2, 3, 4, 5, 7, 8, 9, 10, 14, 15, 16, 19, 20]

similar_pairs_df = similar_pairs_df.loc[indices_to_keep]

similar_pairs_df

Unnamed: 0,entry1,year1,entry2,year2,similarity
0,"SCA, Svenska Cellulosa AB (SCA)",2022.0,SCA. Svenska Cellulosa AB (SCAA),2021.0,95.238095
1,Hagar hf (HAGA),2022.0,Hagar hf. (HAGAR),2015.0,93.75
2,Akastor ASA (AKAST),2020.0,Akastor ASA (AKA),2017.0,91.891892
3,"Ericsson, Telefonab. L M (ERIC)",2022.0,Ericsson Telefonab LM (ERIC-B),2019.0,91.803279
4,SpareBank 1 SR-Bank (SRBANK),2022.0,SpareBank 1 SR-Bank ASA (SRBNK),2020.0,91.525424
5,Tanker Investments Ltd (TNK),2020.0,Tanker Investments Ltd. (TIL),2016.0,91.22807
7,"Hennes & Mauritz AB, H & M (HM)",2022.0,Hennes & Mauritz AB. H&M (HM B),2020.0,90.322581
8,Momentum Group (MMGR),2022.0,Momentum Group AB (MMGR B),2020.0,89.361702
9,Avance Gas Holding ltd (AGAS),2020.0,Avance Gas Holding ltd. (AVANCE),2016.0,88.52459
10,Wärtsilä Oyj Abp (WRT1V),2021.0,Wärtsilä Oyj (WRT1),2020.0,88.372093


I merge those companies that I identify as the same company, replacing the older name with the most recent name.

In [479]:
map_similar_pairs(similar_pairs_df, df)

In [480]:
# # Catella AB and Catena AB should both be in the dataset, but Hagar hf. should be merged into Hagar hf
# test_filter(


#     df, {"comp_name": ["Catella AB", "Catena AB", "Hagar hf", "Hagar hf."]}, "comp_name"


# )

I now want to catch the last companies that are the same, but may have names that weren't caught by the similarity checker. I will do this manually.

In [481]:
# sorted(df['comp_name'].unique().tolist())

In [482]:
# To get the final companies that appear multiple times under different names, I check manually and compile a list
similar_companies_manual = [
    "Kindred Group Plc (formerly Unibet Group)",
    "Kindred Group plc",
    "Ahlstrom Oyj",
    "Ahlstrom-Munksjö Oyj",
    "Bergman & Beving AB",
    "Bergman & Beving AB  (B&B Tools AB)",
    "Kinnevik AB",
    "Kinnevik AB  (Kinnevik Investment AB)",
    "MT Højgaard A/S (formerly known as Højga",
    "MT Højgaard Holding A/S  (Højgaard Holding A/S)",
    "Metso Outotec Oyj",
    "Metso Outotec Oyj  (Outotec Oyj)",
    "Nordnet AB",
    "Nordnet AB publ",
    "Radisson Hospitality AB",
    "Radisson Hospitality AB  (Rezidor Hotel Group AB)",
    "Revenio Group Corporation",
    "Revenio Group Oyj",
    "Raisio Oyj",
    "Raisio Oyj Vaihto-osake",
    "Royal Caribbean Cruises Ltd.",
    "Royal Caribbean Group (formerly: Royal Caribbean Cruises Ltd)",
    "TORM A/S",
    "TORM plc",
    "VBG GROUP AB",
    "VBG Group AB",
]

filtered_df = df[df["comp_name"].isin(similar_companies_manual)]

latest_entries = filtered_df.sort_values(by="year", ascending=False).drop_duplicates(
    subset=["comp_name"], keep="first"
)

# Merge back to get ticker and ensure entry1 has the most recent year
similar_pairs_manual = []
for company1, company2 in zip(
    similar_companies_manual[::2], similar_companies_manual[1::2]
):
    entry1 = latest_entries[latest_entries["comp_name"] == company1]
    entry2 = latest_entries[latest_entries["comp_name"] == company2]

    if not entry1.empty and not entry2.empty:
        # Extract relevant details
        year1 = entry1["year"].values[0]
        year2 = entry2["year"].values[0]
        ticker1 = entry1["ticker"].values[0]
        ticker2 = entry2["ticker"].values[0]

        # Ensure entry1 has the most recent year
        if year1 < year2:
            company1, company2 = company2, company1
            year1, year2 = year2, year1
            ticker1, ticker2 = ticker2, ticker1

        # Format the entries
        formatted_entry1 = f"{company1} ({ticker1})"
        formatted_entry2 = f"{company2} ({ticker2})"

        # Append to the list
        similar_pairs_manual.append((formatted_entry1, year1, formatted_entry2, year2))

# Convert to DataFrame
similar_pairs_manual_df = pd.DataFrame(
    similar_pairs_manual, columns=["entry1", "year1", "entry2", "year2"]
)

similar_pairs_manual_df

Unnamed: 0,entry1,year1,entry2,year2
0,Ahlstrom-Munksjö Oyj (AM1),2020.0,Ahlstrom Oyj (AHL1V),2016.0
1,Bergman & Beving AB (BERG),2022.0,Bergman & Beving AB (B&B Tools AB) (BBTO),2015.0
2,MT Højgaard A/S (formerly known as Højga (MTHH),2022.0,MT Højgaard Holding A/S (Højgaard Holding A/S...,2020.0
3,Metso Outotec Oyj (METSO),2022.0,Metso Outotec Oyj (Outotec Oyj) (OTE),2019.0
4,Nordnet AB (SAVE),2022.0,Nordnet AB publ (NN),2020.0
5,Radisson Hospitality AB (RADH),2017.0,Radisson Hospitality AB (Rezidor Hotel Group ...,2016.0
6,Revenio Group Corporation (REG1V),2022.0,Revenio Group Oyj (REG),2020.0
7,Raisio Oyj Vaihto-osake (RAIVV),2022.0,Raisio Oyj (RAI),2020.0
8,TORM plc (TRMD),2022.0,TORM A/S (Torm A),2015.0


I then merge again, replacing the old name with the new name.

In [483]:
map_similar_pairs(similar_pairs_manual_df, df)

In [484]:
# # check that the changes were made correctly
# df[df["comp_name"].str.startswith("Rad")].sort_values(by="comp_name", ascending=True)
# # df[df["comp_name"].str.match(r"^Ahlstrom.*")]

### Handle duplicates

For duplicate rows, I decide manually which to drop, based on the presence/absence of important data (e.g. GHG emissions)

In [485]:
# df[df.duplicated(subset=['comp_name', 'year'])]
df[df.duplicated(subset=["ticker", "year"], keep=False)].sort_values(
    by=["comp_name", "year"], ascending=[True, False]
)

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
973,Ahlstrom-Munksjö Oyj,AM1,2016.0,Mid,Basic Materials,Sweden,Y,1085.9,Y,Y,...,Y,23000,Y,467.2,ND,Y,Y,N,N,Y
1128,Ahlstrom-Munksjö Oyj,AM1,2016.0,Mid,Basic Materials,Sweden,Y,1142.9,Y,Y,...,N,ND,N,ND,ND,N,Y,N,N,N
972,Ahlstrom-Munksjö Oyj,AM1,2015.0,Mid,Basic Materials,Finland,N,1074.7,Y,Y,...,Y,24000,N,502.2,ND,N,Y,N,N,N
1127,Ahlstrom-Munksjö Oyj,AM1,2015.0,Mid,Basic Materials,Sweden,Y,1130.7,Y,N,...,N,40500,N,337,ND,N,Y,N,Y,Y
971,Ahlstrom-Munksjö Oyj,AM1,2014.0,Mid,Basic Materials,Finland,Y,1001.1,Y,Y,...,Y,26,N,333.4,ND,N,Y,N,N,Y
1126,Ahlstrom-Munksjö Oyj,AM1,2014.0,Mid,Basic Materials,Sweden,Y,1137.3,Y,Y,...,Y,41250,Y,352,ND,N,Y,N,N,Y
1396,Arion Banki hf.,ARION,2019.0,Large,Financials,Iceland,N,430.4688077,Y,N,...,N,136.99,N,0.0634,0.0708,N,N,N,N,N
3433,Arion Banki hf.,ARION,2019.0,Large,Financials,Iceland,Y,354.3743079,Y,Y,...,N,136.99,N,0.1342,0.3154,N,Y,N,N,Y
2377,Industrivärden AB,INDU,2022.0,Large,Financials,Sweden,,-1320.937102,Y,Y,...,Y,ND,N,0.026,0.011,Y,Y,N,N,N
2722,Industrivärden AB,INDU,2022.0,Large,Industrials,Sweden,,2541.587975,Y,Y,...,N,ND,N,19.86,1.685,N,Y,Y,Y,Y


In [486]:
# is this the most robust way of doing this? I don't think so...just make sure the index doesn't change above!
duplicates_to_drop = [971, 972, 973, 1396, 2126, 1247]
df = df.drop(index=duplicates_to_drop)

In [487]:
df[df.duplicated(subset=["ticker", "year"], keep=False)].sort_values(
    by=["comp_name", "year"], ascending=[True, False]
)

Unnamed: 0,comp_name,ticker,year,segment,industry,hq_country,ceo_sust_statem,sales,env_policy,ep_targets,...,resource_target,water_withdraw,water_disclose,ghg_emis,transport_emis,audit_es_report,su_guidelines,su_aud_disclose,su_eva_disclose,su_env_assess
2377,Industrivärden AB,INDU,2022.0,Large,Financials,Sweden,,-1320.937102,Y,Y,...,Y,ND,N,0.026,0.011,Y,Y,N,N,N
2722,Industrivärden AB,INDU,2022.0,Large,Industrials,Sweden,,2541.587975,Y,Y,...,N,ND,N,19.86,1.685,N,Y,Y,Y,Y
2376,Industrivärden AB,INDU,2021.0,Large,Financials,Sweden,Y,2557.608075,Y,Y,...,N,ND,N,0.024,0.01,Y,Y,N,N,Y
2721,Industrivärden AB,INDU,2021.0,Large,Industrials,Sweden,Y,2067.70139,Y,Y,...,Y,ND,N,19.766,ND,N,Y,N,Y,N
2374,Industrivärden AB,INDU,2019.0,Large,Financials,Sweden,Y,2841.07789,Y,Y,...,N,ND,N,35,0.027,Y,Y,N,N,N
2720,Industrivärden AB,INDU,2019.0,Large,Industrials,Sweden,Y,1753.094649,Y,N,...,N,ND,N,ND,ND,N,Y,N,Y,Y
2373,Industrivärden AB,INDU,2018.0,Large,Financials,Sweden,N,-787.7725118,Y,Y,...,N,ND,N,0.0178,0.0302,N,Y,N,Y,Y
2719,Industrivärden AB,INDU,2018.0,Large,Industrials,Sweden,Y,1596.966825,N,N,...,N,ND,N,ND,ND,N,N,N,N,N
2372,Industrivärden AB,INDU,2017.0,Large,Financials,Sweden,Y,1598.3,Y,Y,...,N,ND,N,0,0.049,Y,N,N,N,N
2718,Industrivärden AB,INDU,2017.0,Large,Industrials,Sweden,Y,1507.3,Y,N,...,N,ND,N,ND,ND,Y,Y,N,N,N


In [488]:
display_unique_counts(df)

Unique companies in the database:  642
Unique tickers in the database:  642


### Standardise segment and industry

Some companies are missing segment data some years, while others show different segments from year to year. I fill these by taking the most recent segment for each company and applying it to all other years.

In [489]:
df["segment"].unique().tolist()

['Mid', 'Large', 'Small', nan, 'ND', '0']

In [490]:
companies_missing_segments = (
    df[df["segment"].isin(["ND", "0", np.nan])]["comp_name"].unique().tolist()
)

companies_missing_segments

['Seadrill Ltd',
 'Basware Oyj',
 'Bakkafrost P/F',
 'Onxeo SA',
 'ICA Gruppen AB',
 'Norwegian Finans Holding',
 'Schibsted ASA']

In [491]:
update_segments_remove_na(df)

In [492]:
# # verification that this worked
# df[
#     df["comp_name"].isin(
#         [
#             "Seadrill Ltd",
#             "Basware Oyj",
#             "Bakkafrost P/F",
#             "Onxeo SA",
#             "ICA Gruppen AB",
#             "Norwegian Finans Holding",
#             "Schibsted ASA",
#         ]
#     )
# ]

Now all rows have an associated segment, but some companies appear in different segments from one year to the next. To standardise the segment for each company, I extract data from the most recent year for each company and apply that to all years.

In [493]:
get_most_recent_values(df, columns_to_update=["segment", "industry", "hq_country"])

Now each company's data from the most recent year (in the relevant columns) will be applied to all years. 

In [494]:
# verify that the above code did its job
# test_company(df, "Seadrill Ltd")

Some industries have multiple names. These are merged into one name.

In [495]:
industry_mapping = {
    "Oil & Gas": "Energy",
    "Oil & Gas Equipment & Services": "Energy",
    "Industrials": "Industrial Goods and Services",
    "Personal Care, Drug and Grocery Stores": "Consumer Goods and Services",
    "Consumer Goods": "Consumer Goods and Services",
    "Consumer Discretionary": "Consumer Goods and Services",
    "Consumer Services": "Consumer Goods and Services",
    "Consumer Staples": "Consumer Goods and Services",
    "Basic Resources": "Basic Materials",
    "Financial Services": "Finance",
    "Financials": "Finance",
    "Healthcare": "Health Care",
}

df["industry"] = df["industry"].replace(industry_mapping)

sorted(df["industry"].unique().tolist())

['Basic Materials',
 'Biotechnology',
 'Consumer Goods and Services',
 'Energy',
 'Finance',
 'Health Care',
 'Industrial Goods and Services',
 'Leisure',
 'Media',
 'Real Estate',
 'Retail',
 'Technology',
 'Telecommunications',
 'Travel and Leisure',
 'Unknown',
 'Utilities']

In [496]:
# ensure that all countries are correctly formatted
df["hq_country"] = df["hq_country"].str.strip().replace({"UK": "United Kingdom"})
sorted(df["hq_country"].unique().tolist())

['America',
 'Belgium',
 'Bermuda',
 'Canada',
 'Cayman Islands',
 'Chile',
 'Cyprus',
 'Denmark',
 'Estonia',
 'Faroe Islands',
 'Finland',
 'France',
 'Germany',
 'Iceland',
 'Jersey',
 'Luxembourg',
 'Malta',
 'Netherlands',
 'Norway',
 'Sweden',
 'Switzerland',
 'United Arab Emirates',
 'United Kingdom',
 'United States',
 'Virgin Islands, British']

### Handle missing values

I first remove any row with a missing 'year'.

In [497]:
df = df.dropna(subset=["year"])

In [498]:
show_missing_values(df)

Unnamed: 0_level_0,Missing Values,Missing Percentage,'ND' Values,'ND' Percentage
cols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
comp_name,0,0.0,0,0.0
ticker,0,0.0,0,0.0
year,0,0.0,0,0.0
segment,0,0.0,0,0.0
industry,0,0.0,0,0.0
hq_country,0,0.0,0,0.0
ceo_sust_statem,500,13.14,2,0.05
sales,19,0.5,0,0.0
env_policy,23,0.6,2,0.05
ep_targets,23,0.6,3,0.08


In [499]:
# Make 'ND' more robust to any later aggregation
df.replace("ND", np.nan, inplace=True)

# Convert 'Y' and 'N' to dummy variables
df.replace({"Y": 1, "N": 0}, inplace=True)

In [500]:
# show_missing_values(df)

### Transform anomalies

Continuous variables should have numbers considerably higher than 1 or 0, so where these values are present, the data are considered to be invalid. They are converted to nulls. Binary columns have the opposite problem. Inputs should be either 1 or 0, so any other entry is considered invalid. These are converted to 0. Other numeric columns are also cleaned to remove invalid entries.

In [501]:
generate_binary_summary(df)

Unnamed: 0,Column Name,Data Type,1s,0s
0,comp_name,object,0,0
1,ticker,object,0,0
2,year,float64,0,0
3,segment,object,0,0
4,industry,object,0,0
5,hq_country,object,0,0
6,ceo_sust_statem,object,2306,994
7,sales,object,0,0
8,env_policy,object,3407,370
9,ep_targets,object,2635,1143


Each column is given a valid data type to remove invalid results.

In [502]:
string_columns = ["comp_name", "ticker", "segment", "industry", "hq_country"]
df[string_columns] = df[string_columns].astype(str)

bool_columns = [
    "ceo_sust_statem",
    "env_policy",
    "ep_targets",
    "env_impact_red",
    "incr_renew_en",
    "disclosure_raw",
    "resource_target",
    "water_disclose",
    "audit_es_report",
    "su_guidelines",
    "su_aud_disclose",
    "su_eva_disclose",
    "su_env_assess",
]

df[bool_columns] = df[bool_columns].map(lambda x: 1 if x == 1 else 0)

# Convert specified numerical columns and handle NaNs for values ≤1
numeric_columns = [
    "sales",
    "transport_emis",
    "energy_consump",
    "water_withdraw",
    "ghg_emis",
]

for col in numeric_columns:
    df[col] = pd.to_numeric(df[col], errors="coerce")
    df[col] = df[col].apply(lambda x: np.nan if x <= 1 else x)

In [503]:
generate_binary_summary(df)

Unnamed: 0,Column Name,Data Type,1s,0s
0,comp_name,object,0,0
1,ticker,object,0,0
2,year,float64,0,0
3,segment,object,0,0
4,industry,object,0,0
5,hq_country,object,0,0
6,ceo_sust_statem,int64,2306,1499
7,sales,float64,0,0
8,env_policy,int64,3407,398
9,ep_targets,int64,2635,1170


In [504]:
show_missing_values(df)

Unnamed: 0_level_0,Missing Values,Missing Percentage,'ND' Values,'ND' Percentage
cols,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
comp_name,0,0.0,0,0.0
ticker,0,0.0,0,0.0
year,0,0.0,0,0.0
segment,0,0.0,0,0.0
industry,0,0.0,0,0.0
hq_country,0,0.0,0,0.0
ceo_sust_statem,0,0.0,0,0.0
sales,99,2.6,0,0.0
env_policy,0,0.0,0,0.0
ep_targets,0,0.0,0,0.0


## Export data

I save the data frames to file and load them in the next notebook.

In [None]:
folder_path = r"C:\Users\james\OneDrive - University of Aberdeen\01 - Turing College\D99 - Capstone Project\Nordic Compass - ESG Performance and CSRD Compliance\datasets"

# df.to_csv(f"{folder_path}/nordic_compass_df_cleaned_01.csv")

## Appendix

In [506]:
# columns_to_clean = [
#     "ceo_sust_statem",
#     "env_policy",
#     "env_impact_red",
#     "water_disclose",
#     "audit_es_report",
#     "su_guidelines",
#     "su_env_assess",
# ]

# for col in columns_to_clean:
#     # Convert 'T', 't', 'Y', 'y' to 1
#     df[col] = df[col].apply(lambda x: 1 if str(x).upper() in ["T", "Y"] else x)

#     # Convert blanks to NaN
#     df[col] = df[col].replace(r"^\s*$", np.nan, regex=True)

#     # Convert any value that's not 0 or 1 to NaN
#     df[col] = df[col].apply(lambda x: x if pd.isna(x) or x in [0, 1] else np.nan)

In [507]:
# filtered_df = df[
#     (df["energy_consump"].isin([1, 0]))
#     | (df["water_withdraw"].isin([1, 0]))
#     | (df["ghg_emis"].isin([1, 0]))
#     | (df["transport_emis"].isin([1, 0]))
# ]

# df[["energy_consump", "water_withdraw", "ghg_emis", "transport_emis"]] = df[
#     ["energy_consump", "water_withdraw", "ghg_emis", "transport_emis"]
# ].replace({1: np.nan, 0: np.nan})