# Belfius Alytics (Part 2)
Inspiration:

-https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb

-https://www.youtube.com/watch?v=RIWbalZ7sTo

-https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG?usp=sharing#scrollTo=RSdomqrHNCUY

-https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb

Future ideas:

- Convert docx to latex (https://www.vertopal.com/en/download#96a5acdd2afa4e3aaf723be0ea7b71ad).

### Handle imports:

In [4]:
# Move to root directory
import os

notebooks_dir = 'notebooks'
if notebooks_dir in os.path.abspath(os.curdir):
    while not os.path.abspath(os.curdir).endswith('notebooks'):
        print(os.path.abspath(os.curdir))
        os.chdir('..')
    os.chdir('..')  # to get to root

print(os.path.abspath(os.curdir))

C:\Users\MD726YR\PycharmProjects\eyalytics


In [5]:
# Supress SSL verification (EY problem):
import requests

from requests.packages.urllib3.exceptions import InsecureRequestWarning

# Suppress the warning from urllib3.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

old_send = requests.Session.send

def new_send(*args, **kwargs):
    kwargs['verify'] = False
    return old_send(*args, **kwargs)

requests.Session.send = new_send

In [6]:
# Import relevant libraries for langchain retrieval:
import openai
import tiktoken

from langchain import OpenAI,  LLMChain, PromptTemplate
from langchain.prompts import StringPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS  # facebook ai similarity search 
from langchain.chains import LLMMathChain
from langchain.tools import BaseTool
from langchain.agents import (
    AgentExecutor, LLMSingleActionAgent, AgentOutputParser, 
    AgentType, initialize_agent, Tool
)
from langchain.callbacks import get_openai_callback
from langchain.schema import AgentAction, AgentFinish

**In case you want to use Chroma instead of FAISS:**

`from langchain.vectorstores import Chroma`
    
Note, to use Chroma you will have to install chromadb. This requires having Microsoft Visual C++ 14.0 installed. To install that simply: 

a. Install Microsoft C++ Build Tools: Visit the link provided in the error message (https://visualstudio.microsoft.com/visual-cpp-build-tools/) and install the Microsoft C++ Build Tools.

b. Ensure the Correct Version: Ensure that you have the required version (14.0 or greater) of the C++ build tools installed.

c. Add to PATH: Ensure the tools are added to your system PATH. Usually, the installer should take care of this. But if the problem persists, you might need to verify and add them manually.

d. Restart Your System: Sometimes, after installing such tools, a system restart might be required for the environment variables (like PATH) to update correctly.

**Checks:**
Check if Visual C++ Build Tools is Installed:
- Press Windows + I to open the Settings app.
- Go to "Apps".
- Now in the "Apps & features" tab, search for "Visual Studio".
- Check if there's an installation called "Microsoft Visual Studio" (it might also be "Visual Studio Build Tools").

Check for the Required Components:
- If you find "Microsoft Visual Studio" or "Visual Studio Build Tools" in the list, click on it and then select "Modify".
- This will bring up the Visual Studio Installer.
- Here, ensure that the "Desktop development with C++" workload is checked. Specifically, make sure "MSVC v142 - VS 2019 C++ x64/x86 build tools" (or a similar option) is selected. This provides the C++ compiler that's needed.

In [7]:
# libraries for URL pdf loading
import time
import docx
import pyautogui
import docx2python

from docx.oxml.table import CT_Tbl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [8]:
# Other libraries:
import re
import pickle 
import difflib
import math
import tqdm

# for progress bars in loops
from uuid import uuid4
from tqdm.auto import tqdm
from typing import List, Dict, Any, Union, Optional

In [9]:
# Get API and ENV keys:
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise KeyError(
        "You will need an OPENAI_API_KEY to use the LLM models in this notebook."
    )
openai.api_key = os.getenv("OPENAI_API_KEY")

## Commence Langchain Retrieval Augmentation Tool Development:

**FUTURE IMPROVEMENTS**: 
- Trying docx2python may improve information retrival from a docx.
- Trying lxml may also be a better solution than docx.

In [97]:
# _doc_content = docx2python.docx2python(docx_path)

# for elem in _doc_content.body:
#     print(elem)

### Implementing Future Improvements:

- Instead of separately extracting text and tables, we'll extract all content linearly.
- We'll process the content to ensure it's flattened and in a readable format that can be passed through a recursive splitter and an encoding model later.

In [101]:
FIGURE_THRESHOLD = 0.1
EPSILON = 1e-10
REPEAT_THRESHOLD = 4
MAX_CHAR_COUNT_FOR_FIGURE  = 20
FIGURE_RELATED_CHARS = r"[0123456789.%-]"
COMMON_UNITS = ["kg", "m", "s", "h", "g", "cm", "mm", "l", "ml"]
DOCX_NAMESPACE = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
# DICT KEYS 
METADATA_KEY = "metadata"
COMP_KEY = "company"
REPORT_KEY = "report"
FIGURE_KEY = "potential_figure"
FIGURE_SUMMARY_KEY = "figure_summary"
TABLE_KEY = "table"
TABLE_ROWS_KEY = "table"
TABLE_SUMMARY_KEY = "table_summary"
TEXT_KEY = "text"
DOC_CONTENT = {}


# URL -> file name convertor:
def url2fname(url):
    # Split the URL by '/' and get the last segment
    last_segment = url.split('/')[-1]
    
    # Use regex to remove any suffix after the dot and the dot itself
    cleaned_name = re.sub(r'\..*$', '', last_segment)
    
    return cleaned_name
    
    
# Create URL loader:
def wait_for_file(file_path: str, timeout: int = 60) -> bool:
    """
    Wait for a file to be present at a specified path within a given timeout.
    
    Args:
        file_path (str): Path to the file.
        timeout (int): Maximum waiting time in seconds. Default is 60 seconds.

    Returns:
        bool: True if file is found within the timeout, False otherwise.
    """
    start_time = time.time()

    while time.time() - start_time < timeout:
        if os.path.exists(file_path):
            return True
        time.sleep(1)

    return False


def download_pdf_from_url(url: str, save_path: str) -> str:
    """
    Download a PDF from the specified URL and save it to a local path.
    
    Args:
        url (str): URL of the PDF.
        save_path (str): Local path to save the downloaded PDF.

    Returns:
        str: Path to the saved PDF if successful, None otherwise.
    """
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        if os.path.exists(save_path):
            return save_path
    return None


def convert_pdf_to_docx(
    pdf_filename: str, driver_path: str, pdf_folder_path: str, docx_folder_path: str
) -> str:
    """
    Convert a PDF to a DOCX using Adobe's online tool.
    
    Args:
        pdf_filename (str): Filename of the PDF.
        driver_path (str): Path to the geckodriver executable.
        pdf_folder_path (str): Directory where the PDF is located.
        docx_folder_path (str): Directory where the converted DOCX should be saved.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    # WebDriver setup and configurations
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.set_preference("browser.download.folderList", 2)
    firefox_options.set_preference("browser.download.dir", docx_folder_path)
    firefox_options.set_preference("browser.download.useDownloadDir", True)
    firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
    
    service = Service(driver_path)
    driver = webdriver.Firefox(service=service, options=firefox_options)
    wait = WebDriverWait(driver, 180)
    driver.get("https://www.adobe.com/be_en/acrobat/online/pdf-to-word.html")

    # Upload the PDF
    upload_btn = wait.until(EC.element_to_be_clickable((By.ID, "lifecycle-nativebutton")))
    upload_btn.click()

    full_pdf_path = os.path.join(pdf_folder_path, pdf_filename)
    if not os.path.exists(full_pdf_path):
        print(f"File path\n{full_pdf_path}\nis not valid.")
        return None
    
    # Wait for the file selection dialog and input the file path using pyautogui
    time.sleep(5)
    # Use the path in pyautogui
    pyautogui.typewrite(full_pdf_path)

    # Add a slight delay and then press 'enter' multiple times
    time.sleep(2)
    for _ in range(3):
        pyautogui.press('enter')
        time.sleep(0.1)
    time.sleep(10)
    
    retries = 3
    while retries > 0:
        try:
            # Check for cookie notification and click if exists
            try:
                cookie_reject_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-reject-all-handler")))
                cookie_reject_btn.click()
            except TimeoutException:  # This exception is more specific to WebDriverWait than a general Exception.
                print("Cookie settings notification not found or failed to click.")

            # Wait and click the download button
            download_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.Download__downloadButton___2qFEa')))
            download_btn.click()
            break  # If successful, break out of the loop
        except TimeoutException:
            retries -= 1
            if retries == 0:
                raise  # Re-raise the exception if all retries are exhausted
            print(f"Attempt {3 - retries} failed. Retrying...")
            time.sleep(10)  # Wait for 20 seconds before retrying
    time.sleep(10)
    driver.quit()

    expected_docx_filename = pdf_filename.replace('.pdf', '.docx')
    expected_docx_filepath = os.path.join(docx_folder_path, expected_docx_filename)

    return expected_docx_filepath if wait_for_file(expected_docx_filepath, 55) else None


def convert_url_pdf_to_docx(
    pdf_url: str, 
    driver_path: str = "./drivers/geckodriver.exe", 
    pdf_folder_path: str = None, 
    docx_folder_path: str = None
) -> str:
    """
    Download a PDF from a URL, convert it to DOCX, and save it locally.
    
    Args:
        pdf_url (str): URL of the PDF.
        driver_path (str): Path to the geckodriver executable. Default is './drivers/geckodriver.exe'.
        pdf_folder_path (str): Directory to save the downloaded PDF. Default is '../data/pdf_db'.
        docx_folder_path (str): Directory to save the converted DOCX. Default is '../data/docx_db'.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    cwd = os.getcwd()
    pdf_folder_path = pdf_folder_path or os.path.join(cwd, "data", "pdf_db")
    docx_folder_path = docx_folder_path or os.path.join(cwd, "data", "docx_db")

    os.makedirs(pdf_folder_path, exist_ok=True)
    os.makedirs(docx_folder_path, exist_ok=True)

    pdf_filename = pdf_url.split('/')[-1]
    pdf_save_path = os.path.join(pdf_folder_path, pdf_filename)

    if download_pdf_from_url(pdf_url, pdf_save_path):
        return convert_pdf_to_docx(pdf_filename, driver_path, pdf_folder_path, docx_folder_path)
    return None


def extract_footnotes_from_para(para, next_para=None):
    """Extract footnote references and actual footnotes from a paragraph."""
    footnotes = []
    
    footnote_refs = para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)

    for ref in footnote_refs:
        footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
        footnote = para.part.footnotes_part.footnote_dict[footnote_id]
        footnotes.append(footnote.text)

    # Check in the next paragraph for footnotes if provided
    if next_para:
        next_footnote_refs = next_para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)
        for ref in next_footnote_refs:
            footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
            footnote = next_para.part.footnotes_part.footnote_dict[footnote_id]
            footnotes.append(footnote.text)

    return footnotes


def process_footnotes(text, footnotes):
    """Process and embed footnotes into the text."""
    
    # Existing replacement for footnotes within brackets
    for idx, footnote in enumerate(footnotes, 1):
        text = re.sub(r"\[{}\]".format(idx), "[{}]".format(footnote), text)

    # New addition: replace footnotes appearing directly after words or at the end of sentences
    for idx, footnote in enumerate(footnotes, 1):
        # This regex will look for a number that doesn't have another number directly before it (to differentiate from normal numbers within the text)
        pattern = r'(?<![0-9])' + str(idx) + r'(?![0-9])'
        replacement = "[{}]".format(footnote)
        text = re.sub(pattern, replacement, text)

    return text


def contains_unit(text):
    """Check if text contains a common unit following a number."""
    for unit in COMMON_UNITS:
        # Check for patterns like '123 kg', '123kg', '0.5 m', '0.5m', etc.
        if re.search(r'\d\s?' + re.escape(unit) + r'(?![a-zA-Z])', text):
            return True
    return False


def is_potential_figure_data(text):
    if text is None:
        return False

    text_count = len(text) - text.count(' ')
    figure_char_count = len(re.findall(FIGURE_RELATED_CHARS, text))
    char_count = text_count - figure_char_count

    # New: Check for units
    contains_common_units = contains_unit(text.lower())  # Convert text to lowercase for this check
    
    # New: Check for percentage patterns
    contains_percentage = "%" in text and any(char.isdigit() for char in text)

    if (text_count == 0) or text.endswith(('.', ':', ';', ',')) or (char_count > MAX_CHAR_COUNT_FOR_FIGURE):
        return False

    if (char_count == 0) or (figure_char_count / (char_count + EPSILON) > FIGURE_THRESHOLD) or contains_common_units or contains_percentage:
        return True
    
    return False


def repeated_artifact_check(line, artifact_dict):
    """Check if a line is a repeated artifact and update its count."""
    if line in artifact_dict:
        artifact_dict[line] += 1
        if artifact_dict[line] > REPEAT_THRESHOLD:
            return True  # It's a repeated artifact
    else:
        artifact_dict[line] = 1
    return False

    
def read_docx(
    file_path: str, 
    comp_name: str = None, 
    report_name: str = None,
) -> dict:
    
    def _is_empty(text):
        if len(text) == 0:
            return True
        return False
    
    doc = docx.Document(file_path)
    
    result = {
        METADATA_KEY: {
            'title': doc.core_properties.title,
            'author': doc.core_properties.author,
            'created': doc.core_properties.created,
            COMP_KEY: comp_name,
            REPORT_KEY: report_name,
        },
        TEXT_KEY: [],
        TABLE_KEY: [],
        FIGURE_KEY: []
    }
    
    figure_data_group = {'title': None, 'data': []}
    current_is_figure_data = False
    previous_text = None
    
    artifact_dict = {}
    
    for current_elem, next_elem in tqdm(zip(doc.element.body, doc.element.body[1:] + [None])):
        
        # Paragraph
        if current_elem.tag.endswith('p'):
            
            current_para = docx.text.paragraph.Paragraph(current_elem, None)
            next_para = docx.text.paragraph.Paragraph(next_elem, None)
            
            processed_text = current_para.text.strip()
            # Ignore empty lines or repeated lines 
            if _is_empty(processed_text) or repeated_artifact_check(processed_text, artifact_dict):
                current_para = None
                continue
            try: 
                next_text = next_para.text
            except AttributeError: 
                next_text = None

            # Process footnotes
            footnotes = extract_footnotes_from_para(current_para, next_para)
            processed_text = process_footnotes(processed_text, footnotes)
            
            # Identify if the current line is potential figure data
            previous_was_figure_data = current_is_figure_data  # Move the window forward
            current_is_figure_data = is_potential_figure_data(processed_text)
            next_is_figure_data = is_potential_figure_data(next_text)
            
            if current_is_figure_data:
                
                # If previous line was also figure data, they belong to the same figure
                if previous_was_figure_data:
                    figure_data_group['data'].append(processed_text)
                else:
                    # If a new figure starts, save the previous figure (if there was any)
                    if figure_data_group['data']:
                        result[FIGURE_KEY].append(figure_data_group)
                        figure_data_group = {'title': None, 'data': []}

                    # Assign the previous line as the title for the current figure
                    figure_data_group['title'] = previous_text
                    figure_data_group['data'].append(processed_text)
                    
            elif not next_is_figure_data:  # neither next or current text is figure
                # Not a figure, add to text
                result[TEXT_KEY].append(processed_text)
            else:  # next text is figure, meaning that current text will be stored as title.
                pass

            # Handles case when text potentially interupts figure.
            # Text is again stored in figure_data_group['data'].
            if previous_was_figure_data and next_is_figure_data:
                current_is_figure_data = True
                
            # only change previous text when current para is not empty or repeated string.
            previous_text = processed_text
                
        # Table
        elif current_elem.tag.endswith('tbl'):
            table_index = [tbl._element for tbl in doc.tables].index(current_elem)
            table = doc.tables[table_index]

            headers = [cell.text.strip() for cell in table.rows[0].cells]

            rows = []
            for row in table.rows[1:]:
                row_data = {headers[j]: cell.text.strip() for j, cell in enumerate(row.cells)}
                rows.append(row_data)

            result[TABLE_KEY].append({
                'title': previous_text,
                'col_headers': headers,
                TABLE_ROWS_KEY: rows
            })
        else:
            print(f"Ignoring {current_elem.tag}.")
            
    return result

  
# Testing the functions

# # COCA COLA:
# comp_name = 'Coca-Cola'
# report_name = 'Sustainability Report (2022)'
# pdf_url = "https://www.coca-colacompany.com/content/dam/company/us/en/reports/coca-cola-business-and-sustainability-report-2022.pdf"
# docx_fname = fr"coca-cola-business-and-sustainability-report-2022.docx"

# # Colruyt Group
# comp_name = "Colruyt_Group"
# year = '2022'
# report_name = "Annual report with sustainability (2022)"
# pdf_url = "https://www.colruytgroup.com/content/dam/colruytgroup/investeren/jaarverslag-met-duurzaamheidsrapportering/pdf/en/annual-report-with-sustainability-reporting-2022-2023.pdf"
# docx_fname = fr"annual-report-with-sustainability-reporting-2022-2023.docx"

# DEME Group
comp_name = "DEME_Group"
year = '2022'
report_name = "Annual report with sustainability (2022)"
pdf_url = "https://www.deme-group.com/sites/default/files/2023-03/DEME_Annual_Report2022.pdf"
docx_fname = fr"DEME_Annual_Report2022.docx"

is_scrape = False  # Set to False if you want to avoid web scraping. Note that in this case a preprapered docx is used. 
fname = url2fname(pdf_url)  # used to save vectordb
docx_dir = r"C:\Users\MD726YR\PycharmProjects\eyalytics\data\docx_db"

try:
    if is_scrape:
        docx_path = convert_url_pdf_to_docx(pdf_url)
    else:
        docx_path = fr"{docx_dir}\{docx_fname}"

    if not docx_path:
        print("Failed to convert PDF to DOCX. Exiting...")
        exit(1)

    doc_contents = read_docx(
        docx_path, comp_name=comp_name, report_name=report_name
    )

    print(
        f"SUMMARY:\n{len(doc_contents[TEXT_KEY])} paragraphs, "
        f"{len(doc_contents[TABLE_KEY])} tables, and {len(doc_contents[FIGURE_KEY])} figures "
        f"were extracted."
    )
    
    # Example usage:
    print("\n\nText Extracted:")
    for para in doc_contents[TEXT_KEY]:
        print(para)
    print('---'*50)

    print("\n\nTables Extracted:")
    for table in doc_contents[TABLE_KEY]:
        print(table)

    print("\n\nFigures Extracted:")
    for figure in doc_contents[FIGURE_KEY]:
        print(figure)
    print('---'*50)
    
except Exception as e:
    print(f"An error occurred: {e}")

0it [00:00, ?it/s]

Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt.
Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr.
SUMMARY:
3359 paragraphs, 118 tables, and 238 figures were extracted.


Text Extracted:
ANNUAL REPORT
INTRODUCTION
STRATEGY
DEME’s two-dimensional strategy for
sustainable performance	40
SEGMENTS
CORPORATE GOVERNANCE AND RISK
SUSTAINABILITY & QHSE
Consolidated financial statements	182
APPENDIX
Certificates, awards & ratings	173
All definitions for alternative performance measures (APMs) or acronyms used in this report are available in the Glossary (see the Appendix chapter).
CHAPTER
Facing new challenges and achieving our goals are only possible in a safe and healthy working environment.
This is my commitment to all our employees worldwide.
NATALIA DE SOUZA SECCO  |  QHSE-S ENGINEER
Letter
of the CEO & Chairman
Undoubtedly, we are living in a transformative century – the drive for sustainability, the rise in digitalisation and the ne

In [102]:
# ATTEMPTS TO LOAD BODY IN STRUCTURED WAY:

# def flatten_content(content):
#     """Recursive function to flatten nested content."""
#     if isinstance(content, str):
#         return content
#     elif isinstance(content, list):
#         return " ".join(flatten_content(item) for item in content)
#     return ""
#
#
# def read_docx_content(docx_path):
#     _doc_content = docx2python.docx2python(docx_path)
#    
#     # Flatten and join all the content
#     flattened_content = flatten_content(_doc_content.body)
#    
#     return flattened_content
#
#
# def process_content(content):
#     """Recursive function to process and differentiate between text and tables."""
#     processed = []
#     if isinstance(content, str):
#         return content.strip()
#     elif isinstance(content, list):
#         if all(isinstance(item, list) for item in content):
#             # This is likely a table
#             table = []
#             for row in content:
#                 table_row = []
#                 for cell in row:
#                     table_row.append(process_content(cell))
#                 table.append(table_row)
#             return {"table": table}
#         else:
#             # Process each item in the list
#             for item in content:
#                 result = process_content(item)
#                 if result:
#                     processed.append(result)
#             return " ".join(processed) if isinstance(processed[0], str) else processed
#     return ""
# 
# 
# def read_docx_content(docx_path):
#     _doc_content = docx2python.docx2python(docx_path)
#     processed_content = process_content(_doc_content.body)
#     return processed_content
# 
# 
def process_content(content):
    """Recursive function to process and differentiate between text and tables."""
    if isinstance(content, str):
        return content.strip()
    elif isinstance(content, list):
        processed_items = [process_content(item) for item in content]
        
        # Check if it's a table by inspecting the depth of nested lists
        if any(isinstance(item, list) for item in processed_items) and \
           all(isinstance(row, list) for row in processed_items):
            # Flatten the table rows into a single table dictionary
            all_rows = []
            for item in processed_items:
                all_rows.extend(item)  # Add rows from each table section
            return [{"table": all_rows}]
        
        return processed_items
    return content


def read_docx_content(docx_path):
    _doc_content = docx2python.docx2python(docx_path)
    processed_content = process_content(_doc_content.body)
    
    # Flatten the processed content to have a linear structure
    flattened_content = []
    for item in processed_content:
        if isinstance(item, list):
            flattened_content.extend(item)
        else:
            flattened_content.append(item)
    
    return flattened_content

# ___________________________________________________

# Use the library function
raw_content = read_docx_content(docx_path)
print(f'NUMBER OF PARA: {len(raw_content)}\n\n')
for para in raw_content:
    print(f"{para}\n")

NUMBER OF PARA: 1


{'table': [{'table': [{'table': []}]}, {'table': [{'table': ['20']}]}, {'table': [{'table': ['20']}]}, {'table': [{'table': ['22']}]}, {'table': [{'table': ['22']}]}, {'table': [{'table': ['----media/image1.png----', '', '', '----media/image2.png--------media/image2.png----ANNUAL REPORT', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']}]}, {'table': [{'table': ['TABLE OF CONTENTS']}]}, {'table': [{'table': ['TABLE OF CONTENTS']}]}, {'table': [{'table': ['----media/image3.jpeg----CHAPTER 01', 'INTRODUCTION', 'Letter of the CEO & Chairman\t6', 'Company profile\t10', 'Financial & non-financial key figures\t12', 'Highlights 2022\t16', 'DEME people make the difference\t21', 'DEME core values\t22', 'DEME fleet\t24', 'Group performance 2022\t28', '', 'CHAPTER 02', 'STRATEGY', 'Relevant market drivers\t34', 'DEME’s 2030 strategy\t36', 'DEME’s two-dimensional strategy for', 'sustainable performan

**To avoid missing text, something that may arise from using a specific load procedure, we merge the textual parts obtained using our two load procedures. This should safeguard against information loss while allowing us to use varying formatting standards for the text (some more readable than others).**

Note that we do this after summarising tables and figures below!

### Obtain short summaries for tables.
To facilitate the encoding of tables, we will ask an LLM to generate a textual summary of a table's contents. The idea being that this summary will yield better vector encodings than if we simply tried to encode the table. It's an added cost but one that will hopefully yield better context for our LLMs. Note that the table with summaries will be fed to the engine if the tabular chunk is selected as context. The key when summarising is obtaining short summaries!

In [104]:
TABLE_MIN_NUMERIC_PRC = 10  # %

def count_numeric_characters(list_of_elements):
    """
    Counts the number of numeric characters in a list of elements, excluding spaces.

    Args:
    list_of_elements: A list of elements, which can be strings or dictionaries.

    Returns:
    The total number of numeric characters in the list of elements, excluding spaces.
    """

    num_numeric_chars = 0
    for element in list_of_elements:
        if isinstance(element, dict):
          # Iterate over the values of the dictionary.
          for value in element.values():
            num_numeric_chars += len(re.findall(r'\d', value.replace(' ', '')))
        else:
          # The element is a string.
          num_numeric_chars += len(re.findall(r'\d', element.replace(' ', '')))
    return num_numeric_chars


def get_percentage_numeric_characters(list_of_elements):
    """
    Calculates the percentage of numeric characters as proportion
    of all the characters in a list of elements, excluding spaces.

    Args:
    list_of_elements: A list of elements, which can be strings or dictionaries.

    Returns:
    The percentage of numeric characters in the list of elements,
    as a float, excluding spaces.
    """

    num_numeric_chars = count_numeric_characters(list_of_elements)
    total_num_chars = 0
    for element in list_of_elements:
        if isinstance(element, dict):
          # Iterate over the values of the dictionary.
          for value in element.values():
            total_num_chars += len(value.replace(' ', ''))
        else:
          # The element is a string.
          total_num_chars += len(element.replace(' ', ''))
            
    if total_num_chars == 0:
        return 100
    
    return 100 * (num_numeric_chars / total_num_chars)


doc_contents[TABLE_SUMMARY_KEY] = []

In [105]:
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a table summarizer."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

MAX_RETRIES = 5
WAIT_SECONDS = 10

for i, table in tqdm(enumerate(doc_contents[TABLE_KEY], start=1)):
    
    prc_chars =  get_percentage_numeric_characters(table[TABLE_ROWS_KEY])
    if prc_chars < TABLE_MIN_NUMERIC_PRC:
        # Table is descriptive and hence summary not required
        print(f"RAW TABLE (NO SUMMARY NEEDED): {table}\n\n\n")
        doc_contents[TABLE_SUMMARY_KEY].append(
            f"<<T{i}>>: {table}"
        )
        continue
            
    parameters['messages'][-1]['content'] = f"""
        Instructions; As a financial ESG analyst:

        -Assess if the information in the table shared below, from {comp_name}'s {report_name}, is mostly quantitative or qualitative.
        -If the table is quantitative, describe it in no more than 150 words.
        -Else if the table is qualitative, output it without summarization. You may make minor modifications to improve the 
            readability and interpretability of the table.
        -Do not summarize qualitative tables which describe processes, policies, etc.
        
        Table to evaluate:
        {table}
    """
    retries = 0
    success = False
    while retries < MAX_RETRIES and not success:
        try:
            response = openai.ChatCompletion.create(
                **parameters
            )
            # Add <<T{i}>>: in front to facilitate later lookup
            doc_contents[TABLE_SUMMARY_KEY].append(
                f"<<T{i}>>: {response['choices'][0]['message']['content']}"
            )
            print(f"RAW TABLE: {table}\n\nSummary;\n {doc_contents['table_summary'][-1]}\n\n\n")
            success = True
        except (
            openai.error.Timeout, openai.error.APIConnectionError, 
            openai.error.AuthenticationError, openai.error.ServiceUnavailableError,
        ) as e:
            retries += 1
            print(f"Error encountered. Retrying {retries}/{MAX_RETRIES}")
            time.sleep(WAIT_SECONDS)
    
    if retries == MAX_RETRIES:
        print(f"Failed to get a summary for table {i} after {MAX_RETRIES} retries.")

0it [00:00, ?it/s]

RAW TABLE (NO SUMMARY NEEDED): {'title': 'From a geographical perspective the Europe and Africa region showed a decline in 2022 compared to 2021, largely offset by strong wins in the Asia and America region. Europe continues to account for more than half of the orderbook .', 'col_headers': ['million euro compared to', 'our growth ambitions, we are further', 'While Dredging & Infra remained', '', '', '', '', ''], 'table': [{'million euro compared to': '115 million euro a year ago', 'our growth ambitions, we are further': 'expanding our advanced fleet.”', 'While Dredging & Infra remained': 'the main EBITDA contributor in the', '': '+16%'}]}



RAW TABLE: {'title': 'Environmental\t313.4\t255.3\t190.1\t+23%', 'col_headers': ['Geographical breakdown\n(in % of total)', '2022', '2021', '2020 FY22 VS FY21\n(in nominal value)', '2020 FY22 VS FY21\n(in nominal value)'], 'table': [{'Geographical breakdown\n(in % of total)': 'Europe', '2022': '55%', '2021': '62%', '2020 FY22 VS FY21\n(in nominal v

RAW TABLE: {'title': 'Chairman of the Board of Directors\tLuc Bertrand\t50,000\t10,000', 'col_headers': ['Non-Executive Director', 'John-Eric Bertrand', '25,000', '10,000'], 'table': [{'Non-Executive Director': 'Non-Executive Director', 'John-Eric Bertrand': 'Tom Bamelis', '25,000': '25,000', '10,000': '10,000'}, {'Non-Executive Director': 'Non-Executive Director', 'John-Eric Bertrand': 'Piet Dejonghe', '25,000': '25,000', '10,000': '10,000'}, {'Non-Executive Director': 'Non-Executive Director', 'John-Eric Bertrand': 'Koen Janssen', '25,000': '25,000', '10,000': '10,000'}, {'Non-Executive Director': 'Non-Executive Director', 'John-Eric Bertrand': 'Christian Labeyrie', '25,000': '25,000', '10,000': '10,000'}, {'Non-Executive Director': 'Executive Director', 'John-Eric Bertrand': 'Luc Vandenbulcke', '25,000': '-', '10,000': '-'}, {'Non-Executive Director': 'Independent Director', 'John-Eric Bertrand': 'Leen Geirnaerdt', '25,000': '25,000', '10,000': '7,500'}, {'Non-Executive Director': '

RAW TABLE: {'title': "every day ('#everydayforward').", 'col_headers': ['Rating scale', 'Rating scale', 'Rating score 2022', 'Rating score 2021', 'Rating score 2020', 'Sector ranking 2022', 'Sector average rating 2022', 'DEME\nTrend vs 2021'], 'table': [{'Rating scale': '(D<A)', 'Rating score 2022': 'B', 'Rating score 2021': 'C', 'Rating score 2020': '-', 'Sector ranking 2022': '-\tC', 'Sector average rating 2022': '-\tC', 'DEME\nTrend vs 2021': 'Positive'}, {'Rating scale': '(0<100)', 'Rating score 2022': 'Gold (71)', 'Rating score 2021': 'Silver (63)', 'Rating score 2020': '-', 'Sector ranking 2022': '-', 'Sector average rating 2022': '45', 'DEME\nTrend vs 2021': 'Positive'}, {'Rating scale': '(100<0)', 'Rating score 2022': '26,1*** Medium risk', 'Rating score 2021': '27,8*** Medium risk', 'Rating score 2020': '-', 'Sector ranking 2022': '21st Construction & engineering"', 'Sector average rating 2022': '-', 'DEME\nTrend vs 2021': 'Positive'}, {'Rating scale': '(CCC<AAA)', 'Rating sco

RAW TABLE: {'title': 'CONSOLIDATED STATEMENT OF CHANGES IN EQUITY', 'col_headers': ['2022\n(in thousands of EUR)', 'Share capital and share premium', 'Hedging reserve', 'Remeasurement on retirement obligations', 'Retained earnings and other reserves', 'Cumulative translation adjustment', "Shareholders'\nequity", 'Non- controlling interests', 'Group equity'], 'table': [{'2022\n(in thousands of EUR)': 'Ending, December 31, 2021', 'Share capital and share premium': '36,755', 'Hedging reserve': '-25,872', 'Remeasurement on retirement obligations': '-41,283', 'Retained earnings and other reserves': '1,618,824', 'Cumulative translation adjustment': '-8,881', "Shareholders'\nequity": '1,579,543', 'Non- controlling interests': '19,696', 'Group equity': '1,599,239'}, {'2022\n(in thousands of EUR)': 'Impact IFRS amendments', 'Share capital and share premium': '-', 'Hedging reserve': '-', 'Remeasurement on retirement obligations': '-', 'Retained earnings and other reserves': '-', 'Cumulative tran

RAW TABLE: {'title': 'DEME Environmental NV, the parent company of the Environmental segment, is owned for 25.1% by a third party. In the Dredging & Infra segment there are some non-controlling interests in the marine aggregate and maritime services business and in the dredging business only some minor interests are hold by third parties. In the Concessions segment the same external partner holds interests in two subsidiaries of the Group. In the Offshore segment there are no non-controlling interests at December 31, 2022. Reference is made to the consolidated statement of comprehensive income and the consolidated statement of changes in equity for more information about the non-controlling interests.', 'col_headers': ['Name\tCountry', 'Name\tCountry', '2022\n% of Share- holding', '2021\n% of Share- holding', 'Main Operational Segment 2022'], 'table': [{'Name\tCountry': 'Belgium', '2022\n% of Share- holding': '100%', '2021\n% of Share- holding': '100%', 'Main Operational Segment 2022':

RAW TABLE: {'title': 'CONSOLIDATED STATEMENT OF CASH FLOWS COMPARATIVE ANALYSIS', 'col_headers': ['Notes\t2022\t2021\tDELTA', 'Notes\t2022\t2021\tDELTA', 'Notes\t2022\t2021\tDELTA', 'Notes\t2022\t2021\tDELTA', 'Notes\t2022\t2021\tDELTA'], 'table': [{'Notes\t2022\t2021\tDELTA': '-93,305'}, {'Notes\t2022\t2021\tDELTA': '-28,880'}, {'Notes\t2022\t2021\tDELTA': '45,675'}, {'Notes\t2022\t2021\tDELTA': '16,795'}, {'Notes\t2022\t2021\tDELTA': '-214,195'}, {'Notes\t2022\t2021\tDELTA': '-8,247'}, {'Notes\t2022\t2021\tDELTA': '-222,442'}, {'Notes\t2022\t2021\tDELTA': '413,656'}, {'Notes\t2022\t2021\tDELTA': '-101,613'}, {'Notes\t2022\t2021\tDELTA': '-20,422'}, {'Notes\t2022\t2021\tDELTA': '-504'}, {'Notes\t2022\t2021\tDELTA': '291,117'}, {'Notes\t2022\t2021\tDELTA': '85,470'}, {'Notes\t2022\t2021\tDELTA': '1,464'}, {'Notes\t2022\t2021\tDELTA': '-6,371'}]}

Summary;
 <<T24>>: The table provided is a quantitative table titled "CONSOLIDATED STATEMENT OF CASH FLOWS COMPARATIVE ANALYSIS" from DEME_Gr

RAW TABLE: {'title': 'Further on, experience shows that once an agreement has been reached, cancellations or substantial reductions in the scope or size of contracts are quite rare, but they do occur, certainly in markets that are under severe pressure.', 'col_headers': ['Orderbook by segment (in thousands of EUR)', '2022', '2021'], 'table': [{'Orderbook by segment (in thousands of EUR)': 'Offshore Energy', '2022': '3,260,909', '2021': '2,816,564'}, {'Orderbook by segment (in thousands of EUR)': 'Dredging & Infra', '2022': '2,615,713', '2021': '2,833,296'}, {'Orderbook by segment (in thousands of EUR)': 'Environmental', '2022': '313,378', '2021': '255,330'}, {'Orderbook by segment (in thousands of EUR)': 'Concessions', '2022': '-', '2021': '-'}, {'Orderbook by segment (in thousands of EUR)': 'Total orderbook', '2022': '6,190,000', '2021': '5,905,190'}]}

Summary;
 <<T28>>: The table provided is quantitative and represents the orderbook by segment for DEME_Group in thousands of EUR for 

RAW TABLE: {'title': 'OTHER OPERATING EXPENSES', 'col_headers': ['(in thousands of EUR)\t2022\t2021', '(in thousands of EUR)\t2022\t2021', '(in thousands of EUR)\t2022\t2021'], 'table': [{'(in thousands of EUR)\t2022\t2021': '-'}, {'(in thousands of EUR)\t2022\t2021': '10'}, {'(in thousands of EUR)\t2022\t2021': '3,185'}, {'(in thousands of EUR)\t2022\t2021': '1,146'}, {'(in thousands of EUR)\t2022\t2021': '13,013'}, {'(in thousands of EUR)\t2022\t2021': '29,591'}, {'(in thousands of EUR)\t2022\t2021': '46,945'}]}

Summary;
 <<T33>>: The table provided is quantitative and represents the "Other Operating Expenses" for DEME_Group's Annual report with sustainability in 2022 and 2021. The table consists of three columns: "(in thousands of EUR)", "2022", and "2021". The rows represent different categories of expenses.

In 2022, the expenses are as follows:
- "-" (not specified)
- 10,000 EUR
- 3,185,000 EUR
- 1,146,000 EUR
- 13,013,000 EUR
- 29,591,000 EUR
- 46,945,000 EUR

In 2021, the expe

RAW TABLE: {'title': 'An amount of 14.4 million EUR out of the 24.3 million EUR total net book value of intangibles at the end of the year 2022 is related to the purchase price allocation (PPA)-exercise of the SPT Offshore group (at the end of 2020). These intangibles are amortised over the economic lifetime of 10 years.', 'col_headers': ['Net book value at the end of prior year', '3', '24,932', '-', '24,935'], 'table': [{'Net book value at the end of prior year': '', '3': '', '24,932': '', '-': '', '24,935': ''}, {'Net book value at the end of prior year': 'Net book value at the end of the year', '3': '-', '24,932': '22,308', '-': '3,205', '24,935': '25,513'}]}

Summary;
 <<T37>>: The table provided contains quantitative information. It shows the net book value of intangibles for DEME_Group at the end of the prior year and the end of the current year. The net book value at the end of the prior year is not specified in the table. However, at the end of the current year, the net book va

RAW TABLE: {'title': 'NOTE 6 – PROPERTY, PLANT AND EQUIPMENT', 'col_headers': ['Cumulative depreciation and impairment at January 1, 2022', 'Cumulative depreciation and impairment at January 1, 2022', '49,098', '2,385,178', '16,198', '1,732', '-', '2,452,206'], 'table': [{'Cumulative depreciation and impairment at January 1, 2022': 'Depreciation charge of the year', '49,098': '4,483', '2,385,178': '278,818', '16,198': '2,425', '1,732': '422', '-': '-', '2,452,206': '286,147'}, {'Cumulative depreciation and impairment at January 1, 2022': 'Written down after sales and disposals', '49,098': '-17', '2,385,178': '-94,353', '16,198': '-1,591', '1,732': '-996', '-': '-', '2,452,206': '-96,956'}, {'Cumulative depreciation and impairment at January 1, 2022': "Transfer to 'Assets held for Sale'", '49,098': '-', '2,385,178': '-2,316', '16,198': '-', '1,732': '-', '-': '-', '2,452,206': '-2,316'}, {'Cumulative depreciation and impairment at January 1, 2022': 'Transfers from one heading to another

RAW TABLE: {'title': 'NOTE 7 – RIGHT-OF-USE ASSETS', 'col_headers': ['2022\n(in thousands of EUR)', '2022\n(in thousands of EUR)', 'Land and buildings', 'Floating and other construction equipment', 'Furniture and\nvehicles', 'Total Right-of-use\nassets'], 'table': [{'2022\n(in thousands of EUR)': 'Acquisition cost at January 1, 2022', 'Land and buildings': '90,204', 'Floating and other construction equipment': '10,376', 'Furniture and\nvehicles': '34,143', 'Total Right-of-use\nassets': '134,722'}, {'2022\n(in thousands of EUR)': 'Additions, including fixed assets, own production', 'Land and buildings': '19,843', 'Floating and other construction equipment': '13,958', 'Furniture and\nvehicles': '8,252', 'Total Right-of-use\nassets': '42,052'}, {'2022\n(in thousands of EUR)': 'Sales and disposals', 'Land and buildings': '-10,332', 'Floating and other construction equipment': '-3,309', 'Furniture and\nvehicles': '-4,308', 'Total Right-of-use\nassets': '-17,948'}, {'2022\n(in thousands of E

RAW TABLE: {'title': 'The reconciliation of the total net assets to the carrying amount of the Group’s interests in the associates and joint ventures is as follows.', 'col_headers': ['Reconciliation to the carrying amount of associates\n2022\n(in thousands of EUR)', 'Offshore Energy', 'Dredging &\nInfra', 'Environmental', 'Concessions', 'Total'], 'table': [{'Reconciliation to the carrying amount of associates\n2022\n(in thousands of EUR)': 'Net assets of associates: 100% standalone amounts', 'Offshore Energy': '1,506', 'Dredging &\nInfra': '10,949', 'Environmental': '16,224', 'Concessions': '1,031,499', 'Total': '1,060,178'}, {'Reconciliation to the carrying amount of associates\n2022\n(in thousands of EUR)': "Proportion of the Group's ownership interests in the standalone amounts", 'Offshore Energy': '27', 'Dredging &\nInfra': '5,471', 'Environmental': '2,938', 'Concessions': '162,679', 'Total': '171,115'}, {'Reconciliation to the carrying amount of associates\n2022\n(in thousands of 

RAW TABLE: {'title': 'AND RECONCILIATION TO THE CARRYING AMOUNT', 'col_headers': ['Summarised financial information of joint ventures\n2021\n(in thousands of EUR) (100% standalone amounts)', 'Offshore Energy', 'Dredging &\nInfra', 'Environmental', 'Concessions', 'Total'], 'table': [{'Summarised financial information of joint ventures\n2021\n(in thousands of EUR) (100% standalone amounts)': 'Financial position', 'Offshore Energy': 'Financial position', 'Dredging &\nInfra': 'Financial position', 'Environmental': 'Financial position', 'Concessions': 'Financial position', 'Total': 'Financial position'}, {'Summarised financial information of joint ventures\n2021\n(in thousands of EUR) (100% standalone amounts)': 'Non-current assets', 'Offshore Energy': '156,776', 'Dredging &\nInfra': '19,266', 'Environmental': '5,266', 'Concessions': '-', 'Total': '181,308'}, {'Summarised financial information of joint ventures\n2021\n(in thousands of EUR) (100% standalone amounts)': 'Current assets', 'Offs

RAW TABLE: {'title': 'OTHER NON-CURRENT ASSETS', 'col_headers': ['(in thousands of EUR)', '(in thousands of EUR)', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Balance at January 1', '2022': '4,239', '2021': '3,221'}, {'(in thousands of EUR)': 'Additions', '2022': '7,963', '2021': '1,018'}, {'(in thousands of EUR)': 'Disposals (-)', '2022': '-310', '2021': '-'}, {'(in thousands of EUR)': 'Transfer (to) from other items', '2022': '-', '2021': '-'}, {'(in thousands of EUR)': 'Other movements', '2022': '-', '2021': '-'}, {'(in thousands of EUR)': 'Translation differences', '2022': '-', '2021': '-'}, {'(in thousands of EUR)': 'Balance at December 31', '2022': '11,892', '2021': '4,239'}]}

Summary;
 <<T53>>: The table provided is quantitative and represents the "Other Non-Current Assets" section from DEME_Group's Annual Report with sustainability for the years 2022 and 2021. The table consists of four columns: "(in thousands of EUR)", "(in thousands of EUR)", "2022", and "2021". 



RAW TABLE: {'title': 'Deferred tax assets and liabilities regarding financial derivatives only concern fully consolidated companies, see also the section regarding other comprehensive income.', 'col_headers': ['2022\n(in thousands of EUR) Deferred tax liabilities related to', 'Tangible fixed assets', 'Employee benefits', 'Financial derivatives', 'Reversal statutory provision', 'Long term tax accruals (UTP)', 'Other timing differences', 'Netting', 'Total'], 'table': [{'2022\n(in thousands of EUR) Deferred tax liabilities related to': 'Balance at January 1', 'Tangible fixed assets': '54,217', 'Employee benefits': '-', 'Financial derivatives': '65', 'Reversal statutory provision': '7,577', 'Long term tax accruals (UTP)': '29,627', 'Other timing differences': '9,126', 'Netting': '-26,399', 'Total': '74,213'}, {'2022\n(in thousands of EUR) Deferred tax liabilities related to': 'Recognised in income statement', 'Tangible fixed assets': '-19,975', 'Employee benefits': '-', 'Financial derivati

RAW TABLE: {'title': '(*) The tax netting item reflects the netting of deferred tax assets and liabilities per entity.', 'col_headers': ['2021\n(in thousands of EUR) Deferred tax liabilities related to', 'Tangible fixed assets', 'Employee benefits', 'Financial derivatives', 'Reversal statutory provision', 'Long term tax accruals (UTP)', 'Other timing differences', 'Netting', 'Total'], 'table': [{'2021\n(in thousands of EUR) Deferred tax liabilities related to': 'Balance at January 1', 'Tangible fixed assets': '60,676', 'Employee benefits': '-', 'Financial derivatives': '364', 'Reversal statutory provision': '570', 'Long term tax accruals (UTP)': '36,748', 'Other timing differences': '6,676', 'Netting': '-57,677', 'Total': '47,358'}, {'2021\n(in thousands of EUR) Deferred tax liabilities related to': 'Recognised in income statement', 'Tangible fixed assets': '-6,460', 'Employee benefits': '-', 'Financial derivatives': '-299', 'Reversal statutory provision': '7,007', 'Long term tax accru

RAW TABLE: {'title': 'NOTE 11 – INVENTORIES', 'col_headers': ['(in thousands of EUR)', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Raw materials', '2022': '2,779', '2021': '2,683'}, {'(in thousands of EUR)': 'Consumables', '2022': '22,917', '2021': '9,485'}, {'(in thousands of EUR)': 'Total inventories', '2022': '25,696', '2021': '12,168'}, {'(in thousands of EUR)': 'Movement of the year recorded in statement of income', '2022': '13,528', '2021': '1,712'}]}

Summary;
 <<T62>>: The table provided is quantitative and represents the inventory information from DEME_Group's Annual report with sustainability for the years 2022 and 2021. The table consists of four columns: "(in thousands of EUR)", "2022", and "2021". The first column represents the different categories of inventories, including "Raw materials" and "Consumables". The second and third columns represent the inventory values for the years 2022 and 2021, respectively.

In 2022, the value of raw materials inventory was EU

RAW TABLE: {'title': 'NOTE 14 – ASSETS HELD FOR SALE', 'col_headers': ['(in thousands of EUR)', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Assets held for sale', '2022': '31,997', '2021': '32,456'}]}

Summary;
 <<T66>>: The table provided is quantitative and represents the assets held for sale in thousands of EUR for the years 2022 and 2021. The table has three columns: "(in thousands of EUR)", "2022", and "2021". The row in the table represents the assets held for sale, with the corresponding values for each year. In 2022, the value of assets held for sale is 31,997 thousand EUR, while in 2021, it was 32,456 thousand EUR. This table provides a quantitative overview of the changes in assets held for sale over the two years.



RAW TABLE: {'title': 'NOTE 15 – OTHER CURRENT ASSETS', 'col_headers': ['(in thousands of EUR)', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Deferred charges and accrued income', '2022': '100,950', '2021': '45,710'}, {'(in thousands of EUR)': 

RAW TABLE: {'title': 'Earnings per share, based on the number of ordinary shares at the end of the period (both basic and diluted) in EUR:', 'col_headers': ['Earnings per share from continuing operations (Share of the Group)', '4.45', '4.53', '25.25'], 'table': [{'Earnings per share from continuing operations (Share of the Group)': 'Earnings per share (Share of the Group)', '4.45': '4.45', '4.53': '4.53', '25.25': '25.25'}, {'Earnings per share from continuing operations (Share of the Group)': 'Comprehensive income (Share of the Group) per share', '4.45': '8.50', '4.53': '5.23', '25.25': '29.19'}]}

Summary;
 <<T70>>: The table provided is quantitative as it contains numerical values. It represents the earnings per share (EPS) for DEME_Group based on the number of ordinary shares at the end of the period. The table has three columns: "Earnings per share from continuing operations (Share of the Group)", "4.45", "4.53", and "25.25". 

The first row of the table indicates the title of the

RAW TABLE: {'title': 'DEBT MATURITY SCHEDULE OF TOTAL LONG-TERM FINANCIAL LIABILITIES', 'col_headers': ['(in thousands of EUR)', 'More than 5\nyears', 'Between 1 and\n5 years', 'Less than one\nyear', 'Total'], 'table': [{'(in thousands of EUR)': 'Subordinated loans', 'More than 5\nyears': '-', 'Between 1 and\n5 years': '677', 'Less than one\nyear': '-', 'Total': '677'}, {'(in thousands of EUR)': 'Lease liabilities', 'More than 5\nyears': '35,315', 'Between 1 and\n5 years': '41,067', 'Less than one\nyear': '24,960', 'Total': '101,342'}, {'(in thousands of EUR)': 'Credit institutions', 'More than 5\nyears': '140,494', 'Between 1 and\n5 years': '570,947', 'Less than one\nyear': '227,910', 'Total': '939,351'}, {'(in thousands of EUR)': 'Other long-term loans', 'More than 5\nyears': '-', 'Between 1 and\n5 years': '1,404', 'Less than one\nyear': '-', 'Total': '1,404'}, {'(in thousands of EUR)': 'Total long-term financial liabilities', 'More than 5\nyears': '175,809', 'Between 1 and\n5 years'

RAW TABLE: {'title': 'At closing date, the instruments qualified as cash flow hedges have the following characteristics:', 'col_headers': ['2021\n(in thousands of EUR)', 'Non-current asset', 'Non-current liability', 'Current asset', 'Current liability', 'Total net balance\nfair value'], 'table': [{'2021\n(in thousands of EUR)': 'Interest rate swaps', 'Non-current asset': '-', 'Non-current liability': '-2,608', 'Current asset': '-', 'Current liability': '-1,892', 'Total net balance\nfair value': '-4,500'}]}

Summary;
 <<T77>>: The table provided is quantitative and represents the characteristics of instruments qualified as cash flow hedges at the closing date. The table includes the following columns: "2021 (in thousands of EUR)", "Non-current asset", "Non-current liability", "Current asset", "Current liability", and "Total net balance fair value". 

In the row of the table, it specifies the instrument type as "Interest rate swaps". The values in the subsequent columns indicate the char

RAW TABLE: {'title': 'Similar to 2021, almost the entire Group’s outstanding debt portfolio (short and long-term) has a fixed interest rate character, which limits the exposure of the Group to interest rate fluctuations.', 'col_headers': ['Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products', 'Effective average interest rate after considering derivatives products'], 'table': [{'Effectiv

RAW TABLE: {'title': 'DEME is also exposed to commodity risks and hedges against oil price fluctuations by entering into forward contracts. The fair value variation of these instruments is considered as construction costs. This variation is presented as an operating result. The fair value and notional amount of these instruments can be found below (+ is asset / - is liability):', 'col_headers': ['2021\n(in thousands of EUR)', 'Non-current\nasset', 'Non-current\nliability', 'Current asset', 'Current liability', 'Total net balance fair value', 'Notional amount'], 'table': [{'2021\n(in thousands of EUR)': 'Fuel hedges', 'Non-current\nasset': '500', 'Non-current\nliability': '-', 'Current asset': '2,151', 'Current liability': '-314', 'Total net balance fair value': '2,337', 'Notional amount': '16,292'}]}

Summary;
 <<T86>>: The table provided is quantitative and represents the fair value and notional amount of DEME_Group's fuel hedges for the year 2021. The table consists of six columns: "

RAW TABLE: {'title': 'The aging of trade receivables (net amount and excluding other operating receivables) (note (13)) is as follows:', 'col_headers': ['2021\n(in thousands of EUR)', 'Total', 'Not expired', 'Expired\n<1 month', 'Expired\n<2 months', 'Expired\n<3 months', 'Expired\n<6 months', 'Expired\n<1 year', 'Expired\n> 1 year'], 'table': [{'2021\n(in thousands of EUR)': 'Trade receivables', 'Total': '314,175', 'Not expired': '217,102', 'Expired\n<1 month': '10,470', 'Expired\n<2 months': '14,421', 'Expired\n<3 months': '5,372', 'Expired\n<6 months': '9,573', 'Expired\n<1 year': '10,244', 'Expired\n> 1 year': '46,993'}, {'2021\n(in thousands of EUR)': 'Loss allowance', 'Total': '-18,423', 'Not expired': '-', 'Expired\n<1 month': '-', 'Expired\n<2 months': '-', 'Expired\n<3 months': '-', 'Expired\n<6 months': '-', 'Expired\n<1 year': '-', 'Expired\n> 1 year': '-18,423'}, {'2021\n(in thousands of EUR)': 'Total net amounts', 'Total': '295,752', 'Not expired': '217,102', 'Expired\n<1 

RAW TABLE: {'title': 'Set out below is an overview of the carrying amounts of the Group’s financial instruments that are shown in the financial statements. All fair values mentioned in the table below relate to Level 2. During the reporting periods, there were no transfers between Level 1 and Level 2 fair value measurements, and no transfers into and out of Level 3 fair value measurements.', 'col_headers': ['Non-current liabilities', '26,868', '580,797', '607,665', '', '620,631'], 'table': [{'Non-current liabilities': 'Interest-bearing debt', '26,868': '-', '580,797': '577,970', '607,665': '577,970', '': 'Level 2', '620,631': '590,936'}, {'Non-current liabilities': 'Financial derivatives', '26,868': '26,868', '580,797': '-', '607,665': '26,868', '': 'Level 2', '620,631': '26,868'}, {'Non-current liabilities': 'Other liabilities', '26,868': '-', '580,797': '2,827', '607,665': '2,827', '': 'Level 2', '620,631': '2,827'}, {'Non-current liabilities': 'Current liabilities', '26,868': '12,36

RAW TABLE: {'title': 'DEME’s subsidiaries in the Netherlands operate a number of defined benefit pension schemes. Without exception, these plans are insured with an authorised insurance company in the Netherlands and are closed for new entries and accruals. The net liabilities of the schemes arise from the obligation for the entities to index accrued pension benefits and benefits in payment and/or the obligation to pay guarantee costs to the insurance company.', 'col_headers': ['Employee benefit obligations\n(in thousands of EUR)', '2022', '2021'], 'table': [{'Employee benefit obligations\n(in thousands of EUR)': 'Retirement obligations in Belgium and The Netherlands', '2022': '56,902', '2021': '62,213'}, {'Employee benefit obligations\n(in thousands of EUR)': 'Other retirement obligations', '2022': '3,621', '2021': '3,054'}, {'Employee benefit obligations\n(in thousands of EUR)': 'Balance at December 31', '2022': '60,523', '2021': '65,267'}]}

Summary;
 <<T96>>: The table provided is 

RAW TABLE: {'title': 'MOVEMENT IN RETIREMENT BENEFIT PLAN OBLIGATIONS AND ASSETS', 'col_headers': ['Retirement benefit plan assets balance at January 1', '184,686', '190,074'], 'table': [{'Retirement benefit plan assets balance at January 1': 'Return on plan assets (+) (excluding interest income)', '184,686': '-49,249', '190,074': '-2,976'}, {'Retirement benefit plan assets balance at January 1': 'Interest income on plan assets (+)', '184,686': '1,725', '190,074': '982'}, {'Retirement benefit plan assets balance at January 1': 'Contributions from employer/employees (*)', '184,686': '14,516', '190,074': '13,273'}, {'Retirement benefit plan assets balance at January 1': 'Benefits paid to beneficiaries', '184,686': '-7,344', '190,074': '-14,740'}, {'Retirement benefit plan assets balance at January 1': 'Other movements', '184,686': '-1,773', '190,074': '-1,927'}, {'Retirement benefit plan assets balance at January 1': 'Retirement benefit plan assets balance at December 31', '184,686': '14

RAW TABLE: {'title': 'PROVISIONS', 'col_headers': ['(in thousands of EUR)', 'Warranties', 'Other', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Balance at January 1', 'Warranties': '37,378', 'Other': '5,932', '2022': '43,310', '2021': '30,297'}, {'(in thousands of EUR)': 'Arising during the year', 'Warranties': '5,153', 'Other': '895', '2022': '6,048', '2021': '13,213'}, {'(in thousands of EUR)': 'Utilised during the year', 'Warranties': '-1,659', 'Other': '-', '2022': '-1,659', '2021': '-200'}, {'(in thousands of EUR)': 'Unused amounts reversed', 'Warranties': '-', 'Other': '-', '2022': '-', '2021': '-'}, {'(in thousands of EUR)': 'Balance at December 31', 'Warranties': '40,872', 'Other': '6,827', '2022': '47,699', '2021': '43,310'}, {'(in thousands of EUR)': 'Current', 'Warranties': '4,714', 'Other': '-', '2022': '4,714', '2021': '3,738'}, {'(in thousands of EUR)': 'Non-current', 'Warranties': '36,158', 'Other': '6,827', '2022': '42,985', '2021': '39,572'}]}

Summary;
 <<T10

RAW TABLE: {'title': 'Transactions with joint ventures and associates are realised in the normal course of business and at arm’s length. None of the related parties have entered into any other transactions with the Group that meet the requirements of IAS 24 related party disclosures.', 'col_headers': ['(in thousands of EUR)', '2022', '2021'], 'table': [{'(in thousands of EUR)': 'Assets related to joint ventures and associates', '2022': '', '2021': ''}, {'(in thousands of EUR)': 'Non-current financial assets', '2022': '24,173', '2021': '25,668'}, {'(in thousands of EUR)': 'Trade and other operating receivables', '2022': '31,465', '2021': '13,889'}, {'(in thousands of EUR)': 'Liabilities related to joint ventures and associates', '2022': '', '2021': ''}, {'(in thousands of EUR)': 'Trade and other current liabilities', '2022': '34,606', '2021': '20,996'}, {'(in thousands of EUR)': 'Expenses and income related to joint ventures and associates (-) is cost and (+) is income', '2022': '', '20

RAW TABLE: {'title': 'as of December 31 (in thousands of EUR) (according to Belgian GAAP and after profit allocation)', 'col_headers': ['LIABILITIES', '2022', '2021'], 'table': [{'LIABILITIES': 'CAPITAL AND RESERVES', '2022': '1,111,845', '2021': '-'}, {'LIABILITIES': 'CAPITAL', '2022': '33,194', '2021': '-'}, {'LIABILITIES': 'Issued capital', '2022': '33,194', '2021': '-'}, {'LIABILITIES': 'Uncalled capital (-)', '2022': '-', '2021': '-'}, {'LIABILITIES': 'SHARE PREMIUM ACCOUNT', '2022': '475,989', '2021': '-'}, {'LIABILITIES': 'REVALUATION SURPLUS', '2022': '487,400', '2021': '-'}, {'LIABILITIES': 'RESERVES', '2022': '6,949', '2021': '-'}, {'LIABILITIES': 'Legal reserves', '2022': '3,319', '2021': '-'}, {'LIABILITIES': 'Reserves not available for distribution', '2022': '-', '2021': '-'}, {'LIABILITIES': 'Untaxed reserves', '2022': '1,716', '2021': '-'}, {'LIABILITIES': 'Reserves available for distribution', '2022': '1,914', '2021': '-'}, {'LIABILITIES': 'PROFIT CARRIED FORWARD', '202

RAW TABLE: {'title': 'but not environmentally sustainable activities (not Taxonomy-aligned activities)', 'col_headers': ['CapEx of Taxonomy-eligible but not environmentally sustainable activities (not', '0,00', '0.00%', '', ''], 'table': [{'CapEx of Taxonomy-eligible but not environmentally sustainable activities (not': 'Taxonomy-aligned activities) (A.2)', '0,00': '', '0.00%': '', '': ''}, {'CapEx of Taxonomy-eligible but not environmentally sustainable activities (not': 'Total (A.1 + A.2)', '0,00': '274,970,745.00', '0.00%': '51.87%', '': '0.00%\t0.00%'}, {'CapEx of Taxonomy-eligible but not environmentally sustainable activities (not': 'B. TAXONOMY NON-ELIGIBLE ACTIVITIES', '0,00': '', '0.00%': '', '': ''}, {'CapEx of Taxonomy-eligible but not environmentally sustainable activities (not': 'CapEx of Taxonomy non-eligible activities (B)', '0,00': '255,177,292.00', '0.00%': '48.13%', '': ''}, {'CapEx of Taxonomy-eligible but not environmentally sustainable activities (not': 'Total (A +

**WARNING**: We use the key <<T{i}>>: to indicate that the below are summaries of tables. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw table. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look thorugh the list of raw tables, and join these to their summaries.  

NOTE: i starts at 1!

### Obtain short summaries for figures:
Similalrly to tables, to facilitate encoding we try and pass figures through an LLM first. Note that the confidence we have in figure information is low. Figures are captured using heauristics and may hence be too vague to summarise meanigfully. The may also be extract from the text that have been wrongly captured. In either case a textual summary is required. 

In [106]:
doc_contents[FIGURE_SUMMARY_KEY] = []

In [107]:
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a figure and text summarizer."
    }, {
        "role": "user",
        "content": f"Decide whether the information below (INFO) obtained from {comp_name}'s " + 
            f"{report_name} is from a figure or text, " + 
            "and describe it in no more than 100 words. Note, if the information is vague " + 
            "return 'None'.\n" + 
            "INFO: {'title': 'Global headquarters', 'data': ['200+']}"
    }, {
        "role": "assistant", 
        "content": "Text: Coca-Cola has more than 200 global headquarters."
    }, {
        "role": "user", 
        "content": "INFO: {'title': '2022 Progress on Sustainable Sourcing2', 'data': ['0\t20\t40\t60\t80\t100', '36%', 'GRAPES\t 37%', 'SUGAR CANE\t 40%', 'APPLES\t 55%', 'CORN\t 70%', 'TEA\t 74%', '80%', 'PULP AND PAPER\t 86%', 'ORANGES\t 89%', 'LEMONS\t 96%']}"
    }, {
        "role": "assistant", 
        "content": "Figure: 2022 progress on sustainable sourcing. 37% of grapes, 40% of sugar cane, 55% of apples, 70% of corn, 74% of tea, 86% of pulp and paper, 89% of oranges, and 96% of lemons were sustainably sourced."
    }, {
        "role": "user", 
        "content": "INFO: {'title': 'Organic Revenue Growth (Non-GAAP)1', 'data': ['25', '-5', '24%']}",
    }, {
        "role": "assistant", 
        "content": "Text: Organic Revenue Growth (Non-GAAP) was 24%."
    }, {
        "role": "user", 
        "content": None,
    },
]

# TODO: Avoid repeating loop. Create seperate function.
for i, figure in tqdm(enumerate(doc_contents[FIGURE_KEY], start=0)):
    
    if doc_contents[FIGURE_SUMMARY_KEY] != []:
        # KEEP TRACK OF LAST GPT OUTPUT - This may help with next
        # figure summarisation as there may be a link between the two.
        parameters['messages'][-3:-1] = {
            "role": "user", 
            "content": f"INFO: {doc_contents[FIGURE_KEY][i-1]}"  # previous figure
        }, {
            "role": "assistant", 
            "content": f"{doc_contents[FIGURE_SUMMARY_KEY][-1]}"
        }
    
    parameters['messages'][-1] = {
        "role": "user", 
        "content": f"INFO: {figure}"
    }
    retries = 0
    success = False
    while retries < MAX_RETRIES and not success:
        try:
            response = openai.ChatCompletion.create(
              **parameters
            )
            # Add <<F{i}>>: in front to facilitate later lookup
            doc_contents[FIGURE_SUMMARY_KEY].append(
                f"<<F{i+1}>>: {response['choices'][0]['message']['content']}"
            )
            success = True
        except (openai.error.Timeout, openai.error.APIConnectionError, openai.error.AuthenticationError) as e:
                retries += 1
                print(f"Error encountered. Retrying {retries}/{MAX_RETRIES}")
                time.sleep(WAIT_SECONDS)

    print(f"RAW FIGURE: {figure}")
    print(f"Figure {i + 1} Summary;\n {doc_contents[FIGURE_SUMMARY_KEY][-1]}\n\n")


0it [00:00, ?it/s]

RAW FIGURE: {'title': 'ANNUAL REPORT', 'data': ['CHAPTER 01']}
Figure 1 Summary;
 <<F1>>: Text: The annual report has a chapter titled "Chapter 01".


RAW FIGURE: {'title': 'Letter of the CEO & Chairman\t6', 'data': ['Company profile\t10', 'Highlights 2022\t16', 'DEME core values\t22', 'DEME fleet\t24', 'Group performance 2022\t28', 'CHAPTER 02']}
Figure 2 Summary;
 <<F2>>: Text: The CEO & Chairman's letter is on page 6 of the annual report. The report also includes sections on company profile (page 10), highlights of 2022 (page 16), DEME core values (page 22), DEME fleet (page 24), and group performance in 2022 (page 28). There is also a chapter titled "Chapter 02".


RAW FIGURE: {'title': 'Relevant market drivers\t34', 'data': ['DEME’s 2030 strategy\t36']}
Figure 3 Summary;
 <<F3>>: Text: The annual report discusses relevant market drivers on page 34 and DEME's 2030 strategy on page 36.


RAW FIGURE: {'title': 'Innovation is at the heart of DEME\t44', 'data': ['Key risks\t46', 'CHAPT

RAW FIGURE: {'title': 'In just 13 months, DEME achieves an impressive feat in France when it drills the monopiles into hard rock using industry-first technology at Saint-Nazaire .', 'data': ['6', '18\tCHAPTER 01']}
Figure 24 Summary;
 <<F24>>: Text: In just 13 months, DEME achieves an impressive feat in France at Saint-Nazaire by drilling monopiles into hard rock using industry-first technology.


RAW FIGURE: {'title': 'E U RO N E X T B R U SS E L S', 'data': ['DEME ANNUAL REPORT 2022\t19', '10', '0 - Y E A R P O R T C O N C E S S I O N']}
Figure 25 Summary;
 <<F25>>: None.


RAW FIGURE: {'title': 'The consortium’s ambition is to develop Port-La Nouvelle as a sustainable green port, including establishing a strategic hub for offshore and floating wind .', 'data': ['9']}
Figure 26 Summary;
 <<F26>>: Text: The consortium aims to develop Port-La Nouvelle as a sustainable green port and create a strategic hub for offshore and floating wind.


RAW FIGURE: {'title': 'company on Euronext Brus

RAW FIGURE: {'title': 'Advance growth initiatives such as deep-sea mineral harvesting, green hydrogen and other advanced initiatives with compelling growth potential', 'data': ['40\tCHAPTER 02', 'DEME ANNUAL REPORT 2022\t41']}
Figure 49 Summary;
 <<F49>>: Text: DEME aims to advance growth initiatives such as deep-sea mineral harvesting, green hydrogen, and other advanced initiatives with compelling growth potential. This information is found in Chapter 02 of DEME's Annual Report 2022, specifically on pages 40 and 41.


RAW FIGURE: {'title': 'This strategy will help us to create sustainable value for our customers, DEME and society.', 'data': ['01']}
Figure 50 Summary;
 <<F50>>: Text: This strategy will help DEME create sustainable value for its customers, the company itself, and society.


RAW FIGURE: {'title': "We refer to chapter 3 'Segments' for more information.", 'data': ['02']}
Figure 51 Summary;
 <<F51>>: <<F50>>: Text: For more information, please refer to chapter 3, titled "Se

RAW FIGURE: {'title': 'The increase in the Offshore Energy orderbook reflects new contract awards, received during the second half of the year, with project wins deployments over the next several years, including sizeable project- wins in Continental Europe, the UK, Australia, Taiwan and the US.', 'data': ['62\tCHAPTER 03', 'DEME ANNUAL REPORT 2022\t63']}
Figure 73 Summary;
 <<F73>>: <<F72>>: Text: In Chapter 03 of the DEME Annual Report 2022, it is mentioned that the increase in the Offshore Energy orderbook is a result of new contract awards received in the second half of the year. These contract awards include projects in Continental Europe, the UK, Australia, Taiwan, and the US. This information can be found on pages 62 and 63 of the report.


RAW FIGURE: {'title': 'DREDGING & INFRA', 'data': ['€1,524M', 'turnover (2022)', '43']}
Figure 74 Summary;
 <<F74>>: Figure: The DREDGING & INFRA division had a turnover of €1,524 million in 2022.


RAW FIGURE: {'title': 'dredging vessels', '

RAW FIGURE: {'title': 'Orderbook for Environmental continued its growth trajectory with new contract wins in Norway, France and follow-on projects in Belgium. As per December 31, 2022 the orderbook stood at 313 million euro, an increase of 23% compared to 255 million euro a year earlier.', 'data': ['84\tCHAPTER 03', 'DEME ANNUAL REPORT 2022\t85']}
Figure 94 Summary;
 <<F94>>: Text: The Orderbook for Environmental experienced growth in 2022 with new contract wins in Norway, France, and follow-on projects in Belgium. As of December 31, 2022, the orderbook reached 313 million euros, which is a 23% increase compared to the previous year's 255 million euros.


RAW FIGURE: {'title': 'CONCESSIONS', 'data': ['€9M', 'in 2022']}
Figure 95 Summary;
 <<F95>>: Figure: In 2022, the concessions amounted to €9 million.


RAW FIGURE: {'title': 'to source new project leads and forge successful partnerships', 'data': ['144 MW']}
Figure 96 Summary;
 <<F96>>: Text: The company was able to source new projec

RAW FIGURE: {'title': 'C H R I S T IAN L ABE Y R I E', 'data': ['(º 1956, French)']}
Figure 115 Summary;
 <<F115>>: Figure: Christian Labeyrie was born in 1956 and is French.


RAW FIGURE: {'title': 'K E R S TI N KO N R A D SS O N', 'data': ['(º 1967, Swedish)']}
Figure 116 Summary;
 <<F116>>: <<F115>>: Figure: Kerstin Konradsson was born in 1967 and is Swedish.


RAW FIGURE: {'title': '1 For remuneration purposes, the meeting of the Board of Directors held on the date of the incorporation of DEME Group NV was not taken into account, leading to a total of 4 meetings.', 'data': ['104', 'CHAPTER 04', 'DEME ANNUAL REPORT 2022', '105']}
Figure 117 Summary;
 <<F117>>: Text: For remuneration purposes, the meeting of the Board of Directors held on the date of the incorporation of DEME Group NV was not taken into account, resulting in a total of 4 meetings.


RAW FIGURE: {'title': 'L U C VA N D E N B UL C K E', 'data': ['(º 1971, Belgian)']}
Figure 118 Summary;
 <<F118>>: Figure: LUC VAN DEN B

RAW FIGURE: {'title': 'CO2 emission reduction of', 'data': ['3,380 tonnes CO2 per year']}
Figure 146 Summary;
 <<F146>>: Figure: The CO2 emission reduction is 3,380 tonnes per year.


RAW FIGURE: {'title': 'operational in the course of 2023.', 'data': ['144', '145']}
Figure 147 Summary;
 <<F147>>: None.


RAW FIGURE: {'title': 'We are currently working to increase energy efficiency and to', 'data': ['PRO G R E SS 2 0 2 2']}
Figure 148 Summary;
 <<F148>>: Text: We are currently working to increase energy efficiency and made progress in 2022.


RAW FIGURE: {'title': 'for the surrounding community .', 'data': ['146', '147']}
Figure 149 Summary;
 <<F149>>: None.


RAW FIGURE: {'title': 'intensive construction equipment, and these locations are usually exactly where DEME is working.', 'data': ['PRO G R E SS 2 0 2 2']}
Figure 150 Summary;
 <<F150>>: Text: DEME's intensive construction equipment is typically located exactly where they are working.


RAW FIGURE: {'title': 'diesel powered equip

RAW FIGURE: {'title': 'the team raised £3,030 and the fundraising continues in 2023.', 'data': ['166', '167']}
Figure 166 Summary;
 <<F166>>: None.


RAW FIGURE: {'title': '‘Safety Success Stories’ and these are being shared throughout the Group .', 'data': ['TAKE 5']}
Figure 167 Summary;
 <<F167>>: Text: The group is sharing "Safety Success Stories" throughout the organization.


RAW FIGURE: {'title': 'and the measures to be taken to prevent similar situations occurring .', 'data': ['168', '169']}
Figure 168 Summary;
 <<F168>>: <<F167>>: Text: There are measures to be taken to prevent similar situations from occurring, with specific actions identified in sections 168 and 169.


RAW FIGURE: {'title': 'Luc Vandenbulcke, DEME’s CEO, is very clear: “When it comes to safety, there is no space for negligence.”', 'data': ['170', '171']}
Figure 169 Summary;
 <<F169>>: Text: DEME's CEO, Luc Vandenbulcke, emphasizes the importance of safety and states that there is no room for negligence in thi

RAW FIGURE: {'title': 'However, derivatives which do not qualify as hedging instruments as defined by IFRS 9 are presented as instruments held for trading. Derivative financial instruments are recognised initially at cost. Subsequent to initial recognition, derivative financial instruments are measured at fair value. Recognition of any resulting unrealised gain or loss depends on the nature of the derivative and the effectiveness of the hedge. The fair value of interest-rate swaps is the estimated amount that the company would receive or pay when exercising the swaps at the closing date, taking into account current interest rates and the solvency of the swap counterparty. The fair value of a forward-exchange contract is the', 'data': ['202', '203']}
Figure 184 Summary;
 <<F184>>: <<F183>>: <<F182>>: <<F181>>: <<F180>>: <<F179>>: <<F178>>: <<F177>>: Text: The annual report explains that derivatives that do not qualify as hedging instruments according to IFRS 9 are classified as instrume

RAW FIGURE: {'title': 'The non-current financial assets, other than loans to joint ventures and associates mainly include long-term deposits and guarantees.', 'data': ['232', '233']}
Figure 200 Summary;
 <<F200>>: <<F199>>: <<F198>>: Text: The non-current financial assets, excluding loans to joint ventures and associates, primarily consist of long-term deposits and guarantees (232 and 233).


RAW FIGURE: {'title': 'NOTE 10 – CURRENT TAXES AND DEFERRED TAXES', 'data': ['Balance at December 31']}
Figure 201 Summary;
 <<F201>>: Text: Note 10 - Current taxes and deferred taxes. The data provided is the balance at December 31.


RAW FIGURE: {'title': 'Management periodically evaluates positions taken in the tax returns with respect to situations in which applicable tax regulations are subject to interpretation and establishes provisions where appropriate. These provisions for uncertain tax positions (UTP) are booked as a deferred tax liability. In this regard, management considers UTP’s ind

RAW FIGURE: {'title': '(*) Lease liabilities are not included. Total long-term debts also includes the current portion of the long-term debts (note (18)).', 'data': ['250', '251']}
Figure 213 Summary;
 <<F213>>: None.


RAW FIGURE: {'title': 'The following tables disclose the fair value and the notional amount of exchange rate instruments (forex hedges) issued (forward sales/purchase agreements) (+ is asset / - is liability):', 'data': ['Forex hedges\t68\t-53,661\t2,481\t-30,404', 'Forex hedges\t114\t-24,260\t1,056\t-10,162']}
Figure 214 Summary;
 <<F214>>: Figure: The table discloses the fair value and notional amount of exchange rate instruments (forex hedges). The first row shows 68 forex hedges with a total notional amount of -53,661 and a fair value of 2,481. The second row shows 114 forex hedges with a total notional amount of -24,260 and a fair value of 1,056.


RAW FIGURE: {'title': 'cal', 'data': ['2021', '252', '253']}
Figure 215 Summary;
 <<F215>>: None.


RAW FIGURE: {'titl

RAW FIGURE: {'title': 'activities (Taxonomy-aligned) (A.1)\t274,970,745.00\t51.87%\t51.87%\t0.00%\t51.87%', 'data': ['A.2 Taxonomy-Eligible']}
Figure 236 Summary;
 <<F236>>: Figure: The activities (Taxonomy-aligned) for A.2 Taxonomy-Eligible have a turnover of 274,970,745.00, representing 51.87% of the total turnover.


RAW FIGURE: {'title': 'but not environmentally sustainable activities (not Taxonomy-aligned activities)', 'data': ['294', '295']}
Figure 237 Summary;
 <<F237>>: Figure: There are 294 activities that are not environmentally sustainable and not aligned with the Taxonomy.


RAW FIGURE: {'title': 'REFERENCE TO COMMITMENTS, TARGETS & PERFORMANCE', 'data': ['296', '297']}
Figure 238 Summary;
 <<F238>>: <<F237>>: Figure: There are 296 references to commitments, targets, and performance.




**WARNING**: We use the key <<F{i}>>: to indicate that the below are summaries of figures. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw figure data. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look through the list of raw figures, and join these to their summaries.

NOTE: i starts at 1!

*OPTIONAL: To avoid figures being wrongly chosen because of a lack of noise in the context, we merge multiple figures together to forcefully introduce noise into figure chunks. This way there is a chance that the chuncks are still selected, but lower than if the signal was applified by a low sequence length.*

In [108]:
def join_pairs(lst):
    # If the list has an odd length and more than 2 elements
    if len(lst) % 2 != 0 and len(lst) > 2:
        # Concatenate last three elements
        last = lst[-3] + '\n' + lst[-2] + '\n' + lst[-1]
        # Pair up the rest
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst) - 3, 2)] + [last]
    else:
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst), 2)]

is_join_pairs = False

if is_join_pairs:
    doc_contents[FIGURE_SUMMARY_KEY] = join_pairs(doc_contents[FIGURE_SUMMARY_KEY])
    print(doc_contents[FIGURE_SUMMARY_KEY])

Save the results to avoid having to pointlessly rerun the table and figure summarisation parts.

In [109]:
is_override = True
fpath = f"./data/pickle_db/{comp_name}_{fname}_{year}_inputs.pkl"

# TODO: in industrialised code this would preceed above load procedure!
if os.path.exists(fpath) and not is_override:
    with open(fpath, 'rb') as f:
        doc_contents = pickle.load(f)
else:
    print(f'Saving inputs to {fpath}')
    with open(fpath, 'wb') as f:
        pickle.dump(doc_contents, f)

Saving inputs to ./data/pickle_db/DEME_Group_DEME_Annual_Report2022_2022_inputs.pkl


### Text Splitting, Embedding Models, and Vector DB
We'll be using OpenAI's text-embedding-ada-002 model to convert our contextual chunks to latent space embeddings.

In [110]:
# Prepare text for chunking:
is_join_raw_content = True

text = "\n".join(doc_contents[TEXT_KEY])
if is_join_raw_content:
    text += str(raw_content)

In [111]:
# model = 'gpt-3.5-turbo'  # open ai LLM model we will be using later.
# model = 'text-davinci-003'  # open ai LLM model we will be using later.
model = 'gpt-4'
enc_code = tiktoken.encoding_for_model(model).name
tokenizer = tiktoken.get_encoding(enc_code)

# Determine length of input after tokenization
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=240,  # 310 in order to be able to fit 5 chunks in context window
    chunk_overlap=24,  # 35
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]  # order in which splits are prioritized
)
chunks = text_splitter.split_text(text)
print(
    f'The input text of lenght {len(text)} was split into {len(chunks)} chunks.'
)

The input text of lenght 1441115 was split into 1642 chunks.


In [112]:
def assert_roughly_equal(value1, value2, tolerance, message=None):
    if not math.isclose(value1, value2, rel_tol=tolerance):
        if message is None:
            message = f"{value1} and {value2} are not roughly equal within {tolerance} tolerance"
        raise AssertionError(message)

assert_roughly_equal(sum([len(chunk) for chunk in chunks]), len(text), 100)

In [113]:
SOURCE_KEY = "source"

# Track metadata:
metadatas = [
    {**doc_contents[METADATA_KEY], SOURCE_KEY: 'text'} for _ in range(len(chunks))
]

In [114]:
# extend chunks with table and figure descriptions as well as the source metadata:
for key, source in zip(
    [TABLE_SUMMARY_KEY, FIGURE_SUMMARY_KEY],
    ['table', 'figure']
):
    chunks.extend(doc_contents[key])
    metadatas.extend([
        {**doc_contents[METADATA_KEY], SOURCE_KEY: source}
        for _ in range(len(doc_contents[key]))
    ])

In [115]:
divider = '-'*100
print(f'TOTAL CHUNK COUNT (WITH TABLES AND FIGURES):\n {len(chunks)}\n\n')
for chunk in tqdm(chunks):
    print(f'{chunk}\n{divider}\n\n')

TOTAL CHUNK COUNT (WITH TABLES AND FIGURES):
 1998




  0%|          | 0/1998 [00:00<?, ?it/s]

ANNUAL REPORT
INTRODUCTION
STRATEGY
DEME’s two-dimensional strategy for
sustainable performance	40
SEGMENTS
CORPORATE GOVERNANCE AND RISK
SUSTAINABILITY & QHSE
Consolidated financial statements	182
APPENDIX
Certificates, awards & ratings	173
All definitions for alternative performance measures (APMs) or acronyms used in this report are available in the Glossary (see the Appendix chapter).
CHAPTER
Facing new challenges and achieving our goals are only possible in a safe and healthy working environment.
This is my commitment to all our employees worldwide.
NATALIA DE SOUZA SECCO  |  QHSE-S ENGINEER
Letter
of the CEO & Chairman
Undoubtedly, we are living in a transformative century – the drive for sustainability, the rise in digitalisation and the need to combat global warming, are just some of the major factors impacting our business.
----------------------------------------------------------------------------------------------------


DEME has been shaping the world for 145 years and we

### Indexing:
Store the indexes to avoid pointlessly rerunning the embedding code.

In [116]:
fpath = f"./data/faiss_db/{comp_name}_{fname}_{year}.pkl"
is_override = True

if os.path.exists(fpath) and not is_override:
    print("Loading vectorised chunks...")
    with open(fpath, 'rb') as f:
        vector_store = pickle.load(f)
else:
    # Init embedding model:
    print("Encoding and saving vectorised chunks...")
    
    embed = OpenAIEmbeddings(
        model='text-embedding-ada-002',
    )
    
    vector_store = FAISS.from_texts(
        chunks, embedding=embed,
        metadatas=metadatas,
    )
    with open(fpath, 'wb') as f:
        pickle.dump(vector_store, f)

Encoding and saving vectorised chunks...


**NOTE**: When using multiple documents, one can combine vector stores using the merge command:
- https://python.langchain.com/docs/integrations/vectorstores/faiss

Hence by adding metadata to the vector store, and then applying a merge operation, one can later apply metadata filters and track context origins. 

In [117]:
DOC_CONTENT[comp_name] = {report_name: doc_contents.copy()}  # will become useful when/if we start to use multiple reports

In [118]:
TABLE_PATTERN = r"<<T(\d+)>>:"
FIGURE_PATTERN = r"<<F(\d+)>>:"
VECTOR_STORE = vector_store

# USED IN THE JOIN OF RAW TABLES TO TABLE SUMMARIES AFTER EMBEDDING
def get_table_index_from_string(
    string,
    pattern=TABLE_PATTERN
):
    """
    Checks if a string contains the pattern (e.g., <<T{i}>>),
    and if it does return a list of all i (the digits) and if not return an empty list.
    """
    matches = re.findall(pattern, string)
    return [int(match) for match in matches]


def process_content(
    chunk: str,
    company_key: str,
    report_key: str,
    is_summary: bool = True
) -> str:
    """
    Process table summaries to also include raw tables from 
    doc_content.
    
    Note: if is_summary is True, table summaries are outputed.
    Otherwise, raw tables are outputted.
    """
    table_indexes = get_table_index_from_string(chunk, TABLE_PATTERN)
    for table_idx in table_indexes:
        match = re.search(TABLE_PATTERN.replace("(\d+)", str(table_idx)), chunk)  # search for the specific index
        table = DOC_CONTENT[company_key][report_key][TABLE_KEY][table_idx-1]  # table_idx starts at 1 hence idx-1
        
        if is_summary:
            chunk = chunk[:match.start()] + "TABLE SUMMARY:" + chunk[match.end():]
        else:  # do not include summary in context
            chunk = f"TABLE: {table}"
        
    return chunk


def find_index(
    docs: Dict[str, Any], pattern: str = TABLE_PATTERN
):
    regex = re.compile(pattern)
    for i, _doc in enumerate(docs):
        if regex.search(_doc.page_content):
            return i
    return -1


# The belief here is that quantitative data 
# will most probably appear in tables, and 
# hence we want to ensure that the context 
# for quantitative KPI retrival contains 
# at least one table. 
def search_vector_store(
    query: str,
    vector_store,
    n: int, 
    is_quantitative: bool,
    max_n: int = 20
) -> Dict[str, Any]:
    """
    Returns a context window (size n) of Docs containing at 
    least one table when is_quantitative is True.
    Note the table must be within the first max_n
    docs.
    """
    
    if is_quantitative:
        docs = vector_store.similarity_search(query, max_n)
        table_index = find_index(docs)
        if (table_index < n) or (table_index == -1):
            # no need to process as table either in context
            # window or not similar enough to input query
            return docs[:n]
        else:
            return docs[:n-1] + [docs[table_index]]
    else:
        return vector_store.similarity_search(query, n)


def get_context(
    query: str, 
    vector_store,
    n: int = 4,
    is_flag: bool = True
) -> str:
    """
    Retrive `n` contextual chunks similar to query from VECTOR_STORE.
    NOTE: when is_flag is True, the code outputs table 
    summaries as the context, otherwise it outputs raw 
    tables (where tables are selected as being 
    contextually important).
    """
    contents = [
        f"""{
            process_content(
                doc.page_content,
                company_key=doc.metadata[COMP_KEY],
                report_key=doc.metadata[REPORT_KEY],
                is_summary=is_flag
            )
        } | Metadata: {
                doc.metadata[COMP_KEY]
            } {
                doc.metadata[REPORT_KEY]
        }""" for doc in search_vector_store(
            query, vector_store, n=n, 
            is_quantitative=not is_flag,
        )
    ]

    full_content = '\n\n'.join(
        [
            f"Rank: {rank} | Content: {cont}" 
            for rank, cont in enumerate(contents, start=1)
        ]
    )
    return full_content + "\n\nContextual information sorted from most relevant to least relevant."


class AgentContextRetrieval(BaseTool):
    
    name = "information_retrival"
    description = """
        Fetch the most recent information about a company's financials and ESG initiatives.
    """
    K = 4
    output_chunks = []

    @staticmethod
    def string_similarity(s1, s2):
        seq_matcher = difflib.SequenceMatcher(None, s1, s2)
        return seq_matcher.ratio()
            
    # IMPLEMENTED PURELY TO SATISFY CLASS BEING USED IN AGENT!!!
    def _history_lookup(self, chunk: str) -> str:
        """Check if context has already been provided in the past."""
        for _chunk in self.output_chunks:
            if self.string_similarity(chunk, _chunk) > 0.9:
                return f"The information was shared in the previous {self.name} calls."
        
        # If no highly similar string is found in the outputs, 
        # append the query to outputs and return True
        self.output_chunks.append(chunk)
        return chunk
    
    # ADD HISTORY LOOKUP 
    def _run(self, query: str) -> str:
        contents = [
            f"""{
                self._history_lookup(
                    process_content(
                        doc.page_content,
                        company_key=doc.metadata[COMP_KEY],
                        report_key=doc.metadata[REPORT_KEY],
                        is_summary=False,
                    )
                )
            } | Metadata: {
                    doc.metadata[COMP_KEY]
                } {
                    doc.metadata[REPORT_KEY]
            }""" for doc in VECTOR_STORE.similarity_search(query, self.K)
        ]
        
        full_content = '\n\n'.join(
            [
                f"Rank: {rank} | Content: {cont}" 
                for rank, cont in enumerate(contents, start=1)
            ]
        )
        return full_content + "\n\nContextual information sorted from most relevant to least relevant."
    
    def _arun(self, query: str):
        raise NotImplementedError(
            f"{self.__class__.__name__} does not currently support async run."
        )

### Set up KPI extraction:
Use Open AIs API to extract desired KPIs from Coca-Cola's sustainability report. 

Start with KPIs that are binary flags, and then transition to KPIs that are quantitative.

**IMPROVEMENT IDEAS**: 

Prompt engineering when extracting KPIs entails two steps. Defining a prompt which maximises the chances of selecting the correct context. And defining a prompt which maximises the chances of the LLM correctly extracting information from the context, with the aim of obtaining an accurate KPI. 

Given that the above is a two part task, it makes sense to define two prompts per KPI: one for the similarity search, and one for the KPI extraction. 

In [119]:
EXP_OUTPUT_ID = r"EXPECTED OUTPUT:"
SIMILARITY_PROMPT_KEY = 'similarity_prompt'
GPT_PROMPT_KEY = 'gpt_prompt'
RAW_RESULT_KEY = 'raw_output'
RESULT_KEY = 'output'

def parse_output(s, output_id=EXP_OUTPUT_ID):
    # Check if the input string is a single word or a number (e.g., 19.9, 21,0, 18).
    # The regex pattern matches single words as well as floating point and integer numbers.
    if re.fullmatch(r'\w+|(\d+([.,]\d+)?)', s.strip()):
        return s.strip()
    
    # Otherwise, search for the pattern "expect output:" (case-insensitive)
    match = re.search(output_id, s, re.IGNORECASE)
    
    # If a match is found, return the part of the string that comes after it.
    if match:
        return s[match.end():].strip()  # The `strip` method removes any leading/trailing whitespace.
    else:
        return None  # No match found.
    
    
def parse_qualitative_output(input_string):
    # Use case-insensitive regular expressions to find mentions of 'True', 'False', or 'NaN'
    matches = re.findall(r'true|false|nan', input_string, re.IGNORECASE)

    # Convert the matches to lowercase to ensure case insensitivity
    matches = [match.lower() for match in matches]

    # Determine the final output based on the extracted matches
    if 'true' in matches:
        return 'True'
    elif 'false' in matches:
        return 'False'
    elif 'nan' in matches:
        return 'NaN'
    else:
        return 'NaN'  # Return 'NaN' if no matches were found
    
    
def parse_quantitative_output(input_string, exclude_values=None):
    # Use regular expressions to find numeric values with optional units or 'NaN'
    pattern = r'(\b(?:\d{1,3}(?:,\d{3})*(?:\.\d+)?|NaN|nan|NAN)\b)(?:\s*([a-zA-Z]+))?'

    # Find all matches in the input string
    matches = re.findall(pattern, input_string)

    # Extract the numeric values and units (if present)
    numeric_values = [match[0] for match in matches]
    units = [match[1] if match[1] else '' for match in matches]

    # Set default exclusion criteria if not provided
    if exclude_values is None:
        exclude_values = set()

    # Iterate through the numeric values in reverse order to find the final valid result
    for i in range(len(numeric_values) - 1, -1, -1):
        full_value = numeric_values[i] + units[i]

        if full_value.lower() not in exclude_values:
            return full_value

    return None

In [82]:
parse_quantitative_output("""
    Therefore, the total greenhouse gas (GHG) emissions from Scope 1 categories in 2022 is 83,194 tonnes CO2eq, but the breakdown of this total into specific categories is not available.

Final Answer: 0.300tonnes
""")

'0'

**Qualitative Queries:**

In [120]:
parameters = {
    'model': 'gpt-4', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a financial analyst responsible for extracting KPIs from financial reports."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

# FOLLOWS: similarity_prompt, gpt_prompt, result_key
kpi_flag_prompts = [
    (
        """
        Existence of policy, charter, or code of conduct focusing on diversity and/or discrimination.
        """,
        """
        Does a policy, charter, or code of conduct (a formal separate document) focusing on diversity topics and/or discrimination?
        The document could focus on any one of the following: 
        gender, age, disability, social background, religion, race, etc. Please note that the answer 
        should be 'Yes' if ANY of the above are covered in the company's document.
        """,
        "diversity_policy_flag"
    ),
    (
        """
        Existence of policy, charter, or code of conduct focusing on business ethics.
        """,
        """
        Does a policy, charter, or code of conduct (a formal separate document) focusing on business ethics?
        The document could focus on any one of the following: 
        corruption, bribary, money laundering, fraud, etc. Please note that the answer 
        should be 'Yes' if ANY of the above are covered in the company's document.
        """,
        "compliance_policy_flag"
    ), 
    (
        """
        Does the company report Scope 1 emissions, and Scope 2 (market based) emissions? 
        """,
        """
        Does the company report Scope 1 emissions, and Scope 2 (market based) emissions?
        """,
        "scope_1_and_2_flag"
    ),
    (
        """
        Does the company report Scope 3 emissions? 
        """,
        """
        Does the company report Scope 3 emissions?
        """,
        "scope_3_flag"
    ),
    (
        """
        Do they provide quantitative targets for emission reductions?
        """,
        """
        Do they provide quantitative targets for emission reductions? Look for 
        scope 1, 2 and/or 3. Please note that the answer should be 'Yes' if ANY of the
        measures listed below are mentioned or detailed in the report.
        """,
        "climate_targets_flag"
    ),
    (
        """
        Are company targets for emission reductions validated by SBTi (science based target initiative) 
        or another third party?
        """,
        """
        Are company's scope 1, 2, or 3 reduction targets validated by SBTi (science based target initiative) 
        or another third party? Please note that the answer should be 'Yes' if ANY targets are validated.
        """,
        "sbti_validation_flag",
    ),
    (
        """
        What steps have been taken to lower energy consumption in direct operations? 
        Look for measures related to energy reduction, energy efficiency improvements, 
        and adoption of renewable energy in its offices.
        """, 
        """
        Does the report detail any of the following efforts or measures related to energy
        reduction, energy efficiency improvements, and adoption of renewable
        energy in its offices? Please note that the answer should be 'Yes' if ANY of the
        measures listed below are mentioned or detailed in the report.

        Renewable Energy:

            Utilization of renewable energy sources (e.g., solar, wind, hydro)
            Details on energy from renewable sources, either generated or purchased
            Mention of grid parity or distributed generation insights

        Energy Efficiency:

            Implementation or consideration of efficiency technologies (e.g., LED, HVAC)
            Steps toward building insulation or retrofitting measures
            Adoption or exploration of demand-side management strategies

        Energy Certificates:

            Utilization or acquisition of various energy certificates types (e.g., Renewable Energy Credits, carbon offsets)
            Information on certificate volume or count
            Reference to voluntary vs mandatory compliance with energy certification schemes
        """,
#         """
#         What steps have been taken to lower energy consumption in direct operations?
#         """, 
#         """
#         Does the report detail any of the following efforts or measures related to energy
#         reduction, energy efficiency improvements, and adoption of renewable
#         energy in its offices? Please note that the answer should be 'Yes' if ANY
#         measures are taken.
#         """,
        'energy_reduction_plan_for_office'
    ),
    (
        """
        Do they provide quantitative targets for energy consumption reduction?
        """,
        """
        Do they provide quantitative targets for energy consumption reduction? Look for 
        energy consumption reduction targets in kWh, MWh, GWh, or %?
        Please note that the answer should be 'Yes' if ANY of the
        measures listed below are mentioned or detailed in the report.
        """,
        "energy_consumption_reduction_flag"
    ),
    (
        """
        environmental policy
        """,
#         """
#         Does the company have a environmental policy focusing on 
#         pollution, waste management, water use, or biodiversity?
#         """,
        """
        Does the company have a environmental policy (a formal separate document) focusing on any one of the following: 
        pollution, waste management, water use, or biodiversity? Please note that the answer should be 'Yes' if ANY
        of the above (pollution, waste management, water use, or biodiversity) are covered in the 
        company's policy.
        """,
        "environmental_policy_flag"
    ),  # TODO: will need to scrape more documents 
    (
        """
        Existence of policy, charter, or code of conduct including environmental topics to be signed by suppliers?
        """,
        """
        Existence of policy, charter, or code of conduct (a formal separate document) including environmental topics
        to be signed by suppliers? The document could focus on any one of the following: 
        climate change, pollution, waste management, water use, or biodiversity. Please note that the answer 
        should be 'Yes' if ANY of the above are covered in the company's document.
        """,
        "suppliers_code_of_conduct_env_flag"
    ),
    (
        """
        Existence of policy, charter, or code of conduct including social topics to be signed by suppliers?
        """,
        """
        Existence of policy, charter, or code of conduct (a formal separate document) including social topics
        to be signed by suppliers? The document could focus on any one of the following: 
        human rights, health and safety, working conditions, etc. Please note that the answer 
        should be 'Yes' if ANY of the above are covered in the company's document.
        """,
        "suppliers_code_of_conduct_social_flag"
    ), 
    (
        """
        Existence of policy, charter, or code of conduct focusing on business ethics topics to be signed by suppliers
        """,
        """
        Does a policy, charter, or code of conduct (a formal separate document) focusing on business ethics topics
        to be signed by suppliers exist? The document could focus on any one of the following: 
        corruption, bribary, money laundering, fraud, etc. Please note that the answer 
        should be 'Yes' if ANY of the above are covered in the company's document.
        """,
        "suppliers_code_of_conduct_ethics_flag"
    ),   # TODO: Need to modify how textual tables are stored - need to encode tables as well (this will work with textual tables as these will be properly encoded)
]

kpi_results = {}

# # TODO: Avoid repeating loop. Create seperate function.
for sim_prompt, gpt_prompt, key in kpi_flag_prompts:
    
    print(f"{sim_prompt}\n\n\n")
    
    context = get_context(
        sim_prompt, 
        vector_store,
        n=5,  # qualitative information can span across a greater contextual window hence choose 4-6!
        is_flag=True,  # Use table summaries
    )
    
    full_prompt = f"CONTEXT:\n{context}\n\n" + \
        f"Given the above information can you please answer:\n" + \
        f"QUESTION:\n{gpt_prompt}\n\n" + \
        f"{EXP_OUTPUT_ID} Please first reason about the question and then " \
        f"finalise your answer in one word: either 'True', 'False', or 'NaN'. " + \
        f"Answer 'NaN' when the question cannot be answered from the provided context. " + \
        f"Where possible try and obtain the answer directly from tables."
    
    print(
        f"INPUT: {full_prompt}\n\n"
    )
    
    parameters['messages'][-1]['content'] = full_prompt
    
    try:
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
    except openai.error.InvalidRequestError:
        # USE MODEL WITH WIDER CONTEXT WINDOW
        _model = parameters['model']
        parameters['model'] = "gpt-3.5-turbo-16k"
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
        parameters['model'] = _model  # reset
    
    kpi_results[key] = {
        SIMILARITY_PROMPT_KEY: sim_prompt,
        GPT_PROMPT_KEY: full_prompt,
        RAW_RESULT_KEY: response,
        RESULT_KEY: parse_qualitative_output(response)
    }
    
    print(
        f"RAW OUTPUT: {response}\n\n" +
        f"SAVED OUTPUT: {kpi_results[key][RESULT_KEY]}" +
        "---"*40 +
        '\n\n\n'
    )



        Existence of policy, charter, or code of conduct focusing on diversity and/or discrimination.
        



INPUT: CONTEXT:
Rank: 1 | Content: ['DIVERSITY AND OPPORTUNITY', 'POLICIES', '--\t\tCode of Ethics and Business Integrity', '--\t\tHuman Rights Policy (incl. policies on equal opportunities, hiring practices and discrimination, harassment and disciplinary measures, freedom of association and the right to collective bargaining)']}, {'table': ['', 'MANAGEMENT SYSTEM APPROACH', 'We ensure a structural social dialogue in the organisation . This dialogue leads to an action list which is addressed by our management .', 'We evaluate our approach with the following external validation', 'and/or verification mechanisms in place:', "--\t\tPeriodic external audits of DEME's Management System according to ISO 9001", '--\t\tInternal annual QHSE-S Management Review including training and skill management topics.']}, {'table': ['ETHICAL BUSINESS', 'POLICIES', '--\t\tCode of Ethics and Bu

RAW OUTPUT: The company DEME Group has a "Code of Ethics and Business Integrity" as mentioned in the content of Rank 1, Rank 2, Rank 4, and Rank 5. This code covers important areas such as anti-bribery and anti-corruption, compliance with international trade laws, and accounting standards and records. Therefore, it can be inferred that the company has a formal document focusing on business ethics, including corruption, bribery, and potentially fraud and money laundering.

Final Answer: True

SAVED OUTPUT: True------------------------------------------------------------------------------------------------------------------------




        Does the company report Scope 1 emissions, and Scope 2 (market based) emissions? 
        



INPUT: CONTEXT:
Rank: 1 | Content: <<F229>>: Text: The company is concerned about CO2 emissions from Scope 3, specifically where electricity is generated. | Metadata: DEME_Group Annual report with sustainability (2022)

Rank: 2 | Content: <<F228>>: Text: The

INPUT: CONTEXT:
Rank: 1 | Content: <<F152>>: Text: The company aims to reduce greenhouse gas (GHG) emissions throughout its project value chain. This includes exchanging energy and emissions performance data. They are committed to gaining more understanding of their major emissions categories and setting specific targets and actions based on an analysis of their Scope 3 emissions and related Life Cycle Assessments. The information provided does not specify any specific progress made in 2022. | Metadata: DEME_Group Annual report with sustainability (2022)

Rank: 2 | Content: <<F138>>: Text: The company aims to set specific targets and actions for the purchase of goods and services based on an analysis of their Scope 3 emissions and related Life Cycle Assessments. This information is mentioned in the progress report for 2022. | Metadata: DEME_Group Annual report with sustainability (2022)

Rank: 3 | Content: TA R G E T S
We have defined five significant energy users (SEUs) within our ISO

RAW OUTPUT: Reasoning: 

The report details several efforts and measures related to energy reduction, energy efficiency improvements, and the adoption of renewable energy in its offices. 

Renewable Energy: 
The report mentions the procurement of green electricity for their offices and sites in Belgium as a part of their strategy towards zero-emission offices. They also aim to generate their own wind and solar power for their headquarters. 

Energy Efficiency: 
The report mentions the implementation of heat pumps and fully electrified heating systems in their offices as part of their multi-year plan to shift from fossil fuel heating to the use of green electricity. They also mention focusing on the design and insulation during construction and renovation to reduce the heating and cooling demand. 

Energy Certificates: 
The report does not provide any information on the utilization or acquisition of various energy certificates types, information on certificate volume or count, or refere

RAW OUTPUT: The company does have a policy focusing on environmental aspects. The report mentions that the company's ambition is to actively manage the environmental impact of their operations by protecting biodiversity and minimizing any disturbance of sensitive species and habitats during their operations. They also aim to systematically implement environmental assessments in all project preparations and to avoid environmental incidents. They have a growing number of 127 Green Initiatives in 2022 to minimize the environmental impact of a project. These initiatives cover areas such as air emissions, energy consumption, fauna & flora, soil emissions, use of natural resources, waste management, and water emissions. Therefore, the company's policy covers pollution, waste management, water use, and biodiversity.

EXPECTED OUTPUT: True

SAVED OUTPUT: True------------------------------------------------------------------------------------------------------------------------




        Exis

RAW OUTPUT: The company's annual report mentions the inclusion of the "Code of Ethics and Integrity for business partners" in contracts with suppliers. This suggests that the company has a policy or code of conduct that suppliers are required to sign. However, the specific social topics covered by this code are not explicitly mentioned in the provided context.

Answer: True

SAVED OUTPUT: True------------------------------------------------------------------------------------------------------------------------




        Existence of policy, charter, or code of conduct focusing on business ethics topics to be signed by suppliers
        



INPUT: CONTEXT:
Rank: 1 | Content: include the Code of Ethics and Integrity for business partners in our contracts with suppliers .', 'We monitor and evaluate supplier safety performance via our internal audit system .']}, {'table': ['', 'P U B L I C', 'A U TH O R ITI E S', '', '----media/image472.png--------media/image472.png----Ensuring complian

**Quantitative Queries:**

In [123]:
# for instance direct emissions, fuel combustion, and fugitive emissions in 2022?
# Where possible try and obtain the information from tables.
kpi_quantitative_prompts = [
#     (
#         f"""
#         Does the report detail the level of independence among the members of the Board of Directors?
#         """,
#         f"""
#         Does the report detail the level of independence among the members of the Board of Directors?
#         Specifically, focus on the percentage of board members designated as independent
#         by the company over the total number of board members. If a company had 1 independent board member
#         and 4 total board members than the desired output would be 0.25.
        
#         Note: independence is linked to the material relationship with the company (i.e., role within the company), 
#         conflicts of interest (i.e., shareholding, family ties), and tenure (number of years on the Board).
        
#         """,
#         "independence_of_board_of_directors", 
#         [f'{year}'],
#     ),
    (
        f"""
        What were the reported greenhouse gas (GHG) emissions from Scope 1 categories in {year}?
        """,
        f"""
        What were the greenhouse gas (GHG) emissions from Scope 1 categories in {year}?
        
        Specifically look for:
            - Direct Emissions: Amount and types of direct emissions from company operations
            - Fuel Combustion: Details on fuel types used in company vehicles or on-site energy production
            - Fugitive Emissions: Any emissions from leaks or other unintentional releases
        """,
        "scope_1_emissions",
        ['1', '2eq', '2', '02', '02eq', '0', f'{year}']
    ), 
    (
        f"""
        What were the greenhouse gas (GHG) emissions from Scope 2 market based categories: 
        for instance indirect emissions, grid emission factors, and renewable energy credits
        in {year}?
        """,
        """
        What were the greenhouse gas (GHG) emissions from market based Scope 2 categories?
        
        Specifically look for:
            - Indirect Emissions: Volume of emissions from purchased electricity or heat
            - Grid Emissions Factor: Metrics used to calculate emissions from purchased energy
            - Renewable Energy Credits (RECs): Whether RECs are used to offset emissions
        """,
        "scope_2_emissions_market_based",
        ['2eq', '2', '02', '02eq', '0', f'{year}']
    ),
    (
        f"""
        What were the greenhouse gas (GHG) emissions from Scope 3 categories: 
        for instance supply chain emissions, business travel, product lifecycle, and outsourced activities
        in {year}?
        """,
        """
        What were the greenhouse gas (GHG) emissions from Scope 3 categories?
        
        Specifically look for:
            - Supply Chain Emissions: Information on emissions from upstream and downstream activities
            - Business Travel & Employee Commuting: Emissions attributed to these activities
            - Product Lifecycle: Emissions from the entire lifecycle of products or services
            - Outsourced Activities: Emissions from subcontracted or outsourced operations
            
        Note: Ensure that the KPI extracted (if found) is presented in the format of "X Tonnes," where X is the converted quantity in Tonnes.
        Conversion Factors:
            1 Mega Tonne (MT) = 1,000,000 Tonnes
            1 Kilo Tonne (KT) = 1,000 Tonnes
            1 Tonne (T) = 1 Tonne (no conversion needed)
        """,
        "scope_3_emissions",
        ['3', '2eq', '2', '02', '02eq', '0', f'{year}']
    ),
    (
        """
            Does the report provide comprehensive data and insights on the company's employee turnover rate?
        """,
        """
            Does the report provide comprehensive data and insights on the company's employee turnover rate?
            Specifically, look for:
                - Annual Turnover/Attrition Rate: The percentage of employees who left during the last fiscal year
                - Voluntary vs. Involuntary Turnover: Differentiation between employees leaving by choice and those leaving due to company decisions
        """,
        "employee_turnover", 
        [f'{year}'],
    ),
    (
        f"""
           Total energy consumption of {comp_name}
        """,
        """
            Does the company report energy consumption? Specifically:
                - The amount of energy utilized by the company 
            Note that this may be presented in kWh, MWh, or GWh.
        """,
        "energy_consumption",
        [f'{year}']
    ),  # TODO: improrve doc injestion. did not find answer because table was not loaded by loader.
    (
        """
           Does the company report the gender pay gap?
        """,
        """
            Does the company report the gender pay gap? Specifically: 
                - The ratio of the difference between average gross yearly earnings of male vs female employees over that of male employees.
        """,
        "gender_pay_gap",
        [f'{year}']
    )
]

# quantitative loop
for sim_prompt, gpt_prompt, key, exclude_values in kpi_quantitative_prompts:
    
    print(f"{sim_prompt}\n\n\n")
    
    context = get_context(
        sim_prompt, 
        vector_store,
        n=4,  # The information will most probably appear in a single table.
        is_flag=False,  # Use raw tabular information 
    )
    
    full_prompt = f"CONTEXT:\n{context}\n\n" + \
        f"Given the above information can you please answer:\n" + \
        f"QUESTION:\n{gpt_prompt}\n\n" + \
        f"EXPECTED OUTPUT: Please reason about the question and then, after explaining your thoughts, finalise " + \
        "your answer by outputting a single (total) numerical value. " + \
        "Include units if applicable (for example output: 10.05 or 100.1m³). " + \
        "If the information is not available in the given context respond with 'NaN'. "

    print(
        f"INPUT: {full_prompt}\n\n"
    )
    
    parameters['messages'][-1]['content'] = full_prompt
    
    try:
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
    except openai.error.InvalidRequestError:
        # USE MODEL WITH WIDER CONTEXT WINDOW
        _model = parameters['model']
        parameters['model'] = "gpt-3.5-turbo-16k"
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
        parameters['model'] = _model  # reset
    
    kpi_results[key] = {
        SIMILARITY_PROMPT_KEY: sim_prompt,
        GPT_PROMPT_KEY: full_prompt,
        RAW_RESULT_KEY: response,
        RESULT_KEY: parse_quantitative_output(
            response, exclude_values=exclude_values
        )
    }
    
    print(
        f"RAW OUTPUT: {response}\n\n" +
        f"SAVED OUTPUT: {kpi_results[key][RESULT_KEY]}" +
        "---"*40 +
        '\n\n\n'
    )



        What were the reported greenhouse gas (GHG) emissions from Scope 1 categories in 2022?
        



INPUT: CONTEXT:
Rank: 1 | Content: <<F152>>: Text: The company aims to reduce greenhouse gas (GHG) emissions throughout its project value chain. This includes exchanging energy and emissions performance data. They are committed to gaining more understanding of their major emissions categories and setting specific targets and actions based on an analysis of their Scope 3 emissions and related Life Cycle Assessments. The information provided does not specify any specific progress made in 2022. | Metadata: DEME_Group Annual report with sustainability (2022)

Rank: 2 | Content: <<F228>>: Text: The company aims to achieve net zero greenhouse gas emissions by balancing its CO2 emissions from Scope 1 and Scope 2 with the emissions that are removed through natural absorption. | Metadata: DEME_Group Annual report with sustainability (2022)

Rank: 3 | Content: In 2022, total Scope 1 and 2 

RAW OUTPUT: The provided context does not provide specific information on the volume of emissions from purchased electricity or heat (Scope 2 indirect emissions), the metrics used to calculate these emissions (Grid Emissions Factor), or whether Renewable Energy Credits (RECs) are used to offset these emissions. While the company does mention that it follows the Greenhouse Gas Protocol and reports its GHG emissions according to three scopes, including Scope 2, the exact figures or methods for Scope 2 are not specified. Therefore, the answer to all three parts of the question is 'NaN'.

SAVED OUTPUT: 2are------------------------------------------------------------------------------------------------------------------------




        What were the greenhouse gas (GHG) emissions from Scope 3 categories: 
        for instance supply chain emissions, business travel, product lifecycle, and outsourced activities
        in 2022?
        



INPUT: CONTEXT:
Rank: 1 | Content: <<F152>>: Text:

RAW OUTPUT: The provided context does not include any information about the company's employee turnover rate. The term "turnover" in the context refers to the company's revenue, not employee attrition. There is also no differentiation between voluntary and involuntary turnover. Therefore, the data needed to calculate the annual turnover/attrition rate and the voluntary vs. involuntary turnover is not available in the provided context.

EXPECTED OUTPUT: NaN

SAVED OUTPUT: None------------------------------------------------------------------------------------------------------------------------




           Total energy consumption of DEME_Group
        



INPUT: CONTEXT:
Rank: 1 | Content: {'table': [{'table': ['ENERGY TYPE', '= FUEL']}]}, {'table': [{'table': ['ENERGY TYPE', '= FUEL']}]}, {'table': [{'table': ['ENERGY TYPE', '= ELECTRICITY']}]}, {'table': [{'table': ['ENERGY TYPE', '= ELECTRICITY']}]}, {'table': [{'table': ["--\t\tIn 2022, total Scope 1 and 2 greenhouse gas emissio

RAW OUTPUT: The provided context does not include specific information about the gender pay gap at DEME Group. While there is data on the number of male and female employees, both full-time and part-time, there is no information on their respective average gross yearly earnings. Therefore, it is not possible to calculate the ratio of the difference between average gross yearly earnings of male vs female employees over that of male employees.

EXPECTED OUTPUT: NaN

SAVED OUTPUT: None------------------------------------------------------------------------------------------------------------------------





### TODOs:
- Improve output parser for quantitative outputs.
- Connect gpt-4 to KPI extraction tool.
- ENERGY CONSUMPTION TABLE NOT IN MEMORY - may need to modify load procedure of documents. 

### OPTIONAL EXPERIMENT; Set up Agent:
We will create an agent cabale of:
- Fetching contextual information from the vector store
- Performing basic math operations 
The thought being that this way the agent may be able to compute KPIs which require basic math operations to compute.

**Setup an agent just like in url_retrieval.ipynb.**

Note that a `utils.py` file will be created to store these in the future.

In [64]:
llm = OpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)
llm_math = LLMMathChain(llm=llm)

# initialize the math tool
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful when you need to perform math operations.'
)
# when giving tools to an LLM, we must pass them as a list of tools.
tools = [math_tool, AgentContextRetrieval()]

In [65]:
# Set up the base template
template = """You are an analyst tasked with aggregating financial and ESG KPIs about companies. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to answer as succinctly as possible when giving your final answer. The final answer, where possible, should just be a number or a boolean.

Question: {input}
{agent_scratchpad}"""

In [66]:
# Set up a prompt template which breaksup the intermediate_steps
# into thoughts that are used to fill the agent_scratchpad, 
# tools, and tool_names in the base template:
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[BaseTool or Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [67]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

In [68]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [69]:
llm = OpenAI(
    temperature=0,  # measure of randomness/creativity
    model_name=model
)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(
    llm=llm, 
    prompt=prompt  # Custom Prompt
)

tool_names = [tool.name for tool in tools]

agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=CustomOutputParser(),
    stop=["\nObservation:"],  # you want this to be whatever token you use in the prompt to denote the start of an Observation
    allowed_tools=tool_names
) 

agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent, 
    tools=tools, 
    verbose=True,
    max_iterations=3
)

In [71]:
result = agent_executor.run(
    input=f"what were the scope 1 greenhouse gas (GHG) emissions of {comp_name} in 2022?"
)



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to retrieve the information about Colruyt_Group's scope 1 greenhouse gas emissions in 2022.
Action: information_retrival
Action Input: Company: Colruyt_Group, Year: 2022, KPI: Scope 1 GHG emissions[0m

Observation:[33;1m[1;3mRank: 1 | Content: <<F60>>: Text: Colruyt_Group's scope 1 and 2 action plans target the three main sources of greenhouse gas emissions: cooling, heating, and mobility. Their goal is to reduce emissions by 42% by 2030. | Metadata: Colruyt_Group Annual report with sustainability (2022)

Rank: 2 | Content: TABLE: {'title': 'Greenhouse gas emissions', 'col_headers': ['Greenhouse gas emissions scope 1 (in tonnes CO2eq)', 'Calendar year', '–', '90.889 (1)', '83.194'], 'table': [{'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': '% greenhouse gas emissions scope 1, regulated through emission allowances trading', 'Calendar year': 'Calendar year', '–': '0', '90.889 (1)': '0', '83.194': '0'}, {'Greenhouse

[32;1m[1;3mI have retrieved the information about Colruyt_Group's scope 1 greenhouse gas emissions in 2022. Now I need to extract the specific value.
Action: information_retrival
Action Input: Content: TABLE: {'title': 'Greenhouse gas emissions', 'col_headers': ['Greenhouse gas emissions scope 1 (in tonnes CO2eq)', 'Calendar year', '–', '90.889 (1)', '83.194'], 'table': [{'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': '% greenhouse gas emissions scope 1, regulated through emission allowances trading', 'Calendar year': 'Calendar year', '–': '0', '90.889 (1)': '0', '83.194': '0'}, {'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': 'Greenhouse gas emissions scope 2: location-based (in tonnes CO2eq)', 'Calendar year': 'Calendar year', '–': '–', '90.889 (1)': '31.634 (1)', '83.194': '35.935'}, {'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': 'Greenhouse gas emissions scope 2: market-based (in tonnes CO2eq)', 'Calendar year': 'Calendar year', '–': '–', '90.889 (1)': '232 (

#### TODO: 
- If you wanted to get the above to run then you'd have to create more tooling to extract info from tables etc. You could connect this to an LLM.

In wrapping up our discussion on this prototype, it's essential to emphasize the diverse range of techniques available for extracting key performance indicators (KPIs). To provide a clearer understanding, here are some elaborated strategies:

    LLM Model Fine-Tuning with Sustainability Reports: One approach is to refine a large language model (LLM) specifically with corporate sustainability reports. Once the model has been trained with this data, it can potentially answer questions related to KPIs even without needing explicit context. This is because the relevant information would already be incorporated into its knowledge base.

    Combining Fine-Tuning with Context Window: If solely fine-tuning the latter layers of a transformer network does not yield the desired results, a hybrid method could be more effective. By combining the fine-tuning process with a context window, one could leverage the strengths of both strategies. This would essentially mean refining the model with specific data while also providing a relevant contextual frame during the extraction phase, ensuring more accurate and relevant KPI responses.

    Full Report Processing for KPI Extraction: In scenarios where budget isn't a constraint, an even more comprehensive approach would be to feed the entire sustainability report into the model to retrieve answers to KPI-related queries. This method ensures that no contextual details are missed and that the most accurate and relevant answers are obtained.

    (And so on...)

It's worth noting that the best method would largely depend on the specific requirements, constraints, and objectives of a given project.