# Belfius Alytics (Part 2)
Inspiration:

-https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb

-https://www.youtube.com/watch?v=RIWbalZ7sTo

-https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG?usp=sharing#scrollTo=RSdomqrHNCUY

-https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb

Future ideas:

- Convert docx to latex (https://www.vertopal.com/en/download#96a5acdd2afa4e3aaf723be0ea7b71ad).

### Handle imports:

In [1]:
# Move to root directory
import os

notebooks_dir = 'notebooks'
if notebooks_dir in os.path.abspath(os.curdir):
    while not os.path.abspath(os.curdir).endswith('notebooks'):
        print(os.path.abspath(os.curdir))
        os.chdir('..')
    os.chdir('..')  # to get to root

print(os.path.abspath(os.curdir))

C:\Users\MD726YR\PycharmProjects\eyalytics


In [2]:
# Supress SSL verification (EY problem):
import requests

from requests.packages.urllib3.exceptions import InsecureRequestWarning

# Suppress the warning from urllib3.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

old_send = requests.Session.send

def new_send(*args, **kwargs):
    kwargs['verify'] = False
    return old_send(*args, **kwargs)

requests.Session.send = new_send

In [3]:
# Import relevant libraries for langchain retrieval:
import openai
import tiktoken

from langchain import OpenAI,  LLMChain, PromptTemplate
from langchain.prompts import StringPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS  # facebook ai similarity search 
from langchain.chains import LLMMathChain
from langchain.tools import BaseTool
from langchain.agents import (
    AgentExecutor, LLMSingleActionAgent, AgentOutputParser, 
    AgentType, initialize_agent, Tool
)
from langchain.callbacks import get_openai_callback
from langchain.schema import AgentAction, AgentFinish

**In case you want to use Chroma instead of FAISS:**

`from langchain.vectorstores import Chroma`
    
Note, to use Chroma you will have to install chromadb. This requires having Microsoft Visual C++ 14.0 installed. To install that simply: 

a. Install Microsoft C++ Build Tools: Visit the link provided in the error message (https://visualstudio.microsoft.com/visual-cpp-build-tools/) and install the Microsoft C++ Build Tools.

b. Ensure the Correct Version: Ensure that you have the required version (14.0 or greater) of the C++ build tools installed.

c. Add to PATH: Ensure the tools are added to your system PATH. Usually, the installer should take care of this. But if the problem persists, you might need to verify and add them manually.

d. Restart Your System: Sometimes, after installing such tools, a system restart might be required for the environment variables (like PATH) to update correctly.

**Checks:**
Check if Visual C++ Build Tools is Installed:
- Press Windows + I to open the Settings app.
- Go to "Apps".
- Now in the "Apps & features" tab, search for "Visual Studio".
- Check if there's an installation called "Microsoft Visual Studio" (it might also be "Visual Studio Build Tools").

Check for the Required Components:
- If you find "Microsoft Visual Studio" or "Visual Studio Build Tools" in the list, click on it and then select "Modify".
- This will bring up the Visual Studio Installer.
- Here, ensure that the "Desktop development with C++" workload is checked. Specifically, make sure "MSVC v142 - VS 2019 C++ x64/x86 build tools" (or a similar option) is selected. This provides the C++ compiler that's needed.

In [60]:
# libraries for URL pdf loading
import time
import docx
import pyautogui
from docx.oxml.table import CT_Tbl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [5]:
# Other libraries:
import re
import pickle 
import difflib
import math
import tqdm

# for progress bars in loops
from uuid import uuid4
from tqdm.auto import tqdm
from typing import List, Any, Union, Optional

In [6]:
# Get API and ENV keys:
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise KeyError(
        "You will need an OPENAI_API_KEY to use the LLM models in this notebook."
    )
openai.api_key = os.getenv("OPENAI_API_KEY")

## Commence Langchain Retrieval Augmentation Tool Development:

In [61]:
FIGURE_THRESHOLD = 0.1
EPSILON = 1e-10
REPEAT_THRESHOLD = 4
MAX_CHAR_COUNT_FOR_FIGURE  = 20
FIGURE_RELATED_CHARS = r"[0123456789.%-]"
COMMON_UNITS = ["kg", "m", "s", "h", "g", "cm", "mm", "l", "ml"]
DOCX_NAMESPACE = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
# DICT KEYS 
METADATA_KEY = "metadata"
COMP_KEY = "company"
REPORT_KEY = "report"
FIGURE_KEY = "potential_figure"
FIGURE_SUMMARY_KEY = "figure_summary"
TABLE_KEY = "table"
TABLE_SUMMARY_KEY = "table_summary"
TEXT_KEY = "text"
DOC_CONTENT = {}


# URL -> file name convertor:
def url2fname(url):
    # Split the URL by '/' and get the last segment
    last_segment = url.split('/')[-1]
    
    # Use regex to remove any suffix after the dot and the dot itself
    cleaned_name = re.sub(r'\..*$', '', last_segment)
    
    return cleaned_name
    
    
# Create URL loader:
def wait_for_file(file_path: str, timeout: int = 60) -> bool:
    """
    Wait for a file to be present at a specified path within a given timeout.
    
    Args:
        file_path (str): Path to the file.
        timeout (int): Maximum waiting time in seconds. Default is 60 seconds.

    Returns:
        bool: True if file is found within the timeout, False otherwise.
    """
    start_time = time.time()

    while time.time() - start_time < timeout:
        if os.path.exists(file_path):
            return True
        time.sleep(1)

    return False


def download_pdf_from_url(url: str, save_path: str) -> str:
    """
    Download a PDF from the specified URL and save it to a local path.
    
    Args:
        url (str): URL of the PDF.
        save_path (str): Local path to save the downloaded PDF.

    Returns:
        str: Path to the saved PDF if successful, None otherwise.
    """
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        if os.path.exists(save_path):
            return save_path
    return None


def convert_pdf_to_docx(
    pdf_filename: str, driver_path: str, pdf_folder_path: str, docx_folder_path: str
) -> str:
    """
    Convert a PDF to a DOCX using Adobe's online tool.
    
    Args:
        pdf_filename (str): Filename of the PDF.
        driver_path (str): Path to the geckodriver executable.
        pdf_folder_path (str): Directory where the PDF is located.
        docx_folder_path (str): Directory where the converted DOCX should be saved.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    # WebDriver setup and configurations
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.set_preference("browser.download.folderList", 2)
    firefox_options.set_preference("browser.download.dir", docx_folder_path)
    firefox_options.set_preference("browser.download.useDownloadDir", True)
    firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
    
    service = Service(driver_path)
    driver = webdriver.Firefox(service=service, options=firefox_options)
    wait = WebDriverWait(driver, 180)
    driver.get("https://www.adobe.com/be_en/acrobat/online/pdf-to-word.html")

    # Upload the PDF
    upload_btn = wait.until(EC.element_to_be_clickable((By.ID, "lifecycle-nativebutton")))
    upload_btn.click()

    full_pdf_path = os.path.join(pdf_folder_path, pdf_filename)
    if not os.path.exists(full_pdf_path):
        print(f"File path\n{full_pdf_path}\nis not valid.")
        return None
    
    # Wait for the file selection dialog and input the file path using pyautogui
    time.sleep(5)
    # Use the path in pyautogui
    pyautogui.typewrite(full_pdf_path)

    # Add a slight delay and then press 'enter' multiple times
    time.sleep(2)
    for _ in range(3):
        pyautogui.press('enter')
        time.sleep(0.1)
    time.sleep(10)
    
    retries = 3
    while retries > 0:
        try:
            # Check for cookie notification and click if exists
            try:
                cookie_reject_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-reject-all-handler")))
                cookie_reject_btn.click()
            except TimeoutException:  # This exception is more specific to WebDriverWait than a general Exception.
                print("Cookie settings notification not found or failed to click.")

            # Wait and click the download button
            download_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.Download__downloadButton___2qFEa')))
            download_btn.click()
            break  # If successful, break out of the loop
        except TimeoutException:
            retries -= 1
            if retries == 0:
                raise  # Re-raise the exception if all retries are exhausted
            print(f"Attempt {3 - retries} failed. Retrying...")
            time.sleep(10)  # Wait for 20 seconds before retrying
    time.sleep(10)
    driver.quit()

    expected_docx_filename = pdf_filename.replace('.pdf', '.docx')
    expected_docx_filepath = os.path.join(docx_folder_path, expected_docx_filename)

    return expected_docx_filepath if wait_for_file(expected_docx_filepath, 55) else None


def convert_url_pdf_to_docx(
    pdf_url: str, 
    driver_path: str = "./drivers/geckodriver.exe", 
    pdf_folder_path: str = None, 
    docx_folder_path: str = None
) -> str:
    """
    Download a PDF from a URL, convert it to DOCX, and save it locally.
    
    Args:
        pdf_url (str): URL of the PDF.
        driver_path (str): Path to the geckodriver executable. Default is './drivers/geckodriver.exe'.
        pdf_folder_path (str): Directory to save the downloaded PDF. Default is '../data/pdf_db'.
        docx_folder_path (str): Directory to save the converted DOCX. Default is '../data/docx_db'.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    cwd = os.getcwd()
    pdf_folder_path = pdf_folder_path or os.path.join(cwd, "data", "pdf_db")
    docx_folder_path = docx_folder_path or os.path.join(cwd, "data", "docx_db")

    os.makedirs(pdf_folder_path, exist_ok=True)
    os.makedirs(docx_folder_path, exist_ok=True)

    pdf_filename = pdf_url.split('/')[-1]
    pdf_save_path = os.path.join(pdf_folder_path, pdf_filename)

    if download_pdf_from_url(pdf_url, pdf_save_path):
        return convert_pdf_to_docx(pdf_filename, driver_path, pdf_folder_path, docx_folder_path)
    return None


def extract_footnotes_from_para(para, next_para=None):
    """Extract footnote references and actual footnotes from a paragraph."""
    footnotes = []
    
    footnote_refs = para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)

    for ref in footnote_refs:
        footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
        footnote = para.part.footnotes_part.footnote_dict[footnote_id]
        footnotes.append(footnote.text)

    # Check in the next paragraph for footnotes if provided
    if next_para:
        next_footnote_refs = next_para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)
        for ref in next_footnote_refs:
            footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
            footnote = next_para.part.footnotes_part.footnote_dict[footnote_id]
            footnotes.append(footnote.text)

    return footnotes


def process_footnotes(text, footnotes):
    """Process and embed footnotes into the text."""
    
    # Existing replacement for footnotes within brackets
    for idx, footnote in enumerate(footnotes, 1):
        text = re.sub(r"\[{}\]".format(idx), "[{}]".format(footnote), text)

    # New addition: replace footnotes appearing directly after words or at the end of sentences
    for idx, footnote in enumerate(footnotes, 1):
        # This regex will look for a number that doesn't have another number directly before it (to differentiate from normal numbers within the text)
        pattern = r'(?<![0-9])' + str(idx) + r'(?![0-9])'
        replacement = "[{}]".format(footnote)
        text = re.sub(pattern, replacement, text)

    return text


def contains_unit(text):
    """Check if text contains a common unit following a number."""
    for unit in COMMON_UNITS:
        # Check for patterns like '123 kg', '123kg', '0.5 m', '0.5m', etc.
        if re.search(r'\d\s?' + re.escape(unit) + r'(?![a-zA-Z])', text):
            return True
    return False


def is_potential_figure_data(text):
    if text is None:
        return False

    text_count = len(text) - text.count(' ')
    figure_char_count = len(re.findall(FIGURE_RELATED_CHARS, text))
    char_count = text_count - figure_char_count

    # New: Check for units
    contains_common_units = contains_unit(text.lower())  # Convert text to lowercase for this check
    
    # New: Check for percentage patterns
    contains_percentage = "%" in text and any(char.isdigit() for char in text)

    if (text_count == 0) or text.endswith(('.', ':', ';', ',')) or (char_count > MAX_CHAR_COUNT_FOR_FIGURE):
        return False

    if (char_count == 0) or (figure_char_count / (char_count + EPSILON) > FIGURE_THRESHOLD) or contains_common_units or contains_percentage:
        return True
    
    return False


def repeated_artifact_check(line, artifact_dict):
    """Check if a line is a repeated artifact and update its count."""
    if line in artifact_dict:
        artifact_dict[line] += 1
        if artifact_dict[line] > REPEAT_THRESHOLD:
            return True  # It's a repeated artifact
    else:
        artifact_dict[line] = 1
    return False

    
def read_docx(
    file_path: str, 
    comp_name: str = None, 
    report_name: str = None,
) -> dict:
    
    def _is_empty(text):
        if len(text) == 0:
            return True
        return False
    
    doc = docx.Document(file_path)
    
    result = {
        METADATA_KEY: {
            'title': doc.core_properties.title,
            'author': doc.core_properties.author,
            'created': doc.core_properties.created,
            COMP_KEY: comp_name,
            REPORT_KEY: report_name,
        },
        TEXT_KEY: [],
        TABLE_KEY: [],
        FIGURE_KEY: []
    }
    
    figure_data_group = {'title': None, 'data': []}
    current_is_figure_data = False
    previous_text = None
    
    artifact_dict = {}
    
    for current_elem, next_elem in tqdm(zip(doc.element.body, doc.element.body[1:] + [None])):
        
        # Paragraph
        if current_elem.tag.endswith('p'):
            
            current_para = docx.text.paragraph.Paragraph(current_elem, None)
            next_para = docx.text.paragraph.Paragraph(next_elem, None)
            
            processed_text = current_para.text.strip()
            # Ignore empty lines or repeated lines 
            if _is_empty(processed_text) or repeated_artifact_check(processed_text, artifact_dict):
                current_para = None
                continue
            try: 
                next_text = next_para.text
            except AttributeError: 
                next_text = None

            # Process footnotes
            footnotes = extract_footnotes_from_para(current_para, next_para)
            processed_text = process_footnotes(processed_text, footnotes)
            
            # Identify if the current line is potential figure data
            previous_was_figure_data = current_is_figure_data  # Move the window forward
            current_is_figure_data = is_potential_figure_data(processed_text)
            next_is_figure_data = is_potential_figure_data(next_text)
            
            if current_is_figure_data:
                
                # If previous line was also figure data, they belong to the same figure
                if previous_was_figure_data:
                    figure_data_group['data'].append(processed_text)
                else:
                    # If a new figure starts, save the previous figure (if there was any)
                    if figure_data_group['data']:
                        result[FIGURE_KEY].append(figure_data_group)
                        figure_data_group = {'title': None, 'data': []}

                    # Assign the previous line as the title for the current figure
                    figure_data_group['title'] = previous_text
                    figure_data_group['data'].append(processed_text)
                    
            elif not next_is_figure_data:  # neither next or current text is figure
                # Not a figure, add to text
                result[TEXT_KEY].append(processed_text)
            else:  # next text is figure, meaning that current text will be stored as title.
                pass

            # Handles case when text potentially interupts figure.
            # Text is again stored in figure_data_group['data'].
            if previous_was_figure_data and next_is_figure_data:
                current_is_figure_data = True
                
            # only change previous text when current para is not empty or repeated string.
            previous_text = processed_text
                
        # Table
        elif current_elem.tag.endswith('tbl'):
            table_index = [tbl._element for tbl in doc.tables].index(current_elem)
            table = doc.tables[table_index]

            headers = [cell.text.strip() for cell in table.rows[0].cells]

            rows = []
            for row in table.rows[1:]:
                row_data = {headers[j]: cell.text.strip() for j, cell in enumerate(row.cells)}
                rows.append(row_data)

            result[TABLE_KEY].append({
                'title': previous_text,
                'col_headers': headers,
                'table': rows
            })
        else:
            print(f"Ignoring {current_elem.tag}.")
            
    return result

  
# Testing the functions

# COCA COLA:
# comp_name = 'Coca-Cola'
# report_name = 'Sustainability Report (2022)'
# pdf_url = "https://www.coca-colacompany.com/content/dam/company/us/en/reports/coca-cola-business-and-sustainability-report-2022.pdf"

# Colruyt Group
comp_name = "Colruyt_Group"
report_name = "Annual report with sustainability (2022)"
pdf_url = "https://www.colruytgroup.com/content/dam/colruytgroup/investeren/jaarverslag-met-duurzaamheidsrapportering/pdf/en/annual-report-with-sustainability-reporting-2022-2023.pdf"

fname = url2fname(pdf_url)  # used to save vectordb
is_scrape = True  # Set to False if you want to avoid web scraping. Note that in this case a preprapered docx is used. 
docx_dir = r"C:\Users\MD726YR\PycharmProjects\eyalytics\data\docx_db"

try:
    if is_scrape:
        docx_path = convert_url_pdf_to_docx(pdf_url)
    else:
        docx_path = fr"{docx_dir}\annual-report-with-sustainability-reporting-2022-2023.docx"
#         docx_path = fr"{docx_dir}\coca-cola-business-and-sustainability-report-2022.docx"

    if not docx_path:
        print("Failed to convert PDF to DOCX. Exiting...")
        exit(1)

    doc_contents = read_docx(
        docx_path, comp_name=comp_name, report_name=report_name
    )

    print(
        f"SUMMARY:\n{len(doc_contents[TEXT_KEY])} paragraphs, "
        f"{len(doc_contents[TABLE_KEY])} tables, and {len(doc_contents[FIGURE_KEY])} figures "
        f"were extracted."
    )
    
    # Example usage:
    print("\n\nText Extracted:")
    for para in doc_contents[TEXT_KEY]:
        print(para)
    print('---'*50)

    print("\n\nTables Extracted:")
    for table in doc_contents[TABLE_KEY]:
        print(table)

    print("\n\nFigures Extracted:")
    for figure in doc_contents[FIGURE_KEY]:
        print(figure)
    print('---'*50)
    
except Exception as e:
    print(f"An error occurred: {e}")

Attempt 1 failed. Retrying...
Cookie settings notification not found or failed to click.


0it [00:00, ?it/s]

Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt.
Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr.
SUMMARY:
3280 paragraphs, 144 tables, and 92 figures were extracted.


Text Extracted:
Halle, 9 June 2023 FINANCIAL YEAR 2022/23
Annual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders
of 27 September 2023 and Independent auditor’s report
The Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.
Dit jaarverslag is ook verkrijgbaar in het Nederlands.
Ce rapport annuel est également disponible en français.
Financial year 2022/23 covers the period from 1 April 2022 to 31 March 2023.
This annual report is also available on colruytgroup.com/en/annualreport. Our corporate website also includes all press releases, extra stories and background information.
Word from the Chairman
2022/23 was an eventful, challenging financial year, in which we have continued

**FUTURE IMPROVEMENTS**: 
- Trying docx2python may improve information retrival from a docx.
- Trying lxml may also be a better solution than docx.

In [18]:
import docx2python
import pandas as pd

_doc_content = docx2python.docx2python(docx_path)

for elem in _doc_content.body:
    print(elem)

[[[]]]
[[['2022 BUSINESS & SUSTAINABILITY REPORT']]]
[[['2022 BUSINESS & SUSTAINABILITY REPORT']]]
[[['----media/image1.jpeg--------media/image2.png--------media/image3.png--------media/image1.jpeg--------media/image2.png--------media/image3.png----Refresh the World. Make a Difference.', '', 'CONTENTS', '\t\nCEO MESSAGE\tEXECUTIVE SUMMARY', '\nOUR COMPANY', '\nWATER', '\nPORTFOLIO', '\nPACKAGING', '\nCLIMATE', '\nAGRICULTURE', '\nPEOPLE', '\nOPERATIONS', '\nDATA APPENDIX', '\nFRAMEWORKS', '', '', '', '']]]
[[['CONTENTS']]]
[[['CONTENTS']]]
[[['We build loved brands that bring joy to our consumers’ lives with beverage choices for all occasions, tastes and lifestyles. Our growth strategy is grounded in our core values and commitment to social and environmental responsibility.', '', '', '', '', '', '', '', '', '']]]
[[['CHAIRMAN & CEO MESSAGE', '3']]]
[[['CHAIRMAN & CEO MESSAGE', '3']]]
[[['WATER LEADERSHIP', '24']]]
[[['WATER LEADERSHIP', '24']]]
[[['PEOPLE & COMMUNITIES', '51']]]
[[['PE

### Obtain short summaries for tables.
To facilitate the encoding of tables, we will ask an LLM to generate a textual summary of a table's contents. The idea being that this summary will yield better vector encodings than if we simply tried to encode the table. It's an added cost but one that will hopefully yield better context for our LLMs. Note that the table with summaries will be fed to the engine if the tabular chunk is selected as context. The key when summarising is obtaining short summaries!

In [62]:
doc_contents[TABLE_SUMMARY_KEY] = []
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a table summarizer."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

for i, table in tqdm(enumerate(doc_contents[TABLE_KEY], start=1)):
    
    parameters['messages'][-1]['content'] = f"In 150 words or less, describe what the following table from {comp_name}'s {report_name} displays:\n {table}."

    response = openai.ChatCompletion.create(
      **parameters
    )
    # Add <<T{i}>>: in front to facilitate later lookup
    doc_contents[TABLE_SUMMARY_KEY].append(
        f"<<T{i}>>: {response['choices'][0]['message']['content']}"
    )
    
    print(f"{table}\n\nSummary;\n {doc_contents['table_summary'][-1]}\n\n\n")

0it [00:00, ?it/s]

{'title': '2', 'col_headers': ['', '', ''], 'table': [{'': ''}]}

Summary;
 <<T1>>: The table from Colruyt_Group's Annual report with sustainability (2022) appears to be incomplete or missing key information. The table has a title labeled as '2' but does not provide any column headers or data. It only contains a single empty cell. Without further context or additional data, it is difficult to determine the purpose or content of this table. It is possible that this table may be a placeholder or an error in the report.



{'title': '2', 'col_headers': ['', '', '', '', ''], 'table': [{'': ''}, {'': ''}]}

Summary;
 <<T2>>: The table from Colruyt_Group's Annual report with sustainability (2022) is not provided in the given information. The table appears to have five columns, but the column headers are not specified. The table contains two rows, with each row having a single cell that is left blank. Without further information, it is not possible to determine the specific content or purpose

{'title': 'Production and distribution centres', 'col_headers': ['', 'm2', 'number'], 'table': [{'': 'Production and distribution centres', 'm2': 'Production and distribution centres', 'number': 'Production and distribution centres'}, {'': 'Belgium and Luxembourg', 'm2': '637.739', 'number': '33'}, {'': 'France', 'm2': '64.417', 'number': '4'}]}

Summary;
 <<T7>>: The table titled "Production and distribution centres" in Colruyt_Group's Annual report with sustainability (2022) displays information about the company's production and distribution facilities. The table has three columns: the first column represents the locations of the centres, the second column represents the total area in square meters (m2) of each centre, and the third column represents the number of centres in each location.

The table includes data for two locations: Belgium and Luxembourg, and France. In Belgium and Luxembourg, there are a total of 33 production and distribution centres with a combined area of 637,7

{'title': 'risks of force majeure: natural disasters, fires, acts of terrorism and power cuts', 'col_headers': ['Risk', 'Why is this a risk for us?', 'What are our mitigating actions?'], 'table': [{'Risk': 'STRATEGIC RISKS', 'Why is this a risk for us?': 'STRATEGIC RISKS', 'What are our mitigating actions?': 'STRATEGIC RISKS'}, {'Risk': 'Data and digitisation risk', 'Why is this a risk for us?': "Colruyt Group is committed to constantly updating its data systems and their use. The group's history and specific structure mean that IT changes frequently involve heavy expenditure. In the past, we largely self-managed our applications. These evolved at their own pace, not always in step with the outside world. We are now converting to new systems, but integrating with our existing systems is intensive work. This also demands that the organisation consciously deploy time and money in all the various projects it wants to carry out.\nIn addition, the world is becoming increasingly digital and 

{'title': 'risks of force majeure: natural disasters, fires, acts of terrorism and power cuts', 'col_headers': ['OPERATIONAL RISKS (CONTINUATION)', 'OPERATIONAL RISKS (CONTINUATION)', 'OPERATIONAL RISKS (CONTINUATION)'], 'table': [{'OPERATIONAL RISKS (CONTINUATION)': 'This is requiring us to organise ourselves differently, internally and externally. We secure our position by offering the right partnerships, excellent services and strong negotiation skills at all brand layers and for all categories.'}, {'OPERATIONAL RISKS (CONTINUATION)': 'Colruyt Group is actively involved in product and process quality, with a focus on the food and product safety of the products on its shelves.\nFor this, our food and product safety is continuously monitored and analysed, including an active focus on quality standards, certifications, norms and controls.\nFood Defence, Food Fraud and Food Safety Culture are also conscious points of attention.\nIn addition to the internal policy, agreements for permane

{'title': 'Calendar for shareholders', 'col_headers': ['13/09/2023', 'Record date for depositing shares for participation in the annual General Meeting of Shareholders'], 'table': [{'13/09/2023': '27/09/2023 (16h00)', 'Record date for depositing shares for participation in the annual General Meeting of Shareholders': 'General Meeting of Shareholders for the 2022/23 financial year'}, {'13/09/2023': '28/09/2023\n29/09/2023\n02/10/2023\n03/10/2023\n13/10/2023', 'Record date for depositing shares for participation in the annual General Meeting of Shareholders': 'Dividend for financial year 2022/23 (coupon no. 13)\nCum dividend date (last trading day on which the stock including dividends is traded) Ex-date (posting of coupons)\nRecord date (centralisation of coupons) Payability\nCertificates relating to exemption from or reduction of withholding tax on dividends must be in our possession'}, {'13/09/2023': '10/10/2023', 'Record date for depositing shares for participation in the annual Gene

{'title': 'Evolution of employees’ capital contribution', 'col_headers': ['Year', 'Amount (in million\nEUR)', 'Number of shares'], 'table': [{'Year': '2019', 'Amount (in million\nEUR)': '15,9', 'Number of shares': '380.498'}, {'Year': '2020', 'Amount (in million\nEUR)': '10,3', 'Number of shares': '222.372'}, {'Year': '2021', 'Amount (in million\nEUR)': '7,3', 'Number of shares': '184.228'}, {'Year': '2022', 'Amount (in million\nEUR)': '5,4', 'Number of shares': '238.500'}]}

Summary;
 <<T21>>: The table titled "Evolution of employees' capital contribution" in Colruyt_Group's Annual report with sustainability (2022) displays the changes in the employees' capital contribution over the years. The table has three columns: "Year," "Amount (in million EUR)," and "Number of shares." 

The data in the table shows the following information for each year: 
- In 2019, the employees' capital contribution was 15.9 million EUR, and the number of shares was 380,498.
- In 2020, the employees' capital

{'title': 'Circular water management', 'col_headers': ['Total water consumption (in m³)', 'Calendar year', '592.468', '560.578', '598.066'], 'table': [{'Total water consumption (in m³)': '% rainwater and wastewater', 'Calendar year': 'Calendar year', '592.468': '29', '560.578': '33,4', '598.066': '36,11'}, {'Total water consumption (in m³)': 'Recycled wastewater (in m³)', 'Calendar year': 'Calendar year', '592.468': '109.199', '560.578': '101.943', '598.066': '149.530'}]}

Summary;
 <<T28>>: The table titled "Circular water management" in Colruyt_Group's Annual report with sustainability (2022) displays data related to the company's water consumption and management practices. The table has three columns: "Total water consumption (in m³)", "Calendar year", and three numerical values representing water consumption for each year (592.468, 560.578, and 598.066).

The table also includes two rows of data. The first row provides information on the percentage of rainwater and wastewater in th

{'title': 'Learning and developing together', 'col_headers': ['Investment in education and training (in million EUR)', 'Financial year', '32,1', '39,1', '37,74'], 'table': [{'Investment in education and training (in million EUR)': '% payroll invested in education and training', 'Financial year': 'Financial year', '32,1': '2,41', '39,1': '2,82', '37,74': '2,61'}, {'Investment in education and training (in million EUR)': '# individual participants in personal growth and health training courses', 'Financial year': 'Financial year', '32,1': '1.562', '39,1': '1.548', '37,74': '2.702'}, {'Investment in education and training (in million EUR)': '# various personal growth and health training courses', 'Financial year': 'Financial year', '32,1': '73', '39,1': '55', '37,74': '82'}, {'Investment in education and training (in million EUR)': '# employees in a dual learning programme', 'Financial year': 'Financial year', '32,1': '185', '39,1': '211', '37,74': '240'}, {'Investment in education and tr

{'title': 'Coffee', 'col_headers': ['# coffee products', 'Calendar year', '125', '105', '141'], 'table': [{'# coffee products': '% purchased coffee beans with certification (Rainforest Alliance (incl. UTZ), BIO, Fairtrade)', 'Calendar year': 'Calendar year', '125': '99,6', '105': '100', '141': '100'}, {'# coffee products': '% purchased coffee products with certification (Rainforest Alliance (incl. UTZ), BIO, Fairtrade)', 'Calendar year': 'Calendar year', '125': '97', '105': '100', '141': '100'}, {'# coffee products': '', 'Calendar year': '', '125': '', '105': '', '141': ''}]}

Summary;
 <<T36>>: The table displays information about the sustainability of coffee products in Colruyt_Group's annual report for 2022. The title of the table is "Coffee," and it has three column headers: "# coffee products," "Calendar year," and three numerical values (125, 105, 141). 

The first row of the table provides data on the percentage of purchased coffee beans with certification from organizations lik

{'title': 'Wood', 'col_headers': ['# products containing at least 60% wood', 'Calendar year', '235', '246', '237'], 'table': [{'# products containing at least 60% wood': '% wood products with FSC or PEFC certification', 'Calendar year': 'Calendar year', '235': '100', '246': '100', '237': '100'}, {'# products containing at least 60% wood': '', 'Calendar year': '', '235': '', '246': '', '237': ''}]}

Summary;
 <<T41>>: The table displays information about the use of wood in Colruyt_Group's products. The column headers indicate the number of products containing at least 60% wood for each calendar year (235, 246, and 237). The first row of the table shows the percentage of wood products with FSC or PEFC certification for each calendar year, indicating the sustainability of the wood used in these products. The values for all three years (235, 246, and 237) are 100%, indicating that all wood products containing at least 60% wood had FSC or PEFC certification. The second row of the table is e

{'title': 'Avoiding and reducing food loss (3)', 'col_headers': ['% fresh produce actually sold', 'Calendar year', '97,33', '96,98', '96,83'], 'table': [{'% fresh produce actually sold': '% unsold food incinerated or fermented', 'Calendar year': 'Calendar year', '97,33': '66,8', '96,98': '65', '96,83': '61,3'}, {'% fresh produce actually sold': '% unsold food for human consumption', 'Calendar year': 'Calendar year', '97,33': '–', '96,98': '15,9', '96,83': '20,1'}, {'% fresh produce actually sold': '% unsold food to animal feed', 'Calendar year': 'Calendar year', '97,33': '–', '96,98': '18,8', '96,83': '18,2'}, {'% fresh produce actually sold': '% unsold food used in the biochemical industry', 'Calendar year': 'Calendar year', '97,33': '–', '96,98': '0,3', '96,83': '0,4'}, {'% fresh produce actually sold': '', 'Calendar year': '', '97,33': '', '96,98': '', '96,83': ''}]}

Summary;
 <<T47>>: The table displays the percentage of fresh produce actually sold by Colruyt_Group in three differ

{'title': 'Avoiding and reducing greenhouse gas emissions: scope 3', 'col_headers': ['% employees cycling to work', 'Financial year', '–', '19,5', '21'], 'table': [{'% employees cycling to work': '% employees coming to work by public transport', 'Financial year': 'Financial year', '–': '–', '19,5': '5,8', '21': '6,3'}, {'% employees cycling to work': '% employees carpooling to work', 'Financial year': 'Financial year', '–': '–', '19,5': '4', '21': '4'}, {'% employees cycling to work': '# truck journeys saved by the use of barges in Belgium', 'Financial year': 'Financial year', '–': '5.062', '19,5': '4.836', '21': '4.448'}, {'% employees cycling to work': '% outgoing deliveries in early mornings/late evenings and at night', 'Financial year': 'Financial year', '–': '-', '19,5': '46,2', '21': '46'}, {'% employees cycling to work': 'Load factor outgoing deliveries for Colruyt Lowest Prices (in %)', 'Financial year': 'Financial year', '–': '94,0', '19,5': '94', '21': '93,9'}]}

Summary;
 <<

{'title': 'and future adaptation measures. Read more about the risk assessment from p. 137.', 'col_headers': ['Activity number', 'Activity name', 'Colruyt Group’s main activities', 'Net turnover', 'CapEx', 'OpEx', 'Assessment using the technical screening criteria'], 'table': [{'Activity number': '1.1', 'Activity name': 'Afforestation', 'Colruyt Group’s main activities': 'Forest planting in the Democratic Republic of the Congo', 'Net turnover': '', 'CapEx': '•', 'OpEx': '', 'Assessment using the technical screening criteria': 'Working closely with the project team, the technical screening criteria were extensively reviewed and positively assessed, thanks to a well-supported afforestation plan and appropriate documentation. Among other things, the project is leading to a demonstrable improvement in terms of biodiversity and water management.'}, {'Activity number': '3.6\nNew', 'Activity name': 'Manufacture of other low-carbon technologies', 'Colruyt Group’s main activities': "Liquid ice 

{'title': 'and future adaptation measures. Read more about the risk assessment from p. 137.', 'col_headers': ['Activity number', 'Activity name', 'Colruyt Group’s main activities', 'Net turnover', 'CapEx', 'OpEx', 'Assessment using the technical screening criteria'], 'table': [{'Activity number': '7.2', 'Activity name': 'Renovation of existing buildings', 'Colruyt Group’s main activities': 'Renovation of branches and sites', 'Net turnover': '', 'CapEx': '', 'OpEx': '', 'Assessment using the technical screening criteria': 'After in-depth consultations, we have decided not to recognise a positive assessment for the renovation of our existing buildings yet. We prefer to take a conservative approach and trust that we will meet these criteria sufficiently quickly. The activity is thus not considered aligned.'}, {'Activity number': '7.3', 'Activity name': 'Installation, maintenance and repair of energy efficiency equipment', 'Colruyt Group’s main activities': 'LED lighting', 'Net turnover': 

{'title': 'Financial report', 'col_headers': ['(in million EUR)', 'Note', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'Revenue', 'Note': '3.', '2022/23': '9.933,6', '2021/22(1)': '9.251,1'}, {'(in million EUR)': 'Cost of goods sold', 'Note': '3.', '2022/23': '(7.074,2)', '2021/22(1)': '(6.546,4)'}, {'(in million EUR)': 'Gross profit', 'Note': '3.', '2022/23': '2.859,4', '2021/22(1)': '2.704,7'}]}

Summary;
 <<T60>>: The table titled "Financial report" in Colruyt_Group's Annual report with sustainability (2022) displays three columns: "(in million EUR)", "Note", and "2022/23" along with a fourth column "2021/22(1)". 

The first row of the table shows the revenue for the years 2022/23 and 2021/22, which are 9.933,6 million EUR and 9.251,1 million EUR, respectively. 

The second row represents the cost of goods sold for the same years, which are (7.074,2) million EUR and (6.546,4) million EUR, respectively. The negative values in parentheses indicate a reduction in cost.

The

{'title': 'Given the nature of its activities, Colruyt Group does not rely on a limited number of major customers.', 'col_headers': ['(in million EUR)', 'Retail 2022/23(1)', 'Wholesale\nand Foodservice\n2022/23', 'Other activities\n2022/23', 'Operating segments 2022/23'], 'table': [{'(in million EUR)': 'Revenue - external', 'Retail 2022/23(1)': '8.749,9', 'Wholesale\nand Foodservice\n2022/23': '1.161,3', 'Other activities\n2022/23': '908,4', 'Operating segments 2022/23': '10.819,6'}, {'(in million EUR)': '', 'Retail 2022/23(1)': '', 'Wholesale\nand Foodservice\n2022/23': '', 'Other activities\n2022/23': '', 'Operating segments 2022/23': ''}, {'(in million EUR)': 'Revenue – internal', 'Retail 2022/23(1)': '72,3', 'Wholesale\nand Foodservice\n2022/23': '21,6', 'Other activities\n2022/23': '20,5', 'Operating segments 2022/23': '114,4'}, {'(in million EUR)': '', 'Retail 2022/23(1)': '', 'Wholesale\nand Foodservice\n2022/23': '', 'Other activities\n2022/23': '', 'Operating segments 2022/23'

{'title': 'Impairments amounting to EUR 27,9 million were realised on property, plant and equipment and intangible assets, mainly related to the loss-making activities of Dreamland and Dreambaby.', 'col_headers': ['(in million EUR)', 'Retail 2021/22(1)', 'Wholesale and Foodservice\n2021/22', 'Other activities\n2021/22', 'Operating segments 2021/22'], 'table': [{'(in million EUR)': 'Revenue - external', 'Retail 2021/22(1)': '8.164,9', 'Wholesale and Foodservice\n2021/22': '1.065,0', 'Other activities\n2021/22': '819,4', 'Operating segments 2021/22': '10.049,3'}, {'(in million EUR)': '', 'Retail 2021/22(1)': '', 'Wholesale and Foodservice\n2021/22': '', 'Other activities\n2021/22': '', 'Operating segments 2021/22': ''}, {'(in million EUR)': 'Revenue – internal', 'Retail 2021/22(1)': '68,4', 'Wholesale and Foodservice\n2021/22': '17,2', 'Other activities\n2021/22': '13,5', 'Operating segments 2021/22': '99,1'}, {'(in million EUR)': '', 'Retail 2021/22(1)': '', 'Wholesale and Foodservice\n

{'title': 'As adjusted due to discontinued operations. See note 16 for more information on the restatement of comparative information.', 'col_headers': ['(in million EUR)', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'Revenue', '2022/23': '9.933,6', '2021/22(1)': '9.251,1'}, {'(in million EUR)': 'Cost of goods sold', '2022/23': '(7.074,2)', '2021/22(1)': '(6.546,4)'}, {'(in million EUR)': 'Gross profit', '2022/23': '2.859,4', '2021/22(1)': '2.704,7'}, {'(in million EUR)': 'As a % of revenue\t28,8%\t29,2%', '2022/23': 'As a % of revenue\t28,8%\t29,2%', '2021/22(1)': 'As a % of revenue\t28,8%\t29,2%'}]}

Summary;
 <<T68>>: The table displays financial information related to revenue, cost of goods sold, and gross profit for Colruyt_Group. The figures are presented in million EUR for the fiscal years 2022/23 and 2021/22. 

In terms of revenue, the company generated 9.933,6 million EUR in 2022/23, compared to 9.251,1 million EUR in 2021/22. The cost of goods sold for the respec

{'title': '5. Services and miscellaneous goods', 'col_headers': ['(in million EUR)', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'Rental and rental-related charges', '2022/23': '35,4', '2021/22(1)': '25,3'}, {'(in million EUR)': 'Maintenance and repairs', '2022/23': '86,1', '2021/22(1)': '79,1'}, {'(in million EUR)': 'Utilities', '2022/23': '102,5', '2021/22(1)': '73,3'}, {'(in million EUR)': 'Logistic expenses', '2022/23': '177,6', '2021/22(1)': '138,6'}, {'(in million EUR)': 'Fees, IT and IT-related expenses', '2022/23': '210,1', '2021/22(1)': '194,6'}, {'(in million EUR)': 'Administration, marketing and other expenses', '2022/23': '104,8', '2021/22(1)': '101,3'}, {'(in million EUR)': 'Impairment of current assets', '2022/23': '0,9', '2021/22(1)': '(0,3)'}, {'(in million EUR)': 'Total services and miscellaneous goods\t717,4\t611,9', '2022/23': 'Total services and miscellaneous goods\t717,4\t611,9', '2021/22(1)': 'Total services and miscellaneous goods\t717,4\t611,9'}]}



{'title': 'Income taxes recognised in profit or loss', 'col_headers': ['(in million EUR)', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'A) Effective tax rate\nProfit before tax (excluding share in the result of investments accounted for using the equity method)', '2022/23': '240,1', '2021/22(1)': '364,6'}, {'(in million EUR)': 'Income tax expense', '2022/23': '62,2', '2021/22(1)': '92,6'}, {'(in million EUR)': 'Effective tax rate(2)\t25,90%\t25,40%', '2022/23': 'Effective tax rate(2)\t25,90%\t25,40%', '2021/22(1)': 'Effective tax rate(2)\t25,90%\t25,40%'}, {'(in million EUR)': 'B) Reconciliation between the effective tax rate and the applicable tax rate(3)\nProfit before tax (excluding share in the result of investments accounted for using the equity method)', '2022/23': '24,35%\n240,1', '2021/22(1)': '24,68%\n364,6'}, {'(in million EUR)': 'Income tax expense (based on applicable tax rate)\t58,5\t90,0', '2022/23': 'Income tax expense (based on applicable tax rate)\t58,5\t9

{'title': 'As adjusted due to discontinued operations. See Note 16 for more information.', 'col_headers': ['(in million EUR)', 'Internally developed intangible assets', 'Externally purchased software, licences and\nsimilar rights', 'Acquired customer lists', 'Other intangible\nassets', 'Intangible assets under development', 'Total'], 'table': [{'(in million EUR)': 'Acquisition value', 'Internally developed intangible assets': 'Acquisition value', 'Externally purchased software, licences and\nsimilar rights': 'Acquisition value', 'Acquired customer lists': 'Acquisition value', 'Other intangible\nassets': 'Acquisition value', 'Intangible assets under development': 'Acquisition value', 'Total': ''}, {'(in million EUR)': 'At 1 April 2021', 'Internally developed intangible assets': '199,4', 'Externally purchased software, licences and\nsimilar rights': '99,3', 'Acquired customer lists': '5,9', 'Other intangible\nassets': '12,6', 'Intangible assets under development': '136,9', 'Total': '454,

{'title': 'As adjusted due to discontinued operations. See note 16 for more information.', 'col_headers': ['(in million EUR)', 'Land and buildings', 'Plant, machinery\nand equipment', 'Furniture and vehicles', 'Right-of-use\nassets', 'Other property, plant and equipment', 'Assets under construction', 'Total'], 'table': [{'(in million EUR)': 'Acquisition value', 'Land and buildings': 'Acquisition value', 'Plant, machinery\nand equipment': 'Acquisition value', 'Furniture and vehicles': 'Acquisition value', 'Right-of-use\nassets': 'Acquisition value', 'Other property, plant and equipment': 'Acquisition value', 'Assets under construction': 'Acquisition value', 'Total': ''}, {'(in million EUR)': 'At 1 April 2021', 'Land and buildings': '2.957,3', 'Plant, machinery\nand equipment': '847,2', 'Furniture and vehicles': '548,2', 'Right-of-use\nassets': '284,7', 'Other property, plant and equipment': '202,9', 'Assets under construction': '83,1', 'Total': '4.923,4'}, {'(in million EUR)': 'Revaluat

{'title': 'On property, plant and equipment an impairment loss of EUR 12,0 million was recognised, mainly relating to the loss-making operations of Dreamland and Dreambaby, and the expansion, relocation and renovation of existing stores. The impairment loss is included in the income statement of the current reporting period under ‘Depreciation, amortisation and impairment of non-current assets’ within the operating segments ‘Retail’, ‘Wholesale and Foodservice’ and ‘Other activities’.', 'col_headers': ['(in million EUR)', 'Land and buildings', 'Plant, machinery\nand equipment', 'Furniture and vehicles', 'Right-of-use\nassets', 'Other property, plant and equipment', 'Assets under construction', 'Total'], 'table': [{'(in million EUR)': '', 'Land and buildings': '', 'Plant, machinery\nand equipment': '', 'Furniture and vehicles': '', 'Right-of-use\nassets': '', 'Other property, plant and equipment': '', 'Assets under construction': '', 'Total': ''}, {'(in million EUR)': 'At 31 March 2022'

{'title': 'These adjustments to the net assets relate to energy contracts within the ‘Non-current assets’ category. In addition, effects in the consolidated figures of Virya Energy NV resulting from a change in the consolidation method of the underlying entities, are offset by Colruyt Group as these effects are not applicable to Colruyt Group. The adjustment for Colruyt Group at Smartmat NV relates to goodwill.', 'col_headers': ['2021 (in million EUR)', 'Virya Energy NV(2)(4)', 'Newpharma Group NV(2)(3)', 'Smartmat NV(2)'], 'table': [{'2021 (in million EUR)': 'Non-current assets', 'Virya Energy NV(2)(4)': '2.102,4', 'Newpharma Group NV(2)(3)': '89,1', 'Smartmat NV(2)': '3,0'}, {'2021 (in million EUR)': 'Current assets', 'Virya Energy NV(2)(4)': '251,0', 'Newpharma Group NV(2)(3)': '17,4', 'Smartmat NV(2)': '7,9'}, {'2021 (in million EUR)': 'Non-current liabilities', 'Virya Energy NV(2)(4)': '1.068,9', 'Newpharma Group NV(2)(3)': '17,0', 'Smartmat NV(2)': '2,3'}, {'2021 (in million EUR)

{'title': 'The non-current financial assets evolved as follows during the financial year:', 'col_headers': ['(in million EUR)', '2022/23', '2021/22'], 'table': [{'(in million EUR)': 'At 1 April', '2022/23': '14,7', '2021/22': '111,6'}, {'(in million EUR)': 'Acquisitions', '2022/23': '-', '2021/22': '0,9'}, {'(in million EUR)': 'Capital increases', '2022/23': '0,2', '2021/22': '0,9'}, {'(in million EUR)': 'Capital decreases', '2022/23': '-', '2021/22': '(2,3)'}, {'(in million EUR)': 'Fair value adjustments through other comprehensive income', '2022/23': '(4,1)', '2021/22': '(1,1)'}, {'(in million EUR)': 'Reclassification', '2022/23': '-', '2021/22': '(95,0)'}, {'(in million EUR)': 'Other', '2022/23': '-', '2021/22': '(0,3)'}, {'(in million EUR)': 'At 31 March', '2022/23': '10,8', '2021/22': '14,7'}]}

Summary;
 <<T91>>: The table displays the evolution of non-current financial assets during the financial year for Colruyt_Group. The column headers indicate the years 2022/23 and 2021/22, 

{'title': 'Assets held for sale', 'col_headers': ['(in million EUR)', '31.03.23'], 'table': [{'(in million EUR)': 'Intangible assets', '31.03.23': '1,3'}, {'(in million EUR)': 'Property, plant and equipment', '31.03.23': '62,8'}, {'(in million EUR)': 'Other receivables', '31.03.23': '0,4'}, {'(in million EUR)': 'Total non-current assets from discontinued operations\t64,5', '31.03.23': 'Total non-current assets from discontinued operations\t64,5'}, {'(in million EUR)': 'Inventories', '31.03.23': '20,4'}, {'(in million EUR)': 'Trade receivables', '31.03.23': '40,5'}, {'(in million EUR)': 'Current tax assets', '31.03.23': '0,2'}, {'(in million EUR)': 'Other receivables', '31.03.23': '2,6'}, {'(in million EUR)': 'Cash and cash equivalents', '31.03.23': '2,6'}, {'(in million EUR)': 'Total current assets from discontinued operations\t66,3', '31.03.23': 'Total current assets from discontinued operations\t66,3'}, {'(in million EUR)': 'Total assets from discontinued operations\t130,8', '31.03.2

{'title': 'Other non-current receivables', 'col_headers': ['(in million EUR)', '31.03.23(1)', '31.03.22'], 'table': [{'(in million EUR)': 'Loans to customers', '31.03.23(1)': '4,9', '31.03.22': '4,7'}, {'(in million EUR)': 'Loans to associates', '31.03.23(1)': '1,0', '31.03.22': '12,7'}, {'(in million EUR)': 'Loans to joint ventures', '31.03.23(1)': '2,9', '31.03.22': '1,9'}, {'(in million EUR)': 'Guarantees granted', '31.03.23(1)': '7,6', '31.03.22': '7,4'}, {'(in million EUR)': 'Lease receivables', '31.03.23(1)': '20,4', '31.03.22': '17,1'}, {'(in million EUR)': 'Other receivables', '31.03.23(1)': '1,5', '31.03.22': '2,2'}, {'(in million EUR)': 'Total other non-current receivables\t38,3\t46,0', '31.03.23(1)': 'Total other non-current receivables\t38,3\t46,0', '31.03.22': 'Total other non-current receivables\t38,3\t46,0'}]}

Summary;
 <<T99>>: The table displays the amounts of various types of non-current receivables held by Colruyt_Group. The column headers indicate the reporting dat

{'title': 'Earnings per share', 'col_headers': ['', '2022/23', '2021/22(1)'], 'table': [{'': 'Total operating activity', '2022/23': 'Total operating activity', '2021/22(1)': 'Total operating activity'}, {'': 'Profit for the financial year (group share), including discontinued operations (EUR million)', '2022/23': '200,6', '2021/22(1)': '287,3'}, {'': 'Profit for the financial year (group share), excluding discontinued operations (EUR million)', '2022/23': '179,7', '2021/22(1)': '277,3'}, {'': 'Weighted average number of outstanding shares', '2022/23': '127.967.641', '2021/22(1)': '132.677.085'}, {'': 'Earnings per share – basic (in EUR) – including discontinued operations', '2022/23': '1,57', '2021/22(1)': '2,16'}, {'': 'Earnings per share – diluted (in EUR) – including discontinued operations', '2022/23': '1,57', '2021/22(1)': '2,16'}, {'': 'Earnings per share – basic (in EUR) – excluding discontinued operations', '2022/23': '1,40', '2021/22(1)': '2,09'}, {'': 'Earnings per share – di

{'title': 'The other provisions consist mainly of provisions for vacant properties and reinsurance.', 'col_headers': ['(in million EUR)', '31.03.23', '31.03.22'], 'table': [{'(in million EUR)': 'Defined contribution plans with a legally guaranteed minimum return', '31.03.23': '74,4', '31.03.22': '90,6'}, {'(in million EUR)': 'Benefits related to the ‘Unemployment regime with company supplement’', '31.03.23': '6,4', '31.03.22': '8,8'}, {'(in million EUR)': 'Other post-employment benefits', '31.03.23': '7,1', '31.03.22': '7,8'}, {'(in million EUR)': 'Total\t87,9\t107,2', '31.03.23': 'Total\t87,9\t107,2', '31.03.22': 'Total\t87,9\t107,2'}]}

Summary;
 <<T108>>: The table displays the provisions for various items in million EUR for the years 2023 and 2022. The column headers indicate the dates 31.03.23 and 31.03.22, respectively. The table includes three rows of data and a total row. 

The first row represents the provisions for defined contribution plans with a legally guaranteed minimum 

{'title': 'The amounts relative to these defined contribution plans with a legally guaranteed minimum return that are recognised in the consolidated income statement and in the consolidated statement of comprehensive income can be summarised as follows:', 'col_headers': ['(in million EUR)', '31.03.23', '31.03.22'], 'table': [{'(in million EUR)': 'Total service cost(1)', '31.03.23': '16,4', '31.03.22': '17,9'}, {'(in million EUR)': 'Net interest cost(2)', '31.03.23': '1,5', '31.03.22': '1,0'}, {'(in million EUR)': 'Components recorded in the income statement\t17,9\t18,9', '31.03.23': 'Components recorded in the income statement\t17,9\t18,9', '31.03.22': 'Components recorded in the income statement\t17,9\t18,9'}, {'(in million EUR)': 'Experience adjustments', '31.03.23': '17,5', '31.03.22': '2,5'}, {'(in million EUR)': 'Change of financial assumptions', '31.03.23': '(35,5)', '31.03.22': '(28,4)'}, {'(in million EUR)': 'Return on plan assets', '31.03.23': '3,5', '31.03.22': '1,6'}, {'(in 

{'title': 'Terms and repayment schedule', 'col_headers': ['(in million EUR)', '< 1 year', '1-5 years', '> 5 years', 'Total'], 'table': [{'(in million EUR)': 'Lease and similar liabilities', '< 1 year': '60,5', '1-5 years': '180,2', '> 5 years': '87,7', 'Total': '328,4'}, {'(in million EUR)': 'Bank borrowings', '< 1 year': '410,5', '1-5 years': '295,0', '> 5 years': '55,8', 'Total': '761,3'}, {'(in million EUR)': 'Fixed-rate green retail bond', '< 1 year': '-', '1-5 years': '251,1', '> 5 years': '-', 'Total': '251,1'}, {'(in million EUR)': 'Other', '< 1 year': '0,1', '1-5 years': '5,8', '> 5 years': '-', 'Total': '5,9'}, {'(in million EUR)': 'Total at 31 March 2023(1)', '< 1 year': '471,1', '1-5 years': '732,1', '> 5 years': '143,5', 'Total': '1.346,7'}, {'(in million EUR)': 'Lease and similar liabilities', '< 1 year': '50,9', '1-5 years': '151,1', '> 5 years': '82,0', 'Total': '284,0'}, {'(in million EUR)': 'Bank borrowings', '< 1 year': '298,3', '1-5 years': '378,8', '> 5 years': '1,0

{'title': 'For lease liabilities and similar liabilities, this includes the effect of renewing existing leases and revaluing leases due to indexations, as well as reclassifications to liabilities from discontinued operations.', 'col_headers': ['(in million EUR)', '31.03.21', 'Cash flow', 'Changes in lease\nportfolio(1)', 'Business combinations', 'Reclassi- fication', 'Other(2)', '31.03.22'], 'table': [{'(in million EUR)': 'Lease liabilities and similar liabilities', '31.03.21': '242,8', 'Cash flow': '(51,2)', 'Changes in lease\nportfolio(1)': '43,7', 'Business combinations': '29,6', 'Reclassi- fication': '-', 'Other(2)': '19,1', '31.03.22': '284,0'}, {'(in million EUR)': 'Current', '31.03.21': '41,2', 'Cash flow': '(51,2)', 'Changes in lease\nportfolio(1)': '3,7', 'Business combinations': '5,7', 'Reclassi- fication': '46,1', 'Other(2)': '5,4', '31.03.22': '50,9'}, {'(in million EUR)': 'Non-current', '31.03.21': '201,6', 'Cash flow': '-', 'Changes in lease\nportfolio(1)': '40,0', 'Busin

{'title': 'Colruyt Group’s exposure to exchange rate fluctuations is based on the following positions in foreign currencies:', 'col_headers': ['(in million EUR)', 'Net position', 'Net position'], 'table': [{'(in million EUR)': '(in million EUR)', 'Net position': '31.03.22'}, {'(in million EUR)': 'EUR/INR', 'Net position': '0,9'}, {'(in million EUR)': 'USD/EUR', 'Net position': '2,7'}, {'(in million EUR)': 'NZD/EUR', 'Net position': '0,1'}, {'(in million EUR)': 'Total\t8,1\t3,7', 'Net position': 'Total\t8,1\t3,7'}]}

Summary;
 <<T123>>: The table displays Colruyt Group's exposure to exchange rate fluctuations based on their positions in foreign currencies. The table has three columns: "(in million EUR)", "Net position", and "Net position". The first row of the table represents the column headers. The second row shows the date "31.03.22" under the "(in million EUR)" column. The subsequent rows provide information on the net positions in different currency pairs. 

Under the "(in million 

{'title': 'The amounts due in respect of these commitments are as follows:', 'col_headers': ['(in million EUR)', '31.03.23', '< 1 year', '1-5 years', '> 5 years'], 'table': [{'(in million EUR)': 'Lease arrangements as lessee(1)', '31.03.23': '3,5', '< 1 year': '1,4', '1-5 years': '2,1', '> 5 years': '-'}, {'(in million EUR)': 'Commitments relating to the acquisition of property, plant and equipment', '31.03.23': '115,7', '< 1 year': '103,0', '1-5 years': '12,7', '> 5 years': '-'}, {'(in million EUR)': 'Commitments relating to purchases of goods', '31.03.23': '253,6', '< 1 year': '234,1', '1-5 years': '19,5', '> 5 years': '-'}, {'(in million EUR)': 'Other commitments', '31.03.23': '39,5', '< 1 year': '24,7', '1-5 years': '14,8', '> 5 years': '-'}]}

Summary;
 <<T127>>: The table displays the amounts due in respect of various commitments for Colruyt_Group as of March 31, 2023. The commitments are categorized into different types, and the table provides the amounts for each category in mi

{'title': 'The compensation awarded to key management personnel is summarised below. All amounts are gross amounts before taxes. Social security contributions were paid on these amounts.', 'col_headers': ['(in million EUR)', 'Remuneration\n2022/23', 'Number of persons/shares\n2022/23', 'Remuneration\n2021/22', 'Number of persons/shares\n2021/22'], 'table': [{'(in million EUR)': '', 'Remuneration\n2022/23': '', 'Number of persons/shares\n2022/23': '', 'Remuneration\n2021/22': '', 'Number of persons/shares\n2021/22': ''}, {'(in million EUR)': 'Board of Directors', 'Remuneration\n2022/23': '', 'Number of persons/shares\n2022/23': '10', 'Remuneration\n2021/22': '9', 'Number of persons/shares\n2021/22': '9'}, {'(in million EUR)': 'Fixed remuneration (directors’ fees)', 'Remuneration\n2022/23': '1,0', 'Number of persons/shares\n2022/23': '', 'Remuneration\n2021/22': '0,9', 'Number of persons/shares\n2021/22': '0,9'}, {'(in million EUR)': 'Senior management', 'Remuneration\n2022/23': '', 'Num

{'title': 'Subsidiaries', 'col_headers': ['Colruyt Afrique SAS', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579', 'Dakar, Senegal', 'SN DKR 2020 B 13136', '100%'], 'table': [{'Colruyt Afrique SAS': 'Colruyt Cash and Carry NV', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Edingensesteenweg 196', 'Dakar, Senegal': '1500 Halle, Belgium', 'SN DKR 2020 B 13136': '0716 663 318', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt Gestion SA', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Rue F.W. Raiffeisen 5', 'Dakar, Senegal': '2411 Luxembourg,\nGrand Duchy of Luxembourg', 'SN DKR 2020 B 13136': 'B137485', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt Group Services NV', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Edingensesteenweg 196', 'Dakar, Senegal': '1500 Halle, Belgium', 'SN DKR 2020 B 13136': '0880 364 278', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt IT Consultancy India Private LTD', 'Sacre Coeur III V

{'title': 'Subsidiaries', 'col_headers': ['Immo Colruyt Luxembourg SA', 'Rue F.W. Raiffeisen 5', '2411 Luxembourg, Grand Duchy of\nLuxembourg', 'B195799', '100%'], 'table': [{'Immo Colruyt Luxembourg SA': 'Immo De CE Floor BV', 'Rue F.W. Raiffeisen 5': 'Edingensesteenweg 196', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1500 Halle, Belgium', 'B195799': '0446 434 580', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Immoco SARL', 'Rue F.W. Raiffeisen 5': 'Zone Industrielle, Rue des Entrepôts 4', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '39700 Rochefort-sur-Nenon, France', 'B195799': '527 664 965', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Izock BV', 'Rue F.W. Raiffeisen 5': 'Kerkstraat 132-134', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1851 Humbeek, Belgium', 'B195799': '0426 190 284', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Jims NV', 'Rue F.W. Raiffeisen 5': 'Edingensesteenweg 196', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1500 Halle, Belgium', 

{'title': 'Subsidiaries', 'col_headers': ['WV1 BV', 'Guldensporenpark 100, blok K', '9820 Merelbeke, Belgium', '0627 969 585', '100%'], 'table': [{'WV1 BV': 'WV2 BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0627 973 149', '100%': '100%'}, {'WV1 BV': 'WV3 BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0477 728 760', '100%': '100%'}, {'WV1 BV': 'Yaleli BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0672 981 941', '100%': '100%'}, {'WV1 BV': 'Zeeboerderij Westdiep BV', 'Guldensporenpark 100, blok K': 'Edingensesteenweg 196', '9820 Merelbeke, Belgium': '1500 Halle, Belgium', '0627 969 585': '0739 918 869', '100%': '80%'}]}

Summary;
 <<T138>>: The table titled "Subsidiaries" in Colruyt_Group's Annual report with sustainability (2022) displays information 

{'title': 'Internet:  Email:', 'col_headers': ['Equity', '3.473,5', '1.757,0'], 'table': [{'Equity': 'I. Share capital', '3.473,5': '370,2', '1.757,0': '364,8'}, {'Equity': 'IV. Reserves', '3.473,5': '220,7', '1.757,0': '172,2'}, {'Equity': 'V. Profit carried forward', '3.473,5': '2.882,3', '1.757,0': '1.219,7'}, {'Equity': 'VI. Capital grants', '3.473,5': '0,3', '1.757,0': '0,3'}, {'Equity': '', '3.473,5': '', '1.757,0': ''}, {'Equity': 'Provisions and deferred taxes', '3.473,5': '1,5', '1.757,0': '2,8'}, {'Equity': '', '3.473,5': '', '1.757,0': ''}, {'Equity': 'Liabilities', '3.473,5': '6.323,2', '1.757,0': '5.769,5'}, {'Equity': 'VIII. Liabilities exceeding one year', '3.473,5': '4.298,6', '1.757,0': '4.089,8'}, {'Equity': 'IX. Liabilities for less than one year', '3.473,5': '1.999,4', '1.757,0': '1.657,3'}, {'Equity': 'X. Accruals and deferred income', '3.473,5': '25,2', '1.757,0': '22,4'}, {'Equity': 'Total liabilities', '3.473,5': '9.798,2', '1.757,0': '7.529,3'}]}

Summary;
 <<T

**WARNING**: We use the key <<T{i}>>: to indicate that the below are summaries of tables. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw table. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look thorugh the list of raw tables, and join these to their summaries.  

NOTE: i starts at 1!

### Obtain short summaries for figures:
Similalrly to tables, to facilitate encoding we try and pass figures through an LLM first. Note that the confidence we have in figure information is low. Figures are captured using heauristics and may hence be too vague to summarise meanigfully. The may also be extract from the text that have been wrongly captured. In either case a textual summary is required. 

In [63]:
doc_contents[FIGURE_SUMMARY_KEY] = []
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a figure and text summarizer."
    }, {
        "role": "user",
        "content": f"Decide whether the information below (INFO) obtained from {comp_name}'s " + 
            f"{report_name} is from a figure or text, " + 
            "and describe it in no more than 100 words. Note, if the information is vague " + 
            "return 'None'.\n" + 
            "INFO: {'title': 'Global headquarters', 'data': ['200+']}"
    }, {
        "role": "assistant", 
        "content": "Text: Coca-Cola has more than 200 global headquarters."
    }, {
        "role": "user", 
        "content": "INFO: {'title': '2022 Progress on Sustainable Sourcing2', 'data': ['0\t20\t40\t60\t80\t100', '36%', 'GRAPES\t 37%', 'SUGAR CANE\t 40%', 'APPLES\t 55%', 'CORN\t 70%', 'TEA\t 74%', '80%', 'PULP AND PAPER\t 86%', 'ORANGES\t 89%', 'LEMONS\t 96%']}"
    }, {
        "role": "assistant", 
        "content": "Figure: 2022 progress on sustainable sourcing. 37% of grapes, 40% of sugar cane, 55% of apples, 70% of corn, 74% of tea, 86% of pulp and paper, 89% of oranges, and 96% of lemons were sustainably sourced."
    }, {
        "role": "user", 
        "content": "INFO: {'title': 'Organic Revenue Growth (Non-GAAP)1', 'data': ['25', '-5', '24%']}",
    }, {
        "role": "assistant", 
        "content": "Text: Organic Revenue Growth (Non-GAAP) was 24%."
    }, {
        "role": "user", 
        "content": None,
    },
]

for i, figure in tqdm(enumerate(doc_contents[FIGURE_KEY], start=0)):
    
    if doc_contents[FIGURE_SUMMARY_KEY] != []:
        # KEEP TRACK OF LAST GPT OUTPUT - This may help with next
        # figure summarisation as there may be a link between the two.
        parameters['messages'][-3:-1] = {
            "role": "user", 
            "content": f"INFO: {doc_contents[FIGURE_KEY][i-1]}"  # previous figure
        }, {
            "role": "assistant", 
            "content": f"{doc_contents[FIGURE_SUMMARY_KEY][-1]}"
        }
    
    parameters['messages'][-1] = {
        "role": "user", 
        "content": f"INFO: {figure}"
    }

    response = openai.ChatCompletion.create(
      **parameters
    )
    # Add <<F{i}>>: in front to facilitate later lookup
    doc_contents[FIGURE_SUMMARY_KEY].append(
        f"<<F{i+1}>>: {response['choices'][0]['message']['content']}"
    )
    
    print(f"Figure {i + 1} Summary;\n {doc_contents[FIGURE_SUMMARY_KEY][-1]}\n\n")


0it [00:00, ?it/s]

Figure 1 Summary;
 <<F1>>: None


Figure 2 Summary;
 <<F2>>: Text: The annual report can be found on colruytgroup.com/en/annualreport. The corporate website also contains press releases, extra stories, and background information.


Figure 3 Summary;
 <<F3>>: Text: The organisational structure of Colruyt_Group includes the Sustainability Domain and the Domain Board as overarching bodies. This structure ensures that sustainability is deeply rooted within the company. Additionally, there are 12 sustainability programs in place.


Figure 4 Summary;
 <<F4>>: Figure: The data provided is not clear and does not provide enough information to determine its meaning.


Figure 5 Summary;
 <<F5>>: Figure: The data provided represents a range of values for different categories labeled as "Low". The specific meaning of each value is not clear without further context.


Figure 6 Summary;
 <<F6>>: Figure: The data provided indicates that the value for the category labeled as "Medium" is 2.


Figure 7 S

Figure 46 Summary;
 <<F46>>: Figure: Dirk Van den Berghe, an independent director, received a remuneration of EUR 94,000. The total remuneration for all directors is EUR 987,000.


Figure 47 Summary;
 <<F47>>: Text: Stiftung Pro Creatura, a foundation under Swiss law, and Impact Capital NV are controlled by natural persons who directly or indirectly hold less than 3% of the securities with voting rights of the Company. The denominator for calculating the percentage is 134,077,688.


Figure 48 Summary;
 <<F48>>: Figure: Food donations to social organizations in Belgium (tonnes). The data shows the amount of food donations in tonnes for the years 2014 to 2022. The values are as follows: 3,297 tonnes in 2014, 4,262 tonnes in 2015, 4,504 tonnes in 2016, 5,622 tonnes in 2017, 6,649 tonnes in 2018, 251 tonnes in 2019, 490 tonnes in 2020, 797 tonnes in 2021, and the value for 2022 is not provided.


Figure 49 Summary;
 <<F49>>: None.


Figure 50 Summary;
 <<F50>>: Figure: Boni products achiev

Figure 87 Summary;
 <<F87>>: Text: The report is consistent with the supplementary declaration made to the Audit Committee as specified in article 11 of the regulation (EU) nr. 537/2014. The report was issued on 27 July 2023 in Diegem.


Figure 88 Summary;
 <<F88>>: Figure: The information provided is a code or reference number, "24EN0013," and does not contain any additional context or description.


Figure 89 Summary;
 <<F89>>: Text: The movement in impairments on trade and other receivables is provided as of 31.03.22.


Figure 90 Summary;
 <<F90>>: Figure: The gross impairment is 566.7.


Figure 91 Summary;
 <<F91>>: Text: The contact number for requesting the documents at Colruyt NV's registered office is +32 (2) 363 55 45.


Figure 92 Summary;
 <<F92>>: Text: Etn. Fr. Colruyt is a limited liability company with headquarters at Wilgenveld Edingensesteenweg 196, B-1500 Halle. The VAT number is BE 0400.378.485 and the enterprise number is 0400.378.485. The contact number is +32 (0)2 

**WARNING**: We use the key <<F{i}>>: to indicate that the below are summaries of figures. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw figure data. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look through the list of raw figures, and join these to their summaries.

NOTE: i starts at 1!

*OPTIONAL: To avoid figures being wrongly chosen because of a lack of noise in the context, we merge multiple figures together to forcefully introduce noise into figure chunks. This way there is a chance that the chuncks are still selected, but lower than if the signal was applified by a low sequence length.*

In [65]:
def join_pairs(lst):
    # If the list has an odd length and more than 2 elements
    if len(lst) % 2 != 0 and len(lst) > 2:
        # Concatenate last three elements
        last = lst[-3] + '\n' + lst[-2] + '\n' + lst[-1]
        # Pair up the rest
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst) - 3, 2)] + [last]
    else:
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst), 2)]

is_join_pairs = False
if is_join_pairs:
    doc_contents[FIGURE_SUMMARY_KEY] = join_pairs(doc_contents[FIGURE_SUMMARY_KEY])
    print(doc_contents[FIGURE_SUMMARY_KEY])

### Text Splitting, Embedding Models, and Vector DB
We'll be using OpenAI's text-embedding-ada-002 model. 

In [66]:
# Prepare text for chunking:
text = "\n\n".join(doc_contents[TEXT_KEY])

In [67]:
model = 'gpt-3.5-turbo'  # open ai LLM model we will be using later.
# model = 'text-davinci-003'  # open ai LLM model we will be using later.
enc_code = tiktoken.encoding_for_model(model).name
tokenizer = tiktoken.get_encoding(enc_code)

# Determine length of input after tokenization
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,  # in order to be able to fit 3 chunks in context window
    chunk_overlap=35,  
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]  # order in which splits are prioritized
)
chunks = text_splitter.split_text(text)
print(
    f'The input text of lenght {len(text)} was split into {len(chunks)} chunks.'
)

The input text of lenght 537086 was split into 424 chunks.


In [68]:
def assert_roughly_equal(value1, value2, tolerance, message=None):
    if not math.isclose(value1, value2, rel_tol=tolerance):
        if message is None:
            message = f"{value1} and {value2} are not roughly equal within {tolerance} tolerance"
        raise AssertionError(message)

assert_roughly_equal(sum([len(chunk) for chunk in chunks]), len(text), 100)

In [69]:
# extend chunks with table descriptions:
chunks.extend(doc_contents[TABLE_SUMMARY_KEY])
# extend chuncks with figure descritions
chunks.extend(doc_contents[FIGURE_SUMMARY_KEY])

In [70]:
divider = '-'*100
for chunk in tqdm(chunks):
    print(f'{chunk}\n{divider}\n\n')

  0%|          | 0/660 [00:00<?, ?it/s]

Halle, 9 June 2023 FINANCIAL YEAR 2022/23

Annual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders

of 27 September 2023 and Independent auditor’s report

The Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.

Dit jaarverslag is ook verkrijgbaar in het Nederlands.

Ce rapport annuel est également disponible en français.

Financial year 2022/23 covers the period from 1 April 2022 to 31 March 2023.

This annual report is also available on colruytgroup.com/en/annualreport. Our corporate website also includes all press releases, extra stories and background information.

Word from the Chairman
----------------------------------------------------------------------------------------------------


Word from the Chairman

2022/23 was an eventful, challenging financial year, in which we have continued to evolve, based on our belief that we create added value for society today, tomorrow and in the long term

### Indexing:
Store the indexes to avoid pointlessly rerunning the embedding code.

In [71]:
fpath = f"./data/faiss_db/{comp_name}_{fname}.pkl"
is_override = False

if os.path.exists(fpath) and not is_override:
    with open(fpath, 'rb') as f:
        vector_store = pickle.load(f)
else:
    # Init embedding model:
    print("Encoding chunks...")
    
    embed = OpenAIEmbeddings(
        model='text-embedding-ada-002',
    )
    vector_store = FAISS.from_texts(
        chunks, embedding=embed,
        metadatas=[doc_contents[METADATA_KEY] for _ in range(len(chunks))]
    )
    with open(fpath, 'wb') as f:
        pickle.dump(vector_store, f)

Encoding chunks...


**NOTE**: When using multiple documents, one can combine vector stores using the merge command:
- https://python.langchain.com/docs/integrations/vectorstores/faiss

Hence by adding metadata to the vector store, and then applying a merge operation, one can later apply metadata filters and track context origins. 

In [72]:
DOC_CONTENT[comp_name] = {report_name: doc_contents.copy()}  # will become useful when/if we start to use multiple reports

In [73]:
vector_store.docstore._dict

{'e7bd127a-650e-466e-99d6-159ed7dea70c': Document(page_content='Halle, 9 June 2023 FINANCIAL YEAR 2022/23\n\nAnnual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders\n\nof 27 September 2023 and Independent auditor’s report\n\nThe Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.\n\nDit jaarverslag is ook verkrijgbaar in het Nederlands.\n\nCe rapport annuel est également disponible en français.\n\nFinancial year 2022/23 covers the period from 1 April 2022 to 31 March 2023.\n\nThis annual report is also available on colruytgroup.com/en/annualreport. Our corporate website also includes all press releases, extra stories and background information.\n\nWord from the Chairman', metadata={'title': 'Annual report with sustainability reporting 2022/2023', 'author': '', 'created': datetime.datetime(2023, 9, 20, 14, 30, 19), 'company': 'Colruyt_Group', 'report': 'Annual report with sustainability (2022)'}),
 '

In [77]:
TABLE_PATTERN = r"<<T(\d+)>>"
FIGURE_PATTERN = r"<<F(\d+)>>"

# USED TO JOIN RAW TABLE TO TABLE SUMMARY AFTER EMBEDDING
# NOTE THAT THE PATTERN: <<T{(\d+)}>> IS USED TO DEMARKATE A TABLE
def get_table_index_from_string(
    string,
    pattern=TABLE_PATTERN
):
    """
    Checks if a string contains the pattern (e.g., <<T{i}>>),
    and if it does return a list of all i (the digits) and if not return an empty list.
    """
    matches = re.findall(pattern, string)
    return [int(match) for match in matches]


# USES VARIABLES DEFINED ABOVE SUCH AS VECTOR_STORE, 
# AND DOC_CONTENT.
class ContextRetrieval(BaseTool):
    
    name = "information_retrival"
    description = """
        Fetch the most recent information about a company's financials and ESG initiatives.
    """
    output_chunks = []

    @staticmethod
    def string_similarity(s1, s2):
        seq_matcher = difflib.SequenceMatcher(None, s1, s2)
        return seq_matcher.ratio()

    @staticmethod
    def process_content(
        chunk: str,
        company_key: str,
        report_key: str
    ) -> str:
        """
        Process table summaries to also include raw tables from 
        doc_content.
        """
        table_indexes = get_table_index_from_string(chunk, TABLE_PATTERN)
        for table_idx in table_indexes:
            match = re.search(TABLE_PATTERN.replace("(\d+)", str(table_idx)), chunk)  # search for the specific index
            chunk = chunk[:match.start()] + "TABLE SUMMARY:" + chunk[match.end():]
            
            table = DOC_CONTENT[company_key][report_key][TABLE_KEY][table_idx-1]  # table_idx starts at 1 hence idx-1
            chunk = f"{chunk}\nRAW TABLE: {table}"

        return chunk
            
    def _history_lookup(self, chunk: str) -> str:
        """Check if context has already been provided in the past."""
        for _chunk in self.output_chunks:
            if self.string_similarity(chunk, _chunk) > 0.9:
                return f"The information was shared in the previous {self.name} calls."
        
        # If no highly similar string is found in the outputs, 
        # append the query to outputs and return True
        self.output_chunks.append(chunk)
        return chunk
    
    def _run(self, query: str) -> str:
        contents = [
            f"""{
                self._history_lookup(
                    self.process_content(
                        doc.page_content,
                        company_key=doc.metadata[COMP_KEY],
                        report_key=doc.metadata[REPORT_KEY],
                    )
                )
            } | Metadata: {
                    doc.metadata[COMP_KEY]
                } {
                    doc.metadata[REPORT_KEY]
            }""" for doc in vector_store.similarity_search(query, k=4)
        ]
        
        full_content = '\n\n'.join(
            [
                f"Rank: {rank} | Content: {cont}" 
                for rank, cont in enumerate(contents, start=1)
            ]
        )
        return full_content + "\n\nContextual information sorted from most relevant to least relevant."
    
    def _arun(self, query: str):
        raise NotImplementedError(
            f"{self.__class__.__name__} does not currently support async run."
        )

### Set up KPI extraction:
Use Open AIs API to extract desired KPIs from Coca-Cola's sustainability report. 

In [80]:
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a table summarizer."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

kpi_prompts = [
    """
    What specific actions has the company taken to reduce its energy consumption in direct operations? 
    Are there achievements listed in the report?
    """,
]
kpi_results = []

for prompt in tqdm(kpi_prompts):
    
    context = ContextRetrieval().run(prompt)
    print(
        f"\n\nCONTEXT:\n{context}\n\n" +
        f"Given the above information can you please answer:\n" +
        f"Question:\n{prompt}\n" + 
        f"Please answer either 'True' or 'False'."
    )
    assert False
    
    parameters['messages'][-1]['content'] = prompt
    
    response = openai.ChatCompletion.create(
      **parameters
    )


  0%|          | 0/1 [00:00<?, ?it/s]



CONTEXT:
Rank: 1 | Content: <<F60>>: Text: Colruyt_Group's scope 1 and 2 action plans focus on three decarbonization levers: cooling, heating, and mobility, which are the main sources of greenhouse gas emissions. Their goal is to reduce emissions by 42% by 2030. | Metadata: Colruyt_Group Annual report with sustainability (2022)

Rank: 2 | Content: In three selected water catchment areas in Peru, South Africa and Spain, our aim is to reduce water consumption for growing fruit and vegetables to best-practice levels. We spent last year mapping how to collect data to monitor water consumption in these areas.

We want to implement independent audits or water standards for 70% of the volume of fruit and vegetables coming from high-water-risk countries.

Read more about the SIFAV sector initiative on p. 169.

Avoiding and reducing energy consumption

With the help of our energy reduction plan, we intend to reduce the energy consumption of our activities in Belgium and Luxembourg by 20% by 2

AssertionError: 

In [40]:
# DEBUG:
import pandas as pd

def table_to_list(data):
    """Convert structured table data into a list of lists."""
    headers = data.get('col_headers', [])
    rows = [headers]

    for row_data in data.get('table', []):
        row = []
        for header in headers:
            row.append(row_data.get(header, ''))
        rows.append(row)

    return rows

data = {'title': 'GREENHOUSE GAS EMISSIONS & WASTE', 'col_headers': ['Year ended December 31,', '', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], 'table': [{'Year ended December 31,': 'Reduce our absolute emissions by 25% by 2030 against a 2015 baseline', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': '7%'}, {'Year ended December 31,': 'GHG Emissions1', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Direct, from manufacturing sites (metric tons) (in millions)', '': '', '2014': '1.7', '2015': '1.7', '2016': '1.6', '2017': '1.78', '2018': '1.79', '2019': '1.83', '2020': '1.49', '2021': '1.61', '2022': '1.65'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Indirect, from electricity purchased and consumed (without energy trading) at manufacturing sites (metric tons) (in millions)', '': '', '2014': '3.6', '2015': '3.8', '2016': '3.8', '2017': '3.76', '2018': '3.76', '2019': '3.73', '2020': '3.75', '2021': '3.88', '2022': '3.91'}, {'Year ended December 31,': 'Indirect, from electricity purchased and consumed (without energy trading) at manufacturing sites (using GHG protocol market-based method)2 (metric tons) (in millions)', '': '', '2014': '', '2015': '', '2016': '', '2017': '3.44', '2018': '3.35', '2019': '3.88', '2020': '3.28', '2021': '3.56', '2022': '3.33'}, {'Year ended December 31,': 'Total, from manufacturing sites (metric tons) (in millions)', '': '', '2014': '5.55', '2015': '5.58', '2016': '5.45', '2017': '5.54', '2018': '5.55', '2019': '5.56', '2020': '5.24', '2021': '5.49', '2022': '5.56'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total, from manufacturing sites (using GHG protocol market-based method)2 (in millions)', '': '', '2014': '', '2015': '', '2016': '', '2017': '5.22', '2018': '5.14', '2019': '5.71', '2020': '4.77', '2021': '5.18', '2022': '4.97'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Emissions Ratio (gCO2 /L)', '': '', '2014': '36.89', '2015': '36.23', '2016': '35.29', '2017': '33.96', '2018': '34.83', '2019': '34.74', '2020': '33.96', '2021': '33.33', '2022': '28.85'}, {'Year ended December 31,': 'Business & Sustainability Report and CDP Manufacturing Emissions Reconciliation', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Reported Manufacturing Emissions in Business & Sustainability Report (millions of MT CO2e)–TCCS Reporting Entity3', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Scope 1 emissions per Business & Sustainability Report', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '1.61', '2022': '1.65'}, {'Year ended December 31,': 'Scope 2 emissions per Business & Sustainability Report', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '3.88', '2022': '3.91'}, {'Year ended December 31,': 'Total manufacturing emissions per Business & Sustainability Report', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '5.49', '2022': '5.56'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Reported Manufacturing Emissions in CDP (MT CO2e)–TCCC Reporting Entity3, 4', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Scope 1–Manufacturing per CDP C7.3c', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '325,833', '2022': '304,144'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Scope 2 (location-based)–Manufacturing per CDP C6.3', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '869,832', '2022': '890,400'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Scope 3–Franchises per CDP C6.5', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '4,299,247', '2022': '4,363,071'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total manufacturing emissions per CDP', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '5,494,912', '2022': '5,577,615'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total manufacturing emissions per CDP (in millions)', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '5.49', '2022': '5.56'}, {'Year ended December 31,': 'Energy Use5', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total Energy Use (megajoules) (in millions)', '': '', '2014': '61,764.0', '2015': '61,037.4', '2016': '61,558.7', '2017': '59,070.9', '2018': '61,464.0', '2019': '62,419.9', '2020': '58,888.1', '2021': '63,735.8', '2022': '65,389'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '11,758.9', '2020': '10,985.2', '2021': '12,731.5', '2022': '10,680'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Percentage renewable (electricity)', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '15%', '2020': '17%', '2021': '12%', '2022': '21%'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Energy Use Ratio (megajoules per liter of product)', '': '', '2014': '0.42', '2015': '0.41', '2016': '0.40', '2017': '0.40', '2018': '0.39', '2019': '0.39', '2020': '0.38', '2021': '0.39', '2022': '0.38'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '0.54', '2020': '0.58', '2021': '0.61', '2022': '0.57'}]}

converted_data = table_to_list(data)
        
# Create a pandas DataFrame
df = pd.DataFrame(converted_data)
display(df)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,"Year ended December 31,",,2014.0,2015.0,2016.0,2017.0,2018.0,2019,2020,2021,2022
1,Reduce our absolute emissions by 25% by 2030 a...,,,,,,,,,,7%
2,GHG Emissions1,,,,,,,,,,
3,,,,,,,,,,,
4,"Direct, from manufacturing sites (metric tons)...",,1.7,1.7,1.6,1.78,1.79,1.83,1.49,1.61,1.65
5,,,,,,,,,,,
6,"Indirect, from electricity purchased and consu...",,3.6,3.8,3.8,3.76,3.76,3.73,3.75,3.88,3.91
7,"Indirect, from electricity purchased and consu...",,,,,3.44,3.35,3.88,3.28,3.56,3.33
8,"Total, from manufacturing sites (metric tons) ...",,5.55,5.58,5.45,5.54,5.55,5.56,5.24,5.49,5.56
9,,,,,,,,,,,


### Set up QA Agent:
We will create an agent cabale of:
- Fetching contextual information from the vector store
- Performing basic math operations 
The thought being that this way the agent may be able to compute KPIs which require basic math operations to compute.

**Setup an agent just like in url_retrieval.ipynb.**

Note that a `utils.py` file will be created to store these in the future.

In [None]:
llm_math = LLMMathChain(llm=llm)

# initialize the math tool
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful when you need to perform math operations.'
)
# when giving tools to an LLM, we must pass them as a list of tools.
tools = [math_tool, ContextRetrieval()]

In [90]:
# Set up the base template
template = """You are an analyst tasked with aggregating financial and ESG KPIs about companies. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to answer as succinctly as possible when giving your final answer. The final answer, where possible, should just be a number or a boolean.

Question: {input}
{agent_scratchpad}"""

In [91]:
# Set up a prompt template which breaksup the intermediate_steps
# into thoughts that are used to fill the agent_scratchpad, 
# tools, and tool_names in the base template:
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[BaseTool or Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [92]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

In [93]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [94]:
llm = OpenAI(
    temperature=0,  # measure of randomness/creativity
    model_name=model
)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(
    llm=llm, 
    prompt=prompt  # Custom Prompt
)

tool_names = [tool.name for tool in tools]

agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=CustomOutputParser(),
    stop=["\nObservation:"],  # you want this to be whatever token you use in the prompt to denote the start of an Observation
    allowed_tools=tool_names
) 

agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent, 
    tools=tools, 
    verbose=True,
    max_iterations=3
)

In [95]:
result = agent_executor.run(
    input="what were the net operating revenues of coca-cola in 2022"
)



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find the most recent financial information about coca-cola
Action: information_retrival
Action Input: coca-cola financials[0m

Observation:[33;1m[1;3mRank: 1 | Content: Net income attributable to shareowners  

of The Coca-Cola Company

8,920

7,747

9,771

9,542

20

15

10

5

0

-5

-10

16%

16%

6%

2020

2019

2021

2022

(9%)

20

15

10

5

0

19%

13%

12%

0%
2020

2019

2021

2022

Organic Revenue Growth 
(Non-GAAP)1

Comparable Currency Neutral Operating 
Income Growth (Non-GAAP)2

Per Share Data

Basic earnings per share

$2.09

$1.80

$2.26

$2.20

Comparable Currency Neutral Diluted 
Earnings Per Share Growth (Non-GAAP)3

Adjusted Free Cash Flow Conversion Ratio 
(Non-GAAP)4

Diluted earnings per share

Cash dividends

Balance Sheet Data

Total assets

Long-term debt

2.07

1.60

1.79

1.64

2.25

1.68

2.19

1.76

$86,381

$87,296

$94,354

$ 92,763

27,516

40,125

38,116

36,377

20

15

10

5

0

-

Final parting comments about this prototype, please note that a multitude of methods exist for performing the above KPI extraction. Listing these:
    - One could fine-tune an LLM model using companies sustainability reports. And once done, and the information was in the training set, use the LLMs to answer the questios without context.
    - If the above (i.e. fine tuning later layers of a transfomer network proved insufficent, one can combine fine-tuning with the context window, to achieve the best of both worlds. 
    - In a world where budgetary constrainst were note an issue, one could passthrough the entire report to answer KPI questions.
    - etc.