# Belfius Alytics (Part 2)
Inspiration:

-https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb

-https://www.youtube.com/watch?v=RIWbalZ7sTo

-https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG?usp=sharing#scrollTo=RSdomqrHNCUY

-https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb

Future ideas:

- Convert docx to latex (https://www.vertopal.com/en/download#96a5acdd2afa4e3aaf723be0ea7b71ad).

### Handle imports:

In [1]:
# Move to root directory
import os

notebooks_dir = 'notebooks'
if notebooks_dir in os.path.abspath(os.curdir):
    while not os.path.abspath(os.curdir).endswith('notebooks'):
        print(os.path.abspath(os.curdir))
        os.chdir('..')
    os.chdir('..')  # to get to root

print(os.path.abspath(os.curdir))

C:\Users\MD726YR\PycharmProjects\eyalytics


In [2]:
# Supress SSL verification (EY problem):
import requests

from requests.packages.urllib3.exceptions import InsecureRequestWarning

# Suppress the warning from urllib3.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

old_send = requests.Session.send

def new_send(*args, **kwargs):
    kwargs['verify'] = False
    return old_send(*args, **kwargs)

requests.Session.send = new_send

In [3]:
# Import relevant libraries for langchain retrieval:
import openai
import tiktoken

from langchain import OpenAI,  LLMChain, PromptTemplate
from langchain.prompts import StringPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS  # facebook ai similarity search 
from langchain.chains import LLMMathChain
from langchain.tools import BaseTool
from langchain.agents import (
    AgentExecutor, LLMSingleActionAgent, AgentOutputParser, 
    AgentType, initialize_agent, Tool
)
from langchain.callbacks import get_openai_callback
from langchain.schema import AgentAction, AgentFinish

**In case you want to use Chroma instead of FAISS:**

`from langchain.vectorstores import Chroma`
    
Note, to use Chroma you will have to install chromadb. This requires having Microsoft Visual C++ 14.0 installed. To install that simply: 

a. Install Microsoft C++ Build Tools: Visit the link provided in the error message (https://visualstudio.microsoft.com/visual-cpp-build-tools/) and install the Microsoft C++ Build Tools.

b. Ensure the Correct Version: Ensure that you have the required version (14.0 or greater) of the C++ build tools installed.

c. Add to PATH: Ensure the tools are added to your system PATH. Usually, the installer should take care of this. But if the problem persists, you might need to verify and add them manually.

d. Restart Your System: Sometimes, after installing such tools, a system restart might be required for the environment variables (like PATH) to update correctly.

**Checks:**
Check if Visual C++ Build Tools is Installed:
- Press Windows + I to open the Settings app.
- Go to "Apps".
- Now in the "Apps & features" tab, search for "Visual Studio".
- Check if there's an installation called "Microsoft Visual Studio" (it might also be "Visual Studio Build Tools").

Check for the Required Components:
- If you find "Microsoft Visual Studio" or "Visual Studio Build Tools" in the list, click on it and then select "Modify".
- This will bring up the Visual Studio Installer.
- Here, ensure that the "Desktop development with C++" workload is checked. Specifically, make sure "MSVC v142 - VS 2019 C++ x64/x86 build tools" (or a similar option) is selected. This provides the C++ compiler that's needed.

In [4]:
# libraries for URL pdf loading
import time
import docx
import pyautogui
from docx.oxml.table import CT_Tbl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException

In [5]:
# Other libraries:
import re
import pickle 
import difflib
import math
import tqdm

# for progress bars in loops
from uuid import uuid4
from tqdm.auto import tqdm
from typing import List, Dict, Any, Union, Optional

In [6]:
# Get API and ENV keys:
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise KeyError(
        "You will need an OPENAI_API_KEY to use the LLM models in this notebook."
    )
openai.api_key = os.getenv("OPENAI_API_KEY")

## Commence Langchain Retrieval Augmentation Tool Development:

In [7]:
FIGURE_THRESHOLD = 0.1
EPSILON = 1e-10
REPEAT_THRESHOLD = 4
MAX_CHAR_COUNT_FOR_FIGURE  = 20
FIGURE_RELATED_CHARS = r"[0123456789.%-]"
COMMON_UNITS = ["kg", "m", "s", "h", "g", "cm", "mm", "l", "ml"]
DOCX_NAMESPACE = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}
# DICT KEYS 
METADATA_KEY = "metadata"
COMP_KEY = "company"
REPORT_KEY = "report"
FIGURE_KEY = "potential_figure"
FIGURE_SUMMARY_KEY = "figure_summary"
TABLE_KEY = "table"
TABLE_SUMMARY_KEY = "table_summary"
TEXT_KEY = "text"
DOC_CONTENT = {}


# URL -> file name convertor:
def url2fname(url):
    # Split the URL by '/' and get the last segment
    last_segment = url.split('/')[-1]
    
    # Use regex to remove any suffix after the dot and the dot itself
    cleaned_name = re.sub(r'\..*$', '', last_segment)
    
    return cleaned_name
    
    
# Create URL loader:
def wait_for_file(file_path: str, timeout: int = 60) -> bool:
    """
    Wait for a file to be present at a specified path within a given timeout.
    
    Args:
        file_path (str): Path to the file.
        timeout (int): Maximum waiting time in seconds. Default is 60 seconds.

    Returns:
        bool: True if file is found within the timeout, False otherwise.
    """
    start_time = time.time()

    while time.time() - start_time < timeout:
        if os.path.exists(file_path):
            return True
        time.sleep(1)

    return False


def download_pdf_from_url(url: str, save_path: str) -> str:
    """
    Download a PDF from the specified URL and save it to a local path.
    
    Args:
        url (str): URL of the PDF.
        save_path (str): Local path to save the downloaded PDF.

    Returns:
        str: Path to the saved PDF if successful, None otherwise.
    """
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        if os.path.exists(save_path):
            return save_path
    return None


def convert_pdf_to_docx(
    pdf_filename: str, driver_path: str, pdf_folder_path: str, docx_folder_path: str
) -> str:
    """
    Convert a PDF to a DOCX using Adobe's online tool.
    
    Args:
        pdf_filename (str): Filename of the PDF.
        driver_path (str): Path to the geckodriver executable.
        pdf_folder_path (str): Directory where the PDF is located.
        docx_folder_path (str): Directory where the converted DOCX should be saved.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    # WebDriver setup and configurations
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.set_preference("browser.download.folderList", 2)
    firefox_options.set_preference("browser.download.dir", docx_folder_path)
    firefox_options.set_preference("browser.download.useDownloadDir", True)
    firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
    
    service = Service(driver_path)
    driver = webdriver.Firefox(service=service, options=firefox_options)
    wait = WebDriverWait(driver, 180)
    driver.get("https://www.adobe.com/be_en/acrobat/online/pdf-to-word.html")

    # Upload the PDF
    upload_btn = wait.until(EC.element_to_be_clickable((By.ID, "lifecycle-nativebutton")))
    upload_btn.click()

    full_pdf_path = os.path.join(pdf_folder_path, pdf_filename)
    if not os.path.exists(full_pdf_path):
        print(f"File path\n{full_pdf_path}\nis not valid.")
        return None
    
    # Wait for the file selection dialog and input the file path using pyautogui
    time.sleep(5)
    # Use the path in pyautogui
    pyautogui.typewrite(full_pdf_path)

    # Add a slight delay and then press 'enter' multiple times
    time.sleep(2)
    for _ in range(3):
        pyautogui.press('enter')
        time.sleep(0.1)
    time.sleep(10)
    
    retries = 3
    while retries > 0:
        try:
            # Check for cookie notification and click if exists
            try:
                cookie_reject_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-reject-all-handler")))
                cookie_reject_btn.click()
            except TimeoutException:  # This exception is more specific to WebDriverWait than a general Exception.
                print("Cookie settings notification not found or failed to click.")

            # Wait and click the download button
            download_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.Download__downloadButton___2qFEa')))
            download_btn.click()
            break  # If successful, break out of the loop
        except TimeoutException:
            retries -= 1
            if retries == 0:
                raise  # Re-raise the exception if all retries are exhausted
            print(f"Attempt {3 - retries} failed. Retrying...")
            time.sleep(10)  # Wait for 20 seconds before retrying
    time.sleep(10)
    driver.quit()

    expected_docx_filename = pdf_filename.replace('.pdf', '.docx')
    expected_docx_filepath = os.path.join(docx_folder_path, expected_docx_filename)

    return expected_docx_filepath if wait_for_file(expected_docx_filepath, 55) else None


def convert_url_pdf_to_docx(
    pdf_url: str, 
    driver_path: str = "./drivers/geckodriver.exe", 
    pdf_folder_path: str = None, 
    docx_folder_path: str = None
) -> str:
    """
    Download a PDF from a URL, convert it to DOCX, and save it locally.
    
    Args:
        pdf_url (str): URL of the PDF.
        driver_path (str): Path to the geckodriver executable. Default is './drivers/geckodriver.exe'.
        pdf_folder_path (str): Directory to save the downloaded PDF. Default is '../data/pdf_db'.
        docx_folder_path (str): Directory to save the converted DOCX. Default is '../data/docx_db'.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    cwd = os.getcwd()
    pdf_folder_path = pdf_folder_path or os.path.join(cwd, "data", "pdf_db")
    docx_folder_path = docx_folder_path or os.path.join(cwd, "data", "docx_db")

    os.makedirs(pdf_folder_path, exist_ok=True)
    os.makedirs(docx_folder_path, exist_ok=True)

    pdf_filename = pdf_url.split('/')[-1]
    pdf_save_path = os.path.join(pdf_folder_path, pdf_filename)

    if download_pdf_from_url(pdf_url, pdf_save_path):
        return convert_pdf_to_docx(pdf_filename, driver_path, pdf_folder_path, docx_folder_path)
    return None


def extract_footnotes_from_para(para, next_para=None):
    """Extract footnote references and actual footnotes from a paragraph."""
    footnotes = []
    
    footnote_refs = para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)

    for ref in footnote_refs:
        footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
        footnote = para.part.footnotes_part.footnote_dict[footnote_id]
        footnotes.append(footnote.text)

    # Check in the next paragraph for footnotes if provided
    if next_para:
        next_footnote_refs = next_para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)
        for ref in next_footnote_refs:
            footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
            footnote = next_para.part.footnotes_part.footnote_dict[footnote_id]
            footnotes.append(footnote.text)

    return footnotes


def process_footnotes(text, footnotes):
    """Process and embed footnotes into the text."""
    
    # Existing replacement for footnotes within brackets
    for idx, footnote in enumerate(footnotes, 1):
        text = re.sub(r"\[{}\]".format(idx), "[{}]".format(footnote), text)

    # New addition: replace footnotes appearing directly after words or at the end of sentences
    for idx, footnote in enumerate(footnotes, 1):
        # This regex will look for a number that doesn't have another number directly before it (to differentiate from normal numbers within the text)
        pattern = r'(?<![0-9])' + str(idx) + r'(?![0-9])'
        replacement = "[{}]".format(footnote)
        text = re.sub(pattern, replacement, text)

    return text


def contains_unit(text):
    """Check if text contains a common unit following a number."""
    for unit in COMMON_UNITS:
        # Check for patterns like '123 kg', '123kg', '0.5 m', '0.5m', etc.
        if re.search(r'\d\s?' + re.escape(unit) + r'(?![a-zA-Z])', text):
            return True
    return False


def is_potential_figure_data(text):
    if text is None:
        return False

    text_count = len(text) - text.count(' ')
    figure_char_count = len(re.findall(FIGURE_RELATED_CHARS, text))
    char_count = text_count - figure_char_count

    # New: Check for units
    contains_common_units = contains_unit(text.lower())  # Convert text to lowercase for this check
    
    # New: Check for percentage patterns
    contains_percentage = "%" in text and any(char.isdigit() for char in text)

    if (text_count == 0) or text.endswith(('.', ':', ';', ',')) or (char_count > MAX_CHAR_COUNT_FOR_FIGURE):
        return False

    if (char_count == 0) or (figure_char_count / (char_count + EPSILON) > FIGURE_THRESHOLD) or contains_common_units or contains_percentage:
        return True
    
    return False


def repeated_artifact_check(line, artifact_dict):
    """Check if a line is a repeated artifact and update its count."""
    if line in artifact_dict:
        artifact_dict[line] += 1
        if artifact_dict[line] > REPEAT_THRESHOLD:
            return True  # It's a repeated artifact
    else:
        artifact_dict[line] = 1
    return False

    
def read_docx(
    file_path: str, 
    comp_name: str = None, 
    report_name: str = None,
) -> dict:
    
    def _is_empty(text):
        if len(text) == 0:
            return True
        return False
    
    doc = docx.Document(file_path)
    
    result = {
        METADATA_KEY: {
            'title': doc.core_properties.title,
            'author': doc.core_properties.author,
            'created': doc.core_properties.created,
            COMP_KEY: comp_name,
            REPORT_KEY: report_name,
        },
        TEXT_KEY: [],
        TABLE_KEY: [],
        FIGURE_KEY: []
    }
    
    figure_data_group = {'title': None, 'data': []}
    current_is_figure_data = False
    previous_text = None
    
    artifact_dict = {}
    
    for current_elem, next_elem in tqdm(zip(doc.element.body, doc.element.body[1:] + [None])):
        
        # Paragraph
        if current_elem.tag.endswith('p'):
            
            current_para = docx.text.paragraph.Paragraph(current_elem, None)
            next_para = docx.text.paragraph.Paragraph(next_elem, None)
            
            processed_text = current_para.text.strip()
            # Ignore empty lines or repeated lines 
            if _is_empty(processed_text) or repeated_artifact_check(processed_text, artifact_dict):
                current_para = None
                continue
            try: 
                next_text = next_para.text
            except AttributeError: 
                next_text = None

            # Process footnotes
            footnotes = extract_footnotes_from_para(current_para, next_para)
            processed_text = process_footnotes(processed_text, footnotes)
            
            # Identify if the current line is potential figure data
            previous_was_figure_data = current_is_figure_data  # Move the window forward
            current_is_figure_data = is_potential_figure_data(processed_text)
            next_is_figure_data = is_potential_figure_data(next_text)
            
            if current_is_figure_data:
                
                # If previous line was also figure data, they belong to the same figure
                if previous_was_figure_data:
                    figure_data_group['data'].append(processed_text)
                else:
                    # If a new figure starts, save the previous figure (if there was any)
                    if figure_data_group['data']:
                        result[FIGURE_KEY].append(figure_data_group)
                        figure_data_group = {'title': None, 'data': []}

                    # Assign the previous line as the title for the current figure
                    figure_data_group['title'] = previous_text
                    figure_data_group['data'].append(processed_text)
                    
            elif not next_is_figure_data:  # neither next or current text is figure
                # Not a figure, add to text
                result[TEXT_KEY].append(processed_text)
            else:  # next text is figure, meaning that current text will be stored as title.
                pass

            # Handles case when text potentially interupts figure.
            # Text is again stored in figure_data_group['data'].
            if previous_was_figure_data and next_is_figure_data:
                current_is_figure_data = True
                
            # only change previous text when current para is not empty or repeated string.
            previous_text = processed_text
                
        # Table
        elif current_elem.tag.endswith('tbl'):
            table_index = [tbl._element for tbl in doc.tables].index(current_elem)
            table = doc.tables[table_index]

            headers = [cell.text.strip() for cell in table.rows[0].cells]

            rows = []
            for row in table.rows[1:]:
                row_data = {headers[j]: cell.text.strip() for j, cell in enumerate(row.cells)}
                rows.append(row_data)

            result[TABLE_KEY].append({
                'title': previous_text,
                'col_headers': headers,
                'table': rows
            })
        else:
            print(f"Ignoring {current_elem.tag}.")
            
    return result

  
# Testing the functions

# COCA COLA:
# comp_name = 'Coca-Cola'
# report_name = 'Sustainability Report (2022)'
# pdf_url = "https://www.coca-colacompany.com/content/dam/company/us/en/reports/coca-cola-business-and-sustainability-report-2022.pdf"

# Colruyt Group
comp_name = "Colruyt_Group"
year = '2022'
report_name = "Annual report with sustainability (2022)"
pdf_url = "https://www.colruytgroup.com/content/dam/colruytgroup/investeren/jaarverslag-met-duurzaamheidsrapportering/pdf/en/annual-report-with-sustainability-reporting-2022-2023.pdf"

is_scrape = False  # Set to False if you want to avoid web scraping. Note that in this case a preprapered docx is used. 
fname = url2fname(pdf_url)  # used to save vectordb
docx_dir = r"C:\Users\MD726YR\PycharmProjects\eyalytics\data\docx_db"

try:
    if is_scrape:
        docx_path = convert_url_pdf_to_docx(pdf_url)
    else:
        docx_path = fr"{docx_dir}\annual-report-with-sustainability-reporting-2022-2023.docx"
#         docx_path = fr"{docx_dir}\coca-cola-business-and-sustainability-report-2022.docx"

    if not docx_path:
        print("Failed to convert PDF to DOCX. Exiting...")
        exit(1)

    doc_contents = read_docx(
        docx_path, comp_name=comp_name, report_name=report_name
    )

    print(
        f"SUMMARY:\n{len(doc_contents[TEXT_KEY])} paragraphs, "
        f"{len(doc_contents[TABLE_KEY])} tables, and {len(doc_contents[FIGURE_KEY])} figures "
        f"were extracted."
    )
    
    # Example usage:
    print("\n\nText Extracted:")
    for para in doc_contents[TEXT_KEY]:
        print(para)
    print('---'*50)

    print("\n\nTables Extracted:")
    for table in doc_contents[TABLE_KEY]:
        print(table)

    print("\n\nFigures Extracted:")
    for figure in doc_contents[FIGURE_KEY]:
        print(figure)
    print('---'*50)
    
except Exception as e:
    print(f"An error occurred: {e}")

0it [00:00, ?it/s]

Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt.
Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr.
SUMMARY:
3280 paragraphs, 144 tables, and 92 figures were extracted.


Text Extracted:
Halle, 9 June 2023 FINANCIAL YEAR 2022/23
Annual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders
of 27 September 2023 and Independent auditor’s report
The Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.
Dit jaarverslag is ook verkrijgbaar in het Nederlands.
Ce rapport annuel est également disponible en français.
Financial year 2022/23 covers the period from 1 April 2022 to 31 March 2023.
This annual report is also available on colruytgroup.com/en/annualreport. Our corporate website also includes all press releases, extra stories and background information.
Word from the Chairman
2022/23 was an eventful, challenging financial year, in which we have continued

**FUTURE IMPROVEMENTS**: 
- Trying docx2python may improve information retrival from a docx.
- Trying lxml may also be a better solution than docx.

In [12]:
import docx2python
import pandas as pd

_doc_content = docx2python.docx2python(docx_path)

for elem in _doc_content.body:
    print(elem)

[[['----media/image1.jpeg--------media/image2.png--------media/image3.png--------media/image1.jpeg--------media/image2.png--------media/image3.png----', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '']]]
[[['Annual Report', 'with sustainability reporting', '2022/23']]]
[[['Annual Report', 'with sustainability reporting', '2022/23']]]
[[['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '1', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'Halle, 9 June 2023 FINANCIAL YEAR 2022/23', 'Annual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders', 'of 27 September 2023 and Independent auditor’s report', 'The Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.', 'Dit jaarverslag is ook verkrijgbaar in het Nederlands.', 'Ce rapport annuel est également disponible en français.', '', '', '----media/image4.png----Financial y

### Obtain short summaries for tables.
To facilitate the encoding of tables, we will ask an LLM to generate a textual summary of a table's contents. The idea being that this summary will yield better vector encodings than if we simply tried to encode the table. It's an added cost but one that will hopefully yield better context for our LLMs. Note that the table with summaries will be fed to the engine if the tabular chunk is selected as context. The key when summarising is obtaining short summaries!

In [8]:
doc_contents[TABLE_SUMMARY_KEY] = []

In [9]:
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a table summarizer."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

MAX_RETRIES = 5
WAIT_SECONDS = 10

for i, table in tqdm(enumerate(doc_contents[TABLE_KEY], start=1)):
    
    parameters['messages'][-1]['content'] = f"In 100 words or less, describe what the following table from {comp_name}'s {report_name} displays:\n {table}."

    retries = 0
    success = False
    while retries < MAX_RETRIES and not success:
        try:
            response = openai.ChatCompletion.create(
                **parameters
            )
            # Add <<T{i}>>: in front to facilitate later lookup
            doc_contents[TABLE_SUMMARY_KEY].append(
                f"<<T{i}>>: {response['choices'][0]['message']['content']}"
            )
            print(f"{table}\n\nSummary;\n {doc_contents['table_summary'][-1]}\n\n\n")
            success = True
        except (
            openai.error.Timeout, openai.error.APIConnectionError, 
            openai.error.AuthenticationError, openai.error.ServiceUnavailableError,
        ) as e:
            retries += 1
            print(f"Error encountered. Retrying {retries}/{MAX_RETRIES}")
            time.sleep(WAIT_SECONDS)
    
    if retries == MAX_RETRIES:
        print(f"Failed to get a summary for table {i} after {MAX_RETRIES} retries.")

0it [00:00, ?it/s]

{'title': '2', 'col_headers': ['', '', ''], 'table': [{'': ''}]}

Summary;
 <<T1>>: The table from Colruyt_Group's Annual report with sustainability (2022) appears to be incomplete, as it only contains one row and no column headers. Without further information, it is difficult to determine the specific content or purpose of the table.



{'title': '2', 'col_headers': ['', '', '', '', ''], 'table': [{'': ''}, {'': ''}]}

Summary;
 <<T2>>: The table from Colruyt_Group's Annual report with sustainability (2022) displays two rows and five columns. However, the specific content of the table is not provided, as all the cells are empty.



{'title': 'With the Sustainability Domain and the Domain Board as overarching bodies, our organisational structure guarantees an ecosystem in which sustainability is deeply rooted.', 'col_headers': ['Product'], 'table': [{'Product': 'Infrastructure'}, {'Product': 'People'}]}

Summary;
 <<T3>>: The table displays two columns with the header "Product." The fi

{'title': 'The remuneration framework is presented in greater detail below.', 'col_headers': ['Sustainable context\nOrganisation\nWork\nRelations', 'Sustainable context\nOrganisation\nWork\nRelations', 'Sustainable context\nOrganisation\nWork\nRelations', 'TOTALE REMUNERATION'], 'table': [{'Sustainable context\nOrganisation\nWork\nRelations': 'TOTAL REWARD', 'TOTALE REMUNERATION': 'TOTALE REMUNERATION'}, {'Sustainable context\nOrganisation\nWork\nRelations': 'TOTAL REWARD', 'TOTALE REMUNERATION': 'TOTALE REMUNERATION'}, {'Sustainable context\nOrganisation\nWork\nRelations': 'TOTAL REWARD', 'TOTALE REMUNERATION': 'TOTALE REMUNERATION'}]}

Summary;
 <<T9>>: The table displays the remuneration framework in greater detail, with columns representing the sustainable context, organization, work, and relations. The table also includes a column for the total remuneration. The table consists of three rows, each representing a different aspect of the sustainable context, organization, work, and r

{'title': 'risks of force majeure: natural disasters, fires, acts of terrorism and power cuts', 'col_headers': ['STRATEGIC RISKS (CONTINUATION)', 'STRATEGIC RISKS (CONTINUATION)', 'STRATEGIC RISKS (CONTINUATION)'], 'table': [{'STRATEGIC RISKS (CONTINUATION)': 'A thorough climate assessment has been made; from this we conclude that none of the risks associated with climate change leads to a relatively high risk level for the business impact or the asset value of Colruyt Group.\nFlooding appears to carry the highest level of risk. We provide the necessary monitoring for this and have already drawn up corresponding risk management and business continuity plans.\nFurther details can be found on our website:  atmosphere.\nOn the basis of studies and regular evaluations of adaptation measures, we work on specific local as well as overarching measures. Business continuity plans are drawn up and regularly updated.\nWe are working on new adaptation measures like additional water buffer capacity

{'title': 'risks of force majeure: natural disasters, fires, acts of terrorism and power cuts', 'col_headers': ['FORCE MAJEURE RISKS (CONTINUATION)', 'FORCE MAJEURE RISKS (CONTINUATION)', 'FORCE MAJEURE RISKS (CONTINUATION)'], 'table': [{'FORCE MAJEURE RISKS (CONTINUATION)': 'The group seeks to safeguard the continuity of data processing by means of various mirror and back-up systems, continuity planning and contingency scenarios. By monitoring all systems 24/7, we try to detect problems and/or possible risks as quickly as possible.\nIn addition, the group invests in various transformation programmes and projects to renew and strengthen its current infrastructure. Disaster recovery and business continuity play an important role here.\nWe keep our systems up to date through maintenance and upgrades. In this way we remain supported and also eliminate security risks.\nTo ensure the availability of all our IT systems, we have the necessary processes in place to avoid disruptions in the eve

{'title': 'Supported by the internally developed Sustainable Financing Framework, that governs sustainability in financing, the issue of this green retail bond allows Colruyt Group to continue its long- term investments, in particular those in sustainability, in a targeted manner, as well as to set up a diversified financing mix by optimally', 'col_headers': ['Gross dividend', '0,80', '1,10'], 'table': [{'Gross dividend': 'Net dividend', '0,80': '0,56', '1,10': '0,77'}, {'Gross dividend': 'Profit', '0,80': '1,57', '1,10': '2,16'}, {'Gross dividend': 'Calculation base (weighted average) (2)', '0,80': '127.967.641 shares', '1,10': '132.677.085 shares'}]}

Summary;
 <<T18>>: The table displays information related to Colruyt Group's gross dividend, net dividend, profit, and calculation base for the years 2022 and 2021. The gross dividend for 2022 is 0.80, while the net dividend is 0.56. In comparison, the gross dividend for 2021 is 1.10, with a net dividend of 0.77. The profit for 2022 is 

{'title': 'SDG 3 - SDG 6', 'col_headers': ['Safe and healthy working environment', 'Safe and healthy working environment', 'Safe and healthy working environment', 'Safe and healthy working environment', 'Safe and healthy working environment'], 'table': [{'Safe and healthy working environment': '67,63'}, {'Safe and healthy working environment': '1.348.064'}, {'Safe and healthy working environment': '916'}, {'Safe and healthy working environment': '22,75'}, {'Safe and healthy working environment': '0,54'}, {'Safe and healthy working environment': '2.911'}]}

Summary;
 <<T27>>: The table displays data related to SDG 3 (Good Health and Well-being) and SDG 6 (Clean Water and Sanitation). The column headers indicate a focus on creating a safe and healthy working environment. The table includes numerical values for different aspects of this environment, such as the number of incidents (67.63), the total number of working hours (1,348,064), the number of accidents (916), the accident frequency

{'title': 'Learning and developing together', 'col_headers': ['Investment in education and training (in million EUR)', 'Financial year', '32,1', '39,1', '37,74'], 'table': [{'Investment in education and training (in million EUR)': '% payroll invested in education and training', 'Financial year': 'Financial year', '32,1': '2,41', '39,1': '2,82', '37,74': '2,61'}, {'Investment in education and training (in million EUR)': '# individual participants in personal growth and health training courses', 'Financial year': 'Financial year', '32,1': '1.562', '39,1': '1.548', '37,74': '2.702'}, {'Investment in education and training (in million EUR)': '# various personal growth and health training courses', 'Financial year': 'Financial year', '32,1': '73', '39,1': '55', '37,74': '82'}, {'Investment in education and training (in million EUR)': '# employees in a dual learning programme', 'Financial year': 'Financial year', '32,1': '185', '39,1': '211', '37,74': '240'}, {'Investment in education and tr

{'title': 'Soy for food', 'col_headers': ['Soy in food products (in tonnes)', 'Calendar year', '1.046,7', '882,7', '923,9'], 'table': [{'Soy in food products (in tonnes)': '% GMO-free soy in food products (without use of GMO technologies)', 'Calendar year': 'Calendar year', '1.046,7': '100', '882,7': '100', '923,9': '100'}, {'Soy in food products (in tonnes)': 'Soy for food in TIER 1 (at least 5% soy present in the product) (in tonnes)', 'Calendar year': 'Calendar year', '1.046,7': '619', '882,7': '455', '923,9': '518'}, {'Soy in food products (in tonnes)': '% TIER 1 soy for food with sustainability certification (RTRS, ProTerra, BIO)', 'Calendar year': 'Calendar year', '1.046,7': '49', '882,7': '64,3', '923,9': '54,3'}, {'Soy in food products (in tonnes)': '% TIER 1 soy for food with sustainability certification and/or from Europe or North America', 'Calendar year': 'Calendar year', '1.046,7': '91', '882,7': '88,8', '923,9': '87,8'}, {'Soy in food products (in tonnes)': '% TIER 1 soy 

{'title': 'International chain projects', 'col_headers': ['# active supply chain projects', 'Calendar year', '7', '8', '8'], 'table': [{'# active supply chain projects': '# products from supply chain projects in our stores', 'Calendar year': 'Calendar year', '7': '40', '8': '41'}, {'# active supply chain projects': '# farmers indirectly involved in the supply chain projects (via cooperatives)', 'Calendar year': 'Calendar year', '7': '43.864', '8': '45.011'}, {'# active supply chain projects': '# farmers directly involved in the supply chain projects (directly in the chain)', 'Calendar year': 'Calendar year', '7': '2.174', '8': '2.176'}]}

Summary;
 <<T44>>: The table titled "International chain projects" in Colruyt_Group's Annual report with sustainability (2022) displays the number of active supply chain projects and related data for calendar years 7 and 8. The first column shows the number of active supply chain projects, while the second column represents the calendar year. The subs

{'title': 'Avoiding and reducing greenhouse gas emissions: scopes 1 and 2', 'col_headers': ['% food stores equipped with natural refrigerants (2)', 'Financial year', '–', '35,7', '43'], 'table': [{'% food stores equipped with natural refrigerants (2)': '% food stores equipped with heat recovery (2)', 'Financial year': 'Financial year', '–': '–', '35,7': '19,9', '43': '27,4'}, {'% food stores equipped with natural refrigerants (2)': '% food stores without fossil fuels (2)', 'Financial year': 'Financial year', '–': '–', '35,7': '10,2', '43': '19,8'}, {'% food stores equipped with natural refrigerants (2)': '% low-energy stores in total retail building stock (3)', 'Financial year': 'Financial year', '–': '–', '35,7': '42,5', '43': '47,3'}, {'% food stores equipped with natural refrigerants (2)': '% rotations with liquid ice containers (4)', 'Financial year': 'Financial year', '–': '85,83', '35,7': '93,9', '43': '97,6'}, {'% food stores equipped with natural refrigerants (2)': 'Refrigerant

{'title': 'and future adaptation measures. Read more about the risk assessment from p. 137.', 'col_headers': ['Activity number', 'Activity name', 'Colruyt Group’s main activities', 'Net turnover', 'CapEx', 'OpEx', 'Assessment using the technical screening criteria'], 'table': [{'Activity number': '1.1', 'Activity name': 'Afforestation', 'Colruyt Group’s main activities': 'Forest planting in the Democratic Republic of the Congo', 'Net turnover': '', 'CapEx': '•', 'OpEx': '', 'Assessment using the technical screening criteria': 'Working closely with the project team, the technical screening criteria were extensively reviewed and positively assessed, thanks to a well-supported afforestation plan and appropriate documentation. Among other things, the project is leading to a demonstrable improvement in terms of biodiversity and water management.'}, {'Activity number': '3.6\nNew', 'Activity name': 'Manufacture of other low-carbon technologies', 'Colruyt Group’s main activities': "Liquid ice 

{'title': 'and future adaptation measures. Read more about the risk assessment from p. 137.', 'col_headers': ['Activity number', 'Activity name', 'Colruyt Group’s main activities', 'Net turnover', 'CapEx', 'OpEx', 'Assessment using the technical screening criteria'], 'table': [{'Activity number': '7.2', 'Activity name': 'Renovation of existing buildings', 'Colruyt Group’s main activities': 'Renovation of branches and sites', 'Net turnover': '', 'CapEx': '', 'OpEx': '', 'Assessment using the technical screening criteria': 'After in-depth consultations, we have decided not to recognise a positive assessment for the renovation of our existing buildings yet. We prefer to take a conservative approach and trust that we will meet these criteria sufficiently quickly. The activity is thus not considered aligned.'}, {'Activity number': '7.3', 'Activity name': 'Installation, maintenance and repair of energy efficiency equipment', 'Colruyt Group’s main activities': 'LED lighting', 'Net turnover': 

{'title': 'Financial report', 'col_headers': ['(in million EUR)', 'Note', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'Revenue', 'Note': '3.', '2022/23': '9.933,6', '2021/22(1)': '9.251,1'}, {'(in million EUR)': 'Cost of goods sold', 'Note': '3.', '2022/23': '(7.074,2)', '2021/22(1)': '(6.546,4)'}, {'(in million EUR)': 'Gross profit', 'Note': '3.', '2022/23': '2.859,4', '2021/22(1)': '2.704,7'}]}

Summary;
 <<T60>>: The table displays financial information from Colruyt_Group's Annual Report with sustainability for the years 2022/23 and 2021/22. It includes three columns: "(in million EUR)", "Note", and "2022/23" and "2021/22(1)" as column headers. The rows of the table represent different financial metrics. The first row shows the revenue for the respective years, followed by the cost of goods sold and gross profit. The values in the table indicate the financial figures in million euros for each metric.



{'title': '(35,2)', 'col_headers': ['Operating profit (EBIT)', '', 

{'title': 'Given the nature of its activities, Colruyt Group does not rely on a limited number of major customers.', 'col_headers': ['(in million EUR)', 'Retail 2022/23(1)', 'Wholesale\nand Foodservice\n2022/23', 'Other activities\n2022/23', 'Operating segments 2022/23'], 'table': [{'(in million EUR)': 'Revenue - external', 'Retail 2022/23(1)': '8.749,9', 'Wholesale\nand Foodservice\n2022/23': '1.161,3', 'Other activities\n2022/23': '908,4', 'Operating segments 2022/23': '10.819,6'}, {'(in million EUR)': '', 'Retail 2022/23(1)': '', 'Wholesale\nand Foodservice\n2022/23': '', 'Other activities\n2022/23': '', 'Operating segments 2022/23': ''}, {'(in million EUR)': 'Revenue – internal', 'Retail 2022/23(1)': '72,3', 'Wholesale\nand Foodservice\n2022/23': '21,6', 'Other activities\n2022/23': '20,5', 'Operating segments 2022/23': '114,4'}, {'(in million EUR)': '', 'Retail 2022/23(1)': '', 'Wholesale\nand Foodservice\n2022/23': '', 'Other activities\n2022/23': '', 'Operating segments 2022/23'

{'title': 'Impairments amounting to EUR 27,9 million were realised on property, plant and equipment and intangible assets, mainly related to the loss-making activities of Dreamland and Dreambaby.', 'col_headers': ['(in million EUR)', 'Operating segments 2021/22', 'Unallocated\n2021/22', 'Eliminations\nbetween operating segments & reclassification to discontinued operations(3)\n2021/22', 'Consolidated\n2021/22'], 'table': [{'(in million EUR)': 'Revenue – external', 'Operating segments 2021/22': '10.049,3', 'Unallocated\n2021/22': '-', 'Eliminations\nbetween operating segments & reclassification to discontinued operations(3)\n2021/22': '(798,2)', 'Consolidated\n2021/22': '9.251,1'}, {'(in million EUR)': '', 'Operating segments 2021/22': '', 'Unallocated\n2021/22': '', 'Eliminations\nbetween operating segments & reclassification to discontinued operations(3)\n2021/22': '', 'Consolidated\n2021/22': ''}, {'(in million EUR)': 'Revenue – internal', 'Operating segments 2021/22': '99,1', 'Unall

{'title': 'Including the revenue from Dreamland and Dreambaby, Bike Republic, The Fashion Society, Jims (since May 2021) and Newpharma (period from October to December 2022).', 'col_headers': ['(in million EUR)', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'Rental and rental-related income', '2022/23': '14,3', '2021/22(1)': '13,2'}, {'(in million EUR)': 'Gains on disposal of non-current assets', '2022/23': '10,0', '2021/22(1)': '6,9'}, {'(in million EUR)': 'Remuneration received', '2022/23': '97,5', '2021/22(1)': '87,2'}, {'(in million EUR)': 'Other', '2022/23': '26,7', '2021/22(1)': '28,2'}, {'(in million EUR)': 'Total other operating income\t148,5\t135,5', '2022/23': 'Total other operating income\t148,5\t135,5', '2021/22(1)': 'Total other operating income\t148,5\t135,5'}]}

Summary;
 <<T70>>: The table displays the other operating income of Colruyt_Group for the years 2022/23 and 2021/22. It includes various sources of income such as rental and rental-related income, gai

{'title': 'Income taxes recognised in profit or loss', 'col_headers': ['(in million EUR)', '2022/23', '2021/22(1)'], 'table': [{'(in million EUR)': 'A) Effective tax rate\nProfit before tax (excluding share in the result of investments accounted for using the equity method)', '2022/23': '240,1', '2021/22(1)': '364,6'}, {'(in million EUR)': 'Income tax expense', '2022/23': '62,2', '2021/22(1)': '92,6'}, {'(in million EUR)': 'Effective tax rate(2)\t25,90%\t25,40%', '2022/23': 'Effective tax rate(2)\t25,90%\t25,40%', '2021/22(1)': 'Effective tax rate(2)\t25,90%\t25,40%'}, {'(in million EUR)': 'B) Reconciliation between the effective tax rate and the applicable tax rate(3)\nProfit before tax (excluding share in the result of investments accounted for using the equity method)', '2022/23': '24,35%\n240,1', '2021/22(1)': '24,68%\n364,6'}, {'(in million EUR)': 'Income tax expense (based on applicable tax rate)\t58,5\t90,0', '2022/23': 'Income tax expense (based on applicable tax rate)\t58,5\t9

{'title': 'Property, plant and equipment', 'col_headers': ['(in million EUR)', 'Land and buildings', 'Plant, machinery\nand equipment', 'Furniture and vehicles', 'Right-of-use\nassets', 'Other property, plant and equipment', 'Assets under construction', 'Total'], 'table': [{'(in million EUR)': 'Acquisition value', 'Land and buildings': 'Acquisition value', 'Plant, machinery\nand equipment': 'Acquisition value', 'Furniture and vehicles': 'Acquisition value', 'Right-of-use\nassets': 'Acquisition value', 'Other property, plant and equipment': 'Acquisition value', 'Assets under construction': 'Acquisition value', 'Total': ''}, {'(in million EUR)': 'At 1 April 2022', 'Land and buildings': '3.139,8', 'Plant, machinery\nand equipment': '880,0', 'Furniture and vehicles': '556,3', 'Right-of-use\nassets': '361,5', 'Other property, plant and equipment': '223,4', 'Assets under construction': '94,7', 'Total': '5.255,7'}, {'(in million EUR)': 'Revaluation(1)', 'Land and buildings': '-', 'Plant, mach

{'title': 'As adjusted due to discontinued operations. See note 16 for more information.', 'col_headers': ['(in million EUR)', 'Land and buildings', 'Plant, machinery\nand equipment', 'Furniture and vehicles', 'Right-of-use\nassets', 'Other property, plant and equipment', 'Assets under construction', 'Total'], 'table': [{'(in million EUR)': 'Acquisition value', 'Land and buildings': 'Acquisition value', 'Plant, machinery\nand equipment': 'Acquisition value', 'Furniture and vehicles': 'Acquisition value', 'Right-of-use\nassets': 'Acquisition value', 'Other property, plant and equipment': 'Acquisition value', 'Assets under construction': 'Acquisition value', 'Total': ''}, {'(in million EUR)': 'At 1 April 2021', 'Land and buildings': '2.957,3', 'Plant, machinery\nand equipment': '847,2', 'Furniture and vehicles': '548,2', 'Right-of-use\nassets': '284,7', 'Other property, plant and equipment': '202,9', 'Assets under construction': '83,1', 'Total': '4.923,4'}, {'(in million EUR)': 'Revaluat

{'title': 'Investments in associates', 'col_headers': ['(in million EUR)', '2022/23', '2021/22'], 'table': [{'(in million EUR)': '', '2022/23': '', '2021/22': ''}, {'(in million EUR)': 'Carrying amount at 1 April\t452,3\t313,4', '2022/23': 'Carrying amount at 1 April\t452,3\t313,4', '2021/22': 'Carrying amount at 1 April\t452,3\t313,4'}, {'(in million EUR)': 'Acquisitions/capital increases', '2022/23': '97,6', '2021/22': '115,2'}, {'(in million EUR)': 'Transactions with non-controlling interests', '2022/23': '(20,6)', '2021/22': '-'}, {'(in million EUR)': 'Disposals/capital decreases', '2022/23': '(94,6)', '2021/22': '(0,7)'}, {'(in million EUR)': 'Share in the result for the financial year', '2022/23': '3,2', '2021/22': '5,8'}, {'(in million EUR)': 'Share in other comprehensive income', '2022/23': '88,2', '2021/22': '16,4'}, {'(in million EUR)': 'Dividend', '2022/23': '(1,4)', '2021/22': '(0,2)'}, {'(in million EUR)': 'Other', '2022/23': '1,3', '2021/22': '2,4'}, {'(in million EUR)': 

{'title': 'Investments in joint ventures', 'col_headers': ['(in million EUR)', '2022/23', '2021/22'], 'table': [{'(in million EUR)': '', '2022/23': '', '2021/22': ''}, {'(in million EUR)': 'Carrying amount at 1 April', '2022/23': '12,0', '2021/22': '6,9'}, {'(in million EUR)': 'Acquisitions/capital increases', '2022/23': '6,1', '2021/22': '5,0'}, {'(in million EUR)': 'Disposals', '2022/23': '(0,2)', '2021/22': '-'}, {'(in million EUR)': 'Change in ownership percentage', '2022/23': '0,1', '2021/22': '-'}, {'(in million EUR)': 'Share in the result for the financial year', '2022/23': '(1,5)', '2021/22': '0,1'}, {'(in million EUR)': 'Carrying amount at 31 March', '2022/23': '16,5', '2021/22': '12,0'}]}

Summary;
 <<T89>>: The table displays the investments in joint ventures for Colruyt_Group's annual report with sustainability in 2022. It includes columns for the years 2022/23 and 2021/22. The table shows the carrying amount at the beginning and end of the period, acquisitions/capital incr

{'title': 'Consolidated income statement from discontinued operations', 'col_headers': ['(in million EUR)', '2022/23', '2021/22'], 'table': [{'(in million EUR)': 'Revenue', '2022/23': '886,2', '2021/22': '798,2'}, {'(in million EUR)': 'Costs', '2022/23': '(938,1)', '2021/22': '(846,7)'}, {'(in million EUR)': 'Other operating income', '2022/23': '79,4', '2021/22': '60,6'}, {'(in million EUR)': 'Operating profit (EBIT)', '2022/23': '27,5', '2021/22': '12,1'}, {'(in million EUR)': 'Profit before tax', '2022/23': '27,8', '2021/22': '12,1'}, {'(in million EUR)': 'Income tax expense', '2022/23': '(6,9)', '2021/22': '(2,1)'}, {'(in million EUR)': 'Profit for the financial year from discontinued operations\t20,9\t10,0', '2022/23': 'Profit for the financial year from discontinued operations\t20,9\t10,0', '2021/22': 'Profit for the financial year from discontinued operations\t20,9\t10,0'}, {'(in million EUR)': 'Attributable to:\nOwners of the parent company', '2022/23': '20,9', '2021/22': '10,0'

{'title': 'Based on the most recent transparency notification of 24 March 2023 and taking into account the companies’ treasury shares held by the company at 31 March 2023, the shareholder structure of Etn. Fr. Colruyt NV is as follows:', 'col_headers': ['', 'Shares'], 'table': [{'': 'Colruyt family and relatives', 'Shares': '82.969.340'}, {'': 'Etn. Fr. Colruyt NV (treasury shares)(1)', 'Shares': '6.687.980'}, {'': 'Total of parties acting in concert\t89.657.320', 'Shares': 'Total of parties acting in concert\t89.657.320'}]}

Summary;
 <<T103>>: The table displays the shareholder structure of Etn. Fr. Colruyt NV as of 31 March 2023. It shows the number of shares held by different parties, including the Colruyt family and relatives (82.969.340 shares) and Etn. Fr. Colruyt NV (treasury shares) (6.687.980 shares). The total number of shares held by parties acting in concert is 89.657.320.



{'title': 'Earnings per share', 'col_headers': ['', '2022/23', '2021/22(1)'], 'table': [{'': 'Tota

{'title': 'The amount resulting from the group’s liabilities related to its defined contribution plans with a legally guaranteed minimum return, as recorded in the consolidated statement of financial position, is as follows:', 'col_headers': ['(in million EUR)', '31.03.23', '31.03.22'], 'table': [{'(in million EUR)': 'Present value of the gross obligation under the defined contribution plans with a legally guaranteed minimum return', '31.03.23': '278,8', '31.03.22': '283,2'}, {'(in million EUR)': 'Fair value of plan assets', '31.03.23': '204,4', '31.03.22': '192,6'}, {'(in million EUR)': 'Deficit/(surplus) of funded plans\t74,4\t90,6', '31.03.23': 'Deficit/(surplus) of funded plans\t74,4\t90,6', '31.03.22': 'Deficit/(surplus) of funded plans\t74,4\t90,6'}, {'(in million EUR)': 'Total liability for employee benefits, of which:\nPortion recognised as non-current liabilities', '31.03.23': '74,4', '31.03.22': '90,6'}, {'(in million EUR)': 'Portion recognised as non-current assets', '31.03.

{'title': 'Changes to the main assumptions impact the group’s main employee benefits-related liabilities as follows:', 'col_headers': ['Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’', 'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’', 'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’', 'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’'], 'table': [{'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’': '31.03.22'}, {'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’': '3,9'}, {'Benefits related to the\tLong-service benefits ‘Unemployment regime with\t\t(Belgian entities)\ncompany supplement’

{'title': 'For lease liabilities and similar liabilities, this includes the effect of renewing existing leases and revaluing leases due to indexations, as well as reclassifications to liabilities from discontinued operations.', 'col_headers': ['(in million EUR)', '31.03.21', 'Cash flow', 'Changes in lease\nportfolio(1)', 'Business combinations', 'Reclassi- fication', 'Other(2)', '31.03.22'], 'table': [{'(in million EUR)': 'Lease liabilities and similar liabilities', '31.03.21': '242,8', 'Cash flow': '(51,2)', 'Changes in lease\nportfolio(1)': '43,7', 'Business combinations': '29,6', 'Reclassi- fication': '-', 'Other(2)': '19,1', '31.03.22': '284,0'}, {'(in million EUR)': 'Current', '31.03.21': '41,2', 'Cash flow': '(51,2)', 'Changes in lease\nportfolio(1)': '3,7', 'Business combinations': '5,7', 'Reclassi- fication': '46,1', 'Other(2)': '5,4', '31.03.22': '50,9'}, {'(in million EUR)': 'Non-current', '31.03.21': '201,6', 'Cash flow': '-', 'Changes in lease\nportfolio(1)': '40,0', 'Busin

{'title': 'In accordance with IFRS 7, ‘Financial Instruments: Disclosures’ and IFRS 13, ‘Fair Value Measurement’, financial instruments measured at fair value are classified using a fair value hierarchy.', 'col_headers': ['', 'Measurement at fair value', 'Measurement at fair value', 'Measurement at fair value', ''], 'table': [{'': 'Carrying amount', 'Measurement at fair value': 'Non-observable market prices\nLevel 3'}, {'': '10,8', 'Measurement at fair value': '10,8'}, {'': '0,1', 'Measurement at fair value': '-'}, {'': '9,4', 'Measurement at fair value': '0,3'}, {'': '17,3', 'Measurement at fair value': '-'}, {'': '', 'Measurement at fair value': 'Financial assets at amortised cost Non-current assets'}, {'': '38,3', 'Measurement at fair value': '-'}, {'': '', 'Measurement at fair value': 'Current assets(2)'}, {'': '4,5', 'Measurement at fair value': '-'}, {'': '632,5', 'Measurement at fair value': '-'}, {'': '358,6', 'Measurement at fair value': '-'}, {'': '1.071,5', 'Measurement at f

{'title': 'The amounts to be received in relation to these rights are to be classified as follows:', 'col_headers': ['(in million EUR)', '31.03.22', '< 1 year', '1-5 years', '> 5 years'], 'table': [{'(in million EUR)': 'Lease arrangements as lessor', '31.03.22': '14,7', '< 1 year': '8,1', '1-5 years': '6,6', '> 5 years': '-'}]}

Summary;
 <<T130>>: The table displays the amounts to be received in relation to lease arrangements as a lessor. The amounts are classified based on their time horizon: less than 1 year, 1-5 years, and more than 5 years. As of March 31, 2022, the total amount to be received is €14.7 million, with €8.1 million expected within the next year, €6.6 million expected within 1-5 years, and no amount expected beyond 5 years.



{'title': 'The table below gives an overview of all contingent liabilities of Colruyt Group.', 'col_headers': ['(in million EUR)', '31.03.23', '31.03.22'], 'table': [{'(in million EUR)': 'Disputes', '31.03.23': '3,8', '31.03.22': '7,1'}, {'(in m

{'title': 'Subsidiaries', 'col_headers': ['Colruyt Afrique SAS', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579', 'Dakar, Senegal', 'SN DKR 2020 B 13136', '100%'], 'table': [{'Colruyt Afrique SAS': 'Colruyt Cash and Carry NV', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Edingensesteenweg 196', 'Dakar, Senegal': '1500 Halle, Belgium', 'SN DKR 2020 B 13136': '0716 663 318', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt Gestion SA', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Rue F.W. Raiffeisen 5', 'Dakar, Senegal': '2411 Luxembourg,\nGrand Duchy of Luxembourg', 'SN DKR 2020 B 13136': 'B137485', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt Group Services NV', 'Sacre Coeur III VDN, Villa numéro 10684,\nBoîte Postal 4579': 'Edingensesteenweg 196', 'Dakar, Senegal': '1500 Halle, Belgium', 'SN DKR 2020 B 13136': '0880 364 278', '100%': '100%'}, {'Colruyt Afrique SAS': 'Colruyt IT Consultancy India Private LTD', 'Sacre Coeur III V

{'title': 'Subsidiaries', 'col_headers': ['Immo Colruyt Luxembourg SA', 'Rue F.W. Raiffeisen 5', '2411 Luxembourg, Grand Duchy of\nLuxembourg', 'B195799', '100%'], 'table': [{'Immo Colruyt Luxembourg SA': 'Immo De CE Floor BV', 'Rue F.W. Raiffeisen 5': 'Edingensesteenweg 196', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1500 Halle, Belgium', 'B195799': '0446 434 580', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Immoco SARL', 'Rue F.W. Raiffeisen 5': 'Zone Industrielle, Rue des Entrepôts 4', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '39700 Rochefort-sur-Nenon, France', 'B195799': '527 664 965', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Izock BV', 'Rue F.W. Raiffeisen 5': 'Kerkstraat 132-134', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1851 Humbeek, Belgium', 'B195799': '0426 190 284', '100%': '100%'}, {'Immo Colruyt Luxembourg SA': 'Jims NV', 'Rue F.W. Raiffeisen 5': 'Edingensesteenweg 196', '2411 Luxembourg, Grand Duchy of\nLuxembourg': '1500 Halle, Belgium', 

{'title': 'Subsidiaries', 'col_headers': ['WV1 BV', 'Guldensporenpark 100, blok K', '9820 Merelbeke, Belgium', '0627 969 585', '100%'], 'table': [{'WV1 BV': 'WV2 BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0627 973 149', '100%': '100%'}, {'WV1 BV': 'WV3 BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0477 728 760', '100%': '100%'}, {'WV1 BV': 'Yaleli BV', 'Guldensporenpark 100, blok K': 'Tramstraat 63', '9820 Merelbeke, Belgium': '9052 Zwijnaarde, Belgium', '0627 969 585': '0672 981 941', '100%': '100%'}, {'WV1 BV': 'Zeeboerderij Westdiep BV', 'Guldensporenpark 100, blok K': 'Edingensesteenweg 196', '9820 Merelbeke, Belgium': '1500 Halle, Belgium', '0627 969 585': '0739 918 869', '100%': '80%'}]}

Summary;
 <<T138>>: The table titled "Subsidiaries" in Colruyt_Group's Annual report with sustainability (2022) displays information 

{'title': 'Internet:  Email:', 'col_headers': ['(in million EUR)', '2022/23', '2021/22'], 'table': [{'(in million EUR)': 'I. Operating income', '2022/23': '7.805,1', '2021/22': '7.351,6'}, {'(in million EUR)': 'II. Operating expenses', '2022/23': '(7.643,6)', '2021/22': '(7.177,4)'}, {'(in million EUR)': 'III. Operating profit', '2022/23': '161,5', '2021/22': '174,2'}, {'(in million EUR)': 'IV. Finance income', '2022/23': '1.927,9', '2021/22': '209,0'}, {'(in million EUR)': 'V. Financial costs', '2022/23': '(273,9)', '2021/22': '(149,8)'}, {'(in million EUR)': 'VI. Profit for the financial year before tax', '2022/23': '1.815,5', '2021/22': '233,4'}, {'(in million EUR)': 'VIII. Income tax', '2022/23': '(4,9)', '2021/22': '(6,8)'}, {'(in million EUR)': 'IX. Profit for the financial year', '2022/23': '1.810,6', '2021/22': '226,6'}, {'(in million EUR)': 'X.A. Transfer from the tax exempt reserves', '2022/23': '0,2', '2021/22': '0,9'}, {'(in million EUR)': 'X.B. Transfer to the tax exempt r

**WARNING**: We use the key <<T{i}>>: to indicate that the below are summaries of tables. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw table. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look thorugh the list of raw tables, and join these to their summaries.  

NOTE: i starts at 1!

### Obtain short summaries for figures:
Similalrly to tables, to facilitate encoding we try and pass figures through an LLM first. Note that the confidence we have in figure information is low. Figures are captured using heauristics and may hence be too vague to summarise meanigfully. The may also be extract from the text that have been wrongly captured. In either case a textual summary is required. 

In [10]:
doc_contents[FIGURE_SUMMARY_KEY] = []

In [11]:
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a figure and text summarizer."
    }, {
        "role": "user",
        "content": f"Decide whether the information below (INFO) obtained from {comp_name}'s " + 
            f"{report_name} is from a figure or text, " + 
            "and describe it in no more than 70 words. Note, if the information is vague " + 
            "return 'None'.\n" + 
            "INFO: {'title': 'Global headquarters', 'data': ['200+']}"
    }, {
        "role": "assistant", 
        "content": "Text: Coca-Cola has more than 200 global headquarters."
    }, {
        "role": "user", 
        "content": "INFO: {'title': '2022 Progress on Sustainable Sourcing2', 'data': ['0\t20\t40\t60\t80\t100', '36%', 'GRAPES\t 37%', 'SUGAR CANE\t 40%', 'APPLES\t 55%', 'CORN\t 70%', 'TEA\t 74%', '80%', 'PULP AND PAPER\t 86%', 'ORANGES\t 89%', 'LEMONS\t 96%']}"
    }, {
        "role": "assistant", 
        "content": "Figure: 2022 progress on sustainable sourcing. 37% of grapes, 40% of sugar cane, 55% of apples, 70% of corn, 74% of tea, 86% of pulp and paper, 89% of oranges, and 96% of lemons were sustainably sourced."
    }, {
        "role": "user", 
        "content": "INFO: {'title': 'Organic Revenue Growth (Non-GAAP)1', 'data': ['25', '-5', '24%']}",
    }, {
        "role": "assistant", 
        "content": "Text: Organic Revenue Growth (Non-GAAP) was 24%."
    }, {
        "role": "user", 
        "content": None,
    },
]

# TODO: Avoid repeating loop. Create seperate function.
for i, figure in tqdm(enumerate(doc_contents[FIGURE_KEY], start=0)):
    
    if doc_contents[FIGURE_SUMMARY_KEY] != []:
        # KEEP TRACK OF LAST GPT OUTPUT - This may help with next
        # figure summarisation as there may be a link between the two.
        parameters['messages'][-3:-1] = {
            "role": "user", 
            "content": f"INFO: {doc_contents[FIGURE_KEY][i-1]}"  # previous figure
        }, {
            "role": "assistant", 
            "content": f"{doc_contents[FIGURE_SUMMARY_KEY][-1]}"
        }
    
    parameters['messages'][-1] = {
        "role": "user", 
        "content": f"INFO: {figure}"
    }
    retries = 0
    success = False
    while retries < MAX_RETRIES and not success:
        try:
            response = openai.ChatCompletion.create(
              **parameters
            )
            # Add <<F{i}>>: in front to facilitate later lookup
            doc_contents[FIGURE_SUMMARY_KEY].append(
                f"<<F{i+1}>>: {response['choices'][0]['message']['content']}"
            )
            success = True
        except (openai.error.Timeout, openai.error.APIConnectionError, openai.error.AuthenticationError) as e:
                retries += 1
                print(f"Error encountered. Retrying {retries}/{MAX_RETRIES}")
                time.sleep(WAIT_SECONDS)

    print(f"Figure {i + 1} Summary;\n {doc_contents[FIGURE_SUMMARY_KEY][-1]}\n\n")


0it [00:00, ?it/s]

Figure 1 Summary;
 <<F1>>: None


Figure 2 Summary;
 <<F2>>: Text: The annual report is available on colruytgroup.com/en/annualreport. The corporate website also includes press releases, extra stories, and background information.


Figure 3 Summary;
 <<F3>>: Text: The organisational structure of Colruyt_Group includes the Sustainability Domain and the Domain Board as overarching bodies. It is designed to ensure that sustainability is deeply rooted within the company, and it consists of 12 programmes.


Figure 4 Summary;
 <<F4>>: Figure: The data provided is not clear and does not provide enough information to determine its meaning.


Figure 5 Summary;
 <<F5>>: Figure: The data provided is not clear and does not provide enough information to determine its meaning.


Figure 6 Summary;
 <<F6>>: Figure: The medium value is 2.


Figure 7 Summary;
 <<F7>>: Text: The revenue grew by 7.7% according to the management report.


Figure 8 Summary;
 <<F8>>: Text: The group, DATS 24, will act in acc

Figure 49 Summary;
 <<F49>>: None.


Figure 50 Summary;
 <<F50>>: Figure: Boni products achieved a reduction of 15 tonnes of fat.


Figure 51 Summary;
 <<F51>>: <<F50>>: Figure: Boni products achieved a reduction of 39 tonnes of salt.


Figure 52 Summary;
 <<F52>>: Figure: Colruyt_Group added 66 tonnes of fibers.


Figure 53 Summary;
 <<F53>>: Text: In 2022, Colruyt_Group's main activities in Belgium consumed 598.066 m³ of water. By maximizing the reuse of rainwater and wastewater, their dependence on mains water and water from wells decreased to 63.9%.


Figure 54 Summary;
 <<F54>>: Figure: Evolution of the average simultaneity of Colruyt_Group's electricity generation and consumption from 2018 to 2022.


Figure 55 Summary;
 <<F55>>: Figure: In the financial year 2022/23, there were 3,030 employees who did a job switch.


Figure 56 Summary;
 <<F56>>: Text: By 2022, the Boni beverage range switched from colored PET bottles to transparent PET, allowing transparent bottles to be made fro

**WARNING**: We use the key <<F{i}>>: to indicate that the below are summaries of figures. This is important and used when we feed context into the LLM later. Elaborating, the summary is fed into the LLM with the raw figure data. To achieve this without having to encode the raw table we search for the above mentioned id in the contextual chunks. When found we then proceed to look through the list of raw figures, and join these to their summaries.

NOTE: i starts at 1!

*OPTIONAL: To avoid figures being wrongly chosen because of a lack of noise in the context, we merge multiple figures together to forcefully introduce noise into figure chunks. This way there is a chance that the chuncks are still selected, but lower than if the signal was applified by a low sequence length.*

In [12]:
def join_pairs(lst):
    # If the list has an odd length and more than 2 elements
    if len(lst) % 2 != 0 and len(lst) > 2:
        # Concatenate last three elements
        last = lst[-3] + '\n' + lst[-2] + '\n' + lst[-1]
        # Pair up the rest
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst) - 3, 2)] + [last]
    else:
        return [lst[i] + '\n' + lst[i+1] for i in range(0, len(lst), 2)]

is_join_pairs = False

if is_join_pairs:
    doc_contents[FIGURE_SUMMARY_KEY] = join_pairs(doc_contents[FIGURE_SUMMARY_KEY])
    print(doc_contents[FIGURE_SUMMARY_KEY])

Save the results to avoid having to pointlessly rerun the table and figure summarisation parts.

In [13]:
is_override = True
fpath = f"./data/pickle_db/{comp_name}_{fname}_{year}_inputs.pkl"

# TODO: in industrialised code this would preceed above load procedure!
if os.path.exists(fpath) and not is_override:
    with open(fpath, 'rb') as f:
        doc_contents = pickle.load(f)
else:
    print(f'Saving inputs to {fpath}')
    with open(fpath, 'wb') as f:
        pickle.dump(doc_contents, f)

Saving inputs to ./data/pickle_db/Colruyt_Group_annual-report-with-sustainability-reporting-2022-2023_2022_inputs.pkl


### Text Splitting, Embedding Models, and Vector DB
We'll be using OpenAI's text-embedding-ada-002 model to convert our contextual chunks to latent space embeddings.

In [14]:
# Prepare text for chunking:
text = "\n\n".join(doc_contents[TEXT_KEY])

In [15]:
model = 'gpt-3.5-turbo'  # open ai LLM model we will be using later.
# model = 'text-davinci-003'  # open ai LLM model we will be using later.
enc_code = tiktoken.encoding_for_model(model).name
tokenizer = tiktoken.get_encoding(enc_code)

# Determine length of input after tokenization
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=250,  # in order to be able to fit 4 chunks in context window  # 300
    chunk_overlap=25,  # 35
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]  # order in which splits are prioritized
)
chunks = text_splitter.split_text(text)
print(
    f'The input text of lenght {len(text)} was split into {len(chunks)} chunks.'
)

The input text of lenght 537086 was split into 512 chunks.


In [16]:
def assert_roughly_equal(value1, value2, tolerance, message=None):
    if not math.isclose(value1, value2, rel_tol=tolerance):
        if message is None:
            message = f"{value1} and {value2} are not roughly equal within {tolerance} tolerance"
        raise AssertionError(message)

assert_roughly_equal(sum([len(chunk) for chunk in chunks]), len(text), 100)

In [17]:
SOURCE_KEY = "source"

# Track metadata:
metadatas = [
    {**doc_contents[METADATA_KEY], SOURCE_KEY: 'text'} for _ in range(len(chunks))
]

In [18]:
# extend chunks with table and figure descriptions as well as the source metadata:
for key, source in zip(
    [TABLE_SUMMARY_KEY, FIGURE_SUMMARY_KEY],
    ['table', 'figure']
):
    chunks.extend(doc_contents[key])
    metadatas.extend([
        {**doc_contents[METADATA_KEY], SOURCE_KEY: source}
        for _ in range(len(doc_contents[key]))
    ])

In [19]:
divider = '-'*100
for chunk in tqdm(chunks):
    print(f'{chunk}\n{divider}\n\n')

  0%|          | 0/748 [00:00<?, ?it/s]

Halle, 9 June 2023 FINANCIAL YEAR 2022/23

Annual report presented by the Board of Directors to the Ordinary General Meeting of Shareholders

of 27 September 2023 and Independent auditor’s report

The Dutch annual report in the European Single Electronic Format (ESEF) is the only official version.

Dit jaarverslag is ook verkrijgbaar in het Nederlands.

Ce rapport annuel est également disponible en français.

Financial year 2022/23 covers the period from 1 April 2022 to 31 March 2023.

This annual report is also available on colruytgroup.com/en/annualreport. Our corporate website also includes all press releases, extra stories and background information.

Word from the Chairman
----------------------------------------------------------------------------------------------------


Word from the Chairman

2022/23 was an eventful, challenging financial year, in which we have continued to evolve, based on our belief that we create added value for society today, tomorrow and in the long term

### Indexing:
Store the indexes to avoid pointlessly rerunning the embedding code.

In [20]:
fpath = f"./data/faiss_db/{comp_name}_{fname}_{year}.pkl"
is_override = True

if os.path.exists(fpath) and not is_override:
    print("Loading vectorised chunks...")
    with open(fpath, 'rb') as f:
        vector_store = pickle.load(f)
else:
    # Init embedding model:
    print("Encoding and saving vectorised chunks...")
    
    embed = OpenAIEmbeddings(
        model='text-embedding-ada-002',
    )
    
    vector_store = FAISS.from_texts(
        chunks, embedding=embed,
        metadatas=metadatas,
    )
    with open(fpath, 'wb') as f:
        pickle.dump(vector_store, f)

Encoding and saving vectorised chunks...


**NOTE**: When using multiple documents, one can combine vector stores using the merge command:
- https://python.langchain.com/docs/integrations/vectorstores/faiss

Hence by adding metadata to the vector store, and then applying a merge operation, one can later apply metadata filters and track context origins. 

In [21]:
DOC_CONTENT[comp_name] = {report_name: doc_contents.copy()}  # will become useful when/if we start to use multiple reports

In [22]:
TABLE_PATTERN = r"<<T(\d+)>>:"
FIGURE_PATTERN = r"<<F(\d+)>>:"
VECTOR_STORE = vector_store

# USED IN THE JOIN OF RAW TABLES TO TABLE SUMMARIES AFTER EMBEDDING
def get_table_index_from_string(
    string,
    pattern=TABLE_PATTERN
):
    """
    Checks if a string contains the pattern (e.g., <<T{i}>>),
    and if it does return a list of all i (the digits) and if not return an empty list.
    """
    matches = re.findall(pattern, string)
    return [int(match) for match in matches]


def process_content(
    chunk: str,
    company_key: str,
    report_key: str,
    is_summary: bool = True
) -> str:
    """
    Process table summaries to also include raw tables from 
    doc_content.
    
    Note: if is_summary is True, table summaries are outputed.
    Otherwise, raw tables are outputted.
    """
    table_indexes = get_table_index_from_string(chunk, TABLE_PATTERN)
    for table_idx in table_indexes:
        match = re.search(TABLE_PATTERN.replace("(\d+)", str(table_idx)), chunk)  # search for the specific index
        table = DOC_CONTENT[company_key][report_key][TABLE_KEY][table_idx-1]  # table_idx starts at 1 hence idx-1
        
        if is_summary:
            chunk = chunk[:match.start()] + "TABLE SUMMARY:" + chunk[match.end():]
        else:  # do not include summary in context
            chunk = f"TABLE: {table}"
        
    return chunk


def find_index(
    docs: Dict[str, Any], pattern: str = TABLE_PATTERN
):
    regex = re.compile(pattern)
    for i, _doc in enumerate(docs):
        if regex.search(_doc.page_content):
            return i
    return -1


# The belief here is that quantitative data 
# will most probably appear in tables, and 
# hence we want to ensure that the context 
# for quantitative KPI retrival contains 
# at least one table. 
def search_vector_store(
    query: str,
    vector_store,
    n: int, 
    is_quantitative: bool,
    max_n: int = 20
) -> Dict[str, Any]:
    """
    Returns a context window (size n) of Docs containing at 
    least one table when is_quantitative is True.
    Note the table must be within the first max_n
    docs.
    """
    
    if is_quantitative:
        docs = vector_store.similarity_search(query, max_n)
        table_index = find_index(docs)
        if (table_index < n) or (table_index == -1):
            # no need to process as table either in context
            # window or not similar enough to input query
            return docs[:n]
        else:
            return docs[:n-1] + [docs[table_index]]
    else:
        return vector_store.similarity_search(query, n)


def get_context(
    query: str, 
    vector_store,
    n: int = 4,
    is_flag: bool = True
) -> str:
    """
    Retrive `n` contextual chunks similar to query from VECTOR_STORE.
    NOTE: when is_flag is True, the code outputs table 
    summaries as the context, otherwise it outputs raw 
    tables (where tables are selected as being 
    contextually important).
    """
    contents = [
        f"""{
            process_content(
                doc.page_content,
                company_key=doc.metadata[COMP_KEY],
                report_key=doc.metadata[REPORT_KEY],
                is_summary=is_flag
            )
        } | Metadata: {
                doc.metadata[COMP_KEY]
            } {
                doc.metadata[REPORT_KEY]
        }""" for doc in search_vector_store(
            query, vector_store, n=n, 
            is_quantitative=not is_flag,
        )
    ]

    full_content = '\n\n'.join(
        [
            f"Rank: {rank} | Content: {cont}" 
            for rank, cont in enumerate(contents, start=1)
        ]
    )
    return full_content + "\n\nContextual information sorted from most relevant to least relevant."


class AgentContextRetrieval(BaseTool):
    
    name = "information_retrival"
    description = """
        Fetch the most recent information about a company's financials and ESG initiatives.
    """
    K = 4
    output_chunks = []

    @staticmethod
    def string_similarity(s1, s2):
        seq_matcher = difflib.SequenceMatcher(None, s1, s2)
        return seq_matcher.ratio()
            
    # IMPLEMENTED PURELY TO SATISFY CLASS BEING USED IN AGENT!!!
    def _history_lookup(self, chunk: str) -> str:
        """Check if context has already been provided in the past."""
        for _chunk in self.output_chunks:
            if self.string_similarity(chunk, _chunk) > 0.9:
                return f"The information was shared in the previous {self.name} calls."
        
        # If no highly similar string is found in the outputs, 
        # append the query to outputs and return True
        self.output_chunks.append(chunk)
        return chunk
    
    # ADD HISTORY LOOKUP 
    def _run(self, query: str) -> str:
        contents = [
            f"""{
                self._history_lookup(
                    process_content(
                        doc.page_content,
                        company_key=doc.metadata[COMP_KEY],
                        report_key=doc.metadata[REPORT_KEY],
                        is_summary=False,
                    )
                )
            } | Metadata: {
                    doc.metadata[COMP_KEY]
                } {
                    doc.metadata[REPORT_KEY]
            }""" for doc in VECTOR_STORE.similarity_search(query, self.K)
        ]
        
        full_content = '\n\n'.join(
            [
                f"Rank: {rank} | Content: {cont}" 
                for rank, cont in enumerate(contents, start=1)
            ]
        )
        return full_content + "\n\nContextual information sorted from most relevant to least relevant."
    
    def _arun(self, query: str):
        raise NotImplementedError(
            f"{self.__class__.__name__} does not currently support async run."
        )

### Set up KPI extraction:
Use Open AIs API to extract desired KPIs from Coca-Cola's sustainability report. 

Start with KPIs that are binary flags, and then transition to KPIs that are quantitative.

**IMPROVEMENT IDEAS**: 

Prompt engineering when extracting KPIs entails two steps. Defining a prompt which maximises the chances of selecting the correct context. And defining a prompt which maximises the chances of the LLM correctly extracting information from the context, with the aim of obtaining an accurate KPI. 

Given that the above is a two part task, it makes sense to define two prompts per KPI: one for the similarity search, and one for the KPI extraction. 

In [23]:
EXP_OUTPUT_ID = r"EXPECTED OUTPUT:"
SIMILARITY_PROMPT_KEY = 'similarity_prompt'
GPT_PROMPT_KEY = 'gpt_prompt'
RAW_RESULT_KEY = 'raw_output'
RESULT_KEY = 'output'

def parse_output(s, output_id=EXP_OUTPUT_ID):
    # Check if the input string is a single word or a number (e.g., 19.9, 21,0, 18).
    # The regex pattern matches single words as well as floating point and integer numbers.
    if re.fullmatch(r'\w+|(\d+([.,]\d+)?)', s.strip()):
        return s.strip()
    
    # Otherwise, search for the pattern "expect output:" (case-insensitive)
    match = re.search(output_id, s, re.IGNORECASE)
    
    # If a match is found, return the part of the string that comes after it.
    if match:
        return s[match.end():].strip()  # The `strip` method removes any leading/trailing whitespace.
    else:
        return None  # No match found.

In [46]:
parameters = {
    'model': 'gpt-4', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a financial analyst responsible for extracting KPIs from financial reports."
    }, {
        "role": "user", 
        "content": None,
    }, 
]

# FOLLOWS: similarity_prompt, gpt_prompt, result_key
kpi_flag_prompts = [
#     (
#         """
#         Does the report discuss the organization's Compliance Policy in detail?
#         """,
#         """
#         Does the report discuss the organization's Compliance Policy in detail? SPecifically look for:
#             Name of the policy:
#             Whether the policy is part of these documents (Code of Conduct/Employee Handbook) or part of a separate document (Compliance policy, AML policy, for instance)
#             Commitment:
#             Statements affirming commitment to business ethics
#             Policy Coverage:
#             Types of business ethics topics the policy aims to prevent (e.g., corruption, bribery, anti-money laundering (AML), Counter Terrorist Financing (CTF), conflicts of interest, facilitation payments, gifts, fraud, etc.)
#             Governance:
#             Role of employees/departments in implementing, monitoring, and enforcing the policy
#             Role of management in implementing, monitoring, and enforcing the policy
#         """,
#         "compliance_policy"
#     ), 
    (
        """
        What steps have been taken to lower energy consumption in direct operations? 
        Look for measures related to energy reduction, energy efficiency improvements, 
        and adoption of renewable energy in its offices.
        """, 
        """
        Does the report detail any of the following efforts or measures related to energy
        reduction, energy efficiency improvements, and adoption of renewable
        energy in its offices? Please note that the answer should be 'Yes' if ANY of the
        measures listed below are mentioned or detailed in the report.

        Renewable Energy:

            Utilization of renewable energy sources (e.g., solar, wind, hydro)
            Details on energy from renewable sources, either generated or purchased
            Mention of grid parity or distributed generation insights

        Energy Efficiency:

            Implementation or consideration of efficiency technologies (e.g., LED, HVAC)
            Steps toward building insulation or retrofitting measures
            Adoption or exploration of demand-side management strategies

        Energy Certificates:

            Utilization or acquisition of various energy certificates types (e.g., Renewable Energy Credits, carbon offsets)
            Information on certificate volume or count
            Reference to voluntary vs mandatory compliance with energy certification schemes
        """,
#         #"""
#         #What steps have been taken to lower energy consumption in direct operations?
#         #""", 
#         #"""
#         #Does the report detail any of the following efforts or measures related to energy
#         #reduction, energy efficiency improvements, and adoption of renewable
#         #energy in its offices? Please note that the answer should be 'Yes' if ANY
#         #measures are taken.
#         #""",
        'energy_reduction_plan_for_office'
    )
]

kpi_results = {}

# # TODO: Avoid repeating loop. Create seperate function.
for sim_prompt, gpt_prompt, key in kpi_flag_prompts:
    
    print(f"{sim_prompt}\n\n\n")
    
    context = get_context(
        sim_prompt, 
        vector_store,
        n=4,  # qualitative information can span across a greater contextual window hence choose 4-6!
        is_flag=True,  # Use table summaries
    )
    
    full_prompt = f"CONTEXT:\n{context}\n\n" + \
        f"Given the above information can you please answer:\n" + \
        f"QUESTION:\n{gpt_prompt}\n\n" + \
        f"{EXP_OUTPUT_ID} Please first reason about the question and then " \
        f"finalise your answer in one word: either 'True', 'False', or 'NaN'. " + \
        f"Answer 'NaN' when the question cannot be answered from the provided context. " + \
        f"Where possible try and obtain the answer directly from tables."
    
    print(
        f"INPUT: {full_prompt}\n\n"
    )
    
    parameters['messages'][-1]['content'] = full_prompt
    
    try:
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
    except openai.error.InvalidRequestError:
        # USE MODEL WITH WIDER CONTEXT WINDOW
        parameters['model'] = "gpt-3.5-turbo-16k"
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
        parameters['model'] = 'gpt-3.5-turbo'  # reset
    
    kpi_results[key] = {
        SIMILARITY_PROMPT_KEY: sim_prompt,
        GPT_PROMPT_KEY: full_prompt,
        RAW_RESULT_KEY: response,
        RESULT_KEY: parse_output(response)
    }
    
    print(
        f"RAW OUTPUT: {response}\n\n" +
        f"SAVED OUTPUT: {kpi_results[key][RESULT_KEY]}" +
        "---"*40 +
        '\n\n\n'
    )



        What steps have been taken to lower energy consumption in direct operations? 
        Look for measures related to energy reduction, energy efficiency improvements, 
        and adoption of renewable energy in its offices.
        



INPUT: CONTEXT:
Rank: 1 | Content: By 2022, we were already consuming 10,4% less energy (normalised).

Raising awareness

Through campaigns focusing on energy-saving, we are raising our employees’ awareness to the fact that they too can contribute to reducing energy consumption through their behaviour. We focus on simple actions that make a difference, such as keeping doors closed, de-icing freezers or turning off lights.

Heating

In 2022, we reduced our energy consumption for heating by 16%. This impressive drop is the result of both warmer outdoor temperatures and the energy-saving measures introduced due to the energy crisis. For example,

Colruyt Lowest Prices reintroduced the insulating flaps in the fresh food market and the thermostats in 

In [47]:
#         for instance direct emissions, fuel combustion, and fugitive emissions in 2022?
#         Where possible try and obtain the information from tables.
kpi_quantitative_prompts = [
#     (
#         f"""
#         Does the report detail the level of independence among the members of the Board of Directors?
#         """,
#         f"""
#         Does the report detail the level of independence among the members of the Board of Directors?
#         Specifically, focus on the percentage of board members designated as independent
#         by the company over the total number of board members. If a company had 1 independent board member
#         and 4 total board members than the desired output would be 0.25.
        
#         Note: independence is linked to the material relationship with the company (i.e., role within the company), 
#         conflicts of interest (i.e., shareholding, family ties), and tenure (number of years on the Board).
        
#         """,
#         "independence_of_board_of_directors"
#     ),
#     (
#         f"""
#         What were the reported greenhouse gas (GHG) emissions from Scope 1 categories in {year}?
#         """,
#         f"""
#         What were the greenhouse gas (GHG) emissions from Scope 1 categories in {year}?
        
#         Specifically look for:
#             - Direct Emissions: Amount and types of direct emissions from company operations
#             - Fuel Combustion: Details on fuel types used in company vehicles or on-site energy production
#             - Fugitive Emissions: Any emissions from leaks or other unintentional releases
#         """,
#         "scope_1_emissions"
#     ), 
#     (
#         f"""
#         What were the greenhouse gas (GHG) emissions from Scope 2 market based categories: 
#         for instance indirect emissions, grid emission factors, and renewable energy credits
#         in {year}?
#         """,
#         """
#         What were the greenhouse gas (GHG) emissions from market based Scope 2 categories?
        
#         Specifically look for:
#             - Indirect Emissions: Volume of emissions from purchased electricity or heat
#             - Grid Emissions Factor: Metrics used to calculate emissions from purchased energy
#             - Renewable Energy Credits (RECs): Whether RECs are used to offset emissions
#         """,
#         "scope_2_emissions_market_based"
#     ),
#     (
#         f"""
#         What were the greenhouse gas (GHG) emissions from Scope 3 categories: 
#         for instance supply chain emissions, business travel, product lifecycle, and outsourced activities
#         in {year}?
#         """,
#         """
#         What were the greenhouse gas (GHG) emissions from Scope 3 categories?
        
#         Specifically look for:
#             - Supply Chain Emissions: Information on emissions from upstream and downstream activities
#             - Business Travel & Employee Commuting: Emissions attributed to these activities
#             - Product Lifecycle: Emissions from the entire lifecycle of products or services
#             - Outsourced Activities: Emissions from subcontracted or outsourced operations
#         """,
#         "scope_3_emissions"
#     )
    (
        """
            Does the report provide comprehensive data and insights on the company's employee turnover rate?
        """,
        """
            Does the report provide comprehensive data and insights on the company's employee turnover rate?
            Specifically, look for:
                Annual Turnover/Attrition Rate: The percentage of employees who left during the last fiscal year
                Voluntary vs. Involuntary Turnover: Differentiation between employees leaving by choice and those leaving due to company decisions
        """,
        "employee_turnover"
    ),
]

# quantitative loop
for sim_prompt, gpt_prompt, key in kpi_quantitative_prompts:
    
    context = get_context(
        sim_prompt, 
        vector_store,
        n=3,  # The information will most probably appear in a single table.
        is_flag=False,  # Use raw tabular information 
    )
    
    full_prompt = f"CONTEXT:\n{context}\n\n" + \
        f"Given the above information can you please answer:\n" + \
        f"QUESTION:\n{gpt_prompt}\n\n" + \
        f"EXPECTED OUTPUT: Please reason about the question and then finalise " + \
        "your answer with a single numerical value. " + \
        "Include units if applicable (for example: 10.05 or 100.1m³). " + \
        "If the information is not available in the given context, respond with 'NaN'. "
    
    print(
        f"INPUT: {full_prompt}\n\n"
    )
    
    parameters['messages'][-1]['content'] = full_prompt
    
    try:
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
    except openai.error.InvalidRequestError:
        # USE MODEL WITH WIDER CONTEXT WINDOW
        parameters['model'] = "gpt-3.5-turbo-16k"
        response = openai.ChatCompletion.create(
          **parameters
        )['choices'][0]['message']['content']
        parameters['model'] = 'gpt-3.5-turbo'  # reset
    
    kpi_results[key] = {
        SIMILARITY_PROMPT_KEY: sim_prompt,
        GPT_PROMPT_KEY: full_prompt,
        RAW_RESULT_KEY: response,
        RESULT_KEY: parse_output(response)
    }
    
    print(
        f"RAW OUTPUT: {response}\n\n" +
        f"SAVED OUTPUT: {kpi_results[key][RESULT_KEY]}" +
        "---"*40 +
        '\n\n\n'
    )


INPUT: CONTEXT:
Rank: 1 | Content: TABLE: {'title': 'Working at Colruyt Group', 'col_headers': ['# employees at Colruyt Group as a whole', 'Financial year', '32.945', '32.996', '33.384'], 'table': [{'# employees at Colruyt Group as a whole': 'Evolution of employee count for the entire Colruyt Group (net growth)', 'Financial year': 'Financial year', '32.945': '2.314', '32.996': '51', '33.384': '388'}, {'# employees at Colruyt Group as a whole': '% full-time employees', 'Financial year': 'Financial year', '32.945': '78,56', '32.996': '78,43', '33.384': '78,76'}, {'# employees at Colruyt Group as a whole': '% part-time employees', 'Financial year': 'Financial year', '32.945': '21,44', '32.996': '21,57', '33.384': '21,24'}, {'# employees at Colruyt Group as a whole': 'Average length of service (in years)', 'Financial year': 'Financial year', '32.945': '9,84', '32.996': '10,27', '33.384': '11'}, {'# employees at Colruyt Group as a whole': '# student workers who have worked for Colruyt Group

### TODOs:
- Improve output parser for quantitative outputs.
- Connect gpt-4 to KPI extraction tool.

### OPTIONAL EXPERIMENT; Set up Agent:
We will create an agent cabale of:
- Fetching contextual information from the vector store
- Performing basic math operations 
The thought being that this way the agent may be able to compute KPIs which require basic math operations to compute.

**Setup an agent just like in url_retrieval.ipynb.**

Note that a `utils.py` file will be created to store these in the future.

In [64]:
llm = OpenAI(
    temperature=0,
    model_name='gpt-3.5-turbo'
)
llm_math = LLMMathChain(llm=llm)

# initialize the math tool
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful when you need to perform math operations.'
)
# when giving tools to an LLM, we must pass them as a list of tools.
tools = [math_tool, AgentContextRetrieval()]

In [65]:
# Set up the base template
template = """You are an analyst tasked with aggregating financial and ESG KPIs about companies. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to answer as succinctly as possible when giving your final answer. The final answer, where possible, should just be a number or a boolean.

Question: {input}
{agent_scratchpad}"""

In [66]:
# Set up a prompt template which breaksup the intermediate_steps
# into thoughts that are used to fill the agent_scratchpad, 
# tools, and tool_names in the base template:
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[BaseTool or Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [67]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

In [68]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [69]:
llm = OpenAI(
    temperature=0,  # measure of randomness/creativity
    model_name=model
)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(
    llm=llm, 
    prompt=prompt  # Custom Prompt
)

tool_names = [tool.name for tool in tools]

agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=CustomOutputParser(),
    stop=["\nObservation:"],  # you want this to be whatever token you use in the prompt to denote the start of an Observation
    allowed_tools=tool_names
) 

agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent, 
    tools=tools, 
    verbose=True,
    max_iterations=3
)

In [71]:
result = agent_executor.run(
    input=f"what were the scope 1 greenhouse gas (GHG) emissions of {comp_name} in 2022?"
)



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to retrieve the information about Colruyt_Group's scope 1 greenhouse gas emissions in 2022.
Action: information_retrival
Action Input: Company: Colruyt_Group, Year: 2022, KPI: Scope 1 GHG emissions[0m

Observation:[33;1m[1;3mRank: 1 | Content: <<F60>>: Text: Colruyt_Group's scope 1 and 2 action plans target the three main sources of greenhouse gas emissions: cooling, heating, and mobility. Their goal is to reduce emissions by 42% by 2030. | Metadata: Colruyt_Group Annual report with sustainability (2022)

Rank: 2 | Content: TABLE: {'title': 'Greenhouse gas emissions', 'col_headers': ['Greenhouse gas emissions scope 1 (in tonnes CO2eq)', 'Calendar year', '–', '90.889 (1)', '83.194'], 'table': [{'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': '% greenhouse gas emissions scope 1, regulated through emission allowances trading', 'Calendar year': 'Calendar year', '–': '0', '90.889 (1)': '0', '83.194': '0'}, {'Greenhouse

[32;1m[1;3mI have retrieved the information about Colruyt_Group's scope 1 greenhouse gas emissions in 2022. Now I need to extract the specific value.
Action: information_retrival
Action Input: Content: TABLE: {'title': 'Greenhouse gas emissions', 'col_headers': ['Greenhouse gas emissions scope 1 (in tonnes CO2eq)', 'Calendar year', '–', '90.889 (1)', '83.194'], 'table': [{'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': '% greenhouse gas emissions scope 1, regulated through emission allowances trading', 'Calendar year': 'Calendar year', '–': '0', '90.889 (1)': '0', '83.194': '0'}, {'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': 'Greenhouse gas emissions scope 2: location-based (in tonnes CO2eq)', 'Calendar year': 'Calendar year', '–': '–', '90.889 (1)': '31.634 (1)', '83.194': '35.935'}, {'Greenhouse gas emissions scope 1 (in tonnes CO2eq)': 'Greenhouse gas emissions scope 2: market-based (in tonnes CO2eq)', 'Calendar year': 'Calendar year', '–': '–', '90.889 (1)': '232 (

#### TODO: 
- If you wanted to get the above to run then you'd have to create more tooling to extract info from tables etc. You could connect this to an LLM.

In wrapping up our discussion on this prototype, it's essential to emphasize the diverse range of techniques available for extracting key performance indicators (KPIs). To provide a clearer understanding, here are some elaborated strategies:

    LLM Model Fine-Tuning with Sustainability Reports: One approach is to refine a large language model (LLM) specifically with corporate sustainability reports. Once the model has been trained with this data, it can potentially answer questions related to KPIs even without needing explicit context. This is because the relevant information would already be incorporated into its knowledge base.

    Combining Fine-Tuning with Context Window: If solely fine-tuning the latter layers of a transformer network does not yield the desired results, a hybrid method could be more effective. By combining the fine-tuning process with a context window, one could leverage the strengths of both strategies. This would essentially mean refining the model with specific data while also providing a relevant contextual frame during the extraction phase, ensuring more accurate and relevant KPI responses.

    Full Report Processing for KPI Extraction: In scenarios where budget isn't a constraint, an even more comprehensive approach would be to feed the entire sustainability report into the model to retrieve answers to KPI-related queries. This method ensures that no contextual details are missed and that the most accurate and relevant answers are obtained.

    (And so on...)

It's worth noting that the best method would largely depend on the specific requirements, constraints, and objectives of a given project.