# Belfius Alytics (Part 2)
Inspiration:

-https://github.com/pinecone-io/examples/blob/master/learn/generation/langchain/handbook/08-langchain-retrieval-agent.ipynb

-https://www.youtube.com/watch?v=RIWbalZ7sTo

-https://colab.research.google.com/drive/13FpBqmhYa5Ex4smVhivfEhk2k4S5skwG?usp=sharing#scrollTo=RSdomqrHNCUY

-https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb

Future ideas:

- Convert docx to latex (https://www.vertopal.com/en/download#96a5acdd2afa4e3aaf723be0ea7b71ad).

### Handle imports:

In [1]:
# Move to root directory
import os

notebooks_dir = 'notebooks'
if notebooks_dir in os.path.abspath(os.curdir):
    while not os.path.abspath(os.curdir).endswith('notebooks'):
        print(os.path.abspath(os.curdir))
        os.chdir('..')
    os.chdir('..')  # to get to root

print(os.path.abspath(os.curdir))

C:\Users\MD726YR\PycharmProjects\eyalytics


In [2]:
# Supress SSL verification (EY problem):
import requests

from requests.packages.urllib3.exceptions import InsecureRequestWarning

# Suppress the warning from urllib3.
requests.packages.urllib3.disable_warnings(category=InsecureRequestWarning)

old_send = requests.Session.send

def new_send(*args, **kwargs):
    kwargs['verify'] = False
    return old_send(*args, **kwargs)

requests.Session.send = new_send

In [3]:
# Import relevant libraries for langchain retrieval:
import openai
import tiktoken

from langchain import OpenAI,  LLMChain, PromptTemplate
from langchain.prompts import StringPromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS  # facebook ai similarity search 
from langchain.chains import LLMMathChain
from langchain.tools import BaseTool
from langchain.agents import (
    AgentExecutor, LLMSingleActionAgent, AgentOutputParser, 
    AgentType, initialize_agent, Tool
)
from langchain.callbacks import get_openai_callback
from langchain.schema import AgentAction, AgentFinish

**In case you want to use Chroma instead of FAISS:**

`from langchain.vectorstores import Chroma`
    
Note, to use Chroma you will have to install chromadb. This requires having Microsoft Visual C++ 14.0 installed. To install that simply: 

a. Install Microsoft C++ Build Tools: Visit the link provided in the error message (https://visualstudio.microsoft.com/visual-cpp-build-tools/) and install the Microsoft C++ Build Tools.

b. Ensure the Correct Version: Ensure that you have the required version (14.0 or greater) of the C++ build tools installed.

c. Add to PATH: Ensure the tools are added to your system PATH. Usually, the installer should take care of this. But if the problem persists, you might need to verify and add them manually.

d. Restart Your System: Sometimes, after installing such tools, a system restart might be required for the environment variables (like PATH) to update correctly.

**Checks:**
Check if Visual C++ Build Tools is Installed:
- Press Windows + I to open the Settings app.
- Go to "Apps".
- Now in the "Apps & features" tab, search for "Visual Studio".
- Check if there's an installation called "Microsoft Visual Studio" (it might also be "Visual Studio Build Tools").

Check for the Required Components:
- If you find "Microsoft Visual Studio" or "Visual Studio Build Tools" in the list, click on it and then select "Modify".
- This will bring up the Visual Studio Installer.
- Here, ensure that the "Desktop development with C++" workload is checked. Specifically, make sure "MSVC v142 - VS 2019 C++ x64/x86 build tools" (or a similar option) is selected. This provides the C++ compiler that's needed.

In [4]:
# libraries for URL pdf loading
import time
import docx
import pyautogui
from docx.oxml.table import CT_Tbl
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

In [5]:
# Other libraries:
import re
import pickle 
import difflib
import math
import tqdm

# for progress bars in loops
from uuid import uuid4
from tqdm.auto import tqdm
from typing import List, Any, Union, Optional

In [6]:
# Get API and ENV keys:
from dotenv import load_dotenv

load_dotenv()
if not os.getenv("OPENAI_API_KEY"):
    raise KeyError(
        "You will need an OPENAI_API_KEY to use the LLM models in this notebook."
    )
openai.api_key = os.getenv("OPENAI_API_KEY")

## Commence Langchain Retrieval Augmentation Tool Development:

In [19]:
FIGURE_THRESHOLD = 0.1
EPSILON = 1e-10
REPEAT_THRESHOLD = 4
MAX_CHAR_COUNT_FOR_FIGURE  = 20
FIGURE_RELATED_CHARS = r"[0123456789.%-]"
COMMON_UNITS = ["kg", "m", "s", "h", "g", "cm", "mm", "l", "ml"]
DOCX_NAMESPACE = {'w': 'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}


# URL -> file name convertor:
def url2fname(url):
    # Split the URL by '/' and get the last segment
    last_segment = url.split('/')[-1]
    
    # Use regex to remove any suffix after the dot and the dot itself
    cleaned_name = re.sub(r'\..*$', '', last_segment)
    
    return cleaned_name
    
    
# Create URL loader:
def wait_for_file(file_path: str, timeout: int = 60) -> bool:
    """
    Wait for a file to be present at a specified path within a given timeout.
    
    Args:
        file_path (str): Path to the file.
        timeout (int): Maximum waiting time in seconds. Default is 60 seconds.

    Returns:
        bool: True if file is found within the timeout, False otherwise.
    """
    start_time = time.time()

    while time.time() - start_time < timeout:
        if os.path.exists(file_path):
            return True
        time.sleep(1)

    return False


def download_pdf_from_url(url: str, save_path: str) -> str:
    """
    Download a PDF from the specified URL and save it to a local path.
    
    Args:
        url (str): URL of the PDF.
        save_path (str): Local path to save the downloaded PDF.

    Returns:
        str: Path to the saved PDF if successful, None otherwise.
    """
    response = requests.get(url)
    if response.status_code == 200:
        with open(save_path, 'wb') as file:
            file.write(response.content)
        if os.path.exists(save_path):
            return save_path
    return None


def convert_pdf_to_docx(
    pdf_filename: str, driver_path: str, pdf_folder_path: str, docx_folder_path: str
) -> str:
    """
    Convert a PDF to a DOCX using Adobe's online tool.
    
    Args:
        pdf_filename (str): Filename of the PDF.
        driver_path (str): Path to the geckodriver executable.
        pdf_folder_path (str): Directory where the PDF is located.
        docx_folder_path (str): Directory where the converted DOCX should be saved.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    # WebDriver setup and configurations
    firefox_options = webdriver.FirefoxOptions()
    firefox_options.set_preference("browser.download.folderList", 2)
    firefox_options.set_preference("browser.download.dir", docx_folder_path)
    firefox_options.set_preference("browser.download.useDownloadDir", True)
    firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.wordprocessingml.document")
    
    service = Service(driver_path)
    driver = webdriver.Firefox(service=service, options=firefox_options)
    wait = WebDriverWait(driver, 180)
    driver.get("https://www.adobe.com/be_en/acrobat/online/pdf-to-word.html")

    # Upload the PDF
    upload_btn = wait.until(EC.element_to_be_clickable((By.ID, "lifecycle-nativebutton")))
    upload_btn.click()

    full_pdf_path = os.path.join(pdf_folder_path, pdf_filename)
    if not os.path.exists(full_pdf_path):
        print(f"File path\n{full_pdf_path}\nis not valid.")
        return None
    
    # Wait for the file selection dialog and input the file path using pyautogui
    time.sleep(5)
    # Use the path in pyautogui
    pyautogui.typewrite(full_pdf_path)

    # Add a slight delay and then press 'enter' multiple times
    time.sleep(2)
    for _ in range(3):
        pyautogui.press('enter')
        time.sleep(0.1)
    time.sleep(10)
    
    # Handle cookies and start conversion
    try:
        cookie_reject_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, "#onetrust-reject-all-handler")))
        if cookie_reject_btn:
            cookie_reject_btn.click()
    except Exception:
        print("Cookie settings notification not found or failed to click.")

    download_btn = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, 'button.Download__downloadButton___2qFEa')))
    download_btn.click()
    time.sleep(10)
    driver.quit()

    expected_docx_filename = pdf_filename.replace('.pdf', '.docx')
    expected_docx_filepath = os.path.join(docx_folder_path, expected_docx_filename)

    return expected_docx_filepath if wait_for_file(expected_docx_filepath, 55) else None


def convert_url_pdf_to_docx(
    pdf_url: str, 
    driver_path: str = "./drivers/geckodriver.exe", 
    pdf_folder_path: str = None, 
    docx_folder_path: str = None
) -> str:
    """
    Download a PDF from a URL, convert it to DOCX, and save it locally.
    
    Args:
        pdf_url (str): URL of the PDF.
        driver_path (str): Path to the geckodriver executable. Default is './drivers/geckodriver.exe'.
        pdf_folder_path (str): Directory to save the downloaded PDF. Default is '../data/pdf_db'.
        docx_folder_path (str): Directory to save the converted DOCX. Default is '../data/docx_db'.

    Returns:
        str: Path to the converted DOCX if successful, None otherwise.
    """
    cwd = os.getcwd()
    pdf_folder_path = pdf_folder_path or os.path.join(cwd, "data", "pdf_db")
    docx_folder_path = docx_folder_path or os.path.join(cwd, "data", "docx_db")

    os.makedirs(pdf_folder_path, exist_ok=True)
    os.makedirs(docx_folder_path, exist_ok=True)

    pdf_filename = pdf_url.split('/')[-1]
    pdf_save_path = os.path.join(pdf_folder_path, pdf_filename)

    if download_pdf_from_url(pdf_url, pdf_save_path):
        return convert_pdf_to_docx(pdf_filename, driver_path, pdf_folder_path, docx_folder_path)
    return None


def extract_footnotes_from_para(para, next_para=None):
    """Extract footnote references and actual footnotes from a paragraph."""
    footnotes = []
    
    footnote_refs = para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)

    for ref in footnote_refs:
        footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
        footnote = para.part.footnotes_part.footnote_dict[footnote_id]
        footnotes.append(footnote.text)

    # Check in the next paragraph for footnotes if provided
    if next_para:
        next_footnote_refs = next_para._element.findall('.//w:footnoteReference', namespaces=DOCX_NAMESPACE)
        for ref in next_footnote_refs:
            footnote_id = ref.get("{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id")
            footnote = next_para.part.footnotes_part.footnote_dict[footnote_id]
            footnotes.append(footnote.text)

    return footnotes


def process_footnotes(text, footnotes):
    """Process and embed footnotes into the text."""
    
    # Existing replacement for footnotes within brackets
    for idx, footnote in enumerate(footnotes, 1):
        text = re.sub(r"\[{}\]".format(idx), "[{}]".format(footnote), text)

    # New addition: replace footnotes appearing directly after words or at the end of sentences
    for idx, footnote in enumerate(footnotes, 1):
        # This regex will look for a number that doesn't have another number directly before it (to differentiate from normal numbers within the text)
        pattern = r'(?<![0-9])' + str(idx) + r'(?![0-9])'
        replacement = "[{}]".format(footnote)
        text = re.sub(pattern, replacement, text)

    return text


def contains_unit(text):
    """Check if text contains a common unit following a number."""
    for unit in COMMON_UNITS:
        # Check for patterns like '123 kg', '123kg', '0.5 m', '0.5m', etc.
        if re.search(r'\d\s?' + re.escape(unit) + r'(?![a-zA-Z])', text):
            return True
    return False


def is_potential_figure_data(text):
    if text is None:
        return False

    text_count = len(text) - text.count(' ')
    figure_char_count = len(re.findall(FIGURE_RELATED_CHARS, text))
    char_count = text_count - figure_char_count

    # New: Check for units
    contains_common_units = contains_unit(text.lower())  # Convert text to lowercase for this check
    
    # New: Check for percentage patterns
    contains_percentage = "%" in text and any(char.isdigit() for char in text)

    if (text_count == 0) or text.endswith(('.', ':', ';', ',')) or (char_count > MAX_CHAR_COUNT_FOR_FIGURE):
        return False

    if (char_count == 0) or (figure_char_count / (char_count + EPSILON) > FIGURE_THRESHOLD) or contains_common_units or contains_percentage:
        return True
    
    return False


def repeated_artifact_check(line, artifact_dict):
    """Check if a line is a repeated artifact and update its count."""
    if line in artifact_dict:
        artifact_dict[line] += 1
        if artifact_dict[line] > REPEAT_THRESHOLD:
            return True  # It's a repeated artifact
    else:
        artifact_dict[line] = 1
    return False

    
def read_docx(file_path: str) -> dict:
    
    def _is_empty(text):
        if len(text) == 0:
            return True
        return False
    
    doc = docx.Document(file_path)
    
    result = {
        'metadata': {
            'title': doc.core_properties.title,
            'author': doc.core_properties.author,
            'created': doc.core_properties.created
        },
        'text': [],
        'tables': [],
        'potential_figures': []
    }
    
    figure_data_group = {'title': None, 'data': []}
    current_is_figure_data = False
    previous_text = None
    
    artifact_dict = {}
    
    for current_elem, next_elem in tqdm(zip(doc.element.body, doc.element.body[1:] + [None])):
        
        # Paragraph
        if current_elem.tag.endswith('p'):
            
            current_para = docx.text.paragraph.Paragraph(current_elem, None)
            next_para = docx.text.paragraph.Paragraph(next_elem, None)
            
            processed_text = current_para.text.strip()
            # Ignore empty lines or repeated lines 
            if _is_empty(processed_text) or repeated_artifact_check(processed_text, artifact_dict):
                current_para = None
                continue
            try: 
                next_text = next_para.text
            except AttributeError: 
                next_text = None

            # Process footnotes
            footnotes = extract_footnotes_from_para(current_para, next_para)
            processed_text = process_footnotes(processed_text, footnotes)
            
            # Identify if the current line is potential figure data
            previous_was_figure_data = current_is_figure_data  # Move the window forward
            current_is_figure_data = is_potential_figure_data(processed_text)
            next_is_figure_data = is_potential_figure_data(next_text)
            
            if current_is_figure_data:
                
                # If previous line was also figure data, they belong to the same figure
                if previous_was_figure_data:
                    figure_data_group['data'].append(processed_text)
                else:
                    # If a new figure starts, save the previous figure (if there was any)
                    if figure_data_group['data']:
                        result['potential_figures'].append(figure_data_group)
                        figure_data_group = {'title': None, 'data': []}

                    # Assign the previous line as the title for the current figure
                    figure_data_group['title'] = previous_text
                    figure_data_group['data'].append(processed_text)
                    
            elif not next_is_figure_data:  # neither next or current text is figure
                # Not a figure, add to text
                result['text'].append(processed_text)
            else:  # next text is figure, meaning that current text will be stored as title.
                pass

            # Handles case when text potentially interupts figure.
            # Text is again stored in figure_data_group['data'].
            if previous_was_figure_data and next_is_figure_data:
                current_is_figure_data = True
                
            # only change previous text when current para is not empty or repeated string.
            previous_text = processed_text
                
        # Table
        elif current_elem.tag.endswith('tbl'):
            table_index = [tbl._element for tbl in doc.tables].index(current_elem)
            table = doc.tables[table_index]

            headers = [cell.text.strip() for cell in table.rows[0].cells]

            rows = []
            for row in table.rows[1:]:
                row_data = {headers[j]: cell.text.strip() for j, cell in enumerate(row.cells)}
                rows.append(row_data)

            result['tables'].append({
                'title': previous_text,
                'col_headers': headers,
                'table': rows
            })
        else:
            print(f"Ignoring {current_elem.tag}.")
            
    return result

  
# Testing the functions
comp = 'Coca-Cola Company'
pdf_url = "https://www.coca-colacompany.com/content/dam/company/us/en/reports/coca-cola-business-and-sustainability-report-2022.pdf"
fname = url2fname(pdf_url)  # used to save vectordb
is_scrape = False

try:
    if is_scrape:
        docx_path = convert_url_pdf_to_docx(pdf_url)
    else:
        docx_path = r"C:\Users\MD726YR\PycharmProjects\eyalytics\data\docx_db\coca-cola-business-and-sustainability-report-2022.docx"

    if not docx_path:
        print("Failed to convert PDF to DOCX. Exiting...")
        exit(1)

    doc_contents = read_docx(docx_path)

    print(
        f"SUMMARY:\n{len(doc_contents['text'])} paragraphs, "
        f"{len(doc_contents['tables'])} tables, and {len(doc_contents['potential_figures'])} figures "
        f"were extracted."
    )
    
    # Example usage:
    print("\n\nText Extracted:")
    for para in doc_contents['text']:
        print(para)
    print('---'*50)

    print("\n\nTables Extracted:")
    for table in doc_contents['tables']:
        print(table)

    print("\n\nFigures Extracted:")
    for figure in doc_contents['potential_figures']:
        print(figure)
    print('---'*50)
    
except Exception as e:
    print(f"An error occurred: {e}")

0it [00:00, ?it/s]

Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sdt.
Ignoring {http://schemas.openxmlformats.org/wordprocessingml/2006/main}sectPr.
SUMMARY:
1431 paragraphs, 22 tables, and 54 figures were extracted.


Text Extracted:
Refresh the World. Make a Difference.
CONTENTS
We build loved brands that bring joy to our consumers’ lives with beverage choices for all occasions, tastes and lifestyles. Our growth strategy is grounded in our core values and commitment to social and environmental responsibility.
SCOPE OF THIS REPORT
This 2022 Business & Sustainability Report is The Coca-Cola Company’s fifth report to integrate overall business and sustainability performance, data and context, reflecting our continued journey toward driving sustainable business practices into our core strategy.
Except as otherwise noted, this report covers the 2022 performance of The Coca-Cola Company and the Coca-Cola system (our company and our bottling partners), as applicable.
As used in this r

**FUTURE IMPROVEMENTS**: 
- Trying docx2python may imporve information retrival from a docx.
- Trying lxml may also be a better solution than docx.

In [18]:
import docx2python
import pandas as pd

_doc_content = docx2python.docx2python(docx_path)

for elem in _doc_content.body:
    print(elem)

[[[]]]
[[['2022 BUSINESS & SUSTAINABILITY REPORT']]]
[[['2022 BUSINESS & SUSTAINABILITY REPORT']]]
[[['----media/image1.jpeg--------media/image2.png--------media/image3.png--------media/image1.jpeg--------media/image2.png--------media/image3.png----Refresh the World. Make a Difference.', '', 'CONTENTS', '\t\nCEO MESSAGE\tEXECUTIVE SUMMARY', '\nOUR COMPANY', '\nWATER', '\nPORTFOLIO', '\nPACKAGING', '\nCLIMATE', '\nAGRICULTURE', '\nPEOPLE', '\nOPERATIONS', '\nDATA APPENDIX', '\nFRAMEWORKS', '', '', '', '']]]
[[['CONTENTS']]]
[[['CONTENTS']]]
[[['We build loved brands that bring joy to our consumers’ lives with beverage choices for all occasions, tastes and lifestyles. Our growth strategy is grounded in our core values and commitment to social and environmental responsibility.', '', '', '', '', '', '', '', '', '']]]
[[['CHAIRMAN & CEO MESSAGE', '3']]]
[[['CHAIRMAN & CEO MESSAGE', '3']]]
[[['WATER LEADERSHIP', '24']]]
[[['WATER LEADERSHIP', '24']]]
[[['PEOPLE & COMMUNITIES', '51']]]
[[['PE

### Obtain short summaries for tables.
To facilitate the encoding of tables, we will ask an LLM to generate a textual summary of a table's contents. The idea being that this summary will yield better vector encodings than if we simply tried to encode the table. It's an added cost but one that will hopefully yield better context for our LLMs. Note that the table with summaries will be fed to the engine if the tabular chunk is selected as context. The key when summarising is obtaining short summaries!

In [20]:
doc_contents['table_summary'] = []
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {"role": "system", "content": "You are a table summarizer."}, 
    {
        "role": "user", 
        "content": None,
    }, 
]

for i, table in enumerate(doc_contents['tables'], start=1):
    
    parameters['messages'][-1]['content'] = f"In 100 words or less, describe what the following table from {comp}'s sustainability report displays: {table}."

    response = openai.ChatCompletion.create(
      **parameters
    )
    doc_contents['table_summary'].append(response['choices'][0]['message']['content'])
    
    print(f"Table{i}: {table}\n\nSummary;\n {doc_contents['table_summary'][-1]}\n\n\n")

Table1: {'title': 'Several of our bottling partners and suppliers have set or committed to setting their own science-based reduction targets to drive climate action across our value chain.', 'col_headers': ['GOAL', '2022 STATUS', '2022 STATUS'], 'table': [{'GOAL': 'Reduce absolute', '2022 STATUS': ''}, {'GOAL': 'greenhouse gas (GHG) emissions by 25% by', '2022 STATUS': '7%'}, {'GOAL': '2030, against a 2015', '2022 STATUS': ''}, {'GOAL': 'baseline', '2022 STATUS': ''}]}

Summary;
 The table displays the goals and current status of Coca-Cola Company's bottling partners and suppliers in reducing greenhouse gas (GHG) emissions. The goal is to reduce absolute GHG emissions by 25% by 2030, against a 2015 baseline. The current status for 2022 shows a 7% reduction in GHG emissions.



Table2: {'title': 'Several of our bottling partners and suppliers have set or committed to setting their own science-based reduction targets to drive climate action across our value chain.', 'col_headers': ['AMBI

Table10: {'title': 'For additional details regarding the reconciliation of GAAP and non-GAAP financial measures below, see the company’s Current Reports on Form 8-K filed with the SEC on Feb. 14, 2023, Feb. 10, 2022, Feb. 10, 2021 and Jan. 30, 2020. This information is also available in the “Investors” section of the company’s website, .', 'col_headers': ['Year ended December 31,', '', '2019', '2020', '2021', '2022'], 'table': [{'Year ended December 31,': '(Percent change)', '': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Net Operating Revenues', '': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Reported Net Operating Revenues (GAAP)', '': '', '2019': '9', '2020': '(11)', '2021': '17', '2022': '11'}, {'Year ended December 31,': 'Less: Adjustments to Reported Net Revenues', '': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Currency Impact', '': '', '2019': '(4)', '2020': '(2)', 

Table13: {'title': 'PACKAGING', 'col_headers': ['Year ended December 31,', '', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], 'table': [{'Year ended December 31,': 'World Without Waste', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total weight of our packaging (metric tons)1', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '5.10M2', '2021': '5.30M', '2022': '5.95M'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Percentage of recycled material in our packaging3', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '30%', '2019': '20%4', '2020': '22%', '2021': '23%', '2022': '25%'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': ''

Table14: {'title': 'WATER', 'col_headers': ['Year ended December 31,', '', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], 'table': [{'Year ended December 31,': 'Water Use and Water Withdrawn', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Water Use Ratio (liters of water used per liter of product produced)', '': '', '2014': '2.03', '2015': '1.98', '2016': '1.96', '2017': '1.92', '2018': '1.89', '2019': '1.85', '2020': '1.84', '2021': '1.81', '2022': '1.79'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '

Table16: {'title': 'GREENHOUSE GAS EMISSIONS & WASTE (continued)', 'col_headers': ['Year ended December 31,', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], 'table': []}

Summary;
 The table displays the greenhouse gas emissions and waste data for the Coca-Cola Company from 2014 to 2022. It includes the years as column headers and likely contains information on the company's emissions and waste reduction efforts over time. However, without the actual data in the table, it is not possible to provide a more detailed summary.



Table17: {'title': 'GREENHOUSE GAS EMISSIONS & WASTE (continued)', 'col_headers': ['Fleet Fuel Management', '', '', '', '', '', '', ''], 'table': [{'Fleet Fuel Management': '', '': ''}, {'Fleet Fuel Management': 'Fleet fuel consumed (liters of diesel equiv.)', '': '0.77B'}, {'Fleet Fuel Management': 'HFC-Free Coolers', '': ''}, {'Fleet Fuel Management': '', '': ''}, {'Fleet Fuel Management': 'Number of pieces of HFC-free refrigeration equ

Table21: {'title': 'HUMAN RIGHTS & AGRICULTURE', 'col_headers': ['Year ended December 31,', '', '2014', '2015', '2016', '2017', '2018', '2019', '2020', '2021', '2022'], 'table': [{'Year ended December 31,': 'Human Rights Audits by Region1', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Total', '': '', '2014': '', '2015': '2,318', '2016': '2,789', '2017': '3,204', '2018': '2,823', '2019': '2,778', '2020': '2,279', '2021': '2,848', '2022': '2,770'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': '', '2021': '', '2022': ''}, {'Year ended December 31,': 'Africa', '': '', '2014': '', '2015': '120', '2016': '188', '2017': '259', '2018': '236', '2019': '206', '2020': '165', '2021': '297', '2022': '259'}, {'Year ended December 31,': '', '': '', '2014': '', '2015': '', '2016': '', '2017': '', '2018': '', '2019': '', '2020': 

Table22: {'title': 'DEFINITIONS OF PRIORITY TOPICS', 'col_headers': ['Topic name', 'Subtopic(s)', 'Definition', 'Topic name', 'Subtopic(s)', 'Definition'], 'table': [{'Topic name': 'Corporate Governance', 'Subtopic(s)': 'Business Ethics & Compliance', 'Definition': 'How a company implements policies and procedures surrounding compliant and lawful corporate conduct and competition in the marketplace. Includes the systems and channels in place to collect, process, and address complaints regarding suspected violations.'}, {'Topic name': '', 'Subtopic(s)': 'Corporate Governance', 'Definition': 'How a company implements policies and standards that determine how it operates, including its business strategy and goals, risk management, and sustainability integration. Includes the processes for ensuring the knowledge, skills, experience and diversity of board members.'}, {'Topic name': '', 'Subtopic(s)': 'Transparency & Sustainability Data Validity', 'Definition': 'How a company discloses robus

### Obtain short summaries for figures:
Similalrly to tables, to facilitate encoding we try and pass figures through an LLM first. Note that the confidence we have in figure information is low. Figures are captured using heauristics and may hence be too vague to summarise meanigfully. The may also be extract from the text that have been wrongly captured. In either case a textual summary is required. 

In [24]:
doc_contents['figure_summary'] = []
parameters = {
    'model': 'gpt-3.5-turbo', 
    "temperature": 0,
}
parameters['messages'] = [
    {
        "role": "system", 
        "content": "You are a figure and text summarizer."
    }, {
        "role": "user",
        "content": f"Decide whether the information below (INFO) obtained from {comp}'s " + 
            "sustainability report is from a figure or text, " + 
            "and describe it in no more than 50 words. Note, if the information is vague " + 
            "return 'None'.\n" + 
            "INFO: {'title': 'Global headquarters', 'data': ['200+']}"
    }, {
        "role": "assistant", 
        "content": "Text: Coca-Cola has more than 200 global headquarters."
    }, {
        "role": "user", 
        "content": "INFO: {'title': '2022 Progress on Sustainable Sourcing2', 'data': ['0\t20\t40\t60\t80\t100', '36%', 'GRAPES\t 37%', 'SUGAR CANE\t 40%', 'APPLES\t 55%', 'CORN\t 70%', 'TEA\t 74%', '80%', 'PULP AND PAPER\t 86%', 'ORANGES\t 89%', 'LEMONS\t 96%']}"
    }, {
        "role": "assistant", 
        "content": "Figure: 2022 progress on coca-cola's sustainable sourcing. 37% of grapes, 40% of sugar cane, 55% of apples, 70% of corn, 74% of tea, 86% of pulp and paper, 89% of oranges, and 96% of lemons were sustainably sourced."
    }, {
        "role": "user", 
        "content": "INFO: {'title': 'Organic Revenue Growth (Non-GAAP)1', 'data': ['25', '-5', '24%']}",
    }, {
        "role": "assistant", 
        "content": "Text: Organic Revenue Growth (Non-GAAP) was 24%."
    }, {
        "role": "user", 
        "content": None,
    },
]

for i, figure in enumerate(doc_contents['potential_figures'], start=0):
    
    if doc_contents['figure_summary'] != []:
        # KEEP TRACK OF LAST GPT OUTPUT - This may help with next
        # figure summarisation as there may be a link between the two.
        parameters['messages'][-3:-1] = {
            "role": "user", 
            "content": f"INFO: {doc_contents['potential_figures'][i-1]}"  # previous figure
        }, {
            "role": "assistant", 
            "content": f"{doc_contents['figure_summary'][-1]}"
        }
    
    parameters['messages'][-1] = {
        "role": "user", 
        "content": f"INFO: {figure}"
    }

    response = openai.ChatCompletion.create(
      **parameters
    )
    doc_contents['figure_summary'].append(response['choices'][0]['message']['content'])
    
    print(f"Figure {i + 1} Summary;\n {doc_contents['figure_summary'][-1]}\n\n")


Figure 1 Summary;
 Text: As of December 31, 2022, there were 7 employees in the U.S. workforce, excluding Bottling Investments Group (BIG), Global Ventures, fairlife, and BODYARMOR.


Figure 2 Summary;
 Figure: In 2022, the progress on sustainable sourcing for various products was as follows: 37% for grapes, 40% for sugar cane, 55% for apples, 70% for corn, 74% for tea, 86% for pulp and paper, 89% for oranges, and 96% for lemons.


Figure 3 Summary;
 Figure: The sustainable sourcing progress for coffee in the Coca-Cola Company was 99%.


Figure 4 Summary;
 Figure: The percentage of direct suppliers that achieved compliance with Coca-Cola Company's Supplier Guiding Principles is as follows: 100%, 80%, 100%, 80%, 92%, 90%, 93%, 93%, 100%, 80%, 60%, 40%, 20%.


Figure 5 Summary;
 Figure: The data is based on supplier reporting according to Coca-Cola Company's PSA governance requirements. The years included in the data are 2019, 2020, 2021, and 2022.


Figure 6 Summary;
 Text: Coca-Cola Co

### Text Splitting, Embedding Models, and Vector DB
We'll be using OpenAI's text-embedding-ada-002 model. 

In [19]:
# Prepare text for chunking:
text = "\n\n".join(doc_contents["text"])

In [26]:
model = 'gpt-3.5-turbo'  # open ai LLM model we will be using later.
# model = 'text-davinci-003'  # open ai LLM model we will be using later.
enc_code = tiktoken.encoding_for_model(model).name
tokenizer = tiktoken.get_encoding(enc_code)

# Determine length of input after tokenization
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=600,  # in order to be able to fit 3 chunks in context window
    chunk_overlap=100,  
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]  # order in which splits are prioritized
)
chunks = text_splitter.split_text(text)
print(
    f'The input text of lenght {len(text)} was split into {len(chunks)} chunks.'
)

The input text of lenght 187603 was split into 76 chunks.


In [30]:
def assert_roughly_equal(value1, value2, tolerance, message=None):
    if not math.isclose(value1, value2, rel_tol=tolerance):
        if message is None:
            message = f"{value1} and {value2} are not roughly equal within {tolerance} tolerance"
        raise AssertionError(message)

assert_roughly_equal(sum([len(chunk) for chunk in chunks]), len(text), 100)

In [31]:
# extend chunks with table descriptions:
chunks.extend(doc_contents['table_summary'])
# extend chuncks with figure descritions
chunks.extend(doc_contents['figure_summary'])

98


In [37]:
divider = '-'*100
for chunk in chunks:
    print(f'{chunk}\n{divider}\n\n')

Refresh the World. Make a Difference.

CONTENTS

We build loved brands that bring joy to our consumers’ lives with beverage choices for all occasions, tastes and lifestyles. Our growth strategy is grounded in our core values and commitment to social and environmental responsibility.

SCOPE OF THIS REPORT

This 2022 Business & Sustainability Report is The Coca-Cola Company’s fifth report to integrate overall business and sustainability performance, data and context, reflecting our continued journey toward driving sustainable business practices into our core strategy.

Except as otherwise noted, this report covers the 2022 performance of The Coca-Cola Company and the Coca-Cola system (our company and our bottling partners), as applicable.

As used in this report, the terms “material,” “materiality,” “immaterial,” “substantive,” “significant” and other similar terminology are not used, or intended to be construed, as they have been defined by or construed in accordance with the securities

### Indexing:
Store the indexes to avoid pointlessly rerunning the embedding code.

In [10]:
fpath = f"./data/faiss_db/{fname}.pkl"
is_override = True

if os.path.exists(fpath) and not is_override:
    with open(fpath, 'rb') as f:
        vector_store = pickle.load(f)
else:
    # Init embedding model:
    embed = OpenAIEmbeddings(
        model='text-embedding-ada-002'
    )
    vector_store = FAISS.from_texts(chunks, embedding=embed)
    with open(fpath, 'wb') as f:
        pickle.dump(vector_store, f)

### Set up QA Agent:
We will create an agent cabale of:
- Fetching contextual information from the vector store
- Performing basic math operations 
The thought being that this way the agent may be able to compute KPIs which require basic math operations to compute.

In [11]:
llm = OpenAI(
    temperature=0,
    model_name=model
)

In [89]:
class ContextRetrieval(BaseTool):
    
    name = "information_retrival"
    description = """
        Fetch the most recent information about a company's financials and ESG initiatives.
    """
    output_chunks = []

    @staticmethod
    def string_similarity(s1, s2):
        seq_matcher = difflib.SequenceMatcher(None, s1, s2)
        return seq_matcher.ratio()

    def _similarity_search(self, chunk: str) -> str:
        """Check if context has already been provided in the past."""
        for _chunk in self.output_chunks:
            if self.string_similarity(chunk, _chunk) > 0.9:
                return f"The information was shared in the previous {self.name} calls."
        
        # If no highly similar string is found in the outputs, 
        # append the query to outputs and return True
        self.output_chunks.append(chunk)
        return chunk
    
    def _run(self, query: str) -> str:
        contents = [
            self._similarity_search(doc.page_content) 
            for doc in vector_store.similarity_search(query, k=2)
        ]
        
        full_content = '\n\n'.join(
            [
                f"Rank: {rank} | Content: {_content}" 
                for rank, _content in enumerate(contents, start=1)
            ]
        )
        return full_content + "\n\nContextual information sorted from most relevant to least relevant."
    
    def _arun(self, query: str):
        raise NotImplementedError(
            f"{self.__class__.__name__} does not currently support async run."
        )
        
llm_math = LLMMathChain(llm=llm)

# initialize the math tool
math_tool = Tool(
    name='Calculator',
    func=llm_math.run,
    description='Useful when you need to perform math operations.'
)
# when giving tools to an LLM, we must pass them as a list of tools.
tools = [math_tool, ContextRetrieval()]

**Setup an agent just like in url_retrieval.ipynb.**

Note that a `utils.py` file will be created to store these in the future.

In [90]:
# Set up the base template
template = """You are an analyst tasked with aggregating financial and ESG KPIs about companies. You have access to the following tools:

{tools}

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [{tool_names}]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin! Remember to answer as succinctly as possible when giving your final answer. The final answer, where possible, should just be a number or a boolean.

Question: {input}
{agent_scratchpad}"""

In [91]:
# Set up a prompt template which breaksup the intermediate_steps
# into thoughts that are used to fill the agent_scratchpad, 
# tools, and tool_names in the base template:
class CustomPromptTemplate(StringPromptTemplate):
    # The template to use
    template: str
    # The list of tools available
    tools: List[BaseTool or Tool]
    
    def format(self, **kwargs) -> str:
        # Get the intermediate steps (AgentAction, Observation tuples)
        # Format them in a particular way
        intermediate_steps = kwargs.pop("intermediate_steps")
        thoughts = ""
        for action, observation in intermediate_steps:
            thoughts += action.log
            thoughts += f"\nObservation: {observation}\nThought: "
        # Set the agent_scratchpad variable to that value
        kwargs["agent_scratchpad"] = thoughts
        # Create a tools variable from the list of tools provided
        kwargs["tools"] = "\n".join([f"{tool.name}: {tool.description}" for tool in self.tools])
        # Create a list of tool names for the tools provided
        kwargs["tool_names"] = ", ".join([tool.name for tool in self.tools])
        return self.template.format(**kwargs)

In [92]:
prompt = CustomPromptTemplate(
    template=template,
    tools=tools,
    # This omits the `agent_scratchpad`, `tools`, and `tool_names` variables because those are generated dynamically
    # This includes the `intermediate_steps` variable because that is needed
    input_variables=["input", "intermediate_steps"]
)

In [93]:
class CustomOutputParser(AgentOutputParser):
    
    def parse(self, llm_output: str) -> Union[AgentAction, AgentFinish]:
        # Check if agent should finish
        if "Final Answer:" in llm_output:
            return AgentFinish(
                # Return values is generally always a dictionary with a single `output` key
                # It is not recommended to try anything else at the moment :)
                return_values={"output": llm_output.split("Final Answer:")[-1].strip()},
                log=llm_output,
            )
        # Parse out the action and action input
        regex = r"Action\s*\d*\s*:(.*?)\nAction\s*\d*\s*Input\s*\d*\s*:[\s]*(.*)"
        match = re.search(regex, llm_output, re.DOTALL)
        if not match:
            raise ValueError(f"Could not parse LLM output: `{llm_output}`")
        action = match.group(1).strip()
        action_input = match.group(2)
        # Return the action and action input
        return AgentAction(tool=action, tool_input=action_input.strip(" ").strip('"'), log=llm_output)

In [94]:
llm = OpenAI(
    temperature=0,  # measure of randomness/creativity
    model_name=model
)

# LLM chain consisting of the LLM and a prompt
llm_chain = LLMChain(
    llm=llm, 
    prompt=prompt  # Custom Prompt
)

tool_names = [tool.name for tool in tools]

agent = LLMSingleActionAgent(
    llm_chain=llm_chain, 
    output_parser=CustomOutputParser(),
    stop=["\nObservation:"],  # you want this to be whatever token you use in the prompt to denote the start of an Observation
    allowed_tools=tool_names
) 

agent_executor = AgentExecutor.from_agent_and_tools(
    agent=agent, 
    tools=tools, 
    verbose=True,
    max_iterations=3
)

In [95]:
result = agent_executor.run(
    input="what were the net operating revenues of coca-cola in 2022"
)



[1m> Entering new  chain...[0m
[32;1m[1;3mThought: I need to find the most recent financial information about coca-cola
Action: information_retrival
Action Input: coca-cola financials[0m

Observation:[33;1m[1;3mRank: 1 | Content: Net income attributable to shareowners  

of The Coca-Cola Company

8,920

7,747

9,771

9,542

20

15

10

5

0

-5

-10

16%

16%

6%

2020

2019

2021

2022

(9%)

20

15

10

5

0

19%

13%

12%

0%
2020

2019

2021

2022

Organic Revenue Growth 
(Non-GAAP)1

Comparable Currency Neutral Operating 
Income Growth (Non-GAAP)2

Per Share Data

Basic earnings per share

$2.09

$1.80

$2.26

$2.20

Comparable Currency Neutral Diluted 
Earnings Per Share Growth (Non-GAAP)3

Adjusted Free Cash Flow Conversion Ratio 
(Non-GAAP)4

Diluted earnings per share

Cash dividends

Balance Sheet Data

Total assets

Long-term debt

2.07

1.60

1.79

1.64

2.25

1.68

2.19

1.76

$86,381

$87,296

$94,354

$ 92,763

27,516

40,125

38,116

36,377

20

15

10

5

0

-