# Generative AI and Risk Information

## What is Generative AI?

> Generative AI (GenAI) is a special type of Artificial Intelligence that can learn from and mimic large amounts of data to create content such as text, images, music, videos, code, and more.

## Application of Generative AI in Risk Management

- Information analysis
- Information provision
- Information automation

https://www.pwc.com/us/en/industries/financial-services/library/gen-ai-and-risk-management.html

### A Proof-of-Concept (PoC): Risk Agent

In this part of the course we will develop a conversational AI that provides seamless information about risks and challenges faced by publicly listed companies based on their annual reports. More precisely we will develop an application where a Large Language Model (LLM) orchestrates user queries to the annual reports that are relevant for the query. The user's query will then be enriched with information from the relevant reports which allows the LLM to come up with answers that closely reflect the information from the reports. In other words, the generated answers from the LLM are tuned by a retrieval mechanism that adds domain knowledge to the LLM's response which it otherwise would not have from its mere pretraining. This mechanism is called Retrieval Augmented Generation (RAG) and is a quite popular approach when developing conversational GenAI applications.

> LLMs, are very large deep learning models that are pre-trained on vast amounts of data. They are highly flexible as a single model can perform completely different generative tasks such as answering questions, summarizing documents, translating languages and completing sentences.

While risk-related information from company reports is publicly available, it is quite tedious and resource-intensive to collect and structure it and to make it seamless available to decision makers that act upon such information. For example,
- Financial Analysts and Investment Professionals such as portfolio managers, equity analysts, and hedge fund managers, need to assess the risks and challenges of companies to make informed investment decisions. A risk agent could help them quickly analyze risk disclosures in annual reports, saving time and improving accuracy.
- Corporate Risk Managers need to benchmark their own risk disclosures against competitors or industry standards. A risk agent could help them identify trends in risk reporting and ensure compliance with regulatory requirements.
- ESG analysts focus on evaluating companies' sustainability and governance practices. Risk disclosures in annual reports often include information about environmental and social risks, which are critical for ESG assessments. A risk agent could help them extracting and analyzing ESG-related risks from annual reports to evaluate a company's sustainability profile.

Developing LLM-based applications requires you to choose one of the many proprietary or open-source LLM models. We use the proprietary LLM `command-r-plus` from [Cohere](https://cohere.com/command). Cohere offers trial API keys that can be used to send calls to their LLMs and return responses free of charge. Cohere's trial keys are rate-limited, and cannot be used for commercial purposes.

In order to get a trial API key, [create an account](https://dashboard.cohere.com/welcome/register) on Cohere and generate a trial API key.


<img src="img/cohere.png" alt="Cohere trial key" width="1000">

In the root directory of this project simply create a new file named `.env` where you paste in your trial API key.

<img src="img/env.png" alt=".env file" width="700">

In the first step, we load the API keys into your project's environment, so the keys are available to us.

In [1]:
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

# Access the API keys
cohere_api_key = os.getenv("COHERE_API_KEY")
huggingface_api_key = os.getenv("HUGGINGFACE_API_KEY")

Moving on, we will primarily work with [LlamaIndex](https://docs.llamaindex.ai/en/stable/) which is one of the leading frameworks for building LLM-powered agents.

> Agents are LLM-powered knowledge assistants that use tools to perform tasks like research, data extraction, and more. Agents range from simple question-answering to being able to sense, decide and take actions in order to complete tasks.

LlamaIndex provides the following tools that formalize the development of search agents:
- **Data connectors** (Readers) ingest existing data from their native source and format. These could be APIs, PDFs, SQL databases, and others.
- **Data indexes** structure your data in intermediate representations that are easy and performant for LLMs to consume.
- **Engines** provide natural language access to your data. For example:
    - Query engines are powerful interfaces for question-answering (e.g. a RAG flow).
    - Chat engines are conversational interfaces for multi-message, "back and forth" interactions with your data.
- **Agents** are LLM-powered knowledge workers augmented by tools, from simple helper functions to API integrations and more.

In our use case, we develop an agent capable of extracting risk-related information from annual company reports. The agent orchestrates which report to retrieve information from based on the user's query. The illustration below provides an idea of what such a system may look like.

<img src="img/multi_doc_agent.webp" alt="Multi-document agent" width="900">
<p>Image source: <a href="https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6">Ivan Ilin</a></p>

## Developing a Mulit-Document Risk Agent

### Read documents

For demonstrative purposes, the original annual reports for a sample of DAX 40 companies can be found in `data/raw/reports/`.

In [2]:
reports = os.listdir('data/raw/reports/')
reports

['adidas_2023.pdf',
 'allianz_2023.pdf',
 'basf_2023.pdf',
 'bayer_2023.pdf',
 'beiersdorf_2023.pdf',
 'bmw_2023.pdf']

To read the annual reports we use `SimpleDirectoryReader` which provides a simple and straightforward way to load data from local files and parse the text found in them automatically selecting the best file reader based on the file extensions.

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader('data/raw/reports/').load_data(show_progress=True)
print(f"In total {len(documents)} Document objects have been loaded.")

Loading files: 100%|██████████| 6/6 [03:47<00:00, 37.93s/file]


1735

As you can see `SimpleDirectoryReader` returns a list of `Document` objects where each element represents a single page from the annual reports loaded before.

In [10]:
documents[0:3]

[Document(id_='9e9bcf34-1f19-40c4-849c-c1e477945867', embedding=None, metadata={'page_label': '1', 'file_name': 'adidas_2023.pdf', 'file_path': 'w:\\teaching\\risk_analytics\\data\\raw\\reports\\adidas_2023.pdf', 'file_type': 'application/pdf', 'file_size': 20968677, 'creation_date': '2025-02-02', 'last_modified_date': '2025-02-02'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='2023\nANNUAL REPORT', path=None, url=None, mimetype=None), image_resource=None, audio_resource=None, video_resource=None, text_template='{metadata_str}\n\n{content}'),
 Document(id_='9978232f-cae8-40fb-a671-d730513fab90', embedding=None, metadata={'page_label': '

A `Document` object comes with a unique ID, some metadate, and the data found in the document by data type:
- text_resource
- image_resource
- audio_resource
- video_resource

In [55]:
vars(documents[100])

{'id_': '49b503ab-4c12-498c-b921-057f41885a8f',
 'embedding': None,
 'metadata': {'page_label': '101',
  'file_name': 'adidas_2023.pdf',
  'file_path': 'w:\\teaching\\risk_analytics\\data\\raw\\reports\\adidas_2023.pdf',
  'file_type': 'application/pdf',
  'file_size': 20968677,
  'creation_date': '2025-02-02',
  'last_modified_date': '2025-02-02'},
 'excluded_embed_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'excluded_llm_metadata_keys': ['file_name',
  'file_type',
  'file_size',
  'creation_date',
  'last_modified_date',
  'last_accessed_date'],
 'relationships': {},
 'metadata_template': '{key}: {value}',
 'metadata_separator': '\n',
 'text_resource': MediaResource(embeddings=None, data=None, text='1 2 3 4 5 \nT O  O U R SHA REHO L D ERS  GRO U P  MAN A GEMEN T  REP O RT – \nO U R CO MPA N Y \nGRO U P  MAN A GEMEN T  REP O RT – \nF I N A N CI AL  REVI EW \nCO N SO L I DA T ED  FI N AN CI A L \nST A

Next, we use the custom function `display_document_with_image_side_by_side()` to display and compare the text parsed by `SimpleDirectoryReader` with the original.

In [13]:
from util import display_document_with_image_side_by_side

display_document_with_image_side_by_side(
    document=documents[165],
    image_path='img/annual_report.png'
)

When using the text_resource in the Risk Agent, we do not need the header information which repeats on each page. This is just convoluting the content and should be removed from the `Document` objects. Similar redundant text can be found in the annual reports of the other companies. Moreover, we do not want to use the entire annual reports but only the pages which contain information the companies' 'Risk and Opportunity'. Ultimately, we want to merge all relevant pages of one annual report into a single `Document` object. Otherwise, paragraphs that split across two pages will be handled separately in downstream processing.

In [16]:
pre_process_dict = {
    "adidas": {
        "pages": range(166, 187, 1),
        "string_to_remove": "1 2 3 4 5 \nT O  O U R SHA REHO L D ERS  GRO U P  MAN A GEMEN T  REP O RT – \nO U R CO MPA N Y \nGRO U P  MAN A GEMEN T  REP O RT – \nF I N A N CI AL  REVI EW  \nCO N SO L I DA T ED  FI N AN CI A L \nST A T EMEN T S \nA D D I T I ON A L I N FO RMA T I ON \n \n\\d{1,3} \n    \n    \n A N N U A L  R E P O R T  2 0 2 3",
    },
    "allianz": {
        "pages": range(21, 35, 1),
        "string_to_remove": "B _ Management Report of Allianz SE \n\\d{1,3} Annual Report 2023 – Allianz SE \n"
    },
    "basf": {
        "pages": range(172, 184, 1),
        "string_to_remove": "Combined Management’s Report – Opportunities and Risks\n"
    },
    "bayer": {
        "pages": range(100, 116, 1),
        "string_to_remove": " \n \nBayer Annual Report 2023 A Combined Management Report\n3.2 Opportunity and Risk Report\n \\d{1,3}\n"
    },
    "beiersdorf": {
        "pages": range(155, 166, 1),
        "string_to_remove": " \n \nBeiersdorf Annual Report 2023 A Combined Management Report\n3.2 Opportunity and Risk Report\n \\d{1,3}\n"
    },
    "bmw": {
        "pages": range(126, 142, 1),
        "string_to_remove": "\\d{1,3} BMW Group Report 2023\\s+To Our Stakeholders\\s+Combined Management Report\\s+Group Financial Statements\\s+Responsibility Statement and Auditor’s Report\\s+Remuneration Report\\s+Other Information\\s+\n\\s+Risks and Opportunities\\s+\n"
    }
}

In [None]:
from llama_index.core import Document
import re

def pre_process_document(report_file_path:str, pre_process_dict:dict=None):

    documents = SimpleDirectoryReader(input_files=[report_file_path]).load_data(show_progress=True)
    
    overall_text = ""
    if pre_process_dict is not None:
        string_to_remove = pre_process_dict.get('string_to_remove')
        pages = pre_process_dict.get('pages')
    else:
        pages = range(0, len(documents), 1)

    for page in pages:
        document = documents[page]
        text = document.text
        if string_to_remove is not None:
            text = re.sub(string_to_remove, "", text)
        overall_text = "\n".join([overall_text, text])

    documents = [Document(text=overall_text)]
    
    return documents

In [88]:
temp = pre_process_document('data/raw/reports/adidas_2023.pdf', pre_process_dict.get('adidas'))

Loading files: 100%|██████████| 1/1 [00:27<00:00, 27.59s/file]


In [None]:
from llama_index.core import (
    Settings,
    load_index_from_storage,
    SummaryIndex,
    VectorStoreIndex,
    StorageContext
)
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.llms.cohere import Cohere
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.embeddings.huggingface_api import HuggingFaceInferenceAPIEmbedding
from tqdm import tqdm
from llama_index.core.tools import QueryEngineTool, ToolMetadata
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.objects import ObjectIndex
from pathlib import Path
from llama_index.core.agent import FunctionCallingAgentWorker
from llama_index.core.agent import AgentRunner

In [24]:
companies = [
    'adidas',
    'allianz',
    'basf',
    'bayer',
    'beiersdorf',
    'bmw'
    ]

year = '2023'

In [None]:
async def create_doc_tools(
    company_fp: str,
    company: int,
    post_process: dict = None,
    llm_model = Cohere(model="command-r-plus", api_key=cohere_api_key),
    embed_model = HuggingFaceInferenceAPIEmbedding(model_name="Snowflake/snowflake-arctic-embed-l-v2.0", token=huggingface_api_key),
) -> AgentRunner:
    
    # LLM model
    Settings.llm = llm_model
    # embedding model
    Settings.embed_model = embed_model

    if not os.path.exists(f"./data/reports/{company}"):
        # load pdf documents
        documents = SimpleDirectoryReader(input_files=[company_fp]).load_data()

        if post_process is not None:
            overall_text = ""
            string_to_remove = post_process.get('string_to_remove')
            for page in post_process.get('pages'):
                document = documents[page]
                text = document.text
                if string_to_remove is not None:
                    text = re.sub(string_to_remove, "", text)
                overall_text = "\n".join([overall_text, text])
            documents = [Document(text=overall_text)]
        
        splitter = SentenceSplitter()
        nodes = splitter.get_nodes_from_documents(documents)

        # summary index
        summary_index = SummaryIndex(nodes)
        summary_index.storage_context.persist(
            persist_dir=f"./data/reports/{company}/summary_index"
        )
        # vector store index
        vector_index = VectorStoreIndex(
            nodes,
            embed_model=Settings.embed_model)
        vector_index.storage_context.persist(
            persist_dir=f"./data/reports/{company}/vector_index"
        )
    
    else:
        summary_index = load_index_from_storage(
            StorageContext.from_defaults(
                persist_dir=f"./data/reports/{company}/summary_index"
            )
        )
        vector_index = load_index_from_storage(
            StorageContext.from_defaults(
                persist_dir=f"./data/reports/{company}/vector_index"
            )
        )


    # Summary query engine based on a
    # simple retriever returns all nodes
    summary_query_engine = summary_index.as_query_engine()

    # Vector query engine based on a
    # vector index retriever that only returns the top k results
    vector_query_engine = vector_index.as_query_engine(similarity_top_k=2)

    # define tools
    query_engine_tools = [
        QueryEngineTool.from_defaults(
            name=f"{company}_summary_tool",
            query_engine=summary_query_engine,
            description=(
                f"Useful for summarization questions related to the risks outlined in the annual report of {company}."
            ),
        ),

        QueryEngineTool.from_defaults(
            name=f"{company}_vector_tool",
            query_engine=vector_query_engine,
            description=(
                f"Useful for retrieving specific context concerning {company}'s risks and challenges as highlighted in its annual report."
            ),
        )
    ]

    agent_worker = FunctionCallingAgentWorker.from_tools(
        query_engine_tools, 
        llm=Settings.llm, 
        verbose=True,
        system_prompt=f"""
        You are an agent designed to answer queries about the company {company} which can be answered based on paragraphs 
        from their annual reports where they outline their risks and challenges.
        


        Please always use the tools provided to answer a question. Do NOT rely on prior knowledge.
        """
    )
    agent = AgentRunner(agent_worker)

    return agent

In [37]:
agents = {}
for company in tqdm(companies):
    agent = await create_doc_tools(
        company_fp = Path(f"./data/raw/reports/{company}_{year}.pdf"),
        company = company,
        post_process = post_process[company])
    agents[company] = agent

100%|██████████| 6/6 [00:01<00:00,  5.95it/s]


In [28]:
# define tool for each document agent
all_tools = []
for company in tqdm(companies):
    report_summary = (
        f"This content contains information about {company}'s annual report. Use"
        f" this tool if you want to answer any questions about {company}.\n"
    )
    doc_tool = QueryEngineTool(
        query_engine=agents[company],
        metadata=ToolMetadata(
            name=f"tool_{company}",
            description=report_summary,
        ),
    )
    all_tools.append(doc_tool)

100%|██████████| 6/6 [00:00<?, ?it/s]


In [30]:
obj_index = ObjectIndex.from_objects(
    all_tools,
    index_cls=VectorStoreIndex,
)

RuntimeError: Cannot run the event loop while another loop is running

In [23]:
top_agent_worker = FunctionCallingAgentWorker.from_tools(
    tool_retriever=obj_index.as_retriever(
        similarity_top_k=3,
        node_postprocessors=[SimilarityPostprocessor(similarity_cutoff=0.8)]
        ),
    system_prompt="""
        You are an agent designed to answer queries about a set of given companies.
        Please always try to use the tools provided to answer a question. 
        If none of the tools can be used to answer the question, respond to the user
        that you do not have any information available about the company he is asking.\n
        """,
    verbose=True
)

top_agent = AgentRunner(top_agent_worker)

In [16]:
agent.chat("What does the Norvatis mention about litigation risks in its annual report?")

Added user message to memory: What does the Norvatis mention about litigation risks in its annual report?
=== LLM Response ===
I'm sorry, I can only answer questions about BMW.


AgentChatResponse(response="I'm sorry, I can only answer questions about BMW.", sources=[], source_nodes=[], is_dummy_stream=False, metadata=None)

In [24]:
response = top_agent.chat("Do BMW or Astra Zeneca face residual value risks and how they mitigate these risks?")

Added user message to memory: Do BMW or Astra Zeneca face residual value risks and how they mitigate these risks?
=== LLM Response ===
I can provide some general information about residual value risk, but I don't have specific risk mitigation strategies for BMW or AstraZeneca. Here's what I can tell you:

Residual value risk is the potential financial risk associated with the future value of an asset at the end of its lease or useful life. This risk is relevant for companies that deal with assets that depreciate over time, such as vehicles or equipment.

For BMW, which is an automotive manufacturer, residual value risk is a significant concern. The residual value of their vehicles can impact the company's financial performance and profitability. BMW may face residual value risk due to factors such as changing consumer preferences, advancements in technology, or economic conditions that could affect the future value of their cars.

To mitigate residual value risk, automotive companies l

Resources:
- [Let's talb about LlamaIndex and LangChain](https://superwise.ai/blog/lets-talk-about-llamaindex-and-langchain/)
- [LlamaIndex](https://docs.llamaindex.ai/en/stable/)
- [Advanced RAG techniques - An illustrated overview](https://pub.towardsai.net/advanced-rag-techniques-an-illustrated-overview-04d193d8fec6)