# ESG Fact-Check 

This notebook aims to carry out a research on evaluating how useful LLMs can be for extracting financial KPIs from ESG reports. We are going to do the following: 
- Load & parse document 
- Split document into page nodes 
- Create a vector embedding for each page and store metadata for each page

In [1]:
!pip install llama-index
!pip install python-dotenv
!pip install llama-index-core
!pip install llama-index-embeddings-openai
!pip install llama-parse
!pip install pinecone-client
!pip install llama-index-vector-stores-pinecone













In [2]:
from dotenv import load_dotenv
import os 
from llama_index.core.node_parser import SentenceSplitter
import nest_asyncio
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import Settings
from llama_parse import LlamaParse
from llama_index.core import SimpleDirectoryReader
from pinecone import Pinecone, ServerlessSpec
from llama_index.vector_stores.pinecone import PineconeVectorStore
# from IPython.display import Markdown, display
from llama_index.core import StorageContext

nest_asyncio.apply()  # Allow async calls
load_dotenv()

True

## Define Embed Model 

In [3]:
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
llm = OpenAI(model="gpt-4-turbo")

# Global Embed Model
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 1024

# Document Loading and Preprocessing

In [4]:
documents = SimpleDirectoryReader(
    input_files=["./data/apple_esg_report.pdf"],
).load_data()

In [5]:
print(f"{len(documents)} Documents")


85 Documents


In [6]:
splitter = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
nodes = splitter.get_nodes_from_documents(documents)

In [7]:
print(f"Content of node 0: {nodes[1].get_content(metadata_mode='all')}")

Content of node 0: page_label: 2
file_name: apple_esg_report.pdf
file_path: data/apple_esg_report.pdf
file_type: application/pdf
file_size: 9324447
creation_date: 2024-06-17
last_modified_date: 2024-06-16

Introduction
3  Letter from Tim Cook
4 Report highlights
6 Our approach
8  Our commitment  
to transparency
9 Advocating for change
10  Our commitment  
to human rights Environment
13 Our approach
13 Climate change
18 Resources
20 Smarter chemistryOur People 
23 Our approach 
23 Inclusion and diversity
26  Growth and  
development
27 Benefits
28 Compensation
29 Engagement
30  Workplace practices  
and policies
33  Health and safety  
at Apple
Suppliers
37  Our approach
40  Labor and human rights  
in the supply chain
43  Health, safety,  
and wellness
44  Responsible materials 
sourcing
45  Education and 
professional development
46 EnvironmentCustomers
48 Our approach
48 Privacy
50 Accessibility 
52 Inclusive design
53 Education
54 Health
55 Caring for customers
Communities
59 Our a

## Vector Store 

In [8]:
pc = Pinecone(api_key=os.environ['PINECONE_API_KEY'])

### Create index 

In [9]:
index_name = 'esgeneindex'

# Check if the index with the given name exists
if index_name not in pc.list_indexes().names():
    print('Creating new index..')
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric='euclidean',
        spec=ServerlessSpec(
            cloud='aws',
            region='us-east-1'
        )
    )
    
pinecone_index = pc.Index(index_name)


In [10]:
# vector_index = VectorStoreIndex(nodes=nodes)
vector_store = PineconeVectorStore(pinecone_index=pinecone_index)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes,storage_context=storage_context )

Upserted vectors:   0%|          | 0/99 [00:00<?, ?it/s]

In [11]:
from llama_index.core.vector_stores import MetadataFilters, FilterCondition
from typing import List, Optional


def query_vector_index(query: str, page_nums: Optional[List[str]] = None) -> str:
    page_nums = page_nums or []
    
    meta_data = [{"key": 'page_label', "value": page} for p in page_nums]
    
    query_engine = index.as_query_engine(
        similarity_top_k=5,
        filters=MetadataFilters.from_dicts(meta_data, condition=FilterCondition.OR),
        verbose=True
    )
    
    res = query_engine.query(query)
    return res

In [15]:
query = "Are there examples of how stakeholder feedback has influenced sustainability initiatives?"





results = query_vector_index(query)
print(results)

Yes, there are examples of how stakeholder feedback has influenced sustainability initiatives. For instance, feedback received directly from supplier employees helps ensure that labor and human rights are respected throughout the global supply chain. This feedback is used to address emerging risks, improve rights training for supplier employees and management, and continually strengthen the Supplier Code of Conduct and Supplier Responsibility Standards. Additionally, the feedback channels, such as interviews with supplier employees and third-party anonymous hotlines, allow for the collection of insights that lead to improvements in workplace conditions and compliance with labor standards. These mechanisms ensure that the voices of those within the supply chain directly contribute to the evolution and enhancement of sustainability practices.
