<a href="https://colab.research.google.com/github/kchen737/CS213_Project/blob/main/LLM_Ingredion.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction:
This project aims to automate the extraction of sustainability and ESG (Environmental, Social, and Governance) data from PDF reports. These reports are usually unstructured, image-based, and/or text encoded, posing a significant challenge when obtaining valuable quantitative metrics. <br>
This notebook is a WIP data extraction pipeline that converts unstructured sustainability reports into structured datasets, enabling further analysis, comparision, and visualization.

## Setting All Dependencies:

In [None]:
!uv pip install -q langchain-google-genai google-generativeai

import google.generativeai as genai
import os, getpass

os.environ["GEMINI_API_KEY"] = getpass.getpass("Enter your Google AI API key: ")
genai.configure(api_key=os.environ["GEMINI_API_KEY"])


In [None]:
!uv venv
!uv pip install pymupdf4llm
!uv pip install requests
!uv pip install pymupdf

Using CPython 3.12.11 interpreter at: [36m/usr/bin/python3[39m
Creating virtual environment at: [36m.venv[39m
[33m?[0m [1mA virtual environment already exists at `.venv`. Do you want to replace it?[0m [38;5;8m[y/n][0m [38;5;8m›[0m [36myes[0m

[0J[32m✔[0m [1mA virtual environment already exists at `.venv`. Do you want to replace it?[0m [38;5;8m·[0m [36myes[0m
[?25hActivate with: [32msource .venv/bin/activate[39m
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 178ms[0m[0m
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 228ms[0m[0m
[2mUsing Python 3.12.11 environment at: /usr[0m
[2mAudited [1m1 package[0m [2min 193ms[0m[0m


## Uploading PDF Report:

In [None]:
from google.colab import files
uploaded = files.upload()

# You can then access the uploaded file(s) by their filename
for fn in uploaded.keys():
  print(f'User uploaded file "{fn}" with length {len(uploaded[fn])} bytes')

Saving Pepsico Sustainability 2024.pdf to Pepsico Sustainability 2024 (3).pdf
User uploaded file "Pepsico Sustainability 2024 (3).pdf" with length 664973 bytes


## Extracting Text from PDF to Markdown:

In [None]:
import pymupdf4llm
import pathlib

md_text = pymupdf4llm.to_markdown("Pepsico Sustainability 2024.pdf")
print(md_text)

pathlib.Path("output.md").write_bytes(md_text.encode())

### **2024**

# **Metrics**


#### 2024 ESG Summary . ESG Topics A-Z


###### **2**

##### in our reporting as data becomes available. The data presented within this PDF do not reflect our acquisitions of Sabra Dipping Company, LLC, and PepsiCo-Strauss Fresh Dips & Spreads International GmbH, which became wholly owned subsidiaries in December 2024. Unless otherwise noted, goals and progress reflect the impact of our prior acquisitions as of the end of the 2024 calendar year. We track and report sustainability data according to industry- accepted methodologies, where available. Our methodologies continue to evolve and may incorporate certain assumptions or estimates. Our sustainability reporting is based on the best available data as of the reporting date, which may reflect other uncertainties and limitations, such as where data tracking and collection is outside our direct control (for example, where we rely on third parties to provide data). Our Environmental, Social and Governance (E

17976

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2")
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

#md_header_splits
docs = markdown_splitter.split_text(md_text)
print(docs)


[Document(metadata={}, page_content='### **2024**'), Document(metadata={'Header 1': '**Metrics**'}, page_content='#### 2024 ESG Summary . ESG Topics A-Z  \n###### **2**  \n##### in our reporting as data becomes available. The data presented within this PDF do not reflect our acquisitions of Sabra Dipping Company, LLC, and PepsiCo-Strauss Fresh Dips & Spreads International GmbH, which became wholly owned subsidiaries in December 2024. Unless otherwise noted, goals and progress reflect the impact of our prior acquisitions as of the end of the 2024 calendar year. We track and report sustainability data according to industry- accepted methodologies, where available. Our methodologies continue to evolve and may incorporate certain assumptions or estimates. Our sustainability reporting is based on the best available data as of the reporting date, which may reflect other uncertainties and limitations, such as where data tracking and collection is outside our direct control (for example, where

Testing Call to Gemini:

In [None]:
model = genai.GenerativeModel("gemini-2.5-flash")
response = model.generate_content("Hello Gemini!")
print(response.text)


Hello! How can I help you today?


## Connecting To LangChain:

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import Optional, List

#Define desrired structure to force structured output
class Metric(BaseModel):
  """A single extracted performance metric."""
  metric_name: str = Field(..., description="The name of the metric, e.g., 'Scope 1 and 2 emissions reduction'.")
  value: str = Field(..., description="The value of the metric, including units, e.g., '50%'.")
  year: Optional[int] = Field(None, description="The year the metric corresponds to, if mentioned.")

class ExtractedMetrics(BaseModel):
  """The complete set of metrics extracted from a text chunk."""
  title: str = Field(..., description="A suitable title for the extracted data, e.g., 'Climate Targets'.")
  metrics: List[Metric] = Field(..., description="A list of all the metrics found in the text.")

#initialize model
llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0,
    google_api_key=os.environ["GEMINI_API_KEY"]
)

#Default Function Calling Approach
structured_llm = llm.with_structured_output(ExtractedMetrics)

prompt = ChatPromptTemplate([
    ("system", "You're an expert sustainability analyst! From the following text, extract all relevant metrics and format them according to the provided schema."),
    ("human", "{text_chunk}")
])

chain = prompt | structured_llm

batch_inputs = [{"text_chunk": doc.page_content} for doc in docs]
results = chain.batch(batch_inputs)
print(results)

[ExtractedMetrics(title='2024', metrics=[]), ExtractedMetrics(title='2024 ESG Summary', metrics=[]), ExtractedMetrics(title='Sustainability Targets', metrics=[Metric(metric_name='Deforestation-free sourcing of high-risk commodities', value='more than 90%', year=2025)]), ExtractedMetrics(title='Sustainability Targets and Performance', metrics=[Metric(metric_name='Net-zero emissions target', value='net-zero', year=2050), Metric(metric_name='Scope 1 and 2 emissions reduction target', value='50%', year=2030), Metric(metric_name='Scope 1 and 2 emissions reduction', value='18%', year=2024), Metric(metric_name='Scope 1 and 2 emissions reduction', value='13%', year=2023), Metric(metric_name='Scope 3 Energy & Industry (E&I) emissions reduction target', value='42%', year=2030), Metric(metric_name='Scope 3 Energy & Industry (E&I) emissions reduction', value='12%', year=2024), Metric(metric_name='Scope 3 Energy & Industry (E&I) emissions reduction', value='8%', year=2023), Metric(metric_name='Scop

#Normalization

In [None]:
import pandas as pd
#Flattening Data into Dictionary
flattened_data = []

for result in results:
  if not result.metrics:
    continue
  for metric in result.metrics:
    flattened_data.append({
        'title': result.title,
        'metric_name': metric.metric_name,
        'value': metric.value,
        'year': metric.year
    })

df = pd.DataFrame(flattened_data)
print(df)

                                                title  \
0                              Sustainability Targets   
1              Sustainability Targets and Performance   
2              Sustainability Targets and Performance   
3              Sustainability Targets and Performance   
4              Sustainability Targets and Performance   
5              Sustainability Targets and Performance   
6              Sustainability Targets and Performance   
7              Sustainability Targets and Performance   
8              Sustainability Targets and Performance   
9              Sustainability Targets and Performance   
10             Sustainability Targets and Performance   
11             Sustainability Targets and Performance   
12             Sustainability Targets and Performance   
13  Adopting the Alliance for Water Stewardship (A...   
14  Adopting the Alliance for Water Stewardship (A...   
15  Adopting the Alliance for Water Stewardship (A...   
16  Adopting the Alliance for W

In [None]:
df.to_csv('Pepsi_Extracted_Metrics.csv', index = False)