# Knowledge App From ESG Reports using AWS Bedrock and Neo4j
In this notebook, let's explore how to leverage AWS Bedrock to build and consume a knowledge graph in Neo4j from Environmental, Social and Governance reports from 439 companies across Consumer goods, Financial Services, Healthcare, All Services, Technology companies, Energy sectors

This notebook parses data from [responsibilityreports.com](https://www.responsibilityreports.com) using AWS' Titan models. The models will be prompted to recognise and extract entities and relationships. We will then generate Neo4j Cypher queries using them and write the data to a Neo4j database.
We will again use Titan model and prompt it to convert questions in english to Cypher - Neo4j's query language, which can be used for data retrieval.

## Setup
First off, check that the Python environment you installed in the readme is running this notebook. Make sure you select the a kernel that runs Python 3.8 and above. You can select a SageMaker Image (Eg. Base Python 3.0) that runs 3.10

In [3]:
import sys
sys.version

'3.10.8 (main, Oct 13 2022, 22:36:54) [GCC 10.2.1 20210110]'

First lets install Bedrock libs. Currently they are in preview. So you have to download them here and unzip to a folder.

In [None]:
%pip uninstall -y botocore3
%pip uninstall -y boto3

In [None]:
!wget -O /root/bedrock-python-sdk.zip https://preview.documentation.bedrock.aws.dev/Documentation/SDK/bedrock-python-sdk.zip
!unzip /root/bedrock-python-sdk.zip -d /root/bedrock-python-sdk

Make sure the `boto` & `botocore` libs correspond to the right version of your downloaded SDK. And then execute the code below

In [None]:
%pip install /root/bedrock-python-sdk/boto3-1.26.162-py3-none-any.whl
%pip install /root/bedrock-python-sdk/botocore-1.29.162-py3-none-any.whl

In [None]:
# %pip install boto3
%pip install --user "langchain>=0.0.237"
%pip install --user nltk
%pip install --user graphdatascience
%pip install --user pydantic
%pip install --user IProgress
%pip install --user tqdm

Now restart the kernel.  That will allow the Python evironment to import the new packages.

Provide your `SERVICE_NAME` (e.g. `bedrock`) & `REGION_NAME` (e.g. `us-east-1`) in the code below

In [205]:
# Note, you will need to set your project_id
SERVICE_NAME = 'bedrock'
REGION_NAME = 'us-west-2'

In [206]:
import boto3
import json
bedrock = boto3.client(
 service_name=SERVICE_NAME,
 region_name=REGION_NAME,
 endpoint_url=f'https://{SERVICE_NAME}.{REGION_NAME}.amazonaws.com'
)

## Prompt Definition

In the upcoming sections, we will extract knowledge adhering to the following schema. This is a very Simplified schema to extract only the information we are interested in. Normally, you will have Domain Experts who come up with an ideal Schema. Neo4j being a schema-flexible database, the schema can be modified later to accomodate new data



Schema:
````
(:Company{ticker:string,name:string,sector:string})-[:FILED_ESG_REPORT]->(EsgReport{year:number,url:string})
(:Company{ticker:string,name:string,sector:string})-[:PARTNERED_WITH{date:string,for:string}]->(:Company{ticker:string,name:string,sector:string})
(:Company{ticker:string,name:string,sector:string})-[_DONATED_TO{amount:string,for:string}]->(:Company{ticker:string,name:string,sector:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_ENV_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_SOCIAL_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_GOVERNANCE_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_GOAL]->(:Goal{target:string})
(:Goal{name:string})-[:IS_ABOUT]->(:Criteria{name:string})
````






To achieve our Extraction goal as per the schema, I am going to chain a series of prompts, each focused on only one task - to extract a specific entity. By this way, you can go for more granular extraction. The prompts I used here can be improved and in production scenario, you should consider running QA on the prompt pipelines to ensure that the extracted information is correct. Also, you should consider a landing and serving zones and ensure only curated data lands the serving zone.

Our input source is a `txt` file served from a URL. So, lets write some code to extract and convert the txt to json.

In [300]:
import json
inp_text = ''
with open('./data/DKS.txt') as f:
    inp_text = f.read()

Now, let's write some helper functions to call Bedrock's Tital model

In [207]:
def call_titan_model(prompt):
    try:
        body = json.dumps({"inputText": prompt, "textGenerationConfig":
                           {"maxTokenCount":3000,"stopSequences":[],"temperature":0,"topP":0.9}})
        modelId = 'amazon.titan-tg1-large' # change this to use a different version from the model provider
        accept = 'application/json'
        contentType = 'application/json'

        response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
        response_body = json.loads(response.get('body').read())

        return response_body.get('results')[0].get('outputText')
    except Exception as e:
        print(e)

1. company strategy:
board, company, corporate, governance, management, executive, director, shareholder, global, engagement, vote, term, responsibility, business, team

2. green energy:
energy, emission, million, renewable, use, project, reduce, carbon, water, billion, power, green, total, gas, source

3. customer focus:
customer, provide, business, improve, financial, support, investment, service, year, sustainability, nancial, global, include, help, initiative

4. support community:
community, people, business, support, new, small, income, real, woman, launch, estate, access, customer, uk, include 

5. ethical investments:
investment, climate, company, change, portfolio, risk, responsible, sector, transition, equity, investor, sustainable, business, opportunity, market

6. sustainable finance:
sustainable, impact, sustainability, asset, management, environmental, social, investing, company, billion, waste, client, datum, investment, provide

7. code of conduct:
include, policy, information, risk, review, management, investment, company, portfolio, process, environmental, governance, scope, conduct, datum

8. strong governance:
risk, business, management, environmental, customer, manage, human, social, climate, approach, conduct, page, client, impact, strategic

9. value employees:
employee, work, people, support, value, client, company, help, include, provide, community, program, diverse, customer, service

Now, lets define the prompts to enable our extraction process

In [209]:
import json
import re

def extract_json_from_string(input_string):
    pattern = r'\{.*?\}|\[.*?\]'
    match = re.search(pattern, input_string.replace('\n', ' ').replace('```', ''))
    if match:
        return json.loads(match.group())

Some of the letters are going to be very long for most LLMs to take them as input. So, lets split the large text in a contextaware manner using NLTK

In [257]:
from langchain.text_splitter import NLTKTextSplitter
import nltk
nltk.download('punkt', quiet=True) #downloads the tokenizer model that will help us with context aware of text splitting

text = inp_text

text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)

Created a chunk of size 5696, which is longer than the specified 4000


In [272]:
spl=text.split('\n')
docs=[]
for i in range(len(spl)//250+1):
    docs.append('\n'.join(spl[i:i+250]))
len(docs)

8

In [291]:
from string import Template
import json

things_achieved_tpl="""In this chunk of ESG report from a Company, list out the metrics or things that the company achieved:
$ctext
Achievements:
"""
achievements = []
for chunk in docs:
    prompt = Template(things_achieved_tpl).substitute(ctext=chunk)
    response = call_titan_model(prompt)
    if response.strip().lower() != 'n/a':
        print(response)
        achievements.append(response)

• Developed and implemented a comprehensive COVID-19 response plan to protect teammates and customers.
• Donated $30 million to The DICK’S Sporting Goods Foundation to support more kids, teams, and leagues and ensure programs were still in place and strong upon return.
• Introduced curbside pickup in two days to show appreciation for hardworking teammates.
• Strengthened zero-tolerance stance against discrimination and delivered over 100,000 hours of anti-racism and discrimination bias training.
• Rolled out new I&D goals for the organization and published new data on the composition of our workforce.
• Continued to support youth sports through The DICK’S Sporting Goods Foundation, including the Foundation’s customized Sports Matter Giving Truck and programs to help female athletes make strides on and off the field.
• Joined the Outdoor Industry Climate Action Corps and set a goal to reduce greenhouse gas emissions (GHG) by 30% by 2030.
• Partnered with TerraCycle to launch a national 

In [294]:
planned_goals_tpl="""In this chunk of ESG report from a Company, list out the goals that the company is planning in the near future:
$ctext
Goals:
"""
goals = []
for chunk in docs:
    prompt = Template(planned_goals_tpl).substitute(ctext=chunk)
    response = call_titan_model(prompt)
    if response.strip().lower() != 'n/a':
        print(response)
        goals.append(response)

• Climate action 
• Environmental responsibility 
• Water stewardship 
• Biodiversity and habitat conservation 
• Responsible sourcing 
• Product quality and safety 
• Firearms safety 
• Responsible sourcing 
• Chemical safety and management 
• Data protection and privacy
• Climate action 
• Environmental responsibility 
• Water stewardship 
• Biodiversity
• Responsible sourcing
• Reduce GHG emissions by 30% by 2030 (versus 2016 baseline)
• Join the AAFA/FLA Industry commitment to Responsible Recruitment.
• Attain 100% participation of vertical brands in the Higg facility environmental module by 2025.
• Reduce GHG emissions by 30% by 2030 (versus 2016 baseline)
• Increase renewable energy use by 25% by 2025
• Reduce water consumption by 10% by 2025
• Diversify suppliers to reduce single-use packaging by 50% by 2025
• Ensure 100% of owned and operated facilities are powered by renewable energy by 2030
• Develop and implement a third-party verification process for responsible sourcing
• 

In [301]:
convert_achievements_to_metrics = """These are the achievements extracted from a Company's ESG report. What are the metrics & their respective values as tuples that you can extract from them?
$ctext
Metrics tuples:
"""
metrics = []
prompt = Template(convert_achievements_to_metrics).substitute(ctext='\n'.join(achievements))
response = call_titan_model(prompt)
if response.strip().lower() != 'n/a':
    print(response)
    metrics.append(response)

(‘‘Program Name‘‘, ‘‘Description‘‘), (‘‘Program Name‘‘, ‘‘Goal‘‘), (‘‘Program Name‘‘, ‘‘Outcome‘‘), (‘‘Program Name‘‘, ‘‘Impact Metric‘‘), (‘‘Program Name‘‘, ‘‘Target Date‘‘), (‘‘Program Name‘‘, ‘‘Participating Company‘‘), (‘‘Program Name‘‘, ‘‘Description of Support‘‘), (‘‘Program Name‘‘, ‘‘Amount Donated‘‘), (‘‘Program Name‘‘, ‘‘Recipient Organization‘‘), (‘‘Program Name‘‘, ‘‘Program Focus Area‘‘), (‘‘Program Name‘‘, ‘‘Impact of Support‘‘)
```tabular-data-json
{"rows": [["Program Name", "COVID-19 Response Plan"], ["Program Name", "The DICK'S Sporting Goods Foundation"], ["Program Name", "Zero-Tolerance Stance Against Discrimination"], ["Program Name", "Anti-Racism and Discrimination Bias Training"], ["Program Name", "I&D Goals"], ["Program Name", "Sports Matter Giving Truck"], ["Program Name", "Female Athletes"], ["Program Name", "Climate Action Plan"], ["Program Name", "Outdoor Industry Climate Action Corps"], ["Program Name", "We Are Still In Declaration"], ["Program Name", "Paris C

In [246]:
# tries = 0
# if response is None:
#     continue
# while "tabular-data-json" in response or not(response.endswith(('`', ']'))): #sometimes Titan returns output not in the format we mentioned. This is a temp HACK
#     if tries > 4:
#         print(response)
#         raise Exception("Model not returning in JSON. Try again or change your prompt")
#     tries = tries + 1
#     response = call_titan_model(prompt)

## Data Ingestion Cypher Generation

The entities and relationships we got from the LLM have to be transformed to Cypher so we can write them into Neo4j.

## Data Ingestion

You will need a Neo4j AuraDS Pro instance.  You can deploy that on Google Cloud Marketplace [here](https://console.cloud.google.com/marketplace/product/endpoints/prod.n4gcp.neo4j.io).

With that complete, you'll need to install the Neo4j library and set up your database connection.

In [3]:
from graphdatascience import GraphDataScience

In [4]:
import getpass

# You will need to change these variables
connectionUrl = input("Neo4j Connection URL")
username = input("DB Username")
password = getpass.getpass("DB password")

Neo4j Connection URL neo4j+s://a47f742d.databases.neo4j.io
DB Username neo4j
DB password ········


In [5]:
gds = GraphDataScience(
    connectionUrl,
    auth=(username, password),
    aura_ds=True)

gds.set_database("neo4j")

Before loading the data, create constraints as below