# notebook.ipynb
This is a sample notebook and web application which shows how Amazon Bedrock and Titan can be used with Neo4j. We will explore how to leverage generative AI to build and consume a knowledge graph in Neo4j.

The dataset we're using is from the SEC's EDGAR system.  It was downloaded using [these scripts](https://github.com/neo4j-partners/neo4j-sec-edgar-form13).

The dataflow in this demo consists of two parts:
1. Ingestion - we read the EDGAR files with Bedrock, extracting entities and relationships from them.  Bedrock then generates Neo4j Cypher that is run against a Neo4j database deployed from AWS Marketplace.
2. Consumption - A user inputs natural language into a web UI.  Bedrock converts that to Neo4j Cypher which is run against the database.  This flow allows non technical users to query the database.


## Setup
Select the SageMaker Data Science 3.0 image.

Now, let's install Bedrock libraries. Currently they are in preview. So, you have to download them here and unzip to a folder.

In [None]:
#%pip uninstall -y botocore3
#%pip uninstall -y boto3

In [None]:
!wget -O /root/bedrock-python-sdk.zip https://preview.documentation.bedrock.aws.dev/Documentation/SDK/bedrock-python-sdk.zip
!unzip /root/bedrock-python-sdk.zip -d /root/bedrock-python-sdk

Make sure the `boto` & `botocore` libs correspond to the right version of your downloaded SDK. And then execute the code below

In [None]:
%pip install /root/bedrock-python-sdk/boto3-1.26.162-py3-none-any.whl
%pip install /root/bedrock-python-sdk/botocore-1.29.162-py3-none-any.whl

In [None]:
%pip install --user "langchain>=0.0.237"
%pip install --user nltk
%pip install --user graphdatascience
%pip install --user pydantic
%pip install --user IProgress
%pip install --user tqdm

Now restart the kernel.  That will allow the Python evironment to import the new packages.

In [None]:
#Provide your `SERVICE_NAME` (e.g. `bedrock`) & `REGION_NAME` (e.g. `us-east-1`)
SERVICE_NAME = 'bedrock'
REGION_NAME = 'us-west-2'

In [None]:
import boto3
import json
bedrock = boto3.client(
 service_name=SERVICE_NAME,
 region_name=REGION_NAME,
 endpoint_url=f'https://{SERVICE_NAME}.{REGION_NAME}.amazonaws.com'
)

## Prompt Definition
In the upcoming sections, we will extract knowledge adhering to the following schema. This is a very Simplified schema to extract only the information we are interested in. Normally, you will have Domain Experts who come up with an ideal Schema. Neo4j being a schema-flexible database, the schema can be modified later to accomodate new data



Schema:
````
(:Company{ticker:string,name:string,sector:string})-[:FILED_ESG_REPORT]->(EsgReport{year:number,url:string})
(:Company{ticker:string,name:string,sector:string})-[:PARTNERED_WITH{date:string,for:string}]->(:Company{ticker:string,name:string,sector:string})
(:Company{ticker:string,name:string,sector:string})-[:DONATED_TO{amount:string,for:string}]->(:Charity{name:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_ENV_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_SOCIAL_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_GOVERNANCE_METRIC]->(Metric{name:string,value:string})
(:Company{ticker:string,name:string,sector:string})-[:HAS_GOAL]->(:Goal{target:string})
(:Goal{name:string})-[:IS_ABOUT]->(:Criteria{name:string})
````






To achieve our Extraction goal as per the schema, I am going to chain a series of prompts, each focused on only one task - to extract a specific entity. By this way, you can go for more granular extraction. The prompts I used here can be improved and in production scenario, you should consider running QA on the prompt pipelines to ensure that the extracted information is correct. Also, you should consider a landing and serving zones and ensure only curated data lands the serving zone.

Let's go in this order to gather the data in accordance to out data model:
1. Extract environmental, social and governance achievements
2. Extract quantifiable criteria, value and target year for each goal
3. Extract environmental metric and value from achievements
4. Extract social metric and value from achievements
5. Extract governance metric and value from achievements

## Data Preparation and Helper functions
Our input source is a `txt` file served from a URL. So, lets write some code to extract and convert the txt to json.

In [None]:
import json
inp_text = ''
with open('./data/raw_2023-05-15_archives_edgar_data_1000097_0001000097-23-000006.txt') as f:
    inp_text = f.read()

Now, let's write some helper functions to call Bedrock's Tital model

In [None]:
def call_titan_model(prompt):
    try:
        body = json.dumps({"inputText": prompt, "textGenerationConfig":
                           {"maxTokenCount":3000,"stopSequences":[],"temperature":0,"topP":0.9}})
        modelId = 'amazon.titan-tg1-large' # change this to use a different version from the model provider
        accept = 'application/json'
        contentType = 'application/json'

        response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
        response_body = json.loads(response.get('body').read())

        return response_body.get('results')[0].get('outputText')
    except Exception as e:
        print(e)

1. company strategy:
board, company, corporate, governance, management, executive, director, shareholder, global, engagement, vote, term, responsibility, business, team

2. green energy:
energy, emission, million, renewable, use, project, reduce, carbon, water, billion, power, green, total, gas, source

3. customer focus:
customer, provide, business, improve, financial, support, investment, service, year, sustainability, nancial, global, include, help, initiative

4. support community:
community, people, business, support, new, small, income, real, woman, launch, estate, access, customer, uk, include 

5. ethical investments:
investment, climate, company, change, portfolio, risk, responsible, sector, transition, equity, investor, sustainable, business, opportunity, market

6. sustainable finance:
sustainable, impact, sustainability, asset, management, environmental, social, investing, company, billion, waste, client, datum, investment, provide

7. code of conduct:
include, policy, information, risk, review, management, investment, company, portfolio, process, environmental, governance, scope, conduct, datum

8. strong governance:
risk, business, management, environmental, customer, manage, human, social, climate, approach, conduct, page, client, impact, strategic

9. value employees:
employee, work, people, support, value, client, company, help, include, provide, community, program, diverse, customer, service

Now, lets define the prompts to enable our extraction process

In [None]:
import json
import re

def extract_json_from_string(input_string):
    pattern = r'\{.*?\}|\[.*?\]'
    match = re.search(pattern, input_string.replace('\n', ' ').replace('```', ''))
    if match:
        return json.loads(match.group())

Some of the letters are going to be very long for most LLMs to take them as input. So, lets split the large text in a contextaware manner using NLTK

In [None]:
from langchain.text_splitter import NLTKTextSplitter
import nltk
nltk.download('punkt', quiet=True) #downloads the tokenizer model that will help us with context aware of text splitting

text = inp_text

text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)

In [None]:
spl=text.split('\n')
docs=[]
for i in range(len(spl)//250+1):
    docs.append('\n'.join(spl[i:i+250]))
len(docs)

## Extract environmental, social and governance achievements

In [None]:
from string import Template
import json

things_achieved_tpl="""In this chunk of ESG report from a Company, list out the metrics or things that the company achieved:
$ctext
Achievements:
"""
achievements = []
for chunk in docs:
    prompt = Template(things_achieved_tpl).substitute(ctext=chunk)
    response = call_titan_model(prompt)
    if response.strip().lower() != 'n/a':
        print(response)
        achievements.append(response)

## Extract environmental metric and value from achievements

In [None]:
convert_env_achievements_to_metrics = """These are the list of achievements extracted from a Company's ESG report. 
Now, extract only the achievements related to Environment and convert them to a tuple of the environmental metric and its value
$ctext
Tuples:
"""
env_metrics = []
prompt = Template(convert_env_achievements_to_metrics).substitute(ctext='\n'.join(achievements))
response = call_titan_model(prompt)
if response.strip().lower() != 'n/a':
    print(response)
    env_metrics.append(response)

## Extract social metric and value from achievements

In [None]:
convert_social_achievements_to_metrics = """These are the list of achievements extracted from a Company's ESG report. 
Now, extract only the achievements related to Corporate Social Responsiblity and convert them to a tuple of the metric name and its quantitative value. If there is more than one value for a metric, create a new tuple with that value
$ctext
Tuples:
"""
soc_metrics = []
prompt = Template(convert_social_achievements_to_metrics).substitute(ctext='\n'.join(achievements))
response = call_titan_model(prompt)
if response.strip().lower() != 'n/a':
    print(response)
    soc_metrics.append(response)

## Extract governance metric and value from achievements

In [None]:
convert_governance_achievements_to_metrics = """These are the list of achievements extracted from a Company's ESG report. 
Now, extract only the achievements related to Corporate Governance and convert them to a tuple of the metric name and its quantitative value. If there is more than one value for a metric, create a new tuple with that value
$ctext
Tuples:
"""
gov_metrics = []
prompt = Template(convert_governance_achievements_to_metrics).substitute(ctext='\n'.join(achievements))
response = call_titan_model(prompt)
if response.strip().lower() != 'n/a':
    print(response)
    gov_metrics.append(response)

## Extract quantifiable criteria, value and target year for each goal

In [None]:
planned_goals_tpl="""In this chunk of ESG report from a Company, list out the goals that the company is planning in the near future:
$ctext
Goals:
"""
goals = []
for chunk in docs:
    prompt = Template(planned_goals_tpl).substitute(ctext=chunk)
    response = call_titan_model(prompt)
    if response.strip().lower() != 'n/a':
        print(response)
        goals.append(response)

## Data Ingestion Cypher Generation

The entities and relationships we got from the LLM have to be transformed to Cypher so we can write them into Neo4j.

## Data Ingestion
You will need a Neo4j instance.  You can deploy that on AWS Marketplace [here](https://aws.amazon.com/marketplace/pp/prodview-akmzjikgawgn4).

With that complete, you'll need to install the Neo4j library and set up your database connection.

In [None]:
from graphdatascience import GraphDataScience

# username is neo4j by default
username = 'neo4j'

# You will need to change these variables
connectionUrl = 'foourl'
password = 'foopasswd'

In [None]:
gds = GraphDataScience(
    connectionUrl,
    auth=(username, password),
    )

gds.set_database("neo4j")

Before loading the data, create constraints as below