# notebook.ipynb
This is a sample notebook and web application which shows how Amazon Bedrock and Titan can be used with Neo4j. We will explore how to leverage generative AI to build and consume a knowledge graph in Neo4j.

The dataset we're using is from the SEC's EDGAR system.  It was downloaded using [these scripts](https://github.com/neo4j-partners/neo4j-sec-edgar-form13).

The dataflow in this demo consists of two parts:
1. Ingestion - we read the EDGAR files with Bedrock, extracting entities and relationships from them.  Bedrock then generates Neo4j Cypher that is run against a Neo4j database deployed from AWS Marketplace.
2. Consumption - A user inputs natural language into a web UI.  Bedrock converts that to Neo4j Cypher which is run against the database.  This flow allows non technical users to query the database.


## Setup
Select the SageMaker Data Science 3.0 image.

Now, let's install Bedrock libraries. Currently they are in preview. So, you have to download them here and unzip to a folder.

In [None]:
#%pip uninstall -y botocore3
#%pip uninstall -y boto3

In [None]:
!wget -O /root/bedrock-python-sdk.zip https://preview.documentation.bedrock.aws.dev/Documentation/SDK/bedrock-python-sdk.zip
!unzip /root/bedrock-python-sdk.zip -d /root/bedrock-python-sdk

Make sure the `boto` & `botocore` libs correspond to the right version of your downloaded SDK. And then execute the code below

In [None]:
%pip install /root/bedrock-python-sdk/boto3-1.26.162-py3-none-any.whl
%pip install /root/bedrock-python-sdk/botocore-1.29.162-py3-none-any.whl

In [None]:
%pip install --user "langchain>=0.0.237"
%pip install --user nltk
%pip install --user graphdatascience
%pip install --user pydantic
%pip install --user IProgress
%pip install --user tqdm

Now restart the kernel.  That will allow the Python evironment to import the new packages.

In [2]:
#Provide your `SERVICE_NAME` (e.g. `bedrock`) & `REGION_NAME` (e.g. `us-east-1`)
SERVICE_NAME = 'bedrock'
REGION_NAME = 'us-west-2'

In [3]:
import boto3
import json
bedrock = boto3.client(
 service_name=SERVICE_NAME,
 region_name=REGION_NAME,
 endpoint_url=f'https://{SERVICE_NAME}.{REGION_NAME}.amazonaws.com'
)

## Prompt Definition
In the upcoming sections, we will extract knowledge adhering to the following schema. This is a very Simplified schema to extract only the information we are interested in. Normally, you will have Domain Experts who come up with an ideal Schema. Neo4j being a schema-flexible database, the schema can be modified later to accomodate new data



Schema:
````
(:Manager{name:string})-[:OWNS{reportCalendarOrQuarter:string,value:number,sshPrnamt:number,sshPrnamtType:string,investmentDiscretion:string,votingSole:number,votingShared:number,votingNone:number}]->(:Company{nameOfIssuer:string,cusip:string})
(:Manager{name:string})-[:HAS_ADDRESS]->(Address{street1:string,street2:string,city:string,stateOrCounty:string,zipCode:string})

````






To achieve our Extraction goal as per the schema, I am going to chain a series of prompts, each focused on only one task - to extract a specific entity. By this way, you can go for more granular extraction. The prompts I used here can be improved and in production scenario, you should consider running QA on the prompt pipelines to ensure that the extracted information is correct. Also, you should consider a landing and serving zones and ensure only curated data lands the serving zone.

Let's go in this order to gather the data in accordance to out data model:
1. Extract environmental, social and governance achievements
2. Extract quantifiable criteria, value and target year for each goal
3. Extract environmental metric and value from achievements
4. Extract social metric and value from achievements
5. Extract governance metric and value from achievements

## Data Preparation and Helper functions
Our input source is a `txt` file served from a URL. So, lets write some code to extract and convert the txt to json.

In [4]:
import json
inp_text = ''
with open('./data/raw_2023-05-15_archives_edgar_data_1000097_0001000097-23-000006.txt') as f:
    inp_text = f.read()

Now, let's write some helper functions to call Bedrock's Titan model

In [5]:
def call_titan_model(prompt):
    try:
        body = json.dumps({"inputText": prompt, "textGenerationConfig":
                           {"maxTokenCount":3000,"stopSequences":[],"temperature":0,"topP":0.9}})
        modelId = 'amazon.titan-tg1-large' # change this to use a different version from the model provider
        accept = 'application/json'
        contentType = 'application/json'

        response = bedrock.invoke_model(body=body, modelId=modelId, accept=accept, contentType=contentType)
        response_body = json.loads(response.get('body').read())

        return response_body.get('results')[0].get('outputText')
    except Exception as e:
        print(e)

Now, lets define the prompts to enable our extraction process

In [6]:
import json
import re

def extract_json_from_string(input_string):
    pattern = r'\{.*?\}|\[.*?\]'
    match = re.search(pattern, input_string.replace('\n', ' ').replace('```', ''))
    if match:
        return json.loads(match.group())

The first part of the Input text contains Manager information and the latter is filing info. 
Lets break them up to avoid LLM token limitations 

In [8]:
parts = inp_text.split('</edgarSubmission>') #splits the text into manager details and filing details
manager_info = parts[0]
filing_info = parts[1]

Sometimes `filing_info` can be larger text beyond the limits of LLM input length. So, lets split that up to chunks

In [18]:
import numpy as np
def split_filing_info(s):
    splitter = '</infoTable>'
    _parts = s.split(splitter)
    chunks_of_list = np.array_split(_parts, len(_parts)/15) # max 15 filings per part
    chunks_of_str = map(lambda x: splitter.join(x), chunks_of_list)
    return list(chunks_of_str)

filing_info_chunks = split_filing_info(filing_info)
len(filing_info_chunks)

3

## Extract Manager info

In [38]:
from string import Template
import json

mgr_info_tpl="""From the text below, extract the following as json. Do not miss any of these information.
* "name" - The name from the <name> tag under <filingManager> tag
* "street1" - The manager's street1 address from the <com:street1> tag under <address> tag
* "street2" - The manager's street2 address from the <com:street2> tag under <address> tag
* "city" - The manager's city address from the <com:city> tag under <address> tag
* "stateOrCounty" - The manager's stateOrCounty address from the <com:stateOrCountry> tag under <address> tag
* "zipCode" - The manager's zipCode from the <com:zipCode> tag under <address> tag

Text:
$ctext
"""
achievements = []
prompt = Template(mgr_info_tpl).substitute(ctext=manager_info)
response = call_titan_model(prompt)
response

'\n</XML>\n</TEXT>\n</TYPE>\n</SEQUENCE>\n</DOCUMENT>\n```tabular-data-json\n{"rows": [["Street1", "152 WEST 57TH STREET"], ["Street2", "50TH FLOOR"], ["City", "NEW YORK"], ["StateOrCounty", "NY"], ["ZipCode", "10019"]]}\n```'

## Extract Filing Information

In [39]:
filing_info_tpl = """From the text below, extract the following as json. Do not miss any of these information.
* "nameOfIssuer" - The name from the <nameOfIssuer> tag under <filingManager> tag
* "cusip" - The cusip from the <cusip> tag under <address> tag
* "reportCalendarOrQuarter" - The reportCalendarOrQuarter from the <reportCalendarOrQuarter> tag under <address> tag
* "value" - The value from the <value> tag under <address> tag
* "sshPrnamt" - The sshPrnamt from the <sshPrnamt> tag under <address> tag
* "sshPrnamtType" - The sshPrnamtType from the <sshPrnamtType> tag under <address> tag
* "investmentDiscretion" - The investmentDiscretion from the <investmentDiscretion> tag under <address> tag
* "votingSole" - The votingSole from the <votingSole> tag under <address> tag
* "votingShared" - The votingShared from the <votingShared> tag under <address> tag
* "votingNone" - The votingNone from the <votingNone> tag under <address> tag

Text:
$ctext
"""
filings = []
for chunk in filing_info_chunks:
    prompt = Template(filing_info_tpl).substitute(ctext=chunk)
    response = call_titan_model(prompt)
    print(response)
    

```tabular-data-json
{"rows": [["nameOfIssuer", "ACADEMY SPORTS &amp; OUTDOORS IN"]]}
```
```tabular-data-json
{"rows": [["nameOfIssuer", "FTAI AVIATION LTD"], ["nameOfIssuer", "GREAT ELM GROUP INC"], ["nameOfIssuer", "HAEMONETICS CORP MASS"], ["nameOfIssuer", "INVESCO EXCH TRADED FD TR II"], ["nameOfIssuer", "HASBRO INC"], ["nameOfIssuer", "IVERIC BIO INC"], ["nameOfIssuer", "JASPER THERAPEUTICS INC"], ["nameOfIssuer", "LIONS GATE ENTMNT CORP"], ["nameOfIssuer", "LPL FINL HLDGS INC"], ["nameOfIssuer", "MOLSON COORS BEVERAGE CO"], ["nameOfIssuer", "META PLATFORMS INC"], ["nameOfIssuer", "PALO ALTO NETWORKS INC"], ["nameOfIssuer", "OSCAR HEALTH INC"]]}
```
```tabular-data-json
{"rows": [["titleOfClass", "COM"]]}
```


## Data Ingestion Cypher Generation

The entities and relationships we got from the LLM have to be transformed to Cypher so we can write them into Neo4j.

## Data Ingestion
You will need a Neo4j instance.  You can deploy that on AWS Marketplace [here](https://aws.amazon.com/marketplace/pp/prodview-akmzjikgawgn4).

With that complete, you'll need to install the Neo4j library and set up your database connection.

In [None]:
from graphdatascience import GraphDataScience

# username is neo4j by default
username = 'neo4j'

# You will need to change these variables
connectionUrl = 'foourl'
password = 'foopasswd'

In [None]:
gds = GraphDataScience(
    connectionUrl,
    auth=(username, password),
    )

gds.set_database("neo4j")

Before loading the data, create constraints as below