# Load Document Nodes

## Initial Exploration

In [1]:
import json
with open('../data/form10k-clean/0000004977-22-000058.json') as f:
    f10_k = json.load(f)

In [2]:
print(f10_k['item1'])

>Item 1. Business
Information about the Company's Executive Officers
NAME
PRINCIPAL OCCUPATION
(1)
AGE
Daniel P. Amos
Chairman, Aflac Incorporated and Aflac, since 2001; Chief Executive Officer, Aflac Incorporated and Aflac, since 1990; President, Aflac, from 2017 until 2018; President, Aflac Incorporated, from 2018 until 2020 
70 
Steven K. Beaver
Senior Vice President, Chief Financial Officer, Aflac U.S., since 2019; Senior Vice President, Financial Planning and Analysis, Aflac Incorporated, from 2018 until 2019; Senior Vice President, Global Strategic Projects, Corporate Financial Planning and Analysis, Aflac Incorporated, from 2017 until 2018
57 
Max K. Brodén
Executive Vice President, Chief Financial Officer, Aflac Incorporated, since 2020; Executive Vice President, Aflac since 2020; Treasurer, Aflac, since 2017; Treasurer, Aflac Incorporated from 2017 until 2021; Senior Vice President, Aflac Incorporated and Aflac, from 2017 until 2020; Senior Portfolio Manager, Norges Bank, from

In [3]:
len(f10_k['item1'])

80391

In [4]:
f10_k.keys()

dict_keys(['item1', 'item1a', 'item7', 'item7a', 'cik', 'source'])

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = f10_k['item1']

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)
docs = text_splitter.split_text(text)

In [6]:
print(len(docs))

54


In [7]:
print(docs[4])

The Company has evaluated its holdings and identified those investments most exposed to the negative impacts of an economic downturn as a result of COVID-19, including but not limited to investments in businesses facing an immediate and severe impact such as travel and lodging, leisure, non-emergency medical, energy, and others involving large gatherings of people. These investments are experiencing and may continue to experience higher credit losses, credit rating downgrades and/or defaults and the Company has examined in each case whether a reduction in size of the holding is appropriate. In addition, volatility in oil prices could have a continued adverse impact on issuers in the energy sector. While the Company has identified assets impacted or expected to be impacted by COVID-19 and its consequences, other investments not identified to date may also be impacted. The availability of new investments in certain private market asset classes, such as middle market loans, commercial mor

In [8]:
print(docs[5])

the ultimate impact of COVID-19 on the Company’s investments and hedging programs. See the risk factor below entitled, “The Company is exposed to significant interest rate risk, which may adversely affect its results of operations, financial condition and liquidity” for more information. See the “Investments” and “Results of Operations by Segment” sections of Item 7, MD&A, for more information.


## Test Loading

In [9]:
from dotenv import load_dotenv
import os

load_dotenv('.env', override=True)
NEO4J_URI = os.getenv('NEO4J_URI')
NEO4J_USERNAME = os.getenv('NEO4J_USERNAME')
NEO4J_PASSWORD = os.getenv('NEO4J_PASSWORD')
AURA_DS = eval(os.getenv('AURA_DS').title())

In [10]:
from graphdatascience import GraphDataScience

# Use Neo4j URI and credentials according to our setup
gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=AURA_DS)

# Necessary if you enabled Arrow on the db - this is true for AuraDS
gds.set_database("neo4j")

In [11]:
from langchain.embeddings import OpenAIEmbeddings

embedding_model = OpenAIEmbeddings()

In [12]:
import pandas as pd
from graphdatascience import GraphDataScience
from typing import Tuple, Union
from numpy.typing import ArrayLike


def make_map(x):
    if type(x) == str:
        return x, x
    elif type(x) == tuple:
        return x
    else:
        raise Exception("Entry must of type string or tuple")


def make_set_clause(prop_names: ArrayLike, element_name='n', item_name='rec'):
    clause_list = []
    for prop_name in prop_names:
        clause_list.append(f'{element_name}.{prop_name} = {item_name}.{prop_name}')
    return 'SET ' + ', '.join(clause_list)


def make_node_merge_query(node_key_name: str, node_label: str, cols: ArrayLike):
    template = f'''UNWIND $recs AS rec\nMERGE(n:{node_label} {{{node_key_name}: rec.{node_key_name}}})'''
    prop_names = [x for x in cols if x != node_key_name]
    if len(prop_names) > 0:
        template = template + '\n' + make_set_clause(prop_names)
    return template + '\nRETURN count(n) AS nodeLoadedCount'


def make_rel_merge_query(source_target_labels: Union[Tuple[str, str], str],
                         source_node_key: Union[Tuple[str, str], str],
                         target_node_key: Union[Tuple[str, str], str],
                         rel_type: str,
                         cols: ArrayLike,
                         rel_key: str = None):
    source_target_label_map = make_map(source_target_labels)
    source_node_key_map = make_map(source_node_key)
    target_node_key_map = make_map(target_node_key)

    merge_statement = f'MERGE(s)-[r:{rel_type}]->(t)'
    if rel_key is not None:
        merge_statement = f'MERGE(s)-[r:{rel_type} {{{rel_key}: rec.{rel_key}}}]->(t)'

    template = f'''\tUNWIND $recs AS rec
    MATCH(s:{source_target_label_map[0]} {{{source_node_key_map[0]}: rec.{source_node_key_map[1]}}})
    MATCH(t:{source_target_label_map[1]} {{{target_node_key_map[0]}: rec.{target_node_key_map[1]}}})\n\t''' + merge_statement
    prop_names = [x for x in cols if x not in [rel_key, source_node_key_map[1], target_node_key_map[1]]]
    if len(prop_names) > 0:
        template = template + '\n\t' + make_set_clause(prop_names, 'r')
    return template + '\n\tRETURN count(r) AS relLoadedCount'


def chunks(xs, n=100):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]


def load_nodes(gds: GraphDataScience, node_df: pd.DataFrame, node_key_col: str, node_label: str, chunk_size=1000):
    records = node_df.to_dict('records')
    print(f'======  loading {node_label} nodes  ======')
    total = len(records)
    print(f'staging {total:,} records')
    query = make_node_merge_query(node_key_col, node_label, node_df.columns.copy())
    print(f'\nUsing This Cypher Query:\n```\n{query}\n```\n')
    cumulative_count = 0
    for recs in chunks(records, chunk_size):
        res = gds.run_cypher(query, params={'recs': recs})
        cumulative_count += res.iloc[0, 0]
        print(f'Loaded {cumulative_count:,} of {total:,} nodes')


def load_rels(gds: GraphDataScience,
              rel_df: pd.DataFrame,
              source_target_labels: Union[Tuple[str, str], str],
              source_node_key: Union[Tuple[str, str], str],
              target_node_key: Union[Tuple[str, str], str],
              rel_type: str,
              rel_key: str = None,
              chunk_size=10_000):
    records = rel_df.to_dict('records')
    print(f'======  loading {rel_type} relationships  ======')
    total = len(records)
    print(f'staging {total:,} records')
    query = make_rel_merge_query(source_target_labels, source_node_key,
                                 target_node_key, rel_type, rel_df.columns.copy(), rel_key)
    print(f'\nUsing This Cypher Query:\n```\n{query}\n```\n')
    cumulative_count = 0
    for recs in chunks(records, chunk_size):
        res = gds.run_cypher(query, params={'recs': recs})
        cumulative_count += res.iloc[0, 0]
        print(f'Loaded {cumulative_count:,} of {total:,} relationships')

In [14]:
from pandas import DataFrame
from typing import List


def get_and_split_txt_data(file_names: List[str]) -> DataFrame:
    doc_data_list = []
    for file_name in file_names:
        with open(file_name) as f:
            f10_k = json.load(f)
            for item in ['item1', 'item1a', 'item7', 'item7a']:
                #split text data
                txt = f10_k[item]
                split_txts = text_splitter.split_text(txt)
                chunk_seq_id = 0
                for split_txt in split_txts:
                    form_id = file_name[file_name.rindex('/') + 1:file_name.rindex('.')]
                    doc_data_list.append({ 'documentId': f'{form_id}-{item}-chunk{chunk_seq_id:04d}',
                                           'cik': f10_k['cik'],
                                           'source': f10_k['source'],
                                           'f10kItem': item,
                                           'chunkSeqId': chunk_seq_id,
                                           'text': split_txt})
                    chunk_seq_id += 1
    return pd.DataFrame(doc_data_list)

def add_text_embeddings(df):
    count = 0
    embeddings = []
    for docs in chunks(df.text, n=100):
        count += len(docs)
        print(f'Embedded {count} of {df.shape[0]}')
        embeddings.extend(embedding_model.embed_documents(docs))
    df['textEmbeddings'] = embeddings



In [15]:
%%time

gds.run_cypher('CREATE CONSTRAINT unique_document IF NOT EXISTS FOR (n:Document) REQUIRE n.documentId IS UNIQUE')

all_file_names = ['../data/form10k-clean/' + x for x in os.listdir('../data/form10k-clean/')]
counter = 0
for file_names in chunks(all_file_names, 20):
    counter += len(file_names)
    print(f'=== Processing {counter-len(file_names)}:{counter} of {len(all_file_names)} ===')
    # get and split text data
    print('Loading and splitting Text Files...')
    doc_df = get_and_split_txt_data(file_names)
    # perform text embedding
    print('Performing Text Embedding...')
    add_text_embeddings(doc_df)
    #load nodes
    print('Loading Nodes...')
    load_nodes(gds, doc_df, 'documentId', 'Document')
    print(f'Done Processing {counter-len(file_names)}:{counter}')


=== Processing 0:11 of 11 ===
Loading and splitting Text Files...
Performing Text Embedding...
Embedded 100 of 1711
Embedded 200 of 1711
Embedded 300 of 1711
Embedded 400 of 1711
Embedded 500 of 1711
Embedded 600 of 1711
Embedded 700 of 1711
Embedded 800 of 1711
Embedded 900 of 1711
Embedded 1000 of 1711
Embedded 1100 of 1711
Embedded 1200 of 1711
Embedded 1300 of 1711
Embedded 1400 of 1711
Embedded 1500 of 1711
Embedded 1600 of 1711
Embedded 1700 of 1711
Embedded 1711 of 1711
Loading Nodes...
staging 1,711 records

Using This Cypher Query:
```
UNWIND $recs AS rec
MERGE(n:Document {documentId: rec.documentId})
SET n.cik = rec.cik, n.source = rec.source, n.f10kItem = rec.f10kItem, n.chunkSeqId = rec.chunkSeqId, n.text = rec.text, n.textEmbeddings = rec.textEmbeddings
RETURN count(n) AS nodeLoadedCount
```

Loaded 1,000 of 1,711 nodes
Loaded 1,711 of 1,711 nodes
Done Processing 0:11
CPU times: user 4.01 s, sys: 200 ms, total: 4.21 s
Wall time: 26.5 s
