# Text Embedding
In this notebook, we generate 10-K filings text embeddings with the Vertex AI [textembedding-gecko](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings) model.  Unstructured text from 10-K filings has been extracted using a parser beforehand.

In this notebook, we will:
1. Get 10-K filings unstructured text from a Google storage bucket
2. specifically select Item 1 from the 10K which describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. 
3. Chunk the text into natural sections using NLTK (to avoid input token limits)
4. Save text with embeddings to csv to stage for loading into graph

In [None]:
%pip install --user tabulate sentence-transformers
%pip install --user altair

Be sure to restart the kernel after you run the pip command.

## Get 10-K Filings from Google Cloud Storage

In [None]:
from google.cloud import storage

storage_client = storage.Client()
storage_client.bucket('neo4j-datasets').blob('hands-on-lab/form10k.zip').download_to_filename('/home/jupyter/form10k.zip')

In [None]:
!mkdir /home/jupyter/form10k
!unzip -qq -n '/home/jupyter/form10k.zip' -d /home/jupyter/form10k

## 10-K Filings Exploration and Chunking
Let's open one file to understand its contents.  It is actually a json file. 

In [10]:
import json
with open('/home/jupyter/form10k/0001830197-22-000038.txt') as f:
    f10_k = json.load(f)

We are interested in Item 1 specifically. 

Item 1 describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, and have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.

In [11]:
len(f10_k['item1'])

241333

This text has the ability to exceed token limits for `textembedding-gecko`.  Also the quality of embeddings can go down if the text gets to large. As such we should find some way to chunk the text up into seperate sections for embedding.

Below is a way to do this with Langchain's `RecursiveCharacterTextSplitter` which takes into account of Chunk overlaps. 

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = f10_k['item1']

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 15,
    length_function = len,
    is_separator_regex = False,
)
docs = text_splitter.split_text(text)

In [13]:
print(docs[0])

>Item 1. Business 
Company Overview
We are a leading residential mortgage originator and servicer driven by a mission to create financially healthy, happy homeowners. We do this by delivering scale, efficiency and savings to our partners and customers. Our business model is focused on leveraging a nationwide network of partner relationships to drive sustainable origination growth. We support our origination operations through a robust operational infrastructure and a highly responsive customer experience. We then leverage our servicing platform to manage the customer experience. We believe that the complementary relationship between our origination and servicing businesses allows us to provide a best-in-class experience to our customers throughout their homeownership lifecycle.
Our primary focus is our Wholesale channel, which is a business-to-business-to-customer distribution model in which we utilize our relationships with independent mortgage brokerages, which we refer to as our Bro

## Get 10-K Text Embeddings with Vertex AI
Now that we understand our data and how to chunk it.  Let's Generate embeddings.

In [40]:
from vertexai.language_models import TextEmbeddingModel
from typing import List

EMBEDDING_MODEL = TextEmbeddingModel

def rate_limit(max_per_minute):
    period = 60 / max_per_minute
    while True:
        before = time.time()
        yield
        after = time.time()
        elapsed = after - before
        sleep_time = max(0, period - elapsed)
        if sleep_time > 0:
            # print(f'Sleeping {sleep_time:.1f} seconds')
            time.sleep(sleep_time)
                
def embed_documents(texts: List[str]) -> List[List[float]]:
    """Call Vertex LLM embedding endpoint for embedding docs
    Args:
    texts: The list of texts to embed.
    Returns:
    List of embeddings, one for each text.
    """
    model = EMBEDDING_MODEL.from_pretrained("textembedding-gecko@001")

    limiter = rate_limit(600)
    results = []
    docs = list(texts)

    while docs:
        # Working in batches of 2 because the API apparently won't let
        # us send more than 2 documents per request to get embeddings.
        head, docs = docs[:2], docs[2:]
        # print(f'Sending embedding request for: {head!r}')
        chunk = model.get_embeddings(head)
        results.extend(chunk)
        next(limiter)
    return results

In [32]:
# We will need a chunking utility to stay within token limits as we loop through files
def chunks(xs, n=3):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

In [46]:
import time

def create_text_embedding_entries(input_text:str, company_name: str, cusip: str):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,
        chunk_overlap  = 15,
        length_function = len,
        is_separator_regex = False,
    )
    docs = text_splitter.split_text(input_text)
    res = []
    seq_id = -1
    for d in chunks(docs):
        embeddings = embed_documents(d)
        # throttle so we don't blow through the quota.
        # time.sleep(1)
        
        for i in range(len(d)):
            seq_id += 1
            res.append({'companyName': company_name, 'cusip': cusip, 'seqId': seq_id, 'contextId': company_name + str(seq_id), 'textEmbedding': embeddings[i].values, 'text': d[i]})
    return res

In [34]:
import os

file_names = os.listdir('/home/jupyter/form10k/')
len(file_names)

95

This cell takes about 15 minutes to run.  That's largely down to us throttling so we don't exceed the quota on our free account.  If you're using an enterprise account, you won't need to throttle like this.

In [47]:
%%time

import time

# We're hitting the quota, so we're going to sleep for a bit to zero it out for sure, then throttle our calls
# time.sleep(60)

count = 0
embedding_entries = []
for file_name in file_names:
    if '.txt' in file_name:
        count += 1
        if count % 5 == 0:
            print(f'Parsed {count} of {len(file_names)}')
        with open('/home/jupyter/form10k/' + file_name) as f:
            f10_k = json.load(f)
        embedding_entries.extend(create_text_embedding_entries(f10_k['item1'], f10_k['companyName'], f10_k['cusip']))
len(embedding_entries)

Parsed 5 of 95
Parsed 10 of 95
Parsed 15 of 95
Parsed 20 of 95
Parsed 25 of 95
Parsed 30 of 95
Parsed 35 of 95
Parsed 40 of 95
Parsed 45 of 95
Parsed 50 of 95
Parsed 55 of 95
Parsed 60 of 95
Parsed 65 of 95
Parsed 70 of 95
Parsed 75 of 95
Parsed 80 of 95
Parsed 85 of 95
Parsed 90 of 95
CPU times: user 32.1 s, sys: 6.19 s, total: 38.3 s
Wall time: 4min 33s


3652

## Save 10-K Documents with Embeddings
We will save these locally to use in graph loading, in the next part.

In [48]:
import pandas as pd
edf = pd.DataFrame(embedding_entries)

In [49]:
edf

Unnamed: 0,companyName,cusip,seqId,contextId,textEmbedding,text
0,DOLLAR TREE STORES I,256677105,0,DOLLAR TREE STORES I0,"[0.0011005396954715252, -0.02110779657959938, ...",>Item 1. Business\n” for further discussion of...
1,DOLLAR TREE STORES I,256677105,1,DOLLAR TREE STORES I1,"[0.006865902338176966, 0.003632724517956376, 0...",Plus\n;\n•\nthe introduction of selected Dolla...
2,DOLLAR TREE STORES I,256677105,2,DOLLAR TREE STORES I2,"[-0.04646346718072891, -0.009276329539716244, ...",The rollout of our initiative to add price poi...
3,DOLLAR TREE STORES I,256677105,3,DOLLAR TREE STORES I3,"[-0.007591314613819122, -0.01885572448372841, ...","In fiscal 2019, we recorded a $313.0 million n..."
4,DOLLAR TREE STORES I,256677105,4,DOLLAR TREE STORES I4,"[-0.008723229169845581, -0.014396817423403263,...",We rely extensively on our computer and techno...
...,...,...,...,...,...,...
3647,ARK RESTAURANTS CORP,00214Q104,11,ARK RESTAURANTS CORP11,"[-0.034189801663160324, -0.022611897438764572,...",We have experienced aggressive competition for...
3648,ARK RESTAURANTS CORP,00214Q104,12,ARK RESTAURANTS CORP12,"[-0.04284031689167023, -0.028181184083223343, ...",9\nAlcoholic beverage control regulations requ...
3649,ARK RESTAURANTS CORP,00214Q104,13,ARK RESTAURANTS CORP13,"[-0.03191894665360451, -0.026221390813589096, ...",We are subject to “dram-shop” statutes in most...
3650,ARK RESTAURANTS CORP,00214Q104,14,ARK RESTAURANTS CORP14,"[-0.02670503407716751, -0.029166726395487785, ...","Our business is highly seasonal; however, our ..."


Provide your Neo4j credentials.  We need the DB conection URL, the username (probably `neo4j`), and your password.

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://6688b25b.databases.neo4j.io'
NEO4J_PASSWORD = '_kogrNk53u8oTk5be55kmit1kHGdhZj98yJlG-VYSR'

In [51]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

Remember to create indexes. We will be merging 10K documents by `companyName`. In a production setting, we would want to use a better identifier here (like we did with cusip for Company) However, this should suffice for our intents and purposes as we are just getting acquainted to learning about semantic search.

In [52]:
gds.run_cypher('CREATE INDEX company_name IF NOT EXISTS FOR (n:Company) ON (n.companyName)')
gds.run_cypher('CREATE CONSTRAINT unique_document_id IF NOT EXISTS FOR (n:Document) REQUIRE (n.documentId) IS NODE KEY')

Due to the size of the documents we will want to transform the dataframe into a list of dict that we can chunk up and insert via parameterized query.

In [53]:
emb_entries = edf.to_dict(orient='records')

In [54]:
total = len(emb_entries)
count = 0
for d in chunks(emb_entries, 100):
    gds.run_cypher('''
    UNWIND $records AS record
    MATCH(c:Company {cusip:record.cusip})
    MERGE(b:Document {documentId:record.contextId})
    SET b.documentType = 'FORM_10K_ITEM1', b.seqId = record.seqId, b.textEmbedding = record.textEmbedding, b.text = record.text
    MERGE(c)-[:HAS]->(b)
    RETURN count(b) as cnt
    ''', params = {'records':d})
    count += len(d)
    print(f'loaded {count} of {total}')

loaded 100 of 3652
loaded 200 of 3652
loaded 300 of 3652
loaded 400 of 3652
loaded 500 of 3652
loaded 600 of 3652
loaded 700 of 3652
loaded 800 of 3652
loaded 900 of 3652
loaded 1000 of 3652
loaded 1100 of 3652
loaded 1200 of 3652
loaded 1300 of 3652
loaded 1400 of 3652
loaded 1500 of 3652
loaded 1600 of 3652
loaded 1700 of 3652
loaded 1800 of 3652
loaded 1900 of 3652
loaded 2000 of 3652
loaded 2100 of 3652
loaded 2200 of 3652
loaded 2300 of 3652
loaded 2400 of 3652
loaded 2500 of 3652
loaded 2600 of 3652
loaded 2700 of 3652
loaded 2800 of 3652
loaded 2900 of 3652
loaded 3000 of 3652
loaded 3100 of 3652
loaded 3200 of 3652
loaded 3300 of 3652
loaded 3400 of 3652
loaded 3500 of 3652
loaded 3600 of 3652
loaded 3652 of 3652


## Check Data

In [55]:
# Check node count
gds.run_cypher('MATCH(doc:Document) RETURN count(doc)')

Unnamed: 0,count(doc)
0,3536


Note that we were only getting 10-K docs for a minority of companies. It should be fine for this, but in a more rigorous setting, you may want to try and pull more.  There are likely a few factors attributing to this. 

1. We used company names to search EDGAR which resulted in many misses and dups which were discarded. In a more rigorous setting, we would investigate other endpoints and use more parsing to extract EDGAR cik keys for exact matching companies when pulling forms.

2. Company names are not consistent across form13 filings, so even if we successfully pull on one version of a company name, we may not be able to merge it into the graph via the one company name represented there. 

3. Not all companies in the dataset are obligated to file 10-Ks.

In [56]:
# Check count and percentage of companies with 10-K docs.  Note it is the minority
gds.run_cypher('''
MATCH(b:Company)
WITH b, count{(b)-[:HAS]->(d:Document)} AS docCount
WITH count(b) AS total, sum(toInteger(docCount > 0)) AS numWithDocs
RETURN total, numWithDocs, round(100*toFloat(numWithDocs)/toFloat(total), 2) As PercWithDocs
''')

Unnamed: 0,total,numWithDocs,PercWithDocs
0,242,48,19.83


You might note that there are duplicate names.  For our purposes here, we will treat it as entity resolution, meaning that we treat companies with the same name as belonging to the same overarching entity for semantic search. In a more rigorous setting, we would need to disambiguate with other EDGAR keys.

In [57]:
# Show duplicates via HAS relationship
gds.run_cypher('''
MATCH(b:Company)
RETURN count(b) AS totalCompanies, count(DISTINCT b.companyName) AS uniqueCompanyNames
''')

Unnamed: 0,totalCompanies,uniqueCompanyNames
0,242,241


## View Embeddings as Clusters

Vector embeddings generated by language models are nothing but numerical representation of words or sentences.  So, similar sentences will be located nearby.  The embeddings we generated earlier are higher dimensional ones.  To visualize them, we need to reduce the dimensionality.  Let's do that and visualize it.

In [58]:
import altair as alt

def generate_chart(df, xcol, ycol, lbl = 'on', color = 'basic', title = '', tooltips = ['documentId'], label = ''):
  chart = alt.Chart(df).mark_circle(size=30).encode(
    x = alt.X(xcol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    y = alt.Y(ycol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=tooltips
    )

  if lbl == 'on':
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 7, size = 5, color = 'black').encode(text = label, color = alt.value('black'))
  else:
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 10).encode()

  result = (chart + text).configure(background="#FDF7F0"
        ).properties(
        width = 800,
        height = 500,
        title = title
       ).configure_legend(
  orient = 'bottom', titleFontSize = 18, labelFontSize = 18)
        
  return result

In [59]:
# Reduce dimensionality using PCA
from sklearn.decomposition import PCA

# Function to return the principal components
def get_pc(arr, n):
  pca = PCA(n_components = n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

In [60]:
emb_df = gds.run_cypher("MATCH (c:Company)-[:HAS]->(n:Document) RETURN c.companyName as companyName, n.documentId as documentId, n.text as text, n.textEmbedding as emb LIMIT 1000")
emb_df

Unnamed: 0,companyName,documentId,text,emb
0,Amazon.com Inc,AMAZON C2,We serve authors and independent publishers wi...,"[0.006612538825720549, -0.011562603525817394, ..."
1,Amazon.com Inc,AMAZON C1,Consumers\nWe serve consumers through our onli...,"[-0.0006532742991112173, -0.01848229579627514,..."
2,Amazon.com Inc,AMAZON C3,Our businesses encompass a large variety of pr...,"[-0.0012651049764826894, -0.03241520747542381,..."
3,Amazon.com Inc,AMAZON C8,"70\nCo-CEO, President, and Chair of IronNet Cy...","[0.007523578125983477, -0.05875156447291374, -..."
4,Amazon.com Inc,AMAZON C0,>Item 1.\nBusiness\nThis Annual Report on Form...,"[0.023851962760090828, -0.020312661305069923, ..."
...,...,...,...,...
995,GLOBAL X CLOUD COMPUTING ETF,Global Partner Acquisition Corp Ii21,We are not prohibited from pursuing an initial...,"[0.0037235927302390337, -0.027355222031474113,..."
996,GLOBAL X CLOUD COMPUTING ETF,Global Partner Acquisition Corp Ii72,➤\n\n\nIf we seek shareholder approval of our ...,"[0.014228932559490204, -0.0016051246784627438,..."
997,GLOBAL X CLOUD COMPUTING ETF,Global Partner Acquisition Corp Ii48,"However, we would not be restricting our share...","[0.01272459514439106, -0.021930834278464317, -..."
998,GLOBAL X CLOUD COMPUTING ETF,Global Partner Acquisition Corp Ii50,The foregoing is different from the procedures...,"[0.023618966341018677, -0.018364861607551575, ..."


## K-Means Clustering on the Embeddings
Let's run the K-Means Clustering algorithm and view similar document chunks. 

In [None]:
import numpy as np
from sklearn.cluster import KMeans

embeds = np.array(emb_df['emb'].tolist())
embeds_pc2 = get_pc(embeds, 2)

df_clust = pd.concat([emb_df, pd.DataFrame(embeds_pc2)], axis = 1)
n_clusters = 5

kmeans_model = KMeans(n_clusters = n_clusters, n_init = 1, random_state = 0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:],'0', '1', lbl = 'off', color = 'cluster', title = 'K-Means Clustering with n Clusters', tooltips = ['documentId', 'text'], label = '')
