# Text Embedding
In this notebook, we generate 10-K filings text embeddings with the Vertex AI [textembedding-gecko](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings) model.  Unstructured text from 10-K filings has been extracted using a parser beforehand.

In this notebook, we will:
1. Get 10-K filings unstructured text from a Google storage bucket
2. specifically select Item 1 from the 10K which describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. 
3. Chunk the text into natural sections using NLTK (to avoid input token limits)
4. Save text with embeddings to csv to stage for loading into graph

In [1]:
%pip install --user tabulate sentence-transformers
%pip install --user altair

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting transformers<5.0.0,>=4.6.0 (from sentence-transformers)
  Obtaining dependency information for transformers<5.0.0,>=4.6.0 from https://files.pythonhosted.org/packages/1a/d1/3bba59606141ae808017f6fde91453882f931957f125009417b87a281067/transformers-4.34.0-py3-none-any.whl.metadata
  Downloading transformers-4.34.0-py3-none-any.whl.metadata (121 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m121.5/121.5 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting torch>=1.6.0 (from sentence-transformers)
  Obtaining dependency information for torch>=1.6.0 from https://files.pythonhosted.org/packages/6d/13/b5e8bacd980b2195f8a1741ce11cbb9146568607795d5e4ff510dcff1064/torch-2.1.0-cp3

Be sure to restart the kernel after you run the pip command.

## Get 10-K Filings from Google Cloud Storage

In [2]:
from google.cloud import storage

storage_client = storage.Client()
storage_client.bucket('neo4j-datasets').blob('hands-on-lab/form10k.zip').download_to_filename('/home/jupyter/form10k.zip')

In [3]:
!mkdir /home/jupyter/form10k
!unzip -qq -n '/home/jupyter/form10k.zip' -d /home/jupyter/form10k

## 10-K Filings Exploration and Chunking
Let's open one file to understand its contents.  It is actually a json file. 

In [4]:
import json
with open('/home/jupyter/form10k/0001830197-22-000038.txt') as f:
    f10_k = json.load(f)

We are interested in Item 1 specifically. 

Item 1 describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, and have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.

In [5]:
len(f10_k['item1'])

241333

This text has the ability to exceed token limits for `textembedding-gecko`.  Also the quality of embeddings can go down if the text gets to large. As such we should find some way to chunk the text up into seperate sections for embedding.

Below is a way to do this with Langchain's `RecursiveCharacterTextSplitter` which takes into account of Chunk overlaps. 

In [6]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = f10_k['item1']

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 15,
    length_function = len,
    is_separator_regex = False,
)
docs = text_splitter.split_text(text)

In [7]:
print(docs[0])

>Item 1. Business 
Company Overview
We are a leading residential mortgage originator and servicer driven by a mission to create financially healthy, happy homeowners. We do this by delivering scale, efficiency and savings to our partners and customers. Our business model is focused on leveraging a nationwide network of partner relationships to drive sustainable origination growth. We support our origination operations through a robust operational infrastructure and a highly responsive customer experience. We then leverage our servicing platform to manage the customer experience. We believe that the complementary relationship between our origination and servicing businesses allows us to provide a best-in-class experience to our customers throughout their homeownership lifecycle.
Our primary focus is our Wholesale channel, which is a business-to-business-to-customer distribution model in which we utilize our relationships with independent mortgage brokerages, which we refer to as our Bro

## Get 10-K Text Embeddings with Vertex AI
Now that we understand our data and how to chunk it.  Let's Generate embeddings.

In [21]:
from vertexai.language_models import TextEmbeddingModel
EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [22]:
# We will need a chunking utility to stay within token limits as we loop through files
def chunks(xs, n=3):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

In [23]:
import time

def create_text_embedding_entries(input_text:str, company_name: str, cusip: str):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,
        chunk_overlap  = 15,
        length_function = len,
        is_separator_regex = False,
    )
    docs = text_splitter.split_text(input_text)
    res = []
    seq_id = -1
    for d in chunks(docs):
        embeddings = EMBEDDING_MODEL.get_embeddings(d)
        
        # throttle so we don't blow through the quota.
        time.sleep(1)
        
        for i in range(len(d)):
            seq_id += 1
            res.append({'companyName': company_name, 'cusip': cusip, 'seqId': seq_id, 'contextId': company_name + str(seq_id), 'textEmbedding': embeddings[i].values, 'text': d[i]})
    return res

Due to Quota Limitations, lets only do 5 form 10k files out of the 95 we have

In [26]:
import os

file_names = os.listdir('/home/jupyter/form10k/')[0:5]
len(file_names)

5

This cell takes about 15 minutes to run.  That's largely down to us throttling so we don't exceed the quota on our free account.  If you're using an enterprise account, you won't need to throttle like this.

In [42]:
%%time

count = 0
embedding_entries = []
for file_name in file_names:
    if '.txt' in file_name:
        count += 1
        if count % 5 == 0:
            print(f'Parsed {count} of {len(file_names)}')
        with open('/home/jupyter/form10k/' + file_name) as f:
            f10_k = json.load(f)
        embedding_entries.extend(create_text_embedding_entries(f10_k['item1'], f10_k['companyName'], f10_k['cusip']))
len(embedding_entries)

Parsed 5 of 5
CPU times: user 1.06 s, sys: 68.7 ms, total: 1.13 s
Wall time: 52.4 s


135

## Save 10-K Documents with Embeddings
We will save these locally to use in graph loading, in the next part.

In [28]:
import pandas as pd
edf = pd.DataFrame(embedding_entries)

In [29]:
edf

Unnamed: 0,companyName,cusip,seqId,contextId,textEmbedding,text
0,General motors co,369604103,0,General motors co0,"[0.021373910829424858, -0.008970730938017368, ...",>Item 1. Business \nGeneral Motors Company (so...
1,General motors co,369604103,1,General motors co1,"[-0.033133506774902344, -0.012916910462081432,...",Our vision for the future is a world with zero...
2,General motors co,369604103,2,General motors co2,"[-0.04687150940299034, -0.022455034777522087, ...","In September 2021, we announced three new driv..."
3,General motors co,369604103,3,General motors co3,"[-0.04158658906817436, -0.005550578236579895, ...",Ultium Charge 360 is also available to our fle...
4,General motors co,369604103,4,General motors co4,"[-0.037388477474451065, -0.007613219786435366,...",We offer OnStar and connected services to more...
...,...,...,...,...,...,...
130,ARK RESTAURANTS CORP,00214Q104,11,ARK RESTAURANTS CORP11,"[-0.0341801680624485, -0.02265389822423458, -0...",We have experienced aggressive competition for...
131,ARK RESTAURANTS CORP,00214Q104,12,ARK RESTAURANTS CORP12,"[-0.04283265769481659, -0.028135832399129868, ...",9\nAlcoholic beverage control regulations requ...
132,ARK RESTAURANTS CORP,00214Q104,13,ARK RESTAURANTS CORP13,"[-0.031862545758485794, -0.02629256248474121, ...",We are subject to “dram-shop” statutes in most...
133,ARK RESTAURANTS CORP,00214Q104,14,ARK RESTAURANTS CORP14,"[-0.02673407830297947, -0.02919837087392807, 0...","Our business is highly seasonal; however, our ..."


Provide your Neo4j credentials.  We need the DB conection URL, the username (probably `neo4j`), and your password.

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'neo4j+s://6688b25b.databases.neo4j.io'
NEO4J_PASSWORD = '_kogrNk53u8oTk5be55kmit1kHGdhZj98yJlG-VYSR'

In [31]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

Remember to create indexes. We will be merging 10K documents by `companyName`. In a production setting, we would want to use a better identifier here (like we did with cusip for Company) However, this should suffice for our intents and purposes as we are just getting acquainted to learning about semantic search.

In [32]:
gds.run_cypher('CREATE INDEX company_name IF NOT EXISTS FOR (n:Company) ON (n.companyName)')
gds.run_cypher('CREATE CONSTRAINT unique_document_id IF NOT EXISTS FOR (n:Document) REQUIRE (n.documentId) IS NODE KEY')

Due to the size of the documents we will want to transform the dataframe into a list of dict that we can chunk up and insert via parameterized query.

In [33]:
emb_entries = edf.to_dict(orient='records')

In [34]:
total = len(emb_entries)
count = 0
for d in chunks(emb_entries, 100):
    gds.run_cypher('''
    UNWIND $records AS record
    MATCH(c:Company {cusip:record.cusip})
    MERGE(b:Document {documentId:record.contextId})
    SET b.documentType = 'FORM_10K_ITEM1', b.seqId = record.seqId, b.textEmbedding = record.textEmbedding, b.text = record.text
    MERGE(c)-[:HAS]->(b)
    RETURN count(b) as cnt
    ''', params = {'records':d})
    count += len(d)
    print(f'loaded {count} of {total}')

loaded 100 of 135
loaded 135 of 135


## Check Data

In [35]:
# Check node count
gds.run_cypher('MATCH(doc:Document) RETURN count(doc)')

Unnamed: 0,count(doc)
0,135


Note that we were only getting 10-K docs for a minority of companies. It should be fine for this, but in a more rigorous setting, you may want to try and pull more.  There are likely a few factors attributing to this. 

1. We used company names to search EDGAR which resulted in many misses and dups which were discarded. In a more rigorous setting, we would investigate other endpoints and use more parsing to extract EDGAR cik keys for exact matching companies when pulling forms.

2. Company names are not consistent across form13 filings, so even if we successfully pull on one version of a company name, we may not be able to merge it into the graph via the one company name represented there. 

3. Not all companies in the dataset are obligated to file 10-Ks.

In [36]:
# Check count and percentage of companies with 10-K docs.  Note it is the minority
gds.run_cypher('''
MATCH(b:Company)
WITH b, count{(b)-[:HAS]->(d:Document)} AS docCount
WITH count(b) AS total, sum(toInteger(docCount > 0)) AS numWithDocs
RETURN total, numWithDocs, round(100*toFloat(numWithDocs)/toFloat(total), 2) As PercWithDocs
''')

Unnamed: 0,total,numWithDocs,PercWithDocs
0,220,5,2.27


You might note that there are duplicate names.  For our purposes here, we will treat it as entity resolution, meaning that we treat companies with the same name as belonging to the same overarching entity for semantic search. In a more rigorous setting, we would need to disambiguate with other EDGAR keys.

In [37]:
# Show duplicates via HAS relationship
gds.run_cypher('''
MATCH(b:Company)
RETURN count(b) AS totalCompanies, count(DISTINCT b.companyName) AS uniqueCompanyNames
''')

Unnamed: 0,totalCompanies,uniqueCompanyNames
0,220,219


## View Embeddings as Clusters

Vector embeddings generated by language models are nothing but numerical representation of words or sentences.  So, similar sentences will be located nearby.  The embeddings we generated earlier are higher dimensional ones.  To visualize them, we need to reduce the dimensionality.  Let's do that and visualize it.

In [38]:
import altair as alt

def generate_chart(df, xcol, ycol, lbl = 'on', color = 'basic', title = '', tooltips = ['documentId'], label = ''):
  chart = alt.Chart(df).mark_circle(size=30).encode(
    x = alt.X(xcol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    y = alt.Y(ycol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=tooltips
    )

  if lbl == 'on':
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 7, size = 5, color = 'black').encode(text = label, color = alt.value('black'))
  else:
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 10).encode()

  result = (chart + text).configure(background="#FDF7F0"
        ).properties(
        width = 800,
        height = 500,
        title = title
       ).configure_legend(
  orient = 'bottom', titleFontSize = 18, labelFontSize = 18)
        
  return result

In [39]:
# Reduce dimensionality using PCA
from sklearn.decomposition import PCA

# Function to return the principal components
def get_pc(arr, n):
  pca = PCA(n_components = n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

In [40]:
emb_df = gds.run_cypher("MATCH (c:Company)-[:HAS]->(n:Document) RETURN c.companyName as companyName, n.documentId as documentId, n.text as text, n.textEmbedding as emb LIMIT 1000")
emb_df

Unnamed: 0,companyName,documentId,text,emb
0,ARK INNOVATION ETF,ARK RESTAURANTS CORP0,>Item 1.\n \nBusiness\nCOVID-19 Pandemic and I...,"[-0.042690303176641464, -0.02283802069723606, ..."
1,ARK INNOVATION ETF,ARK RESTAURANTS CORP1,Overview\nWe are a New York corporation formed...,"[-0.0489288792014122, -0.02295021153986454, 0...."
2,ARK INNOVATION ETF,ARK RESTAURANTS CORP10,Competition\nThe hospitality industry is highl...,"[-0.04103657230734825, -0.037126582115888596, ..."
3,ARK INNOVATION ETF,ARK RESTAURANTS CORP11,We have experienced aggressive competition for...,"[-0.0341801680624485, -0.02265389822423458, -0..."
4,ARK INNOVATION ETF,ARK RESTAURANTS CORP12,9\nAlcoholic beverage control regulations requ...,"[-0.04283265769481659, -0.028135832399129868, ..."
...,...,...,...,...
130,VERIZON,Verizon Communications INC5,. We sell network access to mobile virtual net...,"[0.016797110438346863, -0.04080076143145561, -..."
131,VERIZON,Verizon Communications INC6,Global Enterprise offers a broad portfolio of ...,"[-0.0034878277219831944, -0.0191048514097929, ..."
132,VERIZON,Verizon Communications INC7,Public Sector and Other offers wireless produc...,"[0.01294224988669157, -0.024389244616031647, -..."
133,VERIZON,Verizon Communications INC8,Local services\n. We offer an array of local d...,"[0.027821676805615425, -0.04315909743309021, -..."


## K-Means Clustering on the Embeddings
Let's run the K-Means Clustering algorithm and view similar document chunks. 

In [None]:
import numpy as np
from sklearn.cluster import KMeans

embeds = np.array(emb_df['emb'].tolist())
embeds_pc2 = get_pc(embeds, 2)

df_clust = pd.concat([emb_df, pd.DataFrame(embeds_pc2)], axis = 1)
n_clusters = 5

kmeans_model = KMeans(n_clusters = n_clusters, n_init = 1, random_state = 0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:],'0', '1', lbl = 'off', color = 'cluster', title = 'K-Means Clustering with n Clusters', tooltips = ['documentId', 'text'], label = '')
