# Text Embedding
In this notebook, we generate 10-K filings text embeddings with the Vertex AI [text-embedding](https://ai.google.dev/gemini-api/docs/models/gemini#text-embedding) model.  Unstructured text from 10-K filings has been extracted using a parser beforehand.

In this notebook, we will:
1. Get 10-K filings unstructured text from a Google storage bucket
2. specifically select Item 1 from the 10K which describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. 
3. Chunk the text into natural sections (to avoid input token limits)
4. Save text with embeddings to csv to stage for loading into graph

In [None]:
%pip install --user tabulate sentence-transformers
%pip install --user altair

Be sure to restart the kernel after you run the pip command.

## Get 10-K Filings from Google Cloud Storage

In [None]:
from google.cloud import storage

storage_client = storage.Client()
storage_client.bucket('neo4j-datasets').blob('hands-on-lab/form10k.zip').download_to_filename('/home/jupyter/form10k.zip')

In [None]:
!mkdir /home/jupyter/form10k
!unzip -qq -n '/home/jupyter/form10k.zip' -d /home/jupyter/form10k

## 10-K Filings Exploration and Chunking
Let's open one file to understand its contents.  It is actually a json file. 

In [None]:
import json
with open('/home/jupyter/form10k/0001830197-22-000038.txt') as f:
    f10_k = json.load(f)

We are interested in Item 1 specifically. 

Item 1 describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, and have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.

In [None]:
len(f10_k['item1'])

This text has the ability to exceed token limits for the embedding model.  Also the quality of embeddings can go down if the text gets too large. As such we should find some way to chunk the text up into seperate sections for embedding.

Below is a way to do this with Langchain's `RecursiveCharacterTextSplitter` which takes into account of Chunk overlaps. 

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text = f10_k['item1']

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 2000,
    chunk_overlap  = 15,
    length_function = len,
    is_separator_regex = False,
)
docs = text_splitter.split_text(text)

In [None]:
print(docs[0])

## Get 10-K Text Embeddings with Vertex AI
Now that we understand our data and how to chunk it.  Let's Generate embeddings.

In [None]:
from vertexai.language_models import TextEmbeddingModel
EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained("text-embedding-004")

In [None]:
# We will need a chunking utility to stay within token limits as we loop through files
def chunks(xs, n=3):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

In [None]:
import time

def create_text_embedding_entries(input_text:str, company_name: str, cusip: str):
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 2000,
        chunk_overlap  = 15,
        length_function = len,
        is_separator_regex = False,
    )
    docs = text_splitter.split_text(input_text)
    res = []
    seq_id = -1
    for d in chunks(docs):
        embeddings = EMBEDDING_MODEL.get_embeddings(d)
        
        # throttle so we don't blow through the quota.
        time.sleep(1)
        
        for i in range(len(d)):
            seq_id += 1
            res.append({'companyName': company_name, 'cusip': cusip, 'seqId': seq_id, 'contextId': company_name + str(seq_id), 'textEmbedding': embeddings[i].values, 'text': d[i]})
    return res

Due to Quota Limitations, lets only do 5 form 10k files out of the 95 we have

In [None]:
import os

file_names = os.listdir('/home/jupyter/form10k/')[0:5]
len(file_names)

This cell takes about 15 minutes to run.  That's largely down to us throttling so we don't exceed the quota on our free account.  If you're using an enterprise account, you won't need to throttle like this.

In [None]:
%%time

count = 0
embedding_entries = []
for file_name in file_names:
    if '.txt' in file_name:
        count += 1
        if count % 5 == 0:
            print(f'Parsed {count} of {len(file_names)}')
        with open('/home/jupyter/form10k/' + file_name) as f:
            f10_k = json.load(f)
        embedding_entries.extend(create_text_embedding_entries(f10_k['item1'], f10_k['companyName'], f10_k['cusip']))
len(embedding_entries)

## Save 10-K Documents with Embeddings
We will save these locally to use in graph loading, in the next part.

In [None]:
import pandas as pd
edf = pd.DataFrame(embedding_entries)

In [None]:
edf

Provide your Neo4j credentials.  We need the DB conection URL, the username (probably `neo4j`), and your password.

In [None]:
# username is neo4j by default
NEO4J_USERNAME = 'neo4j'

# You will need to change these to match your credentials
NEO4J_URI = 'YOUR_NEO4J_URL_FROM_DOWNLOADED_TXT_FILE' #Eg 'neo4j+s://ccc5f4f5.databases.neo4j.io'
NEO4J_PASSWORD = 'PASSWORD'

In [None]:
from graphdatascience import GraphDataScience

gds = GraphDataScience(
    NEO4J_URI,
    auth=(NEO4J_USERNAME, NEO4J_PASSWORD),
    aura_ds=True
)
gds.set_database('neo4j')

Remember to create indexes. We will be merging 10K documents by `companyName`. In a production setting, we would want to use a better identifier here (like we did with cusip for Company) However, this should suffice for our intents and purposes as we are just getting acquainted to learning about semantic search.

In [None]:
gds.run_cypher('CREATE INDEX company_name IF NOT EXISTS FOR (n:Company) ON (n.companyName)')
gds.run_cypher('CREATE CONSTRAINT unique_document_id IF NOT EXISTS FOR (n:Document) REQUIRE (n.documentId) IS NODE KEY')

Due to the size of the documents we will want to transform the dataframe into a list of dict that we can chunk up and insert via parameterized query.

In [None]:
emb_entries = edf.to_dict(orient='records')

In [None]:
total = len(emb_entries)
count = 0
for d in chunks(emb_entries, 100):
    gds.run_cypher('''
    UNWIND $records AS record
    MATCH(c:Company {cusip:record.cusip})
    MERGE(b:Document {documentId:record.contextId})
    SET b.documentType = 'FORM_10K_ITEM1', b.seqId = record.seqId, b.textEmbedding = record.textEmbedding, b.text = record.text
    MERGE(c)-[:HAS]->(b)
    RETURN count(b) as cnt
    ''', params = {'records':d})
    count += len(d)
    print(f'loaded {count} of {total}')

## Check Data

In [None]:
# Check node count
gds.run_cypher('MATCH(doc:Document) RETURN count(doc)')

Note that we were only getting 10-K docs for a minority of companies. It should be fine for this, but in a more rigorous setting, you may want to try and pull more.  There are likely a few factors attributing to this. 

1. We used company names to search EDGAR which resulted in many misses and dups which were discarded. In a more rigorous setting, we would investigate other endpoints and use more parsing to extract EDGAR cik keys for exact matching companies when pulling forms.

2. Company names are not consistent across form13 filings, so even if we successfully pull on one version of a company name, we may not be able to merge it into the graph via the one company name represented there. 

3. Not all companies in the dataset are obligated to file 10-Ks.

In [None]:
# Check count and percentage of companies with 10-K docs.  Note it is the minority
gds.run_cypher('''
MATCH(b:Company)
WITH b, count{(b)-[:HAS]->(d:Document)} AS docCount
WITH count(b) AS total, sum(toInteger(docCount > 0)) AS numWithDocs
RETURN total, numWithDocs, round(100*toFloat(numWithDocs)/toFloat(total), 2) As PercWithDocs
''')

You might note that there are duplicate names.  For our purposes here, we will treat it as entity resolution, meaning that we treat companies with the same name as belonging to the same overarching entity for semantic search. In a more rigorous setting, we would need to disambiguate with other EDGAR keys.

In [None]:
# Show duplicates via HAS relationship
gds.run_cypher('''
MATCH(b:Company)
RETURN count(b) AS totalCompanies, count(DISTINCT b.companyName) AS uniqueCompanyNames
''')

## View Embeddings as Clusters

Vector embeddings generated by language models are nothing but numerical representation of words or sentences.  So, similar sentences will be located nearby.  The embeddings we generated earlier are higher dimensional ones.  To visualize them, we need to reduce the dimensionality.  Let's do that and visualize it.

In [None]:
import altair as alt

def generate_chart(df, xcol, ycol, lbl = 'on', color = 'basic', title = '', tooltips = ['documentId'], label = ''):
  chart = alt.Chart(df).mark_circle(size=30).encode(
    x = alt.X(xcol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    y = alt.Y(ycol,
        scale=alt.Scale(zero = False),
        axis=alt.Axis(labels = False, ticks = False, domain = False)
    ),
    color= alt.value('#333293') if color == 'basic' else color,
    tooltip=tooltips
    )

  if lbl == 'on':
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 7, size = 5, color = 'black').encode(text = label, color = alt.value('black'))
  else:
    text = chart.mark_text(align = 'left', baseline = 'middle', dx = 10).encode()

  result = (chart + text).configure(background="#FDF7F0"
        ).properties(
        width = 800,
        height = 500,
        title = title
       ).configure_legend(
  orient = 'bottom', titleFontSize = 18, labelFontSize = 18)
        
  return result

In [None]:
# Reduce dimensionality using PCA
from sklearn.decomposition import PCA

# Function to return the principal components
def get_pc(arr, n):
  pca = PCA(n_components = n)
  embeds_transform = pca.fit_transform(arr)
  return embeds_transform

In [None]:
emb_df = gds.run_cypher("MATCH (c:Company)-[:HAS]->(n:Document) RETURN c.companyName as companyName, n.documentId as documentId, n.text as text, n.textEmbedding as emb LIMIT 1000")
emb_df

## K-Means Clustering on the Embeddings
Let's run the K-Means Clustering algorithm and view similar document chunks. 

In [None]:
import numpy as np
from sklearn.cluster import KMeans

embeds = np.array(emb_df['emb'].tolist())
embeds_pc2 = get_pc(embeds, 2)

df_clust = pd.concat([emb_df, pd.DataFrame(embeds_pc2)], axis = 1)
n_clusters = 5

kmeans_model = KMeans(n_clusters = n_clusters, n_init = 1, random_state = 0)
classes = kmeans_model.fit_predict(embeds).tolist()
df_clust['cluster'] = (list(map(str,classes)))

df_clust.columns = df_clust.columns.astype(str)
generate_chart(df_clust.iloc[:],'0', '1', lbl = 'off', color = 'cluster', title = 'K-Means Clustering with n Clusters', tooltips = ['documentId', 'text'], label = '')
