# Text Embedding with Vertex AI

In this notebook, we generate 10K filings text embeddings with the Vertex AI [`textembedding-gecko`](https://cloud.google.com/vertex-ai/docs/generative-ai/model-reference/text-embeddings) model.  Unstructured text from 10K filings has been extracted using a parser beforehand.


In this notebook, we will:
1. Get 10K filings unstructured text from a Google storage bucket
2. specifically select Item 1 from the 10K which describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. 
3. Chunk the text into natural sections using NLTK (to avoid input token limits)
4. Save text with embeddings to csv to stage for loading into graph


## Setup

First, check to ensure you're using the `neo4j_genai` kernel with the following command. This kernel has the necessary runtime and dependencies for this notebook. If you see a different kernel, try changing the kernel to `neo4j_genai` in the upper right corner of the screen.

In [1]:
import sys
import os
os.path.basename(sys.executable.replace("/bin/python",""))

'neo4j_genai'

Next import dependencies

In [2]:
import json
import numpy as np
import os
import re
from string import Template
import pandas as pd

# Vertexai and google cloud
import vertexai
from vertexai.language_models import TextEmbeddingModel
from google.cloud import storage

## Get 10K Filings from Google Cloud

In [3]:
storage_client = storage.Client()
(storage_client
 .bucket('neo4j-datasets')
 .blob('form10k/form10k-clean.zip')
 .download_to_filename('form10k-clean.zip'))

In [4]:
!unzip -qq -n 'form10k-clean.zip'

## 10K Filings Exploration and Chunking

Let's open one file to understand its contents.  It is actually a json file. 

In [None]:
with open('./form10k-clean/0000002488-22-000016.txt') as f:
    f10_k = json.load(f)

We are interested in Item 1 specifically. 

Item 1 describes the business of the company: who and what the company does, what subsidiaries it owns, and what markets it operates in. It may also include recent events, competition, regulations, and labor issues. (Some industries are heavily regulated, and have complex labor requirements, which have significant effects on the business.) Other topics in this section may include special operating costs, seasonal factors, or insurance matters.

In [None]:
len(f10_k['item1'])

This text has the ability to exceed token limits for `textembedding-gecko`.  Also the quality of embeddings can go down if the text gets to large. As such we should find some way to chunk the text up into seperate sections for embedding.

Below is a way to do this with NLTK. 

In [None]:
from langchain.text_splitter import NLTKTextSplitter
import nltk
nltk.download('punkt', quiet=True) #downloads the tokenizer model that will help us with context aware of text splitting

text = f10_k['item1']

text_splitter = NLTKTextSplitter()
docs = text_splitter.split_text(text)

In [None]:
print(docs[0])

## Getting 10K Text Embeddings with VertexAI

Now that we understand our data and how to chunk it.  Lets Generate embeddings. 

In [None]:
# Note, you will need to set your project_id
project_id = 'neo4jbusinessdev'
location = 'us-central1'

In [None]:
# Instantiate the text ebmedding model

EMBEDDING_MODEL = TextEmbeddingModel.from_pretrained("textembedding-gecko@001")

In [None]:
# We will need a chunking utility to make things easier as we loop through files

def chunks(xs, n=5):
    n = max(1, n)
    return [xs[i:i + n] for i in range(0, len(xs), n)]

In [None]:
# Function for splitting and calculating embeddings

def create_text_embedding_entries(input_text:str, company_name:str):
    text_splitter = NLTKTextSplitter()
    docs = text_splitter.split_text(input_text)
    res = []
    seq_id = -1
    for d in chunks(docs):
        embeddings = EMBEDDING_MODEL.get_embeddings(d)
        for i in range(len(d)):
            seq_id += 1
            res.append({'companyName': company_name, 'seqId': seq_id, 'contextId': company_name + str(seq_id), 'textEmbedding': embeddings[i].values, 'text': d[i]})
    return res

In [None]:
# Get file names

file_names = os.listdir('./form10k-clean/')
len(file_names)

In [None]:
%%time

# Primary loop.  This could take 30 minutes to an hour.
count = 0
embedding_entries = []
for file_name in file_names:
    if '.txt' in file_name:
        count+=1
        if count%10 == 0:
            print(f'Parsed {count} of {len(file_names)}')
        with open('./form10k-clean/' + file_name) as f:
            f10_k = json.load(f)
        embedding_entries.extend(create_text_embedding_entries(f10_k['item1'], f10_k['companyName']))
len(embedding_entries)

## Save 10K Documents with Embeddings

We will save these locally to use in graph loading, in the next part.

In [None]:
edf = pd.DataFrame(embedding_entries)

In [None]:
edf

In [None]:
edf.to_csv('form10k-doc-embeddings-2.csv', index=False)