# KMeans Clustering with LLM-Augmented Labels

Herein, we leverage `sci-kit learn`'s `KMeans` clustering algorithm to cluster embeddings
generated from a neural network into `n` clusters. We then use the cluster assignments as
bundles of similar text. Each of these bundles are passed to an LLM to generate a
descriptive label for the cluster.

## Pros and Cons of Approach

| Pros                                                | Cons                                      |
| --------------------------------------------------- | ----------------------------------------- |
| No need for labeled data                            | LLMs are slow to generate labels          |
| Can be used to generate labels for any type of data | Clustering is sensitive to initialization |


In [40]:
import dotenv
import os
import sys
import json

from datasets import load_dataset
from sklearn.cluster import KMeans

sys.path.append("../")

from fns import vector_search, get_embeddings

dotenv.load_dotenv()

# Load the dataset


query = "How do you calculate tax in the state of PA?"
embeddings = get_embeddings(query)["data"][0]["embedding"]

subset = vector_search(
    embeddings,
    k=50, # max = 50 without pagination
)["value"]

def xf_results(results, embeddings = False):
    essential = []
    for result in results:
        prep = {
            "title": result["title"], 
            "heading": result["heading"],
            "content": result["content"].strip(), 
            "url": json.loads(result["metadata"])["community_url"],
            "product": result["product"],
            "score": result["@search.score"]
        }
        if embeddings:
            prep["embeddings"] = result["content_vector"]
        essential.append(prep)
    return essential
        
essentials = xf_results(subset)
# print the number of results
print(f"Number of results: {len(essentials)}")

essentials[:3]

Number of results: 50


[{'title': 'Look up standard tax rates in O Series',
  'heading': 'Look up tax rates by jurisdiction',
  'content': 'Look up tax rates by jurisdiction  Searching by jurisdiction returns only the rates for the selected jurisdictions. To  look up rates for a jurisdiction:  1. Navigate to Tools > Rate Lookup > By Jurisdiction. 2. Click next to the Jurisdictions field to display the Select Jurisdictions dialog box and select jurisdictions for  which you want to see rates. 3. To refine your search, click Advanced Search. See "Advanced Search criteria" in this article for details. 4. Click Search. All tax-rate data that O Series found for the criteria displays.  This example requests the tax rates for both Pennsylvania and Philadelphia and is  useful for seeing the components of the total 8% sales tax that is charged for purchases  in Philadelphia.     Look up tax rates by Tax Area ID  Searching by tax area returns only the rates for the selected Tax Area ID.  1. Navigate to Tools > Rate Loo

In [41]:
from sklearn.preprocessing import MinMaxScaler
import tiktoken

# use text-embedding-ada for tokenization
tokenizer = tiktoken.encoding_for_model("text-embedding-ada-002")

# separate the embeddings from each dict in the list
embeddings = [x["embeddings"] for x in xf_results(subset, embeddings=True)]

# scale the embeddings
scaler = MinMaxScaler()
embeddings = scaler.fit_transform(embeddings)

clusters = 3

# fit a kmeans model to the embeddings
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)

# add the cluster labels to the essentials
for i, label in enumerate(kmeans.labels_):
    essentials[i]["cluster"] = label

# create a mapping of the cluster labels to the items in the cluster
cluster_map = {}
for i, label in enumerate(kmeans.labels_):
    if label not in cluster_map:
        cluster_map[label] = []
    cluster_map[label].append(essentials[i])

# sort the clusters by name
cluster_map = dict(sorted(cluster_map.items()))

def combine_title_heading(item):
    return item["title"] + " - " + item["heading"] if item["heading"] not in item["title"] else item["title"]

prompt_cluster = ""
#  for each cluster in the map, print the cluster label along with the first few titles in the cluster
for label, cluster in cluster_map.items():
    prompt_cluster += f"\nCluster {label}\n"
    for item in cluster:
        prompt_cluster += "- " + combine_title_heading(item) + "\n"

print(prompt_cluster)
# print the total tokens and total items in the cluster
print("tokens: ", len(tokenizer.encode(prompt_cluster)))



Cluster 0
- Look up standard tax rates in O Series - Look up tax rates by jurisdiction
- Set up O Series Hospitality for multi-night lodging
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability
- Create an imposition and calculation rule for the franchise fee - Create a user-defined calculation tax rule for your franchise fee
- Purchase Order events in O Series - Example of a purchase order transaction
- Enter Out-of-State Sales in Vertex Cloud - Entering out-of-state sales in Vertex Cloud
- Get Tax Rates for Uploaded Addresses in Vertex Cloud - Getting tax rates for uploaded addresses
- View bracket schedules for O Series tax rules - Examples of how a bracket schedule applies
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining upfront payment taxability
- Set up O Series Hospitality for multi-night lodging - Apply a tax rate per number of nights stayed
- Purchase Order events in O Ser

In [42]:
from fns import messages_prompt

# ask LLM to create a name for each cluster

system_prompt = """
You are a helpful assistant. You will be provided with a list of clusters of similar titles.
Please provide a name for each cluster based on the titles in the cluster.
Make sure the names are descriptive, concise, and unique (non-overlapping).

# Example input:

Cluster 0:
- Set up your organization’s taxpayers in O Series
- Set up taxpayer registrations in O Series
- Set up your organization’s taxpayers in O Series - The General tab

Cluster 1:
- Your taxpayer hierarchy in O Series - Enabling Adopt Parent Setup
- Add a Single Mapping in Vertex Cloud - Steps for adding a mapping
- Bulk upload of taxability drivers in O Series - Prepare the taxability driver upload file

Cluster 2:
- Set up taxpayer registrations in O Series - Add imposition registration details
- United Kingdom electronic filing for Making Tax Digital (MTD, VAT 100) in VAT Compliance 
- Slovakia electronic filing for Intrastat - Arrivals/Dispatch in VAT Compliance 

# Example output:

- 0: taxpayer registration or organization setup
- 1: taxpayer hierarchy and taxability drivers
- 2: VAT and foreign electronic filing

"""

results = messages_prompt([{
    "role": "system",
    "content": system_prompt
}, {
    "role": "user",
    "content": prompt_cluster
}])

# print the results
print(results)


- Cluster 0: O Series tax setup and configuration
- Cluster 1: Vertex Payroll Tax forms and calculations
- Cluster 2: Pennsylvania Local Services Tax in Vertex Payroll Tax


In [43]:
import re

# extract the cluster names from the llm output
cluster_names = re.findall(r"(\d+): (.+)", results)

print("Cluster names:", cluster_names)
cluster_names = dict(cluster_names)

cluster_names

Cluster names: [('0', 'O Series tax setup and configuration'), ('1', 'Vertex Payroll Tax forms and calculations'), ('2', 'Pennsylvania Local Services Tax in Vertex Payroll Tax')]


{'0': 'O Series tax setup and configuration',
 '1': 'Vertex Payroll Tax forms and calculations',
 '2': 'Pennsylvania Local Services Tax in Vertex Payroll Tax'}

In [44]:
# rename the cluster keys to the cluster names
clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}

# print the first 3 items in each cluster
for label, cluster in clustered_map.items():
    print(f"\n{label}")
    for item in cluster[:3]:
        print("- " + combine_title_heading(item))



O Series tax setup and configuration
- Look up standard tax rates in O Series - Look up tax rates by jurisdiction
- Set up O Series Hospitality for multi-night lodging
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability

Vertex Payroll Tax forms and calculations
- Vertex Payroll Tax form for Pennsylvania-Maryland reciprocal exception
- Vertex Payroll Tax W-4 calculation process
- Vertex® Payroll Tax Release Notes - November 2021 changes

Pennsylvania Local Services Tax in Vertex Payroll Tax
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax - PA\_LST.STATE\_YTD\_PREV
- Vertex Payroll Tax Forms for Pennsylvania


## If user down-votes the answer...

In [45]:
# takes the clustered_map and returns a prompt to the user for follow up questions
def follow_up_prompt(clustered_map):
    prompt = "Sorry, which of the following topics are you interested in?\n"
    index = 0
    for label, cluster in clustered_map.items():
        index += 1
        prompt += f"{index}. {label}\n"
    index += 1
    prompt += f"{index}. something else"
    return prompt

# print the follow up prompt
print(follow_up_prompt(clustered_map))

Sorry, which of the following topics are you interested in?
1. O Series tax setup and configuration
2. Vertex Payroll Tax forms and calculations
3. Pennsylvania Local Services Tax in Vertex Payroll Tax
4. something else


In [46]:
# Putting it all together

# function takes the user query and returns the follow up prompt
def get_follow_up(query):
    """
    Takes the user query and returns a dict with the following keys:
    - top_3: the top 3 results from the vector search
    - follow_up: the follow up prompt
    """
    embeddings = get_embeddings(query)["data"][0]["embedding"]
    subset = vector_search(
        embeddings,
        k=50, # max = 50 without pagination
    )["value"]
    essentials = xf_results(subset)
    tokenizer = tiktoken.encoding_for_model("text-embedding-ada-002")
    embeddings = [x["embeddings"] for x in xf_results(subset, embeddings=True)]
    scaler = MinMaxScaler()
    embeddings = scaler.fit_transform(embeddings)
    clusters = 3
    kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)
    for i, label in enumerate(kmeans.labels_):
        essentials[i]["cluster"] = label
    cluster_map = {}
    for i, label in enumerate(kmeans.labels_):
        if label not in cluster_map:
            cluster_map[label] = []
        cluster_map[label].append(essentials[i])
    cluster_map = dict(sorted(cluster_map.items()))
    prompt_cluster = ""
    for label, cluster in cluster_map.items():
        prompt_cluster += f"\nCluster {label}\n"
        for item in cluster:
            prompt_cluster += "- " + combine_title_heading(item) + "\n"
    system_prompt = """
    You are a helpful assistant. You will be provided with a list of clusters of similar titles.
    Please provide a name for each cluster based on the titles in the cluster.
    Make sure the names are descriptive, concise, and unique (non-overlapping).
    """
    results = messages_prompt([{
        "role": "system",
        "content": system_prompt
    }, {
        "role": "user",
        "content": prompt_cluster
    }])
    cluster_names = re.findall(r"(\d+): (.+)", results)
    cluster_names = dict(cluster_names)
    clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}
    return {
        "top_3": essentials[:3],
        "follow_up": follow_up_prompt(clustered_map)
    }



In [82]:
from IPython.display import display, Markdown as md

# test the function
query = "How do you calculate tax in the state of PA?"

response = get_follow_up(query)

# print the title + headings for the top 3 results
print("Query:\n", query, "\n")
print("Top 3 results:")
for idx, item in enumerate(response["top_3"]):
    print("-", combine_title_heading(item))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response["follow_up"]}")

Query:
 How do you calculate tax in the state of PA? 

Top 3 results:
- Look up standard tax rates in O Series - Look up tax rates by jurisdiction


Look up tax rates by jurisdiction  Searching by jurisdiction returns only the rates for the selected jurisdictions. To  look up rates for a jurisdiction:  1. Navigate to Tools > Rate Lookup > By Jurisdiction. 2. Click next to the Jurisdictions field to display the Select Jurisdictions dialog box and select jurisdictions for  which you want to see rates. 3. To refine your search, click Advanced Search. See "Advanced Search criteria" in this article for details. 4. Click Search. All tax-rate data that O Series found for the criteria displays.  This example requests the tax rates for both Pennsylvania and Philadelphia and is  useful for seeing the components of the total 8% sales tax that is charged for purchases  in Philadelphia.     Look up tax rates by Tax Area ID  Searching by tax area returns only the rates for the selected Tax Area ID.  1. Navigate to Tools > Rate Lookup > By Tax Area. 2. Identify the Tax Area ID in one of these ways:  	* Enter the ID in the Tax Area ID field. 	* Click next to the Tax Area ID field to display the Select Tax Area ID dialog box, where you can enter either the 	 Tax Area ID or as much jurisdiction information as possible. Click Search, select the radio button for the Tax Area ID of interest, and click Select. 3. To refine your search, click Advanced Search. See "Advanced Search criteria" in this article for details. 4. Click Search. All tax-rate data that O Series found for the criteria displays.  This example shows the tax rate for Tax Area ID 391013000, which corresponds to Philadelphia.  If you already know a Tax Area ID, this method of rate lookup is the most efficient.

- Vertex Payroll Tax forms for Pennsylvania Local Services Tax


Vertex Payroll Tax forms for Pennsylvania Local Services Tax  Payroll Tax calculates Pennsylvania Local Services Tax (LST) whether or not you use the form.  However, the form provides greater precision in the calculation, such as allowing  catch-up withholding if needed.  Read about the Pennsylvania LST.     PA\_LST forms  Pass in these multi-value forms to:  * Set the primary Pennsylvania work location if an employee works in multiple Pennsylvania  locations. * Calculate the prorated employee deduction for the local services tax (LST) in the  Pennsylvania municipalities where your company is located.  You can use these forms even if you are claiming one of the valid exemptions.      Calculation  The PA\_LST forms use the following basic calculation to determine the deduction per  pay period:   Maximum annual Local Services Tax amount   ÷ Number of pay periods in a year   \_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_\_  Round the resulting tax amount value down to the nearest penny      Use a form to set a primary work location for period-to-date tax amounts  If the period-to-date tax amount is less than the per-pay period prorated amount for  the local jurisdiction, Payroll Tax deducts the remainder of that amount during the current pay period (except as constrained  by local and statewide limits).  First, Payroll Tax checks for a combination of GeoCode and taxingLocationType with a period-to-date  amount.  If none is found, Payroll Tax uses:  * The primary work location set with PA\_LST.PRIMARYWORKLOCATION.  Or  * If there is no form, Payroll Tax uses the primary work location provided in the paycheck request.      Use a form to enable catch-up withholding  The Pennsylvania Local Services Tax is subject to low-income and military exemptions.  If you use the PA\_LST exemption forms, Payroll Tax factors the exemptions into the calculation and tracks the employee's year-to-date  wages.  Using the forms facilitates catch-up withholding if the wages for a previously exempt  employee exceed the low-income limit.

- Vertex Payroll Tax forms for Pennsylvania Local Services Tax - PA\_LST.STATE\_YTD\_PREV


PA\_LST.STATE\_YTD\_PREV  Pass in a dollar amount representing the statewide taxes paid at a previous employer.  You must supply this field if you do not supply paycheck requests that have year-to-date tax  amounts from past pay periods, but no gross pay for the current pay period. (The information  is needed for Payroll Tax to confirm that the statewide $52 tax limit is not exceeded.)   You can omit this form if the employee did not work for any other Pennsylvania employer  in the calendar year.      Effect on accumulation  Because this field has statewide applicability, if this form is present and populated  on any location with a Pennsylvania GeoCode and a Tax ID of 536, Payroll Tax uses this value as the statewide year-to-date total from previous employers and does  not accumulate tax amounts from other locations.      Conflicting values  If two or more work locations with PA\_LST.STATE\_YTD have conflicting values, calculation  ends.      PA\_LST.LI\_EXEMPTEDPAYPERIOD  This value displays after tax is calculated:  * 0 - A low-income exemption is not valid for this pay period * 1 - A low-income exemption is valid for this pay period  Add this return value to the year-to-date total passed in through the LI\_EXEMPTEDPAYPERIODSYTD  field, in case a catch-up calculation is needed.      Change in pay period frequency  Vertex recommends that you adjust the value of PA\_LST.LI\_EXEMPTEDPAYPERIODSYTD to  base the catch-up calculation on a value that reflects the new pay frequency.


Follow up:
Sorry, which of the following topics are you interested in?
1. Tax Calculation and Configuration in Vertex O Series and Vertex Cloud
2. Payroll Tax Calculation and Forms in Vertex Payroll
3. Pennsylvania Local Services Tax Forms and Specifics
4. something else


In [83]:

def get_name_from_choice(choice, follow_up_prompt):
    """
    Takes the follow up prompt and the user's choice and returns the name of the cluster
    """
    items = re.findall(r"(\d+). (.+)", follow_up_prompt)
    #print("items", items)
    items = dict(items)
    return items[str(choice)]

# test the function
choice = 1
chosen = get_name_from_choice(choice, response["follow_up"])

# add chosen to the query and get the results

updated_query = f"{query} ({chosen.replace('"', '')})"
response2 = get_follow_up(updated_query)

# print the title + headings for the top 3 results
print("Query:\n", updated_query, "\n")
print("Top 3 results:")
for idx, item in enumerate(response2["top_3"]):
    print("-", combine_title_heading(item))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response2["follow_up"]}")



Query:
 How do you calculate tax in the state of PA? (Tax Calculation and Configuration in Vertex O Series and Vertex Cloud) 

Top 3 results:
- Pennsylvania State Specifics


Pennsylvania State Specifics  This article provides information about configuring Pennsylvania returns and prepayments  in Vertex Cloud.  Note: For specific information issued by the Pennsylvania Department of Revenue, refer to  https://revenue.pa.gov.     Returns  Enter your license number as the registration number.  Vertex Cloud supports the following state returns:  

| Return | Notes |
| --- | --- |
| Sales/Use/Hotel Tax (PA-3) | This return has the optional EDI Reference Schedule. This schedule shows data that will be generated in the EDI file. This schedule is informational only, and is not required to generate an EDI file. Choose how to report location data. You can report all locations on one return, or you can report each location on a separate return. You must file this return electronically. |
| Prepayment Form for PA-3 | If you are required to make estimated prepayments, register this form and configure a prepayment amount. This form is informational only. Do not submit this form to the state. |


    Prepayments  If you are required to make prepayments:  1. Register the Prepayment Form for PA-3 in Vertex Cloud and configure a prepayment amount. 2. Vertex Cloud uses this prepayment form to calculate the estimated prepayment amount  for the next reporting month. 3. Use this prepayment amount when you file your tax return for that reporting month. 4. Reconcile the prepayment amount on the tax return for the next reporting month.  Do not submit the prepayment form to the state.

- Calculate Taxes in Vertex Cloud


Calculate Taxes in Vertex Cloud  This video provides an overview of the Calculate Taxes feature in Vertex Cloud.  Vertex Cloud automatically calculates tax for transactions passed from your host system.  You can also use the Calculate Taxes feature to manually calculate taxes. For example,  you might want to test your configuration, troubleshoot issues, or give a quote to  a customer. If you have the Admin role, you can optionally post these manual tax calculations  to the Tax Journal for inclusion on the associated returns.     Before you begin calculating taxes  You must have one of the following roles to use the Calculates Taxes feature:  * Vertex Cloud Indirect Tax Admin – This role allows you to calculate taxes and post  transactions. * Vertex Cloud Indirect Tax Read Only – This role allows you to calculate taxes only.  Refer to the article User Roles for more information.     Calculating taxes  To manually calculate taxes, select Calculate Taxes > Calculate Taxes on the Vertex Cloud menu. The Calculate Taxes page displays the following sections, which  are described in the rest of this article:  * Transaction Details * Shipping Information * Discount Codes (if your host system supports discount codes) * Line Items * Results * Post or Close/Cancel the Transaction

- Configure a Professional Tax Calculations Subscription in Vertex Cloud


Configure a Professional Tax Calculations Subscription in Vertex Cloud  A Professional Tax Calculations subscription gives you the ability to configure taxability  mappings, manage customer exemptions, and calculate taxes. You cannot generate signature-ready  PDF returns.  If you have a Professional Tax Calculations subscription, you must configure a company  and enable at least one jurisdiction. Locations are optional. Refer to the following  articles for specific instructions:  * Add a Company * Enable a State, Territory, or Province Jurisdiction * Configure a Location (optional)  If you have a Professional Tax Calculations subscription, you can export data from  the reporting database for a specific company. You can export data for a filing period  or a data range. You can then import the exported data into a returns-processing application  and generate returns. Refer to the U.S. and Canada Compliance File and International Compliance File articles for details.     Related articles  * Manage my subscription * Edit Your Account Contact Information * Add a User * Change Your Password * Create a credential in Vertex Cloud * Manage Access to your Referred Vertex Cloud Account * Vertex Cloud Security


Follow up:
Sorry, which of the following topics are you interested in?
1. Tax Calculation and Configuration in Vertex O Series for Leasing
2. Tax Calculation and Adjustments in Vertex Cloud
3. Account Setup and Configuration in Vertex Cloud
4. something else
