# KMeans Clustering with LLM-Augmented Labels

Herein, we leverage `sci-kit learn`'s `KMeans` clustering algorithm to cluster embeddings
generated from a neural network into `n` clusters. We then use the cluster assignments as
bundles of similar text. Each of these bundles are passed to an LLM to generate a
descriptive label for the cluster.

## Pros and Cons of Approach

| Pros                                                | Cons                                      |
| --------------------------------------------------- | ----------------------------------------- |
| No need for labeled data                            | LLMs are slow to generate labels          |
| Can be used to generate labels for any type of data | Clustering is sensitive to initialization |


In [1]:
import dotenv
import os
import sys
import json
import re

from datasets import load_dataset
from sklearn.cluster import KMeans

sys.path.append("../")

from fns import search_kb, get_embeddings

dotenv.load_dotenv()

# Load the dataset


# query = "How do you calculate tax in the state of PA?"
# query = "What are the required user authentication parameters for O Series On-Premise User Management and Authentication"
query="What does the Includes Taxable Amount check box indicate?"
embeddings = get_embeddings(query)

subset = search_kb(
    query=query,
    embeddings=embeddings,
    k=30, # max = 50 without pagination
)

# print(subset)
def xf_results(results, embeddings = False):
    essential = []
    for result in results:
        metadata = json.loads(result["metadata"])
        # print(result)
        prep = {
            "title": result["title"], 
            "heading": result["heading"],
            "content": re.sub(r"\xa0|\t", " ", result["content"].strip()), 
            "url": metadata["community_url"],
            "product": result["product"],
            "score": result["@search.score"]
        }
        if embeddings:
            prep["embeddings"] = result["content_vector"]
        essential.append(prep)
    return essential
        
essentials = xf_results(subset["value"])
# print the number of results
print(f"Number of results: {len(essentials)}")

# essentials[50:52]

Number of results: 30


In [2]:
from sklearn.preprocessing import MinMaxScaler
import tiktoken

# use text-embedding-ada for tokenization
tokenizer = tiktoken.encoding_for_model("text-embedding-ada-002")

# separate the embeddings from each dict in the list
embeddings = [x["embeddings"] for x in xf_results(subset["value"], embeddings=True)]

# scale the embeddings
scaler = MinMaxScaler()
embeddings = scaler.fit_transform(embeddings)

clusters = 4

# fit a kmeans model to the embeddings
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)

# add the cluster labels to the essentials
for i, label in enumerate(kmeans.labels_):
    essentials[i]["cluster"] = label

# create a mapping of the cluster labels to the items in the cluster
cluster_map = {}
for i, label in enumerate(kmeans.labels_):
    if label not in cluster_map:
        cluster_map[label] = []
    cluster_map[label].append(essentials[i])

# sort the clusters by name
cluster_map = dict(sorted(cluster_map.items()))

def combine_title_heading(item):
    return item["title"] + " - " + item["heading"] if item["heading"] not in item["title"] else item["title"]

prompt_cluster = ""
#  for each cluster in the map, print the cluster label along with the first few titles in the cluster
for label, cluster in cluster_map.items():
    prompt_cluster += f"\nCluster {label}\n"
    for item in cluster:
        prompt_cluster += "- " + combine_title_heading(item) + "\n"

print(prompt_cluster)
# print the total tokens and total items in the cluster
print("tokens: ", len(tokenizer.encode(prompt_cluster)))



Cluster 0
- Include Basis Inclusions in Nontaxable Basis
- Tax Include Round Up Tax
- Limit Tax Inclusive To Extended Price
- Single Jurisdiction VAT Tax Inclusive Optimized Path
- Inclusion rules in O Series - Example 3
- Tax Assist Field Descriptions - T - taxDate
- Tax Assist function: INCL IMP (Included Imposition)
- Tax-inclusive processing parameters in O Series
- Tax Assist function: INCL IMP (Included Imposition) - Using the Included Impositions function in a rule
- Tax-inclusive processing parameters in O Series - Single Jurisdiction VAT Tax Inclusive Optimized Path
- Tax Assist Field Descriptions - T - taxOverride.overrideAsNonTaxable
- Provide general information for an inclusion rule
- Include Charged Tax in Allocated Line Item
- O Series Transaction Tester > Amounts tab for line items

Cluster 1
- Distribute Taxes in Vertex Cloud - Results
- Calculate Taxes in Vertex Cloud - Results
- Distribute Taxes in Vertex Cloud - Totals
- Reconcile by Invoice in Vertex Cloud - Using

In [3]:
from fns import messages_prompt

# ask LLM to create a name for each cluster

system_prompt = """
You are a helpful assistant. You will be provided with a list of clusters of similar titles.
Please provide a name for each cluster based on the titles in the cluster.
Make sure the names are descriptive, concise, and unique (non-overlapping).

# Example input:

Cluster 0:
- Set up your organization’s taxpayers in O Series
- Set up taxpayer registrations in O Series
- Set up your organization’s taxpayers in O Series - The General tab

Cluster 1:
- Your taxpayer hierarchy in O Series - Enabling Adopt Parent Setup
- Add a Single Mapping in Vertex Cloud - Steps for adding a mapping
- Bulk upload of taxability drivers in O Series - Prepare the taxability driver upload file

Cluster 2:
- Set up taxpayer registrations in O Series - Add imposition registration details
- United Kingdom electronic filing for Making Tax Digital (MTD, VAT 100) in VAT Compliance 
- Slovakia electronic filing for Intrastat - Arrivals/Dispatch in VAT Compliance 

# Example output:

- 0: taxpayer registration or organization setup
- 1: taxpayer hierarchy and taxability drivers
- 2: VAT and foreign electronic filing

"""

results = messages_prompt([{
    "role": "system",
    "content": system_prompt
}, {
    "role": "user",
    "content": prompt_cluster
}])

# print the results
print(results)


- 0: Tax Inclusion and Processing Parameters
- 1: Tax Calculation and Distribution in Vertex Cloud
- 2: Reviewing Taxability and Jurisdiction Analysis
- 3: Setup of Returns and Filing Reports


In [4]:
import re

# extract the cluster names from the llm output
cluster_names = re.findall(r"(\d+): (.+)", results)

print("Cluster names:", cluster_names)
cluster_names = dict(cluster_names)

cluster_names

Cluster names: [('0', 'Tax Inclusion and Processing Parameters'), ('1', 'Tax Calculation and Distribution in Vertex Cloud'), ('2', 'Reviewing Taxability and Jurisdiction Analysis'), ('3', 'Setup of Returns and Filing Reports')]


{'0': 'Tax Inclusion and Processing Parameters',
 '1': 'Tax Calculation and Distribution in Vertex Cloud',
 '2': 'Reviewing Taxability and Jurisdiction Analysis',
 '3': 'Setup of Returns and Filing Reports'}

In [5]:
# rename the cluster keys to the cluster names
clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}

# print the first 3 items in each cluster
for label, cluster in clustered_map.items():
    print(f"\n{label}")
    for item in cluster[:3]:
        print("- " + combine_title_heading(item))



Tax Inclusion and Processing Parameters
- Include Basis Inclusions in Nontaxable Basis
- Tax Include Round Up Tax
- Limit Tax Inclusive To Extended Price

Tax Calculation and Distribution in Vertex Cloud
- Distribute Taxes in Vertex Cloud - Results
- Calculate Taxes in Vertex Cloud - Results
- Distribute Taxes in Vertex Cloud - Totals

Reviewing Taxability and Jurisdiction Analysis
- Fields and Values on the Tax Assist Advanced selector - Included Impositions
- Review the taxability of a communications service code - Field descriptions for the Taxability Analysis tab Part 2
- Review the taxability of a commodity code in O Series - Drill down to see the jurisdiction of interest

Setup of Returns and Filing Reports
- Setting up returns and filing reports - You may choose not to select this check box if the same person reviews, approves,


## If user down-votes the answer...

In [6]:
# takes the clustered_map and returns a prompt to the user for follow up questions
def follow_up_prompt(clustered_map):
    prompt = "Sorry, which of the following topics are you interested in?\n"
    index = 0
    for label, cluster in clustered_map.items():
        label = label.replace("'", "")
        index += 1
        prompt += f"{index}. {label}\n"
    prompt += f"{index + 1}. something else"
    return prompt

# print the follow up prompt
print(follow_up_prompt(clustered_map))

Sorry, which of the following topics are you interested in?
1. Tax Inclusion and Processing Parameters
2. Tax Calculation and Distribution in Vertex Cloud
3. Reviewing Taxability and Jurisdiction Analysis
4. Setup of Returns and Filing Reports
5. something else


In [7]:
# Putting it all together

# function takes the user query and returns the follow up prompt
def get_follow_up(query, k = 50, clusters=3):
    """
    Takes the user query and returns a dict with the following keys:
    - top_3: the top 3 results from the vector search
    - follow_up: the follow up prompt
    """
    embeddings = get_embeddings(query)

    subset = search_kb(
        query=query,
        embeddings=embeddings,
        k=k,
    )["value"]

    essentials = xf_results(subset)
    embeddings = [x["embeddings"] for x in xf_results(subset, embeddings=True)]
    scaler = MinMaxScaler()

    embeddings = scaler.fit_transform(embeddings)
    kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)

    for i, label in enumerate(kmeans.labels_):
        essentials[i]["cluster"] = label
    cluster_map = {}
    for i, label in enumerate(kmeans.labels_):
        if label not in cluster_map:
            cluster_map[label] = []
        cluster_map[label].append(essentials[i])
    cluster_map = dict(sorted(cluster_map.items()))

    prompt_cluster = ""
    for label, cluster in cluster_map.items():
        prompt_cluster += f"\nCluster {label}\n"
        for item in cluster:
            prompt_cluster += "- " + combine_title_heading(item) + "\n"
            
    system_prompt = """
    You are a helpful assistant. You will be provided with a user's query and a
    list of clusters of similar titles. Please provide a name for each cluster
    based on the titles in the cluster as they relate to the query. Make sure
    the names are comprehensive, elaborative and unique (non-overlapping).
    """
    results = messages_prompt([{
        "role": "system",
        "content": system_prompt
    }, {
        "role": "user",
        "content": "query:\n\n" + query + "clusters:\n\n" + prompt_cluster
    }])

    cluster_names = re.findall(r"(\d+): (.+)", results)
    cluster_names = dict(cluster_names)
    clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}
    #print(clustered_map)
    return {
        "top_3": essentials[:3],
        "follow_up": follow_up_prompt(clustered_map)
    }



In [8]:
from IPython.display import display, Markdown as md


cause_effect_map = {
    "When I am adding a user in O Series, how do I get support?":                                                    
        "user support",
    "What products does Vertex have?":                                                                               
        "products",
    "How do I add a company?":                                                                                       
        "trigger follow-up with the user for clarification",
    "How do I play pickleball?":                                                                                     
        "no results",
    "Where do you set up regional preferences in O Series?":                                                         
        "regional preferences",
    "A new district tax has become effective in TX, will my setup pick up the new tax?":                             
        "verify the config",
    "The business just opened an office in London, how do I get the system to calc tax there?":                      
        "explain how to setup new geographic registration and explain that more detailed configuration is possible",
    "I have a customer indicating that the calculation result is incorrect. How do I verify the calculation result?":
        "using transaction tester to simulate the calculation and see the results and identify follow-up options to explore configuration",
    "What are the options to setup product taxability if I cannot find it in the Vertex taxability content?":        
        "creating taxability categories and creating tax rules"
}

# get the 2nd query
query = list(cause_effect_map.keys())[0]
expect = cause_effect_map[query]

response = get_follow_up(query, k=100, clusters=4)

# print the title + headings for the top 3 results
print("Query:\n", query, "\n")
print("Top results:")
for idx, item in enumerate(response["top_3"]):
    print("-", combine_title_heading(item).replace('"', ""))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response['follow_up']}")

Query:
 When I am adding a user in O Series, how do I get support? 

Top results:
- Set up users in O Series


Set up users in O Series  When a member of your organization needs Vertex® O Series access, a Master Administrator or Partition System Administrator - depending on the partitions in which this user needs to work - can set up a user via the Users feature.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.  To set up a user:  1. Navigate to System > Security > Users.  The list of all users is displayed. 2. Complete one of these steps:   * To create a new user, click Add a User to display the Add User page.  * To edit an existing user, click Edit in the Actions column to display the Edit User page. 3. Complete the fields according to "Field descriptions" below. 4. Click Save User when you are satisfied with your settings for this user.  To view an existing user, click View in the Actions column to display the View User page.

- Users in O Series


Users in O Series  There is one thing that every member of your organization needs to take advantage  of Vertex® O Series On-Premise and On-Demand features - to be an O Series user. Users are the people in your organization who need access to O Series functionality.  Here's what you can do with the Users feature:  * Create O Series users. * Assign them a role or roles to define their range of access to specific O Series data and functionality. * Assign them to a partition or partitions. * Change user names. * Reset and change passwords. * Deactivate users. * Search for users by Partition, User Name, or E-Mail.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.     Who can set up a user?  * A Master Administrator can manage users who have access to all partitions. * A Partition System Administrator is responsible for the users and data in a given partition. This person manages users whose access is limited to a given partition.

- Enable Enterprise Single Sign-On (SSO) in O Series On-Demand


Enable Enterprise Single Sign-On (SSO) in O Series On-Demand  Single Sign-On (SSO) strengthens system security and makes users' lives easier by  reducing password fatigue. SSO for Vertex Vertex® O Series On-Demand (O Series) is now available upon request from Customer Support.  Note: The process for implementing SSO requires a number of steps to be completed by Vertex  Consulting. For those who prefer it, a self-service option is planned for a later  date that will not require assistance from Vertex Consulting.  Here is the process:  1. Submit a Customer Support request to have SSO enabled on your On-Demand instance. 2. Receive confirmation that your request has been forwarded to Vertex Consulting and  has been placed in the implementation queue. 3. Vertex Consulting begins work when your request reaches the top of the queue. They  will assist you in setting up multi-partition user functionality and guide you through  user consolidation and conversion of user names.     Related articles  User consolidation


Follow up:
Sorry, which of the following topics are you interested in?
1. User and Role Management in O Series
2. Installation and Configuration of O Series
3. User Credentials and Settings in O Series
4. Support and Troubleshooting for O Series
5. something else


In [9]:
def strip_first_line(s):
    """
    takes a string that wraps over multiple lines and removes the first line
    """
    return s[s.find("\n")+1:]

test_query = "When I am adding a user in O Series, how do I get support?\nuser support\nother stuff"
test_query = strip_first_line(test_query)
print(test_query)

user support
other stuff


In [10]:

def get_name_from_choice(choice, follow_up_prompt):
    """
    Takes the follow up prompt and the user's choice and returns the name of the cluster
    """
    items = re.findall(r"(\d+). (.+)", follow_up_prompt)
    #print("items", items)
    items = dict(items)
    return items[str(choice)] # "something else": escape hatch to different tool?

# test the function
choice = 4
chosen = get_name_from_choice(choice, response["follow_up"]).replace('"', '')

# add chosen to the query and get the results
updated_query = query + "\n" + chosen if chosen != "something else" else strip_first_line(query)
response2 = get_follow_up(updated_query, 4)

# print the title + headings for the top 3 results
print("Query:\n" + updated_query + "\n")
print("Top results:")

for idx, item in enumerate(response2["top_3"]):
    print("-", combine_title_heading(item))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response2['follow_up']}")



Query:
When I am adding a user in O Series, how do I get support?
Support and Troubleshooting for O Series

Top results:
- Set up users in O Series


Set up users in O Series  When a member of your organization needs Vertex® O Series access, a Master Administrator or Partition System Administrator - depending on the partitions in which this user needs to work - can set up a user via the Users feature.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.  To set up a user:  1. Navigate to System > Security > Users.  The list of all users is displayed. 2. Complete one of these steps:   * To create a new user, click Add a User to display the Add User page.  * To edit an existing user, click Edit in the Actions column to display the Edit User page. 3. Complete the fields according to "Field descriptions" below. 4. Click Save User when you are satisfied with your settings for this user.  To view an existing user, click View in the Actions column to display the View User page.

- Users in O Series


Users in O Series  There is one thing that every member of your organization needs to take advantage  of Vertex® O Series On-Premise and On-Demand features - to be an O Series user. Users are the people in your organization who need access to O Series functionality.  Here's what you can do with the Users feature:  * Create O Series users. * Assign them a role or roles to define their range of access to specific O Series data and functionality. * Assign them to a partition or partitions. * Change user names. * Reset and change passwords. * Deactivate users. * Search for users by Partition, User Name, or E-Mail.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.     Who can set up a user?  * A Master Administrator can manage users who have access to all partitions. * A Partition System Administrator is responsible for the users and data in a given partition. This person manages users whose access is limited to a given partition.

- Discuss your O Series On-Demand configuration with Vertex - Open a case


Open a case  When you want to change any O Series configuration parameters that are not available in the O Series user interface, your Tech Lead or backup Tech Lead must open a case with Vertex Customer  Support.  If other members of your implementation team have questions that are unrelated to  O Series configuration, they can also open a case.   Ways to open a case:  * Open the case via the Vertex Community on the web (https://community.vertexinc.com). * Call Vertex Customer Support at 800.281.1900.     Related articles  Analyze your technical requirements for O Series On-Demand  Analyze configuration parameter settings for O Series On-Demand  Overview of O Series On-Demand implementation


Follow up:
Sorry, which of the following topics are you interested in?
1. Vertex Support for O Series Configuration
2. User Setup in O Series
3. Initial Login Guidance for O Series On-Premise
4. something else
