# KMeans Clustering with LLM-Augmented Labels

Herein, we leverage `sci-kit learn`'s `KMeans` clustering algorithm to cluster embeddings
generated from a neural network into `n` clusters. We then use the cluster assignments as
bundles of similar text. Each of these bundles are passed to an LLM to generate a
descriptive label for the cluster.

## Pros and Cons of Approach

| Pros                                                | Cons                                      |
| --------------------------------------------------- | ----------------------------------------- |
| No need for labeled data                            | LLMs are slow to generate labels          |
| Can be used to generate labels for any type of data | Clustering is sensitive to initialization |


In [4]:
import dotenv
import os
import sys
import json
import re

from datasets import load_dataset
from sklearn.cluster import KMeans

sys.path.append("../")

from fns import vector_search, get_embeddings

dotenv.load_dotenv()

# Load the dataset


query = "How do you calculate tax in the state of PA?"
embeddings = get_embeddings(query)["data"][0]["embedding"]

subset = vector_search(
    embeddings,
    k=120, # max = 50 without pagination
)

def xf_results(results, embeddings = False):
    essential = []
    for result in results:
        prep = {
            "title": result["title"], 
            "heading": result["heading"],
            "content": re.sub(r"\xa0|\t", " ", result["content"].strip()), 
            "url": json.loads(result["metadata"])["community_url"],
            "product": result["product"],
            "score": result["@search.score"]
        }
        if embeddings:
            prep["embeddings"] = result["content_vector"]
        essential.append(prep)
    return essential
        
essentials = xf_results(subset)
# print the number of results
print(f"Number of results: {len(essentials)}")

essentials[50:53]

batch: 50
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01
batch: 50
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01&skip=50
batch: 20
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01&skip=50
Number of results: 120


[{'title': 'Look up standard tax rates in O Series',
  'heading': 'Look up tax rates by jurisdiction',
  'content': 'Look up tax rates by jurisdiction  Searching by jurisdiction returns only the rates for the selected jurisdictions. To  look up rates for a jurisdiction:  1. Navigate to Tools > Rate Lookup > By Jurisdiction. 2. Click next to the Jurisdictions field to display the Select Jurisdictions dialog box and select jurisdictions for  which you want to see rates. 3. To refine your search, click Advanced Search. See "Advanced Search criteria" in this article for details. 4. Click Search. All tax-rate data that O Series found for the criteria displays.  This example requests the tax rates for both Pennsylvania and Philadelphia and is  useful for seeing the components of the total 8% sales tax that is charged for purchases  in Philadelphia.     Look up tax rates by Tax Area ID  Searching by tax area returns only the rates for the selected Tax Area ID.  1. Navigate to Tools > Rate Loo

In [5]:
from sklearn.preprocessing import MinMaxScaler
import tiktoken

# use text-embedding-ada for tokenization
tokenizer = tiktoken.encoding_for_model("text-embedding-ada-002")

# separate the embeddings from each dict in the list
embeddings = [x["embeddings"] for x in xf_results(subset, embeddings=True)]

# scale the embeddings
scaler = MinMaxScaler()
embeddings = scaler.fit_transform(embeddings)

clusters = 4

# fit a kmeans model to the embeddings
kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)

# add the cluster labels to the essentials
for i, label in enumerate(kmeans.labels_):
    essentials[i]["cluster"] = label

# create a mapping of the cluster labels to the items in the cluster
cluster_map = {}
for i, label in enumerate(kmeans.labels_):
    if label not in cluster_map:
        cluster_map[label] = []
    cluster_map[label].append(essentials[i])

# sort the clusters by name
cluster_map = dict(sorted(cluster_map.items()))

def combine_title_heading(item):
    return item["title"] + " - " + item["heading"] if item["heading"] not in item["title"] else item["title"]

prompt_cluster = ""
#  for each cluster in the map, print the cluster label along with the first few titles in the cluster
for label, cluster in cluster_map.items():
    prompt_cluster += f"\nCluster {label}\n"
    for item in cluster:
        prompt_cluster += "- " + combine_title_heading(item) + "\n"

print(prompt_cluster)
# print the total tokens and total items in the cluster
print("tokens: ", len(tokenizer.encode(prompt_cluster)))



Cluster 0
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining upfront payment taxability
- Purchase Order events in O Series - Example of a purchase order transaction
- Three-Tier Tax Structure in Vertex Cloud
- Maximum Taxes in Vertex Cloud - Tennessee
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining upfront payment taxability
- Purchase Order events in O Series - Example of a purchase order transaction
- Three-Tier Tax Structure in Vertex Cloud
- Maximum Taxes in Vertex Cloud - Tennessee
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability

Cluster 1
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax
- V

In [6]:
from fns import messages_prompt

# ask LLM to create a name for each cluster

system_prompt = """
You are a helpful assistant. You will be provided with a list of clusters of similar titles.
Please provide a name for each cluster based on the titles in the cluster.
Make sure the names are descriptive, concise, and unique (non-overlapping).

# Example input:

Cluster 0:
- Set up your organization’s taxpayers in O Series
- Set up taxpayer registrations in O Series
- Set up your organization’s taxpayers in O Series - The General tab

Cluster 1:
- Your taxpayer hierarchy in O Series - Enabling Adopt Parent Setup
- Add a Single Mapping in Vertex Cloud - Steps for adding a mapping
- Bulk upload of taxability drivers in O Series - Prepare the taxability driver upload file

Cluster 2:
- Set up taxpayer registrations in O Series - Add imposition registration details
- United Kingdom electronic filing for Making Tax Digital (MTD, VAT 100) in VAT Compliance 
- Slovakia electronic filing for Intrastat - Arrivals/Dispatch in VAT Compliance 

# Example output:

- 0: taxpayer registration or organization setup
- 1: taxpayer hierarchy and taxability drivers
- 2: VAT and foreign electronic filing

"""

results = messages_prompt([{
    "role": "system",
    "content": system_prompt
}, {
    "role": "user",
    "content": prompt_cluster
}])

# print the results
print(results)


- Cluster 0: Tax payment options and configurations
- Cluster 1: Pennsylvania Local Services Tax forms and calculations
- Cluster 2: Payroll tax forms and calculations for various states
- Cluster 3: O Series tax setup and configuration


In [7]:
import re

# extract the cluster names from the llm output
cluster_names = re.findall(r"(\d+): (.+)", results)

print("Cluster names:", cluster_names)
cluster_names = dict(cluster_names)

cluster_names

Cluster names: [('0', 'Tax payment options and configurations'), ('1', 'Pennsylvania Local Services Tax forms and calculations'), ('2', 'Payroll tax forms and calculations for various states'), ('3', 'O Series tax setup and configuration')]


{'0': 'Tax payment options and configurations',
 '1': 'Pennsylvania Local Services Tax forms and calculations',
 '2': 'Payroll tax forms and calculations for various states',
 '3': 'O Series tax setup and configuration'}

In [8]:
# rename the cluster keys to the cluster names
clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}

# print the first 3 items in each cluster
for label, cluster in clustered_map.items():
    print(f"\n{label}")
    for item in cluster[:3]:
        print("- " + combine_title_heading(item))



Tax payment options and configurations
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining stream payment taxability
- Case study: Determine tax payment options for Vertex O Series for Leasing - Determining upfront payment taxability
- Purchase Order events in O Series - Example of a purchase order transaction

Pennsylvania Local Services Tax forms and calculations
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax - PA\_LST.STATE\_YTD\_PREV
- Vertex Payroll Tax forms for Pennsylvania Local Services Tax - Use the forms for employees with multiple work locations

Payroll tax forms and calculations for various states
- Vertex Payroll Tax Forms for Pennsylvania
- Vertex Payroll Tax form for Pennsylvania-Maryland reciprocal exception
- Pennsylvania State Specifics

O Series tax setup and configuration
- Look up standard tax rates in O Series - Look up tax rates by jurisdiction


## If user down-votes the answer...

In [9]:
# takes the clustered_map and returns a prompt to the user for follow up questions
def follow_up_prompt(clustered_map):
    prompt = "Sorry, which of the following topics are you interested in?\n"
    index = 0
    for label, cluster in clustered_map.items():
        index += 1
        prompt += f"{index}. {label.replace('"', "")}\n"
    prompt += f"{index + 1}. something else"
    return prompt

# print the follow up prompt
print(follow_up_prompt(clustered_map))

Sorry, which of the following topics are you interested in?
1. Tax payment options and configurations
2. Pennsylvania Local Services Tax forms and calculations
3. Payroll tax forms and calculations for various states
4. O Series tax setup and configuration
5. something else


In [15]:
# Putting it all together

# function takes the user query and returns the follow up prompt
def get_follow_up(query, k = 50, clusters=3):
    """
    Takes the user query and returns a dict with the following keys:
    - top_3: the top 3 results from the vector search
    - follow_up: the follow up prompt
    """
    embeddings = get_embeddings(query)["data"][0]["embedding"]

    subset = vector_search(
        embeddings,
        k=k,
    )

    essentials = xf_results(subset)
    embeddings = [x["embeddings"] for x in xf_results(subset, embeddings=True)]
    scaler = MinMaxScaler()

    embeddings = scaler.fit_transform(embeddings)
    kmeans = KMeans(n_clusters=clusters, random_state=0).fit(embeddings)

    for i, label in enumerate(kmeans.labels_):
        essentials[i]["cluster"] = label
    cluster_map = {}
    for i, label in enumerate(kmeans.labels_):
        if label not in cluster_map:
            cluster_map[label] = []
        cluster_map[label].append(essentials[i])
    cluster_map = dict(sorted(cluster_map.items()))

    prompt_cluster = ""
    for label, cluster in cluster_map.items():
        prompt_cluster += f"\nCluster {label}\n"
        for item in cluster:
            prompt_cluster += "- " + combine_title_heading(item) + "\n"
            
    system_prompt = """
    You are a helpful assistant. You will be provided with a user's query and a
    list of clusters of similar titles. Please provide a name for each cluster
    based on the titles in the cluster as they relate to the query. Make sure
    the names are comprehensive, elaborative and unique (non-overlapping).
    """
    results = messages_prompt([{
        "role": "system",
        "content": system_prompt
    }, {
        "role": "user",
        "content": "query:\n\n" + query + "clusters:\n\n" + prompt_cluster
    }])

    cluster_names = re.findall(r"(\d+): (.+)", results)
    cluster_names = dict(cluster_names)
    clustered_map = {cluster_names[str(k)]: v for k, v in cluster_map.items()}
    #print(clustered_map)
    return {
        "top_3": essentials[:3],
        "follow_up": follow_up_prompt(clustered_map)
    }



In [16]:
from IPython.display import display, Markdown as md


cause_effect_map = {
    "When I am adding a user in O Series, how do I get support?":                                                    
        "user support",
    "What products does Vertex have?":                                                                               
        "products",
    "How do I add a company?":                                                                                       
        "trigger follow-up with the user for clarification",
    "How do I play pickleball?":                                                                                     
        "no results",
    "Where do you set up regional preferences in O Series?":                                                         
        "regional preferences",
    "A new district tax has become effective in TX, will my setup pick up the new tax?":                             
        "verify the config",
    "The business just opened an office in London, how do I get the system to calc tax there?":                      
        "explain how to setup new geographic registration and explain that more detailed configuration is possible",
    "I have a customer indicating that the calculation result is incorrect. How do I verify the calculation result?":
        "using transaction tester to simulate the calculation and see the results and identify follow-up options to explore configuration",
    "What are the options to setup product taxability if I cannot find it in the Vertex taxability content?":        
        "creating taxability categories and creating tax rules"
}

# get the 2nd query
query = list(cause_effect_map.keys())[0]
expect = cause_effect_map[query]

response = get_follow_up(query, k=100, clusters=4)

# print the title + headings for the top 3 results
print("Query:\n", query, "\n")
print("Top results:")
for idx, item in enumerate(response["top_3"]):
    print("-", combine_title_heading(item).replace('"', ""))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response["follow_up"]}")

batch: 50
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01
batch: 50
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01&skip=50
Query:
 When I am adding a user in O Series, how do I get support? 

Top results:
- Set up users in O Series


Set up users in O Series  When a member of your organization needs Vertex® O Series access, a Master Administrator or Partition System Administrator - depending on the partitions in which this user needs to work - can set up a user via the Users feature.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.  To set up a user:  1. Navigate to System > Security > Users.  The list of all users is displayed. 2. Complete one of these steps:   * To create a new user, click Add a User to display the Add User page.  * To edit an existing user, click Edit in the Actions column to display the Edit User page. 3. Complete the fields according to "Field descriptions" below. 4. Click Save User when you are satisfied with your settings for this user.  To view an existing user, click View in the Actions column to display the View User page.

- Users in O Series


Users in O Series  There is one thing that every member of your organization needs to take advantage  of Vertex® O Series On-Premise and On-Demand features - to be an O Series user. Users are the people in your organization who need access to O Series functionality.  Here's what you can do with the Users feature:  * Create O Series users. * Assign them a role or roles to define their range of access to specific O Series data and functionality. * Assign them to a partition or partitions. * Change user names. * Reset and change passwords. * Deactivate users. * Search for users by Partition, User Name, or E-Mail.  Note: This feature is available in the O Series On-Premise and O Series On-Demand deployments only. It is not available in O Series Cloud. For information about setting up users in O Series Cloud, go here.     Who can set up a user?  * A Master Administrator can manage users who have access to all partitions. * A Partition System Administrator is responsible for the users and data in a given partition. This person manages users whose access is limited to a given partition.

- Enable Enterprise Single Sign-On (SSO) in O Series On-Demand


Enable Enterprise Single Sign-On (SSO) in O Series On-Demand  Single Sign-On (SSO) strengthens system security and makes users' lives easier by  reducing password fatigue. SSO for Vertex Vertex® O Series On-Demand (O Series) is now available upon request from Customer Support.  Note: The process for implementing SSO requires a number of steps to be completed by Vertex  Consulting. For those who prefer it, a self-service option is planned for a later  date that will not require assistance from Vertex Consulting.  Here is the process:  1. Submit a Customer Support request to have SSO enabled on your On-Demand instance. 2. Receive confirmation that your request has been forwarded to Vertex Consulting and  has been placed in the implementation queue. 3. Vertex Consulting begins work when your request reaches the top of the queue. They  will assist you in setting up multi-partition user functionality and guide you through  user consolidation and conversion of user names.     Related articles  User consolidation


Follow up:
Sorry, which of the following topics are you interested in?
1. User Management in O Series
2. O Series On-Premise Implementation and Support
3. Authentication and Access Management in O Series
4. Exemption Certificate Portal Account Management
5. something else


In [17]:
def strip_first_line(s):
    """
    takes a string that wraps over multiple lines and removes the first line
    """
    return s[s.find("\n")+1:]

test_query = "When I am adding a user in O Series, how do I get support?\nuser support\nother stuff"
test_query = strip_first_line(test_query)
print(test_query)

user support
other stuff


In [18]:

def get_name_from_choice(choice, follow_up_prompt):
    """
    Takes the follow up prompt and the user's choice and returns the name of the cluster
    """
    items = re.findall(r"(\d+). (.+)", follow_up_prompt)
    #print("items", items)
    items = dict(items)
    return items[str(choice)] # "something else": escape hatch to different tool?

# test the function
choice = 4
chosen = get_name_from_choice(choice, response["follow_up"]).replace('"', '')

# add chosen to the query and get the results
updated_query = query + "\n" + chosen if chosen != "something else" else strip_first_line(query)
response2 = get_follow_up(updated_query, 4)

# print the title + headings for the top 3 results
print("Query:\n" + updated_query + "\n")
print("Top results:")

for idx, item in enumerate(response2["top_3"]):
    print("-", combine_title_heading(item))
    markdown = md(item["content"])
    display(markdown)

# pretty print the results
print(f"\nFollow up:\n{response2["follow_up"]}")



batch: 4
url: https://copilot-dev-primary-search-service-eastus2.search.windows.net/indexes/2023-12-29/docs/search?api-version=2023-11-01
Query:
When I am adding a user in O Series, how do I get support?
Exemption Certificate Portal Account Management

Top results:
- Provide Exemption Certificate Portal users with their O Series customer codes


Provide Exemption Certificate Portal users with their O Series customer codes  Note: This publication is for users of Vertex Exemption Certificate Manager (ECM). If you  use Vertex Certificate Center, go here.  For users affiliated with a company or organization that already has an account set  up in Vertex® O Series, these users must know their company's or organization's O Series customer code before accessing the portal.  Vertex recommends setting up your Customer Service representatives so they can provide  these portal users with their customer code in case they do not have it readily available  when attempting to log in.     Related articles  Create ECP accounts overview  Open an ECP instance  Create ECP accounts for existing and new customers  Create credentials for logging in to an account  Manage users in ECP

- Create Exemption Certificate Portal accounts - Overview and considerations


Create Exemption Certificate Portal accounts - Overview and considerations  Note: This publication is for users of Vertex Exemption Certificate Manager (ECM). If you  use Vertex Certificate Center, go here.  Before creating or accessing certificates in ECP, your customer must create an account in ECP.     Account components  An ECP account has two main components:  * ECP customer - The company or organization with which you do business. The ECP customer corresponds to a customer in Vertex® O Series.  If you are using ECW for certificate creation, the customer's information prefills many of the fields  on the Buyer Information page in ECW.  * ECP user - An individual from that company or organization who uses the portal to create,  view, print, or renew certificates at an ECP location. The ECP user corresponds to a customer contact in O Series.  If you are using ECW for certificate creation, the company representative that provides a name on the  Signature Information page in ECW corresponds to the ECP user and an O Series customer contact.  A customer can have multiple users, each with their own contact information.     ECP customer identifiers  An ECP customer has several identifiers - a company name, a company code, and a company  email - that flow across the ECP, O Series, and ECW systems.  Customer's company name across systems  Your customers' company names and company representative names flow across the ECP, O Series, and ECW systems. The following table identifies how the company names and company representative  names are identified in each system:  

| System | Company name entity | Company representative entity |
| --- | --- | --- |
| ECP | Company Name(identified on Create User Account page, under Company Details) | User Name(identified on Create User Account page, under Contact Details) |
| O Series | Customer Name(identified on Customer General tab) | Contact Name(identified on Customer Contacts tab) |
| ECW | Company(identified on Buyer Information page) | Name(Identified on Signature Information page) |

- Set up O Series for use with Exemption Certificate Wizard and Exemption Certificate


Set up O Series for use with Exemption Certificate Wizard and Exemption Certificate  Portal  Note: This publication is for users of Vertex Exemption Certificate Manager (ECM). If you  use Vertex Certificate Center, go here.  You can use Vertex® O Series to accept and manage certificates that are created in ECW and Exemption Certificate Portal (ECP).  Setup tasks  * Add configuration parameters to the vertex.cfg file for ECW and ECP. * Set configuration parameters in your vertex.cfg file for your email service. * Ensure that taxpayer codes match ECW seller codes. * Create a certificate letter template for rejected certificates. * Set parameters for certificate approval and rejection.


Follow up:
Sorry, which of the following topics are you interested in?
1. Setting up O Series for Exemption Certificate Support
2. Managing Exemption Certificate Portal Accounts and Customer Codes
3. Understanding the Exemption Certificate Wizard for Manager Support
4. something else
