## CISC 520-90- O-2024/Late Summer - Data Engineering and Mining
## Final Project 
## Customer Service Chatbot Application - Logistics Tracking Function
### Siyan Sun

#### Project Task: 
#### &emsp;To develop a chatbot that helps customers track their orders in real-time through natural, conversational interactions powered by an LLM. 

#### LLM Model:
#### &emsp;microsoft/DialoGPT-small
##### &emsp;&emsp;microsoft/DialoGPT-small is optimized for generating natural, conversational dialogue, perfect for customer service tasks like order tracking. Its lightweight design ensures fast, real-time responses, and it can be easily fine-tuned for specific needs. These features make it an efficient and reliable choice for building a user-friendly chatbot.

#### Approach:
#### &emsp;Step 1 : Collect source dataset for training, check, filter, and clean
##### &emsp;&emsp;&emsp;&emsp;a. Load the customer support dataset* from hugging face for training.
##### &emsp;&emsp;&emsp;&emsp;b. Print dataset check columns and sizes.
##### &emsp;&emsp;&emsp;&emsp;c. Print and check 'intent' column's detail, find unqiue value in the list.
##### &emsp;&emsp;&emsp;&emsp;d. Apply the filter to the customer support dataset, get track order relative dataset.
##### &emsp;&emsp;&emsp;&emsp;e. Keep 'instruction' and 'response' only, transform to dialogue format.
##### &emsp;&emsp;&emsp;&emsp;f. Print all unique placeholders, then replace placeholders with tokens
#### &emsp;Step 2 : Preprocess the dataset for training
##### &emsp;&emsp;&emsp;&emsp;a. Load the tokenizer and set the padding token.
##### &emsp;&emsp;&emsp;&emsp;b. Define a tokenization function to tokenize, truncate, and pad the input dialogue.
##### &emsp;&emsp;&emsp;&emsp;c. Process the dataset using the tokenization function and remove the original dialogue column.
##### &emsp;&emsp;&emsp;&emsp;d. Set the dataset format to PyTorch Tensor for easier use in model training or inference.
##### &emsp;&emsp;&emsp;&emsp;e. Set and load training arguments.
##### &emsp;&emsp;&emsp;&emsp;f. Train tokenized_track_order_dataset, and then save the model.
#### &emsp;Step 3 : Collect and clean order and shippment detail dataset
##### &emsp;&emsp;&emsp;&emsp;a. Load the sales order/shipment dataset**.
##### &emsp;&emsp;&emsp;&emsp;b. Print dataset check columns' detail.
##### &emsp;&emsp;&emsp;&emsp;c. Resize the order number length.
#### &emsp;Step 4 : Build the Customer Service Chatbot Application
##### &emsp;&emsp;&emsp;&emsp;a. Import the necessary Python libraries for data processing, model handling, and conversation management.
##### &emsp;&emsp;&emsp;&emsp;b. Load sales order/shipment**
##### &emsp;&emsp;&emsp;&emsp;c. Load the fine-tuned language model and tokenizer to generate responses.
##### &emsp;&emsp;&emsp;&emsp;d. Create a function to generate responses based on user input using the model.
##### &emsp;&emsp;&emsp;&emsp;e. Develop a function to truncate responses at the last sentence-ending punctuation.
##### &emsp;&emsp;&emsp;&emsp;f. Implement functions to extract order and tracking numbers from user input using regular expressions.
##### &emsp;&emsp;&emsp;&emsp;g. Create a function to replace placeholders in responses with actual order details.
##### &emsp;&emsp;&emsp;&emsp;h. Develop a function to clean and format the response by removing placeholders.
##### &emsp;&emsp;&emsp;&emsp;i. Create a function to retrieve order details using the extracted order or tracking numbers.
##### &emsp;&emsp;&emsp;&emsp;j. Build the main function to handle conversation flow and generate responses.
##### &emsp;&emsp;&emsp;&emsp;k. Ensure the main function is executed when the script runs, initializing the tracking assistant.


#### Reference:
##### &emsp;*: "Bitext - Customer Service Tagged Training Dataset for LLM-based Virtual Assistants":
##### &emsp;&emsp;https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
##### &emsp;**: "Supply Chain-Inventory Management-Data Analyst":
##### &emsp;&emsp;https://www.kaggle.com/datasets/mohammedazarudheen/supply-chain-inventory-management-data-analyst

In [3]:
from datasets import load_dataset

# Load the dataset from hugging face
# https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset
customer_support_dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset")

# Print dataset check columns and sizes 
print(customer_support_dataset)

DatasetDict({
    train: Dataset({
        features: ['flags', 'instruction', 'category', 'intent', 'response'],
        num_rows: 26872
    })
})


In [4]:
# Access the 'intent' column from the training dataset
intents = customer_support_dataset['train']['intent']

# Get the unique values by converting the list to a set
unique_intents = set(intents)

# Print the unique intents
print("Unique intents in the dataset:")
for intents in unique_intents:
    print(intents)


Unique intents in the dataset:
registration_problems
cancel_order
check_payment_methods
check_refund_policy
track_order
check_invoice
switch_account
create_account
delivery_period
get_refund
set_up_shipping_address
edit_account
track_refund
place_order
get_invoice
review
delete_account
contact_customer_service
recover_password
check_cancellation_fee
change_order
newsletter_subscription
payment_issue
delivery_options
change_shipping_address
contact_human_agent
complaint


In [5]:
# track_order is strongest relative to our chatbot application
intent_to_keep = ['track_order']

def filter_intent(example):
    return example['intent'] in intent_to_keep

# Apply the filter to the customer support dataset, get only track order relative dataset
track_order_dataset = customer_support_dataset.filter(filter_intent)

In [6]:
# Only need instruction and response and transform in to diagoue format
def preprocess_function(example):
    instruction = example['instruction'].strip()
    response = example['response'].strip()
    # Combine into a dialogue format
    text = f"User: {instruction}\nAssistant: {response}"
    return {'dialogue': text}

# Apply the preprocess function to the filtered dataset
processed_track_order_dataset = track_order_dataset.map(
    preprocess_function,
    remove_columns=['flags', 'instruction', 'category', 'intent', 'response']
)

# Check the top 5 rows of the track_order_dataset
print(processed_track_order_dataset['train'][:5])

{'dialogue': ["User: needx help to check the ETA of purchase {{Order Number}}\nAssistant: Your message means a lot! I'm aligned with the idea that you need assistance with checking the Estimated Time of Arrival (ETA) for your purchase with the order number {{Order Number}} {{Order Number}}. To obtain the ETA, you can visit the '{{Order Status}}' section on our website. It should provide you with the most up-to-date information regarding the delivery status of your purchase. If you have any further questions or require additional guidance, please feel free to ask. I'm here to ensure you have a smooth and satisfactory experience!", "User: ETA of purchase {{Order Number}}\nAssistant: Thank you for reaching out! I'm here to assist you with checking the estimated time of arrival (ETA) for your purchase with the order number {{Order Number}}. To get the most up-to-date information on the ETA, I recommend visiting the '{{Order Status}}' section on our website. This will provide you with the a

In [7]:
# In the dataset examples, there have some {{...}} works as placeholder. Need to find out all of them.
import re

def extract_placeholders(text):
    # Use re find out all {{...}} placeholders.
    return re.findall(r'\{\{(.*?)\}\}', text)

# Set a empty set for all_placeholders
all_placeholders = []

# Iterate through each sentence in the training set
for sentence in processed_track_order_dataset['train']:
    text = sentence['dialogue']
    placeholders = extract_placeholders(text)
    all_placeholders.extend(placeholders)

# Pick the unique value
unique_placeholders = set(all_placeholders)

# Print out all placeholers's value
print("Unique placeholders found:")
for placeholder in unique_placeholders:
    print(f"{placeholder}")

Unique placeholders found:
Order Number
Client Name
Online Customer Support Channel
Order Status
Tracking Page
ETA
Email Address
Customer Support Phone Number
Online Order Interaction
Delivery Date
Purchase Status
Order Tracker
Purchase History
Customer Support Live Chat URL
Website URL
Shipping Status
Purchase Details
Track Order
Customer Support Hours
Order Tracking


In [13]:
def replace_placeholders_with_tokens(text):
    if not isinstance(text, str):
        return text
    placeholder_to_token = {
         'Order Status': '<ORDER_STATUS>',
         'Website URL': '<WEBSITE_URL>',
         'Email Address': '<EMAIL_ADDRESS>',
         'Customer Support Phone Number': '<CUSTOMER_SUPPORT_PHONE_NUMBER>',
         'Order Number': '<ORDER_NUMBER>',
         'Online Order Interaction': '<ONLINE_ORDER_INTERACTION>',
         'Client Name': '<CLIENT_NAME>',
         'Purchase Details': '<PURCHASE_DETAILS>',
         'Customer Support Hours': '<CUSTOMER_SUPPORT_HOURS>',
         'Purchase History': '<PURCHASE_HISTORY>',
         'Order Tracking': '<ORDER_TRACKING>',
         'ETA': '<ETA>',
         'Online Customer Support Channel': '<ONLINE_CUSTOMER_SUPPORT_CHANNEL>',
         'Tracking Page': '<TRACKING_PAGE>',
         'Order Tracker': '<ORDER_TRACKER>',
         'Purchase Status': '<PURCHASE_STATUS>',
         'Delivery Date': '<DELIVERY_DATE>',
         'Track Order': '<TRACK_ORDER>',
         'Shipping Status': '<SHIPPING_STATUS>',
         'Customer Support Live Chat URL': '<CUSTOMER_SUPPORT_LIVE_CHAT_URL>'
}

    for placeholder, token in placeholder_to_token.items():
        text = text.replace(f'{{{{{placeholder}}}}}', token)
    return text

# Replace with the cleaned processed_track_order_dataset
def preprocess_replace_tokens_function(example):
    instruction = replace_placeholders_with_tokens(example['dialogue'])
    return {'dialogue': instruction}

processed_track_order_dataset = processed_track_order_dataset.map(
    preprocess_replace_tokens_function,
    remove_columns=['dialogue'] 
)

# Check the top row of the track_order_dataset
print(processed_track_order_dataset['train'][0])


{'dialogue': "User: needx help to check the ETA of purchase <ORDER_NUMBER>\nAssistant: Your message means a lot! I'm aligned with the idea that you need assistance with checking the Estimated Time of Arrival (ETA) for your purchase with the order number <ORDER_NUMBER> <ORDER_NUMBER>. To obtain the ETA, you can visit the '<ORDER_STATUS>' section on our website. It should provide you with the most up-to-date information regarding the delivery status of your purchase. If you have any further questions or require additional guidance, please feel free to ask. I'm here to ensure you have a smooth and satisfactory experience!"}


In [15]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('microsoft/DialoGPT-small')
tokenizer.pad_token = tokenizer.eos_token

def tokenize_function(examples):
    return tokenizer(
        examples['dialogue'],
        truncation=True,
        max_length=512,
        padding='max_length',
    )

# Tokenize the dataset
tokenized_track_order_dataset = processed_track_order_dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=['dialogue']
)

# Set the format for PyTorch tensors
tokenized_track_order_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

In [16]:
from transformers import AutoModelForCausalLM

# Load Pretrained Model
model = AutoModelForCausalLM.from_pretrained("microsoft/DialoGPT-small")

In [17]:
from transformers import DataCollatorForLanguageModeling

# Initialize the data collator
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False, # For causal language modeling
)

2024-10-08 20:39:47.942799: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [20]:
from transformers import TrainingArguments
import torch

training_args = TrainingArguments(
    output_dir='./fine-tuned-model',
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=8,
    eval_strategy='no',
    save_strategy='epoch',
    logging_steps=500,
    learning_rate=5e-5,
    fp16=torch.cuda.is_available(),
    report_to='none',
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_track_order_dataset['train'],
    data_collator=data_collator,
)

# Start training
trainer.train()

In [45]:
# Save the fine-tuned model and tokenizer
trainer.save_model('./fine-tuned-model')
tokenizer.save_pretrained('./fine-tuned-model')

('./fine-tuned-model/tokenizer_config.json',
 './fine-tuned-model/special_tokens_map.json',
 './fine-tuned-model/vocab.json',
 './fine-tuned-model/merges.txt',
 './fine-tuned-model/added_tokens.json',
 './fine-tuned-model/tokenizer.json')

In [21]:
#Load order/shipment dataset
import pandas as pd
order_data = pd.read_csv('./Sales_Shipment_Data.csv')
print(order_data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180519 entries, 0 to 180518
Data columns (total 49 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   Product Category Id            180519 non-null  int64  
 1   Category Name                  180519 non-null  object 
 2   Class                          180519 non-null  object 
 3   Customer City                  180519 non-null  object 
 4   Customer Country               180519 non-null  object 
 5   Customer Fname                 180519 non-null  object 
 6   Customer Id                    180519 non-null  int64  
 7   Customer Lname                 180511 non-null  object 
 8   Customer Segment               180519 non-null  object 
 9   Customer State                 180519 non-null  object 
 10  Customer Street                180519 non-null  object 
 11  Customer Zipcode               180516 non-null  float64
 12  Delivery Status               

In [22]:
print(order_data['Order Number'].head())

0    53963
1     3908
2    52009
3     1179
4    56019
Name: Order Number, dtype: int64


In [23]:
# Ensure all order numbers are strings and pad with leading zeros to make them 5 characters long
order_data['Order Number'] = order_data['Order Number'].astype(str).str.zfill(5)

In [33]:
from collections import deque

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("./fine-tuned-model")
model = AutoModelForCausalLM.from_pretrained("./fine-tuned-model")
tokenizer.pad_token = tokenizer.eos_token  # Set pad_token

def generate_response(prompt, max_length=50):  # Limit generation length
    inputs = tokenizer(prompt + tokenizer.eos_token, return_tensors='pt', padding=True)
    input_ids = tokenizer.encode(prompt + tokenizer.eos_token, return_tensors='pt')
    input_ids = input_ids.to(model.device)
    attention_mask = inputs['attention_mask'].to(model.device)
    
    output_ids = model.generate(
        input_ids,
        max_length=max_length + input_ids.shape[1],
        pad_token_id=tokenizer.eos_token_id,
        do_sample=True,
        # temperature=0.8,  # Lower temperature to make generation more orderly
        top_p=0.95,        # Lower top_p to make generation more deterministic
        top_k=60,          # Lower top_k to restrict the range of generated content
        no_repeat_ngram_size=2  # Avoid generating repeating n-gram phrases
    )
    
    response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True)
    
    # Find the last sentence-ending punctuation and preserve as many sentences as possible
    response = truncate_at_sentence_end(response)
    
    return response.strip()

def truncate_at_sentence_end(response):
    # Use regular expressions to find all sentence-ending punctuation (period, question mark, exclamation mark)
    sentence_endings = list(re.finditer(r'[.!?]', response))
    
    if sentence_endings:
        # If multiple sentence-ending punctuation marks are found, take the last one
        last_ending = sentence_endings[-1]
        return response[:last_ending.end()]  # Keep up to the position of the last punctuation mark
    else:
        return response  # If no punctuation marks, return the entire response

def extract_order_id(user_input):
    match = re.search(r'\b\d{5}\b', user_input)
    if match:
        return match.group(0)
    return None

def extract_tracking_number(user_input):
    match = re.search(r'\bTRK\d{9}\b', user_input)
    if match:
        return match.group(0)
    return None

def post_process_response(response, 
                          order_number=None, 
                          order_status=None, 
                          order_tracking=None, 
                          eta=None,  
                          purchase_status=None, 
                          delivery_date=None, 
                          track_order=None, 
                          shipping_status=None, 
                          tracking_number=None,
                          customer_fname=None
                          ):
    
    if order_number:
        response = response.replace('<ORDER_NUMBER>', order_number) # Order Number
    if order_status:
        response = response.replace('<ORDER_STATUS>', order_status) # Order Status
    if order_tracking:
        response = response.replace('<ORDER_TRACKING>', order_tracking)
    if eta:
        response = response.replace('<ETA>', eta)
    if purchase_status:
        response = response.replace('<PURCHASE_STATUS>', purchase_status)
    if delivery_date:
        response = response.replace('<DELIVERY_DATE>', delivery_date)
    if track_order:
        response = response.replace('<TRACK_ORDER>', track_order)
    if shipping_status:
        response = response.replace('<SHIPPING_STATUS>', shipping_status)
    if tracking_number:
        response = response.replace('<TRACKING_NUMBER>', tracking_number)
    if customer_fname:
        response = response.replace('<CUSTOMER_FNAME>', customer_fname)

    response = re.sub(r'<[^>]+>', '', response)

    return response

def clean_generated_response(response):
    # Replace placeholders enclosed in < and >, and replace underscores with spaces
    clean_response = []
    for sentence in response.split('.'):
        # Remove any '<...>' placeholders by replacing them with an empty string
        cleaned_sentence = re.sub(r'<[^>]*>', '', sentence)
        # Replace underscores with spaces
        cleaned_sentence = cleaned_sentence.replace('_', ' ')
        clean_response.append(cleaned_sentence.strip())
    
    # Join the cleaned sentences back into a full response
    return '. '.join(clean_response).strip()



def get_order_details(order_number=None, tracking_number=None):
    if order_number:
        result = order_data[order_data['Order Number'] == order_number]
    elif tracking_number:
        result = order_data[order_data['Tracking Number'] == tracking_number]
    else:
        return None, None
    
    if not result.empty:
        order_status = result.iloc[0]['Order Status']
        eta = result.iloc[0]['Estimated Delivery Date']
        shipping_status = result.iloc[0]['Delivery Status']
        customer_fname = result.iloc[0]['Customer Fname']
        order_number = result.iloc[0]['Order Number']
        #tracking_number = result.iloc[0]['Tracking Number']
        
        return order_status, eta, shipping_status, customer_fname, order_number#, tracking_number
    return None, None,None, None,None

def tracking_system():
    print("Assistant: Hello! Welcome to the Order Tracking Assistant. How can I help you today?")
    
    # Use deque to manage conversation history, keeping up to 6 messages (3 rounds of dialogue)
    conversation_history = deque(maxlen=6)
    
    while True:
        user_input = input("You: ").strip()
        
        if user_input.lower() in ['exit', 'quit', 'bye','end']:
            print("Assistant: Thank you for using the Order Tracking Assistant. Have a great day!")
            break
        if user_input.lower() in ['thank', 'thanks','thanks.']:
            print("Assistant: My pleasure! Let me know if you need further help.")
            break
        
        # Add user message to conversation history
        conversation_history.append(f"Customer: {user_input}")
        
        # Extract order number and tracking number from user input
        order_number = extract_order_id(user_input)
        tracking_number = extract_tracking_number(user_input)
        
        if not order_number and not tracking_number:
            # If the user did not provide an order number or tracking number, prompt the user to provide them
            assistant_prompt = "\n".join(conversation_history) + "\n" + "Order Tracking Assistant: Provide a short and direct answer. " \
                                "Ask the user to provide both their order number and tracking number if they want to track an order. "
            
            response = generate_response(assistant_prompt)
        
        else:
            # Retrieve order details from the CSV based on order_number or tracking_number
            order_status, eta, shipping_status, customer_fname, order_number = get_order_details(order_number, tracking_number)
            
            if not order_status or not eta:
                # If order details are not found, inform the user
                response = "I'm sorry, I couldn't find any details for the provided order number or tracking number. Please check and try again."
            else:
                # If order_number or tracking_number provided
                 order_number = order_number
                 tracking_number = tracking_number
                 order_status = order_status
                 eta = eta 
                 customer_fname = customer_fname
                 
                 if tracking_number:
                     response_template = "Hi,<CUSTOMER_FNAME>.Your tracking number <TRACKING_NUMBER>. Your order number <ORDER_NUMBER> is currently <ORDER_STATUS>. The estimated delivery date is <ETA>.Is there anything else I can assist you with today? Feel free to ask if you need further help or have any other questions." 
                 elif order_number:
                     response_template = "Hi,<CUSTOMER_FNAME>.Your order <ORDER_NUMBER> is currently <ORDER_STATUS>. The estimated delivery date is <ETA>.Is there anything else I can assist you with today? Feel free to ask if you need further help or have any other questions." 
                
                 response = post_process_response(
                    response_template, 
                    order_number=order_number, 
                    tracking_number=tracking_number, 
                    order_status=order_status, 
                    eta=eta,
                    customer_fname = customer_fname
                 )
                
                  # Further clean 
                 response = clean_generated_response(response)

                  # reset value
                 order_status = None
                 eta = None
                 shipping_status = None
                 customer_fname = None
                 order_number = None
                 tracking_number = None
              
        # Print and add assistant's reply to conversation history
        print(f"{response}")
        conversation_history.append(f"Order Tracking Assistant: {response}")

if __name__ == "__main__":
    tracking_system()

Assistant: Hello! Welcome to the Order Tracking Assistant. How can I help you today?


You:  I want to track order 60740.


Hi,Susan. Your order 60740 is currently Pending Payment. The estimated delivery date is 6/13/17. Is there anything else I can assist you with today? Feel free to ask if you need further help or have any other questions.


You:  I also want track an order with a tracking number TRK370784603..


Hi,Susan. Your tracking number TRK370784603. Your order number 48008 is currently Pending. The estimated delivery date is 12/4/16. Is there anything else I can assist you with today? Feel free to ask if you need further help or have any other questions.


You:  Thank


Assistant: My pleasure! Let me know if you need further help.
