# 1. Bitext Customer Support (Intent Recognition Benchmark)
**Category:** AI Agent Core Capabilities

**Source:** [Bitext / Customer Support LLM Chatbot Training Dataset](https://huggingface.co/datasets/bitext/Bitext-customer-support-llm-chatbot-training-dataset)

**Description:** Designed to train customer service agents in intent recognition
and breakdown analysis, testing the AI's ability to accurately understand user needs.

**Data Content:** Contains customer service corpora with 27 specific intents
(e.g., order checks, refunds), 11 high-level categories, ~400 quality flag
combinations, and templated dialog with dynamic placeholders.

**License:** CDLA-Sharing-1.0

---

**This notebook covers:**
1. Data loading from HuggingFace (26,872 instruction-response pairs)
2. Schema exploration: categories, intents, flags, template placeholders
3. Intent & category distribution analysis
4. Instruction & response text length characteristics
5. Quality flag decomposition and pattern analysis
6. Template placeholder usage across intents
7. Intent recognition agent evaluation framework

## 1. Setup

In [None]:
# Install dependencies (uncomment if needed)
# !pip install datasets pandas matplotlib seaborn scipy

In [None]:
import re
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
from datasets import load_dataset
from scipy import stats

sns.set_theme(style="whitegrid")
plt.rcParams["figure.figsize"] = (12, 6)
plt.rcParams["figure.dpi"] = 100
plt.rcParams["axes.titlesize"] = 13
plt.rcParams["axes.labelsize"] = 11

## 2. Dataset Overview

The Bitext Customer Support dataset is a **hybrid synthetic** corpus generated
using NLP/NLG technology for fine-tuning LLMs in customer service applications.

**Structure:**

| Column | Description |
|--------|-------------|
| `flags` | Quality/variation flags (394 unique combinations of 14 flag characters) |
| `instruction` | Customer query with template placeholders (e.g., `{{Order Number}}`) |
| `category` | High-level domain (11 categories: ORDER, ACCOUNT, REFUND, ...) |
| `intent` | Specific intent label (27 intents: cancel_order, get_refund, ...) |
| `response` | Agent response with dynamic placeholders |

**Intent hierarchy (11 categories, 27 intents):**

| Category | Intents |
|----------|----------|
| ACCOUNT | create_account, delete_account, edit_account, recover_password, registration_problems, switch_account |
| ORDER | cancel_order, change_order, place_order, track_order |
| REFUND | check_refund_policy, get_refund, track_refund |
| INVOICE | check_invoice, get_invoice |
| CONTACT | contact_customer_service, contact_human_agent |
| PAYMENT | check_payment_methods, payment_issue |
| FEEDBACK | complaint, review |
| DELIVERY | delivery_options, delivery_period |
| SHIPPING | change_shipping_address, set_up_shipping_address |
| SUBSCRIPTION | newsletter_subscription |
| CANCEL | check_cancellation_fee |

## 3. Data Loading

In [None]:
print("Loading Bitext Customer Support dataset from HuggingFace...")
ds = load_dataset(
    "bitext/Bitext-customer-support-llm-chatbot-training-dataset",
    split="train"
)
df = ds.to_pandas()
print(f"Loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print(f"Columns: {list(df.columns)}")

## 4. Data Schema & Samples

In [None]:
print("=== Data Types ===")
print(df.dtypes)
print(f"\n=== Unique Values ===")
for col in df.columns:
    print(f"  {col:15s}: {df[col].nunique()} unique")
print(f"\n=== Sample Rows ===")
df.head(5)

In [None]:
# Show full instruction-response pairs for different intents
sample_intents = ["cancel_order", "get_refund", "recover_password",
                  "delivery_options", "complaint"]
for intent in sample_intents:
    row = df[df["intent"] == intent].iloc[0]
    print(f"--- Intent: {intent} | Category: {row['category']} "
          f"| Flags: {row['flags']} ---")
    print(f"  Instruction: {row['instruction']}")
    print(f"  Response:    {row['response'][:200]}...")
    print()

## 5. Exploratory Data Analysis

### 5.1 Category & Intent Distribution

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# Category distribution
cat_counts = df["category"].value_counts()
axes[0].barh(cat_counts.index, cat_counts.values, color="steelblue",
             edgecolor="white")
axes[0].set_title("Records per Category")
axes[0].set_xlabel("Count")
axes[0].invert_yaxis()
for i, (idx, val) in enumerate(cat_counts.items()):
    axes[0].text(val + 50, i, f"{val:,}", va="center", fontsize=9)

# Intent distribution
intent_counts = df["intent"].value_counts()
axes[1].barh(intent_counts.index, intent_counts.values, color="coral",
             edgecolor="white")
axes[1].set_title("Records per Intent (27 intents)")
axes[1].set_xlabel("Count")
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print(f"Categories: {df['category'].nunique()}, "
      f"Intents: {df['intent'].nunique()}")
print(f"Intents per category: "
      f"min={df.groupby('category')['intent'].nunique().min()}, "
      f"max={df.groupby('category')['intent'].nunique().max()}")

In [None]:
# Intent-category heatmap
cross = pd.crosstab(df["intent"], df["category"])

plt.figure(figsize=(14, 10))
sns.heatmap(cross, annot=True, fmt="d", cmap="YlOrRd",
            linewidths=0.5, cbar_kws={"label": "Count"})
plt.title("Intent × Category Cross-Tabulation")
plt.xlabel("Category")
plt.ylabel("Intent")
plt.tight_layout()
plt.show()

### 5.2 Instruction & Response Length Analysis

In [None]:
df["instr_len"] = df["instruction"].apply(len)
df["instr_words"] = df["instruction"].apply(lambda x: len(x.split()))
df["resp_len"] = df["response"].apply(len)
df["resp_words"] = df["response"].apply(lambda x: len(x.split()))

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

axes[0, 0].hist(df["instr_len"], bins=50, color="steelblue", edgecolor="white")
axes[0, 0].set_title(f"Instruction Length (chars)\n"
                      f"Mean={df['instr_len'].mean():.0f}, "
                      f"Median={df['instr_len'].median():.0f}")
axes[0, 0].set_xlabel("Character Count")
axes[0, 0].set_ylabel("Frequency")

axes[0, 1].hist(df["instr_words"], bins=30, color="coral", edgecolor="white")
axes[0, 1].set_title(f"Instruction Length (words)\n"
                      f"Mean={df['instr_words'].mean():.1f}, "
                      f"Median={df['instr_words'].median():.0f}")
axes[0, 1].set_xlabel("Word Count")
axes[0, 1].set_ylabel("Frequency")

axes[1, 0].hist(df["resp_len"], bins=50, color="mediumseagreen",
                edgecolor="white")
axes[1, 0].set_title(f"Response Length (chars)\n"
                      f"Mean={df['resp_len'].mean():.0f}, "
                      f"Median={df['resp_len'].median():.0f}")
axes[1, 0].set_xlabel("Character Count")
axes[1, 0].set_ylabel("Frequency")

axes[1, 1].hist(df["resp_words"], bins=50, color="orchid", edgecolor="white")
axes[1, 1].set_title(f"Response Length (words)\n"
                      f"Mean={df['resp_words'].mean():.1f}, "
                      f"Median={df['resp_words'].median():.0f}")
axes[1, 1].set_xlabel("Word Count")
axes[1, 1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

In [None]:
# Response length by category
cat_order = (df.groupby("category")["resp_words"].median()
             .sort_values(ascending=False).index)

plt.figure(figsize=(12, 6))
sns.boxplot(data=df, x="category", y="resp_words", order=cat_order,
            hue="category", palette="Set2", legend=False)
plt.title("Response Length by Category (words)")
plt.xlabel("Category")
plt.ylabel("Word Count")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.show()

### 5.3 Quality Flag Analysis

Each record has a `flags` string composed of single-letter codes. These
represent data augmentation and quality variation markers.

In [None]:
# Decompose flag strings into individual characters
flag_chars = Counter()
for f in df["flags"]:
    for c in str(f):
        flag_chars[c] += 1

flag_df = (pd.DataFrame(flag_chars.items(), columns=["Flag", "Count"])
           .sort_values("Count", ascending=False))

df["n_flags"] = df["flags"].apply(len)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Individual flag frequency
axes[0].bar(flag_df["Flag"], flag_df["Count"], color="steelblue",
            edgecolor="white")
axes[0].set_title(f"Individual Flag Character Frequency ({len(flag_df)} unique)")
axes[0].set_xlabel("Flag Character")
axes[0].set_ylabel("Occurrences")

# Number of flags per record
axes[1].hist(df["n_flags"], bins=range(1, df["n_flags"].max() + 2),
             color="coral", edgecolor="white", align="left")
axes[1].set_title("Number of Flags per Record")
axes[1].set_xlabel("Flag Count")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

print(f"Unique flag combinations: {df['flags'].nunique()}")
print(f"Flag count: mean={df['n_flags'].mean():.1f}, "
      f"min={df['n_flags'].min()}, max={df['n_flags'].max()}")
print(f"\nNote: 'B' appears in ALL records (baseline flag). "
      f"'L' appears in {flag_chars['L']/len(df)*100:.0f}% of records.")

In [None]:
# Top flag combinations
top_flags = df["flags"].value_counts().head(15)

plt.figure(figsize=(12, 5))
top_flags.plot(kind="barh", color="mediumseagreen", edgecolor="white")
plt.gca().invert_yaxis()
plt.title("Top 15 Flag Combinations")
plt.xlabel("Count")
plt.ylabel("Flag Combination")
plt.tight_layout()
plt.show()

### 5.4 Template Placeholder Analysis

Both instructions and responses use `{{placeholder}}` templates for
dynamic content (order numbers, URLs, phone numbers, etc.).

In [None]:
PLACEHOLDER_RE = re.compile(r"\{\{[^}]+\}\}")

# Extract placeholders from instructions
instr_ph = Counter()
for text in df["instruction"]:
    for m in PLACEHOLDER_RE.findall(str(text)):
        instr_ph[m] += 1

# Extract placeholders from responses
resp_ph = Counter()
for text in df["response"]:
    for m in PLACEHOLDER_RE.findall(str(text)):
        resp_ph[m] += 1

df["n_instr_ph"] = df["instruction"].apply(
    lambda x: len(PLACEHOLDER_RE.findall(str(x)))
)
df["n_resp_ph"] = df["response"].apply(
    lambda x: len(PLACEHOLDER_RE.findall(str(x)))
)

print(f"Unique placeholders in instructions: {len(instr_ph)}")
print(f"Unique placeholders in responses: {len(resp_ph)}")
print(f"\n=== Top Instruction Placeholders ===")
for ph, cnt in instr_ph.most_common(10):
    print(f"  {ph:30s}: {cnt}")
print(f"\n=== Top Response Placeholders ===")
for ph, cnt in resp_ph.most_common(15):
    print(f"  {ph:45s}: {cnt}")

In [None]:
# Placeholder usage distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(df["n_instr_ph"], bins=range(0, df["n_instr_ph"].max() + 2),
             color="steelblue", edgecolor="white", align="left")
axes[0].set_title("Placeholders per Instruction")
axes[0].set_xlabel("Number of Placeholders")
axes[0].set_ylabel("Frequency")

axes[1].hist(df["n_resp_ph"], bins=range(0, min(df["n_resp_ph"].max() + 2, 20)),
             color="orchid", edgecolor="white", align="left")
axes[1].set_title("Placeholders per Response")
axes[1].set_xlabel("Number of Placeholders")
axes[1].set_ylabel("Frequency")

plt.tight_layout()
plt.show()

print(f"Instructions with placeholders: "
      f"{(df['n_instr_ph'] > 0).sum()} / {len(df)} "
      f"({(df['n_instr_ph'] > 0).mean()*100:.1f}%)")
print(f"Responses with placeholders: "
      f"{(df['n_resp_ph'] > 0).sum()} / {len(df)} "
      f"({(df['n_resp_ph'] > 0).mean()*100:.1f}%)")

### 5.5 Instruction Diversity per Intent

How many distinct instruction phrasings exist for each intent?
This measures the paraphrase diversity of the training data.

In [None]:
# Unique instructions per intent
diversity = (df.groupby("intent")
             .agg(total=("instruction", "count"),
                  unique=("instruction", "nunique"))
             .assign(diversity_pct=lambda x: x["unique"] / x["total"] * 100)
             .sort_values("diversity_pct", ascending=False))

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

axes[0].barh(diversity.index, diversity["unique"], color="steelblue",
             edgecolor="white")
axes[0].set_title("Unique Instructions per Intent")
axes[0].set_xlabel("Unique Instruction Count")
axes[0].invert_yaxis()

axes[1].barh(diversity.index, diversity["diversity_pct"], color="coral",
             edgecolor="white")
axes[1].set_title("Instruction Diversity (% unique)")
axes[1].set_xlabel("Diversity %")
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print(f"Mean diversity: {diversity['diversity_pct'].mean():.1f}%")
print(f"Min diversity: {diversity['diversity_pct'].min():.1f}% "
      f"({diversity['diversity_pct'].idxmin()})")
print(f"Max diversity: {diversity['diversity_pct'].max():.1f}% "
      f"({diversity['diversity_pct'].idxmax()})")

### 5.6 Instruction Length by Intent

In [None]:
intent_order = (df.groupby("intent")["instr_words"].median()
                .sort_values(ascending=False).index)

plt.figure(figsize=(14, 8))
sns.boxplot(data=df, y="intent", x="instr_words", order=intent_order,
            hue="intent", palette="tab20", legend=False)
plt.title("Instruction Length by Intent (words)")
plt.xlabel("Word Count")
plt.ylabel("Intent")
plt.tight_layout()
plt.show()

### 5.7 Response Complexity by Intent

In [None]:
# Response characteristics per intent
resp_stats = (df.groupby("intent")
              .agg(mean_words=("resp_words", "mean"),
                   mean_ph=("n_resp_ph", "mean"),
                   mean_chars=("resp_len", "mean"))
              .sort_values("mean_words", ascending=False)
              .round(1))

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

axes[0].barh(resp_stats.index, resp_stats["mean_words"],
             color="mediumseagreen", edgecolor="white")
axes[0].set_title("Mean Response Length per Intent (words)")
axes[0].set_xlabel("Mean Word Count")
axes[0].invert_yaxis()

axes[1].barh(resp_stats.index, resp_stats["mean_ph"],
             color="orchid", edgecolor="white")
axes[1].set_title("Mean Placeholders per Response by Intent")
axes[1].set_xlabel("Mean Placeholder Count")
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

## 6. Intent Recognition Agent Evaluation Framework

We define a scoring framework for evaluating how well an agent can
correctly classify customer intents from instruction text.

### Evaluation Criteria

| Criterion | Metric | Weight | Description |
|-----------|--------|--------|-------------|
| **Intent Accuracy** | Exact match on 27 intents | 0.30 | Does the agent predict the correct intent? |
| **Category Accuracy** | Exact match on 11 categories | 0.20 | Does the agent predict the correct category? |
| **Robustness** | Accuracy on flagged variants | 0.20 | Consistent across typos/paraphrases? |
| **Slot Extraction** | Placeholder recall | 0.15 | Does the agent extract template entities? |
| **Response Quality** | BLEU/ROUGE against gold | 0.15 | Is the generated response appropriate? |

In [None]:
# Build baseline metrics from the dataset
def compute_intent_baseline(df):
    """Compute baseline metrics for intent recognition."""
    metrics = {}

    # 1. Class balance (entropy-based)
    intent_probs = df["intent"].value_counts(normalize=True)
    metrics["Intent Entropy"] = -np.sum(
        intent_probs * np.log2(intent_probs)
    )
    metrics["Max Possible Entropy"] = np.log2(df["intent"].nunique())
    metrics["Balance Ratio"] = (
        metrics["Intent Entropy"] / metrics["Max Possible Entropy"]
    )

    # 2. Instruction diversity
    metrics["Unique Instructions"] = df["instruction"].nunique()
    metrics["Instruction Diversity %"] = (
        df["instruction"].nunique() / len(df) * 100
    )

    # 3. Placeholder coverage
    metrics["Instructions with Slots"] = (
        (df["n_instr_ph"] > 0).mean() * 100
    )
    metrics["Responses with Slots"] = (
        (df["n_resp_ph"] > 0).mean() * 100
    )

    # 4. Mean flag complexity
    metrics["Mean Flag Count"] = df["n_flags"].mean()

    return metrics


baseline = compute_intent_baseline(df)
print("=== Intent Recognition Baseline Metrics ===")
for k, v in baseline.items():
    print(f"  {k:30s}: {v:.4f}" if isinstance(v, float) else
          f"  {k:30s}: {v}")

In [None]:
# Simulate agent strategies for intent classification
np.random.seed(42)
all_intents = df["intent"].unique()
all_categories = df["category"].unique()
intent_to_cat = df.groupby("intent")["category"].first().to_dict()

def evaluate_agent(predictions, true_intents, true_categories):
    """Evaluate intent classification agent."""
    pred_intents = [p["intent"] for p in predictions]
    pred_categories = [intent_to_cat.get(pi, "UNKNOWN") for pi in pred_intents]

    intent_acc = np.mean(
        [p == t for p, t in zip(pred_intents, true_intents)]
    )
    cat_acc = np.mean(
        [p == t for p, t in zip(pred_categories, true_categories)]
    )
    return {"Intent Accuracy": intent_acc, "Category Accuracy": cat_acc}


# Sample test set
test_df = df.sample(1000, random_state=42)

# Strategy 1: Random baseline
random_preds = [{"intent": np.random.choice(all_intents)} for _ in range(len(test_df))]

# Strategy 2: Most-frequent class
most_common_intent = df["intent"].mode()[0]
majority_preds = [{"intent": most_common_intent} for _ in range(len(test_df))]

# Strategy 3: Keyword matching (simple heuristic)
keyword_map = {
    "cancel": "cancel_order", "refund": "get_refund",
    "track": "track_order", "invoice": "check_invoice",
    "password": "recover_password", "delivery": "delivery_period",
    "shipping": "set_up_shipping_address", "payment": "payment_issue",
    "complaint": "complaint", "review": "review",
    "account": "edit_account", "subscribe": "newsletter_subscription",
    "human": "contact_human_agent", "contact": "contact_customer_service",
    "place": "place_order", "change": "change_order",
}

def keyword_predict(instruction):
    instr_lower = instruction.lower()
    for kw, intent in keyword_map.items():
        if kw in instr_lower:
            return {"intent": intent}
    return {"intent": most_common_intent}

keyword_preds = [keyword_predict(row["instruction"])
                 for _, row in test_df.iterrows()]

# Evaluate all strategies
strategies = {
    "Random Baseline": random_preds,
    "Majority Class": majority_preds,
    "Keyword Matching": keyword_preds,
}

eval_results = []
for name, preds in strategies.items():
    scores = evaluate_agent(
        preds,
        test_df["intent"].tolist(),
        test_df["category"].tolist()
    )
    eval_results.append({"Strategy": name, **scores})

df_eval = pd.DataFrame(eval_results)
print("=== Intent Recognition Agent Evaluation ===")
print(df_eval.round(3).to_string(index=False))

In [None]:
# Radar chart comparing strategies
labels = ["Intent Accuracy", "Category Accuracy"]
n_metrics = len(labels)
angles = np.linspace(0, 2 * np.pi, n_metrics, endpoint=False).tolist()
angles += angles[:1]

fig, ax = plt.subplots(figsize=(7, 7), subplot_kw=dict(polar=True))
colors_strat = ["gray", "coral", "steelblue"]

for i, (_, row) in enumerate(df_eval.iterrows()):
    values = [row[l] for l in labels]
    values += values[:1]
    ax.plot(angles, values, "o-", linewidth=2, color=colors_strat[i],
            markersize=8, label=row["Strategy"])
    ax.fill(angles, values, alpha=0.1, color=colors_strat[i])

ax.set_xticks(angles[:-1])
ax.set_xticklabels(labels, fontsize=11)
ax.set_ylim(0, 1)
ax.set_yticks([0.2, 0.4, 0.6, 0.8, 1.0])
ax.set_yticklabels(["0.2", "0.4", "0.6", "0.8", "1.0"], fontsize=8)
ax.set_title("Intent Recognition Strategy Comparison", fontsize=14, pad=20)
ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.1))
plt.tight_layout()
plt.show()

## 7. Summary & Key Findings

In [None]:
print("=" * 70)
print("BITEXT CUSTOMER SUPPORT BENCHMARK - SUMMARY")
print("=" * 70)

print(f"\n[Data Scope]")
print(f"  Total records: {len(df):,}")
print(f"  Categories: {df['category'].nunique()}")
print(f"  Intents: {df['intent'].nunique()}")
print(f"  Unique instructions: {df['instruction'].nunique():,}")
print(f"  Unique responses: {df['response'].nunique():,}")
print(f"  Flag combinations: {df['flags'].nunique()}")

print(f"\n[Text Statistics]")
print(f"  Instruction: mean={df['instr_words'].mean():.1f} words, "
      f"median={df['instr_words'].median():.0f}")
print(f"  Response: mean={df['resp_words'].mean():.1f} words, "
      f"median={df['resp_words'].median():.0f}")

print(f"\n[Template Placeholders]")
print(f"  Unique in instructions: {len(instr_ph)}")
print(f"  Unique in responses: {len(resp_ph)}")
print(f"  Instructions with slots: {(df['n_instr_ph']>0).mean()*100:.1f}%")
print(f"  Responses with slots: {(df['n_resp_ph']>0).mean()*100:.1f}%")

print(f"\n[Agent Evaluation]")
for _, row in df_eval.iterrows():
    print(f"  {row['Strategy']:20s}: Intent={row['Intent Accuracy']:.3f}, "
          f"Category={row['Category Accuracy']:.3f}")

## 8. Key Observations

1. **Balanced intents:** The dataset is remarkably well-balanced across 27 intents
   (~950-1,000 samples each), providing fair training signal for all categories.
   The balance ratio (entropy / max entropy) is near 1.0.

2. **Rich paraphrase diversity:** Each intent has hundreds of unique instruction
   phrasings (including typos, abbreviations, and colloquial variants marked by
   quality flags), simulating real user behavior.

3. **Template-driven responses:** Responses use dynamic `{{placeholder}}` slots
   (order numbers, URLs, phone numbers), enabling evaluation of both intent
   classification and slot-filling capabilities.

4. **Hierarchical structure:** The 2-level category-intent hierarchy allows
   evaluation at different granularities — coarse (11 categories) vs fine (27 intents).

5. **Flag-based robustness testing:** The 394 flag combinations encode augmentation
   patterns (typos, length variants, paraphrases), enabling systematic robustness evaluation.

6. **Research relevance (IS/AI):**
   - **Intent recognition:** Train and benchmark customer service intent classifiers
   - **Slot filling:** Evaluate entity extraction from templated conversations
   - **Chatbot fine-tuning:** Directly fine-tune LLMs for customer support
   - **Robustness testing:** Assess model stability across paraphrase variants
   - **Dialog breakdown detection:** Use flag patterns to identify potential failure points