**CivicSignal**
This code supports an end-to-end machine learning workflow that combines unsupervised clustering (UMAP + HDBSCAN) with supervised learning (Decision Tree) to uncover meaningful user segments from Ontario 211 call data. The aim is to identify distinct population needs, behavioral patterns, and demographic trends in order to inform data-driven decision-making for public support services.
-------------------------------------------------
Author: 
 1. Guan-Wei Huang
 2. Yu-Ting Lin
-------------------------------------------------
© 2025 CivicSignal. All rights reserved.

This code was developed by the CivicSignal team as part of a data science initiative focused on public service optimization. Unauthorized use, reproduction, or distribution of this code or its components is prohibited without explicit permission from the authors. This code is intended for educational and non-commercial research purposes only.


In [None]:
import pandas as pd
import numpy as np
import umap
import hdbscan
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from math import pi

In [None]:
# Load Datasets
call_reports = pd.read_csv("./***.csv", low_memory=False)
needs = pd.read_csv("./***.csv", low_memory=False)

- **Data Preprocessing**
1. Deduplication
2. Merging
3. Filtering
4. Datetime Parsing
5. Feature Engineering (date time)
6. Missing Value Removal
7. Category Mapping
8. One-Hot Encoding (Need category)
9. Concatenation of Features
We performed data preprocessing to clean and prepare the call records. This included deduplication, missing value handling, categorical feature mapping, one-hot encoding, and the construction of a feature matrix incorporating need categories, demographics, and temporal information.

In [None]:
# Remove Duplicate CallReportNum
needs_dedup = needs.sort_values(by='CallReportNum').drop_duplicates(subset='CallReportNum', keep='first')
call_reports_dedup = call_reports.sort_values(by='CallReportNum').drop_duplicates(subset='CallReportNum', keep='first')

# Merge Datasets, and Filter for Ontario
df = pd.merge(call_reports_dedup, needs_dedup, on="CallReportNum", how="inner")
df = df.query("StateProvince == 'ON'").reset_index(drop=True)

# Convert Date Column (Define and Apply Time Periods)
df['CallDateAndTimeStart'] = pd.to_datetime(df['CallDateAndTimeStart'])

def get_time_period(hour):
    if 5 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 18:
        return 'Afternoon'
    else:
        return 'Evening'

df['TimePeriod'] = df['CallDateAndTimeStart'].dt.hour.apply(get_time_period)

# Drop Incomplete Records
df = df.dropna(subset=[
    'AIRSNeedCategory',
    'AgeCategory_ON',
    'Gender',
    'TimePeriod',
    'CallLength'
]).reset_index(drop=True)

# Map Need Categories
need_mapping = {
    'Mental Health/Addictions': 'Mental Health',
    'Mental Health/Substance Use Disorders': 'Mental Health',
    'Income Support/Financial Assistance': 'Basic Needs',
    'Utility Assistance': 'Basic Needs',
    'Food/Meals': 'Basic Needs',
    'Legal/Public Safety': 'Legal & Immigration',
    'Citizenship/Immigration': 'Legal & Immigration',
    'Other Government/Economic Services': 'Government Services',
    'Consumer Services': 'Government Services',
    'Community Services': 'Government Services',
    'Individual/Family Services': 'Family Support',
    'Information Services': 'Family Support'
}
df['NeedCategory_Grouped'] = df['AIRSNeedCategory'].map(need_mapping).fillna(df['AIRSNeedCategory'])

#  Create One-Hot Encoded Features
need_dummies = pd.get_dummies(df['NeedCategory_Grouped'], prefix='NeedCategory')
ageCategory_dummies = pd.get_dummies(df['AgeCategory_ON'], prefix='AgeCategory_ON')
gender_dummies = pd.get_dummies(df['Gender'], prefix='Gender')
calllength_col = df['CallLength'].reset_index(drop=True)
timeperiod_dummies = pd.get_dummies(df['TimePeriod'], prefix='TimePeriod')
timeperiod_dummies.reset_index(drop=True, inplace=True)

for dummy in [need_dummies, ageCategory_dummies, gender_dummies]:
    dummy.reset_index(drop=True, inplace=True)

# Prepare Final Feature Set
features = pd.concat([
    need_dummies,
    ageCategory_dummies,
    gender_dummies,
    timeperiod_dummies,
    calllength_col
], axis=1)

#print("✅ df rows after filtering:", len(df))
#print("✅ features shape:", features.shape)
#print(features.head())

- **Unsupervised Learning (UMAP + HDBSCAN)**
1. Exploratory Data Analysis, EDA
To identify natural groupings within the data, we used UMAP for nonlinear dimensionality reduction followed by HDBSCAN clustering. This approach allowed us to visualize and detect density-based clusters without requiring a predefined number of groups, while also labeling noise points effectively.

In [None]:
# Initialize UMAP
reducer = umap.UMAP(n_neighbors=150, min_dist=0.02, n_components=2, random_state=42)
sample_idx = np.random.choice(len(features), size=50000, replace=False)  # Randomly Sample Data
features_sample = features.iloc[sample_idx]

# Apply UMAP
embedding = reducer.fit_transform(features_sample)

# Initialize and Fit HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=600, min_samples=3)
labels = clusterer.fit_predict(embedding)

plt.figure(figsize=(8, 6))
plt.scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='tab10', s=50)
plt.title('UMAP + HDBSCAN Clustering')
plt.xlabel('UMAP 1')
plt.ylabel('UMAP 2')
plt.colorbar(label='Cluster')
plt.show()

- **Post-Clustering Feature Profiling**
1. Counted and ranked cluster sizes
2. Selected top 10 most frequent clusters
3. Calculated average feature values per cluster
4. Standardized feature values for comparison
5. Visualized feature strengths using heatmaps
6. Analyzed age and gender patterns across clusters

We analyzed the top 10 clusters by computing their average feature values and visualized their behavioral and demographic profiles. Z-score normalization was used for categorical features, and MinMax scaling was applied to call duration. These visualizations help identify unique patterns and dominant characteristics in each cluster.

In [None]:
# Count Cluster Labels
label_counts = Counter(clusterer.labels_)
cluster_distribution = pd.DataFrame.from_dict(label_counts, orient='index', columns=['Count'])
cluster_distribution.sort_values(by='Count', ascending=False).head(10)

In [None]:
# Prepare Clustered Subset
subset_idx = sample_idx
subset_df = df.iloc[subset_idx].copy().reset_index(drop=True)
subset_df['Cluster'] = clusterer.labels_

# Rebuild Feature Subset
features_sample = features.iloc[subset_idx].copy().reset_index(drop=True)
subset_top = pd.concat([
    features_sample.drop(columns='CallLength'),
    subset_df['CallLength'],
    subset_df['Cluster']
], axis=1)

# Select Top 10 Clusters
top_clusters = subset_top['Cluster'].value_counts().head(10).index
group_summary = subset_top[subset_top['Cluster'].isin(top_clusters)] \
                    .groupby('Cluster').mean().T

# Filter Relevant Features
filtered_features = [
    col for col in group_summary.index
    if not (col.startswith('AgeCategory_') or col.startswith('Gender_')) and col != 'CallLength'
]

# Scale Feature Averages (Z-Score)
binary_scaled = pd.DataFrame(
    StandardScaler().fit_transform(group_summary.loc[filtered_features]),
    index=filtered_features,
    columns=group_summary.columns
)

# Normalize CallLength (MinMax)
calllength_row = group_summary.loc['CallLength'].values.reshape(-1, 1)
calllength_scaled = MinMaxScaler().fit_transform(calllength_row).flatten()
calllength_df = pd.DataFrame([calllength_scaled], index=['CallLength'], columns=group_summary.columns)

# Combine Scaled Data
final_summary = pd.concat([binary_scaled, calllength_df])

plt.figure(figsize=(14, 10))
sns.heatmap(final_summary, cmap='coolwarm', center=0, annot=True, fmt=".2f")
plt.title('Top 10 Cluster Feature Profiles (Z-score + MinMax for CallLength)')
plt.tight_layout()
plt.show()

# Scale Age Category Features
age_features = [col for col in group_summary.index if col.startswith('AgeCategory')]
age_scaled = pd.DataFrame(
    StandardScaler().fit_transform(group_summary.loc[age_features]),
    index=age_features,
    columns=group_summary.columns
)

plt.figure(figsize=(10, 6))
sns.heatmap(age_scaled, cmap='coolwarm', center=0, annot=True, fmt=".2f")
plt.title('Cluster-wise AgeCategory Strengths (Z-score)')
plt.tight_layout()
plt.show()

#  Scale Gender Features
gender_features = [col for col in group_summary.index if col.startswith('Gender')]
gender_scaled = pd.DataFrame(
    StandardScaler().fit_transform(group_summary.loc[gender_features]),
    index=gender_features,
    columns=group_summary.columns
)

plt.figure(figsize=(8, 4))
sns.heatmap(gender_scaled, cmap='coolwarm', center=0, annot=True, fmt=".2f")
plt.title('Cluster-wise Gender Strengths (Z-score)')
plt.tight_layout()
plt.show()

- **ML Decision Tree Classifier (Supervised Learning)**
1. Assigned cluster labels to feature data
2. Removed noise samples (unclustered points)
3. Trained a decision tree to predict cluster membership
4. Evaluated model performance
5. Identified top features for differentiating clusters
6. Visualized feature importance with a bar chart

To interpret the unsupervised HDBSCAN clusters, we trained a Decision Tree classifier using cluster labels as the target variable. This enabled us to identify the most important features contributing to cluster separation and build an interpretable model for understanding behavioral differences across user segments.

In [None]:
# Assign Cluster Labels
labels = clusterer.labels_
df_clf = features_sample.copy().reset_index(drop=True)
assert len(labels) == len(df_clf), f"labels length: {len(labels)}, df_clf length: {len(df_clf)}"
df_clf['Cluster'] = labels

# Remove Noise Labels (-1)
df_clf = df_clf[df_clf['Cluster'] != -1].reset_index(drop=True)

# Split Features and Target
X = df_clf.drop(columns='Cluster')
y = df_clf['Cluster']

In [None]:
# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Decision Tree Classifier
clf = DecisionTreeClassifier(max_depth=5, random_state=42)
clf.fit(X_train, y_train)

# Evaluate Classifier
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

# Analyze Feature Importance
feature_importance = pd.Series(clf.feature_importances_, index=X.columns)
top_feature_importance = feature_importance.sort_values(ascending=False).head(15)

plt.figure(figsize=(10, 6))
top_feature_importance.plot(kind='barh')
plt.title('Top Features for Cluster Differentiation')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()


- **Cluster Interpretation & Visualization**
To interpret key behavioral traits of the most dominant clusters, we visualized the top 5 clusters using a radar chart. Each spoke represents a top discriminative feature, scaled using Min-Max normalization. This helps highlight which clusters show distinct strengths or needs across key categories.

In [None]:
# Get Top 5 Most Frequent Clusters (Excluding Noise)
top_5_clusters = subset_top[subset_top['Cluster'] != -1]['Cluster'].value_counts().head(5).index.tolist()

# Group by Cluster & Select Top Features
radar_data = subset_top[subset_top['Cluster'].isin(top_5_clusters)] \
    .groupby('Cluster')[top_feature_names].mean()

# Normalize Data for Radar Chart
scaler = MinMaxScaler()
radar_scaled = pd.DataFrame(
    scaler.fit_transform(radar_data),
    index=radar_data.index,
    columns=radar_data.columns
)

# Prepare Radar Chart Angles
labels = radar_scaled.columns.tolist()
num_vars = len(labels)
angles = [n / float(num_vars) * 2 * pi for n in range(num_vars)]
angles += angles[:1] 

# Plot Radar Chart for Each Cluster
plt.figure(figsize=(9, 9))
for cluster in radar_scaled.index:
    values = radar_scaled.loc[cluster].tolist()
    values += values[:1] 
    plt.polar(angles, values, label=f'Cluster {cluster}', linewidth=2)

plt.xticks(angles[:-1], labels, color='black', size=10)
plt.title("Top Cluster Feature Strengths (Radar Chart)", y=1.15)
plt.legend(loc='upper right', bbox_to_anchor=(1.35, 1.1))
plt.tight_layout()
plt.show()


In [None]:
# Export Clustered Records with Metadata
export_df = subset_df[[
    'CallReportNum',           
    'CallDateAndTimeStart',
    'CountryName',
    'CityName',
    'FSA',
    'Level1Name',
    'Level2Name',
    'NeedCategory_Grouped',    
    'AgeCategory_ON',
    'Gender',
    'TimePeriod',
    'CallLength',
    'Cluster'
]]

export_df.to_csv("clustered_call_records.csv", index=False)
print("clustered_call_records.csv with Cluster & Full Metadata")
