# Insurance Agent Segmentation and Modeling Pipeline
This notebook performs feature engineering, K-Means clustering, and a supervised modeling pipeline to segment and target insurance agents.

### About Dataset
An insurance group consists of 10 property and casualty insurance, life insurance and insurance brokerage companies. The property and casualty companies in the group operate in a 17-state region. The group is a major regional property and casualty insurer, represented by more than 4,000 independent agents who live and work in local communities through a six-state region. Define the metrics to analyse agent performance based on several attributes like demography, products sold, new business, etc. The goal is to improve their existing knowledge used for agent segmentation in a supervised predictive framework.

Data courtesy of Kaggle: https://www.kaggle.com/datasets/moneystore/agencyperformance

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

## Step 1: Load the dataset

In [3]:
# Replace this path with the actual path to your full dataset
df = pd.read_csv(r"C:\Users\maxwell.bicking\data-science-portfolio\Insurance Linear Regression\finalapi.csv")

In [4]:
df.head(10)

Unnamed: 0,AGENCY_ID,PRIMARY_AGENCY_ID,PROD_ABBR,PROD_LINE,STATE_ABBR,STAT_PROFILE_DATE_YEAR,RETENTION_POLY_QTY,POLY_INFORCE_QTY,PREV_POLY_INFORCE_QTY,NB_WRTN_PREM_AMT,...,PL_BOUND_CT_ELINKS,PL_QUO_CT_ELINKS,PL_BOUND_CT_PLRANK,PL_QUO_CT_PLRANK,PL_BOUND_CT_eQTte,PL_QUO_CT_eQTte,PL_BOUND_CT_APPLIED,PL_QUO_CT_APPLIED,PL_BOUND_CT_TRANSACTNOW,PL_QUO_CT_TRANSACTNOW
0,3,3,BOILERMACH,CL,IN,2005,0,0,0,40.0,...,0,0,0,103,50,288,0,0,0,0
1,3,3,BOILERMACH,CL,IN,2006,0,0,0,151.0,...,0,0,0,103,50,288,0,0,0,0
2,3,3,BOILERMACH,CL,IN,2007,0,0,0,40.0,...,0,0,0,103,50,288,0,0,0,0
3,3,3,BOILERMACH,CL,IN,2008,0,0,0,69.0,...,0,0,0,103,50,288,0,0,0,0
4,3,3,BOILERMACH,CL,IN,2009,0,0,0,28.0,...,0,0,0,103,50,288,0,0,0,0
5,3,3,BOILERMACH,CL,IN,2010,0,0,0,120.0,...,0,0,0,103,50,288,0,0,0,0
6,3,3,BOILERMACH,CL,IN,2011,0,0,0,231.0,...,0,0,0,103,50,288,0,0,0,0
7,3,3,BOILERMACH,CL,IN,2012,0,0,0,0.0,...,0,0,0,103,50,288,0,0,0,0
8,3,3,BOILERMACH,CL,IN,2013,0,0,0,111.0,...,0,0,0,103,50,288,0,0,0,0
9,3,3,BOILERMACH,CL,IN,2014,0,0,0,213.0,...,0,0,0,103,50,288,0,0,0,0


In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 213328 entries, 0 to 213327
Data columns (total 49 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   AGENCY_ID                  213328 non-null  int64  
 1   PRIMARY_AGENCY_ID          213328 non-null  int64  
 2   PROD_ABBR                  213328 non-null  object 
 3   PROD_LINE                  213328 non-null  object 
 4   STATE_ABBR                 213328 non-null  object 
 5   STAT_PROFILE_DATE_YEAR     213328 non-null  int64  
 6   RETENTION_POLY_QTY         213328 non-null  int64  
 7   POLY_INFORCE_QTY           213328 non-null  int64  
 8   PREV_POLY_INFORCE_QTY      213328 non-null  int64  
 9   NB_WRTN_PREM_AMT           213328 non-null  float64
 10  WRTN_PREM_AMT              213328 non-null  float64
 11  PREV_WRTN_PREM_AMT         213328 non-null  float64
 12  PRD_ERND_PREM_AMT          213328 non-null  float64
 13  PRD_INCRD_LOSSES_AMT       21

In [35]:
print(df['VENDOR'].unique())
print(df['VENDOR_IND'].unique())

['Unknown' 'C' 'A' 'E' 'G' 'B' 'D' 'H' 'J' 'F']
['N' 'Y']


## Step 2: Feature Engineering

The features I selected were chosen to balance predictive power, business interpretability, and scalability across your full dataset. Here's the breakdown:

1. Retention & Loss Metrics <br>

    - RETENTION_RATIO - Captures loyalty—are agents retaining their business? <br>
    - LOSS_RATIO, LOSS_RATIO_3YR - Measures risk quality of the agent's book—higher = more costly claims. <br>
    - GROWTH_RATE_3YR - Indicates expansion—agents growing faster may be worth investing in or monitoring. <br><br>

2. Premium & Production Metrics <br>

    - NB_WRTN_PREM_AMT, WRTN_PREM_AMT - Total business written—core size/volume indicator. <br>
    - PRD_ERND_PREM_AMT - Ties to actual earnings—helps distinguish high-volume but low-profit agencies. <br><br>

3. Demographic & Structural Indicators <br>

    - ACTIVE_PRODUCERS - More producers usually mean broader reach or specialization. <br>
    - PRODUCER_AGE_SPREAD - Suggests diversity in workforce experience (senior vs junior mix). <br>
    - AGENCY_TENURE - Tied to maturity and embeddedness in their market. Longer tenure often = stability. <br><br>

4. Quote System Hit Ratios <br>

    - HIT_RATIO_<__> for each system (like eQT, SBZ, etc.)	These show how effectively agents convert quotes into bound policies. <br>
    - Analyzing by system gives insights into whether certain tech platforms drive better performance or indicate tech-savviness. <br><br>


These collectively give a 360° view of an agent’s performance:

- Retention and risk quality
- New business generation
- Efficiency in sales pipeline
- Demographic indicators of stability and capacity


### To Do:

- Encode VENDOR column and other categoricals
- Include year-over-year deltas (e.g. growth between WRTN_PREM_AMT this year vs last)
- Use time series clustering to analyze agents over time instead of snapshots

In [17]:
# Drop irrelevant columns
df_clean = df.drop(columns=["Unnamed: 0"], errors='ignore')

# Quote system bound/quote pairs
quote_systems = [
    ("CL_BOUND_CT_MDS", "CL_QUO_CT_MDS"),
    ("CL_BOUND_CT_SBZ", "CL_QUO_CT_SBZ"),
    ("CL_BOUND_CT_eQT", "CL_QUO_CT_eQT"),
    ("PL_BOUND_CT_ELINKS", "PL_QUO_CT_ELINKS"),
    ("PL_BOUND_CT_PLRANK", "PL_QUO_CT_PLRANK"),
    ("PL_BOUND_CT_eQTte", "PL_QUO_CT_eQTte"),
    ("PL_BOUND_CT_APPLIED", "PL_QUO_CT_APPLIED"),
    ("PL_BOUND_CT_TRANSACTNOW", "PL_QUO_CT_TRANSACTNOW")
]

# Calculate hit ratios
for bound_col, quote_col in quote_systems:
    hit_ratio_col = f"HIT_RATIO_{bound_col.split('_')[2]}"
    df_clean[hit_ratio_col] = df_clean[bound_col] / df_clean[quote_col].replace(0, 1)

# Additional features
df_clean["PRODUCER_AGE_SPREAD"] = df_clean["MAX_AGE"] - df_clean["MIN_AGE"]
df_clean["AGENCY_TENURE"] = df_clean["STAT_PROFILE_DATE_YEAR"] - df_clean["AGENCY_APPOINTMENT_YEAR"]

## Step 3: K-Means Clustering for Agent Segmentation

In [None]:
# Select features for clustering
feature_cols = [
    "RETENTION_RATIO", "LOSS_RATIO", "LOSS_RATIO_3YR", "GROWTH_RATE_3YR",
    "NB_WRTN_PREM_AMT", "WRTN_PREM_AMT", "PRD_ERND_PREM_AMT",
    "ACTIVE_PRODUCERS", "PRODUCER_AGE_SPREAD", "AGENCY_TENURE"
] + [f"HIT_RATIO_{bound_col.split('_')[2]}" for bound_col, _ in quote_systems]

features_df = df_clean[feature_cols].fillna(0)

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(features_df)

# Automatic K selection
silhouette_scores = []
K_range = range(2, 10)
for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    score = silhouette_score(X_scaled, labels)
    silhouette_scores.append(score)

best_k = K_range[np.argmax(silhouette_scores)]
print(f"Best number of clusters by silhouette score: {best_k}")

# Final clustering
kmeans_final = KMeans(n_clusters=best_k, random_state=42)
df_clean["AGENT_CLUSTER"] = kmeans_final.fit_predict(X_scaled)

### Visualize Clusters with PCA

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df_clean["AGENT_CLUSTER"], cmap='viridis', s=10)
plt.title("K-Means Clustering of Agents (PCA 2D Projection)")
plt.xlabel("PC1")
plt.ylabel("PC2")
plt.grid(True)
plt.tight_layout()
plt.show()

## Step 4: Supervised Model to Predict Agent Segments

In [None]:
# Define X and y
X = df_clean[feature_cols]
y = df_clean["AGENT_CLUSTER"]

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit model
clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

# Predict and evaluate
y_pred = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

## Step 5: Export Results

In [None]:
df_clean.to_csv("insurance_agents_clustered.csv", index=False)