# 🏆 Customer Tier Classification

### Problem Statement

Project 6: Customer Tier Classification (COIL 2000)

- Objective: Classify customers into marketing tiers (e.g., Gold, Silver, Bronze) using demographic and policy data. The NLP extension involves incorporating textual data from customer interactions or applications.

- Dataset: COIL 2000 dataset (structured). Text data would be synthetic (e.g., "customer inquired about bundling home and auto insurance").

- Requirements:

    Approach: Similar to Project 3. Enrich structured data with features derived from text (embeddings, topic labels, sentiment).

    Modeling: Multi-class classification model (e.g., XGBoost, Random Forest, Neural Network).

    Evaluation: Accuracy, F1-score (macro average).

### Project Idea

1. Imagine an insurance company wants to run different marketing campaigns for different types of customers. They don't want to treat a new, young customer the same way they treat a long-time, loyal customer with multiple policies.

2. This project is about creating a smart system that automatically sorts customers into groups (like Gold, Silver, Bronze) based on their information.

3. What you have: A list of facts about customers (their age, what kind of car they have, if they've made claims, etc.). This is the "structured data."

4. The cool extra step: You also get to use notes that employees write about customers, like "Customer called, very happy with claim service" or "Client inquired about a discount". This is the "text data" or NLP part.

5. The goal: Build a model that combines both the facts and the meaning of those notes to decide which group (Gold, Silver, Bronze) a customer belongs to. The better the sorting, the more effective the company's marketing will be.



### The Real-Life Scenario

- Company: Any large insurance company (e.g., State Farm, Allstate, Geico).

- Business Problem: Marketing budgets are limited. Sending expensive promotional gifts or personal agent calls to every single customer is inefficient and costly. The company wants to personalize its marketing efforts to maximize customer retention and sales.

- The "Tier" Solution:
The company decides to create customer tiers to target them appropriately:

    1. Gold Tier: High-Value Loyalists. Long-time customers with multiple policies (e.g., home + auto + life insurance) who rarely make costly claims. Marketing Action: Send them exclusive gifts, offer a dedicated agent, and give them the highest loyalty discounts. Goal: Keep them happy so they never leave.

    2. Silver Tier: Growth Potential. Customers with one policy who have a good payment history. Maybe they recently asked about adding another policy. Marketing Action: Send them targeted ads ("Bundle and save 15%!"). Goal: Upsell them and turn them into Gold customers.

    3. Bronze Tier: New or High-Risk Customers. New customers or those with a history of late payments or frequent small claims. Marketing Action: Send them automated payment reminders and basic renewal notices. Goal: Manage cost and risk efficiently.


### Step 1: Install required packages

In [1]:
# Step 1: Install required packages (run in terminal first)
# pip install pandas numpy matplotlib seaborn scikit-learn xgboost tensorflow nltk joblib

# Step 2: Import all required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import os
import spacy
from datetime import datetime

# ML Libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder, LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

# Classifiers
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

# Neural Network
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import to_categorical

# NLP
from sklearn.feature_extraction.text import TfidfVectorizer

# Saving models
import joblib
import pickle

# Step 3: Load spaCy model
print("Loading spaCy model...")
nlp = spacy.load("en_core_web_sm")

# Set style for plots
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)


Loading spaCy model...


### Step 2: Load and analyse dataset

In [2]:
import pandas as pd
df = pd.read_csv('data/health_insurance.csv', nrows=100000) # Update the path
df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0


In [84]:
df.isnull().sum()

id                      0
Gender                  0
Age                     0
Driving_License         0
Region_Code             0
Previously_Insured      0
Vehicle_Age             0
Vehicle_Damage          0
Annual_Premium          0
Policy_Sales_Channel    0
Vintage                 0
Response                0
dtype: int64

In [86]:
df.describe()

Unnamed: 0,id,Age,Driving_License,Region_Code,Previously_Insured,Annual_Premium,Policy_Sales_Channel,Vintage,Response
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,50000.5,38.7647,0.99804,26.44514,0.45765,30491.61945,112.43802,154.34983,0.12301
std,28867.657797,15.489417,0.044229,13.225898,0.498206,17145.025662,54.020526,83.768386,0.32845
min,1.0,20.0,0.0,0.0,0.0,2630.0,1.0,10.0,0.0
25%,25000.75,25.0,1.0,15.0,0.0,24348.75,30.0,82.0,0.0
50%,50000.5,36.0,1.0,28.0,0.0,31630.0,148.0,154.0,0.0
75%,75000.25,49.0,1.0,36.0,1.0,39411.0,152.0,227.0,0.0
max,100000.0,85.0,1.0,52.0,1.0,540165.0,163.0,299.0,1.0


In [85]:
df.nunique()

id                      100000
Gender                       2
Age                         66
Driving_License              2
Region_Code                 53
Previously_Insured           2
Vehicle_Age                  3
Vehicle_Damage               2
Annual_Premium           33288
Policy_Sales_Channel       136
Vintage                    290
Response                     2
dtype: int64

In [90]:
columns = ['Gender', 'Driving_License', 'Region_Code',
       'Previously_Insured', 'Vehicle_Age', 'Vehicle_Damage']

for col in columns:
    print(f"{col}: {df[col].unique()}\n")

Gender: ['Male' 'Female']

Driving_License: [1 0]

Region_Code: [28.  3. 11. 41. 33.  6. 35. 50. 15. 45.  8. 36. 30. 26. 16. 47. 48. 19.
 39. 23. 37.  5. 17.  2.  7. 29. 46. 27. 25. 13. 18. 20. 49. 22. 44.  0.
  9. 31. 12. 34. 21. 10. 14. 38. 24. 40. 43. 32.  4. 51. 42.  1. 52.]

Previously_Insured: [0 1]

Vehicle_Age: ['> 2 Years' '1-2 Year' '< 1 Year']

Vehicle_Damage: ['Yes' 'No']



### Step 3: Create Tier Label with Complexity and Noise

In [None]:
# Step 3: Create Tier Label with Complexity and Noise
print("\nCreating tier labels with noise...")
premium_threshold = df['Annual_Premium'].quantile(0.75)

# Create a more complex, continuous value score with noise
df['value_score'] = (
    df['Annual_Premium'] * 0.5 +
    np.random.normal(0, 5000, len(df)) +  # Add random noise
    (df['Response'] == 1) * 1000 +
    (df['Previously_Insured'] == 0) * 500 -
    df['Vintage'] * 0.1
)

# Use quantiles to create tiers
df['Tier'] = pd.qcut(df['value_score'], q=3, labels=['Bronze', 'Silver', 'Gold'])

# Check distribution
df['Tier'].value_counts()


Creating tier labels with noise...


Tier
Silver    33334
Bronze    33333
Gold      33333
Name: count, dtype: int64

### Step 4: Add customer Notes coulmn

In [None]:
gold_notes = [
    "Customer requested a callback to discuss bundling home and vehicle insurance. Strong cross-sell potential.",
    "Attended recent webinar on premium vehicle coverage. Asked multiple questions. Ready for conversion.",
    "Referred two friends for vehicle insurance. Highly engaged and influential.",
    "Downloaded comparison chart for premium plans. Wants to upgrade existing policy.",
    "Requested personalized quote via mobile app. High intent signal.",
    "Visited pricing page multiple times in one week. Sales team should reach out.",
    "Left positive feedback on health policy. Open to vehicle insurance upsell.",
    "Customer has multiple policies. Expressed interest in loyalty rewards.",
    "Asked about accident coverage specifics. Likely to purchase add-ons.",
    "Completed online eligibility check for vehicle insurance. Strong lead.",
    "Requested agent visit for policy explanation. High-touch prospect.",
    "Customer has excellent payment history. Ideal candidate for premium plan.",
    "Called to inquire about zero-depreciation add-on. Ready for upsell.",
    "Opened and clicked all links in promotional email. High engagement.",
    "Has luxury vehicle. Interested in top-tier coverage.",
    "Customer asked for claim process details. Indicates serious consideration.",
    "Requested brochure in regional language. Shows intent and accessibility needs.",
    "Has bundled travel and health insurance. Vehicle insurance is next target.",
    "Customer gave testimonial for health policy. Brand advocate.",
    "Engaged in live chat for 20+ minutes discussing vehicle coverage.",
    "Requested quote comparison with competitor. Wants best value.",
    "Customer has high net worth. Prefers concierge service.",
    "Asked about coverage for electric vehicles. Trend-aware and proactive.",
    "Downloaded mobile app and browsed vehicle insurance section.",
    "Customer requested callback during working hours. Shows planning and intent.",
    "Has history of upgrading policies annually. Likely to convert.",
    "Responded to SMS campaign with specific questions. Active lead.",
    "Customer has family coverage. Interested in adding vehicle for spouse.",
    "Attended in-person seminar. Asked about bundling options.",
    "Requested premium calculator link. Indicates buying behavior.",
    "Customer has no claim history. Ideal for premium discounts.",
    "Asked about roadside assistance benefits. Ready for value-added services.",
    "Has multiple vehicles. Interested in fleet coverage.",
    "Customer requested policy draft for review. Near conversion.",
    "Responded positively to loyalty program email. Ready for engagement.",
    "Has long-term health policy. Trusts brand.",
    "Customer requested agent recommendation. Prefers personalized service.",
    "Asked about coverage for vintage cars. Niche interest.",
    "Customer has corporate tie-up. Wants personal vehicle coverage.",
    "Requested call recording for policy explanation. Detail-oriented lead.",
    "Customer has high credit score. Eligible for premium benefits.",
    "Asked about international driving coverage. Frequent traveler.",
    "Customer has teenage driver. Interested in safe-driver discounts.",
    "Requested policy in digital format. Tech-savvy and responsive.",
    "Customer has history of early renewals. Reliable lead.",
    "Asked about coverage for natural disasters. Risk-aware buyer.",
    "Customer requested policy walkthrough. High engagement.",
    "Has bundled pet and health insurance. Vehicle insurance next.",
    "Customer has leased vehicle. Needs tailored coverage.",
    "Requested agent credentials. Trust-focused buyer.",
    "Customer has business vehicle. Interested in commercial coverage."
]


silver_notes = [
    "Customer clicked on ad but did not proceed. Mild interest.",
    "Asked about policy duration. Needs clarity before committing.",
    "Customer has mid-tier health plan. Vehicle insurance could be next.",
    "Opened email but didn’t click. Passive engagement.",
    "Customer asked about EMI options. Needs affordability pitch.",
    "Visited FAQ page. Might need follow-up with human agent.",
    "Customer has one policy. Upsell opportunity with vehicle insurance.",
    "Asked about claim limits. Needs reassurance.",
    "Customer browsed testimonials. Building trust.",
    "Requested brochure but hasn’t responded. Follow-up needed.",
    "Customer has mid-range vehicle. Suitable for standard plan.",
    "Asked about coverage for theft. Needs value proposition.",
    "Customer has seasonal driving habits. Needs flexible plan.",
    "Responded to chatbot but dropped off. Re-engagement needed.",
    "Customer asked about cancellation terms. Needs confidence boost.",
    "Has basic health policy. Vehicle insurance could be bundled.",
    "Customer asked about third-party liability. Needs education.",
    "Clicked on SMS link but didn’t proceed. Mild interest.",
    "Customer has family health plan. Vehicle insurance for dependents possible.",
    "Asked about policy portability. Needs assurance.",
    "Customer has history of late renewals. Needs proactive outreach.",
    "Visited blog post on vehicle safety. Soft lead.",
    "Customer asked about app features. Tech-friendly but undecided.",
    "Has mid-level engagement score. Needs nurturing.",
    "Customer asked about agent availability. Prefers human touch.",
    "Opened push notification but didn’t act. Passive lead.",
    "Customer has one vehicle. Standard plan may suffice.",
    "Asked about bundling with travel insurance. Cross-sell potential.",
    "Customer has moderate income. Needs budget-friendly options.",
    "Clicked on comparison chart but didn’t download. Mild interest.",
    "Customer asked about premium refund. Needs clarity.",
    "Has history of switching providers. Needs retention strategy.",
    "Customer browsed policy terms. Needs simplified explanation.",
    "Asked about coverage for shared vehicles. Needs niche plan.",
    "Customer has basic driving history. Standard plan suitable.",
    "Responded to survey but didn’t opt-in. Passive interest.",
    "Customer has one dependent. Family plan could appeal.",
    "Asked about policy renewal reminders. Needs convenience.",
    "Customer browsed agent profiles. Prefers personalized service.",
    "Has mid-tier engagement score. Needs targeted follow-up.",
    "Customer asked about app login issues. Needs support.",
    "Clicked on blog post about insurance myths. Curious but cautious.",
    "Customer has mid-range sedan. Standard coverage likely.",
    "Asked about policy exclusions. Needs transparency.",
    "Customer has basic coverage. Upsell opportunity.",
    "Responded to email with generic query. Needs tailored pitch.",
    "Customer has moderate claim history. Needs reassurance.",
    "Asked about coverage for rental cars. Occasional driver.",
    "Customer browsed renewal options. Might upgrade.",
    "Has basic health plan. Vehicle insurance could be next step.",
    "Customer asked about agent callback. Medium intent."
]


bronze_notes = [
    "Customer ignored multiple outreach attempts. Low engagement.",
    "Profile shows minimal digital activity. Hard to reach.",
    "Customer unsubscribed from marketing emails. Respect preferences.",
    "Only visited homepage. No product interaction.",
    "Customer has outdated contact info. Needs verification.",
    "Responded with 'not interested' to SMS campaign.",
    "Customer has basic plan and no upgrades in 3 years.",
    "Low click-through rate on ads. Passive behavior.",
    "Customer has history of policy lapses. Low reliability.",
    "Only responds to renewal alerts. No upsell potential.",
    "Customer has minimal claim history but no interest in upgrades.",
    "Profile shows low income bracket. Focus on retention.",
    "Customer declined agent call. Not open to discussion.",
    "Only uses mobile app for renewals. No engagement with new products.",
    "Customer has basic vehicle. No interest in add-ons.",
    "Responded negatively to survey. Avoid aggressive marketing.",
    "Customer has minimal driving history. Low insurance need.",
    "Only interacts during mandatory renewals. No upsell behavior.",
    "Customer has no dependents. Limited cross-sell options.",
    "Profile shows low credit score. Limited premium eligibility.",
    "Customer ignored brochure delivery. No follow-up needed.",
    "Only clicked on unsubscribe link. Avoid future outreach.",
    "Customer has minimal online presence. Hard to target.",
    "Responded with generic queries. No specific interest.",
    "Customer has basic plan and prefers no changes.",
    "Only engages with generic ads. No personalization needed.",
    "Customer has no vehicle ownership record. Low relevance.",
    "Profile shows frequent provider switching. Low loyalty.",
    "Customer declined promotional offer. Not receptive.",
    "Only responds to SMS queries. Avoid outbound calls.",
    "Customer has minimal app usage. Low tech engagement.",
    "Responded with 'just browsing' on chatbot.",
    "Customer has basic health plan and no interest in bundling.",
    "Only interacts during tax season. Limited insurance interest.",
    "Customer has minimal feedback history. Passive user.",
    "Profile shows no referrals. Low network influence.",
    "Customer has basic coverage and no add-ons.",
    "Only responds to mandatory updates. Avoid upsell.",
    "Customer has minimal driving frequency. Low coverage need.",
    "Responded with 'maybe later' to agent pitch.",
    "Customer has basic plan and prefers offline communication.",
    "Only interacts with renewal portal. No product exploration.",
    "Customer has minimal claim activity. Passive behavior.",
    "Profile shows low engagement score. Avoid proactive outreach.",
    "Customer declined loyalty program. Not value-driven.",
    "Only uses app for payment. No browsing behavior.",
    "Customer has basic vehicle and no interest in upgrades.",
    "Responded with 'not now' to bundling offer.",
]

In [44]:


ambiguous_notes = [
    "Customer requested information packet.",
    "Standard policy review completed.",
    "Client asked about payment options.",
    "Scheduled callback for next week.",
    "Updated contact information in system."
]

def generate_realistic_note(row, noise_level=0.2, ambiguous_prob=0.3):
    """Generate notes with noise and ambiguity"""
    # Chance of ambiguous note
    if np.random.random() < ambiguous_prob:
        return np.random.choice(ambiguous_notes)
    
    # Chance of incorrect tier note (noise)
    if np.random.random() < noise_level:
        incorrect_tiers = [t for t in ['Gold', 'Silver', 'Bronze'] if t != row['Tier']]
        chosen_tier = np.random.choice(incorrect_tiers)
    else:
        chosen_tier = row['Tier']
    
    # Return appropriate note
    if chosen_tier == 'Gold':
        return np.random.choice(gold_notes)
    elif chosen_tier == 'Silver':
        return np.random.choice(silver_notes)
    else:
        return np.random.choice(bronze_notes)

df['Customer_Note'] = df.apply(lambda row: generate_realistic_note(row, noise_level=0.08, ambiguous_prob=0.08), axis=1)

#### Preprocess customer notes column

In [45]:

def preprocess_text_with_spacy(text):
    """Preprocess text using spaCy"""
    doc = nlp(text.lower())
    
    # Extract tokens: only alphabetic, not stopwords, not punctuation, length > 2
    tokens = [
        token.lemma_.lower() for token in doc 
        if not token.is_stop 
        and not token.is_punct 
        and token.is_alpha 
        and len(token) > 2
    ]
    
    return ' '.join(tokens)

# Apply spaCy preprocessing
df['Processed_Note'] = df['Customer_Note'].apply(preprocess_text_with_spacy)

### Step 5: Save the Final Synthetic Dataset

In [None]:
# Step 5: Save the Final Synthetic Dataset
print("\nSaving synthetic dataset...")
df.to_csv('data/synthetic_insurance_data.csv', index=False)
print("Saved: synthetic_insurance_data.csv")


Saving synthetic dataset...
Saved: synthetic_insurance_data.csv


In [3]:
df = pd.read_csv('data/synthetic_insurance_data.csv')
df.head()

Unnamed: 0,id,Gender,Age,Driving_License,Region_Code,Previously_Insured,Vehicle_Age,Vehicle_Damage,Annual_Premium,Policy_Sales_Channel,Vintage,Response,value_score,Tier,Customer_Note,Processed_Note
0,1,Male,44,1,28.0,0,> 2 Years,Yes,40454.0,26.0,217,1,14732.376131,Silver,Customer has seasonal driving habits. Needs fl...,customer seasonal driving habit need flexible ...
1,2,Male,76,1,3.0,0,1-2 Year,No,33536.0,26.0,183,0,16863.33701,Silver,Customer declined agent call. Not open to disc...,customer decline agent open discussion
2,3,Male,47,1,28.0,0,> 2 Years,Yes,38294.0,26.0,27,1,23212.907992,Gold,Has luxury vehicle. Interested in top-tier cov...,luxury vehicle interested tier coverage
3,4,Male,21,1,11.0,1,< 1 Year,No,28619.0,152.0,203,0,14530.572061,Silver,Visited pricing page multiple times in one wee...,visit pricing page multiple time week sale tea...
4,5,Female,29,1,41.0,1,< 1 Year,No,27496.0,152.0,39,0,12133.246629,Bronze,Scheduled callback for next week.,schedule callback week


### Step 6: Prepare Data for Modeling

In [48]:
# Step 8: Prepare Data for Modeling
print("\nPreparing data for modeling...")
# Define features and target
X = df[['Age', 'Gender', 'Region_Code', 'Driving_License', 'Previously_Insured', 
        'Vehicle_Age', 'Vehicle_Damage', 'Annual_Premium', 'Vintage', 'Processed_Note']]
y = df['Tier']


Preparing data for modeling...


In [49]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Separate text from structured data
X_train_text = X_train['Processed_Note']
X_test_text = X_test['Processed_Note']
X_train_structured = X_train.drop('Processed_Note', axis=1)
X_test_structured = X_test.drop('Processed_Note', axis=1)


### Step 7: Define Preprocessing Pipelines and fit and trnsform dataset

In [None]:
# Step 7: Define Preprocessing Pipelines
print("Setting up preprocessing pipelines...")
categorical_cols = ['Gender', 'Vehicle_Age', 'Vehicle_Damage']
numerical_cols = ['Age','Driving_License','Region_Code','Previously_Insured', 'Annual_Premium', 'Vintage']

# Structured data preprocessor
preprocessor_structured = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore', sparse_output=False), categorical_cols)
    ])

# Text data preprocessor
preprocessor_text = TfidfVectorizer(max_features=500, ngram_range=(1, 2))



Setting up preprocessing pipelines...


In [None]:
print("Transforming data...")
# Fit and transform structured data
X_train_structured_processed = preprocessor_structured.fit_transform(X_train_structured)
X_test_structured_processed = preprocessor_structured.transform(X_test_structured)

# Fit and transform text data
X_train_text_processed = preprocessor_text.fit_transform(X_train_text).toarray()
X_test_text_processed = preprocessor_text.transform(X_test_text).toarray()

# Combine features
X_train_combined = np.hstack((X_train_structured_processed, X_train_text_processed))
X_test_combined = np.hstack((X_test_structured_processed, X_test_text_processed))

# Encode labels
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

print(f"Final training features shape: {X_train_combined.shape}")
print(f"Final test features shape: {X_test_combined.shape}")

Transforming data...
Final training features shape: (80000, 513)
Final test features shape: (20000, 513)


### Step 8: Save Preprocessing Objects

In [52]:
# Step 11: Save Preprocessing Objects
print("\nSaving preprocessing objects...")
# Create a directory for saved models
os.makedirs('artifacts', exist_ok=True)

# Save preprocessing objects
joblib.dump(preprocessor_structured, 'artifacts/structured_preprocessor.pkl')
joblib.dump(preprocessor_text, 'artifacts/text_preprocessor.pkl')
joblib.dump(label_encoder, 'artifacts/label_encoder.pkl')

print("Saved: artifacts/structured_preprocessor.pkl")
print("Saved: artifacts/text_preprocessor.pkl")
print("Saved: artifacts/label_encoder.pkl")


Saving preprocessing objects...
Saved: artifacts/structured_preprocessor.pkl
Saved: artifacts/text_preprocessor.pkl
Saved: artifacts/label_encoder.pkl


### Step 9: Train diiferent ML and neural network Models

In [None]:

print("\nInitializing results storage...")
results_df = pd.DataFrame(columns=[
    'model_name', 
    'accuracy', 
    'precision', 
    'recall', 
    'f1_score',
    'training_time_seconds',
    'prediction_time_seconds',
    'model_size_mb',
    'timestamp'
])

trained_models = {}



Initializing results storage...


In [None]:

def get_classifiers():
    """Return a dictionary of classifiers"""
    return {
        'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1),
        'XGBoost': XGBClassifier(random_state=42, n_estimators=100, learning_rate=0.1),
        'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000, n_jobs=-1),
        'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    }


def create_neural_network(input_dim, num_classes):
    """Create a neural network model"""
    model = Sequential([
        Dense(128, activation='relu', input_shape=(input_dim,)),
        Dropout(0.3),
        Dense(64, activation='relu'),
        Dropout(0.3),
        Dense(32, activation='relu'),
        Dense(num_classes, activation='softmax')
    ])
    
    model.compile(
        optimizer=Adam(learning_rate=0.001),
        loss='categorical_crossentropy',
        metrics=['accuracy']
    )
    
    return model

In [55]:

def train_and_evaluate_model(model, model_name, X_train, X_test, y_train, y_test):
    """Train and evaluate a single model"""
    print(f"Training {model_name}...")
    
    start_time = datetime.now()
    
    if model_name == 'Neural Network':
        y_train_cat = to_categorical(y_train)
        y_test_cat = to_categorical(y_test)
        
        history = model.fit(
            X_train, y_train_cat,
            epochs=50,
            batch_size=32,
            validation_split=0.2,
            verbose=1
        )
        
        y_pred_proba = model.predict(X_test, verbose=1)
        y_pred = np.argmax(y_pred_proba, axis=1)
        
    else:
        model.fit(X_train, y_train)
        y_pred = model.predict(X_test)
    
    training_time = (datetime.now() - start_time).total_seconds()
    
    # Calculate prediction time
    pred_start = datetime.now()
    _ = model.predict(X_test[:100])
    prediction_time = (datetime.now() - pred_start).total_seconds() / 100
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred, average='weighted')
    recall = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Calculate model size
    if model_name != 'Neural Network':
        joblib.dump(model, 'temp_model.pkl')
        model_size = os.path.getsize('temp_model.pkl') / (1024 * 1024)
        os.remove('temp_model.pkl')
    else:
        model_size = 0
    
    result = {
        'model_name': model_name,
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1_score': f1,
        'training_time_seconds': training_time,
        'prediction_time_seconds': prediction_time,
        'model_size_mb': model_size,
        'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
    }
    
    return model, result, y_pred


In [None]:

print("\n=== TRAINING ALL MODELS ===")
classifiers = get_classifiers()

# Train traditional ML models
for model_name, model in classifiers.items():
    trained_model, result, y_pred = train_and_evaluate_model(
        model, model_name, X_train_combined, X_test_combined, y_train_encoded, y_test_encoded
    )
    trained_models[model_name] = trained_model
    results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)
    print(f"{model_name}: Accuracy = {result['accuracy']:.4f}")



=== TRAINING ALL MODELS ===
Training Random Forest...


  results_df = pd.concat([results_df, pd.DataFrame([result])], ignore_index=True)


Random Forest: Accuracy = 0.9003
Training XGBoost...
XGBoost: Accuracy = 0.9102
Training Logistic Regression...
Logistic Regression: Accuracy = 0.9114
Training K-Nearest Neighbors...
K-Nearest Neighbors: Accuracy = 0.8710


In [57]:
# Train Neural Network
nn_model = create_neural_network(X_train_combined.shape[1], len(np.unique(y_train_encoded)))
trained_nn, nn_result, y_pred_nn = train_and_evaluate_model(
    nn_model, 'Neural Network', X_train_combined, X_test_combined, y_train_encoded, y_test_encoded
)
trained_models['Neural Network'] = trained_nn
results_df = pd.concat([results_df, pd.DataFrame([nn_result])], ignore_index=True)


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Training Neural Network...
Epoch 1/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 3ms/step - accuracy: 0.8295 - loss: 0.4354 - val_accuracy: 0.9089 - val_loss: 0.2788
Epoch 2/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9123 - loss: 0.2756 - val_accuracy: 0.9098 - val_loss: 0.2743
Epoch 3/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9116 - loss: 0.2730 - val_accuracy: 0.9103 - val_loss: 0.2769
Epoch 4/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9133 - loss: 0.2690 - val_accuracy: 0.9105 - val_loss: 0.2729
Epoch 5/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9138 - loss: 0.2664 - val_accuracy: 0.9100 - val_loss: 0.2749
Epoch 6/50
[1m2000/2000[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 3ms/step - accuracy: 0.9143 - loss: 0.2655 - val_accuracy: 0.9092 - val_lo

In [58]:
# Save neural network and update size
trained_nn.save('temp_nn_model.h5')
nn_size = os.path.getsize('temp_nn_model.h5') / (1024 * 1024)
os.remove('temp_nn_model.h5')
results_df.loc[results_df['model_name'] == 'Neural Network', 'model_size_mb'] = nn_size



In [59]:
# Step 15: Save Best Model and Results
print("\n=== SAVING RESULTS ===")
best_model_row = results_df.loc[results_df['accuracy'].idxmax()]
best_model_name = best_model_row['model_name']
best_model = trained_models[best_model_name]

print(f"Best model: {best_model_name} with accuracy {best_model_row['accuracy']:.4f}")

# Save the best model
if best_model_name == 'Neural Network':
    best_model.save('saved_models/best_model_neural_network.h5')
    print("Saved: saved_models/best_model_neural_network.h5")
else:
    joblib.dump(best_model, 'saved_models/best_model.pkl')
    print("Saved: saved_models/best_model.pkl")


=== SAVING RESULTS ===
Best model: Logistic Regression with accuracy 0.9114
Saved: saved_models/best_model.pkl


### Step 9: Save all trained models along with best model

In [60]:
# Save all models
for model_name, model in trained_models.items():
    if model_name != 'Neural Network':
        joblib.dump(model, f'saved_models/{model_name.replace(" ", "_").lower()}.pkl')

In [None]:
# Save results to CSV
results_df.to_csv('data/model_evaluation_results.csv', index=False)
print("Saved: model_evaluation_results.csv")

Saved: model_evaluation_results.csv


### Step 10: Save Complete Configuration

In [None]:
import json
config = {
    'feature_columns': numerical_cols + categorical_cols + ['Processed_Note'],
    'target_column': 'Tier',
    'preprocessing_steps': {
        'structured_preprocessor': 'saved_models/structured_preprocessor.pkl',
        'text_preprocessor': 'saved_models/text_preprocessor.pkl',
        'label_encoder': 'saved_models/label_encoder.pkl'
    },
    'best_model': best_model_name,
    'best_accuracy': best_model_row['accuracy'],
    'timestamp': datetime.now().strftime('%Y-%m-%d %H:%M:%S')
}

with open('saved_models/training_config.json', 'w') as f:
    json.dump(config, f, indent=2)

print("Saved: saved_models/training_config.json")
print("\n=== TRAINING COMPLETE ===")
print("Files created:")
print("- data/synthetic_insurance_data.csv")
print("- data/model_evaluation_results.csv")
print("- saved_models/ (directory with all trained models)")
print("- artifacts/ (directory with all preprocessing objects)")

Saved: saved_models/training_config.json

=== TRAINING COMPLETE ===
Files created:
- synthetic_insurance_data_with_tiers.csv
- model_evaluation_results.csv
- saved_models/ (directory with all preprocessing objects and models)
- model_accuracy_comparison.png


## Testing

In [78]:
X_test_structured[:5].values


array([[41, 'Male', 29.0, 1, 1, '1-2 Year', 'No', 39136.0, 275],
       [76, 'Male', 33.0, 1, 1, '1-2 Year', 'No', 29653.0, 109],
       [23, 'Female', 18.0, 1, 0, '< 1 Year', 'Yes', 40053.0, 282],
       [49, 'Male', 8.0, 1, 1, '1-2 Year', 'No', 42977.0, 221],
       [26, 'Male', 41.0, 1, 1, '< 1 Year', 'No', 31040.0, 114]],
      dtype=object)

In [79]:
y_test[:5].values

['Gold', 'Bronze', 'Gold', 'Silver', 'Silver']
Categories (3, object): ['Bronze' < 'Silver' < 'Gold']

In [81]:
X_test_text[:5].values

array(['respond bundle offer', 'respond generic query specific interest',
       'engage live chat minute discuss vehicle coverage',
       'profile show low engagement score avoid proactive outreach',
       'customer browse agent profile prefer personalized service'],
      dtype=object)

In [None]:
text= [
    # Gold
    "Customer requested a callback to discuss bundling home and vehicle insurance. Strong cross-sell potential.",
    "Attended recent webinar on premium vehicle coverage. Asked multiple questions. Ready for conversion.",
    "Referred two friends for vehicle insurance. Highly engaged and influential.",
    "Downloaded comparison chart for premium plans. Wants to upgrade existing policy.",


    # Silver
    "Customer clicked on ad but did not proceed. Mild interest.",
    "Asked about policy duration. Needs clarity before committing.",
    "Customer has mid-tier health plan. Vehicle insurance could be next.",
    "Opened email but didn’t click. Passive engagement.",
    "Customer asked about EMI options. Needs affordability pitch.",


    # Bronze
    "Customer ignored multiple outreach attempts. Low engagement.",
    "Profile shows minimal digital activity. Hard to reach.",
    "Customer unsubscribed from marketing emails. Respect preferences.",
    "Only visited homepage. No product interaction.",
    "Customer has outdated contact info. Needs verification.",
    
]