# Advanced SUBMARIT Usage

This notebook demonstrates advanced features and techniques for submarket analysis.

## Table of Contents
1. [Working with Real Data](#real-data)
2. [Custom Distance Metrics](#custom-metrics)
3. [Cross-Validation and Stability](#validation)
4. [Performance Optimization](#performance)
5. [MATLAB Integration](#matlab)
6. [Advanced Visualization](#visualization)

In [None]:
# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.spatial.distance import pdist, squareform
from scipy.sparse import csr_matrix
import time

# SUBMARIT imports
from submarit.core import create_substitution_matrix, create_sparse_substitution_matrix
from submarit.algorithms import LocalSearch, MiniBatchLocalSearch
from submarit.evaluation import ClusterEvaluator, gap_statistic, stability_analysis
from submarit.validation import KFoldValidator, MultipleRunsValidator
from submarit.io import load_data, save_results, load_matlab_data
from submarit.utils.matlab_compat import MatlabAPI

np.random.seed(42)
print("Advanced SUBMARIT tutorial ready!")

## 1. Working with Real Data <a id='real-data'></a>

Let's work with a realistic retail dataset.

In [None]:
# Create a realistic retail product dataset
n_products = 500

# Product categories
categories = ['Electronics', 'Clothing', 'Food', 'Home', 'Sports']
brands = ['BrandA', 'BrandB', 'BrandC', 'BrandD', 'BrandE']

# Generate product data
np.random.seed(42)
data = {
    'product_id': [f'SKU_{i:04d}' for i in range(n_products)],
    'category': np.random.choice(categories, n_products, p=[0.3, 0.2, 0.2, 0.2, 0.1]),
    'brand': np.random.choice(brands, n_products),
    'price': np.random.lognormal(3.5, 0.8, n_products),
    'rating': np.random.normal(4.0, 0.5, n_products).clip(1, 5),
    'reviews': np.random.negative_binomial(5, 0.1, n_products),
    'in_stock': np.random.choice([0, 1], n_products, p=[0.1, 0.9]),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20], n_products, p=[0.5, 0.2, 0.15, 0.1, 0.05])
}

# Add category-specific features
for i in range(n_products):
    if data['category'][i] == 'Electronics':
        data['price'][i] *= 5  # Electronics are more expensive
    elif data['category'][i] == 'Food':
        data['price'][i] *= 0.3  # Food is cheaper

df = pd.DataFrame(data)
print(f"Dataset shape: {df.shape}")
df.head()

In [None]:
# Prepare features for clustering
# One-hot encode categorical variables
category_dummies = pd.get_dummies(df['category'], prefix='cat')
brand_dummies = pd.get_dummies(df['brand'], prefix='brand')

# Normalize numerical features
from sklearn.preprocessing import StandardScaler

numerical_features = ['price', 'rating', 'reviews', 'discount_pct']
scaler = StandardScaler()
scaled_features = scaler.fit_transform(df[numerical_features])
scaled_df = pd.DataFrame(scaled_features, columns=[f'{col}_scaled' for col in numerical_features])

# Combine all features
X = pd.concat([scaled_df, category_dummies, brand_dummies], axis=1).values

print(f"Feature matrix shape: {X.shape}")
print(f"Features: {list(scaled_df.columns) + list(category_dummies.columns) + list(brand_dummies.columns)}")

## 2. Custom Distance Metrics <a id='custom-metrics'></a>

Create custom distance metrics for specific business needs.

In [None]:
# Define a custom distance function
def weighted_product_distance(x, y, feature_weights):
    """
    Custom distance that weights features differently.
    Price and category are more important for substitution.
    """
    diff = x - y
    weighted_diff = diff * feature_weights
    return np.sqrt(np.sum(weighted_diff ** 2))

# Create feature weights (price and category more important)
n_features = X.shape[1]
feature_weights = np.ones(n_features)
feature_weights[0] = 3.0  # Price weight
feature_weights[4:9] = 2.0  # Category weights

# Create custom substitution matrix
def create_custom_substitution_matrix(X, feature_weights):
    n = len(X)
    S = np.zeros((n, n))
    
    for i in range(n):
        for j in range(i+1, n):
            dist = weighted_product_distance(X[i], X[j], feature_weights)
            S[i, j] = S[j, i] = dist
    
    return S

# For efficiency, use vectorized version
def create_custom_substitution_matrix_fast(X, feature_weights):
    # Weight the features
    X_weighted = X * np.sqrt(feature_weights)
    # Use scipy's pdist for efficiency
    distances = pdist(X_weighted, metric='euclidean')
    return squareform(distances)

# Compare standard vs custom
S_standard = create_substitution_matrix(X, metric='euclidean')
S_custom = create_custom_substitution_matrix_fast(X, feature_weights)

print(f"Standard matrix - Mean distance: {S_standard.mean():.3f}")
print(f"Custom matrix - Mean distance: {S_custom.mean():.3f}")

## 3. Cross-Validation and Stability <a id='validation'></a>

Ensure clustering results are stable and reliable.

In [None]:
# K-fold cross-validation
validator = KFoldValidator(n_splits=5, random_state=42)

# Test different numbers of clusters
k_values = [3, 5, 7, 10]
cv_results = {}

for k in k_values:
    print(f"\nValidating k={k}...")
    scores = validator.validate(X, n_clusters=k, algorithm='local_search')
    cv_results[k] = {
        'scores': scores,
        'mean': np.mean(scores),
        'std': np.std(scores)
    }
    print(f"CV Score: {cv_results[k]['mean']:.3f} ± {cv_results[k]['std']:.3f}")

# Visualize CV results
plt.figure(figsize=(10, 6))
means = [cv_results[k]['mean'] for k in k_values]
stds = [cv_results[k]['std'] for k in k_values]

plt.errorbar(k_values, means, yerr=stds, marker='o', markersize=10, capsize=5)
plt.xlabel('Number of Clusters')
plt.ylabel('Cross-Validation Score')
plt.title('Cross-Validation Results')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Stability analysis - multiple runs
n_clusters = 5
n_runs = 20

# Run clustering multiple times
all_clusterings = []
for run in range(n_runs):
    ls = LocalSearch(n_clusters=n_clusters, n_restarts=5, random_state=run)
    clusters = ls.fit_predict(S_custom)
    all_clusterings.append(clusters)

# Calculate pairwise agreement
from sklearn.metrics import adjusted_rand_score

agreement_matrix = np.zeros((n_runs, n_runs))
for i in range(n_runs):
    for j in range(n_runs):
        agreement_matrix[i, j] = adjusted_rand_score(all_clusterings[i], all_clusterings[j])

# Visualize stability
plt.figure(figsize=(8, 6))
sns.heatmap(agreement_matrix, cmap='Blues', vmin=0, vmax=1, 
            annot=False, square=True, cbar_kws={'label': 'Adjusted Rand Index'})
plt.title('Clustering Stability Matrix')
plt.xlabel('Run')
plt.ylabel('Run')
plt.show()

print(f"Average stability (ARI): {agreement_matrix[np.triu_indices(n_runs, k=1)].mean():.3f}")

## 4. Performance Optimization <a id='performance'></a>

Techniques for handling large datasets efficiently.

In [None]:
# Create a larger dataset for performance testing
n_large = 2000
X_large = np.random.randn(n_large, 50)

# Compare different approaches
print("Performance Comparison:")
print("=" * 50)

# 1. Dense matrix
start = time.time()
S_dense = create_substitution_matrix(X_large, metric='euclidean')
time_dense = time.time() - start
print(f"Dense matrix creation: {time_dense:.2f}s")
print(f"Memory usage: {S_dense.nbytes / 1e6:.1f} MB")

# 2. Sparse matrix (keep only top 10%)
start = time.time()
S_sparse = create_sparse_substitution_matrix(X_large, threshold=0.1, metric='euclidean')
time_sparse = time.time() - start
print(f"\nSparse matrix creation: {time_sparse:.2f}s")
print(f"Memory usage: {(S_sparse.data.nbytes + S_sparse.indices.nbytes + S_sparse.indptr.nbytes) / 1e6:.1f} MB")
print(f"Sparsity: {1 - S_sparse.nnz / (n_large**2):.1%}")

# 3. Mini-batch clustering
print("\nClustering comparison:")

# Standard Local Search
start = time.time()
ls_standard = LocalSearch(n_clusters=10, n_restarts=3)
clusters_standard = ls_standard.fit_predict(S_dense[:1000, :1000])  # Subset for speed
time_standard = time.time() - start
print(f"Standard Local Search: {time_standard:.2f}s")

# Mini-batch Local Search
start = time.time()
mbls = MiniBatchLocalSearch(n_clusters=10, batch_size=100, n_init=3)
clusters_minibatch = mbls.fit_predict(S_sparse[:1000, :1000])
time_minibatch = time.time() - start
print(f"Mini-batch Local Search: {time_minibatch:.2f}s")
print(f"Speedup: {time_standard/time_minibatch:.1f}x")

In [None]:
# Parallel processing example
from joblib import Parallel, delayed
import multiprocessing

n_cores = multiprocessing.cpu_count()
print(f"Available cores: {n_cores}")

# Function to run single clustering
def run_clustering(S, k, seed):
    ls = LocalSearch(n_clusters=k, n_restarts=1, random_state=seed)
    return ls.fit_predict(S), ls.objective_

# Serial execution
start = time.time()
serial_results = []
for seed in range(10):
    result = run_clustering(S_dense[:500, :500], 5, seed)
    serial_results.append(result)
time_serial = time.time() - start

# Parallel execution
start = time.time()
parallel_results = Parallel(n_jobs=-1)(
    delayed(run_clustering)(S_dense[:500, :500], 5, seed) 
    for seed in range(10)
)
time_parallel = time.time() - start

print(f"\nSerial execution: {time_serial:.2f}s")
print(f"Parallel execution: {time_parallel:.2f}s")
print(f"Speedup: {time_serial/time_parallel:.1f}x")

## 5. MATLAB Integration <a id='matlab'></a>

Working with MATLAB files and ensuring compatibility.

In [None]:
# Create sample data to save in MATLAB format
matlab_data = {
    'features': X[:100],  # First 100 products
    'substitution_matrix': S_custom[:100, :100],
    'product_names': np.array(df['product_id'][:100].values, dtype=object),
    'parameters': {
        'n_clusters': 5,
        'metric': 'custom',
        'algorithm': 'local_search'
    }
}

# Save to MATLAB format
from scipy.io import savemat
savemat('sample_data.mat', matlab_data)
print("Data saved to sample_data.mat")

# Load and verify
from scipy.io import loadmat
loaded_data = loadmat('sample_data.mat')
print("\nLoaded data keys:", [k for k in loaded_data.keys() if not k.startswith('__')])
print(f"Features shape: {loaded_data['features'].shape}")
print(f"First product name: {loaded_data['product_names'][0][0]}")

In [None]:
# Use MATLAB compatibility API
matlab_api = MatlabAPI()

# MATLAB-style function calls
S_matlab_style = matlab_api.create_substitution_matrix(X[:100], 'euclidean')
clusters_matlab_style = matlab_api.local_search(S_matlab_style, 5)

# Convert to 1-based indexing for MATLAB
clusters_matlab_indexed = clusters_matlab_style + 1

print(f"Cluster range (Python): {clusters_matlab_style.min()}-{clusters_matlab_style.max()}")
print(f"Cluster range (MATLAB): {clusters_matlab_indexed.min()}-{clusters_matlab_indexed.max()}")

# Save results in MATLAB format
matlab_results = {
    'clusters': clusters_matlab_indexed,
    'substitution_matrix': S_matlab_style,
    'objective_value': np.array([42.0]),  # Example objective value
    'n_iterations': np.array([15])
}
savemat('results_for_matlab.mat', matlab_results)

## 6. Advanced Visualization <a id='visualization'></a>

Create publication-quality visualizations.

In [None]:
# Perform clustering for visualization
ls = LocalSearch(n_clusters=5, n_restarts=10, random_state=42)
final_clusters = ls.fit_predict(S_custom)
df['cluster'] = final_clusters

# 1. Interactive cluster exploration
import plotly.express as px
import plotly.graph_objects as go

# Prepare data for 3D visualization
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
X_3d = pca.fit_transform(X)

# Create interactive 3D scatter plot
fig = px.scatter_3d(
    x=X_3d[:, 0], y=X_3d[:, 1], z=X_3d[:, 2],
    color=final_clusters,
    hover_data={'Product': df['product_id'], 
                'Category': df['category'],
                'Price': df['price'].round(2)},
    labels={'x': f'PC1 ({pca.explained_variance_ratio_[0]:.1%})',
            'y': f'PC2 ({pca.explained_variance_ratio_[1]:.1%})',
            'z': f'PC3 ({pca.explained_variance_ratio_[2]:.1%})',
            'color': 'Cluster'},
    title='Interactive 3D Cluster Visualization'
)

fig.update_layout(height=600)
fig.show()

In [None]:
# 2. Cluster profile radar chart
# Calculate average features per cluster
feature_cols = ['price', 'rating', 'reviews', 'discount_pct']
cluster_profiles = df.groupby('cluster')[feature_cols].mean()

# Normalize to 0-1 scale for radar chart
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
cluster_profiles_norm = pd.DataFrame(
    scaler.fit_transform(cluster_profiles),
    index=cluster_profiles.index,
    columns=cluster_profiles.columns
)

# Create radar chart
fig = go.Figure()

for cluster in range(5):
    values = cluster_profiles_norm.loc[cluster].values.tolist()
    values += values[:1]  # Complete the circle
    
    fig.add_trace(go.Scatterpolar(
        r=values,
        theta=feature_cols + [feature_cols[0]],
        fill='toself',
        name=f'Cluster {cluster}'
    ))

fig.update_layout(
    polar=dict(
        radialaxis=dict(visible=True, range=[0, 1])
    ),
    showlegend=True,
    title="Cluster Profiles (Normalized Features)"
)
fig.show()

In [None]:
# 3. Hierarchical cluster visualization
from scipy.cluster.hierarchy import dendrogram, linkage

# Compute linkage matrix
linkage_matrix = linkage(S_custom[:100, :100], method='ward')

# Create dendrogram
plt.figure(figsize=(15, 8))
dendrogram(
    linkage_matrix,
    labels=df['product_id'][:100].values,
    leaf_rotation=90,
    leaf_font_size=8
)
plt.title('Product Hierarchy (First 100 Products)', fontsize=16)
plt.xlabel('Product ID')
plt.ylabel('Distance')
plt.tight_layout()
plt.show()

In [None]:
# 4. Network visualization of product relationships
import networkx as nx

# Create network from substitution matrix
# Keep only strong connections (top 5% most similar)
threshold = np.percentile(S_custom[S_custom > 0], 5)
G = nx.Graph()

# Add nodes
for i in range(100):  # Use first 100 products
    G.add_node(i, 
               label=df.iloc[i]['product_id'],
               cluster=final_clusters[i],
               category=df.iloc[i]['category'])

# Add edges for similar products
for i in range(100):
    for j in range(i+1, 100):
        if S_custom[i, j] < threshold:
            G.add_edge(i, j, weight=1/S_custom[i, j])

# Plot network
plt.figure(figsize=(12, 10))
pos = nx.spring_layout(G, k=0.5, iterations=50)

# Draw nodes colored by cluster
node_colors = [final_clusters[i] for i in range(100)]
nx.draw_networkx_nodes(G, pos, node_color=node_colors, 
                      cmap='viridis', node_size=300, alpha=0.8)

# Draw edges
nx.draw_networkx_edges(G, pos, alpha=0.2)

# Add labels for some nodes
labels = {i: G.nodes[i]['label'][:8] for i in range(0, 100, 10)}
nx.draw_networkx_labels(G, pos, labels, font_size=8)

plt.title('Product Similarity Network', fontsize=16)
plt.axis('off')
plt.tight_layout()
plt.show()

print(f"Network has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
print(f"Average degree: {sum(dict(G.degree()).values()) / G.number_of_nodes():.1f}")

In [None]:
# 5. Publication-quality heatmap
# Prepare data for heatmap
cluster_category_counts = pd.crosstab(df['cluster'], df['category'])
cluster_brand_counts = pd.crosstab(df['cluster'], df['brand'])

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Category distribution
sns.heatmap(cluster_category_counts, annot=True, fmt='d', 
            cmap='YlOrRd', cbar_kws={'label': 'Product Count'},
            ax=ax1)
ax1.set_title('Products per Category in Each Cluster', fontsize=14)
ax1.set_xlabel('Category')
ax1.set_ylabel('Cluster')

# Brand distribution
sns.heatmap(cluster_brand_counts, annot=True, fmt='d',
            cmap='YlGnBu', cbar_kws={'label': 'Product Count'},
            ax=ax2)
ax2.set_title('Products per Brand in Each Cluster', fontsize=14)
ax2.set_xlabel('Brand')
ax2.set_ylabel('Cluster')

plt.tight_layout()
plt.show()

## Summary and Best Practices

### Key Takeaways:

1. **Data Preparation**: 
   - Properly encode categorical variables
   - Scale numerical features
   - Consider feature importance when creating distance metrics

2. **Custom Metrics**:
   - Create domain-specific distance functions
   - Weight features based on business importance

3. **Validation**:
   - Always use cross-validation
   - Check clustering stability
   - Test multiple k values

4. **Performance**:
   - Use sparse matrices for large datasets
   - Leverage parallel processing
   - Consider mini-batch algorithms

5. **Visualization**:
   - Use multiple perspectives
   - Create interactive visualizations for exploration
   - Generate publication-quality figures

### Next Steps:
- Apply these techniques to your own data
- Experiment with different distance metrics
- Combine with business rules for practical applications

In [None]:
# Clean up
import os
if os.path.exists('sample_data.mat'):
    os.remove('sample_data.mat')
if os.path.exists('results_for_matlab.mat'):
    os.remove('results_for_matlab.mat')
    
print("Advanced tutorial completed!")