# Getting Started with SUBMARIT

Welcome to SUBMARIT (SUBMARket Identification and Testing)! This notebook will guide you through the basic concepts and usage of the library.

## What is SUBMARIT?

SUBMARIT is a Python library for analyzing product substitution patterns and identifying submarkets in competitive environments. It implements advanced clustering algorithms originally developed in MATLAB for economic analysis.

## Key Concepts

1. **Substitution Matrix**: A matrix representing the substitution patterns between products
2. **Submarket Clustering**: Grouping products based on substitution patterns
3. **Local Search**: Optimization algorithms for finding optimal clusters
4. **Statistical Validation**: Tools for validating clustering results

## Installation

First, let's make sure SUBMARIT is installed and import the necessary modules:

In [None]:
# Install SUBMARIT if not already installed
# !pip install submarit

# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import SUBMARIT modules
from submarit.algorithms import KSMLocalSearch, LocalSearchResult
from submarit.core import create_substitution_matrix
from submarit.evaluation import ClusterEvaluator, EvaluationVisualizer
from submarit.validation import RandIndex, k_fold_validate

# Set random seed for reproducibility
np.random.seed(42)

print("SUBMARIT modules imported successfully!")

## Creating a Simple Substitution Matrix

Let's start by creating a simple substitution matrix. This matrix represents how likely one product is to substitute for another.

In [None]:
# Create a simple substitution matrix for 10 products
n_products = 10

# Generate synthetic substitution data
# Products 0-4 are in one submarket, 5-9 in another
substitution_matrix = np.zeros((n_products, n_products))

# Within-submarket substitution (high values)
for i in range(5):
    for j in range(5):
        if i != j:
            substitution_matrix[i, j] = np.random.uniform(0.7, 0.9)

for i in range(5, 10):
    for j in range(5, 10):
        if i != j:
            substitution_matrix[i, j] = np.random.uniform(0.7, 0.9)

# Between-submarket substitution (low values)
for i in range(5):
    for j in range(5, 10):
        substitution_matrix[i, j] = np.random.uniform(0.1, 0.3)
        substitution_matrix[j, i] = np.random.uniform(0.1, 0.3)

# Visualize the substitution matrix
plt.figure(figsize=(8, 6))
sns.heatmap(substitution_matrix, annot=False, cmap='YlOrRd', 
            xticklabels=[f'P{i}' for i in range(n_products)],
            yticklabels=[f'P{i}' for i in range(n_products)])
plt.title('Product Substitution Matrix')
plt.xlabel('Product')
plt.ylabel('Product')
plt.show()

print(f"Created substitution matrix of shape: {substitution_matrix.shape}")

## Running Basic Clustering

Now let's use the KSMLocalSearch algorithm to identify submarkets:

In [None]:
# Initialize the local search algorithm
local_search = KSMLocalSearch(
    n_clusters=2,  # We expect 2 submarkets
    max_iterations=100,
    n_restarts=10,
    random_state=42
)

# Run the clustering algorithm
print("Running local search clustering...")
result = local_search.fit(substitution_matrix)

# Display results
print(f"\nClustering completed!")
print(f"Best objective value: {result.best_objective:.4f}")
print(f"Number of iterations: {result.n_iterations}")
print(f"\nCluster assignments:")
for i, cluster in enumerate(result.labels):
    print(f"  Product P{i} -> Cluster {cluster}")

## Evaluating Clustering Results

Let's evaluate how well our clustering performed:

In [None]:
# Create an evaluator
evaluator = ClusterEvaluator()

# Evaluate the clustering
eval_result = evaluator.evaluate(
    substitution_matrix,
    result.labels,
    calculate_stability=True
)

# Display evaluation metrics
print("Clustering Evaluation Metrics:")
print(f"  Within-cluster similarity: {eval_result.within_cluster_similarity:.4f}")
print(f"  Between-cluster similarity: {eval_result.between_cluster_similarity:.4f}")
print(f"  Silhouette score: {eval_result.silhouette_score:.4f}")
print(f"  Davies-Bouldin index: {eval_result.davies_bouldin_index:.4f}")

# If we know the true labels (ground truth)
true_labels = np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

# Calculate Rand Index
rand_index = RandIndex()
rand_result = rand_index.compute(result.labels, true_labels)
print(f"\nComparison with ground truth:")
print(f"  Rand Index: {rand_result.rand_index:.4f}")
print(f"  Adjusted Rand Index: {rand_result.adjusted_rand_index:.4f}")

## Visualizing Results

Let's create some visualizations to better understand our clustering results:

In [None]:
# Create visualization helper
visualizer = EvaluationVisualizer()

# Create a figure with multiple subplots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# 1. Reordered substitution matrix
ax = axes[0, 0]
sorted_indices = np.argsort(result.labels)
reordered_matrix = substitution_matrix[sorted_indices][:, sorted_indices]
sns.heatmap(reordered_matrix, ax=ax, cmap='YlOrRd', 
            xticklabels=[f'P{i}' for i in sorted_indices],
            yticklabels=[f'P{i}' for i in sorted_indices])
ax.set_title('Reordered Substitution Matrix')

# Add cluster boundaries
cluster_sizes = [np.sum(result.labels == i) for i in range(2)]
boundary = cluster_sizes[0]
ax.axhline(y=boundary, color='blue', linewidth=2)
ax.axvline(x=boundary, color='blue', linewidth=2)

# 2. Cluster membership bar chart
ax = axes[0, 1]
product_names = [f'P{i}' for i in range(n_products)]
colors = ['red' if label == 0 else 'blue' for label in result.labels]
ax.bar(product_names, [1]*n_products, color=colors)
ax.set_title('Cluster Membership')
ax.set_xlabel('Product')
ax.set_ylabel('Cluster')
ax.set_ylim(0, 1.5)

# 3. Within vs Between cluster similarities
ax = axes[1, 0]
metrics = ['Within-cluster', 'Between-cluster']
values = [eval_result.within_cluster_similarity, eval_result.between_cluster_similarity]
ax.bar(metrics, values, color=['green', 'orange'])
ax.set_title('Cluster Cohesion vs Separation')
ax.set_ylabel('Average Similarity')
ax.set_ylim(0, 1)

# 4. Convergence plot (if available)
ax = axes[1, 1]
if hasattr(result, 'objective_history') and result.objective_history:
    ax.plot(result.objective_history)
    ax.set_title('Objective Function Convergence')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Objective Value')
else:
    # Create a simple convergence simulation
    iterations = np.arange(20)
    objective_values = result.best_objective + 0.1 * np.exp(-iterations/5) + 0.01 * np.random.randn(20)
    ax.plot(iterations, objective_values)
    ax.set_title('Objective Function Convergence (Simulated)')
    ax.set_xlabel('Iteration')
    ax.set_ylabel('Objective Value')

plt.tight_layout()
plt.show()

## Working with Real Data

In practice, you'll often work with real market data. Here's an example of how to load and process real data:

In [None]:
# Example: Creating a substitution matrix from market share data
# This is a synthetic example showing the typical workflow

# Simulate market share data
n_products = 15
n_markets = 50

# Generate synthetic market share data
# Each row is a market, each column is a product
market_shares = np.random.dirichlet(np.ones(n_products), n_markets)

# Add some structure: products 0-4, 5-9, 10-14 are in different submarkets
for i in range(n_markets):
    submarket = np.random.choice([0, 1, 2])
    if submarket == 0:
        market_shares[i, 0:5] *= 3
    elif submarket == 1:
        market_shares[i, 5:10] *= 3
    else:
        market_shares[i, 10:15] *= 3
    # Renormalize
    market_shares[i] /= market_shares[i].sum()

# Convert to DataFrame for easier handling
df_market_shares = pd.DataFrame(
    market_shares,
    columns=[f'Product_{i}' for i in range(n_products)],
    index=[f'Market_{i}' for i in range(n_markets)]
)

print("Market share data shape:", df_market_shares.shape)
print("\nFirst 5 markets:")
print(df_market_shares.head())

In [None]:
# Create substitution matrix from market share correlations
# Products with correlated market shares are likely substitutes
correlation_matrix = df_market_shares.corr().values

# Convert correlations to substitution scores (0 to 1)
substitution_matrix_real = (correlation_matrix + 1) / 2
np.fill_diagonal(substitution_matrix_real, 0)  # No self-substitution

# Visualize the substitution matrix
plt.figure(figsize=(10, 8))
sns.heatmap(substitution_matrix_real, 
            xticklabels=[f'P{i}' for i in range(n_products)],
            yticklabels=[f'P{i}' for i in range(n_products)],
            cmap='YlOrRd', cbar_kws={'label': 'Substitution Score'})
plt.title('Substitution Matrix from Market Share Data')
plt.show()

In [None]:
# Run clustering on the real data
local_search_real = KSMLocalSearch(
    n_clusters=3,  # We expect 3 submarkets
    max_iterations=200,
    n_restarts=20,
    random_state=42
)

result_real = local_search_real.fit(substitution_matrix_real)

# Display results
print("Clustering Results:")
for cluster in range(3):
    products_in_cluster = [i for i, label in enumerate(result_real.labels) if label == cluster]
    print(f"\nCluster {cluster}: {products_in_cluster}")
    print(f"  Products: {', '.join([f'P{i}' for i in products_in_cluster])}")

## Cross-Validation

To ensure robust results, let's use k-fold cross-validation:

In [None]:
# Perform k-fold cross-validation
print("Running 5-fold cross-validation...")

cv_results = k_fold_validate(
    substitution_matrix_real,
    local_search_real,
    n_folds=5,
    random_state=42
)

print("\nCross-validation results:")
print(f"Average objective: {cv_results.mean_objective:.4f} (+/- {cv_results.std_objective:.4f})")
print(f"Average silhouette score: {cv_results.mean_silhouette:.4f} (+/- {cv_results.std_silhouette:.4f})")
print(f"Clustering stability: {cv_results.stability_score:.4f}")

## Best Practices and Tips

### 1. Data Preparation
- Ensure your substitution matrix is symmetric if substitution is bidirectional
- Normalize values to [0, 1] range for consistency
- Handle missing data appropriately

### 2. Algorithm Selection
- Use `KSMLocalSearch` for standard clustering
- Use `KSMLocalSearchConstrained` when you have constraints on cluster sizes
- Try multiple values of `n_clusters` to find the optimal number

### 3. Parameter Tuning
- Increase `n_restarts` for better solutions (at the cost of computation time)
- Adjust `max_iterations` based on problem complexity
- Use `random_state` for reproducible results

### 4. Validation
- Always validate results using multiple metrics
- Use cross-validation for robust evaluation
- Compare with domain knowledge when available

## Summary

In this notebook, we covered:
1. Creating and visualizing substitution matrices
2. Running basic clustering with KSMLocalSearch
3. Evaluating clustering results with multiple metrics
4. Visualizing clustering outcomes
5. Working with real market data
6. Using cross-validation for robust results

Next steps:
- Explore advanced clustering techniques in notebook 02
- Learn about performance optimization in notebook 03
- Discover visualization options in notebook 04
- See MATLAB migration guide in notebook 05