# Getting Started with SUBMARIT

This notebook provides a comprehensive introduction to SUBMARIT (SUBMARket Identification and Testing).

## Table of Contents
1. [Installation and Setup](#installation)
2. [Basic Concepts](#concepts)
3. [Loading Data](#loading)
4. [Creating Substitution Matrices](#substitution)
5. [Clustering Products](#clustering)
6. [Evaluating Results](#evaluation)
7. [Visualization](#visualization)

## 1. Installation and Setup <a id='installation'></a>

First, let's import the necessary libraries and set up our environment.

In [None]:
# Core imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# SUBMARIT imports
from submarit.core import create_substitution_matrix
from submarit.algorithms import LocalSearch
from submarit.evaluation import ClusterEvaluator, gap_statistic
from submarit.evaluation.visualization import plot_substitution_matrix

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print("Setup complete!")

## 2. Basic Concepts <a id='concepts'></a>

SUBMARIT identifies submarkets based on product substitution patterns. Key concepts:

- **Substitution Matrix**: Measures how substitutable products are for each other
- **Submarkets**: Groups of products that are close substitutes
- **Local Search**: Algorithm that finds optimal submarket assignments

## 3. Loading Data <a id='loading'></a>

Let's create a synthetic dataset representing products with various features.

In [None]:
# Create synthetic product data
n_products = 150
n_features = 10

# Generate products in 3 distinct groups
np.random.seed(42)

# Group 1: Budget products (low price, basic features)
group1 = np.random.randn(50, n_features) * 0.5 + np.array([1, 1, 0, 0, 1, 0, 0, 1, 0, 0])

# Group 2: Mid-range products
group2 = np.random.randn(50, n_features) * 0.5 + np.array([3, 3, 1, 1, 2, 1, 1, 2, 1, 1])

# Group 3: Premium products (high price, advanced features)
group3 = np.random.randn(50, n_features) * 0.5 + np.array([5, 5, 2, 2, 3, 2, 2, 3, 2, 2])

# Combine groups
X = np.vstack([group1, group2, group3])

# Create product names
product_names = [f"Product_{i:03d}" for i in range(n_products)]

# Create DataFrame for easier viewing
feature_names = [f"Feature_{i}" for i in range(n_features)]
df_products = pd.DataFrame(X, columns=feature_names, index=product_names)

print(f"Dataset shape: {X.shape}")
print(f"\nFirst 5 products:")
df_products.head()

In [None]:
# Visualize feature distributions
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.ravel()

for i, feature in enumerate(feature_names):
    axes[i].hist(df_products[feature], bins=20, alpha=0.7)
    axes[i].set_title(feature)
    axes[i].set_xlabel('Value')
    axes[i].set_ylabel('Count')

plt.tight_layout()
plt.show()

## 4. Creating Substitution Matrices <a id='substitution'></a>

The substitution matrix captures how similar/substitutable products are based on their features.

In [None]:
# Create substitution matrix using Euclidean distance
S_euclidean = create_substitution_matrix(X, metric='euclidean')

print(f"Substitution matrix shape: {S_euclidean.shape}")
print(f"Value range: [{S_euclidean.min():.3f}, {S_euclidean.max():.3f}]")
print(f"Mean substitution distance: {S_euclidean.mean():.3f}")

In [None]:
# Compare different distance metrics
metrics = ['euclidean', 'manhattan', 'cosine']
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, metric in enumerate(metrics):
    S = create_substitution_matrix(X, metric=metric)
    
    im = axes[i].imshow(S, cmap='viridis', aspect='auto')
    axes[i].set_title(f'{metric.capitalize()} Distance')
    axes[i].set_xlabel('Product Index')
    axes[i].set_ylabel('Product Index')
    plt.colorbar(im, ax=axes[i])

plt.tight_layout()
plt.show()

## 5. Clustering Products <a id='clustering'></a>

Now let's identify submarkets using the Local Search algorithm.

In [None]:
# First, let's find the optimal number of clusters using gap statistic
print("Finding optimal number of clusters...")

k_values = range(2, 8)
gaps = []
stds = []

for k in k_values:
    gap, std = gap_statistic(S_euclidean, k, n_bootstrap=10)
    gaps.append(gap)
    stds.append(std)
    print(f"k={k}: gap={gap:.3f}, std={std:.3f}")

# Plot gap statistic
plt.figure(figsize=(8, 5))
plt.errorbar(k_values, gaps, yerr=stds, marker='o', capsize=5)
plt.xlabel('Number of Clusters')
plt.ylabel('Gap Statistic')
plt.title('Gap Statistic for Optimal k')
plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Perform clustering with k=3 (our true number of groups)
n_clusters = 3

# Initialize Local Search algorithm
ls = LocalSearch(
    n_clusters=n_clusters,
    max_iter=100,
    n_restarts=10,
    random_state=42,
    verbose=True
)

# Fit the model
clusters = ls.fit_predict(S_euclidean)

print(f"\nClustering complete!")
print(f"Final objective value: {ls.objective_:.3f}")
print(f"Number of iterations: {ls.n_iter_}")
print(f"Cluster sizes: {np.bincount(clusters)}")

In [None]:
# Analyze cluster assignments
df_products['Cluster'] = clusters

# Show cluster means
cluster_means = df_products.groupby('Cluster')[feature_names].mean()
print("Cluster feature means:")
cluster_means.round(2)

## 6. Evaluating Results <a id='evaluation'></a>

Let's evaluate the quality of our clustering using various metrics.

In [None]:
# Initialize evaluator
evaluator = ClusterEvaluator()

# Calculate all metrics
metrics = evaluator.evaluate(S_euclidean, clusters)

# Display results
print("Clustering Quality Metrics:")
print("=" * 40)
for metric, value in metrics.items():
    print(f"{metric:20s}: {value:10.4f}")

# Interpret results
print("\nInterpretation:")
print(f"- Silhouette score: {metrics['silhouette']:.3f} (closer to 1 is better)")
print(f"- Davies-Bouldin: {metrics['davies_bouldin']:.3f} (lower is better)")

In [None]:
# Compare different numbers of clusters
k_range = range(2, 8)
silhouette_scores = []
db_scores = []

for k in k_range:
    ls_temp = LocalSearch(n_clusters=k, n_restarts=5, random_state=42)
    clusters_temp = ls_temp.fit_predict(S_euclidean)
    metrics_temp = evaluator.evaluate(S_euclidean, clusters_temp)
    
    silhouette_scores.append(metrics_temp['silhouette'])
    db_scores.append(metrics_temp['davies_bouldin'])

# Plot comparison
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(k_range, silhouette_scores, 'o-', markersize=8)
ax1.set_xlabel('Number of Clusters')
ax1.set_ylabel('Silhouette Score')
ax1.set_title('Silhouette Score vs Number of Clusters')
ax1.grid(True, alpha=0.3)

ax2.plot(k_range, db_scores, 'o-', markersize=8, color='orange')
ax2.set_xlabel('Number of Clusters')
ax2.set_ylabel('Davies-Bouldin Index')
ax2.set_title('Davies-Bouldin Index vs Number of Clusters')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Visualization <a id='visualization'></a>

Finally, let's visualize our submarkets in various ways.

In [None]:
# Plot substitution matrix with cluster boundaries
fig, ax = plt.subplots(figsize=(10, 8))
plot_substitution_matrix(S_euclidean, clusters, ax=ax)
ax.set_title('Substitution Matrix with Identified Submarkets', fontsize=16)
plt.show()

In [None]:
# Visualize clusters in 2D using PCA
from sklearn.decomposition import PCA

# Reduce to 2D
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X)

# Plot
plt.figure(figsize=(10, 8))
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=clusters, 
                     cmap='viridis', s=100, alpha=0.7, edgecolors='black')
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.1%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.1%} variance)')
plt.title('Products in 2D PCA Space', fontsize=16)
plt.colorbar(scatter, label='Cluster')

# Add cluster centers
for k in range(n_clusters):
    cluster_points = X_2d[clusters == k]
    center = cluster_points.mean(axis=0)
    plt.plot(center[0], center[1], 'r*', markersize=20, 
             markeredgecolor='black', markeredgewidth=2)

plt.grid(True, alpha=0.3)
plt.show()

In [None]:
# Create a heatmap of average features by cluster
plt.figure(figsize=(10, 6))
sns.heatmap(cluster_means.T, annot=True, fmt='.2f', cmap='coolwarm', 
            cbar_kws={'label': 'Average Value'})
plt.xlabel('Cluster', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Average Feature Values by Cluster', fontsize=16)
plt.show()

In [None]:
# Interactive cluster exploration
from ipywidgets import interact, IntSlider

def explore_cluster(cluster_id):
    """Explore products in a specific cluster."""
    mask = clusters == cluster_id
    cluster_products = df_products[mask]
    
    print(f"Cluster {cluster_id} - {mask.sum()} products")
    print("=" * 50)
    print("\nFeature statistics:")
    print(cluster_products[feature_names].describe().round(2))
    
    print("\nSample products:")
    print(cluster_products.head(10))

# Create interactive widget
interact(explore_cluster, 
         cluster_id=IntSlider(min=0, max=n_clusters-1, step=1, value=0,
                             description='Cluster:'))

## Summary

In this notebook, we've covered:

1. **Data Preparation**: Creating a synthetic dataset with distinct product groups
2. **Substitution Matrix**: Computing product similarities using different metrics
3. **Clustering**: Using Local Search to identify submarkets
4. **Evaluation**: Assessing cluster quality with multiple metrics
5. **Visualization**: Various ways to visualize and interpret results

### Next Steps

- Try with your own data
- Experiment with different distance metrics
- Explore advanced features like cross-validation and stability analysis
- Check out the other example notebooks for more advanced usage

In [None]:
# Save results for later use
results = {
    'data': X,
    'product_names': product_names,
    'substitution_matrix': S_euclidean,
    'clusters': clusters,
    'metrics': metrics,
    'algorithm_params': {
        'n_clusters': n_clusters,
        'metric': 'euclidean',
        'algorithm': 'local_search'
    }
}

# Save to file (optional)
# import joblib
# joblib.dump(results, 'getting_started_results.pkl')

print("Analysis complete!")