# MATLAB to Python Migration Guide for SUBMARIT

This notebook helps MATLAB users transition to using SUBMARIT in Python. We'll cover:
- Syntax differences and equivalents
- Data structure conversions
- Function mappings
- Common patterns and idioms
- Performance considerations
- Complete migration examples

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.io as sio
from scipy.sparse import csr_matrix
import warnings
warnings.filterwarnings('ignore')

# Import SUBMARIT modules
from submarit.algorithms import KSMLocalSearch, KSMLocalSearchConstrained
from submarit.evaluation import ClusterEvaluator, GAPStatistic
from submarit.validation import RandIndex, k_fold_validate
from submarit.utils.matlab_compat import MatlabCompatibility

print("Libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Working with arrays similar to MATLAB matrices")

## 1. Basic Syntax Comparison

Let's start with the fundamental differences between MATLAB and Python/NumPy.

In [None]:
# MATLAB vs Python Syntax Examples

print("=== Array/Matrix Creation ===")
print("MATLAB: A = [1 2 3; 4 5 6; 7 8 9]")
print("Python:")
A = np.array([[1, 2, 3], 
              [4, 5, 6], 
              [7, 8, 9]])
print(A)
print()

print("=== Indexing (MATLAB is 1-based, Python is 0-based) ===")
print("MATLAB: A(1,1) = 1")
print(f"Python: A[0,0] = {A[0,0]}")
print()

print("=== Slicing ===")
print("MATLAB: A(1:2, 2:3)")
print("Python: A[0:2, 1:3]")
print(A[0:2, 1:3])
print()

print("=== Common Operations ===")
# Transpose
print("MATLAB: A'")
print("Python: A.T or np.transpose(A)")
print(A.T)
print()

# Element-wise multiplication
print("MATLAB: A .* B")
print("Python: A * B")
B = np.ones_like(A) * 2
print(A * B)
print()

# Matrix multiplication
print("MATLAB: A * B")
print("Python: A @ B or np.dot(A, B)")
print(A @ B)

In [None]:
# Common MATLAB functions and their Python equivalents
print("=== Common Function Mappings ===")

# Create mapping table
mappings = [
    ("zeros(m,n)", "np.zeros((m,n))"),
    ("ones(m,n)", "np.ones((m,n))"),
    ("eye(n)", "np.eye(n)"),
    ("rand(m,n)", "np.random.rand(m,n)"),
    ("randn(m,n)", "np.random.randn(m,n)"),
    ("size(A)", "A.shape"),
    ("length(A)", "len(A) or A.size"),
    ("sum(A)", "np.sum(A, axis=0)"),
    ("sum(A,2)", "np.sum(A, axis=1)"),
    ("mean(A)", "np.mean(A, axis=0)"),
    ("std(A)", "np.std(A, axis=0, ddof=1)"),
    ("max(A)", "np.max(A, axis=0)"),
    ("min(A)", "np.min(A, axis=0)"),
    ("find(A>5)", "np.where(A>5)"),
    ("reshape(A,m,n)", "A.reshape(m,n)"),
    ("diag(A)", "np.diag(A)"),
    ("inv(A)", "np.linalg.inv(A)"),
    ("eig(A)", "np.linalg.eig(A)"),
    ("svd(A)", "np.linalg.svd(A)"),
    ("norm(A)", "np.linalg.norm(A)"),
]

print(f"{'MATLAB':<20} | {'Python':<30}")
print("-" * 52)
for matlab, python in mappings:
    print(f"{matlab:<20} | {python:<30}")

## 2. Loading MATLAB Data Files

In [None]:
# Example: Loading and saving MATLAB files

# Create sample data
sample_data = {
    'substitution_matrix': np.random.rand(20, 20),
    'product_names': ['Product_' + str(i) for i in range(20)],
    'cluster_labels': np.random.randint(0, 3, 20),
    'metadata': {
        'date': '2024-01-01',
        'version': 1.0
    }
}

# Save as MATLAB file
filename = 'sample_data.mat'
sio.savemat(filename, sample_data)
print(f"Saved data to {filename}")

# Load MATLAB file
loaded_data = sio.loadmat(filename)
print(f"\nLoaded keys: {list(loaded_data.keys())}")

# Access data (note: loadmat adds some metadata keys)
sub_matrix = loaded_data['substitution_matrix']
print(f"\nSubstitution matrix shape: {sub_matrix.shape}")
print(f"Cluster labels: {loaded_data['cluster_labels'].flatten()}")

# Clean up
import os
os.remove(filename)

## 3. Migrating SUBMARIT MATLAB Functions

In [None]:
# Function mapping table for SUBMARIT
print("=== SUBMARIT Function Mappings ===")
print()

submarit_mappings = [
    ("MATLAB Function", "Python Equivalent", "Module"),
    ("-" * 30, "-" * 40, "-" * 25),
    # Algorithms
    ("kSMLocalSearch", "KSMLocalSearch", "submarit.algorithms"),
    ("kSMLocalSearch2", "KSMLocalSearch2", "submarit.algorithms"),
    ("kSMLocalSearchConstrained", "KSMLocalSearchConstrained", "submarit.algorithms"),
    ("kSMLocalSearchConstrained2", "KSMLocalSearchConstrained2", "submarit.algorithms"),
    # Evaluation
    ("kEvaluateClustering", "ClusterEvaluator.evaluate", "submarit.evaluation"),
    ("GAPStatisticUniform", "GAPStatistic.compute", "submarit.evaluation"),
    ("kSMEntropy", "EntropyClusterer", "submarit.evaluation"),
    # Validation
    ("RandIndex4", "RandIndex.compute", "submarit.validation"),
    ("kSMNFold", "k_fold_validate", "submarit.validation"),
    ("RunClusters", "run_clusters", "submarit.validation"),
    ("RunClustersTopk", "run_clusters_topk", "submarit.validation"),
    # Matrix creation
    ("CreateSubstitutionMatrix", "create_substitution_matrix", "submarit.core"),
    ("kSMCreateDist", "create_switching_matrix_distribution", "submarit.validation"),
]

for matlab, python, module in submarit_mappings:
    print(f"{matlab:<30} | {python:<40} | {module:<25}")

## 4. Complete Migration Example: kSMLocalSearch

In [None]:
# MATLAB code (commented):
"""
% MATLAB version
S = load('substitution_matrix.mat');
k = 3;  % number of clusters
maxIter = 100;
nRestarts = 10;

[labels, objective, iterations] = kSMLocalSearch(S, k, maxIter, nRestarts);

% Evaluate clustering
[silhouette, db_index, ch_index] = kEvaluateClustering(S, labels);

% Display results
fprintf('Objective: %.4f\n', objective);
fprintf('Silhouette: %.4f\n', silhouette);
"""

# Python equivalent:
print("Python implementation of the MATLAB code above:")
print()

# Create sample substitution matrix
S = np.random.rand(50, 50)
S = (S + S.T) / 2  # Make symmetric
np.fill_diagonal(S, 0)  # No self-substitution

# Parameters
k = 3  # number of clusters
max_iter = 100
n_restarts = 10

# Run local search
search = KSMLocalSearch(
    n_clusters=k,
    max_iterations=max_iter,
    n_restarts=n_restarts,
    random_state=42  # For reproducibility
)
result = search.fit(S)

# Extract results
labels = result.labels
objective = result.best_objective
iterations = result.n_iterations

# Evaluate clustering
evaluator = ClusterEvaluator()
eval_result = evaluator.evaluate(S, labels)

# Display results
print(f'Objective: {objective:.4f}')
print(f'Silhouette: {eval_result.silhouette_score:.4f}')
print(f'Davies-Bouldin Index: {eval_result.davies_bouldin_index:.4f}')
print(f'Calinski-Harabasz Index: {eval_result.calinski_harabasz_score:.4f}')
print(f'Iterations: {iterations}')

## 5. Migrating Complex Workflows

In [None]:
# Example: Complete analysis workflow

# MATLAB workflow (commented):
"""
% Load data
data = load('market_data.mat');
S = data.substitution_matrix;
product_names = data.product_names;

% Determine optimal number of clusters
max_k = 10;
gap_values = zeros(max_k, 1);
for k = 1:max_k
    gap_values(k) = GAPStatisticUniform(S, k, 20);
end
[~, optimal_k] = max(gap_values);

% Run clustering with optimal k
[labels, obj] = kSMLocalSearch(S, optimal_k, 200, 50);

% Cross-validation
[cv_scores, cv_labels] = kSMNFold(S, optimal_k, 5);

% Compare with constrained version
min_size = 3;
[labels_const, obj_const] = kSMLocalSearchConstrained(S, optimal_k, min_size, 200, 50);

% Calculate Rand index
rand_idx = RandIndex4(labels, labels_const);
"""

# Python equivalent:
print("Python implementation of complex MATLAB workflow:")
print()

# Create sample data
n_products = 30
S = np.random.rand(n_products, n_products)
S = (S + S.T) / 2
np.fill_diagonal(S, 0)
# Add some structure
for i in range(0, 10):
    for j in range(0, 10):
        if i != j:
            S[i, j] = np.random.uniform(0.7, 0.9)
for i in range(10, 20):
    for j in range(10, 20):
        if i != j:
            S[i, j] = np.random.uniform(0.7, 0.9)
for i in range(20, 30):
    for j in range(20, 30):
        if i != j:
            S[i, j] = np.random.uniform(0.7, 0.9)

product_names = [f'Product_{i}' for i in range(n_products)]

# Step 1: Determine optimal number of clusters
print("Step 1: Finding optimal number of clusters...")
gap_stat = GAPStatistic(max_clusters=8, n_references=10, random_state=42)
gap_result = gap_stat.compute(S)
optimal_k = gap_result.optimal_k
print(f"Optimal number of clusters: {optimal_k}")

# Step 2: Run clustering with optimal k
print("\nStep 2: Running clustering...")
search = KSMLocalSearch(
    n_clusters=optimal_k,
    max_iterations=200,
    n_restarts=50,
    random_state=42
)
result = search.fit(S)
labels = result.labels
obj = result.best_objective
print(f"Objective value: {obj:.4f}")

# Step 3: Cross-validation
print("\nStep 3: Running cross-validation...")
cv_results = k_fold_validate(S, search, n_folds=5, random_state=42)
print(f"CV mean objective: {cv_results.mean_objective:.4f} (+/- {cv_results.std_objective:.4f})")

# Step 4: Compare with constrained version
print("\nStep 4: Running constrained clustering...")
min_size = 3
search_const = KSMLocalSearchConstrained(
    n_clusters=optimal_k,
    min_cluster_size=min_size,
    max_iterations=200,
    n_restarts=50,
    random_state=42
)
result_const = search_const.fit(S)
labels_const = result_const.labels
obj_const = result_const.best_objective
print(f"Constrained objective value: {obj_const:.4f}")

# Step 5: Calculate Rand index
print("\nStep 5: Comparing solutions...")
rand_calc = RandIndex()
rand_result = rand_calc.compute(labels, labels_const)
print(f"Rand Index: {rand_result.rand_index:.4f}")
print(f"Adjusted Rand Index: {rand_result.adjusted_rand_index:.4f}")

## 6. Handling MATLAB-Specific Patterns

In [None]:
# Common MATLAB patterns and their Python equivalents

print("=== Cell Arrays ===")
# MATLAB: results = cell(n_runs, 1);
# Python:
results = [None] * 10  # or use a list
print(f"Python list of None: {results[:3]}...")
print()

print("=== Struct Arrays ===")
# MATLAB: result.labels = labels; result.objective = obj;
# Python: Use dictionaries or classes
result = {
    'labels': np.array([0, 1, 2, 0, 1, 2]),
    'objective': 0.8765
}
print(f"Dictionary: {result}")

# Or use a class (more MATLAB struct-like)
class Result:
    def __init__(self):
        self.labels = None
        self.objective = None

result_obj = Result()
result_obj.labels = np.array([0, 1, 2, 0, 1, 2])
result_obj.objective = 0.8765
print(f"Class object - labels: {result_obj.labels}, objective: {result_obj.objective}")
print()

print("=== Logical Indexing ===")
# MATLAB: A(A > 5) = 0;
# Python:
A = np.array([[1, 6, 3], [8, 2, 7]])
print(f"Original: \n{A}")
A[A > 5] = 0
print(f"After A[A > 5] = 0: \n{A}")
print()

print("=== Find Function ===")
# MATLAB: idx = find(A == 0);
# Python:
idx = np.where(A == 0)
print(f"Indices where A == 0: {idx}")
# To get linear indices like MATLAB:
linear_idx = np.flatnonzero(A == 0)
print(f"Linear indices: {linear_idx}")

## 7. Performance Comparison and Tips

In [None]:
# Performance tips when migrating from MATLAB

import time

print("=== Performance Tips ===")
print()

# Tip 1: Preallocate arrays
n = 10000
print("1. Preallocate arrays (like MATLAB)")

# Bad (growing array)
start = time.time()
arr_bad = []
for i in range(n):
    arr_bad.append(i**2)
time_bad = time.time() - start

# Good (preallocated)
start = time.time()
arr_good = np.zeros(n)
for i in range(n):
    arr_good[i] = i**2
time_good = time.time() - start

# Best (vectorized)
start = time.time()
arr_best = np.arange(n)**2
time_best = time.time() - start

print(f"  Growing list: {time_bad:.4f}s")
print(f"  Preallocated: {time_good:.4f}s (speedup: {time_bad/time_good:.1f}x)")
print(f"  Vectorized: {time_best:.4f}s (speedup: {time_bad/time_best:.1f}x)")
print()

# Tip 2: Use NumPy functions instead of loops
print("2. Use NumPy functions (vectorization)")
A = np.random.rand(1000, 1000)

# MATLAB-style loop
start = time.time()
sum_loop = 0
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        if A[i, j] > 0.5:
            sum_loop += A[i, j]
time_loop = time.time() - start

# NumPy vectorized
start = time.time()
sum_vec = np.sum(A[A > 0.5])
time_vec = time.time() - start

print(f"  Loop: {time_loop:.4f}s")
print(f"  Vectorized: {time_vec:.4f}s (speedup: {time_loop/time_vec:.1f}x)")
print()

# Tip 3: Use appropriate data types
print("3. Use appropriate data types")
n = 1000000
print(f"  float64 array: {np.zeros(n, dtype=np.float64).nbytes / 1024 / 1024:.2f} MB")
print(f"  float32 array: {np.zeros(n, dtype=np.float32).nbytes / 1024 / 1024:.2f} MB")
print(f"  int32 array: {np.zeros(n, dtype=np.int32).nbytes / 1024 / 1024:.2f} MB")

## 8. Complete Example: Migrating a MATLAB Script

In [None]:
# Original MATLAB script (as comments):
"""
% analyze_submarkets.m
% This script analyzes product substitution patterns

% Load data
load('market_data.mat');

% Parameters
min_k = 2;
max_k = 10;
n_runs = 100;

% Initialize results
results = struct();
results.k_values = min_k:max_k;
results.objectives = zeros(length(results.k_values), 1);
results.silhouettes = zeros(length(results.k_values), 1);
results.best_labels = cell(length(results.k_values), 1);

% Test different numbers of clusters
for i = 1:length(results.k_values)
    k = results.k_values(i);
    fprintf('Testing k = %d...\n', k);
    
    % Run multiple times and keep best
    best_obj = -inf;
    best_labels = [];
    
    for run = 1:n_runs
        [labels, obj] = kSMLocalSearch(substitution_matrix, k, 100, 1);
        if obj > best_obj
            best_obj = obj;
            best_labels = labels;
        end
    end
    
    % Store results
    results.objectives(i) = best_obj;
    results.best_labels{i} = best_labels;
    
    % Evaluate
    [sil, ~, ~] = kEvaluateClustering(substitution_matrix, best_labels);
    results.silhouettes(i) = sil;
end

% Plot results
figure;
subplot(2,1,1);
plot(results.k_values, results.objectives, 'o-');
xlabel('Number of Clusters');
ylabel('Objective Value');
title('Objective vs Number of Clusters');

subplot(2,1,2);
plot(results.k_values, results.silhouettes, 's-');
xlabel('Number of Clusters');
ylabel('Silhouette Score');
title('Silhouette vs Number of Clusters');

% Find optimal k
[~, idx] = max(results.silhouettes);
optimal_k = results.k_values(idx);
fprintf('\nOptimal number of clusters: %d\n', optimal_k);

% Save results
save('clustering_results.mat', 'results');
"""

# Python implementation:
print("Python implementation of analyze_submarkets.m:")
print("=" * 50)

# Create sample data (simulating loaded data)
n_products = 50
substitution_matrix = np.random.rand(n_products, n_products)
substitution_matrix = (substitution_matrix + substitution_matrix.T) / 2
np.fill_diagonal(substitution_matrix, 0)

# Add structure
true_k = 4
cluster_size = n_products // true_k
for i in range(true_k):
    start = i * cluster_size
    end = (i + 1) * cluster_size if i < true_k - 1 else n_products
    substitution_matrix[start:end, start:end] *= 2
substitution_matrix = np.clip(substitution_matrix, 0, 1)

# Parameters
min_k = 2
max_k = 10
n_runs = 20  # Reduced for demo

# Initialize results dictionary (similar to MATLAB struct)
results = {
    'k_values': list(range(min_k, max_k + 1)),
    'objectives': [],
    'silhouettes': [],
    'best_labels': []
}

# Test different numbers of clusters
evaluator = ClusterEvaluator()

for k in results['k_values']:
    print(f'Testing k = {k}...', end=' ')
    
    # Run multiple times and keep best
    best_obj = -np.inf
    best_labels = None
    
    for run in range(n_runs):
        search = KSMLocalSearch(
            n_clusters=k,
            max_iterations=100,
            n_restarts=1,
            random_state=run  # Different seed each run
        )
        result = search.fit(substitution_matrix)
        
        if result.best_objective > best_obj:
            best_obj = result.best_objective
            best_labels = result.labels
    
    # Store results
    results['objectives'].append(best_obj)
    results['best_labels'].append(best_labels)
    
    # Evaluate
    eval_result = evaluator.evaluate(substitution_matrix, best_labels)
    results['silhouettes'].append(eval_result.silhouette_score)
    
    print(f'Objective: {best_obj:.4f}, Silhouette: {eval_result.silhouette_score:.4f}')

# Convert to numpy arrays for easier plotting
results['objectives'] = np.array(results['objectives'])
results['silhouettes'] = np.array(results['silhouettes'])

# Plot results (MATLAB-style subplots)
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 8))

# Subplot 1: Objectives
ax1.plot(results['k_values'], results['objectives'], 'o-', linewidth=2, markersize=8)
ax1.set_xlabel('Number of Clusters')
ax1.set_ylabel('Objective Value')
ax1.set_title('Objective vs Number of Clusters')
ax1.grid(True, alpha=0.3)

# Subplot 2: Silhouettes
ax2.plot(results['k_values'], results['silhouettes'], 's-', linewidth=2, markersize=8, color='green')
ax2.set_xlabel('Number of Clusters')
ax2.set_ylabel('Silhouette Score')
ax2.set_title('Silhouette vs Number of Clusters')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find optimal k
idx = np.argmax(results['silhouettes'])
optimal_k = results['k_values'][idx]
print(f'\nOptimal number of clusters: {optimal_k}')
print(f'Best silhouette score: {results["silhouettes"][idx]:.4f}')

# Save results (as both .mat and .npz)
# For MATLAB compatibility
sio.savemat('clustering_results.mat', results)
print("\nResults saved to clustering_results.mat")

# Python native format
np.savez('clustering_results.npz', **results)
print("Results saved to clustering_results.npz")

# Clean up
import os
os.remove('clustering_results.mat')
os.remove('clustering_results.npz')

## 9. Quick Reference Card

In [None]:
# Create a quick reference card
print("MATLAB to Python Quick Reference Card for SUBMARIT")
print("=" * 60)
print()

reference = """
COMMON PATTERNS:
----------------
MATLAB: A(1,1)                    Python: A[0,0]
MATLAB: A(:,1)                    Python: A[:,0]
MATLAB: A(1,:)                    Python: A[0,:]
MATLAB: A(1:5,2:4)               Python: A[0:5,1:4]
MATLAB: A'                        Python: A.T
MATLAB: A * B                     Python: A @ B
MATLAB: A .* B                    Python: A * B
MATLAB: for i = 1:10              Python: for i in range(10):
MATLAB: if a && b                 Python: if a and b:
MATLAB: if a || b                 Python: if a or b:
MATLAB: ~a                        Python: not a or ~a (bitwise)

SUBMARIT SPECIFIC:
------------------
MATLAB:                           Python:
[labels, obj] = kSMLocalSearch(   search = KSMLocalSearch(
    S, k, maxIter, nRestarts);        n_clusters=k,
                                      max_iterations=maxIter,
                                      n_restarts=nRestarts)
                                  result = search.fit(S)
                                  labels = result.labels
                                  obj = result.best_objective

DATA STRUCTURES:
----------------
MATLAB: cell array                Python: list or numpy object array
MATLAB: struct                    Python: dict or class
MATLAB: logical array             Python: boolean numpy array

FILE I/O:
---------
MATLAB: save('file.mat', 'var')   Python: sio.savemat('file.mat', {'var': var})
MATLAB: load('file.mat')          Python: data = sio.loadmat('file.mat')

TIPS:
-----
1. Python uses 0-based indexing (MATLAB is 1-based)
2. Python ranges are exclusive at the end: range(0,5) gives [0,1,2,3,4]
3. Use numpy for numerical operations (similar performance to MATLAB)
4. Consider using pandas DataFrames for table-like data
5. matplotlib.pyplot provides MATLAB-like plotting interface
"""

print(reference)

## Summary

This notebook covered the essential aspects of migrating from MATLAB to Python for SUBMARIT:

1. **Basic Syntax**: Array creation, indexing, operations
2. **Function Mappings**: Direct equivalents for common functions
3. **Data I/O**: Loading and saving MATLAB files
4. **SUBMARIT Functions**: Mapping between MATLAB and Python APIs
5. **Complex Workflows**: Step-by-step migration examples
6. **Performance**: Tips for maintaining MATLAB-level performance
7. **Complete Examples**: Full script migrations

### Key Takeaways:

- **Indexing**: Remember Python is 0-based, MATLAB is 1-based
- **Vectorization**: Use NumPy operations instead of loops
- **Data Structures**: Use dictionaries or classes instead of structs
- **File I/O**: scipy.io handles MATLAB files seamlessly
- **API**: SUBMARIT Python API closely mirrors MATLAB functions

### Next Steps:

1. Start with simple scripts and gradually migrate complex ones
2. Use the reference card for quick lookups
3. Test results match between MATLAB and Python versions
4. Take advantage of Python's additional features (better debugging, package management, etc.)