<p align="center"><img src="logo.png" alt="logo" width="1000"/></p>
<div align="center">
<div class="klotski-board">
  <!-- Red block -->
  <div class="block red" style="grid-row: 1 / span 2; grid-column: 2 / span 2;"></div>

  <!-- Tall side blocks -->
  <div class="block blue" style="grid-row: 1 / span 2; grid-column: 1;"></div>
  <div class="block blue" style="grid-row: 1 / span 2; grid-column: 4;"></div>
  <div class="block blue" style="grid-row: 3 / span 2; grid-column: 1;"></div>
  <div class="block blue" style="grid-row: 3 / span 2; grid-column: 4;"></div>

  <!-- Small blocks -->
  <div class="block green" style="grid-row: 5; grid-column: 1;"></div>
  <div class="block green" style="grid-row: 4; grid-column: 2;"></div>
  <div class="block green" style="grid-row: 4; grid-column: 3;"></div>
  <div class="block green" style="grid-row: 5; grid-column: 4;"></div>

  <!-- Horizoontal yellow block -->
  <div class="block yellow" style="grid-row: 3 ; grid-column: 2/ span 2;"></div>
</div>

<style>
.klotski-board {
  display: grid;
  grid-template-rows: repeat(5, 60px);
  grid-template-columns: repeat(4, 60px);
  gap: 4px;
  background: #222222;
  padding: 4px;
  width: max-content;
  border-radius: 8px;
}
.block {
  border-radius: 6px;
}
.red { background: #e74c3c; }
.blue { background: #3498db; }
.green { background: #2ecc71; }
.yellow { background: #f1c40f; }
</style>
*A standard Klotski puzzle layout with 81 moves as solution*
</div>



# Cracking the Code: Predicting Puzzle Difficulty with Machine Learning


## *How I Built an AI System to Solve the "Goldilocks Problem" in Game Design*

---



### The Hidden Architecture of Engagement


Every great puzzle hides two stories: the one a player solves, and the one that decides what is worth solving. This project is about the second story, a system that doesn't just brute-force solutions, but studies what makes a board compelling, predicts how it will play, and curates puzzles that feel sharp, fair, and memorable

- **The core challenge**: How do you systematically identify the difference between a puzzle that teaches and one that frustrates? Between complexity that engages and complexity that overwhelms?

- **My approach**: Build a machine learning system that understands puzzle difficulty the way humans experience it — not just as mathematical complexity, but as cognitive load, spatial reasoning patterns, and the satisfying progression from confusion to clarity
---

### The Technical Challenge: Klotski Sliding Block Puzzles

I chose **Klotski puzzles** as my proving ground - these ancient Chinese sliding block puzzles are:
- **Computationally complex**: Even small changes create dramatically different difficulty
- **Measurable**: Clear success metrics (solvable/unsolvable, solution length)
- **Generalizable**: Principles apply to any spatial reasoning game

**The Goal**: Can I build a system that predicts puzzle difficulty more accurately than human intuition?

---

### What I Built: An End-to-End ML Pipeline

**Custom Puzzle Solver**: BFS algorithm finding optimal solutions  
**Massive Dataset**: 50,000 unique puzzles with 22+ engineered features  
**Multi-Model ML Suite**: Classification, regression, clustering, and complexity analysis  
**Novel Feature Engineering**: Complexity scoring and symmetry detection algorithms  
**Business Intelligence**: Actionable insights for game design optimization  

---

### The Results That Matter

- **38.5% natural solvability rate** in randomly generated puzzles (perfect for balanced difficulty)
- **22+ engineered features** capturing spatial, compositional, and strategic complexity
- **Advanced complexity scoring** that correlates with human perception of difficulty
- **Symmetry detection algorithms** revealing hidden puzzle patterns
- **Production-ready pipeline** generating insights for game designers

---

### Why This Matters

This isn't just about puzzles. This is about **using data science to understand human cognition and engagement**. The techniques I've developed here apply to:

- **Educational software**: Adaptive difficulty in learning platforms and skill progression
- **Game design**: Any puzzle or strategy game balancing across player skill levels
- **User experience**: Predicting task complexity in any interface or workflow
- **Product onboarding**: Sequencing feature introduction to minimize cognitive overload

---

*Let's dive into the data and see how machine learning can solve the Goldilocks Problem...*


# 📋🔍 Project Overview



### Technical Stack
- **Data Generation**: Custom Python pipeline with BFS solver
- **Feature Engineering**: 22+ domain-specific features
- **ML Framework**: scikit-learn, pandas, numpy
- **Visualization**: matplotlib, seaborn, plotly
- **Analysis**: Statistical modeling and pattern recognition



### Dataset Specifications
- **Size**: 50,000 unique Klotski puzzle boards generated
- **Features**: 22+ engineered features per puzzle
- **Target Variables**: Solvability (binary) + Solution Length (regression)
- **Quality**: Clean, validated data with proper feature encoding

### Importing Libraries

In [202]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.figure_factory as ff
import warnings
from tabulate import tabulate
from collections import namedtuple
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, confusion_matrix, r2_score, mean_absolute_error
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from IPython.display import SVG, display
import matplotlib.font_manager as fm
import glob
import ast
import math
import os




# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
warnings.filterwarnings('ignore')

# Configure display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# Plotly settings for  visuals
import plotly.io as pio
pio.templates.default = "simple_white"

print("Klotski Puzzle ML Analysis")
print("Libraries loaded successfully!")


Klotski Puzzle ML Analysis
Libraries loaded successfully!


##  Feature Documentation of Klotski Puzzle Dataset



### **Identification & Metadata**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| puzzle_id | Object | "puzzle_XXXXXX" | Unique identifier for each puzzle configuration |
| timestamp | Object | ISO format | When the puzzle was generated or analyzed |
| step_number | Float | 0-200+ | Sequential step number in solution path |



### **Solvability & Solution Metrics**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| is_solvable | Boolean | True/False | Whether the puzzle has a valid solution |
| solution_length | Integer | 0-130+ | Number of moves in optimal solution (0 if unsolvable) |
| solve_time_seconds | Float | 0.0-30.0+ | Computational time taken to solve the puzzle |



### **Block Composition Analysis**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| total_blocks | Integer | 8-15 | Total number of blocks on the board |
| block_count_1x1 | Integer | 0-15 | Number of small 1×1 square blocks |
| block_count_1x2 | Integer | 0-7 | Number of horizontal 1×2 rectangular blocks |
| block_count_2x1 | Integer | 0-6 | Number of vertical 2×1 rectangular blocks |
| block_count_2x2 | Integer | 1 | Number of large 2×2 goal blocks (always 1) |


### **Goal Block Positioning**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| goal_initial_row | Integer | 0-3 | Starting row position of the goal block |
| goal_initial_col | Integer | 0-2 | Starting column position of the goal block |
| goal_distance_to_target | Float | 0.0-4.24 | Euclidean distance from goal to target position |
| goal_manhattan_distance | Integer | 1-4 | Manhattan distance from goal to target position |
| blocks_between_goal_target | Integer | 0-10 | Number of blocks blocking goal's direct path |



### **Spatial Relationship Features**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| adjacent_1x1_count | Integer | 0-4 | Number of 1×1 blocks adjacent to goal block |
| adjacent_1x2_count | Integer | 0-4 | Number of 1×2 blocks adjacent to goal block |
| adjacent_2x1_count | Integer | 0-4 | Number of 2×1 blocks adjacent to goal block |
| wall_adjacent_sides | Integer | 0-4 | Number of board edges touching the goal block |



### **Board Space Analysis**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| empty_spaces_count | Integer | 2-8 | Total number of unoccupied cells on board |
| empty_space | Object | Coordinate pairs | Positions of empty spaces on the board |



### **Visual & State Representation**
| Feature | Type | Range/Format | Description |
|---------|------|--------------|-------------|
| initial_block_states | Object | Shape & Coordinate | Complete state of all block positions |
| board_visual | Object | ASCII art | Visual representation of the puzzle board |


---



### **Feature Engineering Notes**

These features capture multiple dimensions of puzzle complexity:

- **Spatial Complexity**: Block counts, adjacency patterns, empty spaces
- **Goal Accessibility**: Distance metrics, blocking patterns, wall constraints  
- **Solution Difficulty**: Path length, computational complexity, solvability
- **Board Density**: Total blocks vs. available space ratios
- **Strategic Patterns**: Block type distributions and spatial arrangements



This rich feature set enables machine learning models to understand not just the mathematical properties of each puzzle, but also the cognitive and strategic challenges they present to human players.



### **Research Applications**

This dataset supports multiple research directions:

- **Difficulty Prediction**: Predicting human-perceived difficulty from board features
- **Cognitive Load Analysis**: Understanding what makes puzzles mentally challenging
- **Automated Level Design**: Generating puzzles with target difficulty characteristics
- **Player Behavior Modeling**: Analyzing how spatial patterns affect player strategies
- **Educational Applications**: Adaptive difficulty progression in learning systems

# 📊 Data Loading and Initial Exploration

### Loading the data

In [58]:
def load_dataset():
    """
    Load the dataset.
    """
    df = None  # Initialize df to avoid NameError


    possible_files = [
        '.\enhanced_dataset_20250912_024232.csv'
        ]
        
    for file_path in possible_files:
            if os.path.exists(file_path):
                if file_path.endswith('.json'):
                    df = pd.read_json(file_path)
                else:
                    df = pd.read_csv(file_path)
                print(f"📂 Loaded: {file_path}")
                break

    return df

df = load_dataset()  # Load the dataset

📂 Loaded: .\enhanced_dataset_20250912_024232.csv


### The first overview

In [59]:
print(f"\n📊 Dataset Overview:")
print(f"Shape: {df.shape[0]:,} puzzles × {df.shape[1]} features")

# Column Overview (data types)
print(f"\n🔍 Column Overview:")
print(df.dtypes.value_counts())



📊 Dataset Overview:
Shape: 50,000 puzzles × 24 features

🔍 Column Overview:
int64      15
object      5
float64     3
bool        1
Name: count, dtype: int64


In [60]:
# Set pandas display options to show more text
pd.set_option('display.max_colwidth', None)  # Show full content
pd.set_option('display.width', None)  # Don't wrap lines
pd.set_option('display.max_columns', None)  # Show all columns

# Custom display function for specific columns
sample_df = df.head().copy()
    
# Display basic info first
print(f"📋 First 5 puzzles:")
display(sample_df.drop(columns=['board_visual'], errors='ignore'))
    
# Then show board visuals separately
if 'board_visual' in df.columns:
    print("\n Board Visuals for these samples:")
    for i, (idx, row) in enumerate(sample_df.iterrows()):
        print(f"\n--- Puzzle {i+1} (Index {idx}) ---")
        print(row['board_visual'])

📋 First 5 puzzles:


Unnamed: 0,puzzle_id,timestamp,is_solvable,total_blocks,block_count_1x1,block_count_1x2,block_count_2x1,block_count_2x2,goal_initial_row,goal_initial_col,goal_distance_to_target,goal_manhattan_distance,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count,wall_adjacent_sides,empty_spaces_count,empty_space,solution_length,solve_time_seconds,step_number,initial_block_states
0,puzzle_000000,2025-09-12T02:42:36.004653,True,11,6,2,2,1,1,2,2.236068,3,6,5,1,0,1,2,"[(3, 2), (4, 2)]",17,2.848788,,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]"
1,puzzle_000001,2025-09-12T02:42:33.172049,False,9,2,1,5,1,1,1,2.0,2,2,2,1,5,0,2,"[(0, 3), (4, 0)]",0,0.014643,,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]"
2,puzzle_000002,2025-09-12T02:42:55.841823,True,10,4,2,3,1,0,2,3.162278,4,5,1,1,1,2,2,"[(3, 1), (3, 2)]",75,22.666284,,"[(2, 2, 0, 2), (1, 2, 2, 1), (1, 1, 1, 0), (2, 1, 0, 1), (2, 1, 2, 0), (1, 2, 4, 1), (2, 1, 3, 3), (1, 1, 0, 0), (1, 1, 2, 3), (1, 1, 4, 0)]"
3,puzzle_000003,2025-09-12T02:42:33.201175,False,8,0,5,2,1,3,2,1.0,1,2,0,4,0,2,2,"[(0, 1), (0, 2)]",0,0.004736,,"[(2, 2, 3, 2), (1, 2, 1, 1), (2, 1, 0, 3), (1, 2, 3, 0), (1, 2, 2, 0), (1, 2, 4, 0), (2, 1, 0, 0), (1, 2, 2, 2)]"
4,puzzle_000004,2025-09-12T02:42:33.183368,False,10,4,3,2,1,1,0,2.236068,3,5,3,2,1,1,2,"[(0, 1), (4, 0)]",0,0.010456,,"[(2, 2, 1, 0), (1, 1, 4, 3), (2, 1, 2, 3), (1, 1, 0, 0), (1, 1, 3, 1), (1, 2, 4, 1), (1, 2, 0, 2), (1, 2, 1, 2), (1, 1, 3, 0), (2, 1, 2, 2)]"



 Board Visuals for these samples:

--- Puzzle 1 (Index 0) ---
┌───┬───────┬───┐
│   │       │   │
│   ├───┬───┴───┤
│   │   │       │
├───┼───┤       │
│   │   │       │
│   ├───┼───┬───┤
│   │   │ X │   │
├───┴───┤   ├───┤
│       │ X │   │
└───────┴───┴───┘

--- Puzzle 2 (Index 1) ---
┌───┬───────┬───┐
│   │       │ X │
│   ├───────┼───┤
│   │       │   │
├───┤       │   │
│   │       │   │
├───┼───┬───┼───┤
│   │   │   │   │
├───┤   │   │   │
│ X │   │   │   │
└───┴───┴───┴───┘

--- Puzzle 3 (Index 2) ---
┌───┬───┬───────┐
│   │   │       │
├───┤   │       │
│   │   │       │
├───┼───┴───┬───┤
│   │       │   │
│   ├───────┼───┤
│   │ X   X │   │
├───┼───────┤   │
│   │       │   │
└───┴───────┴───┘

--- Puzzle 4 (Index 3) ---
┌───┬───────┬───┐
│   │ X   X │   │
│   ├───────┤   │
│   │       │   │
├───┴───┬───┴───┤
│       │       │
├───────┼───────┤
│       │       │
├───────┤       │
│       │       │
└───────┴───────┘

--- Puzzle 5 (Index 4) ---
┌───┬───┬───────┐
│   │ X │      

### Seeing basic info about the dataset

In [61]:
# Dataset Info
display(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 24 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   puzzle_id                   50000 non-null  object 
 1   timestamp                   50000 non-null  object 
 2   is_solvable                 50000 non-null  bool   
 3   total_blocks                50000 non-null  int64  
 4   block_count_1x1             50000 non-null  int64  
 5   block_count_1x2             50000 non-null  int64  
 6   block_count_2x1             50000 non-null  int64  
 7   block_count_2x2             50000 non-null  int64  
 8   goal_initial_row            50000 non-null  int64  
 9   goal_initial_col            50000 non-null  int64  
 10  goal_distance_to_target     50000 non-null  float64
 11  goal_manhattan_distance     50000 non-null  int64  
 12  blocks_between_goal_target  50000 non-null  int64  
 13  adjacent_1x1_count          500

None

#### Checking for Duplicates

In [62]:
# Dataset Duplicate Value Count
print(f'Number of duplicated rows in the dataset: {df.duplicated().sum()}')

Number of duplicated rows in the dataset: 0


In [63]:
# Missing Values/Null Values Count
print(f'There are {df.isna().sum().sum()} missing values in the dataset\n')
df.isna().sum()

There are 50000 missing values in the dataset



puzzle_id                         0
timestamp                         0
is_solvable                       0
total_blocks                      0
block_count_1x1                   0
block_count_1x2                   0
block_count_2x1                   0
block_count_2x2                   0
goal_initial_row                  0
goal_initial_col                  0
goal_distance_to_target           0
goal_manhattan_distance           0
blocks_between_goal_target        0
adjacent_1x1_count                0
adjacent_1x2_count                0
adjacent_2x1_count                0
wall_adjacent_sides               0
empty_spaces_count                0
empty_space                       0
solution_length                   0
solve_time_seconds                0
step_number                   50000
initial_block_states              0
board_visual                      0
dtype: int64

#### Dropping Redundent Features

I'm dropping `step_number` and `timestamp` because:

- `timestamp` is of no use in the evaluation.
- `step_number` is just a feature I developed for an alternative evaluation on which i'll be working in the future, which requires state space and it's irrelevant here, so it's empty.


In [64]:
df = df.drop(columns=['step_number','timestamp'])

In [65]:
# Missing Values/Null Values Count again
print(f'There are {df.isna().sum().sum()} missing values in the dataset\n')
df.isna().sum()

There are 0 missing values in the dataset



puzzle_id                     0
is_solvable                   0
total_blocks                  0
block_count_1x1               0
block_count_1x2               0
block_count_2x1               0
block_count_2x2               0
goal_initial_row              0
goal_initial_col              0
goal_distance_to_target       0
goal_manhattan_distance       0
blocks_between_goal_target    0
adjacent_1x1_count            0
adjacent_1x2_count            0
adjacent_2x1_count            0
wall_adjacent_sides           0
empty_spaces_count            0
empty_space                   0
solution_length               0
solve_time_seconds            0
initial_block_states          0
board_visual                  0
dtype: int64

#### Checking Statistical Overview

In [66]:
# Quick DataFrame overview (board_visual will show with )
print("\n📊 Basic Statistics:")
display(df.describe())


📊 Basic Statistics:


Unnamed: 0,total_blocks,block_count_1x1,block_count_1x2,block_count_2x1,block_count_2x2,goal_initial_row,goal_initial_col,goal_distance_to_target,goal_manhattan_distance,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count,wall_adjacent_sides,empty_spaces_count,solution_length,solve_time_seconds
count,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0,50000.0
mean,10.03694,3.73284,2.6445,2.6596,1.0,1.36542,0.99138,1.965111,2.36192,3.81462,1.81444,1.5894,1.61864,1.18256,1.65896,13.5063,2.769057
std,1.003731,1.838427,1.218798,1.216107,0.0,1.068808,0.852807,0.824543,1.067388,2.032825,1.402258,1.005976,1.12763,0.716186,0.752186,22.761069,6.002471
min,8.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000968
25%,9.0,2.0,2.0,2.0,1.0,0.0,0.0,1.0,1.0,2.0,1.0,1.0,1.0,1.0,2.0,0.0,0.012442
50%,10.0,4.0,3.0,3.0,1.0,1.0,1.0,2.0,2.0,4.0,2.0,2.0,1.0,1.0,2.0,0.0,0.036329
75%,11.0,4.0,3.0,3.0,1.0,2.0,2.0,3.0,3.0,5.0,3.0,2.0,2.0,2.0,2.0,24.0,2.163855
max,15.0,12.0,7.0,6.0,1.0,3.0,2.0,3.162278,4.0,10.0,10.0,6.0,6.0,2.0,2.0,137.0,30.04732


#### Checking Min-Max Values

In [67]:
def check_min_max_values(df, ignore_columns=None, only_columns=None):
    """
    Check the minimum and maximum values for each variable in the dataframe,
    with the ability to either ignore certain columns or only check a subset.
    
    :param df: The DataFrame
    :param ignore_columns: List of columns to ignore (default is None)
    :param only_columns: List of columns to include (default is None)
    """
    if ignore_columns is None:
        ignore_columns = []  
    if only_columns is None:
        only_columns = df.columns.tolist()  # fallback: use all columns
    
    # Final set of columns to evaluate
    columns_to_check = [col for col in only_columns if col not in ignore_columns]
    
    results = []
    for column in columns_to_check:
        min_value = df[column].min()
        max_value = df[column].max()
        results.append([column, min_value, max_value])
    
    result_df = pd.DataFrame(results, columns=['Column', 'Min Value', 'Max Value'])
    
    print("\nMin and Max Values:")
    display(result_df)

# Example usage
ignore_columns = ['initial_block_states', 'board_visual', 'empty_space']
check_min_max_values(df, ignore_columns)



Min and Max Values:


Unnamed: 0,Column,Min Value,Max Value
0,puzzle_id,puzzle_000000,puzzle_049999
1,is_solvable,False,True
2,total_blocks,8,15
3,block_count_1x1,0,12
4,block_count_1x2,0,7
5,block_count_2x1,0,6
6,block_count_2x2,1,1
7,goal_initial_row,0,3
8,goal_initial_col,0,2
9,goal_distance_to_target,1.0,3.162278


----

## Data cleaning & wrangling 

### Checking Unique Data

In [68]:
def check_unique_values(df_or_column):
    """
    Check number of unique values for each variable in the dataframe or a single column.
    If a single column is passed, it returns the unique values for that column.
    """
    # Check if the input is a DataFrame or a Series (single column)
    if isinstance(df_or_column, pd.DataFrame):
        # Handle multiple columns (entire DataFrame)
        results = []
        for column in df_or_column.columns:
            num_unique = df_or_column[column].nunique()  # Number of unique values
            results.append([column, num_unique])
        result_df = pd.DataFrame(results, columns=['Column', 'Number of Unique Values'])
    
    elif isinstance(df_or_column, pd.Series):
        # Handle single column (Series)
        num_unique = df_or_column.nunique()  # Number of unique values
        result_df = pd.DataFrame([[df_or_column.name, num_unique]], columns=['Column', 'Number of Unique Values'])
    
    return result_df

In [69]:
# Example: Using the function on the entire DataFrame
result_df = check_unique_values(df)

# Display the result
display(result_df)


Unnamed: 0,Column,Number of Unique Values
0,puzzle_id,50000
1,is_solvable,2
2,total_blocks,8
3,block_count_1x1,7
4,block_count_1x2,8
5,block_count_2x1,7
6,block_count_2x2,1
7,goal_initial_row,4
8,goal_initial_col,3
9,goal_distance_to_target,6


### Checking For Leagal layouts

In [70]:
def analyze_dataset(df):
    """
    Perform analysis on the dataset and display various statistics.
    """
    # Step 1: Calculate Legal Boards (non-zero empty spaces)
    df['legal_board'] = df['empty_spaces_count'] > 0  # New boolean column for legal boards
    
    legal_boards = df[df['legal_board']]  # Use the new column to filter legal boards
    num_legal_boards = legal_boards.shape[0]
    
    # Step 2: Solvable and Unsolvable puzzles (based only on legal boards)
    solvable_legal = legal_boards['is_solvable'].sum()
    unsolvable_legal = num_legal_boards - solvable_legal
    
    print(f"Total boards: {df.shape[0]:,}")
    print(f"Legal boards: {num_legal_boards:,}")
    print(f"Solvable puzzles: {solvable_legal:,} ({solvable_legal / num_legal_boards:.1%})")
    print(f"Unsolvable puzzles (based on legal boards): {unsolvable_legal:,} ({unsolvable_legal / num_legal_boards:.1%})")

    # Step 3: Illegal boards (with 0 empty spaces)
    invalid_boards = df[~df['legal_board']]  # Use the negation of the legal_board column
    num_invalid_boards = invalid_boards.shape[0]
    percent_invalid_boards = (num_invalid_boards / df.shape[0]) * 100
    print(f"Boards with 0 empty spaces (illegal boards): {num_invalid_boards:,} puzzles ({percent_invalid_boards:.1f}%)")

    # Step 4: Solution length statistics based only on solvable boards
    solvable_df = legal_boards[legal_boards['is_solvable']]  # Use solvable puzzles for solution length analysis
    if len(solvable_df) > 0:
        print(f"Average solution length: {solvable_df['solution_length'].mean():.1f} moves")
        print(f"Solution range: {solvable_df['solution_length'].min()}-{solvable_df['solution_length'].max()} moves")



In [71]:
print("\n")
analyze_dataset(df)



Total boards: 50,000
Legal boards: 41,474
Solvable puzzles: 19,866 (47.9%)
Unsolvable puzzles (based on legal boards): 21,608 (52.1%)
Boards with 0 empty spaces (illegal boards): 8,526 puzzles (17.1%)
Average solution length: 34.0 moves
Solution range: 1-137 moves


In [72]:
# Createing a copy of original df, i plan to test the illeagl boards later in ML; To find if it can make it solveable by making any 2 spaces empty (its just a thought idk if i'll be able to do it T_T)
original_df = df.copy() 

df = df[df['empty_spaces_count'] == 2]
analyze_dataset(df)

Total boards: 41,474
Legal boards: 41,474
Solvable puzzles: 19,866 (47.9%)
Unsolvable puzzles (based on legal boards): 21,608 (52.1%)
Boards with 0 empty spaces (illegal boards): 0 puzzles (0.0%)
Average solution length: 34.0 moves
Solution range: 1-137 moves


# 🔧 Advanced Feature Engineering



### 1. Complexity Analyzer Implementation

I'll create a sophisticated complexity scoring system that captures multiple dimensions of puzzle difficulty:

In [73]:
# Feature engineering for structural complexity
def engineer_structural_features(df):
    """
    Engineer structural features following empirical hardness modeling.
    Research-backed feature transformations without learned weights.
    
    This function creates deterministic features that capture puzzle difficulty
    through spatial reasoning, movement constraints, and goal accessibility.
    Each feature is designed based on cognitive psychology and puzzle theory.
    """
    print("Engineering structural features...")
    
    # ==========================================
    # MOVEMENT CONSTRAINT FEATURES
    # ==========================================
    
    # Mobility Index: Measures how constrained piece movement is
    # Logic: More empty space relative to blocks = easier movement = lower difficulty
    # Formula: empty_spaces / total_blocks
    # Range: ~0.1-0.3 (higher = easier puzzles)
    # Research basis: Spatial reasoning requires maneuvering room (Hegarty & Waller, 2005)
    df['mobility_index'] = (
        df['empty_spaces_count'] / df['total_blocks']
    )
    
    # ==========================================
    # GOAL INTERFERENCE FEATURES  
    # ==========================================
    
    # Blocker Score: Quantifies how much the goal path is obstructed
    # Logic: More blocks blocking + wall constraints = higher clearing complexity
    # Formula: blocks_in_way × (1 + wall_penalty)
    # Wall penalty: Each adjacent wall adds 50% difficulty (max 2 walls for 2x2 goal)
    # Range: 0-30+ (higher = harder puzzles)
    # Research basis: Goal-oriented spatial planning (Knauff & Johnson-Laird, 2002)
    df['blocker_score'] = (
        df['blocks_between_goal_target'] *           # Base obstacle count
        (1 + df['wall_adjacent_sides'] / 2.0)        # Wall constraint multiplier (max 2 walls)
    )
    
    # Clearance Cost: Measures spatial reasoning complexity for goal achievement  
    # Logic: Distance + obstacles = compound planning difficulty
    # Formula: distance_to_goal × blocks_blocking_path
    # Captures both "how far" and "how many obstacles" simultaneously
    # Range: 0-50+ (higher = harder puzzles requiring multi-step planning)
    # Research basis: Working memory load in spatial problem solving (Miyake et al., 2001)
    df['clearance_cost'] = (
        df['goal_distance_to_target'] * df['blocks_between_goal_target']
    )
    
    # ==========================================
    # BOARD COMPOSITION FEATURES
    # ==========================================
    
    # Board Density: Fundamental constraint measure
    # Logic: More filled board = less maneuvering space = higher difficulty
    # Formula: total_blocks / board_size (5×4 = 20 cells)
    # Range: 0.4-0.75 (40%-75% filled)
    # Research basis: Constraint satisfaction problems (Mackworth, 1977)
    df['board_density'] = df['total_blocks'] / 20.0  # Normalize to 5×4 board
    
    # ==========================================
    # FEATURE SELECTION FOR ML PIPELINE
    # ==========================================
    
    # Curated feature set combining:
    # 1. Engineered complexity metrics (mobility, blocker, clearance, density)
    # 2. Raw spatial measurements (distances, adjacencies) 
    # 3. Block composition patterns (different piece types)
    #
    # Selection criteria:
    # - Low correlation (avoid redundancy)
    # - High predictive power for difficulty
    # - Interpretable business meaning
    # - Computationally efficient
    structural_features = [
        # === ENGINEERED COMPLEXITY FEATURES ===
        'mobility_index',          # Movement freedom (lower = harder)
        'blocker_score',          # Goal path obstruction (higher = harder) 
        'clearance_cost',         # Compound spatial reasoning (higher = harder)
        'board_density',          # Space availability (higher = generally harder)
        
        # === RAW SPATIAL MEASUREMENTS ===
        'goal_manhattan_distance', # Direct distance to target position
        'wall_adjacent_sides',    # Wall constraint count (0-2 for Klotski)
        'blocks_between_goal_target', # Direct obstacle count
        
        # === BLOCK ADJACENCY PATTERNS ===
        'adjacent_1x1_count',     # Small block clustering (affects movement)
        'adjacent_1x2_count',     # Horizontal block patterns
        'adjacent_2x1_count'      # Vertical block patterns
    ]
    
    print(f"    Engineered {len(structural_features)} structural features")
    
    return df, structural_features

# ==========================================
# APPLY FEATURE ENGINEERING PIPELINE
# ==========================================

# Execute the feature engineering process
df, feature_columns = engineer_structural_features(df)

print("\n Structural Feature Summary:")
print("    Each feature designed to capture different aspects of puzzle difficulty:")
print("   • Lower mobility_index = more constrained movement")
print("   • Higher blocker_score = more goal interference") 
print("   • Higher clearance_cost = more complex spatial reasoning required")
print("   • Higher board_density = less maneuvering space available")

# Display statistical summary of all engineered features
structural_stats = df[feature_columns].describe()
display(structural_stats)

# Add structural features to main feature list for downstream analysis
all_features = feature_columns.copy()


Engineering structural features...
    Engineered 10 structural features

 Structural Feature Summary:
    Each feature designed to capture different aspects of puzzle difficulty:
   • Lower mobility_index = more constrained movement
   • Higher blocker_score = more goal interference
   • Higher clearance_cost = more complex spatial reasoning required
   • Higher board_density = less maneuvering space available


Unnamed: 0,mobility_index,blocker_score,clearance_cost,board_density,goal_manhattan_distance,wall_adjacent_sides,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count
count,41474.0,41474.0,41474.0,41474.0,41474.0,41474.0,41474.0,41474.0,41474.0,41474.0
mean,0.204786,6.133939,8.705244,0.492686,2.35996,1.175242,3.729951,1.812003,1.534817,1.576241
std,0.019264,4.04982,7.173863,0.046845,1.064812,0.717496,1.983253,1.409185,0.987666,1.105195
min,0.142857,0.0,0.0,0.4,1.0,0.0,0.0,0.0,0.0,0.0
25%,0.2,3.0,2.0,0.45,1.0,1.0,2.0,1.0,1.0,1.0
50%,0.2,4.5,6.0,0.5,2.0,1.0,4.0,2.0,1.0,1.0
75%,0.222222,7.5,13.416408,0.5,3.0,2.0,5.0,3.0,2.0,2.0
max,0.25,20.0,31.622777,0.7,4.0,2.0,10.0,9.0,6.0,6.0


In [74]:
display(df.head())

Unnamed: 0,puzzle_id,is_solvable,total_blocks,block_count_1x1,block_count_1x2,block_count_2x1,block_count_2x2,goal_initial_row,goal_initial_col,goal_distance_to_target,goal_manhattan_distance,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count,wall_adjacent_sides,empty_spaces_count,empty_space,solution_length,solve_time_seconds,initial_block_states,board_visual,legal_board,mobility_index,blocker_score,clearance_cost,board_density
0,puzzle_000000,True,11,6,2,2,1,1,2,2.236068,3,6,5,1,0,1,2,"[(3, 2), (4, 2)]",17,2.848788,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]",┌───┬───────┬───┐\n│ │ │ │\n│ ├───┬───┴───┤\n│ │ │ │\n├───┼───┤ │\n│ │ │ │\n│ ├───┼───┬───┤\n│ │ │ X │ │\n├───┴───┤ ├───┤\n│ │ X │ │\n└───────┴───┴───┘,True,0.181818,9.0,13.416408,0.55
1,puzzle_000001,False,9,2,1,5,1,1,1,2.0,2,2,2,1,5,0,2,"[(0, 3), (4, 0)]",0,0.014643,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]",┌───┬───────┬───┐\n│ │ │ X │\n│ ├───────┼───┤\n│ │ │ │\n├───┤ │ │\n│ │ │ │\n├───┼───┬───┼───┤\n│ │ │ │ │\n├───┤ │ │ │\n│ X │ │ │ │\n└───┴───┴───┴───┘,True,0.222222,2.0,4.0,0.45
2,puzzle_000002,True,10,4,2,3,1,0,2,3.162278,4,5,1,1,1,2,2,"[(3, 1), (3, 2)]",75,22.666284,"[(2, 2, 0, 2), (1, 2, 2, 1), (1, 1, 1, 0), (2, 1, 0, 1), (2, 1, 2, 0), (1, 2, 4, 1), (2, 1, 3, 3), (1, 1, 0, 0), (1, 1, 2, 3), (1, 1, 4, 0)]",┌───┬───┬───────┐\n│ │ │ │\n├───┤ │ │\n│ │ │ │\n├───┼───┴───┬───┤\n│ │ │ │\n│ ├───────┼───┤\n│ │ X X │ │\n├───┼───────┤ │\n│ │ │ │\n└───┴───────┴───┘,True,0.2,10.0,15.811388,0.5
3,puzzle_000003,False,8,0,5,2,1,3,2,1.0,1,2,0,4,0,2,2,"[(0, 1), (0, 2)]",0,0.004736,"[(2, 2, 3, 2), (1, 2, 1, 1), (2, 1, 0, 3), (1, 2, 3, 0), (1, 2, 2, 0), (1, 2, 4, 0), (2, 1, 0, 0), (1, 2, 2, 2)]",┌───┬───────┬───┐\n│ │ X X │ │\n│ ├───────┤ │\n│ │ │ │\n├───┴───┬───┴───┤\n│ │ │\n├───────┼───────┤\n│ │ │\n├───────┤ │\n│ │ │\n└───────┴───────┘,True,0.25,4.0,2.0,0.4
4,puzzle_000004,False,10,4,3,2,1,1,0,2.236068,3,5,3,2,1,1,2,"[(0, 1), (4, 0)]",0,0.010456,"[(2, 2, 1, 0), (1, 1, 4, 3), (2, 1, 2, 3), (1, 1, 0, 0), (1, 1, 3, 1), (1, 2, 4, 1), (1, 2, 0, 2), (1, 2, 1, 2), (1, 1, 3, 0), (2, 1, 2, 2)]",┌───┬───┬───────┐\n│ │ X │ │\n├───┴───┼───────┤\n│ │ │\n│ ├───┬───┤\n│ │ │ │\n├───┬───┤ │ │\n│ │ │ │ │\n├───┼───┴───┼───┤\n│ X │ │ │\n└───┴───────┴───┘,True,0.2,7.5,11.18034,0.5


In [75]:
new_features = ["mobility_index", "blocker_score", "clearance_cost", "board_density", ]
check_min_max_values(df, only_columns=new_features)


Min and Max Values:


Unnamed: 0,Column,Min Value,Max Value
0,mobility_index,0.142857,0.25
1,blocker_score,0.0,20.0
2,clearance_cost,0.0,31.622777
3,board_density,0.4,0.7


### 2. Chiral Twin Symmetry Analysis

I'll implement advanced symmetry detection to identify mirror patterns in puzzle boards:

#### Section 1: Core Functions and Data Structures

In [76]:
# Define lightweight block representation for canonical operations
BlockLite = namedtuple('BlockLite', ['num_rows','num_cols','row_pos','col_pos'])

def parse_block_states(s):
    """
    Parse block states with robust error handling
    Converts string representation to list of tuples
    """
    try:
        return ast.literal_eval(s) if isinstance(s, str) else s
    except Exception:
        return []

def blocks_from_states(states):
    """
    Convert state tuples to BlockLite objects
    Creates structured representation for geometric operations
    """
    if not states: 
        return []
    return [BlockLite(h, w, r, c) for (h, w, r, c) in states]

def horizontal_mirror_blocks(blocks, cols=4):
    """
    Create horizontal mirror of blocks (chiral twin transformation)
    
    Logic: Flip each block's column position across vertical centerline
    Formula: new_col = total_cols - original_col - block_width
    """
    return [BlockLite(b.num_rows, b.num_cols, b.row_pos, cols - b.col_pos - b.num_cols) 
            for b in blocks]

def canonical_form(blocks):
    """
    Create ordering-invariant canonical representation
    
    Purpose: Enables comparison regardless of block enumeration order
    Method: Sort blocks by position and size to create unique signature
    """
    return tuple(sorted((b.row_pos, b.col_pos, b.num_rows, b.num_cols) for b in blocks))

print("Chiral twin analysis functions loaded")

Chiral twin analysis functions loaded


#### Section 2: Symmetry Detection & Classification

In [77]:
def is_self_mirror(blocks, cols=4):
    """
    Check if puzzle is its own chiral twin (self-mirroring)
    
    Logic: Compare canonical form with its horizontal mirror
    Returns: True if puzzle is symmetric around vertical centerline
    """
    return canonical_form(blocks) == canonical_form(horizontal_mirror_blocks(blocks, cols))

def mirror_invariant_key(blocks, cols=4):
    """
    Create key identical for a board and its chiral twin
    
    Purpose: Group chiral twins together for efficient pair detection
    Method: Use minimum of canonical form and its mirror (lexicographic ordering)
    Result: Both a puzzle and its chiral twin get the same key
    """
    a = canonical_form(blocks)
    b = canonical_form(horizontal_mirror_blocks(blocks, cols))
    return min(a, b)  # Identical for a board and its chiral twin

print("Symmetry detection functions ready")

Symmetry detection functions ready


#### Section 3: Data Parsing & Canonical Signature Generation

In [78]:


print("Analyzing chiral twin patterns...")
print("="*50)

# Parse block configurations from string format
print("    Parsing puzzle block configurations...")
df['parsed_blocks'] = df['initial_block_states'].apply(parse_block_states)
df['blocks_objects'] = df['parsed_blocks'].apply(blocks_from_states)

print(f"       Parsed {len(df)} puzzle configurations")

# Generate canonical signatures for each puzzle
print("    Computing canonical forms and chiral signatures...")
df['mirror_canonical'] = df['blocks_objects'].apply(canonical_form)
df['mirror_invariant_key'] = df['blocks_objects'].apply(mirror_invariant_key)
df['is_self_mirror'] = df['blocks_objects'].apply(is_self_mirror)

print(f"       Generated canonical signatures for all puzzles")

Analyzing chiral twin patterns...
    Parsing puzzle block configurations...
       Parsed 41474 puzzle configurations
    Computing canonical forms and chiral signatures...
       Generated canonical signatures for all puzzles


#### Section 4: Chiral Twin Grouping & Pair Detection

In [79]:
# Group puzzles by their mirror-invariant signatures
print("    Identifying chiral twin pairs...")
grp = df.groupby('mirror_invariant_key', dropna=False)
df['mirror_group_size'] = grp['puzzle_id'].transform('size')
df['has_chiral_twin'] = (df['mirror_group_size'] >= 2).astype(int)

print(f"       Found {df['has_chiral_twin'].sum():,} puzzles with potential chiral twins")

def pick_twin(group):
    """
    Map each puzzle to its chiral twin within the group
    
    Logic: 
    1. Group puzzles by their canonical form
    2. If 2+ different canonical forms exist, they're chiral twins
    3. Pair puzzles from different canonical forms
    """
    by_canon = {}
    for _, row in group.iterrows():
        by_canon.setdefault(row['mirror_canonical'], []).append(row['puzzle_id'])
    
    twin = {}
    canons = list(by_canon.keys())
    
    # If multiple canonical forms exist, they're chiral twins
    if len(canons) >= 2:
        A, B = by_canon[canons[0]], by_canon[canons[1]]
        # Create reciprocal mapping between twin groups
        for i, a in enumerate(A): 
            twin[a] = B[i % len(B)]
        for j, b in enumerate(B): 
            twin[b] = A[j % len(A)]
    
    # Ensure all puzzles have an entry (None if no twin)
    for ids in by_canon.values():
        for pid in ids: 
            twin.setdefault(pid, None)
    
    return twin

# Create comprehensive twin mapping
print("    Creating chiral twin mappings...")
tmap = {}
for _, g in grp:
    tmap.update(pick_twin(g))

df['chiral_twin_id'] = df['puzzle_id'].map(tmap)

print("    Chiral twin relationships established")

    Identifying chiral twin pairs...
       Found 8,379 puzzles with potential chiral twins
    Creating chiral twin mappings...
    Chiral twin relationships established


#### Section 5: Chiral Twin Analysis Results

In [80]:
print("\n Chiral twin analysis complete!")
print("="*40)

# Summary statistics
twin_count = df['has_chiral_twin'].sum()
self_mirror_count = df['is_self_mirror'].sum()
total_puzzles = len(df)

print(f"    Puzzles with chiral twins: {twin_count:,} ({twin_count/total_puzzles:.1%})")
print(f"    Self-mirroring puzzles: {self_mirror_count:,} ({self_mirror_count/total_puzzles:.1%})")
print(f"    Asymmetric puzzles: {total_puzzles - twin_count - self_mirror_count:,}")

# Add chiral features to main feature list
chiral_features = ['is_self_mirror', 'has_chiral_twin']
all_features.extend(chiral_features)

print(f"\n    Added {len(chiral_features)} chiral symmetry features to analysis")


 Chiral twin analysis complete!
    Puzzles with chiral twins: 8,379 (20.2%)
    Self-mirroring puzzles: 145 (0.3%)
    Asymmetric puzzles: 32,950

    Added 2 chiral symmetry features to analysis


#### Section 6: Chiral Twin Solution Equivalence Testing

In [81]:

print("\n Testing chiral twin solution equivalence...")
print("="*45)

if 'solution_length' in df.columns and 'solve_time_seconds' in df.columns:
    # Find puzzles that have actual chiral twins (not self-mirrors)
    twin_pairs = df[df['chiral_twin_id'].notna() & (df['chiral_twin_id'] != df['puzzle_id'])].copy()
    
    if len(twin_pairs) > 0:
        print(f"    Analyzing {len(twin_pairs)} puzzles with chiral twins")
        
        # Initialize analysis variables
        length_diffs = []
        identical_solutions = 0
        valid_pairs = 0
        
        # Compare each puzzle with its chiral twin
        for _, puzzle in twin_pairs.iterrows():
            twin_id = puzzle['chiral_twin_id']
            twin_row = df[df['puzzle_id'] == twin_id]
            
            if len(twin_row) > 0:
                twin = twin_row.iloc[0]
                
                # Only compare solvable puzzle pairs
                if (puzzle['is_solvable'] and twin['is_solvable'] and 
                    pd.notna(puzzle['solution_length']) and pd.notna(twin['solution_length'])):
                    
                    length_diff = abs(puzzle['solution_length'] - twin['solution_length'])
                    length_diffs.append(length_diff)
                    valid_pairs += 1
                    
                    if length_diff == 0:
                        identical_solutions += 1
        
        # Report quality assessment results
        if length_diffs:
            avg_length_diff = np.mean(length_diffs)
            print(f"\n   SOLUTION EQUIVALENCE RESULTS:")
            print(f"       Average solution length difference: {avg_length_diff:.2f} moves")
            print(f"       Identical solution lengths: {identical_solutions}/{valid_pairs} ({identical_solutions/valid_pairs:.1%})")
            print(f"       Maximum difference observed: {max(length_diffs):.0f} moves")
            
            # Quality interpretation
            if avg_length_diff <= 0.1:
                print(f"\n    EXCELLENT: Chiral twins have nearly identical solution complexity")
                print(f"       This validates our canonical geometry approach")
            elif avg_length_diff <= 1.0:
                print(f"\n    ACCEPTABLE: Minor solution differences between chiral twins")
                print(f"       May indicate solver tie-breaking or minor asymmetries")
            else:
                print(f"\n    INVESTIGATION NEEDED: Significant solution differences")
                print(f"       Average diff {avg_length_diff:.2f} moves suggests systematic issues")
        
        print(f"\n    Valid twin pairs analyzed: {valid_pairs}")
    else:
        print("    No chiral twin pairs found for equivalence testing")
else:
    print("    Missing solution data - skipping equivalence analysis")

print("\n Chiral twin analysis complete!")


 Testing chiral twin solution equivalence...
    Analyzing 4802 puzzles with chiral twins

   SOLUTION EQUIVALENCE RESULTS:
       Average solution length difference: 0.00 moves
       Identical solution lengths: 1063/1063 (100.0%)
       Maximum difference observed: 0 moves

    EXCELLENT: Chiral twins have nearly identical solution complexity
       This validates our canonical geometry approach

    Valid twin pairs analyzed: 1063

 Chiral twin analysis complete!


In [82]:
df.head(2)

Unnamed: 0,puzzle_id,is_solvable,total_blocks,block_count_1x1,block_count_1x2,block_count_2x1,block_count_2x2,goal_initial_row,goal_initial_col,goal_distance_to_target,goal_manhattan_distance,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count,wall_adjacent_sides,empty_spaces_count,empty_space,solution_length,solve_time_seconds,initial_block_states,board_visual,legal_board,mobility_index,blocker_score,clearance_cost,board_density,parsed_blocks,blocks_objects,mirror_canonical,mirror_invariant_key,is_self_mirror,mirror_group_size,has_chiral_twin,chiral_twin_id
0,puzzle_000000,True,11,6,2,2,1,1,2,2.236068,3,6,5,1,0,1,2,"[(3, 2), (4, 2)]",17,2.848788,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]",┌───┬───────┬───┐\n│ │ │ │\n│ ├───┬───┴───┤\n│ │ │ │\n├───┼───┤ │\n│ │ │ │\n│ ├───┼───┬───┤\n│ │ │ X │ │\n├───┴───┤ ├───┤\n│ │ X │ │\n└───────┴───┴───┘,True,0.181818,9.0,13.416408,0.55,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]","[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]","((0, 0, 2, 1), (0, 1, 1, 2), (0, 3, 1, 1), (1, 1, 1, 1), (1, 2, 2, 2), (2, 0, 2, 1), (2, 1, 1, 1), (3, 1, 1, 1), (3, 3, 1, 1), (4, 0, 1, 2), (4, 3, 1, 1))","((0, 0, 1, 1), (0, 1, 1, 2), (0, 3, 2, 1), (1, 0, 2, 2), (1, 2, 1, 1), (2, 2, 1, 1), (2, 3, 2, 1), (3, 0, 1, 1), (3, 2, 1, 1), (4, 0, 1, 1), (4, 2, 1, 2))",False,1,0,
1,puzzle_000001,False,9,2,1,5,1,1,1,2.0,2,2,2,1,5,0,2,"[(0, 3), (4, 0)]",0,0.014643,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]",┌───┬───────┬───┐\n│ │ │ X │\n│ ├───────┼───┤\n│ │ │ │\n├───┤ │ │\n│ │ │ │\n├───┼───┬───┼───┤\n│ │ │ │ │\n├───┤ │ │ │\n│ X │ │ │ │\n└───┴───┴───┴───┘,True,0.222222,2.0,4.0,0.45,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]","[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]","((0, 0, 2, 1), (0, 1, 1, 2), (1, 1, 2, 2), (1, 3, 2, 1), (2, 0, 1, 1), (3, 0, 1, 1), (3, 1, 2, 1), (3, 2, 2, 1), (3, 3, 2, 1))","((0, 0, 2, 1), (0, 1, 1, 2), (1, 1, 2, 2), (1, 3, 2, 1), (2, 0, 1, 1), (3, 0, 1, 1), (3, 1, 2, 1), (3, 2, 2, 1), (3, 3, 2, 1))",False,1,0,


In [83]:
ignore_columns = ['empty_space','initial_block_states','board_visual','parsed_blocks','blocks_objects','mirror_canonical','mirror_invariant_key','chiral_twin_id']
check_min_max_values(df,ignore_columns)


Min and Max Values:


Unnamed: 0,Column,Min Value,Max Value
0,puzzle_id,puzzle_000000,puzzle_049999
1,is_solvable,False,True
2,total_blocks,8,14
3,block_count_1x1,0,12
4,block_count_1x2,0,7
5,block_count_2x1,0,6
6,block_count_2x2,1,1
7,goal_initial_row,0,3
8,goal_initial_col,0,2
9,goal_distance_to_target,1.0,3.162278


In [84]:
#  Using the  function on a single column ( 'chiral_twin_id')
result_df = check_unique_values(df['chiral_twin_id'])

# Display the result
display(result_df)


Unnamed: 0,Column,Number of Unique Values
0,chiral_twin_id,4288


###  3. Refinment of Chiral Twin & Symmetry Analysis  

Robust Data Quality Assessment

##### Section 1: Data Validation and Initial Setup

In [85]:
print("SYMMETRY & CHIRAL TWIN ANALYSIS")
print("=" * 32)
print("Data Validation and Setup")

# Check for data quality issues
if df['puzzle_id'].duplicated().any():
    duplicate_count = df['puzzle_id'].duplicated().sum()
    print(f"WARNING: Found {duplicate_count} duplicate puzzle_ids")
    print("Recommendation: Clean duplicates before analysis")
else:
    print("Data validation passed: No duplicate puzzle_ids found")

# Create working copy to avoid modifying original data
df_analysis = df.copy()
print(f"Working with {len(df_analysis):,} puzzles")
print("Setup complete!\n")

SYMMETRY & CHIRAL TWIN ANALYSIS
Data Validation and Setup
Data validation passed: No duplicate puzzle_ids found
Working with 41,474 puzzles
Setup complete!



##### Section 2: Compare Original vs Corrected Chiral Definitions

In [86]:

print("Chiral Twin Definition Analysis")
print('='*32)
# Create corrected definition based on mirror invariant groups
canon_nunique = df_analysis.groupby('mirror_invariant_key')['mirror_canonical'].transform('nunique')
df_analysis['has_chiral_twin_distinct'] = (canon_nunique >= 2)

# Compare the two definitions
original_count = df_analysis['has_chiral_twin'].sum()
corrected_count = df_analysis['has_chiral_twin_distinct'].sum()
agreement = (df_analysis['has_chiral_twin'] == df_analysis['has_chiral_twin_distinct']).sum()

print("Chiral Twin Definition Comparison:")
print(f"   Original definition: {original_count:,} puzzles")
print(f"   Corrected definition: {corrected_count:,} puzzles")
print(f"   Difference: {corrected_count - original_count:+,} puzzles")
print(f"   Agreement rate: {agreement/len(df_analysis)*100:.1f}%")

# Store results for later use
chiral_results = {
    'original_count': original_count,
    'corrected_count': corrected_count,
    'agreement_rate': agreement/len(df_analysis)
}
print("Chiral definition analysis complete!\n")

Chiral Twin Definition Analysis
Chiral Twin Definition Comparison:
   Original definition: 8,379 puzzles
   Corrected definition: 4,802 puzzles
   Difference: -3,577 puzzles
   Agreement rate: 91.4%
Chiral definition analysis complete!



##### Section 3: Analyze Twin ID Mapping and Create Unique Pairs

In [87]:
print("Twin Mapping Analysis")
print('='*25)

# Find puzzles with valid twin mappings
rows_with_twin = df_analysis['chiral_twin_id'].notna() & (df_analysis['chiral_twin_id'] != df_analysis['puzzle_id'])
puzzles_with_mapped_twin = rows_with_twin.sum()

print(f"Puzzles with mapped twins: {puzzles_with_mapped_twin:,}")
print(f"Percentage of dataset: {(puzzles_with_mapped_twin/len(df_analysis)*100):.2f}%")

if puzzles_with_mapped_twin > 0:
    # Create unique bidirectional pairs
    pairs = df_analysis.loc[rows_with_twin, ['puzzle_id', 'chiral_twin_id']].copy()
    pairs['pair_a'] = pairs[['puzzle_id', 'chiral_twin_id']].min(axis=1)
    pairs['pair_b'] = pairs[['puzzle_id', 'chiral_twin_id']].max(axis=1)
    unique_pairs = pairs.drop_duplicates(['pair_a', 'pair_b'])
    
    print(f"Unique twin pairs: {len(unique_pairs):,}")
    print(f"Expected pairs: {puzzles_with_mapped_twin // 2:,} (if perfectly bidirectional)")
    
    # Validate twin references exist in dataset
    all_puzzle_ids = set(df_analysis['puzzle_id'])
    referenced_twins = set(df_analysis['chiral_twin_id'].dropna())
    missing_twins = referenced_twins - all_puzzle_ids
    
    if missing_twins:
        print(f"WARNING: {len(missing_twins)} twin IDs reference non-existent puzzles")
    else:
        print("All twin references are valid")
else:
    unique_pairs = None
    print("No twin pairs found in dataset")

print("Twin mapping analysis complete!\n")

Twin Mapping Analysis
Puzzles with mapped twins: 4,802
Percentage of dataset: 11.58%
Unique twin pairs: 2,658
Expected pairs: 2,401 (if perfectly bidirectional)
All twin references are valid
Twin mapping analysis complete!



In [88]:
# BLOCK 4: Analyze Solvability Patterns Between Twin Pairs
print("Twin Solvability Analysis")
print('='*25)

if unique_pairs is not None and len(unique_pairs) > 0:
    # Prepare puzzle statistics
    puzzle_stats = (df_analysis.drop_duplicates('puzzle_id')
                   .set_index('puzzle_id')[['is_solvable', 'solution_length', 'solve_time_seconds']])
    
    # Merge twin pairs with puzzle statistics
    twin_analysis = (unique_pairs
                    .merge(puzzle_stats, left_on='pair_a', right_index=True, how='left')
                    .merge(puzzle_stats, left_on='pair_b', right_index=True, 
                           how='left', suffixes=('_a', '_b')))
    
    # Check merge success
    failed_merges = twin_analysis.isnull().any(axis=1).sum()
    if failed_merges > 0:
        print(f"Warning: {failed_merges} pairs couldn't be matched with puzzle stats")
    
    # Analyze solvability patterns
    valid_analysis = twin_analysis.dropna()
    
    if len(valid_analysis) > 0:
        both_solvable = valid_analysis['is_solvable_a'] & valid_analysis['is_solvable_b']
        both_unsolvable = ~valid_analysis['is_solvable_a'] & ~valid_analysis['is_solvable_b']
        mixed_solvability = ~(both_solvable | both_unsolvable)
        
        print("Twin Pair Solvability Patterns:")
        print(f"   Both solvable: {both_solvable.sum():,} pairs ({both_solvable.mean()*100:.1f}%)")
        print(f"   Both unsolvable: {both_unsolvable.sum():,} pairs ({both_unsolvable.mean()*100:.1f}%)")
        print(f"   Mixed solvability: {mixed_solvability.sum():,} pairs ({mixed_solvability.mean()*100:.1f}%)")
        
        # Store for next analysis
        solvability_results = {
            'both_solvable': both_solvable.sum(),
            'both_unsolvable': both_unsolvable.sum(),
            'mixed': mixed_solvability.sum(),
            'solvable_twins': valid_analysis[both_solvable] if both_solvable.sum() > 0 else None
        }
    else:
        solvability_results = None
        print("No valid twin pairs for solvability analysis")
else:
    twin_analysis = None
    solvability_results = None
    print("No twin pairs available for solvability analysis")

print("Solvability analysis complete!\n")

Twin Solvability Analysis
Twin Pair Solvability Patterns:
   Both solvable: 551 pairs (20.7%)
   Both unsolvable: 2,105 pairs (79.2%)
   Mixed solvability: 2 pairs (0.1%)
Solvability analysis complete!



##### Section 5: Difficulty Comparison for Solvable Twin Pairs

In [89]:
print("Difficulty Comparison Analysis")
print('='*30)

if solvability_results and solvability_results['solvable_twins'] is not None:
    solvable_twins = solvability_results['solvable_twins']
    
    # Calculate difficulty differences
    length_diff = abs(solvable_twins['solution_length_a'] - solvable_twins['solution_length_b'])
    time_diff = abs(solvable_twins['solve_time_seconds_a'] - solvable_twins['solve_time_seconds_b'])
    
    print("Difficulty Comparison (Solvable Pairs Only):")
    print(f"   Analyzed pairs: {len(solvable_twins):,}")
    print(f"   Average solution length difference: {length_diff.mean():.2f} moves")
    print(f"   Average solve time difference: {time_diff.mean():.3f} seconds")
    print(f"   Identical solution lengths: {(length_diff == 0).sum():,} pairs")
    
    # Perfect symmetry analysis
    perfect_symmetry = (length_diff == 0) & (time_diff < 0.001)
    print(f"   Perfect computational symmetry: {perfect_symmetry.sum():,} pairs")
    
    difficulty_results = {
        'avg_length_diff': length_diff.mean(),
        'avg_time_diff': time_diff.mean(),
        'identical_lengths': (length_diff == 0).sum(),
        'perfect_symmetry': perfect_symmetry.sum()
    }
else:
    difficulty_results = None
    print("No solvable twin pairs available for difficulty comparison")

print("Difficulty comparison complete!\n")

Difficulty Comparison Analysis
Difficulty Comparison (Solvable Pairs Only):
   Analyzed pairs: 551
   Average solution length difference: 0.00 moves
   Average solve time difference: 0.490 seconds
   Identical solution lengths: 551 pairs
   Perfect computational symmetry: 11 pairs
Difficulty comparison complete!



##### Section 6: Mirror Group Distribution and Self-Mirror Analysis

In [90]:
print("Mirror Group Analysis")
print('='*22)

# Self-mirror puzzle analysis
self_mirror_count = df_analysis['is_self_mirror'].sum()
print("Self-Mirror Analysis:")
print(f"   Self-symmetric puzzles: {self_mirror_count:,} ({self_mirror_count/len(df_analysis)*100:.2f}%)")

# Mirror group size distribution (by groups, not puzzles)
group_sizes = df_analysis.groupby('mirror_invariant_key').size()
group_distribution = group_sizes.value_counts().sort_index()

print("\nMirror Group Distribution (by number of groups):")
total_groups = len(group_sizes)

for size, num_groups in group_distribution.items():
    puzzles_in_groups = size * num_groups
    print(f"   {num_groups:,} groups of size {size} " +
          f"({num_groups/total_groups*100:.1f}% of groups, {puzzles_in_groups:,} puzzles)")

print(f"\nGroup Summary:")
print(f"   Total mirror groups: {total_groups:,}")
print(f"   Average group size: {group_sizes.mean():.2f}")

mirror_results = {
    'self_mirror_count': self_mirror_count,
    'total_groups': total_groups,
    'group_distribution': group_distribution,
    'avg_group_size': group_sizes.mean()
}
print("Mirror group analysis complete!\n")

Mirror Group Analysis
Self-Mirror Analysis:
   Self-symmetric puzzles: 145 (0.35%)

Mirror Group Distribution (by number of groups):
   33,095 groups of size 1 (89.7% of groups, 33,095 puzzles)
   3,176 groups of size 2 (8.6% of groups, 6,352 puzzles)
   457 groups of size 3 (1.2% of groups, 1,371 puzzles)
   114 groups of size 4 (0.3% of groups, 456 puzzles)
   29 groups of size 5 (0.1% of groups, 145 puzzles)
   8 groups of size 6 (0.0% of groups, 48 puzzles)
   1 groups of size 7 (0.0% of groups, 7 puzzles)

Group Summary:
   Total mirror groups: 36,880
   Average group size: 1.12
Mirror group analysis complete!



##### Section 7: Feature Quality and Compression Assessment

In [91]:

print("Feature Quality Assessment")
print('='*27)

# Calculate uniqueness metrics
unique_mirror_keys = df_analysis['mirror_invariant_key'].nunique()
unique_canonical = df_analysis['mirror_canonical'].nunique()
compression_ratio = len(df_analysis) / unique_mirror_keys

print("Feature Quality Metrics:")
print(f"   Total puzzles: {len(df_analysis):,}")
print(f"   Unique mirror invariant keys: {unique_mirror_keys:,}")
print(f"   Unique canonical forms: {unique_canonical:,}")
print(f"   Compression ratio: {compression_ratio:.2f}x")

# Assess feature information content
if compression_ratio > 1.05:
    print("   Assessment: Good symmetry compression - useful for duplicate detection")
elif compression_ratio > 1.01:
    print("   Assessment: Moderate symmetry compression")
else:
    print("   Assessment: Minimal symmetry compression")

feature_results = {
    'unique_mirror_keys': unique_mirror_keys,
    'unique_canonical': unique_canonical,
    'compression_ratio': compression_ratio
}
print("Feature quality assessment complete!\n")

Feature Quality Assessment
Feature Quality Metrics:
   Total puzzles: 41,474
   Unique mirror invariant keys: 36,880
   Unique canonical forms: 38,960
   Compression ratio: 1.12x
   Assessment: Good symmetry compression - useful for duplicate detection
Feature quality assessment complete!



##### Section 8: Generate EDA Strategy Recommendations

In [92]:
print("EDA Strategy Recommendations")
print('='*30)

print("RECOMMENDATIONS FOR EDA:")

# Symmetry feature richness assessment
corrected_count = chiral_results['corrected_count']
if corrected_count > 1000:
    print("   SYMMETRY: Rich symmetry features detected")
    print("   - Include twin difficulty comparison visualizations")
    print("   - Create symmetry-based puzzle clustering analysis")
    rich_symmetry = True
elif corrected_count > 100:
    print("   SYMMETRY: Moderate symmetry features")
    print("   - Include symmetry analysis as secondary EDA component")
    rich_symmetry = False
else:
    print("   SYMMETRY: Limited symmetry features")
    print("   - Focus on primary complexity metrics instead")
    rich_symmetry = False

# Self-mirror analysis recommendation
if mirror_results['self_mirror_count'] > 100:
    print("   SELF-MIRROR: Significant self-symmetric puzzles found")
    print("   - Analyze self-symmetric puzzles as distinct complexity class")
    include_self_mirror = True
else:
    print("   SELF-MIRROR: Few self-symmetric puzzles")
    include_self_mirror = False

# Definition recommendation
original_count = chiral_results['original_count']
if abs(corrected_count - original_count) > len(df_analysis) * 0.01:
    print("   DEFINITION: Use corrected 'has_chiral_twin_distinct' definition")
    use_corrected = True
else:
    print("   DEFINITION: Original 'has_chiral_twin' definition is accurate")
    use_corrected = False

# Compression utility
if feature_results['compression_ratio'] > 1.05:
    print("   COMPRESSION: Symmetry features useful for duplicate detection")

# Final recommendations summary
final_recommendations = {
    'use_corrected_definition': use_corrected,
    'rich_symmetry_features': rich_symmetry,
    'include_self_mirror_analysis': include_self_mirror,
    'suitable_for_duplicate_detection': feature_results['compression_ratio'] > 1.05
}

print(f"\nAnalysis complete! Found {corrected_count:,} puzzles with chiral twins.")
print("Use these recommendations to guide your EDA visualizations.")

EDA Strategy Recommendations
RECOMMENDATIONS FOR EDA:
   SYMMETRY: Rich symmetry features detected
   - Include twin difficulty comparison visualizations
   - Create symmetry-based puzzle clustering analysis
   SELF-MIRROR: Significant self-symmetric puzzles found
   - Analyze self-symmetric puzzles as distinct complexity class
   DEFINITION: Use corrected 'has_chiral_twin_distinct' definition
   COMPRESSION: Symmetry features useful for duplicate detection

Analysis complete! Found 4,802 puzzles with chiral twins.
Use these recommendations to guide your EDA visualizations.


##### View of Final state of Data Frame

In [93]:
# Get the enhanced dataframe with the new column
display(df_analysis.head(2))
# Checking info on the enhanced version
df_analysis.info()

Unnamed: 0,puzzle_id,is_solvable,total_blocks,block_count_1x1,block_count_1x2,block_count_2x1,block_count_2x2,goal_initial_row,goal_initial_col,goal_distance_to_target,goal_manhattan_distance,blocks_between_goal_target,adjacent_1x1_count,adjacent_1x2_count,adjacent_2x1_count,wall_adjacent_sides,empty_spaces_count,empty_space,solution_length,solve_time_seconds,initial_block_states,board_visual,legal_board,mobility_index,blocker_score,clearance_cost,board_density,parsed_blocks,blocks_objects,mirror_canonical,mirror_invariant_key,is_self_mirror,mirror_group_size,has_chiral_twin,chiral_twin_id,has_chiral_twin_distinct
0,puzzle_000000,True,11,6,2,2,1,1,2,2.236068,3,6,5,1,0,1,2,"[(3, 2), (4, 2)]",17,2.848788,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]",┌───┬───────┬───┐\n│ │ │ │\n│ ├───┬───┴───┤\n│ │ │ │\n├───┼───┤ │\n│ │ │ │\n│ ├───┼───┬───┤\n│ │ │ X │ │\n├───┴───┤ ├───┤\n│ │ X │ │\n└───────┴───┴───┘,True,0.181818,9.0,13.416408,0.55,"[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]","[(2, 2, 1, 2), (1, 2, 0, 1), (1, 1, 3, 3), (1, 1, 1, 1), (1, 2, 4, 0), (1, 1, 4, 3), (2, 1, 2, 0), (1, 1, 0, 3), (1, 1, 3, 1), (2, 1, 0, 0), (1, 1, 2, 1)]","((0, 0, 2, 1), (0, 1, 1, 2), (0, 3, 1, 1), (1, 1, 1, 1), (1, 2, 2, 2), (2, 0, 2, 1), (2, 1, 1, 1), (3, 1, 1, 1), (3, 3, 1, 1), (4, 0, 1, 2), (4, 3, 1, 1))","((0, 0, 1, 1), (0, 1, 1, 2), (0, 3, 2, 1), (1, 0, 2, 2), (1, 2, 1, 1), (2, 2, 1, 1), (2, 3, 2, 1), (3, 0, 1, 1), (3, 2, 1, 1), (4, 0, 1, 1), (4, 2, 1, 2))",False,1,0,,False
1,puzzle_000001,False,9,2,1,5,1,1,1,2.0,2,2,2,1,5,0,2,"[(0, 3), (4, 0)]",0,0.014643,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]",┌───┬───────┬───┐\n│ │ │ X │\n│ ├───────┼───┤\n│ │ │ │\n├───┤ │ │\n│ │ │ │\n├───┼───┬───┼───┤\n│ │ │ │ │\n├───┤ │ │ │\n│ X │ │ │ │\n└───┴───┴───┴───┘,True,0.222222,2.0,4.0,0.45,"[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]","[(2, 2, 1, 1), (2, 1, 1, 3), (1, 1, 3, 0), (2, 1, 3, 3), (1, 2, 0, 1), (2, 1, 0, 0), (2, 1, 3, 1), (2, 1, 3, 2), (1, 1, 2, 0)]","((0, 0, 2, 1), (0, 1, 1, 2), (1, 1, 2, 2), (1, 3, 2, 1), (2, 0, 1, 1), (3, 0, 1, 1), (3, 1, 2, 1), (3, 2, 2, 1), (3, 3, 2, 1))","((0, 0, 2, 1), (0, 1, 1, 2), (1, 1, 2, 2), (1, 3, 2, 1), (2, 0, 1, 1), (3, 0, 1, 1), (3, 1, 2, 1), (3, 2, 2, 1), (3, 3, 2, 1))",False,1,0,,False


<class 'pandas.core.frame.DataFrame'>
Index: 41474 entries, 0 to 49999
Data columns (total 36 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   puzzle_id                   41474 non-null  object 
 1   is_solvable                 41474 non-null  bool   
 2   total_blocks                41474 non-null  int64  
 3   block_count_1x1             41474 non-null  int64  
 4   block_count_1x2             41474 non-null  int64  
 5   block_count_2x1             41474 non-null  int64  
 6   block_count_2x2             41474 non-null  int64  
 7   goal_initial_row            41474 non-null  int64  
 8   goal_initial_col            41474 non-null  int64  
 9   goal_distance_to_target     41474 non-null  float64
 10  goal_manhattan_distance     41474 non-null  int64  
 11  blocks_between_goal_target  41474 non-null  int64  
 12  adjacent_1x1_count          41474 non-null  int64  
 13  adjacent_1x2_count          41474 no

# 🔍📊📈 Comprehensive Exploratory Data Analysis

### Visualization 1: Dataset Overview - Board Legality Distribution

In [None]:
# Add the fonts to the font manager
font_path_1 = r"D:\klotski-project\fonts\AuxMono-Regular.ttf"  # Path to AuxMono font

# Register the fonts
fm.fontManager.addfont(font_path_1)

# Get the font names
font_name_1 = fm.FontProperties(fname=font_path_1).get_name()

# Calculate distribution
legal_counts = original_df['legal_board'].value_counts()
total_count = len(original_df)

# Create subplot with pie and bar
fig = make_subplots(
    rows=1, cols=2,
    subplot_titles=('Board Legality Distribution', 'Dataset Composition'),
    specs=[[{"type": "pie"}, {"type": "bar"}]],
    horizontal_spacing=0.15
)

# Pie chart
fig.add_trace(
    go.Pie(
        labels=['Legal', 'Illegal'],
        values=[legal_counts[True], legal_counts[False]],
        marker_colors=['#2E86AB', '#F24236'],
        textinfo='label+percent+value',
        hoverinfo='label+percent+value',
        textfont_size=12,
        insidetextorientation='horizontal',
        hole=0.3,
        pull=[0.1, 0],
        hoverlabel=dict(
            font=dict(
                family="Courier New",  # Hover text font (Courier New)
                size=14,               # Font size
                color="black"          # Font color
            )
        )
    ), row=1, col=1
)

# Bar chart with breakdown of legal boards
legal_boards = original_df[original_df['legal_board'] == True]
solvable_count = legal_boards['is_solvable'].sum()
unsolvable_count = len(legal_boards) - solvable_count
illegal_count = legal_counts[False]

categories = ['Solvable', 'Unsolvable', 'Illegal']
counts = [solvable_count, unsolvable_count, illegal_count]
colors = ['#A23B72', '#F18F01', '#F24236']

fig.add_trace(
    go.Bar(
        x=categories,
        y=counts,
        marker_color=colors,
        text=[f'{count:,}<br>({count/total_count*100:.1f}%)' for count in counts],
        textposition='auto',
        textfont_size=11,
        hoverinfo='x+y+text',
        hoverlabel=dict(
            font=dict(family="Courier New", size=14,color="black")
        )
    ), row=1, col=2
)

# Update layout for a clean, professional look
fig.update_layout(
    title={
        'text': f'Klotski Dataset Overview: {total_count:,} Puzzle Boards',
        'x': 0.5,
        'xanchor': 'center',
        'font': {
            'family': 'Courier New',  # Heading font (MonomaniacOne)
            'size': 30,
            'color': '#242424',
            'weight': 'bold'
        }
    },
    showlegend=False,
    height=500,
    font=dict(
        family="Courier New",  # Default font for the rest of the plot
        size=15,
        color="black"
    ),
    plot_bgcolor='#F6EEE3',
    paper_bgcolor='#E7E0D3',
    margin=dict(t=50, b=50, l=50, r=50),
    xaxis=dict(showgrid=True, gridcolor='lightgray'),
    yaxis=dict(showgrid=True, gridcolor='lightgray'),
)

# Update pie chart and bar chart text annotations (AuxMono font)
fig.update_traces(
    textfont=dict(
        family=font_name_1,  # Data text font (AuxMono)
        size=12
    ),
    selector=dict(type='pie')
)

fig.update_traces(
    textfont=dict(
        family=font_name_1,  # Data text font (AuxMono)
        size=11
    ),
    selector=dict(type='bar')
)

# Adjust axes titles
fig.update_xaxes(title_text="Board Categories", row=1, col=2)
fig.update_yaxes(title_text="Number of Puzzles", row=1, col=2)

# Show the plot
fig.show()

# Print summary statistics
print(f"Total puzzles: {total_count:,}")
print(f"Legal boards: {legal_counts[True]:,} ({legal_counts[True]/total_count*100:.1f}%)")
print(f"  - Solvable: {solvable_count:,} ({solvable_count/total_count*100:.1f}%)")
print(f"  - Unsolvable: {unsolvable_count:,} ({unsolvable_count/total_count*100:.1f}%)")
print(f"Illegal boards: {illegal_count:,} ({illegal_count/total_count*100:.1f}%)")


Total puzzles: 50,000
Legal boards: 41,474 (82.9%)
  - Solvable: 19,866 (39.7%)
  - Unsolvable: 21,608 (43.2%)
Illegal boards: 8,526 (17.1%)


### Visualization 2: Solution Length Distribution with Complexity Tiers

In [197]:
# Create complexity tiers based on solution length
def assign_complexity_tier(length):
    if length <= 10:
        return 'Easy (≤10 moves)'
    elif length <= 25:
        return 'Medium (11-25 moves)'
    elif length <= 49:  # Adjust the range for Hard
        return 'Hard (26-49 moves)'
    else:
        return 'Expert (50+ moves)'

# Subset solvable data
solvable_data = df_analysis[df_analysis['is_solvable'] == True].copy()
solvable_data['complexity_tier'] = solvable_data['solution_length'].apply(assign_complexity_tier)

# Create subplot with histogram and box plot
fig = make_subplots(
    rows=2, cols=1,
    subplot_titles=(
        f'Solution Length Distribution (n={len(solvable_data):,} solvable puzzles)',
        'Complexity Tiers Box Plot Analysis'
    ),
    vertical_spacing=0.2,  # Adjusted vertical spacing
    row_heights=[0.5, 0.5]  # Adjust row height ratio
)

# Define complexity tier colors
tier_colors = {
    'Easy (≤10 moves)': '#27AE60',
    'Medium (11-25 moves)': "#F3CE12", 
    'Hard (26-49 moves)': '#E67E22',
    'Expert (50+ moves)': '#E74C3C'
}

# Histogram with complexity tier coloring
for tier in ['Easy (≤10 moves)', 'Medium (11-25 moves)', 'Hard (26-49 moves)', 'Expert (50+ moves)']:
    tier_data = solvable_data[solvable_data['complexity_tier'] == tier]['solution_length']
    
    # Dynamically adjust bin size based on tier
    if tier == 'Easy (≤10 moves)':
        nbinsx = 11  # Smaller bins for easy tier
    elif tier == 'Medium (11-25 moves)':
        nbinsx = 15
    elif tier == 'Hard (26-49 moves)':
        nbinsx = 25
    else:  # Expert
        nbinsx = 100

    fig.add_trace(
        go.Histogram(
            x=tier_data,
            name=tier,
            marker_color=tier_colors[tier],
            opacity=0.7,
            nbinsx=nbinsx,  # Set dynamic bin size
            hovertemplate=(
                '%{x} steps<br>'  # Show the number of steps
                '%{y} puzzles'  # Show the number of puzzles in that bin
                '<extra></extra>'
            )
        ), row=1, col=1
    )

# Box plot for tiers
for tier in tier_colors.keys():
    tier_data = solvable_data[solvable_data['complexity_tier'] == tier]['solution_length']
    fig.add_trace(
        go.Box(
            y=tier_data,
            name=tier,
            marker_color=tier_colors[tier],
            boxpoints='outliers',
            showlegend=False,
        ), row=2, col=1
    )

# Update layout for a clean, professional look
fig.update_layout(
    title={
        'text': 'Klotski Puzzle Difficulty Analysis: Solution Length Distribution',
        'x': 0.5,
        'xanchor': 'center',
        'font': {
            'family': 'Courier New',  # Heading font (Courier New)
            'size': 30,
            'color': '#242424',
            'weight': 'bold'
        }
    },
    showlegend=True,
    height=850,  
    font=dict(family="Courier New", size=12, color="black"),
    plot_bgcolor='#F6EEE3',
    paper_bgcolor='#E7E0D3',
        xaxis=dict(showgrid=True, gridcolor='lightgray'),
    yaxis=dict(showgrid=True, gridcolor='lightgray'),
    margin=dict(t=50, b=50, l=50, r=50),
    barmode='overlay'
)

# Update axes titles
fig.update_xaxes(title_text="Solution Length (moves)", row=1, col=1)
fig.update_yaxes(title_text="Number of Puzzles", row=1, col=1)
fig.update_xaxes(title_text="Complexity Tier", row=2, col=1)
fig.update_yaxes(title_text="Solution Length (moves)", row=2, col=1)

# Apply custom font to axis titles and legend in layout
fig.update_layout(
    xaxis=dict(title_font=dict(family="Courier New", size=14, color="black")),
    yaxis=dict(title_font=dict(family="Courier New", size=14, color="black")),
    legend=dict(font=dict(family="Courier New", size=12, color="black"))
)


# Print tier statistics
print("COMPLEXITY TIER ANALYSIS:")
tier_stats = solvable_data.groupby('complexity_tier')['solution_length'].agg(['count', 'mean', 'median', 'std']).round(2)
for i, tier in enumerate(tier_colors.keys()):
    if tier in tier_stats.index:
        stats = tier_stats.loc[tier]
        fig.add_annotation(
            x=i,  # corresponds to the box plot order
            y=stats['mean'],  # display mean as reference
            text=f"μ={stats['mean']:.1f}, σ={stats['std']:.1f}",
            showarrow=False,
            yshift=50,
            font=dict(family="Courier New", size=13, color="black"),
            xref='x2', yref='y2',  # references to the second subplot
        )


# Show the plot
fig.show()
 

COMPLEXITY TIER ANALYSIS:


### Visualization 3: Feature Correlation Heatmap - Feature Engineering Depth

In [200]:
# Select engineered features for correlation analysis
feature_columns = [
    # Basic puzzle metadata
    'solution_length', 'solve_time_seconds',
    
    # Spatial features  
    'density', 'empty_adjacent_count', 'wall_adjacent_blocks', 'corner_positions',
    
    # Piece composition
    'num_1x1_blocks', 'num_1x2_blocks', 'num_2x1_blocks', 'num_2x2_blocks',
    
    # Goal analysis
    'goal_distance_to_target', 'goal_manhattan_distance', 'blocks_between_goal_target',
    'goal_wall_adjacent', 'goal_corner_position',
    
    # Pattern analysis
    'adjacent_blocks', 'max_block_distance', 'block_spread_x', 'block_spread_y',
    
    # Symmetry features
    'is_self_mirror', 'has_chiral_twin_distinct',
    
    # Advanced spatial
    'spatial_entropy', 'clustering_coefficient', 'path_complexity_estimate',
    'empty_space_connectivity', 'block_density_variance'
]

# Filter to available columns and solvable puzzles for meaningful correlations
available_features = [col for col in feature_columns if col in df_analysis.columns]
correlation_data = df_analysis[df_analysis['is_solvable'] == True][available_features].copy()

# Convert boolean columns to numeric for correlation
bool_cols = correlation_data.select_dtypes(include=['bool']).columns
correlation_data[bool_cols] = correlation_data[bool_cols].astype(int)

# Calculate correlation matrix
corr_matrix = correlation_data.corr()

# Create mask for upper triangle to avoid duplication
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
corr_matrix_masked = corr_matrix.mask(mask)

# Create heatmap
fig = go.Figure(data=go.Heatmap(
    z=corr_matrix_masked.values,
    x=corr_matrix.columns,
    y=corr_matrix.columns,
    colorscale=[
        [0, '#27AE60'],     # Strong negative (green)
        [0.25, '#A3D78D'],   # Weak negative (light green)
        [0.5, '#F6EEE3'],    # No correlation (off-white)
        [0.75, '#F8C471'],   # Weak positive (light orange)
        [1, '#E67E22']       # Strong positive (orange)
    ],
    zmid=0,
    zmin=-1,
    zmax=1,
    text=np.round(corr_matrix_masked.values, 2),
    texttemplate="%{text}",
    textfont={"size": 8},
    hoverongaps=False,
    hovertemplate='%{y} vs %{x}<br>Correlation: %{z:.3f}<extra></extra>'
))

# Update layout for a clean, professional look
fig.update_layout(
    title={
        'text': f'Feature Correlation Matrix: {len(available_features)} Engineered Features',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 30, 'color': '#242424', 'family': 'Courier New', 'weight': 'bold'}
    },
    width=900,
    height=900,
    xaxis={'side': 'bottom'},
    yaxis={'autorange': 'reversed'},
    font=dict(family="Courier New", size=12, color="black"),
    plot_bgcolor='#F6EEE3',  # Similar background as the complexity graph
    paper_bgcolor='#E7E0D3',  # Similar background as the complexity graph
    margin=dict(t=50, b=50, l=50, r=50),
    hovermode='closest'
)

# Update axis titles
fig.update_xaxes(title_font=dict(family="Courier New", size=14, color="black"))
fig.update_yaxes(title_font=dict(family="Courier New", size=14, color="black"))
fig.update_layout(
    xaxis=dict(title_text="Features", tickangle=45),
    yaxis=dict(title_text="Features")
)

# Show the plot
fig.show()

# Print high correlation insights
print("HIGH CORRELATION INSIGHTS:")
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        corr_val = corr_matrix.iloc[i, j]
        if abs(corr_val) > 0.7:  # Strong correlation threshold
            high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_val))

high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
for feat1, feat2, corr in high_corr_pairs[:10]:  # Top 10
    print(f"{feat1} ↔ {feat2}: {corr:.3f}")

HIGH CORRELATION INSIGHTS:
goal_distance_to_target ↔ goal_manhattan_distance: 0.959
goal_manhattan_distance ↔ blocks_between_goal_target: 0.931
goal_distance_to_target ↔ blocks_between_goal_target: 0.859


### Visualization 4: Chiral Twin Analysis (NEEDS TO BE WORKED ON!!!!)

In [201]:
# Prepare chiral twin data
chiral_data = df_analysis[df_analysis['has_chiral_twin_distinct'] == True].copy()
non_chiral_data = df_analysis[df_analysis['has_chiral_twin_distinct'] == False].copy()

# Create comprehensive chiral analysis subplot
fig = make_subplots(
    rows=2, cols=2,
    subplot_titles=(
        'Chiral Twin vs Regular Puzzle Solvability',
        'Solution Length: Chiral vs Non-Chiral',
        'Solve Time Comparison',
        'Difficulty Distribution by Chiral Status'
    ),
    specs=[[{"type": "bar"}, {"type": "box"}],
           [{"type": "violin"}, {"type": "histogram"}]],
    vertical_spacing=0.12,
    horizontal_spacing=0.1
)

# 1. Solvability comparison
chiral_solvable = chiral_data['is_solvable'].mean()
non_chiral_solvable = non_chiral_data['is_solvable'].mean()

fig.add_trace(
    go.Bar(
        x=['Chiral Twin Puzzles', 'Regular Puzzles'],
        y=[chiral_solvable * 100, non_chiral_solvable * 100],
        marker_color=['#9B59B6', '#34495E'],
        text=[f'{chiral_solvable*100:.1f}%', f'{non_chiral_solvable*100:.1f}%'],
        textposition='auto',
        name='Solvability Rate'
    ), row=1, col=1
)

# 2. Solution length box plot (solvable only)
chiral_solvable_data = chiral_data[chiral_data['is_solvable'] == True]
non_chiral_solvable_data = non_chiral_data[non_chiral_data['is_solvable'] == True]

fig.add_trace(
    go.Box(
        y=chiral_solvable_data['solution_length'],
        name='Chiral Twin',
        marker_color='#9B59B6',
        boxpoints='outliers'
    ), row=1, col=2
)

fig.add_trace(
    go.Box(
        y=non_chiral_solvable_data['solution_length'],
        name='Regular',
        marker_color='#34495E',
        boxpoints='outliers'
    ), row=1, col=2
)

# 3. Solve time violin plot (solvable only)
fig.add_trace(
    go.Violin(
        y=chiral_solvable_data['solve_time_seconds'],
        name='Chiral Twin',
        fillcolor='rgba(155, 89, 182, 0.5)',
        line_color='#9B59B6',
        side='negative'
    ), row=2, col=1
)

fig.add_trace(
    go.Violin(
        y=non_chiral_solvable_data['solve_time_seconds'],
        name='Regular',
        fillcolor='rgba(52, 73, 94, 0.5)',
        line_color='#34495E',
        side='positive'
    ), row=2, col=1
)

# 4. Difficulty tier histogram
if 'complexity_tier' in chiral_solvable_data.columns:
    chiral_tiers = chiral_solvable_data['complexity_tier'].value_counts()
    non_chiral_tiers = non_chiral_solvable_data['complexity_tier'].value_counts()
    
    tiers = list(set(chiral_tiers.index) | set(non_chiral_tiers.index))
    chiral_counts = [chiral_tiers.get(tier, 0) for tier in tiers]
    non_chiral_counts = [non_chiral_tiers.get(tier, 0) for tier in tiers]
    
    fig.add_trace(
        go.Bar(
            x=tiers,
            y=chiral_counts,
            name='Chiral Twin',
            marker_color='#9B59B6',
            opacity=0.7
        ), row=2, col=2
    )
    
    fig.add_trace(
        go.Bar(
            x=tiers,
            y=non_chiral_counts,
            name='Regular',
            marker_color='#34495E',
            opacity=0.7
        ), row=2, col=2
    )

# Update layout
fig.update_layout(
    title={
        'text': f'Chiral Twin Analysis: {len(chiral_data):,} Twin vs {len(non_chiral_data):,} Regular Puzzles',
        'x': 0.5,
        'xanchor': 'center',
        'font': {'size': 16, 'color': 'darkblue', 'family': 'Courier New'}
    },
    height=850,
    showlegend=False,
    font=dict(family="Courier New", size=12, color="black"),
    plot_bgcolor='#F6EEE3',
    paper_bgcolor='#E7E0D3',
    margin=dict(t=50, b=50, l=50, r=50)
)

# Update axes
fig.update_yaxes(title_text="Solvability Rate (%)", row=1, col=1)
fig.update_yaxes(title_text="Solution Length", row=1, col=2)
fig.update_yaxes(title_text="Solve Time (seconds)", row=2, col=1)
fig.update_yaxes(title_text="Number of Puzzles", row=2, col=2)

fig.show()

# Print statistical comparison
print("CHIRAL TWIN STATISTICAL ANALYSIS:")
print(f"Chiral twin puzzles: {len(chiral_data):,} ({len(chiral_data)/len(df_analysis)*100:.1f}%)")
print(f"Solvability rate - Chiral: {chiral_solvable*100:.1f}% vs Regular: {non_chiral_solvable*100:.1f}%")

if len(chiral_solvable_data) > 0 and len(non_chiral_solvable_data) > 0:
    print(f"Mean solution length - Chiral: {chiral_solvable_data['solution_length'].mean():.1f} vs Regular: {non_chiral_solvable_data['solution_length'].mean():.1f}")
    print(f"Mean solve time - Chiral: {chiral_solvable_data['solve_time_seconds'].mean():.3f}s vs Regular: {non_chiral_solvable_data['solve_time_seconds'].mean():.3f}s")


CHIRAL TWIN STATISTICAL ANALYSIS:
Chiral twin puzzles: 4,802 (11.6%)
Solvability rate - Chiral: 22.2% vs Regular: 51.3%
Mean solution length - Chiral: 33.5 vs Regular: 34.0
Mean solve time - Chiral: 4.545s vs Regular: 6.185s


### Visualization 5: Piece Distribution Patterns

In [218]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go

# Subplot grid
fig = make_subplots(
    rows=2, cols=3,
    subplot_titles=(
        "1×1 Blocks Distribution",
        "1×2 Blocks Distribution",
        "2×1 Blocks Distribution",
        "Total Blocks: Solvable vs Unsolvable",
        "Solvability Rate by Block Composition",
        "Solvability vs Piece Diversity"
    ),
    horizontal_spacing=0.12,
    vertical_spacing=0.2
)

block_cols = ['block_count_1x1', 'block_count_1x2', 'block_count_2x1']
block_names = ['1×1 Blocks', '1×2 Blocks', '2×1 Blocks']
block_colors = ['#FF6B6B', '#4ECDC4', '#45B7D1']

# 1. Histograms for block distributions
for i, (col, name, color) in enumerate(zip(block_cols, block_names, block_colors)):
    fig.add_trace(
        go.Histogram(
            x=df[col],
            name=name,
            marker_color=color,
            opacity=0.75
        ),
        row=1, col=i+1
    )

# 2. Total blocks solvable vs unsolvable
fig.add_trace(
    go.Histogram(
        x=df[df['is_solvable']]['total_blocks'],
        name="Solvable",
        marker_color="green",
        opacity=0.6
    ),
    row=2, col=1
)
fig.add_trace(
    go.Histogram(
        x=df[~df['is_solvable']]['total_blocks'],
        name="Unsolvable",
        marker_color="red",
        opacity=0.6
    ),
    row=2, col=1
)

# 3. Heatmap of solvability by block composition
block_composition = df.groupby(
    ['block_count_1x1', 'block_count_1x2']
)['is_solvable'].mean().reset_index()

fig.add_trace(
    go.Heatmap(
        x=block_composition['block_count_1x2'],
        y=block_composition['block_count_1x1'],
        z=block_composition['is_solvable'],
        colorscale="RdYlGn",
        colorbar_title="Solvability Rate",
        hovertemplate="<b>1x2 Blocks:</b> %{x}<br><b>1x1 Blocks:</b> %{y}<br><b>Solvability Rate:</b> %{z}<extra></extra>"
    ),
    row=2, col=2
)

# 4. Bar chart for piece diversity
df_temp = df.copy()
df_temp['piece_diversity'] = (df_temp[block_cols] > 0).sum(axis=1)
diversity_stats = df_temp.groupby('piece_diversity')['is_solvable'].mean()

fig.add_trace(
    go.Bar(
        x=diversity_stats.index,
        y=diversity_stats.values,
        marker_color="purple",
        opacity=0.7,
        name="Diversity"
    ),
    row=2, col=3
)

# Layout styling
fig.update_layout(
    title={
        'text': "Piece Distribution Patterns Analysis",
        'x': 0.5,
        'xanchor': 'center',
        'font': dict(family="Courier New", size=26, color="#242424")
    },
    font=dict(family="Courier New", size=12, color="black"),
    plot_bgcolor='#F6EEE3',
    paper_bgcolor='#E7E0D3',
    height=850,
    margin=dict(t=70, b=50, l=50, r=50),
    barmode='stack',  # Stack histograms instead of overlapping
    legend=dict(
        x=0.8, y=1.05,  # Adjust the position of the legend
        orientation="h",
        traceorder='normal',
        font=dict(family="Courier New", size=12, color="black")
    ),
    
    # Apply x and y axis grid settings globally
    xaxis=dict(showgrid=True, gridcolor='lightgray'),
    yaxis=dict(showgrid=True, gridcolor='lightgray'),
    
)

# Adjust subplot-specific axis labels to ensure they are clear
fig.update_xaxes(title_text="Total Blocks", row=2, col=1)
fig.update_yaxes(title_text="Solvability Rate", row=2, col=2)
fig.update_xaxes(title_text="Piece Diversity", row=2, col=3)

fig.show()


In [219]:
fig = make_subplots(
        rows=2, cols=3,
        subplot_titles=(
            "Euclidean Distance Distribution",
            "Manhattan Distance Distribution",
            "Blocking Pieces Distribution",
            "Solvability by Goal Position",
            "Solution Length vs Goal Distance",
            "Distance Metrics Correlation"
        ),
        horizontal_spacing=0.12,
        vertical_spacing=0.2
    )

    # Distance metrics
distance_cols = ['goal_distance_to_target', 'goal_manhattan_distance', 'blocks_between_goal_target']
distance_names = ['Euclidean Distance', 'Manhattan Distance', 'Blocking Pieces']

# Histograms for solvable vs unsolvable
for i, (col, name) in enumerate(zip(distance_cols, distance_names)):
        fig.add_trace(
            go.Histogram(
                x=df[df['is_solvable']][col],
                name=f"{name} (Solvable)",
                marker_color="green",
                opacity=0.6
            ),
            row=1, col=i+1
        )
        fig.add_trace(
            go.Histogram(
                x=df[~df['is_solvable']][col],
                name=f"{name} (Unsolvable)",
                marker_color="red",
                opacity=0.6
            ),
            row=1, col=i+1
        )

# Heatmap for goal position solvability
solvability_pivot = df.pivot_table(
        values='is_solvable',
        index='goal_initial_row',
        columns='goal_initial_col',
        aggfunc='mean'
    )

fig.add_trace(
        go.Heatmap(
            z=solvability_pivot.values,
            x=solvability_pivot.columns,
            y=solvability_pivot.index,
            colorscale="RdYlGn",
            colorbar_title="Solvability"
        ),
        row=2, col=1
    )

# Scatter for solution length vs distance
solvable_df = df[df['is_solvable']]
fig.add_trace(
        go.Scatter(
            x=solvable_df['goal_distance_to_target'],
            y=solvable_df['solution_length'],
            mode='markers',
            marker=dict(
                size=6,
                color=solvable_df['blocks_between_goal_target'],
                colorscale='Viridis',
                showscale=True,
                colorbar=dict(title="Blocking Pieces")
            ),
            name="Solution Length"
        ),
        row=2, col=2
    )

# Correlation heatmap
distance_features = ['goal_distance_to_target', 'goal_manhattan_distance',
                         'blocks_between_goal_target', 'solution_length']
corr_data = solvable_df[distance_features].corr()

fig.add_trace(
        go.Heatmap(
            z=corr_data.values,
            x=corr_data.columns,
            y=corr_data.index,
            colorscale="RdBu",
            zmin=-1, zmax=1,
            colorbar_title="Correlation"
        ),
        row=2, col=3
    )

# Layout styling
fig.update_layout(
        title={
            'text': "Goal Distance Multi-Metric Analysis",
            'x': 0.5,
            'xanchor': 'center',
            'font': dict(family="Courier New", size=26, color="#242424")
        },
        font=dict(family="Courier New", size=12, color="black"),
        plot_bgcolor='#F6EEE3',
        paper_bgcolor='#E7E0D3',
        height=850,
        margin=dict(t=70, b=50, l=50, r=50),
        barmode='overlay',
        legend=dict(font=dict(family="Courier New", size=12, color="black"))
    )

fig.show()


In [220]:
fig = make_subplots(
        rows=2, cols=3,
        subplot_titles=(
            "Solvability vs Board Density",
            "Solvability vs Wall Adjacency",
            "Mobility Index Distribution",
            "Solvability vs Blocker Score",
            "Solution Length vs Clearance Cost",
            "Spatial Metrics Correlation"
        ),
        horizontal_spacing=0.12,
        vertical_spacing=0.2
    )

# 1. Density analysis
density_bins = np.linspace(df['board_density'].min(), df['board_density'].max(), 10)
df['density_category'] = pd.cut(df['board_density'], bins=density_bins, include_lowest=True)
density_stats = df.groupby('density_category')['is_solvable'].mean()

fig.add_trace(
        go.Bar(
            x=[str(x) for x in density_stats.index],
            y=density_stats.values,
            marker_color="skyblue",
            opacity=0.7,
            name="Density"
        ),
        row=1, col=1
    )

# 2. Wall adjacency
wall_stats = df.groupby('wall_adjacent_sides')['is_solvable'].mean()
fig.add_trace(
        go.Bar(
            x=wall_stats.index.astype(str),
            y=wall_stats.values,
            marker_color="orange",
            opacity=0.7
        ),
        row=1, col=2
    )

# 3. Mobility index distribution
fig.add_trace(
        go.Histogram(
            x=df[df['is_solvable']]['mobility_index'],
            name="Solvable",
            marker_color="green",
            opacity=0.6
        ),
        row=1, col=3
    )
fig.add_trace(
        go.Histogram(
            x=df[~df['is_solvable']]['mobility_index'],
            name="Unsolvable",
            marker_color="red",
            opacity=0.6
        ),
        row=1, col=3
    )

# 4. Blocker score
blocker_bins = np.linspace(0, df['blocker_score'].quantile(0.95), 10)
df['blocker_category'] = pd.cut(df['blocker_score'], bins=blocker_bins, include_lowest=True)
blocker_stats = df.groupby('blocker_category')['is_solvable'].mean()

fig.add_trace(
        go.Bar(
            x=[str(x) for x in blocker_stats.index],
            y=blocker_stats.values,
            marker_color="coral",
            opacity=0.7
        ),
        row=2, col=1
    )

    # 5. Clearance cost vs solution length
solvable_df = df[df['is_solvable']]
fig.add_trace(
        go.Scatter(
            x=solvable_df['clearance_cost'],
            y=solvable_df['solution_length'],
            mode='markers',
            marker=dict(
                size=6,
                color=solvable_df['board_density'],
                colorscale='Plasma',
                showscale=True,
                colorbar=dict(title="Board Density")
            )
        ),
        row=2, col=2
    )

# 6. Correlation heatmap
spatial_features = ['mobility_index', 'blocker_score', 'clearance_cost',
                        'board_density', 'wall_adjacent_sides']
corr_data = df[spatial_features].corr()

fig.add_trace(
        go.Heatmap(
            z=corr_data.values,
            x=corr_data.columns,
            y=corr_data.index,
            colorscale="RdBu",
            zmin=-1, zmax=1,
            colorbar_title="Correlation"
        ),
        row=2, col=3
    )

# Layout styling
fig.update_layout(
        title={
            'text': "Spatial Layout Impact Analysis",
            'x': 0.5,
            'xanchor': 'center',
            'font': dict(family="Courier New", size=26, color="#242424")
        },
        font=dict(family="Courier New", size=12, color="black"),
        plot_bgcolor='#F6EEE3',
        paper_bgcolor='#E7E0D3',
        height=850,
        margin=dict(t=70, b=50, l=50, r=50),
        barmode='overlay',
        legend=dict(font=dict(family="Courier New", size=12, color="black"))
    )

fig.show()
