# Financial Statements Page Classification - Exploratory Data Analysis (EDA)

## Project Overview
This notebook performs comprehensive EDA on Arabic financial statements to prepare for building a computer vision classification model.

### Classification Classes:
1. **Independent Auditor's Report** - Pages with auditor comments and opinions
2. **Financial Sheets** - Balance Sheet, Income Statement, Cash Flow, Statement of Change in Equity
3. **Notes about Financial Statements (Tabular)** - Notes containing at least one table
4. **Notes about Financial Statements (Text)** - Notes without tables
5. **Other Pages** - Cover pages, table of contents, etc.

### Key Constraints:
- Documents are in **Arabic** - text extraction is unreliable
- Focus on **visual/structural features** for classification
- Table detection is critical for distinguishing Tabular vs Text notes

---
## 1. Environment Setup & Dependencies

In [None]:
# Install required packages
!pip install -q pdf2image pymupdf Pillow opencv-python-headless matplotlib seaborn pandas numpy scikit-learn tqdm

In [None]:
# Install poppler for pdf2image (required on Colab)
!apt-get install -q poppler-utils

In [None]:
# Core imports
import os
import sys
import json
import warnings
from pathlib import Path
from collections import defaultdict, Counter
from typing import List, Dict, Tuple, Optional

# Data handling
import numpy as np
import pandas as pd

# Image processing
import cv2
from PIL import Image
import fitz  # PyMuPDF

# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
import seaborn as sns
from matplotlib.gridspec import GridSpec

# Utilities
from tqdm.notebook import tqdm
from sklearn.cluster import KMeans

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')

print("All imports successful!")
print(f"OpenCV version: {cv2.__version__}")
print(f"PyMuPDF version: {fitz.version}")

---
## 2. Data Loading & Mount Google Drive

In [None]:
# Mount Google Drive (for Colab)
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# ============================================
# CONFIGURE YOUR DATA PATH HERE
# ============================================

# Option 1: If data is in Google Drive
DATA_PATH = "/content/drive/MyDrive/YOUR_DATASET_FOLDER"  # <-- UPDATE THIS PATH

# Option 2: If you uploaded to Colab directly
# DATA_PATH = "/content/your_uploaded_folder"

# Option 3: If data is already extracted as images
# IMAGES_PATH = "/content/drive/MyDrive/extracted_images"

# Check if path exists
if os.path.exists(DATA_PATH):
    print(f"Data path found: {DATA_PATH}")
    print(f"Contents: {os.listdir(DATA_PATH)[:20]}...")  # Show first 20 items
else:
    print(f"WARNING: Path '{DATA_PATH}' not found!")
    print("Please update DATA_PATH to point to your dataset folder.")

---
## 3. Dataset Structure Exploration

In [None]:
def explore_directory_structure(root_path: str, max_depth: int = 3) -> Dict:
    """
    Recursively explore directory structure and gather statistics.
    
    Args:
        root_path: Root directory to explore
        max_depth: Maximum depth to traverse
    
    Returns:
        Dictionary with structure information
    """
    structure = {
        'total_files': 0,
        'total_folders': 0,
        'file_types': Counter(),
        'folder_contents': {},
        'pdf_files': [],
        'image_files': [],
        'depth_distribution': Counter()
    }
    
    image_extensions = {'.jpg', '.jpeg', '.png', '.tiff', '.tif', '.bmp', '.gif'}
    
    for root, dirs, files in os.walk(root_path):
        depth = root.replace(root_path, '').count(os.sep)
        if depth > max_depth:
            continue
            
        structure['total_folders'] += len(dirs)
        structure['depth_distribution'][depth] += len(files)
        
        rel_path = os.path.relpath(root, root_path)
        structure['folder_contents'][rel_path] = len(files)
        
        for file in files:
            structure['total_files'] += 1
            ext = os.path.splitext(file)[1].lower()
            structure['file_types'][ext] += 1
            
            full_path = os.path.join(root, file)
            if ext == '.pdf':
                structure['pdf_files'].append(full_path)
            elif ext in image_extensions:
                structure['image_files'].append(full_path)
    
    return structure

In [None]:
# Explore dataset structure
print("="*60)
print("DATASET STRUCTURE EXPLORATION")
print("="*60)

dataset_info = explore_directory_structure(DATA_PATH)

print(f"\nüìÅ Total Folders: {dataset_info['total_folders']}")
print(f"üìÑ Total Files: {dataset_info['total_files']}")
print(f"üìë PDF Files: {len(dataset_info['pdf_files'])}")
print(f"üñºÔ∏è  Image Files: {len(dataset_info['image_files'])}")

print("\nüìä File Type Distribution:")
for ext, count in sorted(dataset_info['file_types'].items(), key=lambda x: -x[1]):
    print(f"   {ext if ext else '(no extension)'}: {count} files")

In [None]:
# Visualize file type distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# File types pie chart
file_types = dict(dataset_info['file_types'])
if file_types:
    labels = [k if k else 'no ext' for k in file_types.keys()]
    sizes = list(file_types.values())
    explode = [0.05 if l == '.pdf' else 0 for l in labels]
    
    axes[0].pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%',
                shadow=True, startangle=90)
    axes[0].set_title('File Type Distribution', fontsize=14, fontweight='bold')

# Folder content distribution
folder_contents = dataset_info['folder_contents']
if folder_contents:
    # Show top 15 folders by file count
    sorted_folders = sorted(folder_contents.items(), key=lambda x: -x[1])[:15]
    folders, counts = zip(*sorted_folders)
    folders = [f[:30] + '...' if len(f) > 30 else f for f in folders]  # Truncate long names
    
    axes[1].barh(range(len(folders)), counts, color=sns.color_palette('viridis', len(folders)))
    axes[1].set_yticks(range(len(folders)))
    axes[1].set_yticklabels(folders)
    axes[1].set_xlabel('Number of Files')
    axes[1].set_title('Top 15 Folders by File Count', fontsize=14, fontweight='bold')
    axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

---
## 4. PDF Document Analysis

In [None]:
def analyze_pdf(pdf_path: str) -> Dict:
    """
    Extract metadata and statistics from a PDF file.
    
    Args:
        pdf_path: Path to PDF file
    
    Returns:
        Dictionary with PDF metadata and statistics
    """
    try:
        doc = fitz.open(pdf_path)
        
        # Get page dimensions
        page_sizes = []
        for page in doc:
            rect = page.rect
            page_sizes.append({
                'width': rect.width,
                'height': rect.height,
                'aspect_ratio': rect.width / rect.height if rect.height > 0 else 0
            })
        
        info = {
            'filename': os.path.basename(pdf_path),
            'filepath': pdf_path,
            'num_pages': len(doc),
            'file_size_mb': os.path.getsize(pdf_path) / (1024 * 1024),
            'metadata': doc.metadata,
            'page_sizes': page_sizes,
            'avg_width': np.mean([p['width'] for p in page_sizes]) if page_sizes else 0,
            'avg_height': np.mean([p['height'] for p in page_sizes]) if page_sizes else 0,
            'is_encrypted': doc.is_encrypted,
            'has_toc': len(doc.get_toc()) > 0
        }
        
        doc.close()
        return info
    
    except Exception as e:
        return {
            'filename': os.path.basename(pdf_path),
            'filepath': pdf_path,
            'error': str(e)
        }

In [None]:
# Analyze all PDF files
print("Analyzing PDF documents...")
pdf_analyses = []

for pdf_path in tqdm(dataset_info['pdf_files'], desc="Processing PDFs"):
    pdf_analyses.append(analyze_pdf(pdf_path))

# Create DataFrame for analysis
pdf_df = pd.DataFrame([p for p in pdf_analyses if 'error' not in p])
error_pdfs = [p for p in pdf_analyses if 'error' in p]

print(f"\n‚úÖ Successfully analyzed: {len(pdf_df)} PDFs")
print(f"‚ùå Failed to analyze: {len(error_pdfs)} PDFs")

In [None]:
# PDF Statistics Summary
if len(pdf_df) > 0:
    print("="*60)
    print("PDF DOCUMENTS STATISTICS")
    print("="*60)
    
    print(f"\nüìÑ Total PDFs: {len(pdf_df)}")
    print(f"üìÉ Total Pages: {pdf_df['num_pages'].sum():,}")
    print(f"üíæ Total Size: {pdf_df['file_size_mb'].sum():.2f} MB")
    
    print(f"\nüìä Pages per Document:")
    print(f"   Min: {pdf_df['num_pages'].min()}")
    print(f"   Max: {pdf_df['num_pages'].max()}")
    print(f"   Mean: {pdf_df['num_pages'].mean():.1f}")
    print(f"   Median: {pdf_df['num_pages'].median():.1f}")
    print(f"   Std Dev: {pdf_df['num_pages'].std():.1f}")
    
    print(f"\nüìè Page Dimensions (average):")
    print(f"   Width: {pdf_df['avg_width'].mean():.1f} pts")
    print(f"   Height: {pdf_df['avg_height'].mean():.1f} pts")
    
    print(f"\nüîê Encrypted PDFs: {pdf_df['is_encrypted'].sum()}")
    print(f"üìë PDFs with Table of Contents: {pdf_df['has_toc'].sum()}")

In [None]:
# Visualize PDF statistics
if len(pdf_df) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Pages per document distribution
    axes[0, 0].hist(pdf_df['num_pages'], bins=30, color='steelblue', edgecolor='black', alpha=0.7)
    axes[0, 0].axvline(pdf_df['num_pages'].mean(), color='red', linestyle='--', label=f'Mean: {pdf_df["num_pages"].mean():.1f}')
    axes[0, 0].axvline(pdf_df['num_pages'].median(), color='green', linestyle='--', label=f'Median: {pdf_df["num_pages"].median():.1f}')
    axes[0, 0].set_xlabel('Number of Pages')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Distribution of Pages per Document', fontsize=12, fontweight='bold')
    axes[0, 0].legend()
    
    # File size distribution
    axes[0, 1].hist(pdf_df['file_size_mb'], bins=30, color='coral', edgecolor='black', alpha=0.7)
    axes[0, 1].axvline(pdf_df['file_size_mb'].mean(), color='red', linestyle='--', label=f'Mean: {pdf_df["file_size_mb"].mean():.2f} MB')
    axes[0, 1].set_xlabel('File Size (MB)')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Distribution of File Sizes', fontsize=12, fontweight='bold')
    axes[0, 1].legend()
    
    # Pages vs File Size scatter
    axes[1, 0].scatter(pdf_df['num_pages'], pdf_df['file_size_mb'], alpha=0.6, c='purple')
    axes[1, 0].set_xlabel('Number of Pages')
    axes[1, 0].set_ylabel('File Size (MB)')
    axes[1, 0].set_title('Pages vs File Size Correlation', fontsize=12, fontweight='bold')
    
    # Page dimensions
    axes[1, 1].scatter(pdf_df['avg_width'], pdf_df['avg_height'], alpha=0.6, c='teal')
    axes[1, 1].set_xlabel('Average Page Width (pts)')
    axes[1, 1].set_ylabel('Average Page Height (pts)')
    axes[1, 1].set_title('Page Dimension Distribution', fontsize=12, fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Correlation
    correlation = pdf_df['num_pages'].corr(pdf_df['file_size_mb'])
    print(f"\nüìà Correlation between pages and file size: {correlation:.3f}")

In [None]:
# Show PDF DataFrame
if len(pdf_df) > 0:
    display_cols = ['filename', 'num_pages', 'file_size_mb', 'avg_width', 'avg_height', 'is_encrypted', 'has_toc']
    print("\nüìã PDF Documents Overview (first 20):")
    display(pdf_df[display_cols].head(20))

---
## 5. Page Extraction & Image Analysis

In [None]:
def extract_pages_as_images(pdf_path: str, dpi: int = 150, max_pages: int = None) -> List[np.ndarray]:
    """
    Extract pages from PDF as images using PyMuPDF.
    
    Args:
        pdf_path: Path to PDF file
        dpi: Resolution for rendering
        max_pages: Maximum number of pages to extract (None for all)
    
    Returns:
        List of images as numpy arrays (RGB)
    """
    images = []
    try:
        doc = fitz.open(pdf_path)
        zoom = dpi / 72  # 72 is the default PDF resolution
        mat = fitz.Matrix(zoom, zoom)
        
        pages_to_extract = min(len(doc), max_pages) if max_pages else len(doc)
        
        for page_num in range(pages_to_extract):
            page = doc[page_num]
            pix = page.get_pixmap(matrix=mat)
            
            # Convert to numpy array
            img = np.frombuffer(pix.samples, dtype=np.uint8).reshape(pix.height, pix.width, pix.n)
            
            # Convert to RGB if needed
            if pix.n == 4:  # RGBA
                img = cv2.cvtColor(img, cv2.COLOR_RGBA2RGB)
            elif pix.n == 1:  # Grayscale
                img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
                
            images.append(img)
        
        doc.close()
    except Exception as e:
        print(f"Error extracting {pdf_path}: {e}")
    
    return images

In [None]:
def analyze_image_properties(img: np.ndarray) -> Dict:
    """
    Analyze visual properties of an image.
    
    Args:
        img: Image as numpy array (RGB)
    
    Returns:
        Dictionary with image properties
    """
    height, width = img.shape[:2]
    
    # Convert to grayscale for analysis
    if len(img.shape) == 3:
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    else:
        gray = img
    
    # Edge detection for structural analysis
    edges = cv2.Canny(gray, 50, 150)
    edge_density = np.sum(edges > 0) / (height * width)
    
    # Line detection using Hough Transform
    lines = cv2.HoughLinesP(edges, 1, np.pi/180, threshold=100, minLineLength=50, maxLineGap=10)
    num_lines = len(lines) if lines is not None else 0
    
    # Separate horizontal and vertical lines
    horizontal_lines = 0
    vertical_lines = 0
    if lines is not None:
        for line in lines:
            x1, y1, x2, y2 = line[0]
            angle = np.abs(np.arctan2(y2 - y1, x2 - x1) * 180 / np.pi)
            if angle < 15 or angle > 165:  # Horizontal
                horizontal_lines += 1
            elif 75 < angle < 105:  # Vertical
                vertical_lines += 1
    
    # Contour analysis
    contours, _ = cv2.findContours(edges, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    num_contours = len(contours)
    
    # Calculate white space ratio
    _, binary = cv2.threshold(gray, 240, 255, cv2.THRESH_BINARY)
    white_space_ratio = np.sum(binary == 255) / (height * width)
    
    # Text density estimation (inverse of white space in document area)
    text_density = 1 - white_space_ratio
    
    return {
        'width': width,
        'height': height,
        'aspect_ratio': width / height,
        'edge_density': edge_density,
        'num_lines': num_lines,
        'horizontal_lines': horizontal_lines,
        'vertical_lines': vertical_lines,
        'num_contours': num_contours,
        'white_space_ratio': white_space_ratio,
        'text_density': text_density
    }

In [None]:
# Extract and analyze sample pages from each PDF
print("Extracting and analyzing sample pages from PDFs...")
print("(This may take a while depending on dataset size)\n")

all_page_analyses = []
sample_images = {}  # Store some sample images for visualization

# Analyze pages from each PDF
for pdf_info in tqdm(pdf_analyses[:], desc="Analyzing PDFs"):
    if 'error' in pdf_info:
        continue
    
    pdf_path = pdf_info['filepath']
    pdf_name = pdf_info['filename']
    
    # Extract all pages
    images = extract_pages_as_images(pdf_path, dpi=150)
    
    for page_idx, img in enumerate(images):
        props = analyze_image_properties(img)
        props['pdf_name'] = pdf_name
        props['page_num'] = page_idx + 1
        props['pdf_path'] = pdf_path
        all_page_analyses.append(props)
        
        # Store sample images (first, middle, last page of first 5 PDFs)
        if len(sample_images) < 50:
            key = f"{pdf_name}_page{page_idx+1}"
            if page_idx == 0 or page_idx == len(images)//2 or page_idx == len(images)-1:
                sample_images[key] = img

page_df = pd.DataFrame(all_page_analyses)
print(f"\n‚úÖ Analyzed {len(page_df)} pages from {len(pdf_df)} PDFs")

In [None]:
# Page-level statistics
if len(page_df) > 0:
    print("="*60)
    print("PAGE-LEVEL IMAGE STATISTICS")
    print("="*60)
    
    print(f"\nüìÑ Total Pages Analyzed: {len(page_df)}")
    
    print(f"\nüìè Image Dimensions:")
    print(f"   Width  - Min: {page_df['width'].min()}, Max: {page_df['width'].max()}, Mean: {page_df['width'].mean():.1f}")
    print(f"   Height - Min: {page_df['height'].min()}, Max: {page_df['height'].max()}, Mean: {page_df['height'].mean():.1f}")
    
    print(f"\nüìê Aspect Ratios:")
    print(f"   Min: {page_df['aspect_ratio'].min():.3f}")
    print(f"   Max: {page_df['aspect_ratio'].max():.3f}")
    print(f"   Mean: {page_df['aspect_ratio'].mean():.3f}")
    
    print(f"\nüìä Structural Features:")
    print(f"   Edge Density    - Mean: {page_df['edge_density'].mean():.4f}, Std: {page_df['edge_density'].std():.4f}")
    print(f"   Lines Detected  - Mean: {page_df['num_lines'].mean():.1f}, Std: {page_df['num_lines'].std():.1f}")
    print(f"   Horizontal Lines- Mean: {page_df['horizontal_lines'].mean():.1f}")
    print(f"   Vertical Lines  - Mean: {page_df['vertical_lines'].mean():.1f}")
    print(f"   Contours        - Mean: {page_df['num_contours'].mean():.1f}")
    
    print(f"\nüìù Content Metrics:")
    print(f"   White Space Ratio - Mean: {page_df['white_space_ratio'].mean():.3f}")
    print(f"   Text Density      - Mean: {page_df['text_density'].mean():.3f}")

In [None]:
# Visualize page-level statistics
if len(page_df) > 0:
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    
    # Edge density distribution
    axes[0, 0].hist(page_df['edge_density'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    axes[0, 0].set_xlabel('Edge Density')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].set_title('Edge Density Distribution', fontweight='bold')
    
    # Number of lines distribution
    axes[0, 1].hist(page_df['num_lines'], bins=50, color='coral', edgecolor='black', alpha=0.7)
    axes[0, 1].set_xlabel('Number of Lines')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Lines Detected Distribution', fontweight='bold')
    
    # Horizontal vs Vertical lines
    axes[0, 2].scatter(page_df['horizontal_lines'], page_df['vertical_lines'], alpha=0.3, c='purple')
    axes[0, 2].set_xlabel('Horizontal Lines')
    axes[0, 2].set_ylabel('Vertical Lines')
    axes[0, 2].set_title('Horizontal vs Vertical Lines', fontweight='bold')
    
    # White space ratio
    axes[1, 0].hist(page_df['white_space_ratio'], bins=50, color='teal', edgecolor='black', alpha=0.7)
    axes[1, 0].set_xlabel('White Space Ratio')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('White Space Ratio Distribution', fontweight='bold')
    
    # Contours distribution
    axes[1, 1].hist(page_df['num_contours'], bins=50, color='orange', edgecolor='black', alpha=0.7)
    axes[1, 1].set_xlabel('Number of Contours')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Contours Distribution', fontweight='bold')
    
    # Aspect ratio
    axes[1, 2].hist(page_df['aspect_ratio'], bins=30, color='green', edgecolor='black', alpha=0.7)
    axes[1, 2].set_xlabel('Aspect Ratio (W/H)')
    axes[1, 2].set_ylabel('Frequency')
    axes[1, 2].set_title('Aspect Ratio Distribution', fontweight='bold')
    
    plt.tight_layout()
    plt.show()

---
## 6. Table Detection Analysis

Critical for distinguishing **Notes (Tabular)** vs **Notes (Text)**

In [None]:
def detect_tables_morphological(img: np.ndarray, min_table_area_ratio: float = 0.01) -> Dict:
    """
    Detect tables using morphological operations.
    A table is defined as structured element with at least 1 row and 2 columns 
    or at least 2 rows and 1 column, separated by visible lines.
    
    Args:
        img: Input image (RGB)
        min_table_area_ratio: Minimum area ratio for a valid table
    
    Returns:
        Dictionary with table detection results
    """
    height, width = img.shape[:2]
    total_area = height * width
    
    # Convert to grayscale
    if len(img.shape) == 3:
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    else:
        gray = img.copy()
    
    # Threshold to binary (invert so lines are white)
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    
    # Detect horizontal lines
    horizontal_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 30, 1))
    horizontal_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, horizontal_kernel, iterations=2)
    
    # Detect vertical lines
    vertical_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, height // 30))
    vertical_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, vertical_kernel, iterations=2)
    
    # Combine horizontal and vertical lines
    table_mask = cv2.add(horizontal_lines, vertical_lines)
    
    # Dilate to connect nearby lines
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
    table_mask = cv2.dilate(table_mask, kernel, iterations=2)
    
    # Find contours of potential tables
    contours, _ = cv2.findContours(table_mask, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    
    tables_found = []
    total_table_area = 0
    
    for contour in contours:
        x, y, w, h = cv2.boundingRect(contour)
        area = w * h
        area_ratio = area / total_area
        
        # Filter by minimum area and aspect ratio
        if area_ratio >= min_table_area_ratio and w > 50 and h > 30:
            # Count lines within this region
            region_h = horizontal_lines[y:y+h, x:x+w]
            region_v = vertical_lines[y:y+h, x:x+w]
            
            # Find horizontal line segments
            h_lines = cv2.HoughLinesP(region_h, 1, np.pi/180, threshold=30, minLineLength=w//4, maxLineGap=10)
            v_lines = cv2.HoughLinesP(region_v, 1, np.pi/180, threshold=30, minLineLength=h//4, maxLineGap=10)
            
            num_h_lines = len(h_lines) if h_lines is not None else 0
            num_v_lines = len(v_lines) if v_lines is not None else 0
            
            # Check if it qualifies as a table (at least 1 row and 2 columns OR 2 rows and 1 column)
            # Rows are separated by horizontal lines, columns by vertical lines
            is_table = (num_h_lines >= 1 and num_v_lines >= 2) or (num_h_lines >= 2 and num_v_lines >= 1)
            
            if is_table:
                tables_found.append({
                    'bbox': (x, y, w, h),
                    'area': area,
                    'area_ratio': area_ratio,
                    'h_lines': num_h_lines,
                    'v_lines': num_v_lines
                })
                total_table_area += area
    
    # Calculate line densities
    h_line_pixels = np.sum(horizontal_lines > 0)
    v_line_pixels = np.sum(vertical_lines > 0)
    
    return {
        'has_table': len(tables_found) > 0,
        'num_tables': len(tables_found),
        'tables': tables_found,
        'total_table_area_ratio': total_table_area / total_area,
        'h_line_density': h_line_pixels / total_area,
        'v_line_density': v_line_pixels / total_area,
        'table_mask': table_mask
    }

In [None]:
def detect_tables_grid_based(img: np.ndarray) -> Dict:
    """
    Alternative table detection using grid intersection analysis.
    
    Args:
        img: Input image (RGB)
    
    Returns:
        Dictionary with grid-based table detection results
    """
    height, width = img.shape[:2]
    
    # Convert to grayscale
    if len(img.shape) == 3:
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    else:
        gray = img.copy()
    
    # Apply adaptive threshold
    binary = cv2.adaptiveThreshold(gray, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, 
                                    cv2.THRESH_BINARY_INV, 11, 2)
    
    # Detect horizontal lines
    h_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (width // 20, 1))
    h_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, h_kernel)
    
    # Detect vertical lines  
    v_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (1, height // 20))
    v_lines = cv2.morphologyEx(binary, cv2.MORPH_OPEN, v_kernel)
    
    # Find intersections
    intersections = cv2.bitwise_and(h_lines, v_lines)
    
    # Dilate intersections
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (3, 3))
    intersections = cv2.dilate(intersections, kernel, iterations=2)
    
    # Count intersection points
    contours, _ = cv2.findContours(intersections, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
    num_intersections = len(contours)
    
    # Combined table structure
    table_structure = cv2.add(h_lines, v_lines)
    structure_ratio = np.sum(table_structure > 0) / (height * width)
    
    # A table typically has multiple intersections forming a grid
    # Minimum 4 intersections for a simple 2x2 grid
    likely_has_table = num_intersections >= 4 and structure_ratio > 0.005
    
    return {
        'likely_has_table': likely_has_table,
        'num_intersections': num_intersections,
        'structure_ratio': structure_ratio,
        'intersection_density': num_intersections / (height * width) * 10000  # Per 10k pixels
    }

In [None]:
# Perform table detection on all pages
print("Performing table detection analysis on all pages...")

table_analyses = []

for pdf_info in tqdm(pdf_analyses[:], desc="Table Detection"):
    if 'error' in pdf_info:
        continue
    
    pdf_path = pdf_info['filepath']
    pdf_name = pdf_info['filename']
    
    images = extract_pages_as_images(pdf_path, dpi=150)
    
    for page_idx, img in enumerate(images):
        # Morphological table detection
        morph_result = detect_tables_morphological(img)
        
        # Grid-based table detection
        grid_result = detect_tables_grid_based(img)
        
        table_analyses.append({
            'pdf_name': pdf_name,
            'page_num': page_idx + 1,
            'has_table_morph': morph_result['has_table'],
            'num_tables_morph': morph_result['num_tables'],
            'table_area_ratio': morph_result['total_table_area_ratio'],
            'h_line_density': morph_result['h_line_density'],
            'v_line_density': morph_result['v_line_density'],
            'has_table_grid': grid_result['likely_has_table'],
            'num_intersections': grid_result['num_intersections'],
            'structure_ratio': grid_result['structure_ratio'],
            'intersection_density': grid_result['intersection_density']
        })

table_df = pd.DataFrame(table_analyses)
print(f"\n‚úÖ Table detection completed for {len(table_df)} pages")

In [None]:
# Table detection statistics
if len(table_df) > 0:
    print("="*60)
    print("TABLE DETECTION ANALYSIS")
    print("="*60)
    
    # Combine both methods for final determination
    table_df['has_table_combined'] = table_df['has_table_morph'] | table_df['has_table_grid']
    
    pages_with_tables = table_df['has_table_combined'].sum()
    pages_without_tables = len(table_df) - pages_with_tables
    
    print(f"\nüìä Table Detection Results:")
    print(f"   Pages WITH tables: {pages_with_tables} ({pages_with_tables/len(table_df)*100:.1f}%)")
    print(f"   Pages WITHOUT tables: {pages_without_tables} ({pages_without_tables/len(table_df)*100:.1f}%)")
    
    print(f"\nüìà Detection Method Comparison:")
    print(f"   Morphological method detected tables in: {table_df['has_table_morph'].sum()} pages")
    print(f"   Grid-based method detected tables in: {table_df['has_table_grid'].sum()} pages")
    
    print(f"\nüìè Table Area Statistics (for pages with tables):")
    table_pages = table_df[table_df['has_table_combined']]
    if len(table_pages) > 0:
        print(f"   Mean table area ratio: {table_pages['table_area_ratio'].mean():.4f}")
        print(f"   Max table area ratio: {table_pages['table_area_ratio'].max():.4f}")
    
    print(f"\nüî≤ Line Density Statistics:")
    print(f"   Horizontal line density - Mean: {table_df['h_line_density'].mean():.6f}")
    print(f"   Vertical line density - Mean: {table_df['v_line_density'].mean():.6f}")
    print(f"   Grid intersections - Mean: {table_df['num_intersections'].mean():.1f}")

In [None]:
# Visualize table detection statistics
if len(table_df) > 0:
    fig, axes = plt.subplots(2, 3, figsize=(16, 10))
    
    # Table presence pie chart
    table_counts = table_df['has_table_combined'].value_counts()
    labels = ['No Tables', 'Has Tables']
    sizes = [table_counts.get(False, 0), table_counts.get(True, 0)]
    colors = ['#ff9999', '#66b3ff']
    axes[0, 0].pie(sizes, labels=labels, colors=colors, autopct='%1.1f%%', startangle=90)
    axes[0, 0].set_title('Pages With/Without Tables', fontweight='bold')
    
    # Table area ratio distribution
    axes[0, 1].hist(table_df['table_area_ratio'], bins=50, color='steelblue', edgecolor='black', alpha=0.7)
    axes[0, 1].set_xlabel('Table Area Ratio')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].set_title('Table Area Ratio Distribution', fontweight='bold')
    
    # Number of intersections distribution
    axes[0, 2].hist(table_df['num_intersections'], bins=50, color='coral', edgecolor='black', alpha=0.7)
    axes[0, 2].set_xlabel('Number of Grid Intersections')
    axes[0, 2].set_ylabel('Frequency')
    axes[0, 2].set_title('Grid Intersection Distribution', fontweight='bold')
    
    # H vs V line density scatter
    colors = ['red' if t else 'blue' for t in table_df['has_table_combined']]
    axes[1, 0].scatter(table_df['h_line_density'], table_df['v_line_density'], c=colors, alpha=0.3)
    axes[1, 0].set_xlabel('Horizontal Line Density')
    axes[1, 0].set_ylabel('Vertical Line Density')
    axes[1, 0].set_title('H vs V Line Density (Red=Table, Blue=No Table)', fontweight='bold')
    
    # Structure ratio distribution
    axes[1, 1].hist(table_df['structure_ratio'], bins=50, color='green', edgecolor='black', alpha=0.7)
    axes[1, 1].set_xlabel('Structure Ratio')
    axes[1, 1].set_ylabel('Frequency')
    axes[1, 1].set_title('Table Structure Ratio Distribution', fontweight='bold')
    
    # Method agreement
    agreement = [
        ((table_df['has_table_morph'] == True) & (table_df['has_table_grid'] == True)).sum(),
        ((table_df['has_table_morph'] == True) & (table_df['has_table_grid'] == False)).sum(),
        ((table_df['has_table_morph'] == False) & (table_df['has_table_grid'] == True)).sum(),
        ((table_df['has_table_morph'] == False) & (table_df['has_table_grid'] == False)).sum()
    ]
    labels = ['Both: Table', 'Morph Only', 'Grid Only', 'Both: No Table']
    axes[1, 2].bar(labels, agreement, color=['green', 'orange', 'purple', 'gray'])
    axes[1, 2].set_ylabel('Number of Pages')
    axes[1, 2].set_title('Detection Method Agreement', fontweight='bold')
    axes[1, 2].tick_params(axis='x', rotation=45)
    
    plt.tight_layout()
    plt.show()

---
## 7. Visual Sample Gallery

In [None]:
def display_sample_pages(sample_images: Dict, n_samples: int = 12, figsize: Tuple = (20, 16)):
    """
    Display a gallery of sample page images.
    
    Args:
        sample_images: Dictionary of sample images
        n_samples: Number of samples to display
        figsize: Figure size
    """
    samples = list(sample_images.items())[:n_samples]
    n_cols = 4
    n_rows = (len(samples) + n_cols - 1) // n_cols
    
    fig, axes = plt.subplots(n_rows, n_cols, figsize=figsize)
    axes = axes.flatten() if n_rows > 1 else [axes] if n_cols == 1 else axes
    
    for idx, (name, img) in enumerate(samples):
        axes[idx].imshow(img)
        axes[idx].set_title(name[:30] + '...' if len(name) > 30 else name, fontsize=9)
        axes[idx].axis('off')
    
    # Hide empty subplots
    for idx in range(len(samples), len(axes)):
        axes[idx].axis('off')
    
    plt.suptitle('Sample Page Gallery', fontsize=16, fontweight='bold')
    plt.tight_layout()
    plt.show()

In [None]:
# Display sample pages
if sample_images:
    print(f"Displaying {min(12, len(sample_images))} sample pages...")
    display_sample_pages(sample_images, n_samples=12)

In [None]:
def visualize_table_detection(img: np.ndarray, pdf_name: str, page_num: int):
    """
    Visualize table detection results on a single image.
    
    Args:
        img: Input image
        pdf_name: Name of the PDF
        page_num: Page number
    """
    result = detect_tables_morphological(img)
    
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))
    
    # Original image
    axes[0].imshow(img)
    axes[0].set_title(f'{pdf_name} - Page {page_num}\n(Original)', fontweight='bold')
    axes[0].axis('off')
    
    # Table mask
    axes[1].imshow(result['table_mask'], cmap='gray')
    axes[1].set_title(f'Detected Lines\n(H: {result["h_line_density"]:.5f}, V: {result["v_line_density"]:.5f})', fontweight='bold')
    axes[1].axis('off')
    
    # Image with table bounding boxes
    img_with_boxes = img.copy()
    for table in result['tables']:
        x, y, w, h = table['bbox']
        cv2.rectangle(img_with_boxes, (x, y), (x+w, y+h), (255, 0, 0), 3)
    
    axes[2].imshow(img_with_boxes)
    has_table_text = 'TABLE DETECTED' if result['has_table'] else 'NO TABLE'
    axes[2].set_title(f'{has_table_text}\n({result["num_tables"]} tables found)', fontweight='bold',
                      color='green' if result['has_table'] else 'red')
    axes[2].axis('off')
    
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize table detection on sample pages
print("Visualizing table detection on sample pages...\n")

# Get a few sample images for visualization
sample_keys = list(sample_images.keys())[:6]

for key in sample_keys:
    parts = key.rsplit('_page', 1)
    pdf_name = parts[0]
    page_num = int(parts[1]) if len(parts) > 1 else 1
    img = sample_images[key]
    visualize_table_detection(img, pdf_name, page_num)

---
## 8. Layout & Structure Analysis

In [None]:
def analyze_page_layout(img: np.ndarray) -> Dict:
    """
    Analyze page layout characteristics.
    
    Args:
        img: Input image (RGB)
    
    Returns:
        Dictionary with layout analysis results
    """
    height, width = img.shape[:2]
    
    # Convert to grayscale
    if len(img.shape) == 3:
        gray = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
    else:
        gray = img.copy()
    
    # Binary threshold
    _, binary = cv2.threshold(gray, 200, 255, cv2.THRESH_BINARY_INV)
    
    # Analyze content distribution in different regions
    # Divide page into 3x3 grid
    h_third = height // 3
    w_third = width // 3
    
    region_densities = {}
    regions = [
        ('top_left', (0, 0, w_third, h_third)),
        ('top_center', (w_third, 0, 2*w_third, h_third)),
        ('top_right', (2*w_third, 0, width, h_third)),
        ('middle_left', (0, h_third, w_third, 2*h_third)),
        ('middle_center', (w_third, h_third, 2*w_third, 2*h_third)),
        ('middle_right', (2*w_third, h_third, width, 2*h_third)),
        ('bottom_left', (0, 2*h_third, w_third, height)),
        ('bottom_center', (w_third, 2*h_third, 2*w_third, height)),
        ('bottom_right', (2*w_third, 2*h_third, width, height))
    ]
    
    for name, (x1, y1, x2, y2) in regions:
        region = binary[y1:y2, x1:x2]
        region_densities[name] = np.sum(region > 0) / region.size
    
    # Analyze margins (top 5%, bottom 5%, left 5%, right 5%)
    margin_size = int(min(height, width) * 0.05)
    margins = {
        'top_margin_density': np.sum(binary[:margin_size, :] > 0) / (margin_size * width),
        'bottom_margin_density': np.sum(binary[-margin_size:, :] > 0) / (margin_size * width),
        'left_margin_density': np.sum(binary[:, :margin_size] > 0) / (height * margin_size),
        'right_margin_density': np.sum(binary[:, -margin_size:] > 0) / (height * margin_size)
    }
    
    # Calculate header and footer presence (top/bottom 10%)
    header_region = binary[:int(height*0.1), :]
    footer_region = binary[-int(height*0.1):, :]
    
    header_density = np.sum(header_region > 0) / header_region.size
    footer_density = np.sum(footer_region > 0) / footer_region.size
    
    # Content distribution (variance across horizontal bands)
    band_densities = []
    num_bands = 10
    band_height = height // num_bands
    for i in range(num_bands):
        band = binary[i*band_height:(i+1)*band_height, :]
        band_densities.append(np.sum(band > 0) / band.size)
    
    content_distribution_variance = np.var(band_densities)
    
    return {
        **region_densities,
        **margins,
        'header_density': header_density,
        'footer_density': footer_density,
        'content_distribution_variance': content_distribution_variance,
        'band_densities': band_densities
    }

In [None]:
# Analyze layout for all pages
print("Analyzing page layouts...")

layout_analyses = []

for pdf_info in tqdm(pdf_analyses[:], desc="Layout Analysis"):
    if 'error' in pdf_info:
        continue
    
    pdf_path = pdf_info['filepath']
    pdf_name = pdf_info['filename']
    
    images = extract_pages_as_images(pdf_path, dpi=150)
    
    for page_idx, img in enumerate(images):
        layout = analyze_page_layout(img)
        layout['pdf_name'] = pdf_name
        layout['page_num'] = page_idx + 1
        layout_analyses.append(layout)

layout_df = pd.DataFrame(layout_analyses)
print(f"\n‚úÖ Layout analysis completed for {len(layout_df)} pages")

In [None]:
# Layout statistics
if len(layout_df) > 0:
    print("="*60)
    print("PAGE LAYOUT ANALYSIS")
    print("="*60)
    
    print(f"\nüìä Region Content Density (9-grid):")
    regions = ['top_left', 'top_center', 'top_right', 
               'middle_left', 'middle_center', 'middle_right',
               'bottom_left', 'bottom_center', 'bottom_right']
    for region in regions:
        print(f"   {region:16s}: Mean={layout_df[region].mean():.4f}, Std={layout_df[region].std():.4f}")
    
    print(f"\nüìè Margin Analysis:")
    print(f"   Top margin density: {layout_df['top_margin_density'].mean():.4f}")
    print(f"   Bottom margin density: {layout_df['bottom_margin_density'].mean():.4f}")
    print(f"   Left margin density: {layout_df['left_margin_density'].mean():.4f}")
    print(f"   Right margin density: {layout_df['right_margin_density'].mean():.4f}")
    
    print(f"\nüìÑ Header/Footer:")
    print(f"   Header density: Mean={layout_df['header_density'].mean():.4f}, Std={layout_df['header_density'].std():.4f}")
    print(f"   Footer density: Mean={layout_df['footer_density'].mean():.4f}, Std={layout_df['footer_density'].std():.4f}")
    
    print(f"\nüìà Content Distribution Variance: {layout_df['content_distribution_variance'].mean():.6f}")

In [None]:
# Visualize layout statistics
if len(layout_df) > 0:
    fig, axes = plt.subplots(2, 2, figsize=(14, 10))
    
    # Region density heatmap (average across all pages)
    region_means = [
        [layout_df['top_left'].mean(), layout_df['top_center'].mean(), layout_df['top_right'].mean()],
        [layout_df['middle_left'].mean(), layout_df['middle_center'].mean(), layout_df['middle_right'].mean()],
        [layout_df['bottom_left'].mean(), layout_df['bottom_center'].mean(), layout_df['bottom_right'].mean()]
    ]
    sns.heatmap(region_means, ax=axes[0, 0], annot=True, fmt='.3f', cmap='YlOrRd',
                xticklabels=['Left', 'Center', 'Right'], yticklabels=['Top', 'Middle', 'Bottom'])
    axes[0, 0].set_title('Average Content Density by Region', fontweight='bold')
    
    # Header vs Footer density
    axes[0, 1].scatter(layout_df['header_density'], layout_df['footer_density'], alpha=0.3)
    axes[0, 1].set_xlabel('Header Density')
    axes[0, 1].set_ylabel('Footer Density')
    axes[0, 1].set_title('Header vs Footer Density', fontweight='bold')
    
    # Content distribution variance
    axes[1, 0].hist(layout_df['content_distribution_variance'], bins=50, color='purple', edgecolor='black', alpha=0.7)
    axes[1, 0].set_xlabel('Content Distribution Variance')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].set_title('Content Distribution Variance', fontweight='bold')
    
    # Margin densities comparison
    margin_data = {
        'Top': layout_df['top_margin_density'].mean(),
        'Bottom': layout_df['bottom_margin_density'].mean(),
        'Left': layout_df['left_margin_density'].mean(),
        'Right': layout_df['right_margin_density'].mean()
    }
    axes[1, 1].bar(margin_data.keys(), margin_data.values(), color=['red', 'blue', 'green', 'orange'])
    axes[1, 1].set_ylabel('Average Density')
    axes[1, 1].set_title('Average Margin Densities', fontweight='bold')
    
    plt.tight_layout()
    plt.show()

---
## 9. Feature Correlation Analysis

In [None]:
# Merge all analysis dataframes
if len(page_df) > 0 and len(table_df) > 0 and len(layout_df) > 0:
    # Merge on pdf_name and page_num
    combined_df = page_df.merge(
        table_df, on=['pdf_name', 'page_num'], how='left'
    ).merge(
        layout_df, on=['pdf_name', 'page_num'], how='left'
    )
    
    print(f"Combined dataset: {len(combined_df)} pages with {len(combined_df.columns)} features")
    print(f"\nColumns: {list(combined_df.columns)}")

In [None]:
# Feature correlation matrix
if 'combined_df' in dir() and len(combined_df) > 0:
    # Select numeric columns for correlation
    numeric_cols = combined_df.select_dtypes(include=[np.number]).columns.tolist()
    
    # Remove columns that might cause issues
    cols_to_exclude = ['page_num']
    feature_cols = [c for c in numeric_cols if c not in cols_to_exclude]
    
    # Calculate correlation matrix
    corr_matrix = combined_df[feature_cols].corr()
    
    # Plot correlation heatmap
    plt.figure(figsize=(16, 14))
    mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
    sns.heatmap(corr_matrix, mask=mask, annot=False, cmap='RdBu_r', center=0,
                square=True, linewidths=0.5)
    plt.title('Feature Correlation Matrix', fontsize=14, fontweight='bold')
    plt.tight_layout()
    plt.show()
    
    # Find highly correlated features
    print("\nüîó Highly Correlated Feature Pairs (|r| > 0.7):")
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > 0.7:
                high_corr_pairs.append((
                    corr_matrix.columns[i], 
                    corr_matrix.columns[j], 
                    corr_matrix.iloc[i, j]
                ))
    
    high_corr_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
    for f1, f2, r in high_corr_pairs[:15]:
        print(f"   {f1} <-> {f2}: r={r:.3f}")

---
## 10. Page Clustering Analysis (Unsupervised)

In [None]:
# Perform unsupervised clustering to find natural groupings
if 'combined_df' in dir() and len(combined_df) > 0:
    from sklearn.preprocessing import StandardScaler
    from sklearn.decomposition import PCA
    from sklearn.cluster import KMeans
    
    # Select features for clustering
    cluster_features = [
        'edge_density', 'num_lines', 'horizontal_lines', 'vertical_lines',
        'white_space_ratio', 'text_density', 'num_contours',
        'table_area_ratio', 'h_line_density', 'v_line_density',
        'num_intersections', 'structure_ratio',
        'header_density', 'footer_density', 'content_distribution_variance',
        'middle_center'
    ]
    
    # Filter to available features
    available_features = [f for f in cluster_features if f in combined_df.columns]
    
    # Prepare data
    X = combined_df[available_features].fillna(0).values
    
    # Standardize features
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    # PCA for visualization
    pca = PCA(n_components=2)
    X_pca = pca.fit_transform(X_scaled)
    
    print(f"PCA explained variance: {pca.explained_variance_ratio_.sum()*100:.1f}%")

In [None]:
# Find optimal number of clusters using elbow method
if 'X_scaled' in dir():
    from sklearn.metrics import silhouette_score
    
    inertias = []
    silhouettes = []
    K_range = range(2, 11)
    
    for k in K_range:
        kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
        kmeans.fit(X_scaled)
        inertias.append(kmeans.inertia_)
        silhouettes.append(silhouette_score(X_scaled, kmeans.labels_))
    
    fig, axes = plt.subplots(1, 2, figsize=(14, 5))
    
    # Elbow plot
    axes[0].plot(K_range, inertias, 'bo-')
    axes[0].set_xlabel('Number of Clusters (k)')
    axes[0].set_ylabel('Inertia')
    axes[0].set_title('Elbow Method for Optimal k', fontweight='bold')
    
    # Silhouette plot
    axes[1].plot(K_range, silhouettes, 'go-')
    axes[1].set_xlabel('Number of Clusters (k)')
    axes[1].set_ylabel('Silhouette Score')
    axes[1].set_title('Silhouette Score vs k', fontweight='bold')
    
    plt.tight_layout()
    plt.show()
    
    # Best k based on silhouette
    best_k = K_range[np.argmax(silhouettes)]
    print(f"\nüìä Best k based on silhouette score: {best_k}")
    print(f"Note: We have 5 target classes, but natural clusters may differ.")

In [None]:
# Cluster with k=5 (matching our target classes)
if 'X_scaled' in dir():
    kmeans_5 = KMeans(n_clusters=5, random_state=42, n_init=10)
    cluster_labels = kmeans_5.fit_predict(X_scaled)
    
    combined_df['cluster'] = cluster_labels
    
    # Visualize clusters in PCA space
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=cluster_labels, cmap='viridis', alpha=0.6)
    plt.colorbar(scatter, label='Cluster')
    plt.xlabel('First Principal Component')
    plt.ylabel('Second Principal Component')
    plt.title('Page Clusters in PCA Space (k=5)', fontsize=14, fontweight='bold')
    plt.show()
    
    # Cluster distribution
    print("\nüìä Cluster Distribution:")
    cluster_counts = combined_df['cluster'].value_counts().sort_index()
    for cluster, count in cluster_counts.items():
        print(f"   Cluster {cluster}: {count} pages ({count/len(combined_df)*100:.1f}%)")

In [None]:
# Analyze cluster characteristics
if 'combined_df' in dir() and 'cluster' in combined_df.columns:
    print("\nüìã Cluster Characteristics:")
    print("="*70)
    
    key_features = ['edge_density', 'num_lines', 'has_table_combined', 'table_area_ratio',
                    'white_space_ratio', 'text_density', 'header_density']
    
    available_key_features = [f for f in key_features if f in combined_df.columns]
    
    cluster_summary = combined_df.groupby('cluster')[available_key_features].mean()
    display(cluster_summary.round(4))
    
    # Visualize cluster characteristics
    fig, axes = plt.subplots(2, 3, figsize=(15, 10))
    axes = axes.flatten()
    
    for idx, feature in enumerate(available_key_features[:6]):
        cluster_means = combined_df.groupby('cluster')[feature].mean()
        axes[idx].bar(cluster_means.index, cluster_means.values, 
                      color=plt.cm.viridis(np.linspace(0, 1, 5)))
        axes[idx].set_xlabel('Cluster')
        axes[idx].set_ylabel(feature)
        axes[idx].set_title(f'{feature} by Cluster', fontweight='bold')
    
    plt.tight_layout()
    plt.show()

---
## 11. Sample Pages from Each Cluster

In [None]:
def show_cluster_samples(combined_df: pd.DataFrame, n_samples_per_cluster: int = 3):
    """
    Display sample pages from each cluster.
    
    Args:
        combined_df: DataFrame with cluster assignments
        n_samples_per_cluster: Number of samples to show per cluster
    """
    n_clusters = combined_df['cluster'].nunique()
    
    fig, axes = plt.subplots(n_clusters, n_samples_per_cluster, 
                             figsize=(5*n_samples_per_cluster, 6*n_clusters))
    
    for cluster_idx in range(n_clusters):
        cluster_pages = combined_df[combined_df['cluster'] == cluster_idx]
        samples = cluster_pages.sample(n=min(n_samples_per_cluster, len(cluster_pages)), random_state=42)
        
        for sample_idx, (_, row) in enumerate(samples.iterrows()):
            pdf_path = row['pdf_path']
            page_num = int(row['page_num'])
            
            # Extract the specific page
            images = extract_pages_as_images(pdf_path, dpi=100, max_pages=page_num)
            if images and len(images) >= page_num:
                img = images[page_num - 1]
                
                ax = axes[cluster_idx, sample_idx] if n_clusters > 1 else axes[sample_idx]
                ax.imshow(img)
                ax.set_title(f"Cluster {cluster_idx}\n{row['pdf_name']}\nPage {page_num}", fontsize=9)
                ax.axis('off')
    
    plt.suptitle('Sample Pages from Each Cluster', fontsize=16, fontweight='bold', y=1.02)
    plt.tight_layout()
    plt.show()

In [None]:
# Show samples from each cluster
if 'combined_df' in dir() and 'cluster' in combined_df.columns and 'pdf_path' in combined_df.columns:
    print("Displaying sample pages from each cluster...")
    show_cluster_samples(combined_df, n_samples_per_cluster=3)

---
## 12. EDA Summary & Insights

In [None]:
print("="*70)
print("                    EDA SUMMARY & KEY INSIGHTS")
print("="*70)

if 'pdf_df' in dir() and len(pdf_df) > 0:
    print(f"\nüìÅ DATASET OVERVIEW")
    print(f"   Total PDF documents: {len(pdf_df)}")
    print(f"   Total pages: {pdf_df['num_pages'].sum():,}")
    print(f"   Average pages per document: {pdf_df['num_pages'].mean():.1f}")
    print(f"   Total dataset size: {pdf_df['file_size_mb'].sum():.2f} MB")

if 'page_df' in dir() and len(page_df) > 0:
    print(f"\nüìä PAGE IMAGE STATISTICS")
    print(f"   Average dimensions: {page_df['width'].mean():.0f} x {page_df['height'].mean():.0f} pixels")
    print(f"   Aspect ratio range: {page_df['aspect_ratio'].min():.3f} - {page_df['aspect_ratio'].max():.3f}")
    print(f"   Average edge density: {page_df['edge_density'].mean():.4f}")
    print(f"   Average text density: {page_df['text_density'].mean():.4f}")

if 'table_df' in dir() and len(table_df) > 0:
    pages_with_tables = table_df['has_table_combined'].sum() if 'has_table_combined' in table_df.columns else 0
    print(f"\nüìã TABLE DETECTION")
    print(f"   Pages with tables: {pages_with_tables} ({pages_with_tables/len(table_df)*100:.1f}%)")
    print(f"   Pages without tables: {len(table_df) - pages_with_tables} ({(len(table_df) - pages_with_tables)/len(table_df)*100:.1f}%)")

if 'combined_df' in dir() and 'cluster' in combined_df.columns:
    print(f"\nüî¨ UNSUPERVISED CLUSTERING (k=5)")
    cluster_counts = combined_df['cluster'].value_counts().sort_index()
    for cluster, count in cluster_counts.items():
        print(f"   Cluster {cluster}: {count} pages ({count/len(combined_df)*100:.1f}%)")

print(f"\nüéØ KEY OBSERVATIONS FOR CLASSIFICATION")
print("   1. Table detection can help distinguish Tabular vs Text notes")
print("   2. Edge density and line counts correlate with structured content")
print("   3. Layout region analysis can identify headers/titles")
print("   4. White space ratio varies between document types")
print("   5. Unsupervised clustering shows natural page groupings")

print(f"\nüí° RECOMMENDATIONS FOR MODELING")
print("   1. Use CNN-based model for visual features (ResNet, EfficientNet)")
print("   2. Table detection features should be included")
print("   3. Consider data augmentation for class imbalance")
print("   4. Multi-scale features may help capture layout patterns")
print("   5. Ensemble approach combining visual + structural features")

In [None]:
# Save analysis results
if 'combined_df' in dir():
    # Save to CSV for later use
    combined_df.to_csv('page_analysis_results.csv', index=False)
    print("\nüíæ Analysis results saved to 'page_analysis_results.csv'")
    
    # Summary statistics
    summary_stats = combined_df.describe()
    summary_stats.to_csv('page_analysis_summary.csv')
    print("üíæ Summary statistics saved to 'page_analysis_summary.csv'")

---
## 13. Check for Labels/Annotations (if available)

In [None]:
# Check if there are any label/annotation files in the dataset
def find_label_files(root_path: str) -> List[str]:
    """
    Search for potential label/annotation files.
    
    Args:
        root_path: Root directory to search
    
    Returns:
        List of potential label file paths
    """
    label_patterns = ['label', 'annotation', 'class', 'target', 'ground_truth', 'gt']
    label_extensions = ['.csv', '.json', '.txt', '.xlsx', '.xml']
    
    found_files = []
    
    for root, dirs, files in os.walk(root_path):
        for file in files:
            file_lower = file.lower()
            if any(ext in file_lower for ext in label_extensions):
                if any(pattern in file_lower for pattern in label_patterns):
                    found_files.append(os.path.join(root, file))
                elif file_lower.endswith('.csv') or file_lower.endswith('.json'):
                    # Also check generic CSV/JSON files
                    found_files.append(os.path.join(root, file))
    
    return found_files

# Search for label files
if os.path.exists(DATA_PATH):
    label_files = find_label_files(DATA_PATH)
    
    if label_files:
        print("üìã Potential label/annotation files found:")
        for f in label_files:
            print(f"   - {f}")
    else:
        print("‚ö†Ô∏è No label/annotation files found in the dataset.")
        print("   You may need to create labels manually or use the unsupervised clusters as a starting point.")

In [None]:
# If labels exist, load and analyze them
# Uncomment and modify the path when you have labels

# LABELS_PATH = "path/to/your/labels.csv"  # Update this

# if os.path.exists(LABELS_PATH):
#     labels_df = pd.read_csv(LABELS_PATH)
#     print(f"Labels loaded: {len(labels_df)} entries")
#     print(f"\nLabel distribution:")
#     print(labels_df['label'].value_counts())
    
#     # Visualize class distribution
#     plt.figure(figsize=(10, 6))
#     labels_df['label'].value_counts().plot(kind='bar')
#     plt.title('Class Distribution')
#     plt.xlabel('Class')
#     plt.ylabel('Count')
#     plt.xticks(rotation=45)
#     plt.tight_layout()
#     plt.show()

---
## Next Steps

Based on this EDA, the next phase should include:

1. **Data Preprocessing**:
   - Convert all PDFs to images at consistent resolution
   - Resize/normalize images for model input
   - Apply any necessary image enhancements

2. **Labeling** (if not already done):
   - Create labels for the 5 classes
   - Use cluster analysis as a starting point for semi-supervised labeling

3. **Feature Engineering**:
   - Extract table detection features
   - Layout features (region densities, margins)
   - Edge/line features

4. **Model Development**:
   - CNN-based visual classification (ResNet, EfficientNet)
   - Consider incorporating structural features
   - Handle class imbalance if present