# üìä Data Exploration Notebook
## Step 1: Look at your data BEFORE building anything!

**Why explore first?**
- Understand what your images look like
- Check for data quality issues
- Understand the label distribution
- Make informed decisions about preprocessing

## 1. Setup & Imports

In [None]:
# Import libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
from collections import Counter

# Display settings
plt.style.use('default')
%matplotlib inline

print("‚úì Libraries imported successfully!")

## 2. Load and Inspect the CSV File

In [None]:
# TODO: Update this path to your actual CSV file location
CSV_PATH = "../data/raw/labels.csv"  # Adjust this!
IMAGE_DIR = "../data/raw/images/"    # Adjust this!

# Load the CSV
# df = pd.read_csv(CSV_PATH)
# print(f"Total samples: {len(df)}")
# print(f"\nColumn names: {df.columns.tolist()}")
# print(f"\nFirst few rows:")
# df.head(10)

## 3. Examine Label Distribution

Are all Arabic letters equally represented? Imbalanced data can cause problems!

In [None]:
# TODO: Uncomment and run when you have your data

# # Count samples per class
# label_column = 'label'  # TODO: Change this to your actual label column name
# label_counts = df[label_column].value_counts().sort_index()
#
# # Plot distribution
# plt.figure(figsize=(14, 5))
# plt.bar(range(len(label_counts)), label_counts.values)
# plt.xlabel('Class (Arabic Letter)')
# plt.ylabel('Number of Samples')
# plt.title('Distribution of Arabic Letters in Dataset')
# plt.tight_layout()
# plt.show()
#
# print(f"\nMin samples: {label_counts.min()}")
# print(f"Max samples: {label_counts.max()}")
# print(f"Mean samples: {label_counts.mean():.1f}")

## 4. Visualize Sample Images

Let's see what the handwritten letters actually look like!

In [None]:
# Arabic letters for reference
ARABIC_LETTERS = [
    'ÿß', 'ÿ®', 'ÿ™', 'ÿ´', 'ÿ¨', 'ÿ≠', 'ÿÆ', 'ÿØ', 'ÿ∞', 'ÿ±',
    'ÿ≤', 'ÿ≥', 'ÿ¥', 'ÿµ', 'ÿ∂', 'ÿ∑', 'ÿ∏', 'ÿπ', 'ÿ∫', 'ŸÅ',
    'ŸÇ', 'ŸÉ', 'ŸÑ', 'ŸÖ', 'ŸÜ', 'Ÿá', 'Ÿà', 'Ÿä'
]

print(f"Total Arabic letters: {len(ARABIC_LETTERS)}")
print(f"Letters: {' '.join(ARABIC_LETTERS)}")

In [None]:
# TODO: Uncomment and run when you have your data

# def show_sample_images(df, image_dir, num_samples=16):
#     """Display a grid of sample images."""
#     fig, axes = plt.subplots(4, 4, figsize=(10, 10))
#
#     # Get random samples
#     samples = df.sample(n=min(num_samples, len(df)))
#
#     for i, (ax, (_, row)) in enumerate(zip(axes.flat, samples.iterrows())):
#         # TODO: Adjust column names to match your CSV
#         img_path = os.path.join(image_dir, row['image_path'])
#         label = row['label']
#
#         try:
#             img = Image.open(img_path)
#             ax.imshow(img, cmap='gray')
#             ax.set_title(f'Label: {label}', fontsize=10)
#         except Exception as e:
#             ax.set_title(f'Error: {e}', fontsize=8)
#
#         ax.axis('off')
#
#     plt.tight_layout()
#     plt.savefig('../outputs/sample_images.png')
#     plt.show()
#
# show_sample_images(df, IMAGE_DIR)

## 5. Check Image Properties

Let's verify image sizes, formats, and pixel value ranges.

In [None]:
# TODO: Uncomment and run when you have your data

# def analyze_images(df, image_dir, num_samples=100):
#     """Analyze image properties."""
#     sizes = []
#     modes = []
#
#     samples = df.sample(n=min(num_samples, len(df)))
#
#     for _, row in samples.iterrows():
#         try:
#             img_path = os.path.join(image_dir, row['image_path'])
#             img = Image.open(img_path)
#             sizes.append(img.size)
#             modes.append(img.mode)
#         except:
#             pass
#
#     print(f"Analyzed {len(sizes)} images")
#     print(f"\nUnique sizes: {set(sizes)}")
#     print(f"Unique modes: {set(modes)}")
#
#     if len(set(sizes)) > 1:
#         print("\n‚ö†Ô∏è  WARNING: Images have different sizes!")
#         print("   You'll need to resize them to a consistent size (e.g., 32x32)")
#     else:
#         print("\n‚úì All images have the same size")
#
# analyze_images(df, IMAGE_DIR)

## 6. Pixel Value Analysis

Understanding pixel values helps with normalization.

In [None]:
# TODO: Uncomment and run when you have your data

# def analyze_pixel_values(df, image_dir, num_samples=50):
#     """Analyze pixel value distributions."""
#     all_pixels = []
#
#     samples = df.sample(n=min(num_samples, len(df)))
#
#     for _, row in samples.iterrows():
#         try:
#             img_path = os.path.join(image_dir, row['image_path'])
#             img = Image.open(img_path).convert('L')  # Convert to grayscale
#             pixels = np.array(img).flatten()
#             all_pixels.extend(pixels)
#         except:
#             pass
#
#     all_pixels = np.array(all_pixels)
#
#     print(f"Pixel value statistics:")
#     print(f"  Min: {all_pixels.min()}")
#     print(f"  Max: {all_pixels.max()}")
#     print(f"  Mean: {all_pixels.mean():.2f}")
#     print(f"  Std: {all_pixels.std():.2f}")
#
#     # Plot histogram
#     plt.figure(figsize=(10, 4))
#     plt.hist(all_pixels, bins=50, edgecolor='black')
#     plt.xlabel('Pixel Value')
#     plt.ylabel('Frequency')
#     plt.title('Distribution of Pixel Values')
#     plt.show()
#
# analyze_pixel_values(df, IMAGE_DIR)

## 7. Summary & Next Steps

After exploring your data, answer these questions:

1. **How many samples do you have?** 
   - Total: ___
   - Per class (roughly): ___

2. **Are the classes balanced?**
   - Yes / No
   - If no, which classes have fewer samples?

3. **What size are the images?**
   - Original size: ___
   - Target size: 32x32 (recommended)

4. **Are images grayscale or color?**
   - Type: ___
   - We'll use: Grayscale (1 channel)

5. **Any quality issues noticed?**
   - Blurry images: ___
   - Missing images: ___
   - Incorrect labels: ___

---

**Next Steps:**
1. Move to `02_model_experiments.ipynb`
2. Implement `dataset.py` based on what you learned here
3. Build your first CNN!