# Welcome to Machine Learning with Python!

Hey there! 👋 Ready to dive into the exciting world of machine learning? This notebook will take you on a journey through the essential Python tools that every data scientist needs to know. Don't worry if you're feeling a bit overwhelmed by all the mathematical notation you see in ML papers – we'll break everything down step by step!

Think of this as your practical guide to understanding how data flows through machine learning pipelines. We'll work with real datasets, create beautiful visualizations, and yes, we'll even tackle some of that intimidating math – but in a way that actually makes sense.

## What You'll Learn (and Why It's Awesome!)

### 1. **Working with Real Data**
You know how Netflix recommends movies or how your phone recognizes faces in photos? It all starts with understanding data! We'll learn:
- How to load and explore datasets (like the famous Iris flowers!)
- Statistical analysis that reveals hidden patterns
- Data preprocessing tricks that can make or break your ML models

### 2. **The Language of Data**
Ever wondered how computers "see" images or process spreadsheets? We'll discover:
- **Tabular data**: Think Excel spreadsheets, but with superpowers! We represent them as feature matrices $\mathbf{X}$ where each row is a sample and each column is a feature
- **Image data**: Those vacation photos are actually just arrays of numbers to a computer! Images become multi-dimensional tensors $\mathbf{I}$ with height, width, and color channels
- **Labels**: The "answers" we want our AI to predict, stored in vectors $\mathbf{y}$

### 3. **Essential Data Science Skills**
We'll master the everyday tools that data scientists use:
- **Smart data slicing**: Grabbing exactly the data you need (like finding all cat photos in your collection)
- **Visualization**: Creating plots that tell compelling stories about your data

---

**Before We Start**: This notebook assumes you're comfortable with basic Python. If you need a refresher on Python syntax, NumPy arrays, or Matplotlib plotting, check out these helpful resources:
- `python_basics.ipynb` for Python fundamentals
- `numpy.ipynb` for NumPy array operations  
- `matplotlib.ipynb` for creating stunning visualizations

---

## Let's Get Our Tools Ready!

Time to import our data science toolkits! These are the essential libraries that every ML practitioner keeps in their back pocket.

In [1]:
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, fetch_openml

# Initialise matplotlib settings
def setup_matplotlib():
    """
    Configure matplotlib for better visualization quality.
    
    Returns:
    --------
    None
        Configures global matplotlib settings for the notebook.
    """
    plt.style.use('default')
    plt.rcParams['figure.figsize'] = (10, 6)
    plt.rcParams['font.size'] = 12
    plt.rcParams['axes.grid'] = True
    plt.rcParams['grid.alpha'] = 0.3

setup_matplotlib()

## 1. Working with Tabular Data - Let's Meet the Iris Dataset!

Ever wondered how botanists classify different species of flowers? The Iris dataset is a classic in machine learning - it contains measurements of iris flowers and their species. It's like having a digital botanist's notebook!

### Understanding How Computers See Tabular Data

When we load tabular data (think spreadsheet), the computer sees it as a **feature matrix** $\mathbf{X}$, a fancy way of organizing information:

$$\mathbf{X} = \begin{pmatrix}
x_{1,1} & x_{1,2} & \cdots & x_{1,d} \\
x_{2,1} & x_{2,2} & \cdots & x_{2,d} \\
\vdots & \vdots & \ddots & \vdots \\
x_{n,1} & x_{n,2} & \cdots & x_{n,d}
\end{pmatrix}$$

Here's what this means in plain English:
- Each **row** represents one flower we measured (a sample)
- Each **column** represents a specific measurement like petal length (a feature)
- $n$ = total number of flowers we measured
- $d$ = number of different measurements we took per flower
- $x_{i,j}$ = the $j$-th measurement of the $i$-th flower

Each row $\mathbf{x}_i = (x_{i,1}, x_{i,2}, ..., x_{i,d})$ is a flower's "profile" - all its measurements bundled together!

### Why Everyone Loves the Iris Dataset

The **Iris dataset** has been famous since 1936. Here's what makes it special:

**What's in the box?**
- **150 flowers** total (50 of each species - perfectly balanced!)
- **4 measurements** per flower (all in centimeters):
  1. **Sepal length**: How long the sepal is (ranges 4.3-7.9 cm)
  2. **Sepal width**: How wide the sepal is (ranges 2.0-4.4 cm)  
  3. **Petal length**: How long the petal is (ranges 1.0-6.9 cm)
  4. **Petal width**: How wide the petal is (ranges 0.1-2.5 cm)
- **3 species**: Setosa, Versicolor, and Virginica

**Why is it perfect for learning?**
- **Just right size**: Big enough to be interesting, small enough to understand completely
- **Real-world messiness**: Some species are easy to tell apart, others not so much!
- **Visual friendly**: You can actually plot and see the patterns
- **Proof of Concept**: If your ML algorithm can't handle Iris, it probably can't handle anything!

### Target Labels

We also get the **target vector** $\mathbf{y}$ - basically the "answer key" telling us which species each flower is:
- $y_i = 0$ means "This flower is Setosa"
- $y_i = 1$ means "This flower is Versicolor"  
- $y_i = 2$ means "This flower is Virginica"

This number-coding system is called **label encoding** - we turn flower names into numbers so computers can work with them more easily.

<div class='alert alert-warning'>
    <b>Your Turn!</b> Time to load your first real dataset!

Let's build the `load_iris_dataset()` function that will become your gateway to the fascinating world of flower classification. Here's your mission: 

1. **Import the dataset**: Use scikit-learn's built-in Iris dataset
2. **Extract the Components**: Grab the feature matrix X, labels y, feature names, and species names
3. **Explore what you found**: Print some basic info so we can see what we're working with:
    - How big is our dataset? (samples × features)
    - What measurements did the botanists take? (feature names and their ranges)
    - Which flower species are we dealing with? (target names and how many of each)
4. **Return** everything neatly as a tuple so we can use it later.

</div>

In [2]:
# Hints:
# - Use: from sklearn.datasets import load_iris
# - The loaded dataset has attributes: .data, .target, .feature_names, .target_names
# - Use np.unique(y, return_counts=True) to get class distribution
# - Use np.min() and np.max() to find feature ranges

def load_iris_dataset():
    """
    Load the Iris dataset from scikit-learn and print basic information.
    
    Returns:
    --------
    tuple: (X, y, feature_names, target_names)
        X : numpy.ndarray of shape (150, 4)
            Feature matrix containing sepal/petal measurements
        y : numpy.ndarray of shape (150,)
            Target labels (0=Setosa, 1=Versicolor, 2=Virginica)
        feature_names : list of str
            Names of the 4 features
        target_names : list of str
            Names of the 3 species classes

    """
    # Load using sklearn's helper
    data = load_iris()
    X = data.data
    y = data.target
    feature_names = list(data.feature_names)
    target_names = list(data.target_names)

    # Print helpful diagnostics requested by the notebook
    print("Iris dataset loaded.")
    print(f"Shape: {X.shape} (samples x features)")
    for i, name in enumerate(feature_names):
        col = X[:, i]
        print(f" - {name}: min={col.min():.2f}, max={col.max():.2f}, mean={col.mean():.2f}, std={col.std():.2f}")
    unique, counts = np.unique(y, return_counts=True)
    class_info = {target_names[u]: int(c) for u, c in zip(unique, counts)}
    print("Class distribution:", class_info)

    return X, y, feature_names, target_names

In [None]:


# Load the dataset
X, y, feature_names, target_names = load_iris_dataset()

Iris dataset loaded.
Shape: (150, 4) (samples x features)
 - sepal length (cm): min=4.30, max=7.90, mean=5.84, std=0.83
 - sepal width (cm): min=2.00, max=4.40, mean=3.06, std=0.43
 - petal length (cm): min=1.00, max=6.90, mean=3.76, std=1.76
 - petal width (cm): min=0.10, max=2.50, mean=1.20, std=0.76
Class distribution: {np.str_('setosa'): 50, np.str_('versicolor'): 50, np.str_('virginica'): 50}


### 1.1 Getting to Know Your Data - The Art of Exploration!

Now that we have our dataset loaded, it's time to play detective! **Exploratory Data Analysis (EDA)** is like getting to know a new friend - you want to understand their personality, quirks, and what makes them tick. The same goes for data.

**What questions should we ask our data?**

1. **Size and Shape**: 
   - How much data do we have to work with? (More samples usually = happier ML algorithms!)
   - How many features are we juggling? (Too many can be overwhelming!)

2. **What Type of Data Are We Dealing With?**
   - Are our measurements continuous numbers (like height) or categories (like colors)?
   - What units are we working with? (Centimeters, pixels, counts?)
   - What's the range of each measurement?

3. **Statistical Personality**:
   - **Average values**: What's typical? $\mu = \frac{1}{n}\sum_{i=1}^n x_i$
   - **Spread**: How much do values vary? $\sigma = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i - \mu)^2}$
   - **Extremes**: What are the smallest and largest values?

4. **Data Quality Check**:
   - Any missing information? (Like incomplete survey responses)
   - Strange outliers? (Maybe someone measured a petal in meters instead of centimeters!)
   - Are our classes balanced? (Equal numbers of each flower type, or is one species hogging the spotlight?)

5. **Feature Relationships**:
   - Do some measurements tend to go together? (Like tall people often having longer arms)
   - Which features might be most helpful for telling species apart?

**Pro Tip**: When we compute statistics, direction matters. Computing across samples (axis=0) tells us about each feature, while computing across features (axis=1) tells us about each sample.

<div class='alert alert-warning'>
    <b>Detective Time!</b> Let's analyze our dataset like a data scientist!
    
Build the `analyze_dataset_structure()` function to reveal all the secrets hidden in our data:

1. **Getting Started**: Extract how many samples and features we have (hint: check X.shape)
2. **Sneak peek**: Show the first 5 flowers so we can see what the data actually looks like
3. **Crunch the numbers** (NumPy style):
    - Calculate average measurements for each feature (what's a "typical" petal length?)
    - Find the standard deviation (how much do measurements vary?)
    - Discover the extremes (smallest and largest values for each feature)
4. **Count the flowers**:
    - How many of each species do we have?
    - What percentage of our dataset does each species represent?
5. **Make it pretty**: Print everything with nice formatting using the actual feature and species names!
6. **Return** a dictionary with all your findings for future use

</div>

In [4]:
def analyze_dataset_structure(X, y, feature_names, target_names):
    """
    Perform comprehensive exploratory data analysis on a tabular dataset.
    
    Parameters:
    -----------
    X : numpy.ndarray of shape (n_samples, n_features)
        Feature matrix
    y : numpy.ndarray of shape (n_samples,)
        Target labels
    feature_names : list of str
        Names of features
    target_names : list of str
        Names of target classes
    
    Returns:
    --------
    dict : Analysis results containing:
        - 'shape': Dataset dimensions
        - 'statistics': Mean, std, min, max per feature
        - 'class_distribution': Sample count per class

    """

    # Basic shape info
    n_samples, n_features = X.shape
    print(f"Dataset shape: {n_samples} samples, {n_features} features")

    # Sneak peek: first 5 samples
    print("First 5 samples (rows):")
    print(X[:5])

    # Feature-wise statistics (compute along axis=0)
    mean = np.mean(X, axis=0)
    std = np.std(X, axis=0)
    minv = np.min(X, axis=0)
    maxv = np.max(X, axis=0)

    feature_stats = {
        'mean': mean.tolist(),
        'std': std.tolist(),
        'min': minv.tolist(),
        'max': maxv.tolist()
    }

    # Dataset class distribution (counts and percentages)
    unique, counts = np.unique(y, return_counts=True)
    class_counts = {target_names[int(u)]: int(c) for u, c in zip(unique, counts)}
    class_percent = {target_names[int(u)]: float(c) / n_samples * 100.0 for u, c in zip(unique, counts)}
    class_distribution = {'counts': class_counts, 'percent': class_percent}

    # Pretty print statistics using provided names
    print("Feature statistics:")
    for i, name in enumerate(feature_names):
        print(f" - {name}: mean={mean[i]:.2f}, std={std[i]:.2f}, min={minv[i]:.2f}, max={maxv[i]:.2f}")
    print("Class distribution (counts):", class_counts)
    print("Class distribution (%):", {k: f"{v:.1f}%" for k, v in class_percent.items()})

    return {
        'shape': (n_samples, n_features),
        'statistics': feature_stats,
        'class_distribution': class_distribution
    }

In [5]:
# Analyze the Iris dataset
analysis_results = analyze_dataset_structure(X, y, feature_names, target_names)

Dataset shape: 150 samples, 4 features
First 5 samples (rows):
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
Feature statistics:
 - sepal length (cm): mean=5.84, std=0.83, min=4.30, max=7.90
 - sepal width (cm): mean=3.06, std=0.43, min=2.00, max=4.40
 - petal length (cm): mean=3.76, std=1.76, min=1.00, max=6.90
 - petal width (cm): mean=1.20, std=0.76, min=0.10, max=2.50
Class distribution (counts): {np.str_('setosa'): 50, np.str_('versicolor'): 50, np.str_('virginica'): 50}
Class distribution (%): {np.str_('setosa'): '33.3%', np.str_('versicolor'): '33.3%', np.str_('virginica'): '33.3%'}


## 2. Working with Images - From Pixels to Patterns!

Ready to dive into the world of computer vision? Let's see how computers "see" images! Spoiler alert: they don't see pictures like we do - they see massive grids of numbers. But that's actually pretty amazing when you think about it!

### How Computers See Images - It's All About Arrays!

While tabular data is like a spreadsheet, **images are like multi-dimensional grids of numbers**. Let's break this down:

**Grayscale Images** (black and white): In general a single grid of brightness values.
$$\mathbf{I} = \begin{pmatrix}
I_{1,1} & I_{1,2} & \cdots & I_{1,W} \\
I_{2,1} & I_{2,2} & \cdots & I_{2,W} \\
\vdots & \vdots & \ddots & \vdots \\
I_{H,1} & I_{H,2} & \cdots & I_{H,W}
\end{pmatrix}$$

**Color Images**: Now we have THREE grids stacked on top of each other (like a sandwich).
- **Red layer** + **Green layer** + **Blue layer** = Full color image.
- $H$ = how tall the image is (height in pixels)
- $W$ = how wide the image is (width in pixels)  
- $C$ = number of color channels (1 for B&W, 3 for RGB color)
- $I_{i,j,c}$ = how much of color $c$ there is at position $(i,j)$

### Pixel Intensity and Color Spaces

**Pixel Intensities**:
- **8-bit images**: Integer values $I_{i,j} \in \{0, 1, 2, ..., 255\}$
  - 0 = black (no intensity)
  - 255 = white (maximum intensity)
- **Normalized images**: Float values $I_{i,j} \in [0, 1]$
  - 0.0 = black, 1.0 = white
  - Common in deep learning (better numerical stability)

**RGB Color Space**: Each pixel has three components $(R, G, B)$
- $R$ = Red channel intensity
- $G$ = Green channel intensity  
- $B$ = Blue channel intensity
- Combined they produce the perceived color

---

### Meet MNIST - The Handwriting Dataset

**MNIST** (the name for "handwritten digits dataset") is like the training wheels of computer vision. If Iris is about flowers, MNIST is about recognizing handwritten numbers.

**Dataset Properties**:
- **70,000 images** total (60,000 training + 10,000 test)
- **28×28 pixels** per image ($H = W = 28$)
- **Grayscale** ($C = 1$, single channel)
- **10 classes**: Digits 0, 1, 2, ..., 9
- **Pixel values**: $I_{i,j} \in [0, 255]$ (8-bit grayscale)

**How does the computer see it?**
- Feature matrix: $\mathbf{X} \in \mathbb{R}^{70000 \times 784}$ (flattened pixels)
- Image tensor: $\mathbf{I} \in \mathbb{R}^{70000 \times 28 \times 28}$ (spatial structure preserved)
- Target vector: $\mathbf{y} \in \{0, 1, 2, ..., 9\}^{70000}$

**Why is everyone obsessed with MNIST?**
- **Perfect for beginners**: Small enough to experiment with quickly
- **Benchmark dataset**: Standard for comparing algorithm performance
- **Real-world relevance**: Basis for OCR (Optical Character Recognition) systems
- **Visualization**: Easy to see what the algorithm is learning

<div class='alert alert-warning'>
    <b>Your Digital Handwriting Adventure!</b> Time to load some real handwritten digits!

Create the `load_mnist_dataset()` function to bring those handwritten digits to life:

1. **Get the dataset**: Use scikit-learn's fetch_openml with name="mnist_784" and version=1
2. **Extract the goods**: Pull out the images and their corresponding digit labels
3. **Size control**: If max_samples is specified, take just a subset
4. **Shape it right**: Reshape the flat pixel arrays back into proper 28×28 images 
5. **Clean up labels**: Convert those labels to nice clean integers (0, 1, 2, ..., 9)
6. **Show off your findings**: Print some stats about what you loaded:
    - How many images and what size?
    - What's the range of pixel values? (should be 0-255)
    - Which digits are available?
    - How many examples of each digit do we have?
7. Return the processed images and labels

</div>

In [None]:
# Notes:
# ------
# - Original MNIST images are 28x28 pixels, grayscale
# - Pixel intensities range from 0 (black) to 255 (white)
# - Dataset is balanced with ~7000 samples per digit class

def load_mnist_dataset(max_samples=5000):
    """
    Load the MNIST handwritten digits dataset from scikit-learn.
    
    Parameters:
    -----------
    max_samples : int, default=5000
        Maximum number of samples to load (for faster processing in tutorials).
        Set to None to load all 70,000 samples.
    
    Returns:
    --------
    tuple: (images, labels)
        images : numpy.ndarray of shape (n_samples, 28, 28)
            Grayscale images with pixel values in [0, 255]
        labels : numpy.ndarray of shape (n_samples,)
            Integer labels from 0-9 representing digit classes

    """
    raise NotImplementedError('Implement the load_mnist_dataset function.')

    print("Loading MNIST dataset (this may take a moment)...")
    
    # Load data from OpenML
    
    # Convert to numpy arrays and reshape

    
    # Subsample if requested (for faster processing)

    
    return images, labels



# Load MNIST dataset (using subset for faster processing)
images, labels = load_mnist_dataset(max_samples=5000)

<div class='alert alert-warning'>
    <b>Your First Computer Vision Gallery</b> Time to create your image visualization toolkit!. The `visualize_sample_images()` function is your gateway to displaying handwritten digits.

Your implementation should: 

1. **Set up the gallery**: Create a subplot grid using `plt.subplots(1, n_samples, figsize=(15, 3))`

2. **Handle the special case**: When `n_samples=1`, matplotlib returns a single `Axes` object (not array), consider this.

3. **Display each image** (loop through `range(n_samples)`):
   - Show image using `imshow()` with grayscale colormap (`cmap='gray'`)
   - Set pixel range: `vmin=0, vmax=255` for consistent brightness
   - Add title showing the label for that image
   - Remove axis ticks and labels

4. **Add overall title** to the entire figure.
    
</div>

In [None]:
def visualize_sample_images(images, labels, n_samples=5, title="Sample Images"):
    """
    Display a grid of sample images with their labels.
    
    Parameters:
    -----------
    images : numpy.ndarray of shape (n_images, height, width)
        Array of grayscale images
    labels : numpy.ndarray of shape (n_images,)
        Corresponding labels for each image
    n_samples : int, default=5
        Number of images to display
    title : str, default="Sample Images"
        Title for the plot
    """

    raise NotImplementedError("Implement the function to visualize sample images.")

    # Leave as is
    plt.tight_layout()
    plt.show()

In [None]:
# Visualize sample images
visualize_sample_images(images, labels, n_samples=5, title="Sample MNIST Digits")

## 3. Array Indexing and Slicing: Mathematical Operations on Data

**Array indexing** is not just a programming concept—it's a fundamental mathematical operation for **data selection** and **subsetting**. In machine learning, we constantly need to:

- **Select subsets** of data for training/validation/testing
- **Extract specific features** for analysis
- **Filter data** based on conditions
- **Crop images** to focus on regions of interest

### Mathematical Notation for Data Slicing

**For tabular data** $\mathbf{X} \in \mathbb{R}^{n \times d}$:

**Row Selection** (Sample subset):
- $\mathbf{X}_{[i:j, :]}$ = rows $i$ through $j-1$, all columns
- $\mathbf{X}_{[I, :]}$ where $I = \{i_1, i_2, ..., i_k\}$ = specific rows

**Column Selection** (Feature subset):
- $\mathbf{X}_{[:, j]}$ = all rows, column $j$ (single feature)
- $\mathbf{X}_{[:, J]}$ where $J = \{j_1, j_2, ..., j_m\}$ = feature subset

**Boolean Indexing** (Conditional selection):
- $\mathbf{X}_{[mask, :]}$ where $mask = (y == c)$ for class $c$
- Example: $\mathbf{X}_{setosa} = \{x_i : y_i = 0\}$ (all Setosa samples)

### Advanced Indexing Operations

**Statistical Filtering**:
Given feature $j$ with values $\mathbf{x}_j = (x_{1,j}, x_{2,j}, ..., x_{n,j})$:

- **Outlier removal**: $mask = (|x_{i,j} - \mu_j| < 2\sigma_j)$
- **Quantile filtering**: $mask = (Q_{0.25} \leq x_{i,j} \leq Q_{0.75})$
- **Threshold filtering**: $mask = (x_{i,j} > \tau)$ for threshold $\tau$

**Multi-condition Filtering**:
- **Logical AND**: $mask_1 \land mask_2$ (both conditions true)
- **Logical OR**: $mask_1 \lor mask_2$ (either condition true)
- **Logical NOT**: $\neg mask$ (condition false)

### Image Slicing: Spatial Data Operations

**For images** $\mathbf{I} \in \mathbb{R}^{H \times W}$ or $\mathbf{I} \in \mathbb{R}^{H \times W \times C}$:

**Spatial Cropping**:
$$\mathbf{I}_{crop} = \mathbf{I}_{[y_1:y_2, x_1:x_2]} \quad \text{or} \quad \mathbf{I}_{[y_1:y_2, x_1:x_2, :]}$$

**Common Image Operations**:
- **Center crop**: Extract $k \times k$ region from image center
- **Corner extraction**: $\mathbf{I}_{[0:k, 0:k]}$ (top-left corner)
- **Region of Interest (ROI)**: Extract specific spatial regions
- **Patch extraction**: For data augmentation or sliding window analysis

<div class='alert alert-warning'>
    <b>Tabular Data Manipulation</b> Time to master the art of smart data slicing.

Your implementation should: 

1. **Feature subset selection**: 
   - Pick specific features (columns): `X[:10, feature_indices]`
   - Print selected feature names and resulting shape

2. **Class-based filtering** (Boolean indexing):
   - Create boolean mask: `y == target_class`
   - Apply mask to filter: `X[class_mask], y[class_mask]`
   - Count and display samples for the selected class

3. **Condition-based filtering**:
   - Create condition mask: `X[:, feature_idx] > threshold`  
   - Filter data and analyze species distribution
   - Show sample information meeting the condition

4. **Return results dictionary** with feature_subset, class_subset, and condition_subset


</div>

In [None]:
def demonstrate_tabular_slicing(X, y, feature_names, target_names):
    """
    Demonstrate various data slicing and filtering techniques on tabular data.
    
    Parameters:
    -----------
    X : numpy.ndarray of shape (n_samples, n_features)
        Feature matrix
    y : numpy.ndarray of shape (n_samples,)
        Target labels
    feature_names : list of str
        Names of features
    target_names : list of str
        Names of target classes
    
    Returns:
    --------
    dict : Dictionary containing sliced data examples:
        - 'feature_subset': Selected features for first 10 samples
        - 'class_subset': Samples belonging to specific class
        - 'condition_subset': Samples meeting specific condition

    """

    raise NotImplementedError("Implement the function to demonstrate tabular slicing.")
    
    # Example 1: Feature subset selection
    print("1. FEATURE SUBSET SELECTION")

    
    # Example 2: Class-based filtering (Boolean indexing)
    print(f"\n2. CLASS-BASED FILTERING")


    # Example 3: Condition-based filtering
    print(f"\n3. CONDITION-BASED FILTERING")
    
    return {
        'feature_subset': X_slice,
        'class_subset': X_class,
        'condition_subset': X_condition
    }

In [None]:
# Demonstrate tabular data slicing techniques
slicing_results = demonstrate_tabular_slicing(X, y, feature_names, target_names)

### Image Data Indexing: Spatial Coordinates and Array Operations

**Coordinate Systems in Images**:

In NumPy arrays (and most computer vision libraries), images use **matrix indexing**:
- **First dimension**: rows (y-coordinate, vertical position)
- **Second dimension**: columns (x-coordinate, horizontal position)
- **Origin (0,0)**: Top-left corner of the image

$$\mathbf{I}[y, x] = \text{pixel at row } y, \text{ column } x$$

**Spatial Relationships**:
- Moving **right**: increase $x$ (column index)
- Moving **down**: increase $y$ (row index)
- **Width**: number of columns ($W$)
- **Height**: number of rows ($H$)

**Mathematical Operations on Image Regions**:

**Cropping Window**: Extract subregion $\mathbf{I}_{crop} \in \mathbb{R}^{h \times w}$
$$\mathbf{I}_{crop} = \mathbf{I}_{[y_{start}:y_{end}, x_{start}:x_{end}]}$$

where:
- $h = y_{end} - y_{start}$ (crop height)
- $w = x_{end} - x_{start}$ (crop width)

**Center Cropping** (common preprocessing technique):
For image $\mathbf{I} \in \mathbb{R}^{H \times W}$, extract center region of size $h \times w$:

$$y_{start} = \frac{H - h}{2}, \quad x_{start} = \frac{W - w}{2}$$
$$y_{end} = y_{start} + h, \quad x_{end} = x_{start} + w$$

<div class='alert alert-warning'>
    <b>Image Cropping Basics</b>  Ever wondered how Instagram creates those perfect square crops? Or how computer vision models focus on the most important parts of an image? Image slicing!

Your implementation of the `demonstrate_image_slicing` should: 

1. **Center cropping** (extract 14×14 center region from 28×28 images):
   - Calculate start indices: `(28-14)//2 = 7` for both dimensions
   - Extract center crop: `images[i, 7:21, 7:21]`
   - Store in results dictionary

2. **Create before/after visualization**: Show original vs cropped side-by-side

3. **Return results dictionary** with all extracted regions

    
</div>

In [None]:
def demonstrate_image_slicing(images, labels, n_samples=5):
    """
    Demonstrate various image slicing and cropping techniques.
    
    Parameters:
    -----------
    images : numpy.ndarray of shape (n_images, height, width)
        Array of grayscale images
    labels : numpy.ndarray of shape (n_images,)
        Corresponding labels for each image
    n_samples : int, default=5
        Number of images to process and display
    
    Returns:
    --------
    dict : Dictionary containing:
        - 'original_shapes': Shape of original images
        - 'cropped_images': List of center-cropped images
        - 'cropped_shapes': Shape of cropped images
    

    """
    print("IMAGE SLICING AND CROPPING DEMONSTRATIONS")
    
    raise NotImplementedError("Implement the function to demonstrate image slicing.")

    
    # Calculate center crop coordinates

    
    # Perform center cropping

    
    # Visualize original vs cropped images

    print("="*60)

    plt.tight_layout()
    plt.show()
    
    # Print shape information

    
    return {
        'original_shapes': ,
        'cropped_images': ,
        'cropped_shapes': ,
        'crop_coordinates':
    }

In [None]:
# Demonstrate image slicing techniques
print("Applying image slicing to MNIST digits...")
cropping_results = demonstrate_image_slicing(images, labels, n_samples=5)


<div class='alert alert-warning'>
    <b>Advanced Image Dissection!</b> You'll dissect a single digit into 8 different regions to understand how different parts contribute to recognition. This is exactly how object detection algorithms analyze images.

Your implementation should: 
    
1. **Define slicing operations dictionary** with 8 regions:
   - 4 corner quadrants (top-left, top-right, bottom-left, bottom-right)
   - 4 half regions (top half, bottom half, left half, right half)
   - Use `slice(start, end)` objects for each dimension

2. **Create visual dissection**:
   - Apply each slice to the first image
   - Create 2×4 subplot grid: `plt.subplots(2, 4, figsize=(16, 8))`
   - Display each region with title showing name and shape
   - Use consistent grayscale scaling: `vmin=0, vmax=255`

3. **Professional formatting**:
   - Set subplot titles with region name and dimensions
   - Remove axes for clean visualization
   - Add overall title showing the digit label

4. **Return results dictionary** with all sliced regions


</div>

In [None]:
def demonstrate_advanced_slicing(images, labels, n_samples=3):
    """
    Demonstrate advanced image slicing techniques.
    
    Parameters:
    -----------
    images : numpy.ndarray of shape (n_images, height, width)
        Array of grayscale images  
    labels : numpy.ndarray of shape (n_images,)
        Corresponding labels
    n_samples : int, default=3
        Number of images to process
    
    Returns:
    --------
    dict : Various sliced regions (corners, edges, etc.)
    """
    print(f"\nADVANCED SLICING TECHNIQUES")

    raise NotImplementedError("Implement the function to demonstrate image slicing.")

    slicing_ops = {}

    # Plot the Results
    plt.tight_layout()
    plt.show()
    
    return {name: img[y_slice, x_slice] for name, (y_slice, x_slice) in slicing_ops.items()}

In [None]:
# Demonstrate advanced image slicing techniques
advanced_slicing_results = demonstrate_advanced_slicing(images, labels, n_samples=1)

---

# Wrapping Up

You have now completed a practical introduction to working with tabular and image data in Python. Throughout this notebook, you have developed essential skills for loading, analyzing, slicing, and visualizing real-world datasets — the core competencies for any machine learning practitioner.

The next phase of your journey will focus on **Exploratory Data Analysis (EDA)**. EDA enables you to uncover patterns, relationships, and potential issues within your data before building predictive models.

## Next Steps: Homework Assignment

Your upcoming homework will extend the concepts and techniques introduced here. Check out the homework.ipynb where you will:

- Apply advanced EDA methods to the Iris and MNIST Dataset
- Visualize data distributions, correlations, and outliers