# Data Visualization: Statistical Analysis and Pattern Recognition

Welcome to the exciting world of **Exploratory Data Analysis (EDA)**!

In the `exercise.ipynb` notebook, you took your first steps into machine learning pipelines and learned the fundamental skill of loading data. Now it's time to dive deeper and investigate the hidden properties, patterns, and secrets that these datasets hold.

## Why This Matters

Think of data analysis as being a detective investigating a case. Raw data is like a collection of clues scattered across a crime scene. Without proper investigation techniques, these clues remain meaningless. But with the right analytical tools, patterns emerge, relationships become clear, and insights that can make or break your machine learning models are revealed.

## Practical Skills You'll Gain
- **Statistical analysis**: Understanding what your data is trying to tell you
- **Pattern recognition**: Identifying relationships that inform feature engineering
- **Data preprocessing**: Cleaning and preparing data so your ML models can work their magic
- **Quality assessment**: Detecting issues that could derail your machine learning projects


Ready to become a data detective? Let's begin!

---

## Environment Setup

**Building on Your Previous Work!**

The cells below will automatically set up your analysis environment by importing the functions you implemented in `exercise.ipynb`. This way, you can focus on the exciting new concepts without having to rewrite code you've already mastered.

### What's Happening Here:
1. **Library Import**: Loading all essential data science libraries
2. **Function Import**: Automatically importing your exercise functions 
3. **Data Loading**: Pre-loading both Iris and MNIST datasets for immediate use

**Just run the cells and you'll be all set for the homework!**

---

In [None]:
# Import necessary libraries for data analysis and visualization
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, fetch_openml
from scipy import stats
import nbformat
from nbconvert import PythonExporter

# Configure matplotlib for better visualizations
plt.style.use('default')
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 11
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

In [None]:
# Import functions from the exercise notebook
import sys
sys.path.append('.')

def import_notebook_functions():
    """
    Import functions from exercise.ipynb by executing its Python cells.
    
    Returns:
    --------
    dict : Dictionary containing the imported functions
    """
    # load the notebook
    with open('exercise.ipynb', 'r') as f:
        nb = nbformat.read(f, as_version=4)
    exporter = PythonExporter()
    body, _ = exporter.from_notebook_node(nb)
    
    # Create a namespace to execute the code
    namespace = {'np': np, 'plt': plt, 'load_iris': load_iris, 'fetch_openml': fetch_openml}
    exec(body, namespace)
    
    # Extract the functions we need
    function_names = [
        'load_iris_dataset', 
        'load_mnist_dataset',
    ]
    functions = {}
    for func_name in function_names:
        if func_name in namespace:
            functions[func_name] = namespace[func_name]
            globals()[func_name] = namespace[func_name] # Add to global namespace
    
    return functions

imported_functions = import_notebook_functions()

In [None]:
# Load datasets using functions from exercise.ipynb
X, y, feature_names, target_names = load_iris_dataset()
images, labels = load_mnist_dataset(max_samples=5000)

---
**Data visualization** is not just about making pretty plots—it's a crucial **analytical tool** for:

1. **Understanding data distributions**: Are features normally distributed? Skewed? Multimodal?
2. **Detecting outliers**: Values that might indicate errors or rare cases
3. **Revealing relationships**: Correlations, clusters, and patterns
4. **Assessing class separability**: Can classes be distinguished visually?
5. **Validating assumptions**: Do the data meet algorithm requirements?
6. **Debugging models**: Understanding where algorithms succeed or fail

### Statistical Distributions and Their Interpretation

To investigate the above stated phenomenons, data scientist have various tools to reveal information about the dataset.

**Histogram Analysis**: 

For feature $\mathbf{x}_j = (x_{1,j}, x_{2,j}, ..., x_{n,j})$

A **histogram** shows the **empirical probability distribution**:
- **Bins**: Intervals $[b_k, b_{k+1})$ that partition the data range
- **Frequency**: $f_k = |\{x_i : b_k \leq x_i < b_{k+1}\}|$
- **Density**: $p_k = \frac{f_k}{n \cdot \Delta b}$ where $\Delta b$ is bin width

**Common Distribution Patterns**:
- **Normal (Gaussian)**: Bell-shaped, symmetric around mean
  $$p(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
- **Skewed**: Asymmetric with long tail (positive/negative skew)
- **Uniform**: Approximately constant across range
- **Bimodal**: Two distinct peaks (might indicate mixed populations)
- **Heavy-tailed**: More extreme values than normal distribution

<div class='alert alert-warning'>
    <b>Statistical Inspection</b> The secret lies in understanding feature distributions! You're about to become a data detective who can spot patterns, outliers, and hidden insights at a glance.

Your implementation should: 

1. **Set up subplot grid**:
   - Calculate grid dimensions and create figure with subplots
   - Handle different grid shapes (single feature, single row, multiple rows)

2. **Calculate statistics for each feature**:
   - Extract feature data and compute mean, std, min, max using NumPy
   - Calculate skewness 
   - store results in statistics dictionary

3. **Create informative histograms**:
   - Plot histogram with 20 bins, skyblue color and black edges
   - Add vertical lines for mean (red dashed) and ±1 std (orange dotted)
   - Set title showing statistics (mean, std, skewness)
   - Add axis labels, grid, and legend

4. **Print statistical summary** with interpretation for all features

5. **Return statistics dictionary** with means, stds, ranges, and skewness

</div>

In [None]:
# Hints:
# - μⱼ = (1/n) Σᵢ xᵢⱼ  (mean)
# - σⱼ = √[(1/(n-1)) Σᵢ (xᵢⱼ - μⱼ)²]  (standard deviation)
# - Skewness = E[(X-μ)³]/σ³  (distribution asymmetry measure)

def plot_feature_distributions(X, feature_names, title="Feature Distributions"):
    """
    Create histogram plots for all features to visualize their distributions and print statistical summaries.
    
    Parameters:
    -----------
    X : numpy.ndarray of shape (n_samples, n_features)
        Feature matrix
    feature_names : list of str
        Names of the features
    title : str, default="Feature Distributions"
        Overall title for the plot
    
    Returns:
    --------
    dict : Statistical information about each feature:
        - 'means': Mean values per feature
        - 'stds': Standard deviations per feature  
        - 'skewness': Skewness values per feature
        - 'ranges': Min-max ranges per feature
    
    """

    raise NotImplementedError("Implement the function to plot feature distributions.")

    n_features = X.shape[1]
    
    # Calculate grid dimensions
    
    # Statistical analysis results
    stats = {
        'means': [],
        'stds': [],
        'ranges': [],
        'skewness': []
    }
    
    for idx in range(n_features):
        
        # Extract feature data
        
        # Calculate statistics

        # Store statistics
        stats['means'].append(mean_val)
        stats['stds'].append(std_val)
        stats['ranges'].append((min_val, max_val))
        stats['skewness'].append(skewness)
        
        # Create histogram
    
    # Hide empty subplots
    
    # Plot the Results
    plt.tight_layout()
    plt.show()
    
    # Print statistical summary   
    for i, name in enumerate(feature_names):

    
    return stats



In [None]:
# Visualize feature distributions for the Iris dataset
distribution_stats = plot_feature_distributions(X, feature_names, 
                                               title="Iris Dataset: Feature Distributions")

### Scatter Plots: Exploring Feature Relationships and Class Separability

A **scatter plot** visualizes the joint distribution of two features $(x_i, x_j)$:
- **X-axis**: Feature $i$ values
- **Y-axis**: Feature $j$ values  
- **Each point**: One data sample $(x_{k,i}, x_{k,j})$

Its a powefull tool to get a first impression of the data and provide guidance for further analysis. When points are colored by class, scatter plots reveal:

- **Linear separability**: Classes can be separated by a straight line
- **Cluster structure**: Distinct groups in feature space
- **Overlap regions**: Where classification will be difficult
- **Decision boundaries**: Where optimal separation might occur

**Correlation Analysis**:


The **Pearson correlation coefficient** measures linear relationship strength:
$$r_{ij} = \frac{\sum_{k=1}^n (x_{k,i} - \mu_i)(x_{k,j} - \mu_j)}{\sqrt{\sum_{k=1}^n (x_{k,i} - \mu_i)^2 \sum_{k=1}^n (x_{k,j} - \mu_j)^2}}$$

**Interpretation**:

- $r_{ij} = +1$: Perfect positive linear correlation
- $r_{ij} = 0$: No linear correlation (but nonlinear relationships may exist)
- $r_{ij} = -1$: Perfect negative linear correlation

**Correlation Strength**

- $|r_{ij}| > 0.7$: Strong correlation
- $|r_{ij}| < 0.3$: Weak correlation

Correlation analysis helps identify redundant features, where two variables carry similar information. Highly correlated features may lead to multicollinearity, which can negatively affect some models (e.g., linear regression). On the other hand, identifying uncorrelated or weakly correlated features can uncover complementary information, making them valuable for classification or prediction. However, correlation only captures linear relationships—nonlinear patterns may still exist even when $r_{ij} = 0$, so it's essential to combine correlation analysis with visual tools like scatter plots for a complete understanding.

<div class='alert alert-warning'>
    <b>Dataset Relationship</b> Time to uncover the hidden connections in your data. You're about to create scatter plots that reveal which features work together and which classes can be easily separated.

Your implementation of the `plot_feature_relationships` should: 

1. **Set up feature pairs and grid layout**:
   - Calculate grid dimensions and create subplot grid with appropriate figure size

2. **Compute correlation matrix**:
   - Use `np.corrcoef(X.T)` to calculate Pearson correlation coefficients
   - This reveals linear relationships between all feature pairs

3. **Create informative scatter plots**:
   - Extract feature data and plot points for each class with different colors
   - Use `scatter()` with `alpha=0.7, edgecolors='black', s=60`
   - Calculate and mark class centroids with 'X' markers

4. **Professional formatting**:
   - Set axis labels and titles with correlation values
   - Add grid, legend, and proper formatting
   - Hide unused subplots for clean appearance

5. **Print correlation analysis**:
   - Display correlation score

6. **Return results dictionary** with correlation_matrix

</div>

In [None]:
def plot_feature_relationships(X, y, feature_names, target_names, 
                               feature_pairs=None, figsize=(15, 10)):
    """
    Create scatter plots to visualize relationships between feature pairs.
    
    Parameters:
    -----------
    X : numpy.ndarray of shape (n_samples, n_features)
        Feature matrix
    y : numpy.ndarray of shape (n_samples,)
        Target labels
    feature_names : list of str
        Names of features
    target_names : list of str
        Names of target classes
    feature_pairs : list of tuples, optional
        Specific feature pairs to plot. If None, plots all combinations.
    figsize : tuple, default=(15, 10)
        Figure size for the plot
    
    Returns:
    --------
    dict : Correlation analysis results:
        - 'correlation_matrix': Pearson correlation coefficients
        - 'separability_scores': Class separability measures per feature pair

    """
    
    raise NotImplementedError("Implement the function to plot feature relationships.")

    # Define feature pairs to plot
    if feature_pairs is None:
    
    # Calculate grid dimensions
    
    # Craft the Scatter Plots
    
    # Hide empty subplots

    # Plot the Results
    plt.tight_layout()
    plt.show()
    
    # Print analysis results
    return {
        'correlation_matrix': correlation_matrix,
    }

In [None]:
# Analyze feature relationships in the Iris dataset
relationship_analysis = plot_feature_relationships(X, y, feature_names, target_names,
                                                  feature_pairs=[(2, 3), (0, 2), (1, 3)],
                                                  figsize=(15, 5))

### Image Data Visualization: Understanding Visual Patterns

For datasets like MNIST, each digit class has characteristic patterns:
- **Digit 0**: Circular structure, hollow center
- **Digit 1**: Vertical lines, minimal width
- **Digit 8**: Two loops, complex topology
- **Within-class variation**: Different handwriting styles
- **Between-class similarity**: Digits 6 and 9 are similar when rotated

Visualizing image data helps uncover both statistical properties and structural patterns relevant for classification or preprocessing.

**Pixel Intensity Analysis**:
For grayscale image $\mathbf{I} \in \mathbb{R}^{H \times W}$, we can analyze:

**Global Statistics**:

- **Mean intensity**: $\mu_I = \frac{1}{HW} \sum_{i=1}^H \sum_{j=1}^W I_{i,j}$
- **Standard deviation**: $\sigma_I = \sqrt{\frac{1}{HW-1} \sum_{i=1}^H \sum_{j=1}^W (I_{i,j} - \mu_I)^2}$
- **Dynamic range**: $[I_{min}, I_{max}]$ where $I_{min} = \min_{i,j} I_{i,j}$, $I_{max} = \max_{i,j} I_{i,j}$

These metrics give a first-order summary of brightness, contrast, and range in the image data.

**Histogram Analysis for Images**:

The **pixel intensity histogram** shows the distribution of brightness values. This analysis is useful for preprocessing steps like normalization, contrast adjustment, or thresholding.

- **Dark images**: Histogram concentrated at low values (0-100)
- **Bright images**: Histogram concentrated at high values (150-255)
- **High contrast**: Histogram spread across full range
- **Low contrast**: Histogram concentrated in narrow range

<div class='alert alert-warning'>

<b>Digital Gallery</b> Time to create a handwritten digit showcase. .

Build your `visualize_digit_examples()` function to:

1. **Discover your collection**: Find all unique digits and set up the perfect gallery layout
    - Use np.unique(labels) to find all digit classes present
    - Calculate grid dimensions
    - Create subplot grid with appropriate figure size
2. **Select representative examples**: Find the first specimen of each digit and count your entire collection  
3. **Create professional displays**: Show each digit with beautiful formatting and clean visualization
    - For each digit, find first occurrence using np.where(labels == digit)[0]
    - Store example index and count total samples per class
    - Display image using imshow() with grayscale colormap and proper scaling
    - Set title showing digit value and sample count
4. **Handle the unexpected**: Gracefully manage missing digits and empty spaces
    - If no samples exist for a digit, display "No samples" message
    - Hide unused subplots for clean appearance
    - Remove axis ticks for cleaner visualization
5. **Provide collection insights**: Print detailed statistics about your digital artifact distribution
    - Show sample count and percentage for each digit class
    - Format output clearly with proper alignment   
6. **Return analysis results** a dictionary with example_indices and class_counts

</div>

In [None]:
def visualize_digit_examples(images, labels, title="Digit Examples"):
    """
    Display one example of each digit class (0-9) from the dataset.
    
    Parameters:
    -----------
    images : numpy.ndarray of shape (n_samples, height, width)
        Array of digit images
    labels : numpy.ndarray of shape (n_samples,)
        Corresponding digit labels (0-9)
    title : str, default="Digit Examples"
        Title for the visualization
    
    Returns:
    --------
    dict : Information about the examples:
        - 'example_indices': Index of the example used for each digit
        - 'class_counts': Number of samples per digit class
    
    """
    raise NotImplementedError("Implement the function to visualize digit examples.") 

    unique_digits = np.unique(labels)
    
    # Calculate grid dimensions

    # Craft the Subplots and extract statistics    
    for i, digit in enumerate(unique_digits):

    # Plot the Results
    plt.tight_layout()
    plt.show()
    
    # Print class distribution
    for digit in sorted(class_counts.keys()):
    
    return {
        'example_indices': example_indices,
        'class_counts': class_counts
    }

In [None]:
# Visualize examples of each digit class
digit_examples = visualize_digit_examples(images, labels, 
                                         title="MNIST Dataset: One Example per Digit Class")

<div class='alert alert-warning'>

<b>Pixel Intensity Inspection</b> Ever wondered how photo editing apps analyze brightness and contrast? You're about to build the same analysis tools that photographers use to perfect their images!

Create the `analyze_pixel_intensity_distribution()` function to:

1. **Set up analysis parameters**:
    - Handle sample_indices parameter
    - Calculate grid layout for visualization
    - Create subplot grid with extra column for combined histogram
2. **Statistical analysis per image**: 
    - Calculate statistical measures: mean, std, min, max using NumPy
    - Store results in analysis_results dictionary
    - Generate histogram data: np.histogram(pixels, bins=50, range=(0, 255))
3. **Create professional visualizations**: 
    - Display original images with statistical info in titles
    - Create individual histograms showing pixel intensity distributions
    - Combine all histograms in final subplot with different colors
    - Use proper labels, legends, and formatting
4. **Print detailed summary**: Show intensity ranges and dynamic range for each sample
5. **Return comprehensive results**: Dictionary with all statistical measures and histogram data
    
</div>

In [None]:
def analyze_pixel_intensity_distribution(images, sample_indices=None, title="Pixel Intensity Analysis"):
    """
    Analyze and visualize pixel intensity distributions in images.
    
    Parameters:
    -----------
    images : numpy.ndarray of shape (n_samples, height, width)
        Array of images
    sample_indices : list or None, default=None
        Specific image indices to analyze. If None, analyzes first image.
    title : str, default="Pixel Intensity Analysis"
        Title for the visualization
    
    Returns:
    --------
    dict : Statistical analysis results:
        - 'mean_intensities': Mean pixel intensity per image
        - 'std_intensities': Standard deviation per image
        - 'intensity_ranges': Min-max ranges per image
        - 'histogram_data': Histogram data for visualization

    """

    raise NotImplementedError("Implement the function to analyze pixel intensity distribution.")
    
    analysis_results = {
        'mean_intensities': [],
        'std_intensities': [],
        'intensity_ranges': [],
        'histogram_data': []
    }
    

    # Calculate grid for subplots


    # Calculate statistics and plot images
    for idx, sample_idx in enumerate(sample_indices):       
        # Extract image and flatten for analysis
        
        # Statistical analysis
        
        # Display original image

        # Create histogram
    
    # Combined histogram plot

    # Plot the Results
    plt.tight_layout()
    plt.show()
    
    # Print statistical summary
    
    return analysis_results

In [None]:
# Analyze pixel intensity distributions for sample images
intensity_analysis = analyze_pixel_intensity_distribution(images, 
                                                        sample_indices=[0, 100, 500], 
                                                        title="Pixel Intensity Distribution Analysis")