# Histograms & Distribution Plots

## Overview

**Histograms** visualize the distribution of numerical data by dividing it into bins and counting frequencies.

```
Histogram = Visual representation of data distribution
```

### What We'll Learn

**1. Basic Histograms** 📊
- Creating histograms with hist()
- Bins and bin edges
- Frequency vs density
- Customization

**2. Multiple Distributions** 👥
- Overlapping histograms
- Side-by-side comparison
- Transparency and colors
- Legends

**3. Density Plots (KDE)** 📈
- Kernel Density Estimation
- Smooth distribution curves
- Bandwidth selection
- Combined with histograms

**4. Advanced Techniques** 🚀
- Cumulative histograms
- 2D histograms (heatmaps)
- Step histograms
- Custom bin edges

**5. Statistical Analysis** 📏
- Distribution shapes
- Mean, median, mode
- Standard deviation
- Normal distribution overlay

### Why Master Histograms?

```
✓ Understand data distribution
✓ Identify outliers
✓ Check for normality
✓ Compare distributions
✓ Essential for EDA
✓ Foundation for statistics
```

### Common Use Cases

- **Data Science**: Feature distributions, EDA
- **Statistics**: Normality testing, hypothesis testing
- **Business**: Customer age, income, spending patterns
- **Science**: Measurement distributions, experiment results
- **Quality Control**: Process variation analysis

Let's master histograms! 🚀

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import norm, skew, kurtosis
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline
np.random.seed(42)

print(f"Matplotlib: {plt.matplotlib.__version__}")
print(f"NumPy: {np.__version__}")
print(f"Pandas: {pd.__version__}")
print(f"SciPy: {stats.__version__}")
print(f"\n✅ Setup complete!")

## 1. Basic Histograms

### Basic Syntax

```python
# Simple histogram
ax.hist(data, bins=10)

# Full parameters
ax.hist(data,
       bins=30,              # Number of bins
       range=(min, max),     # Data range
       density=False,        # Normalize to PDF
       weights=None,         # Weight each value
       cumulative=False,     # Cumulative histogram
       histtype='bar',       # 'bar', 'step', 'stepfilled'
       align='mid',          # 'left', 'mid', 'right'
       orientation='vertical', # or 'horizontal'
       color='steelblue',    # Color
       edgecolor='black',    # Edge color
       linewidth=1,          # Edge width
       alpha=0.7,            # Transparency
       label='Data')         # Legend label
```

### Choosing Number of Bins

```python
# Methods for bin selection
bins='auto'        # Default (max of Sturges and FD)
bins='fd'          # Freedman-Diaconis rule
bins='doane'       # Doane's formula
bins='scott'       # Scott's rule
bins='rice'        # Rice's rule
bins='sturges'     # Sturges' formula
bins='sqrt'        # Square root rule

# Manual
bins=30            # Fixed number
bins=[0, 10, 20, 30, 40]  # Custom edges
bins=np.linspace(0, 100, 21)  # Array of edges
```

### Frequency vs Density

```python
# Frequency (counts)
ax.hist(data, bins=30, density=False)
ax.set_ylabel('Frequency')

# Density (probability)
ax.hist(data, bins=30, density=True)
ax.set_ylabel('Density')
# Note: Area under curve = 1
```

### Histogram Types

```python
# Bar (default)
ax.hist(data, histtype='bar')

# Step (outline only)
ax.hist(data, histtype='step', linewidth=2)

# Step filled
ax.hist(data, histtype='stepfilled', alpha=0.5)

# Barstacked (for multiple datasets)
ax.hist([data1, data2], histtype='barstacked')
```

### Return Values

```python
n, bins, patches = ax.hist(data, bins=30)
# n: frequencies (counts)
# bins: bin edges (length = n + 1)
# patches: list of Rectangle objects
```

### Best Practices

```
✓ Choose appropriate number of bins
✓ Label axes clearly
✓ Use density for different sample sizes
✓ Add transparency for overlapping
✓ Show mean/median lines
✓ Add statistical annotations
✗ Don't use too many/few bins
✗ Don't forget axis labels
✗ Don't ignore outliers
```

In [None]:
print("=== BASIC HISTOGRAMS ===\n")

# Generate sample data
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)

# Example 1: Different bin counts
print("Example 1: Effect of Bin Count")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
bin_counts = [10, 30, 50, 100]

for ax, bins in zip(axes.flat, bin_counts):
    ax.hist(normal_data, bins=bins, color='steelblue', 
           edgecolor='black', alpha=0.7)
    ax.set_title(f'Bins = {bins}', fontweight='bold', fontsize=12)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.grid(True, alpha=0.3)

plt.suptitle('Impact of Bin Count on Histogram Shape', 
            fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Example 2: Histogram types
print("\n" + "="*70)
print("Example 2: Different Histogram Types")
print("="*70)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
hist_types = [('bar', 'Bar'), ('barstacked', 'Bar Stacked'), 
             ('step', 'Step'), ('stepfilled', 'Step Filled')]

for ax, (htype, title) in zip(axes.flat, hist_types):
    if htype == 'step':
        ax.hist(normal_data, bins=30, histtype=htype, 
               color='steelblue', linewidth=2)
    else:
        ax.hist(normal_data, bins=30, histtype=htype, 
               color='steelblue', edgecolor='black', alpha=0.7)
    ax.set_title(f'histtype="{htype}"', fontweight='bold', fontsize=12)
    ax.set_xlabel('Value')
    ax.set_ylabel('Frequency')
    ax.grid(True, alpha=0.3)

plt.suptitle('Histogram Types', fontsize=16, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

# Example 3: Frequency vs Density
print("\n" + "="*70)
print("Example 3: Frequency vs Density")
print("="*70)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Frequency
axes[0].hist(normal_data, bins=30, color='steelblue', 
            edgecolor='black', alpha=0.7, density=False)
axes[0].set_title('Frequency Histogram', fontweight='bold', fontsize=14)
axes[0].set_xlabel('Value', fontsize=12)
axes[0].set_ylabel('Frequency (Count)', fontsize=12)
axes[0].grid(True, alpha=0.3)

# Density
axes[1].hist(normal_data, bins=30, color='coral', 
            edgecolor='black', alpha=0.7, density=True)
axes[1].set_title('Density Histogram', fontweight='bold', fontsize=14)
axes[1].set_xlabel('Value', fontsize=12)
axes[1].set_ylabel('Density (Probability)', fontsize=12)
axes[1].grid(True, alpha=0.3)

# Overlay normal distribution
mu, sigma = normal_data.mean(), normal_data.std()
x = np.linspace(normal_data.min(), normal_data.max(), 100)
axes[1].plot(x, norm.pdf(x, mu, sigma), 'r-', linewidth=2, 
            label=f'Normal(μ={mu:.1f}, σ={sigma:.1f})')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\n💡 Key Difference:")
print("   Frequency: Shows raw counts")
print("   Density: Normalized (area = 1), allows comparison")

## 2. Multiple Distributions & KDE

### Overlapping Histograms

```python
ax.hist([data1, data2], bins=30, 
       label=['Group 1', 'Group 2'],
       color=['blue', 'red'],
       alpha=0.6, edgecolor='black')
ax.legend()
```

### Kernel Density Estimation (KDE)

```python
from scipy.stats import gaussian_kde

# Create KDE
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
density = kde(x)

# Plot
ax.plot(x, density, linewidth=2, label='KDE')
```

### Combined Histogram + KDE

```python
# Histogram (density)
ax.hist(data, bins=30, density=True, alpha=0.5)

# KDE overlay
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
ax.plot(x, kde(x), 'r-', linewidth=2, label='KDE')
```

### Statistical Annotations

```python
mean = data.mean()
median = np.median(data)
std = data.std()

ax.axvline(mean, color='red', linestyle='--', 
          linewidth=2, label=f'Mean: {mean:.2f}')
ax.axvline(median, color='green', linestyle='--', 
          linewidth=2, label=f'Median: {median:.2f}')
```

## Practice Exercises

### Beginner
1. Create histogram with 20 bins, custom color
2. Add mean and median lines
3. Compare frequency vs density
4. Try different bin algorithms
5. Create horizontal histogram

### Intermediate
6. Overlay two distributions with transparency
7. Add KDE curve to histogram
8. Create cumulative histogram
9. Color bins by value ranges
10. 2D histogram (hexbin)

### Advanced
11. Multi-panel distribution comparison
12. Fit and overlay normal distribution
13. Test for normality (Q-Q plot)
14. Animated histogram (time series)
15. Complete EDA dashboard with histograms

## Quick Reference

```python
# Basic
ax.hist(data, bins=30, color='steelblue', edgecolor='black')

# Multiple
ax.hist([d1, d2], bins=30, alpha=0.6, label=['A', 'B'])

# KDE
from scipy.stats import gaussian_kde
kde = gaussian_kde(data)
x = np.linspace(data.min(), data.max(), 100)
ax.plot(x, kde(x), linewidth=2)

# Statistics
ax.axvline(data.mean(), color='r', linestyle='--', label='Mean')
```

## Summary

✅ Histograms show data distribution
✅ Bins crucial for visual interpretation
✅ Density allows comparison of different sizes
✅ KDE provides smooth distribution estimate
✅ Add statistical lines for context

**Next: 07_pie_donut_charts.ipynb**