# Week 2d: Matplotlib Fundamentals
## ISM 6251: Introduction to Machine Learning

### Learning Objectives
By the end of this notebook, you will be able to:
1. Create scatter plots to visualize relationships
2. Build histograms to understand distributions
3. Generate bar charts (horizontal and vertical)
4. Customize plot appearance and styling
5. Create multiple subplots
6. Add annotations and labels to plots

## 1. Introduction to Matplotlib

Matplotlib is Python's primary plotting library. It provides:
- Publication-quality figures
- Interactive environments
- Customizable plots
- Export to many file formats

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set style for better-looking plots
plt.style.use('seaborn-v0_8-darkgrid')

# For inline plots in Jupyter
%matplotlib inline

print(f"Matplotlib version: {plt.matplotlib.__version__}")

## 2. Basic Plot Structure

Understanding the anatomy of a matplotlib figure.

In [None]:
# Simple line plot to understand structure
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create figure and axis
fig, ax = plt.subplots(figsize=(10, 6))

# Plot data
ax.plot(x, y)

# Add labels and title
ax.set_xlabel('X Values', fontsize=12)
ax.set_ylabel('Y Values', fontsize=12)
ax.set_title('Basic Plot Structure', fontsize=14, fontweight='bold')

# Add grid
ax.grid(True, alpha=0.3)

plt.show()

## 3. Scatter Plots

Scatter plots are used to visualize the relationship between two continuous variables.

### Basic Scatter Plot

In [None]:
# Generate sample data
np.random.seed(42)
n_points = 100
x = np.random.randn(n_points)
y = 2 * x + np.random.randn(n_points) * 0.5

# Create scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(x, y)
plt.xlabel('X Variable')
plt.ylabel('Y Variable')
plt.title('Basic Scatter Plot')
plt.grid(True, alpha=0.3)
plt.show()

### Customized Scatter Plot

In [None]:
# Generate data with categories
np.random.seed(42)
n_points = 150

# Three different groups
group1_x = np.random.normal(0, 1, 50)
group1_y = np.random.normal(0, 1, 50)

group2_x = np.random.normal(3, 1, 50)
group2_y = np.random.normal(3, 1, 50)

group3_x = np.random.normal(0, 1, 50)
group3_y = np.random.normal(3, 1, 50)

# Create figure
plt.figure(figsize=(12, 8))

# Plot each group with different colors and markers
plt.scatter(group1_x, group1_y, c='red', marker='o', s=100, alpha=0.6, label='Group A')
plt.scatter(group2_x, group2_y, c='blue', marker='^', s=100, alpha=0.6, label='Group B')
plt.scatter(group3_x, group3_y, c='green', marker='s', s=100, alpha=0.6, label='Group C')

plt.xlabel('X Variable', fontsize=12)
plt.ylabel('Y Variable', fontsize=12)
plt.title('Scatter Plot with Multiple Groups', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3)

# Add horizontal and vertical lines at means
plt.axhline(y=0, color='gray', linestyle='--', alpha=0.3)
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.3)

plt.show()

### Scatter Plot with Size and Color Mapping

In [None]:
# Generate data
np.random.seed(42)
n = 100
x = np.random.rand(n) * 100
y = np.random.rand(n) * 100
colors = np.random.rand(n)
sizes = np.random.rand(n) * 500

# Create scatter plot
plt.figure(figsize=(12, 8))
scatter = plt.scatter(x, y, c=colors, s=sizes, alpha=0.6, cmap='viridis')

plt.xlabel('X Variable', fontsize=12)
plt.ylabel('Y Variable', fontsize=12)
plt.title('Scatter Plot with Size and Color Mapping', fontsize=14, fontweight='bold')

# Add colorbar
plt.colorbar(scatter, label='Color Value')

# Add text annotation for the largest point
max_idx = np.argmax(sizes)
plt.annotate('Largest Point', 
             xy=(x[max_idx], y[max_idx]), 
             xytext=(x[max_idx]+10, y[max_idx]+10),
             arrowprops=dict(arrowstyle='->', color='red'),
             fontsize=10)

plt.grid(True, alpha=0.3)
plt.show()

## 4. Histograms

Histograms show the distribution of a continuous variable.

### Basic Histogram

In [None]:
# Generate data
np.random.seed(42)
data = np.random.normal(100, 15, 1000)

# Create histogram
plt.figure(figsize=(10, 6))
plt.hist(data, bins=30, edgecolor='black', alpha=0.7)

plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Basic Histogram', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')

# Add vertical line at mean
plt.axvline(data.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {data.mean():.1f}')
plt.legend()

plt.show()

### Multiple Histograms (Overlapping)

In [None]:
# Generate multiple datasets
np.random.seed(42)
data1 = np.random.normal(100, 15, 1000)
data2 = np.random.normal(120, 20, 1000)
data3 = np.random.normal(90, 10, 1000)

# Create figure
plt.figure(figsize=(12, 6))

# Plot histograms
plt.hist(data1, bins=30, alpha=0.5, label='Dataset 1', color='blue')
plt.hist(data2, bins=30, alpha=0.5, label='Dataset 2', color='red')
plt.hist(data3, bins=30, alpha=0.5, label='Dataset 3', color='green')

plt.xlabel('Value', fontsize=12)
plt.ylabel('Frequency', fontsize=12)
plt.title('Overlapping Histograms', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3, axis='y')

plt.show()

### Side-by-Side Histograms

In [None]:
# Generate data
np.random.seed(42)
men_scores = np.random.normal(70, 10, 500)
women_scores = np.random.normal(75, 12, 500)

# Create subplots
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# First histogram
ax1.hist(men_scores, bins=25, color='skyblue', edgecolor='black', alpha=0.7)
ax1.set_xlabel('Score', fontsize=12)
ax1.set_ylabel('Frequency', fontsize=12)
ax1.set_title('Men\'s Scores', fontsize=12)
ax1.axvline(men_scores.mean(), color='red', linestyle='--', label=f'Mean: {men_scores.mean():.1f}')
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Second histogram
ax2.hist(women_scores, bins=25, color='lightcoral', edgecolor='black', alpha=0.7)
ax2.set_xlabel('Score', fontsize=12)
ax2.set_ylabel('Frequency', fontsize=12)
ax2.set_title('Women\'s Scores', fontsize=12)
ax2.axvline(women_scores.mean(), color='red', linestyle='--', label=f'Mean: {women_scores.mean():.1f}')
ax2.legend()
ax2.grid(True, alpha=0.3, axis='y')

plt.suptitle('Score Distribution by Gender', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## 5. Bar Charts

Bar charts are used to compare quantities across different categories.

### Vertical Bar Chart

In [None]:
# Create data
categories = ['Product A', 'Product B', 'Product C', 'Product D', 'Product E']
values = [23, 45, 56, 78, 32]

# Create bar chart
plt.figure(figsize=(10, 6))
bars = plt.bar(categories, values, color='steelblue', edgecolor='black', alpha=0.7)

# Add value labels on top of bars
for bar in bars:
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{height}',
             ha='center', va='bottom', fontsize=10)

plt.xlabel('Products', fontsize=12)
plt.ylabel('Sales (in thousands)', fontsize=12)
plt.title('Vertical Bar Chart - Product Sales', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='y')

plt.show()

### Horizontal Bar Chart

In [None]:
# Create data
categories = ['Marketing', 'Development', 'Sales', 'Support', 'HR', 'Finance']
values = [45, 80, 62, 35, 28, 55]

# Sort data for better visualization
sorted_data = sorted(zip(categories, values), key=lambda x: x[1])
categories_sorted = [x[0] for x in sorted_data]
values_sorted = [x[1] for x in sorted_data]

# Create horizontal bar chart
plt.figure(figsize=(10, 8))
bars = plt.barh(categories_sorted, values_sorted, color='coral', edgecolor='black', alpha=0.7)

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    plt.text(width, bar.get_y() + bar.get_height()/2.,
             f'{width}',
             ha='left', va='center', fontsize=10)

plt.xlabel('Number of Employees', fontsize=12)
plt.ylabel('Department', fontsize=12)
plt.title('Horizontal Bar Chart - Department Sizes', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3, axis='x')

plt.tight_layout()
plt.show()

### Grouped Bar Chart

In [None]:
# Create data
categories = ['Q1', 'Q2', 'Q3', 'Q4']
product_a = [20, 35, 30, 25]
product_b = [25, 30, 35, 30]
product_c = [15, 20, 25, 20]

# Set width of bars
bar_width = 0.25
x = np.arange(len(categories))

# Create figure
plt.figure(figsize=(12, 6))

# Create bars
plt.bar(x - bar_width, product_a, bar_width, label='Product A', color='skyblue', edgecolor='black')
plt.bar(x, product_b, bar_width, label='Product B', color='orange', edgecolor='black')
plt.bar(x + bar_width, product_c, bar_width, label='Product C', color='lightgreen', edgecolor='black')

plt.xlabel('Quarter', fontsize=12)
plt.ylabel('Sales (in thousands)', fontsize=12)
plt.title('Grouped Bar Chart - Quarterly Sales by Product', fontsize=14, fontweight='bold')
plt.xticks(x, categories)
plt.legend()
plt.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

### Stacked Bar Chart

In [None]:
# Create data
categories = ['Team A', 'Team B', 'Team C', 'Team D']
junior = [10, 15, 12, 8]
mid = [15, 20, 18, 12]
senior = [5, 8, 7, 6]

# Create figure
plt.figure(figsize=(10, 6))

# Create stacked bars
plt.bar(categories, junior, label='Junior', color='lightblue', edgecolor='black')
plt.bar(categories, mid, bottom=junior, label='Mid-level', color='orange', edgecolor='black')
plt.bar(categories, senior, bottom=np.array(junior)+np.array(mid), label='Senior', color='green', edgecolor='black')

plt.xlabel('Teams', fontsize=12)
plt.ylabel('Number of Employees', fontsize=12)
plt.title('Stacked Bar Chart - Team Composition', fontsize=14, fontweight='bold')
plt.legend(loc='upper right')
plt.grid(True, alpha=0.3, axis='y')

# Add total labels on top
totals = np.array(junior) + np.array(mid) + np.array(senior)
for i, (cat, total) in enumerate(zip(categories, totals)):
    plt.text(i, total + 0.5, str(total), ha='center', fontsize=10, fontweight='bold')

plt.show()

## 6. Combining Plot Types

In [None]:
# Generate data
np.random.seed(42)
x = np.linspace(0, 10, 50)
y = 2 * x + np.random.randn(50) * 2

# Calculate statistics
categories = ['Mean', 'Median', 'Std Dev', 'Min', 'Max']
values = [y.mean(), np.median(y), y.std(), y.min(), y.max()]

# Create subplots
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Scatter plot
axes[0, 0].scatter(x, y, alpha=0.6, color='blue')
axes[0, 0].set_xlabel('X Values')
axes[0, 0].set_ylabel('Y Values')
axes[0, 0].set_title('Scatter Plot')
axes[0, 0].grid(True, alpha=0.3)

# Histogram
axes[0, 1].hist(y, bins=15, color='green', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Y Values')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Y')
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Bar chart
axes[1, 0].bar(categories, values, color='orange', alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Statistic')
axes[1, 0].set_ylabel('Value')
axes[1, 0].set_title('Summary Statistics')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Horizontal bar chart
axes[1, 1].barh(categories, values, color='purple', alpha=0.7, edgecolor='black')
axes[1, 1].set_xlabel('Value')
axes[1, 1].set_ylabel('Statistic')
axes[1, 1].set_title('Summary Statistics (Horizontal)')
axes[1, 1].grid(True, alpha=0.3, axis='x')

plt.suptitle('Multiple Plot Types Example', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 7. Customization and Styling

In [None]:
# Generate data
np.random.seed(42)
data1 = np.random.normal(100, 15, 200)
data2 = np.random.normal(110, 20, 200)

# Create custom styled plot
fig, ax = plt.subplots(figsize=(12, 8))

# Create scatter plot with custom colors
scatter1 = ax.scatter(range(len(data1)), data1, 
                     c=data1, cmap='coolwarm', 
                     s=50, alpha=0.6, 
                     label='Dataset 1')

# Add colorbar
cbar = plt.colorbar(scatter1, ax=ax)
cbar.set_label('Value', rotation=270, labelpad=15)

# Customize axes
ax.set_xlabel('Index', fontsize=14, fontweight='bold')
ax.set_ylabel('Value', fontsize=14, fontweight='bold')
ax.set_title('Customized Scatter Plot', fontsize=16, fontweight='bold', pad=20)

# Customize grid
ax.grid(True, linestyle='--', alpha=0.3, color='gray')
ax.set_facecolor('#f0f0f0')

# Add horizontal lines for mean and std
mean_val = data1.mean()
std_val = data1.std()

ax.axhline(mean_val, color='red', linestyle='-', linewidth=2, label=f'Mean: {mean_val:.1f}')
ax.axhline(mean_val + std_val, color='red', linestyle='--', linewidth=1, alpha=0.5, label=f'±1 Std')
ax.axhline(mean_val - std_val, color='red', linestyle='--', linewidth=1, alpha=0.5)

# Add text annotation
ax.text(len(data1)*0.02, mean_val + std_val*1.5, 
        f'Mean: {mean_val:.1f}\nStd: {std_val:.1f}',
        fontsize=10, 
        bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

# Customize legend
ax.legend(loc='upper right', frameon=True, shadow=True, fancybox=True)

plt.tight_layout()
plt.show()

## 8. Practical Examples

### Example 1: Sales Dashboard

In [None]:
# Generate sales data
np.random.seed(42)
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
          'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
sales_2023 = np.random.randint(50, 150, 12)
sales_2024 = np.random.randint(60, 160, 12)
regions = ['North', 'South', 'East', 'West']
region_sales = [250, 180, 320, 290]

# Create dashboard
fig = plt.figure(figsize=(15, 10))

# Monthly sales comparison
ax1 = plt.subplot(2, 2, 1)
x = np.arange(len(months))
width = 0.35
ax1.bar(x - width/2, sales_2023, width, label='2023', color='steelblue')
ax1.bar(x + width/2, sales_2024, width, label='2024', color='coral')
ax1.set_xlabel('Month')
ax1.set_ylabel('Sales ($1000s)')
ax1.set_title('Monthly Sales Comparison')
ax1.set_xticks(x)
ax1.set_xticklabels(months, rotation=45)
ax1.legend()
ax1.grid(True, alpha=0.3, axis='y')

# Regional distribution
ax2 = plt.subplot(2, 2, 2)
colors = ['#ff9999', '#66b3ff', '#99ff99', '#ffcc99']
ax2.pie(region_sales, labels=regions, colors=colors, autopct='%1.1f%%', startangle=90)
ax2.set_title('Sales by Region')

# Sales trend
ax3 = plt.subplot(2, 2, 3)
ax3.plot(months, sales_2023, marker='o', linestyle='-', linewidth=2, label='2023')
ax3.plot(months, sales_2024, marker='s', linestyle='-', linewidth=2, label='2024')
ax3.set_xlabel('Month')
ax3.set_ylabel('Sales ($1000s)')
ax3.set_title('Sales Trend')
ax3.legend()
ax3.grid(True, alpha=0.3)
plt.setp(ax3.xaxis.get_majorticklabels(), rotation=45)

# Sales distribution
ax4 = plt.subplot(2, 2, 4)
all_sales = np.concatenate([sales_2023, sales_2024])
ax4.hist(all_sales, bins=15, color='purple', alpha=0.7, edgecolor='black')
ax4.axvline(all_sales.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {all_sales.mean():.1f}')
ax4.set_xlabel('Sales ($1000s)')
ax4.set_ylabel('Frequency')
ax4.set_title('Sales Distribution (All Months)')
ax4.legend()
ax4.grid(True, alpha=0.3, axis='y')

plt.suptitle('Sales Dashboard 2023-2024', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

### Example 2: Student Performance Analysis

In [None]:
# Generate student data
np.random.seed(42)
n_students = 100
math_scores = np.random.normal(75, 12, n_students)
science_scores = np.random.normal(78, 10, n_students)
english_scores = np.random.normal(82, 8, n_students)

# Create analysis plots
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

# Scatter: Math vs Science
axes[0, 0].scatter(math_scores, science_scores, alpha=0.5)
axes[0, 0].set_xlabel('Math Score')
axes[0, 0].set_ylabel('Science Score')
axes[0, 0].set_title('Math vs Science Scores')
axes[0, 0].grid(True, alpha=0.3)

# Add correlation line
z = np.polyfit(math_scores, science_scores, 1)
p = np.poly1d(z)
axes[0, 0].plot(math_scores, p(math_scores), "r--", alpha=0.8)

# Histogram: Math scores
axes[0, 1].hist(math_scores, bins=20, color='blue', alpha=0.7, edgecolor='black')
axes[0, 1].set_xlabel('Math Score')
axes[0, 1].set_ylabel('Number of Students')
axes[0, 1].set_title('Math Score Distribution')
axes[0, 1].axvline(math_scores.mean(), color='red', linestyle='--', label=f'Mean: {math_scores.mean():.1f}')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Bar chart: Average scores by subject
subjects = ['Math', 'Science', 'English']
avg_scores = [math_scores.mean(), science_scores.mean(), english_scores.mean()]
colors = ['blue', 'green', 'orange']
bars = axes[0, 2].bar(subjects, avg_scores, color=colors, alpha=0.7, edgecolor='black')
axes[0, 2].set_ylabel('Average Score')
axes[0, 2].set_title('Average Scores by Subject')
axes[0, 2].set_ylim([0, 100])

# Add value labels on bars
for bar, score in zip(bars, avg_scores):
    axes[0, 2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
                    f'{score:.1f}', ha='center', va='bottom')
axes[0, 2].grid(True, alpha=0.3, axis='y')

# Box plot: Score distributions
all_scores = [math_scores, science_scores, english_scores]
bp = axes[1, 0].boxplot(all_scores, labels=subjects, patch_artist=True)
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)
axes[1, 0].set_ylabel('Score')
axes[1, 0].set_title('Score Distributions by Subject')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Scatter: Science vs English
axes[1, 1].scatter(science_scores, english_scores, alpha=0.5, color='purple')
axes[1, 1].set_xlabel('Science Score')
axes[1, 1].set_ylabel('English Score')
axes[1, 1].set_title('Science vs English Scores')
axes[1, 1].grid(True, alpha=0.3)

# Overall distribution
all_scores_flat = np.concatenate([math_scores, science_scores, english_scores])
axes[1, 2].hist(all_scores_flat, bins=30, color='gray', alpha=0.7, edgecolor='black')
axes[1, 2].set_xlabel('Score')
axes[1, 2].set_ylabel('Frequency')
axes[1, 2].set_title('Overall Score Distribution')
axes[1, 2].axvline(all_scores_flat.mean(), color='red', linestyle='--', 
                   label=f'Mean: {all_scores_flat.mean():.1f}')
axes[1, 2].legend()
axes[1, 2].grid(True, alpha=0.3, axis='y')

plt.suptitle('Student Performance Analysis', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

## 9. Best Practices

### Tips for Effective Visualizations

In [None]:
# Example of good vs bad visualization practices
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))

# Generate data
categories = ['A', 'B', 'C', 'D', 'E']
values = [23, 45, 56, 78, 32]

# Bad example
ax1.bar(categories, values, color=['red', 'yellow', 'green', 'blue', 'orange'], width=1.0)
ax1.set_title('Bad: Too Many Colors, No Labels', fontsize=10)
ax1.set_ylim([0, 100])

# Good example
bars = ax2.bar(categories, values, color='steelblue', edgecolor='black', alpha=0.7)
ax2.set_xlabel('Category', fontsize=12)
ax2.set_ylabel('Value', fontsize=12)
ax2.set_title('Good: Clear, Consistent, Labeled', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3, axis='y')

# Add value labels
for bar in bars:
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{height}', ha='center', va='bottom')

plt.suptitle('Visualization Best Practices', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

# Print best practices
print("\nBest Practices for Data Visualization:")
print("1. Always label your axes")
print("2. Include a descriptive title")
print("3. Use consistent color schemes")
print("4. Add gridlines for easier reading")
print("5. Include value labels when appropriate")
print("6. Choose the right plot type for your data")
print("7. Avoid chart junk and unnecessary decorations")
print("8. Make sure text is readable")
print("9. Use color meaningfully")
print("10. Consider your audience")

## Summary

In this notebook, we covered the fundamentals of data visualization with Matplotlib:

### Key Plot Types:
1. **Scatter Plots**: 
   - Show relationships between two continuous variables
   - Can encode additional dimensions with size and color
   - Useful for identifying patterns and correlations

2. **Histograms**:
   - Display the distribution of a continuous variable
   - Help identify skewness, outliers, and modality
   - Can compare multiple distributions

3. **Bar Charts**:
   - Compare quantities across categories
   - Available in vertical and horizontal orientations
   - Can be grouped or stacked for multiple series

### Key Skills Learned:
- Creating and customizing plots
- Adding labels, titles, and legends
- Working with subplots
- Applying colors and styles
- Adding annotations and gridlines
- Combining multiple plot types

### Remember:
- Choose the right visualization for your data
- Keep plots clear and simple
- Always label axes and include titles
- Use color purposefully
- Consider your audience

These visualization skills are essential for:
- Exploratory data analysis
- Presenting findings
- Model evaluation
- Communicating insights