# Visualization for Insight

This lab teaches you how to use Pandas, Matplotlib, and Seaborn to turn data into clear, useful charts. You'll learn to quickly explore data, spot patterns, and create visualizations that help both technical and business audiences understand your findings.

## Setup and Data Creation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Set up visualization style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (12, 8)
plt.rcParams['font.size'] = 10
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3

np.random.seed(42)
print("=== CREATING HOUSING MARKET DATASET ===")

In [None]:
# Generate realistic housing market data
n_properties = 800
property_ids = range(20001, 20001 + n_properties)

# Property characteristics
square_footage = np.clip(np.random.normal(2200, 650, n_properties).astype(int), 900, 4500)
bedrooms = np.random.choice([2, 3, 4, 5, 6], n_properties, p=[0.15, 0.35, 0.35, 0.12, 0.03])
bathrooms = np.clip(bedrooms + np.random.choice([-0.5, 0, 0.5, 1], n_properties, p=[0.1, 0.5, 0.3, 0.1]), 1.5, 4.5)

property_types = np.random.choice(['Single Family', 'Townhouse', 'Condo', 'Duplex'], 
                                 n_properties, p=[0.55, 0.25, 0.15, 0.05])

neighborhoods = np.random.choice(['Downtown', 'Suburban North', 'Suburban South', 'Riverside', 
                                 'Historic District', 'New Development', 'Lakefront'],
                                n_properties, p=[0.12, 0.22, 0.20, 0.15, 0.12, 0.14, 0.05])

# Time and pricing data
start_date = datetime(2020, 1, 1)
end_date = datetime(2024, 6, 30)
date_range = (end_date - start_date).days
sale_dates = [start_date + timedelta(days=np.random.randint(0, date_range)) for _ in range(n_properties)]

# Features and pricing
has_garage = np.random.choice([True, False], n_properties, p=[0.8, 0.2])
has_pool = np.random.choice([True, False], n_properties, p=[0.25, 0.75])
has_fireplace = np.random.choice([True, False], n_properties, p=[0.45, 0.55])

# Calculate realistic prices
base_prices = 140 * square_footage
neighborhood_multipliers = {'Lakefront': 1.45, 'Downtown': 1.35, 'Riverside': 1.25, 
                           'Historic District': 1.20, 'New Development': 1.15, 
                           'Suburban North': 1.08, 'Suburban South': 1.05}
price_adjustments = np.array([neighborhood_multipliers.get(n, 1.0) for n in neighborhoods])

years_since_2020 = np.array([(date - start_date).days / 365.25 for date in sale_dates])
market_trends = 1 + (years_since_2020 * 0.07)

feature_bonuses = (has_garage.astype(int) * 18000 + 
                  has_pool.astype(int) * 30000 + 
                  has_fireplace.astype(int) * 12000)

final_prices = np.clip((base_prices * price_adjustments * market_trends + feature_bonuses) * 
                      np.random.normal(1.0, 0.15, n_properties), 150000, 2000000).round(-3)

days_on_market = np.clip(np.random.exponential(35, n_properties).astype(int), 1, 200)

In [None]:
# Create comprehensive DataFrame
viz_data = pd.DataFrame({
    'property_id': property_ids,
    'sale_date': sale_dates,
    'sale_price': final_prices,
    'square_feet': square_footage,
    'bedrooms': bedrooms,
    'bathrooms': bathrooms,
    'property_type': property_types,
    'neighborhood': neighborhoods,
    'has_garage': has_garage,
    'has_pool': has_pool,
    'has_fireplace': has_fireplace,
    'days_on_market': days_on_market
})

# Add derived variables
viz_data['price_per_sqft'] = viz_data['sale_price'] / viz_data['square_feet']
viz_data['sale_year'] = pd.to_datetime(viz_data['sale_date']).dt.year
viz_data['sale_month'] = pd.to_datetime(viz_data['sale_date']).dt.month
viz_data['total_rooms'] = viz_data['bedrooms'] + viz_data['bathrooms']
viz_data['feature_score'] = (viz_data['has_garage'].astype(int) + 
                            viz_data['has_pool'].astype(int) + 
                            viz_data['has_fireplace'].astype(int))

viz_data['market_segment'] = pd.cut(viz_data['sale_price'], 
                                   bins=[0, 300000, 500000, 800000, float('inf')],
                                   labels=['Entry', 'Mid-Market', 'Premium', 'Luxury'])

print(f"✓ Dataset created: {len(viz_data)} properties")
print(f"✓ Price range: ${viz_data['sale_price'].min():,.0f} - ${viz_data['sale_price'].max():,.0f}")
print(viz_data.head())

## Exploratory Data Visualization

In [None]:
# Distribution Analysis
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
fig.suptitle('Housing Market Data: Distribution Analysis', fontsize=16, fontweight='bold')

# Price distribution
axes[0, 0].hist(viz_data['sale_price'], bins=30, alpha=0.7, color='skyblue', edgecolor='black')
median_price = viz_data['sale_price'].median()
axes[0, 0].axvline(median_price, color='red', linestyle='--', label=f'Median: ${median_price:,.0f}')
axes[0, 0].set_title('Sale Price Distribution')
axes[0, 0].set_xlabel('Sale Price ($)')
axes[0, 0].legend()

# Square footage
axes[0, 1].hist(viz_data['square_feet'], bins=25, alpha=0.7, color='lightgreen', edgecolor='black')
axes[0, 1].set_title('Square Footage Distribution')
axes[0, 1].set_xlabel('Square Feet')

# Price per sq ft
axes[0, 2].hist(viz_data['price_per_sqft'], bins=25, alpha=0.7, color='coral', edgecolor='black')
axes[0, 2].set_title('Price per Sq Ft Distribution')
axes[0, 2].set_xlabel('Price per Sq Ft ($)')

# Bedrooms
bedroom_counts = viz_data['bedrooms'].value_counts().sort_index()
axes[1, 0].bar(bedroom_counts.index, bedroom_counts.values, alpha=0.7, color='gold', edgecolor='black')
axes[1, 0].set_title('Bedroom Distribution')
axes[1, 0].set_xlabel('Number of Bedrooms')

# Days on market
axes[1, 1].hist(viz_data['days_on_market'], bins=30, alpha=0.7, color='plum', edgecolor='black')
axes[1, 1].set_title('Days on Market Distribution')
axes[1, 1].set_xlabel('Days on Market')

# Market segments
segment_counts = viz_data['market_segment'].value_counts()
axes[1, 2].pie(segment_counts.values, labels=segment_counts.index, autopct='%1.1f%%')
axes[1, 2].set_title('Market Segment Distribution')

plt.tight_layout()
plt.show()

print(f"✓ Price range: ${viz_data['sale_price'].min():,.0f} - ${viz_data['sale_price'].max():,.0f}")
print(f"✓ Most common bedrooms: {viz_data['bedrooms'].mode()[0]}")
print(f"✓ Average days on market: {viz_data['days_on_market'].mean():.0f}")

## Relationship and Correlation Analysis

In [None]:
# Scatter Plot Analysis
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
fig.suptitle('Key Variable Relationships', fontsize=16, fontweight='bold')

# Price vs Square Footage
scatter1 = axes[0, 0].scatter(viz_data['square_feet'], viz_data['sale_price'], 
                             c=viz_data['bedrooms'], cmap='viridis', alpha=0.6)
axes[0, 0].set_xlabel('Square Feet')
axes[0, 0].set_ylabel('Sale Price ($)')
axes[0, 0].set_title('Price vs Square Footage (colored by bedrooms)')
plt.colorbar(scatter1, ax=axes[0, 0], label='Bedrooms')

# Price vs Days on Market
axes[0, 1].scatter(viz_data['days_on_market'], viz_data['sale_price'], alpha=0.6, c='red')
axes[0, 1].set_xlabel('Days on Market')
axes[0, 1].set_ylabel('Sale Price ($)')
axes[0, 1].set_title('Price vs Days on Market')

# Square Feet vs Bedrooms
axes[1, 0].scatter(viz_data['bedrooms'], viz_data['square_feet'], alpha=0.6, c='green')
axes[1, 0].set_xlabel('Number of Bedrooms')
axes[1, 0].set_ylabel('Square Feet')
axes[1, 0].set_title('Square Feet vs Bedrooms')

# Feature Score vs Price
axes[1, 1].scatter(viz_data['feature_score'], viz_data['sale_price'], alpha=0.6, c='purple')
axes[1, 1].set_xlabel('Feature Score')
axes[1, 1].set_ylabel('Sale Price ($)')
axes[1, 1].set_title('Feature Score vs Price')

plt.tight_layout()
plt.show()

# Correlation Analysis
correlations = {
    'Price vs Square Feet': viz_data['sale_price'].corr(viz_data['square_feet']),
    'Price vs Bedrooms': viz_data['sale_price'].corr(viz_data['bedrooms']),
    'Price vs Days on Market': viz_data['sale_price'].corr(viz_data['days_on_market']),
    'Price vs Feature Score': viz_data['sale_price'].corr(viz_data['feature_score'])
}

print("Key Correlations:")
for relationship, correlation in correlations.items():
    strength = "Strong" if abs(correlation) > 0.7 else "Moderate" if abs(correlation) > 0.4 else "Weak"
    direction = "positive" if correlation > 0 else "negative"
    print(f"• {relationship}: {correlation:.3f} ({strength} {direction})")

In [None]:
# Correlation Heatmap
numerical_columns = ['sale_price', 'square_feet', 'bedrooms', 'bathrooms', 
                    'days_on_market', 'price_per_sqft', 'total_rooms', 'feature_score']

correlation_matrix = viz_data[numerical_columns].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='RdBu_r', center=0, 
            square=True, fmt='.3f', cbar_kws={'label': 'Correlation Coefficient'})
plt.title('Correlation Matrix: Housing Market Variables', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

## Time Series and Trend Analysis

In [None]:
# Time Series Analysis
viz_data['sale_date'] = pd.to_datetime(viz_data['sale_date'])
viz_data = viz_data.sort_values('sale_date')

# Monthly aggregations
monthly_data = viz_data.groupby(pd.Grouper(key='sale_date', freq='M')).agg({
    'sale_price': ['count', 'mean'],
    'days_on_market': 'mean'
}).round(2)

monthly_data.columns = ['sales_volume', 'avg_price', 'avg_days_market']
monthly_data = monthly_data.reset_index()

fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('Time Series Analysis', fontsize=16, fontweight='bold')

# Sales volume over time
axes[0, 0].plot(monthly_data['sale_date'], monthly_data['sales_volume'], 
               marker='o', linewidth=2, color='blue')
axes[0, 0].set_title('Monthly Sales Volume')
axes[0, 0].set_ylabel('Number of Sales')
axes[0, 0].tick_params(axis='x', rotation=45)

# Price trend
axes[0, 1].plot(monthly_data['sale_date'], monthly_data['avg_price'], 
               marker='s', linewidth=2, color='green')
axes[0, 1].set_title('Average Price Trend')
axes[0, 1].set_ylabel('Average Price ($)')
axes[0, 1].tick_params(axis='x', rotation=45)

# Seasonal patterns
viz_data['month'] = viz_data['sale_date'].dt.month
monthly_patterns = viz_data.groupby('month')['sale_price'].count()
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

axes[1, 0].bar(month_names, monthly_patterns.values, color='skyblue', alpha=0.8)
axes[1, 0].set_title('Sales Volume by Month')
axes[1, 0].set_ylabel('Number of Sales')
axes[1, 0].tick_params(axis='x', rotation=45)

# Price vs Volume scatter
axes[1, 1].scatter(monthly_data['sales_volume'], monthly_data['avg_price'], 
                  c=range(len(monthly_data)), cmap='viridis', s=60)
axes[1, 1].set_xlabel('Monthly Sales Volume')
axes[1, 1].set_ylabel('Average Price ($)')
axes[1, 1].set_title('Price vs Volume Over Time')

plt.tight_layout()
plt.show()

# Calculate trends
price_change = ((monthly_data['avg_price'].iloc[-1] - monthly_data['avg_price'].iloc[0]) / 
                monthly_data['avg_price'].iloc[0]) * 100

print(f"Time Series Insights:")
print(f"• Total price appreciation: {price_change:+.1f}%")
peak_month = monthly_patterns.idxmax()
print(f"• Peak sales month: {month_names[peak_month-1]}")

## Business Dashboard

In [None]:
# Executive Dashboard
fig = plt.figure(figsize=(16, 12))
gs = fig.add_gridspec(3, 3, hspace=0.3, wspace=0.3)

fig.suptitle('Housing Market Executive Dashboard', fontsize=18, fontweight='bold', y=0.95)

# Key metrics
total_sales = len(viz_data)
avg_price = viz_data['sale_price'].mean()
median_price = viz_data['sale_price'].median()
avg_days_market = viz_data['days_on_market'].mean()

# Sales trend
ax1 = fig.add_subplot(gs[0, :])
ax1.text(0.5, 0.5, f'Total Sales: {total_sales:,} | Avg Price: ${avg_price:,.0f} | Median: ${median_price:,.0f} | Avg DOM: {avg_days_market:.0f}', 
         transform=ax1.transAxes, ha='center', va='center', fontsize=14, 
         bbox=dict(boxstyle="round,pad=0.5", facecolor="lightblue", alpha=0.8))
ax1.axis('off')

# Price by neighborhood
ax2 = fig.add_subplot(gs[1, 0])
neighborhood_prices = viz_data.groupby('neighborhood')['sale_price'].mean().sort_values()
ax2.barh(range(len(neighborhood_prices)), neighborhood_prices.values, color='steelblue')
ax2.set_yticks(range(len(neighborhood_prices)))
ax2.set_yticklabels(neighborhood_prices.index, fontsize=9)
ax2.set_title('Avg Price by Neighborhood')

# Property type distribution
ax3 = fig.add_subplot(gs[1, 1])
prop_type_counts = viz_data['property_type'].value_counts()
ax3.pie(prop_type_counts.values, labels=prop_type_counts.index, autopct='%1.1f%%')
ax3.set_title('Market Share by Type')

# Feature impact
ax4 = fig.add_subplot(gs[1, 2])
feature_impact = viz_data.groupby('feature_score')['sale_price'].mean()
ax4.bar(feature_impact.index, feature_impact.values, color='orange', alpha=0.8)
ax4.set_title('Price by Feature Count')
ax4.set_xlabel('Features (Garage+Pool+Fireplace)')

# Market segment analysis
ax5 = fig.add_subplot(gs[2, :])
segment_analysis = viz_data.groupby('market_segment').agg({
    'sale_price': ['count', 'mean'],
    'days_on_market': 'mean',
    'square_feet': 'mean'
}).round(0)

segment_data = []
for segment in segment_analysis.index:
    row = segment_analysis.loc[segment]
    segment_data.append([
        segment,
        f"{row[('sale_price', 'count')]:.0f}",
        f"${row[('sale_price', 'mean')]:,.0f}",
        f"{row[('days_on_market', 'mean')]:.0f}",
        f"{row[('square_feet', 'mean')]:.0f}"
    ])

table = ax5.table(cellText=segment_data, 
                 colLabels=['Segment', 'Count', 'Avg Price', 'Avg DOM', 'Avg Sq Ft'],
                 cellLoc='center', loc='center')
table.auto_set_font_size(False)
table.set_fontsize(10)
table.scale(1, 2)
ax5.axis('off')
ax5.set_title('Market Segment Performance', fontsize=12, fontweight='bold')

plt.show()

print("✓ Executive dashboard created with key insights")

## Practice Exercises

In [None]:
# Exercise 1: Feature Analysis
print("=== FEATURE IMPACT ANALYSIS ===")

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Garage impact
garage_prices = viz_data.groupby('has_garage')['sale_price'].mean()
axes[0].bar(['No Garage', 'Has Garage'], garage_prices.values, color=['lightcoral', 'lightgreen'])
axes[0].set_title('Price Impact: Garage')
axes[0].set_ylabel('Average Price ($)')

# Pool impact
pool_prices = viz_data.groupby('has_pool')['sale_price'].mean()
axes[1].bar(['No Pool', 'Has Pool'], pool_prices.values, color=['lightcoral', 'skyblue'])
axes[1].set_title('Price Impact: Pool')

# Fireplace impact
fireplace_prices = viz_data.groupby('has_fireplace')['sale_price'].mean()
axes[2].bar(['No Fireplace', 'Has Fireplace'], fireplace_prices.values, color=['lightcoral', 'gold'])
axes[2].set_title('Price Impact: Fireplace')

plt.tight_layout()
plt.show()

# Feature premiums
garage_premium = garage_prices[True] - garage_prices[False]
pool_premium = pool_prices[True] - pool_prices[False]
fireplace_premium = fireplace_prices[True] - fireplace_prices[False]

print(f"Feature Premiums:")
print(f"• Garage: ${garage_premium:,.0f}")
print(f"• Pool: ${pool_premium:,.0f}")
print(f"• Fireplace: ${fireplace_premium:,.0f}")

In [None]:
# Exercise 2: Predictive Analysis
print("=== PREDICTIVE INSIGHTS ===")

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Predictive Analysis: R-squared Values', fontsize=14)

# Square feet prediction
slope1, intercept1, r_value1, _, _ = stats.linregress(viz_data['square_feet'], viz_data['sale_price'])
axes[0, 0].scatter(viz_data['square_feet'], viz_data['sale_price'], alpha=0.5)
axes[0, 0].plot(viz_data['square_feet'], slope1 * viz_data['square_feet'] + intercept1, 'r-', linewidth=2)
axes[0, 0].set_title(f'Square Feet (R² = {r_value1**2:.3f})')
axes[0, 0].set_xlabel('Square Feet')
axes[0, 0].set_ylabel('Sale Price')

# Feature score prediction
slope2, intercept2, r_value2, _, _ = stats.linregress(viz_data['feature_score'], viz_data['sale_price'])
axes[0, 1].scatter(viz_data['feature_score'], viz_data['sale_price'], alpha=0.5, c='green')
axes[0, 1].plot(viz_data['feature_score'], slope2 * viz_data['feature_score'] + intercept2, 'r-', linewidth=2)
axes[0, 1].set_title(f'Feature Score (R² = {r_value2**2:.3f})')
axes[0, 1].set_xlabel('Feature Score')

# Total rooms prediction
slope3, intercept3, r_value3, _, _ = stats.linregress(viz_data['total_rooms'], viz_data['sale_price'])
axes[1, 0].scatter(viz_data['total_rooms'], viz_data['sale_price'], alpha=0.5, c='purple')
axes[1, 0].plot(viz_data['total_rooms'], slope3 * viz_data['total_rooms'] + intercept3, 'r-', linewidth=2)
axes[1, 0].set_title(f'Total Rooms (R² = {r_value3**2:.3f})')
axes[1, 0].set_xlabel('Total Rooms')
axes[1, 0].set_ylabel('Sale Price')

# Days on market prediction
slope4, intercept4, r_value4, _, _ = stats.linregress(viz_data['days_on_market'], viz_data['sale_price'])
axes[1, 1].scatter(viz_data['days_on_market'], viz_data['sale_price'], alpha=0.5, c='red')
axes[1, 1].plot(viz_data['days_on_market'], slope4 * viz_data['days_on_market'] + intercept4, 'r-', linewidth=2)
axes[1, 1].set_title(f'Days on Market (R² = {r_value4**2:.3f})')
axes[1, 1].set_xlabel('Days on Market')

plt.tight_layout()
plt.show()

# Predictor ranking
predictors = [
    ('Square Feet', r_value1**2),
    ('Feature Score', r_value2**2),
    ('Total Rooms', r_value3**2),
    ('Days on Market', r_value4**2)
]
predictors.sort(key=lambda x: x[1], reverse=True)

print("Predictor Ranking:")
for i, (predictor, r_squared) in enumerate(predictors, 1):
    print(f"{i}. {predictor}: {r_squared:.3f}")

print(f"\n✓ {predictors[0][0]} is the strongest predictor")

## Challenge: Advanced Market Segmentation Analysis

**Challenge:**  
Segment the housing market by both `property_type` and `market_segment`. For each combination, calculate:

- The number of properties
- The average sale price
- The average days on market

Present the results as a pivot table (rows: `property_type`, columns: `market_segment`). Highlight which property type and segment combination has the highest average sale price.

*Tip: Use `pd.pivot_table` and DataFrame styling for highlighting!*

In [None]:
# Your challenge code here

### Challenge: Visualize Property Type Distribution

**Challenge:**  
Create a bar chart that shows the number of properties for each `property_type` in the dataset. Make sure to:

- Display the property types on the x-axis and the counts on the y-axis.
- Add value labels on top of each bar for clarity.
- Give your chart a descriptive title.

*Tip: Use `prop_type_counts` for the counts and `matplotlib` for plotting!*

In [None]:
# Your challenge code here

### Challenge: Multi-Dimensional Market Insights

**Challenge:**  
Create a heatmap visualization that shows the *average days on market* for each combination of `property_type` and `market_segment`. Your task:

- Build a pivot table with `property_type` as rows and `market_segment` as columns, values as the mean of `days_on_market`.
- Use `seaborn.heatmap` to visualize the result, with annotations for the values.
- Highlight and describe which combination has the *lowest* average days on market, and discuss what this might mean for sellers and buyers.

*Tip: Use `pd.pivot_table` and `sns.heatmap` for a clear, insightful visualization!*

In [None]:
# Your challenge code here

## Summary

In this lab, you learned comprehensive data visualization techniques:

- **Data Setup**: Created realistic datasets with meaningful relationships
- **Exploratory Analysis**: Distribution plots, histograms, and categorical analysis
- **Relationship Analysis**: Scatter plots, correlations, and heatmaps
- **Time Series**: Trend analysis and seasonal patterns
- **Business Dashboards**: Professional presentations for stakeholders
- **Predictive Analysis**: Regression analysis and feature importance

These skills form the foundation of effective data storytelling and help communicate insights clearly to both technical and business audiences.

**Key Takeaways:**
- Start with exploratory visualizations to understand your data
- Use correlation analysis to identify relationships
- Time series analysis reveals trends and seasonality
- Business dashboards should focus on actionable insights
- Always validate findings with statistical analysis