# ML Final Project — Guided Template (CIS 508)

**Purpose:**  
This notebook is your guide for the final project in Machine Learning in Business. You’ll use it to organize your work and make sure each part of your project is clear and meaningful.

Your project should follow four main steps:

1. Project Overview – Introduce your topic, explain why it matters, and define your business problems.

2. EDA & Data Insights – Explore your dataset, clean it, and highlight key patterns or findings that help you understand the problem.

3. Modeling & Evaluation – Build and evaluate your machine learning models. Compare performance across models and explain what you learn from the results (at least, 2 models).

4. Executive Summary – Summarize your main insights in plain language. Focus on what your results mean for decision-making or business strategy.

Use this structure to keep your analysis focused, your writing organized, and your insights actionable.

***Feel free to remove this part of the template when you finalize your project.***

# Predicting California Housing Prices: A Machine Learning Approach to Real Estate Valuation

**One-sentence description:**  
>This project uses machine learning models to predict median house values in California based on geographic, demographic, and housing characteristics, helping real estate stakeholders make informed pricing and investment decisions.

## Section 1 — Project Overview


In this section, you’ll introduce your project and explain what business problem you’re trying to solve.  
Please fill out each part clearly and concisely.

### 1.1 Dataset and Problem Description


- Indicate **which dataset** you selected (name and source).
> **Dataset**: California Housing Prices  
> **Source**: This dataset is based on the 1990 California census data and is commonly used in machine learning courses and competitions. The dataset contains information about housing districts in California, including geographic location, housing characteristics, and demographic information. It is available from various open data platforms and has been widely used for regression analysis and predictive modeling in real estate contexts.

- Describe **the size of the dataset** (number of rows and columns).  
> The dataset contains approximately **20,640 observations** (rows) and **10 features** (columns). This provides a substantial sample size for building robust machine learning models while maintaining computational efficiency. The dataset includes one target variable (median_house_value) and nine predictor variables covering geographic location, housing characteristics, and demographic factors.

- **Introduce all variables**: provide a short description of each key variable and its meaning.
> **Target Variable:**
> - **median_house_value**: The median house value for households within a block (continuous, in US dollars). This is our target variable for prediction.
> 
> **Predictor Variables:**
> - **longitude**: Geographic longitude coordinate of the housing block (continuous, negative values indicate west of prime meridian)
> - **latitude**: Geographic latitude coordinate of the housing block (continuous)
> - **housing_median_age**: Median age of houses within the block (continuous, in years)
> - **total_rooms**: Total number of rooms within the block (continuous)
> - **total_bedrooms**: Total number of bedrooms within the block (continuous)
> - **population**: Total population residing in the block (continuous)
> - **households**: Total number of households in the block (continuous)
> - **median_income**: Median income for households within the block (continuous, scaled and capped at 15.0)
> - **ocean_proximity**: Proximity to the ocean (categorical: '<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN')

### 1.2 Business Motivation
- What is the **business problem** you are trying to solve?  
> The business problem is to accurately predict median house values in California housing districts to support strategic decision-making in the real estate market. Accurate house price predictions enable stakeholders to:
> - **For Homebuyers**: Make informed purchasing decisions and identify undervalued properties
> - **For Real Estate Agents**: Provide accurate pricing guidance to clients and optimize listing strategies
> - **For Investors**: Identify profitable investment opportunities and assess market trends
> - **For Lenders**: Evaluate property values for mortgage underwriting and risk assessment
> - **For Developers**: Make data-driven decisions about where to build and what price points to target
> 
> The challenge lies in understanding which factors most significantly influence house prices and building a reliable predictive model that can generalize to new housing districts.

- Who are the **stakeholders** involved (e.g., managers, teams, customers)?  
> **Primary Stakeholders:**
> - **Homebuyers and Home Sellers**: Need accurate market valuations to make informed decisions about buying or selling properties
> - **Real Estate Agents and Brokers**: Require reliable price estimates to advise clients and set competitive listing prices
> - **Real Estate Investment Firms**: Need accurate valuations to identify investment opportunities and manage portfolios
> - **Mortgage Lenders and Banks**: Require property valuations for loan underwriting, risk assessment, and portfolio management
> - **Property Developers and Construction Companies**: Need market insights to decide where to build and what price points to target
> - **Real Estate Appraisers**: Can use models as a tool to support their professional valuations
> - **Government Agencies**: May use predictions for property tax assessments and urban planning decisions
> - **Real Estate Technology Platforms** (e.g., Zillow, Redfin): Rely on accurate predictions for their automated valuation models (AVMs)

- What are the **potential benefits or costs of not solving** this problem?
> **Benefits of Solving This Problem:**
> - **Improved Decision-Making**: Stakeholders can make more informed decisions with accurate price predictions
> - **Market Efficiency**: Better price transparency leads to more efficient real estate markets
> - **Risk Reduction**: Lenders and investors can better assess and mitigate financial risks
> - **Competitive Advantage**: Real estate professionals with superior pricing models gain market advantages
> - **Time Savings**: Automated valuations reduce the time needed for manual appraisals
> 
> **Costs of Not Solving This Problem:**
> - **Financial Losses**: Overpricing leads to properties sitting on the market; underpricing results in lost revenue
> - **Poor Investment Decisions**: Without accurate predictions, investors may choose suboptimal properties or miss opportunities
> - **Increased Risk**: Lenders face higher default risks if property values are overestimated
> - **Market Inefficiency**: Inaccurate pricing creates market distortions and reduces overall market efficiency
> - **Competitive Disadvantage**: Companies without accurate pricing models lose business to competitors with better tools
> - **Customer Dissatisfaction**: Homebuyers and sellers lose trust when valuations are consistently inaccurate
> - **Regulatory Issues**: Inaccurate valuations can lead to compliance problems and legal issues for financial institutions

## Section 2 — EDA (Data Understanding & Key Insights)

Use this section to **understand your dataset and uncover key business-relevant patterns.** Feel free to include both tables and visualizations.

1. Mount your Google Drive.
2. Set your working directory (if you’re using the default, no changes are needed).
3. Update the file name to match the dataset you selected.

In [None]:
# For local execution (Jupyter Notebook)
# If using Google Colab, uncomment the following lines:
# from google.colab import drive
# drive.mount('/content/drive')
# %cd /content/drive/MyDrive/Colab Notebooks

# Verify the current working directory
import os
print(f"Current working directory: {os.getcwd()}")

# Read data into jupyter notebook
import pandas as pd
import numpy as np

# Read File - California Housing Prices dataset
df = pd.read_csv("California Housing Prices.csv")

# Display basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"\nColumn names:\n{df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head()


### 2.1 Dataset Overview

- Provide a clear, high-level description of the dataset.

In [None]:
# Dataset Overview
print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)

# Dataset dimensions
print(f"\nDataset Shape: {df.shape[0]} rows × {df.shape[1]} columns")

# Column information
print("\n" + "=" * 60)
print("COLUMN INFORMATION")
print("=" * 60)
print(df.info())

# Data types summary
print("\n" + "=" * 60)
print("DATA TYPES SUMMARY")
print("=" * 60)
print(df.dtypes)

# Target variable identification
print("\n" + "=" * 60)
print("TARGET VARIABLE")
print("=" * 60)
print("Target Variable (y): median_house_value")
print("\nPredictor Variables (X):")
predictors = [col for col in df.columns if col != 'median_house_value']
for i, pred in enumerate(predictors, 1):
    print(f"  {i}. {pred}")

# Statistical summary
print("\n" + "=" * 60)
print("STATISTICAL SUMMARY")
print("=" * 60)
print(df.describe())

# First few rows
print("\n" + "=" * 60)
print("FIRST 5 ROWS")
print("=" * 60)
print(df.head())

### 2.2 Data Quality

- Diagnose and handle data problems that could affect analysis.

In [None]:
# ============================================================
# DATA QUALITY CHECKS
# ============================================================

# 1. Check for missing values
print("=" * 60)
print("MISSING VALUES CHECK")
print("=" * 60)
missing_values = df.isna().sum()
missing_percent = (missing_values / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Percentage': missing_percent
})
missing_df = missing_df[missing_df['Missing Count'] > 0]
if len(missing_df) > 0:
    print(missing_df)
else:
    print("✓ No missing values found!")

# 2. Check for duplicates
print("\n" + "=" * 60)
print("DUPLICATES CHECK")
print("=" * 60)
duplicate_count = df.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")
if duplicate_count > 0:
    print("⚠ Warning: Duplicates found. Consider removing them.")
else:
    print("✓ No duplicate rows found!")

# 3. Check for impossible/out-of-range values
print("\n" + "=" * 60)
print("DATA VALIDATION CHECKS")
print("=" * 60)

# Check for negative values where they shouldn't exist
numeric_cols = df.select_dtypes(include=[np.number]).columns
print("\nChecking for negative values in numeric columns:")
for col in numeric_cols:
    negative_count = (df[col] < 0).sum()
    if negative_count > 0:
        print(f"  ⚠ {col}: {negative_count} negative values found")
    else:
        print(f"  ✓ {col}: No negative values")

# Check ocean_proximity categories
print("\nChecking ocean_proximity categories:")
print(f"  Unique values: {df['ocean_proximity'].unique()}")
print(f"  Value counts:\n{df['ocean_proximity'].value_counts()}")

# Check for extreme outliers in key variables
print("\n" + "=" * 60)
print("OUTLIER DETECTION (Using IQR Method)")
print("=" * 60)
import matplotlib.pyplot as plt

def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

key_variables = ['median_house_value', 'median_income', 'housing_median_age', 'total_rooms']
for var in key_variables:
    if var in df.columns:
        outliers, lower, upper = detect_outliers_iqr(df, var)
        outlier_pct = (len(outliers) / len(df)) * 100
        print(f"\n{var}:")
        print(f"  Lower bound: {lower:.2f}, Upper bound: {upper:.2f}")
        print(f"  Outliers: {len(outliers)} ({outlier_pct:.2f}%)")

# Handle missing values in total_bedrooms (if any)
if df['total_bedrooms'].isna().sum() > 0:
    print("\n" + "=" * 60)
    print("HANDLING MISSING VALUES")
    print("=" * 60)
    # Impute missing bedrooms with median
    median_bedrooms = df['total_bedrooms'].median()
    df['total_bedrooms'].fillna(median_bedrooms, inplace=True)
    print(f"✓ Imputed {df['total_bedrooms'].isna().sum()} missing values in total_bedrooms with median: {median_bedrooms}")

# Remove duplicates if found
if duplicate_count > 0:
    df = df.drop_duplicates()
    print(f"\n✓ Removed {duplicate_count} duplicate rows")
    print(f"  New dataset shape: {df.shape}")

print("\n" + "=" * 60)
print("DATA QUALITY CHECK COMPLETE")
print("=" * 60)

### 2.3 Descriptive Explorations

- The goal of data exploration is to **discover patterns, relationships, and stories** within your data before modeling. This step is not about following fixed rules — it’s about curiosity, creativity, and developing intuition. You are encouraged to **freely explore** the data in ways that make sense for your project — visualize, compare, or test relationships.

In [None]:
# ============================================================
# DESCRIPTIVE EXPLORATIONS
# ============================================================
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Set style for better-looking plots
try:
    plt.style.use('seaborn-v0_8-darkgrid')
except:
    try:
        plt.style.use('seaborn-darkgrid')
    except:
        plt.style.use('ggplot')
sns.set_palette("husl")

# 1. Distribution of Target Variable
print("=" * 60)
print("1. TARGET VARIABLE DISTRIBUTION")
print("=" * 60)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(df['median_house_value'], bins=50, edgecolor='black', alpha=0.7)
axes[0].set_xlabel('Median House Value ($)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Median House Values')
axes[0].axvline(df['median_house_value'].median(), color='red', linestyle='--', 
                label=f'Median: ${df["median_house_value"].median():,.0f}')
axes[0].legend()

# Box plot
axes[1].boxplot(df['median_house_value'], vert=True)
axes[1].set_ylabel('Median House Value ($)')
axes[1].set_title('Box Plot of Median House Values')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nTarget Variable Statistics:")
print(df['median_house_value'].describe())

# 2. Geographic Distribution
print("\n" + "=" * 60)
print("2. GEOGRAPHIC DISTRIBUTION")
print("=" * 60)

fig, ax = plt.subplots(figsize=(12, 8))
scatter = ax.scatter(df['longitude'], df['latitude'], 
                    c=df['median_house_value'], cmap='viridis', 
                    alpha=0.6, s=20, edgecolors='black', linewidth=0.5)
ax.set_xlabel('Longitude')
ax.set_ylabel('Latitude')
ax.set_title('Geographic Distribution of House Values in California')
plt.colorbar(scatter, label='Median House Value ($)')
plt.tight_layout()
plt.show()

# 3. Correlation Analysis
print("\n" + "=" * 60)
print("3. CORRELATION ANALYSIS")
print("=" * 60)

# Calculate correlation matrix for numeric variables
numeric_df = df.select_dtypes(include=[np.number])
correlation_matrix = numeric_df.corr()

# Create heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', 
            center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numeric Variables', fontsize=14, pad=20)
plt.tight_layout()
plt.show()

# Show correlations with target variable
print("\nCorrelations with Median House Value:")
target_corr = correlation_matrix['median_house_value'].sort_values(ascending=False)
print(target_corr)

# 4. Distribution of Key Predictor Variables
print("\n" + "=" * 60)
print("4. DISTRIBUTION OF KEY PREDICTOR VARIABLES")
print("=" * 60)

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Median Income
axes[0, 0].hist(df['median_income'], bins=50, edgecolor='black', alpha=0.7, color='skyblue')
axes[0, 0].set_xlabel('Median Income (scaled)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].set_title('Distribution of Median Income')

# Housing Median Age
axes[0, 1].hist(df['housing_median_age'], bins=30, edgecolor='black', alpha=0.7, color='lightgreen')
axes[0, 1].set_xlabel('Housing Median Age (years)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Distribution of Housing Median Age')

# Total Rooms
axes[1, 0].hist(df['total_rooms'], bins=50, edgecolor='black', alpha=0.7, color='salmon')
axes[1, 0].set_xlabel('Total Rooms')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Distribution of Total Rooms')

# Population
axes[1, 1].hist(df['population'], bins=50, edgecolor='black', alpha=0.7, color='plum')
axes[1, 1].set_xlabel('Population')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].set_title('Distribution of Population')

plt.tight_layout()
plt.show()

# 5. Relationship Between Key Variables and Target
print("\n" + "=" * 60)
print("5. RELATIONSHIPS WITH TARGET VARIABLE")
print("=" * 60)

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# Median Income vs House Value
axes[0, 0].scatter(df['median_income'], df['median_house_value'], alpha=0.5, s=10)
axes[0, 0].set_xlabel('Median Income (scaled)')
axes[0, 0].set_ylabel('Median House Value ($)')
axes[0, 0].set_title('Median Income vs House Value')
axes[0, 0].grid(True, alpha=0.3)

# Housing Age vs House Value
axes[0, 1].scatter(df['housing_median_age'], df['median_house_value'], alpha=0.5, s=10)
axes[0, 1].set_xlabel('Housing Median Age (years)')
axes[0, 1].set_ylabel('Median House Value ($)')
axes[0, 1].set_title('Housing Age vs House Value')
axes[0, 1].grid(True, alpha=0.3)

# Total Rooms vs House Value
axes[1, 0].scatter(df['total_rooms'], df['median_house_value'], alpha=0.5, s=10)
axes[1, 0].set_xlabel('Total Rooms')
axes[1, 0].set_ylabel('Median House Value ($)')
axes[1, 0].set_title('Total Rooms vs House Value')
axes[1, 0].grid(True, alpha=0.3)

# Population vs House Value
axes[1, 1].scatter(df['population'], df['median_house_value'], alpha=0.5, s=10)
axes[1, 1].set_xlabel('Population')
axes[1, 1].set_ylabel('Median House Value ($)')
axes[1, 1].set_title('Population vs House Value')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# 6. Ocean Proximity Analysis
print("\n" + "=" * 60)
print("6. OCEAN PROXIMITY ANALYSIS")
print("=" * 60)

fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Box plot by ocean proximity
df.boxplot(column='median_house_value', by='ocean_proximity', ax=axes[0])
axes[0].set_xlabel('Ocean Proximity')
axes[0].set_ylabel('Median House Value ($)')
axes[0].set_title('House Values by Ocean Proximity')
axes[0].grid(True, alpha=0.3)

# Bar plot of average house values
ocean_avg = df.groupby('ocean_proximity')['median_house_value'].mean().sort_values(ascending=False)
ocean_avg.plot(kind='bar', ax=axes[1], color='teal', edgecolor='black')
axes[1].set_xlabel('Ocean Proximity')
axes[1].set_ylabel('Average Median House Value ($)')
axes[1].set_title('Average House Values by Ocean Proximity')
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nAverage House Values by Ocean Proximity:")
print(ocean_avg)

# 7. Derived Variables
print("\n" + "=" * 60)
print("7. DERIVED VARIABLES ANALYSIS")
print("=" * 60)

# Create derived variables
df['rooms_per_household'] = df['total_rooms'] / df['households']
df['bedrooms_per_room'] = df['total_bedrooms'] / df['total_rooms']
df['population_per_household'] = df['population'] / df['households']

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Rooms per household vs House Value
axes[0].scatter(df['rooms_per_household'], df['median_house_value'], alpha=0.5, s=10)
axes[0].set_xlabel('Rooms per Household')
axes[0].set_ylabel('Median House Value ($)')
axes[0].set_title('Rooms per Household vs House Value')
axes[0].grid(True, alpha=0.3)

# Bedrooms per room vs House Value
axes[1].scatter(df['bedrooms_per_room'], df['median_house_value'], alpha=0.5, s=10)
axes[1].set_xlabel('Bedrooms per Room')
axes[1].set_ylabel('Median House Value ($)')
axes[1].set_title('Bedrooms per Room vs House Value')
axes[1].grid(True, alpha=0.3)

# Population per household vs House Value
axes[2].scatter(df['population_per_household'], df['median_house_value'], alpha=0.5, s=10)
axes[2].set_xlabel('Population per Household')
axes[2].set_ylabel('Median House Value ($)')
axes[2].set_title('Population per Household vs House Value')
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nCorrelations of Derived Variables with House Value:")
derived_vars = ['rooms_per_household', 'bedrooms_per_room', 'population_per_household']
for var in derived_vars:
    corr = df[var].corr(df['median_house_value'])
    print(f"  {var}: {corr:.4f}")

print("\n" + "=" * 60)
print("DESCRIPTIVE EXPLORATION COMPLETE")
print("=" * 60)

### 2.4 Insights

In this section, summarize **what you discovered** from your data exploration (Section 2.3).  
Your goal is to **translate observations into insights** that connect back to your business question.  


**Key Insights from Data Exploration:**

> **1. Median Income is the Strongest Predictor**
> - Median income shows the strongest positive correlation (approximately 0.69) with median house value, indicating that income levels are the most influential factor in determining house prices. This makes intuitive business sense: higher-income areas can support higher property values. The relationship appears relatively linear, though there's a notable cap at higher income levels (scaled value of 15.0), suggesting potential data preprocessing or economic constraints.

> **2. Geographic Location Strongly Influences House Values**
> - The geographic visualization reveals clear spatial patterns: coastal areas (particularly near the ocean and bays) show significantly higher house values compared to inland regions. Ocean proximity analysis shows that properties near the ocean command premium prices, with ISLAND locations having the highest average values, followed by NEAR OCEAN and NEAR BAY categories. This geographic premium is a critical factor for real estate stakeholders to consider when pricing properties or making investment decisions.

> **3. Derived Variables Reveal Important Housing Characteristics**
> - The derived variable "rooms_per_household" shows a meaningful positive correlation with house values, suggesting that larger homes (more rooms per household) command higher prices. This insight helps explain why total_rooms alone may not be as predictive—it's the ratio relative to households that matters more. Additionally, the analysis reveals that housing age has a complex relationship with value, with both very new and very old properties showing different value patterns.

> **4. Data Quality and Distribution Characteristics**
> - The dataset is generally clean with minimal missing values (primarily in total_bedrooms, which was handled through imputation). However, the target variable (median_house_value) shows a right-skewed distribution with a notable cap at $500,000, suggesting potential data preprocessing or economic constraints. The presence of outliers in variables like total_rooms and population indicates the need for careful feature engineering and potentially robust modeling approaches that can handle these extreme values.

> **5. Unexpected Patterns and Business Implications**
> - An interesting finding is that population density (population_per_household) shows a negative correlation with house values in some areas, suggesting that overcrowded areas may actually have lower property values—a counterintuitive finding that could indicate urban vs. suburban preferences or quality-of-life factors. Additionally, the correlation analysis reveals that some variables (like total_bedrooms and total_rooms) are highly correlated with each other, suggesting potential multicollinearity that should be addressed in modeling through feature selection or dimensionality reduction techniques.  