Step 1: Loading and Exploring the California Housing Dataset

Load the California Housing dataset, which contains information about various attributes of houses in California districts and their median prices. Understanding our data is crucial before building any model, so I'm examining:

- The dataset structure: 20,640 samples with 8 features
- The feature names and what they represent (MedInc, HouseAge, AveRooms, AveBedrms, Population, AveOccup, Latitude, Longitude)
- Basic descriptive statistics to understand the ranges and distributions
- The distribution of our target variable (housing prices, measured in $100,000s)
- Correlations between features and prices
- Visual relationships between key features and housing prices

The California Housing dataset is excellent for our neural network demonstration because:
- It's a regression problem (predicting continuous housing prices).
- It has multiple features with different scales and relationships.
- It has over 20,000 samples, providing enough data for the network to learn meaningful patterns.
- It contains geographical information (latitude and longitude), which introduces interesting spatial relationships.

Looking at the correlations, we can see that median income (MedInc) has the strongest positive correlation with housing prices. This makes intuitive sense - areas with higher incomes tend to have more expensive housing. The scatter plots help visualize these relationships, showing both linear and non-linear patterns that our neural network will need to learn.

In [None]:
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

# Set seeda for reproducible results
tf.random.set_seed(1)
np.random.seed(1)
random.seed(1)

# Load the California Housing dataset
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split

# Load the dataset
housing = fetch_california_housing()
X = housing.data
y = housing.target

# Examine the dataset
print(f"California Housing dataset shape: {X.shape}")
print(f"Features: {housing.feature_names}")
print(f"Target variable: Median house value in $100,000s")

# View descriptive statistics
housing_df = pd.DataFrame(X, columns=housing.feature_names)
housing_df['PRICE'] = y
print("\nDescriptive Statistics:")
print(housing_df.describe())

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print(f"\nTraining set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

# Plot the distribution of housing prices
plt.figure(figsize=(10, 6))
plt.hist(y, bins=30)
plt.xlabel('Price ($100,000s)')
plt.ylabel('Count')
plt.title('Distribution of Housing Prices')
plt.show()

# Look at correlations with target variable
correlations = housing_df.corr()['PRICE'].sort_values(ascending=False)
print("\nFeature correlations with price:")
print(correlations)

# Plot a few key features against price
plt.figure(figsize=(15, 10))

plt.subplot(2, 2, 1)
plt.scatter(housing_df['MedInc'], housing_df['PRICE'], alpha=0.5)
plt.xlabel('Median Income (MedInc)')
plt.ylabel('Price ($100,000s)')
plt.title('Price vs. Median Income')

plt.subplot(2, 2, 2)
plt.scatter(housing_df['AveRooms'], housing_df['PRICE'], alpha=0.5)
plt.xlabel('Average Rooms (AveRooms)')
plt.ylabel('Price ($100,000s)')
plt.title('Price vs. Average Rooms')

plt.subplot(2, 2, 3)
plt.scatter(housing_df['AveBedrms'], housing_df['PRICE'], alpha=0.5)
plt.xlabel('Average Bedrooms (AveBedrms)')
plt.ylabel('Price ($100,000s)')
plt.title('Price vs. Average Bedrooms')

plt.subplot(2, 2, 4)
plt.scatter(housing_df['Population'], housing_df['PRICE'], alpha=0.5)
plt.xlabel('Population')
plt.ylabel('Price ($100,000s)')
plt.title('Price vs. Population')

plt.tight_layout()
plt.show()

Step 2: Building a Simple Neural Network

Build a simple feedforward neural network with the following structure:
- Input Layer: Accepts 8 input features from the California Housing dataset
-  First Hidden Layer: 64 neurons with ReLU activation
-Second Hidden Layer: 32 neurons with ReLU activation
- Output Layer: A single neuron with no activation function (linear output)

Let's break down the key architectural decisions:
- Layer Sizes: I chose 64 neurons for the first layer and 32 for the second. This decreasing width pattern helps the network gradually distill the 8 input features into more abstract representations. The first layer is wider to capture various feature interactions, while subsequent layers consolidate this information.
- Activation Functions: ReLU (Rectified Linear Unit) activations introduce non-linearity, allowing the network to learn complex relationships. For each neuron, ReLU outputs the input directly if it's positive, otherwise it outputs zero. This non-linearity is crucial - without it, multiple layers would simply collapse into a single linear transformation.
- Output Layer: For regression problems like housing price prediction, we use a single output neuron with no activation function. This allows the network to predict any numerical value along the real number line, which is necessary for price predictions.

The model summary reveals the parameter count for each layer:
- Dense(64): (8 inputs × 64 outputs) + 64 biases = 576 parameters
- Dense(32): (64 inputs × 32 outputs) + 32 biases = 2,080 parameters
- Dense(1): (32 inputs × 1 output) + 1 bias = 33 parameters

In total, this relatively simple network has 2,689 trainable parameters/weights, despite us only having access to 8 features. Each of these parameters will be adjusted during training through backpropagation and gradient descent. 

In [None]:
# Define a simple neural network model
model = keras.Sequential([
    # Input layer - explicit definition for clarity
    keras.layers.Input(shape=(8,)),  # 8 features in the California dataset
    
    # First hidden layer
    keras.layers.Dense(units=64, activation='relu'),
    
    # Second hidden layer
    keras.layers.Dense(units=32, activation='relu'),
    
    # Output layer - single neuron with no activation for regression
    keras.layers.Dense(units=1)
])

# Display the model summary
model.summary()

Step 3: Compiling the Model

In [None]:
# Compile the model with basic settings
model.compile(
    optimizer="RMSprop",
    loss='mean_squared_error',  # Standard loss for regression
    metrics=['mae']  # Mean Absolute Error in $100,000s
)

Step 4: Training the Model

In [None]:
# Train the model with detailed monitoring
history = model.fit(
    X_train,                  # Input features
    y_train,                  # Target housing prices
    batch_size=64,            # Process 64 examples per gradient update
    epochs=100,               # Maximum number of passes through the dataset
    validation_split=0.2,     # Use 20% of training data for validation
    verbose=1                 # Show progress during training
)

# Store training metrics for analysis
train_loss = history.history['loss']
val_loss = history.history['val_loss']
train_mae = history.history['mae']
val_mae = history.history['val_mae']
epochs_range = range(1, len(train_loss) + 1)

Step 5: Analyzing Training Progress

In [None]:
# Plot training history
plt.figure(figsize=(16, 6))

# Plot loss
plt.subplot(1, 2, 1)
plt.plot(epochs_range, train_loss, label='Training Loss (MSE)')
plt.plot(epochs_range, val_loss, label='Validation Loss (MSE)')
plt.xlabel('Epoch')
plt.ylabel('Mean Squared Error')
plt.title('Loss During Training')
plt.grid(True)
plt.legend()

# Plot mean absolute error
plt.subplot(1, 2, 2)
plt.plot(epochs_range, train_mae, label='Training MAE')
plt.plot(epochs_range, val_mae, label='Validation MAE')
plt.xlabel('Epoch')
plt.ylabel('Mean Absolute Error ($100,000s)')
plt.title('Mean Absolute Error During Training')
plt.grid(True)
plt.legend()

plt.tight_layout()
plt.show()

# Print final training stats
print(f"Final training loss (MSE): {train_loss[-1]:.4f}")
print(f"Final validation loss (MSE): {val_loss[-1]:.4f}")
print(f"Final training MAE: ${train_mae[-1]:.4f} ($100,000s)")
print(f"Final validation MAE: ${val_mae[-1]:.4f} ($100,000s)")

# Check if training was successful
if val_loss[-1] < val_loss[0]:
    improvement = (1 - val_loss[-1]/val_loss[0]) * 100
    print(f"Model improved by {improvement:.1f}% during training.")
else:
    print("Model did not improve during training. Consider adjusting hyperparameters.")

Step 6: Evaluating the Model on Test Data

In [None]:
# Evaluate on the test set
test_loss, test_mae = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss (MSE): {test_loss:.4f}")
print(f"Test MAE: ${test_mae:.4f} ($100,000s)")

# Make predictions and analyze errors
predictions = model.predict(X_test)
errors = predictions.flatten() - y_test

# Calculate key error metrics
mean_error = np.mean(errors)
median_error = np.median(errors)
max_error = np.max(np.abs(errors))

print(f"Mean prediction error: ${mean_error:.4f} ($100,000s)")
print(f"Median prediction error: ${median_error:.4f} ($100,000s)")
print(f"Maximum prediction error: ${max_error:.4f} ($100,000s)")

# Plot actual vs predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, predictions, alpha=0.5)
plt.plot([0, 5], [0, 5], 'r--')  # Perfect prediction line
plt.xlabel('Actual Price ($100,000s)')
plt.ylabel('Predicted Price ($100,000s)')
plt.title('Actual vs. Predicted Housing Prices')
plt.grid(True)
plt.show()

# Plot error distribution
plt.figure(figsize=(10, 6))
plt.hist(errors, bins=50)
plt.xlabel('Prediction Error ($100,000s)')
plt.ylabel('Count')
plt.title('Distribution of Prediction Errors')
plt.axvline(x=0, color='r', linestyle='--')
plt.grid(True)
plt.show()

# Check error by price range
price_ranges = [0, 1, 2, 3, 4, 5]
for i in range(len(price_ranges)-1):
    # Filter test data for this price range
    mask = (y_test >= price_ranges[i]) & (y_test < price_ranges[i+1])
    range_mae = np.mean(np.abs(errors[mask]))
    range_count = np.sum(mask)
    
    if range_count > 0:
        print(f"MAE for houses ${price_ranges[i]}-{price_ranges[i+1]} million: ${range_mae:.4f} ($100,000s), {range_count} houses")