# 🏠 Boston Housing Dataset - ML Homework Exercises
## Hands-On Practice for Data Science & Machine Learning

Welcome to your homework assignment! This notebook contains practical exercises using the Boston Housing dataset. You'll apply the concepts learned in the ML 101 basics session.

### 📋 Learning Objectives:
By completing these exercises, you will:
- Practice data exploration and visualization techniques
- Handle missing data and feature engineering
- Build and compare multiple ML models
- Evaluate model performance using various metrics
- Draw business insights from your analysis

### 🎯 Dataset Context:
The Boston Housing dataset contains information about housing prices in Boston suburbs. Your goal is to predict median home values based on various neighborhood characteristics.

### ⏱️ Time Estimate: 
Allow 90-120 minutes to complete all exercises.

---

## 📚 Exercise Structure:
- **🔍 Exploration Exercises** - Data understanding and visualization
- **🔧 Preprocessing Challenges** - Data cleaning and feature engineering  
- **🤖 Modeling Tasks** - Training and comparing ML models
- **📊 Evaluation & Insights** - Performance analysis and business conclusions

---

### 🚀 Ready to Start? 
Work through each exercise step by step. Don't hesitate to experiment and try different approaches!

## Exercise 1: Data Loading & Initial Exploration (15 minutes)

### 🎯 Your Task:
Load the Boston Housing dataset and perform initial data exploration to understand the structure and characteristics of the data.

In [None]:
# Exercise 1.1: Import Required Libraries
# TODO: Import the essential libraries for data science and ML
# Hint: You'll need pandas, numpy, matplotlib, seaborn, and sklearn modules

# Your code here:


print("✅ Libraries imported successfully!")

In [None]:
# Exercise 1.2: Load the Boston Housing Dataset
# TODO: Load the Boston Housing dataset
# Hint: You can use sklearn.datasets.load_boston() or the newer version from sklearn.datasets
# Note: Boston dataset was removed in sklearn 1.2+, so we'll create it from a URL

import pandas as pd
import numpy as np

# Load Boston Housing dataset from URL (since it's deprecated in newer sklearn)
url = "https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv"
# TODO: Load the dataset using pandas
# df = pd.read_csv(url)

# Your code here:


# TODO: Display basic information about the dataset
# - Shape of the dataset
# - First few rows
# - Column names

# Your code here:


In [None]:
# Exercise 1.3: Basic Data Statistics
# TODO: Generate and interpret basic statistics for the dataset
# - Use .describe() method
# - Identify the target variable (medv - median value of homes)
# - Find min, max, and mean of the target variable

# Your code here:


print("\n🎯 TARGET VARIABLE ANALYSIS:")
# TODO: Print target variable statistics
# target_col = 'medv'  # median value of homes in $1000s


In [None]:
# Exercise 1.4: Data Quality Check
# TODO: Check for missing values and data types
# - Check for null values in each column
# - Verify data types are appropriate
# - Look for any obvious data quality issues

print("❓ MISSING VALUES CHECK:")
# Your code here:


print("\n📊 DATA TYPES:")
# Your code here:


### 🔍 **Reflection Questions for Exercise 1:**
1. How many features (variables) are in the dataset?
2. What is the target variable and what does it represent?
3. Are there any missing values that need to be handled?
4. What is the range of home prices in the dataset?

**Expected Outcome:** You should have a clean dataset loaded with 506 rows and 14 columns, with 'medv' as the target variable representing median home values.

## Exercise 2: Exploratory Data Analysis (25 minutes)

### 🎯 Your Task:
Create visualizations to understand the relationships between features and the target variable. Identify patterns that might be useful for prediction.

In [None]:
# Exercise 2.1: Target Variable Distribution
# TODO: Create visualizations to understand the distribution of home prices
# - Histogram of medv (target variable)
# - Box plot to identify outliers
# - Calculate and display basic statistics

import matplotlib.pyplot as plt
import seaborn as sns

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")

# TODO: Create subplots for histogram and boxplot
# fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Your code here for histogram and boxplot:


print("📊 HOME PRICE STATISTICS:")
# TODO: Print statistics about the target variable


In [None]:
# Exercise 2.2: Feature Correlation Analysis
# TODO: Create a correlation heatmap to understand relationships between variables
# - Calculate correlation matrix
# - Create a heatmap using seaborn
# - Identify features most correlated with medv (target)

print("🔍 CORRELATION ANALYSIS:")
# TODO: Calculate correlations with target variable
# correlation_with_target = df.corr()['medv'].sort_values(ascending=False)

# Your code here:



# TODO: Create correlation heatmap
# plt.figure(figsize=(12, 10))

# Your code here:


In [None]:
# Exercise 2.3: Key Feature Relationships
# TODO: Create scatter plots for the top 4 features most correlated with price
# Hint: Use the correlation analysis from Exercise 2.2 to identify top features

# TODO: Get top correlated features (excluding target itself)
# top_features = correlation_with_target.drop('medv').head(4).index

print("📈 TOP FEATURES CORRELATED WITH PRICE:")
# Your code here to print top correlations:


# TODO: Create 2x2 subplot of scatter plots
# fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Your code here for scatter plots:


In [None]:
# Exercise 2.4: Categorical Analysis (if any)
# TODO: Identify if there are any categorical or binary variables
# Hint: Look for variables with few unique values
# Explore 'chas' (Charles River dummy variable) if it exists

print("🏷️ CATEGORICAL VARIABLES ANALYSIS:")
# TODO: Check unique values for each column to identify categorical variables
# for col in df.columns:
#     unique_vals = df[col].nunique()
#     if unique_vals < 10:
#         print(f"{col}: {unique_vals} unique values - {df[col].unique()}")

# Your code here:


# TODO: If 'chas' exists, create a boxplot showing price by Charles River location
# This shows price differences between homes near vs. not near Charles River

# Your code here:


### 🔍 **Reflection Questions for Exercise 2:**
1. What is the distribution shape of home prices? Are there outliers?
2. Which 3 features have the strongest correlation with home prices?
3. Are there any features that show clear linear relationships with price?
4. Do homes near the Charles River have different price patterns?

**Expected Outcome:** You should identify key features like 'rm' (rooms), 'lstat' (lower status %), and others that strongly correlate with housing prices.

## Exercise 3: Data Preprocessing & Feature Engineering (20 minutes)

### 🎯 Your Task:
Prepare the data for machine learning by handling outliers, creating new features, and splitting the dataset.

In [None]:
# Exercise 3.1: Handle Outliers (Optional Challenge)
# TODO: Identify and decide how to handle extreme outliers in the target variable
# - Define outliers using IQR method or Z-score
# - Decide whether to remove, cap, or keep outliers
# - Justify your decision

print("🔍 OUTLIER ANALYSIS:")
# TODO: Calculate IQR for target variable
# Q1 = df['medv'].quantile(0.25)
# Q3 = df['medv'].quantile(0.75)
# IQR = Q3 - Q1

# Your code here:


# TODO: Identify outliers
# lower_bound = Q1 - 1.5 * IQR
# upper_bound = Q3 + 1.5 * IQR
# outliers = df[(df['medv'] < lower_bound) | (df['medv'] > upper_bound)]

# Your code here:


print(f"📊 Found X outliers out of {len(df)} total observations")
# TODO: Decide on outlier treatment and implement

In [None]:
# Exercise 3.2: Feature Engineering
# TODO: Create new features that might improve model performance
# Ideas:
# - Price per room ratio (medv/rm)
# - Crime rate categories (high/medium/low based on 'crim')
# - Age categories for buildings
# - Combined socioeconomic indicators

print("🔧 FEATURE ENGINEERING:")
# TODO: Create new features
# Example: df['price_per_room'] = df['medv'] / df['rm']

# Your code here:


# TODO: Display information about your new features
# print("✅ New features created:")
# for feature in new_features:
#     print(f"  - {feature}: {description}")


In [None]:
# Exercise 3.3: Feature Selection
# TODO: Select the final set of features for modeling
# - Include original features that showed strong correlations
# - Include your engineered features
# - Remove any irrelevant or redundant features

print("🎯 FEATURE SELECTION:")
# TODO: Define your feature list
# selected_features = ['rm', 'lstat', 'crim', ...] # Add your selected features

# Your code here:


# TODO: Create X (features) and y (target)
# X = df[selected_features]
# y = df['medv']

# Your code here:


print(f"✅ Selected {len(selected_features)} features for modeling")
print(f"✅ Dataset shape: X={X.shape}, y={y.shape}")

In [None]:
# Exercise 3.4: Train-Test Split
# TODO: Split your data into training and testing sets
# - Use 80% for training, 20% for testing
# - Set random_state=42 for reproducibility
# - Use stratification if needed (for regression, usually not required)

from sklearn.model_selection import train_test_split

print("📊 DATA SPLITTING:")
# TODO: Perform train-test split
# X_train, X_test, y_train, y_test = train_test_split(...)

# Your code here:


# TODO: Display split information
# print(f"Training set: {X_train.shape[0]} samples")
# print(f"Test set: {X_test.shape[0]} samples")
# print(f"Training target mean: ${y_train.mean():.1f}k")
# print(f"Test target mean: ${y_test.mean():.1f}k")


### 🔍 **Reflection Questions for Exercise 3:**
1. How did you decide to handle outliers? What was your reasoning?
2. What new features did you create and why do you think they'll be useful?
3. How many features did you select for modeling and what criteria did you use?
4. Are your training and test sets balanced in terms of target variable distribution?

**Expected Outcome:** You should have a clean dataset split into train/test with well-chosen features ready for modeling.

## Exercise 4: Model Training & Comparison (25 minutes)

### 🎯 Your Task:
Train multiple regression models and compare their performance. Understand the strengths and weaknesses of different algorithms.

In [None]:
# Exercise 4.1: Linear Regression Baseline
# TODO: Train a Linear Regression model as your baseline
# - Import LinearRegression from sklearn
# - Fit the model on training data
# - Make predictions on both train and test sets
# - Calculate R² score and RMSE

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

print("📈 LINEAR REGRESSION MODEL:")
# TODO: Create and train linear regression model
# lr_model = LinearRegression()

# Your code here:


# TODO: Make predictions
# lr_train_pred = lr_model.predict(X_train)
# lr_test_pred = lr_model.predict(X_test)

# Your code here:


# TODO: Calculate metrics
# lr_train_r2 = r2_score(y_train, lr_train_pred)
# lr_test_r2 = r2_score(y_test, lr_test_pred)
# lr_train_rmse = np.sqrt(mean_squared_error(y_train, lr_train_pred))
# lr_test_rmse = np.sqrt(mean_squared_error(y_test, lr_test_pred))

# Your code here:


print(f"✅ Training R²: {lr_train_r2:.3f}, RMSE: ${lr_train_rmse:.2f}k")
print(f"✅ Test R²: {lr_test_r2:.3f}, RMSE: ${lr_test_rmse:.2f}k")

In [None]:
# Exercise 4.2: Random Forest Regression
# TODO: Train a Random Forest model
# - Import RandomForestRegressor
# - Use n_estimators=100, random_state=42
# - Compare performance with Linear Regression

from sklearn.ensemble import RandomForestRegressor

print("🌳 RANDOM FOREST MODEL:")
# TODO: Create and train Random Forest model
# rf_model = RandomForestRegressor(n_estimators=100, random_state=42)

# Your code here:


# TODO: Make predictions and calculate metrics
# rf_train_pred = rf_model.predict(X_train)
# rf_test_pred = rf_model.predict(X_test)

# Your code here:


# TODO: Calculate and display metrics
# rf_train_r2 = r2_score(y_train, rf_train_pred)
# rf_test_r2 = r2_score(y_test, rf_test_pred)
# rf_train_rmse = np.sqrt(mean_squared_error(y_train, rf_train_pred))
# rf_test_rmse = np.sqrt(mean_squared_error(y_test, rf_test_pred))

print(f"✅ Training R²: {rf_train_r2:.3f}, RMSE: ${rf_train_rmse:.2f}k")
print(f"✅ Test R²: {rf_test_r2:.3f}, RMSE: ${rf_test_rmse:.2f}k")

In [None]:
# Exercise 4.3: Additional Model (Your Choice)
# TODO: Choose and implement one additional regression model
# Options: Ridge Regression, Lasso Regression, Gradient Boosting, SVM, etc.
# Research the model and explain why you chose it

print("🤖 ADDITIONAL MODEL - [YOUR CHOICE]:")
# TODO: Import and implement your chosen model
# from sklearn.[module] import [YourChosenRegressor]

# Your code here:


# TODO: Train, predict, and evaluate your model
# Explain why you chose this particular algorithm

# Your reasoning:
# "I chose [model name] because..."

# Your code here:


In [None]:
# Exercise 4.4: Model Comparison
# TODO: Create a comprehensive comparison of all your models
# - Organize results in a pandas DataFrame
# - Include both R² and RMSE for train and test sets
# - Calculate overfitting (train_score - test_score)
# - Visualize the comparison

print("📊 MODEL COMPARISON:")
# TODO: Create comparison DataFrame
# model_comparison = pd.DataFrame({
#     'Model': ['Linear Regression', 'Random Forest', 'Your Model'],
#     'Train_R2': [lr_train_r2, rf_train_r2, your_train_r2],
#     'Test_R2': [lr_test_r2, rf_test_r2, your_test_r2],
#     'Train_RMSE': [lr_train_rmse, rf_train_rmse, your_train_rmse],
#     'Test_RMSE': [lr_test_rmse, rf_test_rmse, your_test_rmse]
# })

# Your code here:


# TODO: Calculate overfitting metric
# model_comparison['Overfitting'] = model_comparison['Train_R2'] - model_comparison['Test_R2']

# Your code here:


# TODO: Display the comparison table
# print(model_comparison)

# TODO: Create visualization comparing model performance
# fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Your code here for visualization:


### 🔍 **Reflection Questions for Exercise 4:**
1. Which model performed best on the test set and why do you think that is?
2. Are any of your models overfitting? How can you tell?
3. What is the practical difference between an RMSE of $3k vs $5k for this problem?
4. Which model would you recommend for production use and why?

**Expected Outcome:** You should have trained 3+ models with test R² scores above 0.6 and identified the best performing model for your dataset.

## Exercise 5: Feature Importance & Model Interpretation (15 minutes)

### 🎯 Your Task:
Understand which features are most important for predicting house prices and interpret your model results.

In [None]:
# Exercise 5.1: Feature Importance Analysis
# TODO: Extract and visualize feature importance from your Random Forest model
# - Get feature importances from the trained Random Forest
# - Create a horizontal bar plot showing top 10 most important features
# - Interpret what this means for the business

print("🔍 FEATURE IMPORTANCE ANALYSIS:")
# TODO: Get feature importances
# feature_importance = pd.DataFrame({
#     'feature': X_train.columns,
#     'importance': rf_model.feature_importances_
# }).sort_values('importance', ascending=False)

# Your code here:


# TODO: Create visualization of top features
# plt.figure(figsize=(10, 6))

# Your code here:


print("📊 TOP 5 MOST IMPORTANT FEATURES:")
# TODO: Print top 5 features with their importance scores
# for idx, row in feature_importance.head().iterrows():
#     print(f"  {row['feature']}: {row['importance']:.3f}")


In [None]:
# Exercise 5.2: Prediction Analysis
# TODO: Analyze some specific predictions to understand model behavior
# - Select 5 test samples with different characteristics
# - Compare actual vs predicted values
# - Analyze what features might be driving the predictions

print("🔍 PREDICTION ANALYSIS:")
# TODO: Select interesting test samples
# sample_indices = [0, 10, 20, 30, 40]  # You can choose different indices

# Your code here:


# TODO: For each sample, show:
# - Actual price
# - Predicted price
# - Key feature values
# - Prediction error

# for i in sample_indices:
#     idx = X_test.index[i]
#     actual = y_test.iloc[i]
#     predicted = rf_test_pred[i]  # or your best model's predictions
#     
#     print(f"\nSample {i+1}:")
#     print(f"  Actual Price: ${actual:.1f}k")
#     print(f"  Predicted Price: ${predicted:.1f}k")
#     print(f"  Error: ${abs(actual-predicted):.1f}k")
#     # Show key feature values for this sample

In [None]:
# Exercise 5.3: Residual Analysis
# TODO: Create residual plots to understand model performance patterns
# - Plot residuals (actual - predicted) vs predicted values
# - Check for patterns that indicate model issues
# - Calculate and display residual statistics

print("📊 RESIDUAL ANALYSIS:")
# TODO: Calculate residuals for your best model
# best_model_pred = rf_test_pred  # Replace with your best model
# residuals = y_test - best_model_pred

# Your code here:


# TODO: Create residual plot
# plt.figure(figsize=(10, 6))

# Your code here:


# TODO: Calculate residual statistics
# print(f"📈 Residual Statistics:")
# print(f"  Mean: ${residuals.mean():.3f}k (should be close to 0)")
# print(f"  Std Dev: ${residuals.std():.3f}k")
# print(f"  Min: ${residuals.min():.3f}k")
# print(f"  Max: ${residuals.max():.3f}k")


### 🔍 **Reflection Questions for Exercise 5:**
1. Which features are most important for predicting house prices? Do these make intuitive sense?
2. Are there any predictions where your model performed particularly well or poorly? Why?
3. What patterns do you see in the residual plot? Are there any concerning trends?
4. How would you explain your model's predictions to a real estate agent or home buyer?

**Expected Outcome:** You should understand which features drive predictions and be able to interpret your model's behavior on specific examples.

## Exercise 6: Business Insights & Recommendations (10 minutes)

### 🎯 Your Task:
Translate your technical findings into actionable business insights and recommendations.

In [None]:
# Exercise 6.1: Price Prediction Function
# TODO: Create a function that can predict house prices for new properties
# - Use your best performing model
# - Include input validation
# - Provide confidence intervals or prediction ranges

def predict_house_price(rm, lstat, crim, other_features):
    """
    Predict house price based on key features
    
    Parameters:
    rm: Average number of rooms per dwelling
    lstat: % lower status of the population
    crim: Per capita crime rate
    other_features: Dictionary of other features
    
    Returns:
    Predicted price in thousands of dollars
    """
    # TODO: Implement the prediction function
    # You'll need to create a feature vector and use your trained model
    
    # Your code here:
    
    pass

# TODO: Test your function with a few examples
print("🏠 HOUSE PRICE PREDICTIONS:")
# Example: predict_house_price(rm=6.5, lstat=5.0, crim=0.1, other_features={...})

# Your code here:


In [None]:
# Exercise 6.2: Market Insights
# TODO: Generate insights about the Boston housing market based on your analysis
# - What factors most strongly influence house prices?
# - What recommendations would you make to home buyers?
# - What areas or features should investors focus on?

print("📊 BOSTON HOUSING MARKET INSIGHTS:")
print("="*50)

# TODO: Write insights based on your analysis
# Example insights to develop:

print("🏡 KEY PRICE DRIVERS:")
# TODO: List the top 3-5 factors that most influence house prices
# Based on your feature importance analysis

print("\n💡 RECOMMENDATIONS FOR HOME BUYERS:")
# TODO: Provide 3-4 actionable recommendations for home buyers
# Based on your findings

print("\n📈 INVESTMENT OPPORTUNITIES:")
# TODO: Identify potential investment strategies
# Based on patterns you discovered

print("\n⚠️  MARKET RISKS:")
# TODO: Identify potential risks or limitations
# Based on your model analysis


In [None]:
# Exercise 6.3: Model Limitations & Future Improvements
# TODO: Critically evaluate your model and suggest improvements
# - What are the current limitations of your approach?
# - What additional data would be helpful?
# - How would you improve the model in a real-world scenario?

print("🔍 MODEL EVALUATION & FUTURE IMPROVEMENTS:")
print("="*50)

print("📉 CURRENT LIMITATIONS:")
# TODO: List 3-4 limitations of your current model
# Consider data, methodology, and practical constraints

print("\n📈 POTENTIAL IMPROVEMENTS:")
# TODO: Suggest 3-4 specific improvements
# Consider additional features, advanced techniques, etc.

print("\n🎯 NEXT STEPS FOR PRODUCTION:")
# TODO: Outline what you'd need to do to deploy this model
# Consider monitoring, updating, validation, etc.


### 🔍 **Final Reflection Questions:**
1. What was the most surprising finding from your analysis?
2. How confident would you be using this model to make real estate decisions?
3. What additional domain expertise would help improve your model?
4. How would you communicate these results to non-technical stakeholders?

**Expected Outcome:** You should have practical insights about housing prices and a clear understanding of how to apply ML to real-world business problems.

## 🎉 Congratulations!

You've completed a comprehensive machine learning project from start to finish! 

### 📚 What You've Accomplished:
- ✅ Explored and visualized a real-world dataset
- ✅ Preprocessed data and engineered meaningful features  
- ✅ Trained and compared multiple ML models
- ✅ Interpreted model results and feature importance
- ✅ Generated actionable business insights
- ✅ Critically evaluated model limitations

### 🚀 Next Steps:
1. **Try Advanced Techniques**: Experiment with hyperparameter tuning, cross-validation, or ensemble methods
2. **Deploy Your Model**: Learn how to put your model into production
3. **Explore Other Datasets**: Apply these skills to different domains and problems
4. **Study Advanced ML**: Dive deeper into deep learning, time series, or specialized ML areas

### 💡 Key Takeaways:
- Machine learning is as much about understanding the data as it is about algorithms
- Feature engineering and domain knowledge are often more important than complex models
- Always validate your results and consider practical limitations
- Communication of results is crucial for business impact

**Great job completing this hands-on ML exercise!** 🎯