# Assignment 2: Airbnb Price Prediction

## Introduction

This notebook analyzes Airbnb San Diego listings data to build machine learning models for price prediction. We'll explore various features that influence listing prices, including:

- Property characteristics (bedrooms, bathrooms, accommodates)
- Location (latitude, longitude)
- Room type
- Reviews and ratings
- Amenities
- Availability

Our goal is to:
1. Load and explore the Airbnb listings dataset
2. Preprocess and clean the data
3. Perform exploratory data analysis
4. Build and evaluate predictive models for listing prices



In [1]:
import sys
from pathlib import Path
import os

# Add project root to Python path
# This works whether Jupyter is started from project root or notebooks directory
current_dir = Path().resolve()
# If we're in notebooks/, go up one level; otherwise assume we're already at project root
if current_dir.name == 'notebooks':
    project_root = current_dir.parent
else:
    project_root = current_dir

sys.path.insert(0, str(project_root))
os.chdir(project_root)  # Change to project root for relative paths

from scripts.load_data import load_data
from scripts.preprocess import preprocess
from scripts.eda import run_eda
from scripts.models import train_models

df_raw = load_data()
df = preprocess(df_raw)

print("Cleaned dataset shape:", df.shape)
display(df.head())


ModuleNotFoundError: No module named 'seaborn'

## Load Dataset

The data has been loaded from `listings.csv` in the `/data` directory. The dataset contains Airbnb San Diego listings with various features including property characteristics, location, reviews, and amenities.


## Preview + Summary Statistics

Let's examine the dataset structure and get summary statistics:


In [3]:
print(f"Total records: {len(df)}")
print(f"\nDataset shape: {df.shape}")
print(f"\nColumn names:")
print(df.columns.tolist())
print(f"\nData types:")
print(df.dtypes)
print(f"\nSummary statistics:")
print(df.describe())
print(f"\nMissing values:")
print(df.isnull().sum())


NameError: name 'df' is not defined

## Preprocessing

The data has been preprocessed to:
- Clean price column (remove $, convert to float)
- Handle missing values in numerical columns
- Encode categorical features (room_type)
- Convert amenities string into an integer count feature
- Prepare data for modeling


In [None]:
print("Preprocessed data summary:")
print(df.describe())
print("\nMissing values:")
print(df.isnull().sum())
print("\nFeature columns:")
print(df.columns.tolist())
print(f"\nPrice statistics:")
print(df['price'].describe())


## Exploratory Data Analysis (EDA)

Let's generate visualizations to understand the price distribution and relationships between features:


In [None]:
run_eda(df)


### View Generated Figures

The EDA script has generated several visualizations saved to `/output/figures/`:

1. **price_distribution.png** - Price distribution (histogram, box plot, Q-Q plot)
2. **price_vs_features.png** - Price vs bedrooms and bathrooms
3. **correlation_heatmap.png** - Feature correlations
4. **geographic_scatter.png** - Geographic scatterplot (lat, lon colored by price)
5. **price_by_room_type.png** - Boxplot of price by room type


In [None]:
# Display some of the generated figures
from IPython.display import Image, display
from pathlib import Path

figures_dir = Path('../output/figures')

if (figures_dir / 'price_distribution.png').exists():
    display(Image(str(figures_dir / 'price_distribution.png')))
    
if (figures_dir / 'price_vs_features.png').exists():
    display(Image(str(figures_dir / 'price_vs_features.png')))
    
if (figures_dir / 'correlation_heatmap.png').exists():
    display(Image(str(figures_dir / 'correlation_heatmap.png')))
    
if (figures_dir / 'geographic_scatter.png').exists():
    display(Image(str(figures_dir / 'geographic_scatter.png')))
    
if (figures_dir / 'price_by_room_type.png').exists():
    display(Image(str(figures_dir / 'price_by_room_type.png')))


## Feature Engineering

Additional features have been created during preprocessing:

- **Amenities count**: Converted amenities string into an integer count
- **Room type encoding**: One-hot encoded room_type categories
- **Numerical features**: All numerical columns are ready for modeling
- **Missing value imputation**: Missing values in numerical columns filled with median


In [None]:
# Show feature engineering results
print("Engineered features:")
print(f"Total features: {len(df.columns)}")
print(f"\nFeature columns:")
print(df.columns.tolist())

if 'amenities_count' in df.columns:
    print(f"\nAmenities count statistics:")
    print(df['amenities_count'].describe())

room_type_cols = [col for col in df.columns if col.startswith('room_type_')]
if room_type_cols:
    print(f"\nRoom type dummy variables: {room_type_cols}")


## Baseline Model

We'll start with a simple baseline model that predicts the mean price.


In [None]:
# Baseline model will be trained as part of train_models()
# This is just for demonstration
import numpy as np

baseline_pred = np.full(len(df), df['price'].mean())
baseline_rmse = np.sqrt(np.mean((df['price'] - baseline_pred)**2))
baseline_mae = np.mean(np.abs(df['price'] - baseline_pred))

print(f"Baseline Model (Mean Prediction):")
print(f"  RMSE: ${baseline_rmse:.2f}")
print(f"  MAE: ${baseline_mae:.2f}")


## Advanced Models

Now we'll train more sophisticated models:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor (if available)


In [None]:
results, models = train_models(df)
results


## Evaluation + Error Analysis

The models have been evaluated using:
- **RMSE** (Root Mean Squared Error): Lower is better (in dollars)
- **MAE** (Mean Absolute Error): Lower is better (in dollars)
- **R²** (Coefficient of Determination): Higher is better (closer to 1.0)

Let's visualize the results:


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Plot comparison of model metrics
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# RMSE comparison
axes[0].bar(results['Model'], results['RMSE'], color='steelblue')
axes[0].set_title('RMSE Comparison', fontweight='bold')
axes[0].set_ylabel('RMSE ($)')
axes[0].tick_params(axis='x', rotation=45)

# MAE comparison
axes[1].bar(results['Model'], results['MAE'], color='coral')
axes[1].set_title('MAE Comparison', fontweight='bold')
axes[1].set_ylabel('MAE ($)')
axes[1].tick_params(axis='x', rotation=45)

# R² comparison
axes[2].bar(results['Model'], results['R²'], color='mediumseagreen')
axes[2].set_title('R² Comparison', fontweight='bold')
axes[2].set_ylabel('R² Score')
axes[2].tick_params(axis='x', rotation=45)
axes[2].axhline(y=0, color='black', linestyle='--', linewidth=0.5)

plt.tight_layout()
plt.show()

print("\nBest model by metric:")
print(f"  Best RMSE: {results.loc[results['RMSE'].idxmin(), 'Model']} (${results['RMSE'].min():.2f})")
print(f"  Best MAE: {results.loc[results['MAE'].idxmin(), 'Model']} (${results['MAE'].min():.2f})")
print(f"  Best R²: {results.loc[results['R²'].idxmax(), 'Model']} ({results['R²'].max():.3f})")


## Conclusion + Future Work

### Key Findings:

1. **Data Quality**: The dataset includes comprehensive Airbnb listing information with various features that influence pricing.

2. **Price Patterns**: 
   - Price distribution is right-skewed (many affordable listings, few expensive ones)
   - Location, property size, and room type are key factors
   - Reviews and ratings may influence pricing

3. **Model Performance**: 
   - Advanced models (Random Forest, XGBoost) typically outperform baseline and linear models
   - Feature engineering (amenities count, room type encoding) improves predictive power
   - Geographic features (latitude, longitude) capture location-based pricing

4. **Challenges**:
   - Price prediction is complex due to many factors (location, seasonality, host preferences)
   - Missing data requires careful handling
   - Outliers in price may need special treatment

### Future Work:

- Hyperparameter tuning for advanced models
- Feature selection to identify most important predictors
- Ensemble methods combining multiple models
- Additional feature engineering (neighborhood clusters, distance to landmarks)
- Time-based features if temporal data is available
- Handling of extreme outliers and price caps
