# 03 - Baseline Models

This notebook implements baseline models for Airbnb price prediction.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

## Load Data

In [2]:
# Load the feature-engineered data
df = pd.read_csv('../data/processed/features/listings_features.csv')
print(f"Dataset shape: {df.shape}")
df.head()

Dataset shape: (7096, 79)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,picture_url,host_id,host_url,...,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,price_per_person,description_length,dist_center,avail_frac,sent_mean,rev_count,geo_cluster
0,97945,https://www.airbnb.com/rooms/97945,0.0,2025-03-19,city scrape,Deluxw-Apartm. with roof terrace,Enjoy the simple life at this quiet and centra...,https://a0.muscache.com/pictures/2459996/10b4c...,517685,https://www.airbnb.com/users/show/517685,...,0,0,-0.044211,-0.067181,-1.785823,50.794769,0.213699,0.464175,126.0,8
1,114695,https://www.airbnb.com/rooms/114695,0.0,2025-03-19,city scrape,Apartment Munich/East with sundeck,Enjoy the simple life at this quiet and centra...,https://a0.muscache.com/pictures/21571874/960e...,581737,https://www.airbnb.com/users/show/581737,...,0,0,-0.235088,-0.462938,-1.785823,49.902293,0.580822,0.571723,78.0,7
2,127383,https://www.airbnb.com/rooms/127383,0.0,2025-03-20,previous scrape,City apartment next to Pinakothek,Enjoy the simple life at this quiet and centra...,https://a0.muscache.com/pictures/79238c11-bc61...,630556,https://www.airbnb.com/users/show/630556,...,0,0,-0.095111,-0.385534,-1.785823,49.035222,0.008219,0.418847,116.0,1
3,159634,https://www.airbnb.com/rooms/159634,0.0,2025-03-20,previous scrape,"Fancy, bright central roof top flat and homeof...",In this idyllic stylish flat you live very qui...,https://a0.muscache.com/pictures/336144dc-b06d...,765694,https://www.airbnb.com/users/show/765694,...,0,0,-0.362339,-0.273174,-0.048502,50.034958,0.008219,0.452107,45.0,3
4,170154,https://www.airbnb.com/rooms/170154,0.0,2025-03-20,city scrape,"Own floor & bath, parking & breakfast","Enjoy a quiet neighbourhood, easy access to th...",https://a0.muscache.com/pictures/31636890/593e...,108297,https://www.airbnb.com/users/show/108297,...,1,0,1.699134,-0.504137,-1.303234,50.862626,0.20274,0.52991,577.0,8


## Data Preparation

In [3]:
# Check for missing values
print("Missing values:")
print(df.isnull().sum().sum())

# Remove rows with missing values
df_clean = df.dropna()
print(f"Shape after removing missing values: {df_clean.shape}")

Missing values:
2752
Shape after removing missing values: (5720, 79)


In [4]:
# Select numerical features for baseline models
numerical_cols = df_clean.select_dtypes(include=[np.number]).columns.tolist()
numerical_cols = [col for col in numerical_cols if col != 'price']

print(f"Number of numerical features: {len(numerical_cols)}")
print("Features:", numerical_cols[:10], "..." if len(numerical_cols) > 10 else "")

Number of numerical features: 49
Features: ['id', 'scrape_id', 'host_id', 'host_listings_count', 'host_total_listings_count', 'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms'] ...


In [5]:
# Prepare features and target
X = df_clean[numerical_cols]
y = df_clean['price']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

Training set: (4576, 49)
Test set: (1144, 49)


## Baseline Models

In [6]:
# Define models
models = {
    'Linear Regression': LinearRegression(),
    'Ridge Regression': Ridge(alpha=1.0),
    'Lasso Regression': Lasso(alpha=0.1),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingRegressor(n_estimators=100, random_state=42)
}

# Function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test, model_name):
    # Train the model
    model.fit(X_train, y_train)
    
    # Make predictions
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    
    return {
        'model': model,
        'mse': mse,
        'rmse': rmse,
        'mae': mae,
        'r2': r2,
        'predictions': y_pred
    }

In [7]:
# Evaluate all models
results = {}

for name, model in models.items():
    print(f"Training {name}...")
    results[name] = evaluate_model(model, X_train, X_test, y_train, y_test, name)
    print(f"{name} - RMSE: {results[name]['rmse']:.2f}, R²: {results[name]['r2']:.3f}")
    print("-" * 50)

Training Linear Regression...
Linear Regression - RMSE: 0.93, R²: -0.003
--------------------------------------------------
Training Ridge Regression...
Ridge Regression - RMSE: 0.34, R²: 0.864
--------------------------------------------------
Training Lasso Regression...
Lasso Regression - RMSE: 0.31, R²: 0.887
--------------------------------------------------
Training Random Forest...
Random Forest - RMSE: 0.09, R²: 0.990
--------------------------------------------------
Training Gradient Boosting...
Gradient Boosting - RMSE: 0.08, R²: 0.992
--------------------------------------------------


## Results Comparison

In [8]:
# Create results comparison
results_df = pd.DataFrame({
    'Model': list(results.keys()),
    'RMSE': [results[name]['rmse'] for name in results.keys()],
    'MAE': [results[name]['mae'] for name in results.keys()],
    'R²': [results[name]['r2'] for name in results.keys()]
})

results_df = results_df.sort_values('RMSE')
results_df

Unnamed: 0,Model,RMSE,MAE,R²
4,Gradient Boosting,0.082355,0.023519,0.992189
3,Random Forest,0.092668,0.015439,0.99011
2,Lasso Regression,0.313309,0.133062,0.886951
1,Ridge Regression,0.343852,0.138008,0.863835
0,Linear Regression,0.933042,0.24339,-0.002591


## Conclusion

In summary, the baseline experiments reveal a clear progression in model performance as we move from simple linear approaches to more sophisticated ensemble methods:

* **Ordinary Linear Regression** fails to capture the complexity of Airbnb pricing, yielding an R² near zero and a high RMSE, which indicates its predictions are barely better than the dataset mean.
* **Ridge and Lasso Regression**, by introducing L₂ and L₁ regularization respectively, substantially improve fit, achieving R² values around 0.86–0.89 and reducing RMSE to roughly one-third of the linear model’s error. This demonstrates that penalizing large coefficients helps mitigate overfitting and handles multicollinearity among numerical features.
* **Random Forest Regression**, an ensemble of decision trees, further boosts accuracy by modeling nonlinear interactions and variable importance automatically. With R² ≈ 0.99 and RMSE ≈ 0.09, it explains nearly all the variance in the test set.
* **Gradient Boosting Regression**, which iteratively fits trees to the residuals of previous trees, achieves the best overall performance (R² ≈ 0.992, RMSE ≈ 0.082). Its fine-grained error-correcting mechanism allows it to capture subtle patterns in the data more effectively than a bagged forest.

**Metric explanations:**

* **RMSE (Root Mean Squared Error)** measures the square root of the average squared differences between predicted and actual prices. A lower RMSE means predictions are, on average, closer to the true values.
* **MAE (Mean Absolute Error)** computes the average absolute difference between predictions and actual prices. Unlike RMSE, it treats all errors equally, making it more robust to outliers.
* **R² (Coefficient of Determination)** indicates the proportion of variance in the actual prices that the model explains. An R² of 1 means perfect prediction, while 0 means no better than predicting the mean.
