# Airbnb Price Prediction – Modeling

This notebook develops and evaluates machine learning regression models to
predict nightly Airbnb rental prices using cleaned Lisbon listings data.


## Imports

In [1]:
# Import data manipulation libraries
import pandas as pd  # For data processing and manipulation
import numpy as np   # For numerical operations

# Import model evaluation and preprocessing tools
from sklearn.model_selection import train_test_split  
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder      
from sklearn.pipeline import Pipeline                
from sklearn.metrics import mean_squared_error      

# Import regression models
from sklearn.linear_model import LinearRegression     
from sklearn.ensemble import RandomForestRegressor    


## Load Clean Data

In [2]:
# Load the cleaned Airbnb listings data from CSV file
df = pd.read_csv("../data/processed/clean_listings.csv")
# Transform price to log scale (using log1p to handle zero values)
df["log_price"] = np.log1p(df["price"])

## Features & Target

In [3]:
# Prepare features (X) by removing price columns from the dataframe
X = df.drop(["price", "log_price"], axis=1)
# Set target variable (y) to the log-transformed price column
y = df["log_price"]

## Train–Test Split

In [4]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

## Preprocessing Pipeline

In [5]:
# Define categorical features that need one-hot encoding
categorical_features = ["neighbourhood", "room_type"]

# Define numerical features that will be passed through without transformation
numerical_features = [
    "minimum_nights",
    "number_of_reviews",
    "reviews_per_month",
    "availability_365"
]

# Create a preprocessor using ColumnTransformer to handle both feature types
preprocessor = ColumnTransformer(
    transformers=[
        # Transform categorical features using one-hot encoding, ignoring unknown categories
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features),
        # Pass numerical features through without transformation
        ("num", "passthrough", numerical_features)
    ]
)

## Linear Regression

In [6]:
# Create a pipeline that combines preprocessing and linear regression
lr_model = Pipeline([
    ("preprocessor", preprocessor), 
    ("model", LinearRegression())   
])

# Train the model using the preprocessed training data
lr_model.fit(X_train, y_train)

## Random Forest

In [7]:
# Create a pipeline that combines preprocessing and a Random Forest model
rf_model = Pipeline([
    # First step: apply the previously defined preprocessor
    ("preprocessor", preprocessor),
    # Second step: create and configure a Random Forest regression model
    ("model", RandomForestRegressor(
        n_estimators=100,  # Use 100 trees in the forest
        random_state=42    # Set random seed for reproducibility
    ))
])

# Train the model using the preprocessed training data
rf_model.fit(X_train, y_train)

## Evaluation

In [8]:
def rmse(model):
    # Generate predictions using the provided model on the test data
    preds = model.predict(X_test)
    # Calculate the Root Mean Squared Error between actual and predicted values
    # First computes MSE, then takes the square root to get RMSE
    return np.sqrt(mean_squared_error(y_test, preds))

# Calculate and print the RMSE for the Linear Regression model
print("Linear Regression RMSE:", rmse(lr_model))
# Calculate and print the RMSE for the Random Forest model
print("Random Forest RMSE:", rmse(rf_model))

Linear Regression RMSE: 0.4825833583189827
Random Forest RMSE: 0.4586829032018365


## Conclusion

The Random Forest model outperforms Linear Regression, indicating the presence
of non-linear relationships between listing features and price. Location and
room type are the strongest predictors of nightly rental prices in Lisbon.
