**ORHAN OLGEN**

**Prediction with Machine Learning for Economist**

**Assignment 2**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.simplefilter('ignore')
from math import radians, sin, cos, sqrt, atan2
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
from statsmodels.iolib.summary2 import summary_col
from sklearn.linear_model import LassoCV
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
#pip install xgboost
import xgboost as xgb
#!pip install lightgbm
import lightgbm as lgb
import time


In this project, we analyze the Airbnb dataset for Lisbon and Porto, containing over 24k rows. We will build and compare five predictive models, evaluating their fit, time, and feature importance. We will also validate the models by testing them on another dataset from Lisbon and Porto. Let's start by importing the Lisbon dataset for analysis.

In [None]:
airbnb=pd.read_csv('listings-earlier.csv')
print(airbnb.shape)
airbnb.head()
airbnb.columns

In [None]:
airbnb['host_listings_count']

The dataset contains 75 variables, but not all are necessary for our price model analysis. Therefore, I will drop irrelevant variables and retain only the most important ones for the next stage.

# **Part 1. Modelling**

# 1.1 Data Wrangling

I’ve selected the following key variables for the pricing model: Room type, Latitude, Longitude, Accommodation, Bedrooms, Host listings, Max/Min Nights, Reviews, Review Score, and Price (target). Let’s keep these variables and explore their types.

In [None]:
# Select only the variables of interest
selected_columns = [
    'room_type',
    'latitude',
    'longitude',
    'accommodates',
    'bedrooms',
    'host_total_listings_count',
    'maximum_nights',
    'minimum_nights',
    'number_of_reviews',
    'review_scores_rating',
    'price'
]

# Create new dataframe with only selected variables
selected_df = airbnb[selected_columns].copy()
print(selected_df.shape)
print(selected_df.dtypes)
selected_df.head()


The target variable `price` is currently an object due to the dollar sign (e.g., $105.00). I will convert it to a float and remove any observations with missing values for smoother analysis.

In [None]:
# Convert price column to float
selected_df['price'] = (
    selected_df['price']
    .astype(str)  # ensure everything is treated as string
    .str.replace('[\$,]', '', regex=True)  # Remove both $ and commas
    .astype(float)
)

print(selected_df['price'].head())

# Drop rows with missing prices
selected_df = selected_df.dropna(subset=['price'])
print("\nMissing values after cleaning:", selected_df['price'].isnull().sum())


Since `room_type` contains values like 'Entire home/apt', we will convert it into a categorical variable for smoother analysis. Let's proceed with this transformation.

In [None]:
# First, let's examine the current values
print("Current room_type values:")
print(selected_df['room_type'].value_counts())

`room_type` has four unique values. We'll standardize the text by converting to lowercase and removing special characters

In [None]:
# Clean the room_type strings
selected_df['room_type'] = (
    selected_df['room_type']
    .str.strip()
    .str.lower()
    .str.replace(r'[\/\s]', '_', regex=True)
)

# Create dummy variables
dummies = pd.get_dummies(
    selected_df['room_type'],
    prefix='room',
    drop_first=False
)

# Join with original data
selected_df = pd.concat([selected_df, dummies], axis=1)
print(selected_df.filter(like='room_').head())
selected_df.head()
#print(selected_df.shape)

Here, we analyze Airbnb listings in Lisbon by calculating the distance from the city center (Praça do Comércio) using the Haversine formula. This helps identify location patterns, which are important for our later modeling analysis. The coordinates for Praça do Comércio were sourced from Google.

In [None]:
# centre point for refrence to caluate the distance later
CENTER_LAT = 38.7223  # latitude
CENTER_LON = -9.1393  # longitude

# clean coordinate columns
selected_df['latitude'] = (
    selected_df['latitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)  # Remove non-numeric chars
    .replace('', np.nan)
    .astype(float)
)

selected_df['longitude'] = (
    selected_df['longitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)
    .replace('', np.nan)
    .astype(float)
)

# drop rows with invalid coordinates
print(f"Before cleaning: {len(selected_df)} rows")
selected_df = selected_df.dropna(subset=['latitude', 'longitude'])
print(f"After cleaning: {len(selected_df)} rows")

def haversine_distance(lat1, lon1, lat2, lon2):
    """Calculate distance between two points in kilometers"""
    R = 6371  # Earth radius in km

    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1

    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * atan2(sqrt(a), sqrt(1-a))

    return R * c

# create new distance column
selected_df['distance_to_center'] = selected_df.apply(
    lambda row: haversine_distance(
        row['latitude'],
        row['longitude'],
        CENTER_LAT,
        CENTER_LON
    ),
    axis=1
)

# Verify results
print("\nDistance statistics (km):")
print(selected_df['distance_to_center'].describe())




We plan to bin `accommodates` and `bedrooms` to simplify these continuous variables into meaningful categories, making patterns easier to detect. Binning reduces the impact of outliers and improves model interpretability. First, let's check the maximum and minimum values for both variables.


In [None]:
print(selected_df['accommodates'].value_counts())
print(selected_df['bedrooms'].value_counts())
selected_df['accommodates'].agg(['min', 'max'])
selected_df['bedrooms'].agg(['min', 'max'])

In [None]:
# binning for variable: accomodation
selected_df['accommodates_bin'] = pd.cut(selected_df['accommodates'], bins=[0, 2, 4, 6, 9, 20], labels=False)
selected_df['bedrooms_bin'] = pd.cut(selected_df['bedrooms'], bins=[0, 1, 2, 3, 4, 16], labels=False)
selected_df.head()



We’ll include `host_listings_count` to capture differences between individual and professional hosts. Hosts with more listings may have different pricing strategies or offer more consistent quality. Let's first check the unique observations to clean the data for modeling.

In [None]:
print(selected_df['host_total_listings_count'].value_counts())
selected_df['host_total_listings_count'].unique()
selected_df['host_total_listings_count'].agg(['min', 'max'])

In [None]:
# creating bins
selected_df['host_scale'] = pd.cut(selected_df['host_total_listings_count'], bins=[0, 1, 5, 10, 1149], labels=False)


We’ll use binning for `minimum_nights` and `maximum_nights` to reduce complexity and outliers. Binning groups listings into categories like short-term or long-term stays, making patterns clearer and improving model performance.

In [None]:
print(selected_df['maximum_nights'].value_counts())
selected_df['maximum_nights'].unique()
selected_df['maximum_nights'].agg(['min', 'max'])

In [None]:
print(selected_df['minimum_nights'].value_counts())
selected_df['minimum_nights'].unique()
selected_df['minimum_nights'].agg(['min', 'max'])

In [None]:
#creating bins for both
selected_df['min_nights_bin'] = pd.cut(selected_df['minimum_nights'], bins=[0, 3, 7, 30, 701], labels=False)
selected_df['max_nights_bin'] = pd.cut(selected_df['maximum_nights'], bins=[0, 7, 30, 100, 10000], labels=False)
selected_df.head()



Review Number and Review Score Rating

In [None]:
print(selected_df['number_of_reviews'].value_counts())
selected_df['number_of_reviews'].unique()
selected_df['number_of_reviews'].agg(['min', 'max'])

In [None]:
print(selected_df['review_scores_rating'].value_counts())
selected_df['review_scores_rating'].unique()
selected_df['review_scores_rating'].agg(['min', 'max'])

The highest review rating is 5.0, and the lowest is 0. We will drop observations with `nan` values. For `number_of_reviews`, we’ll bin it to group listings by popularity and experience level, reducing outliers and simplifying model interpretation. Let's proceed with the code.

In [None]:
# Drop rows where review_scores_rating is missing
selected_df = selected_df.dropna(subset=['number_of_reviews'])
selected_df['review_number'] = pd.cut(selected_df['number_of_reviews'], bins=[0, 100, 200, 300 , 400, 500, 600, 2612], labels=False)
selected_df.head()

#**1.2.1 Model: OLS**

We built and compared three OLS regression models. The basic model includes distance to the city center, review score, and number of bedrooms. We gradually added more variables like accommodates, nights, number of reviews, and host listing count. The models are displayed side by side using `summary_col` for comparison.

In [None]:
# Features and Target
X = selected_df.drop('price', axis=1)
y = selected_df['price']

# Keep only numeric variables
X = X.select_dtypes(include=[np.number])

# Filter to Drop NaNs for Required Columns
required_cols = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin', 'host_scale', 'review_number'
]

valid_index = X[required_cols].dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Add constant for OLS intercept
X = sm.add_constant(X)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Variables for Each Model
model1_vars = ['const', 'distance_to_center', 'review_scores_rating', 'bedrooms_bin']
model2_vars = model1_vars + ['accommodates_bin', 'max_nights_bin', 'min_nights_bin']
model3_vars = model2_vars + ['host_scale', 'review_number']

# Step 6: Fit the Models
model1 = sm.OLS(y_train, X_train[model1_vars]).fit()
model2 = sm.OLS(y_train, X_train[model2_vars]).fit()
model3 = sm.OLS(y_train, X_train[model3_vars]).fit()

# Display Model Comparison Table
summary = summary_col(
    results=[model1, model2, model3],
    float_format='%0.3f',
    stars=True,
    model_names=['Model 1', 'Model 2', 'Model 3'],
    info_dict={
        'R-squared': lambda x: f"{x.rsquared:.3f}",
        'Adj. R-squared': lambda x: f"{x.rsquared_adj:.3f}",
        'BIC': lambda x: f"{x.bic:.0f}",
        'Observations': lambda x: f"{int(x.nobs)}"
    }
)

print(summary)
#selected_df

**Basic Interpretation**

Model fit improves slightly with each added variable, with R² increasing from 0.033 (Model 1) to 0.037 (Model 3). Distance to the center has no effect, while `bedrooms_bin` strongly influences price. In Models 2 and 3, `accommodates_bin` and `min_nights_bin` are significant, suggesting capacity and stay duration impact pricing. Model 3 also shows that hosts with more listings and fewer reviews tend to charge higher prices. Despite these findings, the models explain only a small portion of price variation, with Model 3 performing best.

#**1.2.2 Model: LASSO**

We use `LassoCV` with cross-validation to select the best regularization strength and `StandardScaler` to scale features. After fitting the model, we print the selected alpha and coefficients, then evaluate performance on the test set using R² and RMSE.

In [None]:
# Drop rows with NaNs
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [None]:

# Build LASSO Pipeline
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42))
])

# Fit LASSO
lasso_pipeline.fit(X_train, y_train)

# Get Best Alpha and Coefficients
best_alpha = lasso_pipeline.named_steps['lasso'].alpha_
lasso_coefficients = lasso_pipeline.named_steps['lasso'].coef_

print(f"Best alpha (lambda): {best_alpha:.4f}")
print("\nLASSO Coefficients:")
for feature, coef in zip(X.columns, lasso_coefficients):
    print(f"{feature}: {coef:.4f}")

# Evaluate Model
y_pred = lasso_pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"\nTest R²: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")

**Basic Interpretation**

The RMSE of 107.16 indicates accurate predictions. LASSO set several coefficients to zero, showing little predictive value. Key predictors included `accommodates`, `bedrooms`, and `minimum_nights`.


#**1.2.3 Model: Random Forest**

In this part, we trained a Random Forest regression model to predict Airbnb prices using all numeric features. We split the data into training and test sets, fit the model, and evaluated its performance using R² and RMSE. We also extracted and ranked feature importances to understand which variables contributed most to the model's predictions. This helps interpret the model and compare it with previous approaches like OLS and LASSO.

In [None]:
# Select Relevant Features
selected_features = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin',
    'host_scale', 'review_number'
]

X = selected_df[selected_features]
y = selected_df['price']
# Drop NaNs and Split Data
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Random Forest Test R²: {r2:.4f}")
print(f"Random Forest Test RMSE: {rmse:.2f}")

# Show Feature Importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))

**Basic Intepretation**

The Random Forest model has a Test R² of -1.12, indicating poor predictions. The high RMSE of 177 confirms this. Distance_to_center was the most important feature, contributing 69%, while `accommodates_bin`, `number_of_reviews`, and `bedrooms_bin` had less impact. The negative R² suggests overfitting or issues with the binned data.

#**1.2.4 Model: XGBoost**

For the fourth model, we use XGBoost, training it on numeric features and evaluating performance with R² and RMSE. After fitting, we extract and rank the most important features based on their contribution to the model.

In [None]:
# Keep only numeric columns and drop NaNs
X = X.select_dtypes(include='number')
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Fit XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbosity=0
)

xgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = xgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"XGBoost Test R²: {r2:.4f}")
print(f"XGBoost Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


#**1.2.5 Model: LightGBM**

In this part, we use LightGBM as out fifth model.

We evaluated its performance using R² and RMSE, and analyzed which features were most important for prediction. This helped us assess how well LightGBM captures complex patterns in the data compared to other models like Random Forest and XGBoost.

In [None]:
# Fit LightGBM Regressor
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = lgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"LightGBM Test R²: {r2:.4f}")
print(f"LightGBM Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': lgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


# **1.3.1 Horserace table -- Model Comparasion**




We compare five regression models—OLS, LASSO, Random Forest, XGBoost, and LightGBM—on Airbnb price prediction. For each model, we record R², RMSE, and training time, then compile the results into a horserace table to evaluate fit and training speed.

In [None]:
results = []
# OLS
start = time.time()
lr_model = LinearRegression().fit(X_train, y_train)
lr_pred = lr_model.predict(X_test)
end = time.time()
results.append(['OLS', r2_score(y_test, lr_pred), np.sqrt(mean_squared_error(y_test, lr_pred)), end - start])

# LASSO
start = time.time()
lasso = LassoCV(cv=5).fit(X_train, y_train)
lasso_pred = lasso.predict(X_test)
end = time.time()
results.append(['LASSO', r2_score(y_test, lasso_pred), np.sqrt(mean_squared_error(y_test, lasso_pred)), end - start])

# Random Forest
start = time.time()
rf = RandomForestRegressor(n_estimators=100, random_state=42).fit(X_train, y_train)
rf_pred = rf.predict(X_test)
end = time.time()
results.append(['Random Forest', r2_score(y_test, rf_pred), np.sqrt(mean_squared_error(y_test, rf_pred)), end - start])

# XGBoost
start = time.time()
xgb_model = xgb.XGBRegressor(n_estimators=100, learning_rate=0.1, random_state=42, verbosity=0).fit(X_train, y_train)
xgb_pred = xgb_model.predict(X_test)
end = time.time()
results.append(['XGBoost', r2_score(y_test, xgb_pred), np.sqrt(mean_squared_error(y_test, xgb_pred)), end - start])

# LightGBM
start = time.time()
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42).fit(X_train, y_train)
lgb_pred = lgb_model.predict(X_test)
end = time.time()
results.append(['LightGBM', r2_score(y_test, lgb_pred), np.sqrt(mean_squared_error(y_test, lgb_pred)), end - start])

# Horserace Table
horse_df = pd.DataFrame(results, columns=['Model', 'R² Score', 'RMSE', 'Train Time (s)'])
horse_df.sort_values(by='R² Score', ascending=False, inplace=True)
print("\nModel Comparison (Horserace Table):")
print(horse_df.to_string(index=False))


The horserace table shows LASSO and OLS performed best, with an R² of 0.26 and the lowest RMSE (104), indicating strong predictive power. Simpler linear models outperformed more complex ones.

# **1.4.1 Feature Importance: RF and LightGBM**


We extracted feature importance values from the Random Forest and LightGBM models, selected the top 10 features from each, and merged them into a single comparison table.

In [None]:
# Extract feature importances
rf_importance = pd.DataFrame({
    'Feature': X.columns,
    'RF Importance': rf.feature_importances_
}).sort_values(by='RF Importance', ascending=False).reset_index(drop=True)

lgb_importance = pd.DataFrame({
    'Feature': X.columns,
    'LightGBM Importance': lgb_model.feature_importances_
}).sort_values(by='LightGBM Importance', ascending=False).reset_index(drop=True)

# Merge Top 10 from each
rf_top10 = rf_importance.head(10)
lgb_top10 = lgb_importance.head(10)

# Merge on feature name for comparison
comparison_df = pd.merge(rf_top10, lgb_top10, on='Feature', how='outer').fillna(0)
comparison_df = comparison_df.sort_values(by='RF Importance', ascending=False).reset_index(drop=True)
print("Top 10 Feature Importance Comparison (RF vs LightGBM):")
print(comparison_df.to_string(index=False))


The feature importance comparison shows both Random Forest and LightGBM agree on the most influential variable: distance_to_center. This consistency supports the reliability of this feature, despite weak overall model performance.

## **Part 2: Validity**

# **2.1 Data Wrangling**

In [None]:
lisbon_dec=pd.read_csv('listings-later.csv')
print(lisbon_dec.shape)
lisbon_dec.head()
lisbon_dec.columns

In [None]:
# Select only the variables of interest
selected_columns_dec = [
    'room_type',
    'latitude',
    'longitude',
    'accommodates',
    'bedrooms',
    'host_total_listings_count',
    'maximum_nights',
    'minimum_nights',
    'number_of_reviews',
    'review_scores_rating',
    'price'
]

# Create new dataframe with only selected variables
selected_dec = airbnb[selected_columns_dec].copy()
print(selected_dec.shape)
print(selected_dec.dtypes)
selected_dec.head()

**2.1.1. Price**


In [None]:
# Convert price column to float
selected_dec['price'] = (
    selected_dec['price']
    .astype(str)  # ensure everything is treated as string
    .str.replace('[\$,]', '', regex=True)  # Remove both $ and commas
    .astype(float)
)

print(selected_dec['price'].head())

# Drop rows with missing prices
selected_dec = selected_dec.dropna(subset=['price'])
print("\nMissing values after cleaning:", selected_dec['price'].isnull().sum())

**2.1.2 Room Type**

In [None]:
# Clean the room_type strings
selected_dec['room_type'] = (
    selected_dec['room_type']
    .str.strip()
    .str.lower()
    .str.replace(r'[\/\s]', '_', regex=True)
)

# Create dummy variables
dummies = pd.get_dummies(
    selected_dec['room_type'],
    prefix='room',
    drop_first=False
)

# Join with original data
selected_dec = pd.concat([selected_dec, dummies], axis=1)

# Preview the dummy variables
print(selected_dec.filter(like='room_').head())
selected_dec.head()


**2.1.3 Latitude and Longitude**

In [None]:
# Centre point for reference ( )
CENTER_LAT = 38.7223  # latitude
CENTER_LON = -9.1393  # longitude

# Clean coordinate columns
selected_dec['latitude'] = (
    selected_dec['latitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)
    .replace('', np.nan)
    .astype(float)
)

selected_dec['longitude'] = (
    selected_dec['longitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)
    .replace('', np.nan)
    .astype(float)
)

# Drop rows with invalid coordinates
print(f"Before cleaning: {len(selected_dec)} rows")
selected_dec = selected_dec.dropna(subset=['latitude', 'longitude'])
print(f"After cleaning: {len(selected_dec)} rows")

# Haversine formula to calculate distance to center
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

# Create new distance column
selected_dec['distance_to_center'] = selected_dec.apply(
    lambda row: haversine_distance(
        row['latitude'],
        row['longitude'],
        CENTER_LAT,
        CENTER_LON
    ),
    axis=1
)

# Summary stats
print("\nDistance statistics (km):")
print(selected_dec['distance_to_center'].describe())



**2.1.4 Accomodation and Bedrooms**

In [None]:
selected_dec['accommodates'].agg(['min', 'max'])

In [None]:
selected_dec['bedrooms'].agg(['min', 'max'])

In [None]:
# binning for variable: accomodation
selected_dec['accommodates_bin'] = pd.cut(selected_dec['accommodates'], bins=[0, 2, 4, 6, 9, 16], labels=False)
selected_dec['bedrooms_bin'] = pd.cut(selected_dec['bedrooms'], bins=[0, 1, 2, 3, 4, 15], labels=False)
# selected_dec.head()

**2.1.5 Host Listings Count**

In [None]:
selected_dec['host_total_listings_count'].agg(['min', 'max'])

In [None]:
#creating bins
selected_dec['host_scale'] = pd.cut(selected_dec['host_total_listings_count'], bins=[0, 1, 5, 10, 1149], labels=False)

**2.1.6 Maximum and Minimum Nights**

In [None]:
selected_dec['maximum_nights'].agg(['min', 'max'])

In [None]:
selected_dec['minimum_nights'].agg(['min', 'max'])

In [None]:
#creating bins for both
selected_dec['min_nights_bin'] = pd.cut(selected_dec['minimum_nights'], bins=[0, 3, 7, 30, 701], labels=False)
selected_dec['max_nights_bin'] = pd.cut(selected_dec['maximum_nights'], bins=[0, 7, 30, 100, 10000], labels=False)
selected_dec.head()

**2.1.7 Review Number and Review Score Rating**

In [None]:
selected_dec['number_of_reviews'].agg(['min', 'max'])

In [None]:
# Drop rows where review_scores_rating is missing
selected_dec = selected_dec.dropna(subset=['number_of_reviews'])
selected_dec['review_number'] = pd.cut(selected_dec['number_of_reviews'], bins=[0, 100, 200, 300 , 400, 500, 600, 2612], labels=False)
selected_dec.head()

#**2.2.1 Model: OLS**

In [None]:
# Features and Target
X = selected_dec.drop('price', axis=1)
y = selected_dec['price']

# Keep only numeric variables
X = X.select_dtypes(include=[np.number])

# Filter to Drop NaNs for Required Columns
required_cols = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin', 'host_scale', 'review_number'
]

valid_index = X[required_cols].dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Add constant for OLS intercept
X = sm.add_constant(X)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Variables for Each Model
model1_vars = ['const', 'distance_to_center', 'review_scores_rating', 'bedrooms_bin']
model2_vars = model1_vars + ['accommodates_bin', 'max_nights_bin', 'min_nights_bin']
model3_vars = model2_vars + ['host_scale', 'review_number']

# Step 6: Fit the Models
model1 = sm.OLS(y_train, X_train[model1_vars]).fit()
model2 = sm.OLS(y_train, X_train[model2_vars]).fit()
model3 = sm.OLS(y_train, X_train[model3_vars]).fit()

# Display Model Comparison Table
summary = summary_col(
    results=[model1, model2, model3],
    float_format='%0.3f',
    stars=True,
    model_names=['Model 1', 'Model 2', 'Model 3'],
    info_dict={
        'R-squared': lambda x: f"{x.rsquared:.3f}",
        'Adj. R-squared': lambda x: f"{x.rsquared_adj:.3f}",
        'BIC': lambda x: f"{x.bic:.0f}",
        'Observations': lambda x: f"{int(x.nobs)}"
    }
)

print(summary)


#**2.2.2 Model: LASSO**


In [None]:
# Drop rows with NaNs
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [None]:

# Build LASSO Pipeline
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42))
])

# Fit LASSO
lasso_pipeline.fit(X_train, y_train)

# Get Best Alpha and Coefficients
best_alpha = lasso_pipeline.named_steps['lasso'].alpha_
lasso_coefficients = lasso_pipeline.named_steps['lasso'].coef_

print(f"Best alpha (lambda): {best_alpha:.4f}")
print("\nLASSO Coefficients:")
for feature, coef in zip(X.columns, lasso_coefficients):
    print(f"{feature}: {coef:.4f}")

# Evaluate Model
y_pred = lasso_pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"\nTest R²: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")

#**2.2.3 Model: Random Forest**

In [None]:
# Select Relevant Features
selected_features = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin',
    'host_scale', 'review_number'
]

X = selected_dec[selected_features]
y = selected_dec['price']

# Drop NaNs and Split Data
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Random Forest Test R²: {r2:.4f}")
print(f"Random Forest Test RMSE: {rmse:.2f}")

# Show Feature Importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


#**2.2.4 Model: XGBoost**

In [None]:
# Keep only numeric columns and drop NaNs
X = X.select_dtypes(include='number')
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Fit XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbosity=0
)

xgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = xgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"XGBoost Test R²: {r2:.4f}")
print(f"XGBoost Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


#**1.2.5 Model: LightGBM**

In [None]:
# Fit LightGBM Regressor
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = lgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"LightGBM Test R²: {r2:.4f}")
print(f"LightGBM Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': lgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


**Step 1:** Load and Process Porto Dataset

In [54]:
porto=pd.read_csv('listings-porto.csv')
print(porto.shape)
porto.head()
porto.columns

(14360, 79)


Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

In [55]:
# Select only the variables of interest
selected_columns_porto = [
    'room_type',
    'latitude',
    'longitude',
    'accommodates',
    'bedrooms',
    'host_total_listings_count',
    'maximum_nights',
    'minimum_nights',
    'number_of_reviews',
    'review_scores_rating',
    'price'
]

# Create new dataframe with only selected variables
selected_porto = airbnb[selected_columns_porto].copy()
print(selected_porto.shape)
print(selected_porto.dtypes)
selected_porto.head()

(24181, 11)
room_type                     object
latitude                     float64
longitude                    float64
accommodates                   int64
bedrooms                     float64
host_total_listings_count    float64
maximum_nights                 int64
minimum_nights                 int64
number_of_reviews              int64
review_scores_rating         float64
price                         object
dtype: object


Unnamed: 0,room_type,latitude,longitude,accommodates,bedrooms,host_total_listings_count,maximum_nights,minimum_nights,number_of_reviews,review_scores_rating,price
0,Entire home/apt,38.6975,-9.19768,3,1.0,2.0,365,4,77,4.52,$35.00
1,Entire home/apt,38.71241,-9.12706,3,1.0,2.0,14,2,215,4.82,$78.00
2,Entire home/apt,38.71156,-9.12987,4,1.0,1.0,1125,2,407,4.77,$63.00
3,Entire home/apt,38.71108,-9.15979,16,9.0,2.0,1125,2,140,4.95,$984.00
4,Entire home/apt,38.74606,-9.15358,3,1.0,2.0,365,6,61,4.57,$100.00


In [56]:
# Convert price column to float
selected_porto['price'] = (
    selected_porto['price']
    .astype(str)  # ensure everything is treated as string
    .str.replace('[\$,]', '', regex=True)  # Remove both $ and commas
    .astype(float)
)

print(selected_porto['price'].head())

# Drop rows with missing prices
selected_porto = selected_porto.dropna(subset=['price'])
print("\nMissing values after cleaning:", selected_porto['price'].isnull().sum())

0     35.0
1     78.0
2     63.0
3    984.0
4    100.0
Name: price, dtype: float64

 0ssing values after cleaning:


In [57]:
# Clean the room_type strings
selected_porto['room_type'] = (
    selected_porto['room_type']
    .str.strip()
    .str.lower()
    .str.replace(r'[\/\s]', '_', regex=True)
)

# Create dummy variables
dummies = pd.get_dummies(
    selected_porto['room_type'],
    prefix='room',
    drop_first=False
)

# Join with original data
selected_porto = pd.concat([selected_porto, dummies], axis=1)

# Preview the dummy variables
print(selected_porto.filter(like='room_').head())
selected_porto.head()


         room_type  room_entire_home_apt  room_hotel_room  room_private_room  \
0  entire_home_apt                  True            False              False   
1  entire_home_apt                  True            False              False   
2  entire_home_apt                  True            False              False   
3  entire_home_apt                  True            False              False   
4  entire_home_apt                  True            False              False   

   room_shared_room  
0             False  
1             False  
2             False  
3             False  
4             False  


Unnamed: 0,room_type,latitude,longitude,accommodates,bedrooms,host_total_listings_count,maximum_nights,minimum_nights,number_of_reviews,review_scores_rating,price,room_entire_home_apt,room_hotel_room,room_private_room,room_shared_room
0,entire_home_apt,38.6975,-9.19768,3,1.0,2.0,365,4,77,4.52,35.0,True,False,False,False
1,entire_home_apt,38.71241,-9.12706,3,1.0,2.0,14,2,215,4.82,78.0,True,False,False,False
2,entire_home_apt,38.71156,-9.12987,4,1.0,1.0,1125,2,407,4.77,63.0,True,False,False,False
3,entire_home_apt,38.71108,-9.15979,16,9.0,2.0,1125,2,140,4.95,984.0,True,False,False,False
4,entire_home_apt,38.74606,-9.15358,3,1.0,2.0,365,6,61,4.57,100.0,True,False,False,False


In [58]:
# Define Porto's city center coordinates (example: Praça da Liberdade)
PORTO_CENTER_LAT = 41.1579
PORTO_CENTER_LON = -8.6291

# Clean coordinate columns
selected_porto['latitude'] = (
    selected_porto['latitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)
    .replace('', np.nan)
    .astype(float)
)

selected_porto['longitude'] = (
    selected_porto['longitude']
    .astype(str)
    .str.replace('[^0-9.-]', '', regex=True)
    .replace('', np.nan)
    .astype(float)
)

# Drop rows with invalid coordinates
print(f"Before cleaning: {len(selected_porto)} rows")
selected_porto = selected_porto.dropna(subset=['latitude', 'longitude'])
print(f"After cleaning: {len(selected_porto)} rows")

# Haversine formula to calculate distance to center
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371  # Earth radius in km
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat / 2)**2 + cos(lat1) * cos(lat2) * sin(dlon / 2)**2
    c = 2 * atan2(sqrt(a), sqrt(1 - a))
    return R * c

# Create new distance column
selected_porto['distance_to_center'] = selected_porto.apply(
    lambda row: haversine_distance(
        row['latitude'],
        row['longitude'],
        CENTER_LAT,
        CENTER_LON
    ),
    axis=1
)

# Summary stats
print("\nDistance statistics (km):")
print(selected_porto['distance_to_center'].describe())



Before cleaning: 21457 rows
After cleaning: 21457 rows

Distance statistics (km):
count    21457.000000
mean        10.037012
std         14.355003
min          0.065004
25%          1.223089
50%          1.858989
75%         17.284823
max         65.240474
Name: distance_to_center, dtype: float64


In [59]:
selected_porto['accommodates'].agg(['min', 'max'])

min     1
max    16
Name: accommodates, dtype: int64

In [60]:
selected_porto['bedrooms'].agg(['min', 'max'])

min     0.0
max    25.0
Name: bedrooms, dtype: float64

In [61]:
# binning for variable: accomodation
selected_porto['accommodates_bin'] = pd.cut(selected_porto['accommodates'], bins=[0, 2, 4, 6, 9, 16], labels=False)
selected_porto['bedrooms_bin'] = pd.cut(selected_porto['bedrooms'], bins=[0, 1, 2, 3, 4, 15], labels=False)


In [62]:
selected_porto['host_total_listings_count'].agg(['min', 'max'])

min       1.0
max    7977.0
Name: host_total_listings_count, dtype: float64

In [63]:
#creating bins
selected_porto['host_scale'] = pd.cut(selected_porto['host_total_listings_count'], bins=[0, 1, 5, 10, 1149], labels=False)

In [64]:
selected_porto['maximum_nights'].agg(['min', 'max'])

min        1
max    36180
Name: maximum_nights, dtype: int64

In [65]:
selected_porto['minimum_nights'].agg(['min', 'max'])

min      1
max    730
Name: minimum_nights, dtype: int64

In [66]:
#creating bins for both
selected_porto['min_nights_bin'] = pd.cut(selected_porto['minimum_nights'], bins=[0, 3, 7, 30, 701], labels=False)
selected_porto['max_nights_bin'] = pd.cut(selected_porto['maximum_nights'], bins=[0, 7, 30, 100, 10000], labels=False)
selected_porto.head()

Unnamed: 0,room_type,latitude,longitude,accommodates,bedrooms,host_total_listings_count,maximum_nights,minimum_nights,number_of_reviews,review_scores_rating,...,room_entire_home_apt,room_hotel_room,room_private_room,room_shared_room,distance_to_center,accommodates_bin,bedrooms_bin,host_scale,min_nights_bin,max_nights_bin
0,entire_home_apt,38.6975,-9.19768,3,1.0,2.0,365,4,77,4.52,...,True,False,False,False,5.767489,1,0.0,1.0,1.0,3.0
1,entire_home_apt,38.71241,-9.12706,3,1.0,2.0,14,2,215,4.82,...,True,False,False,False,1.528748,1,0.0,1.0,0.0,1.0
2,entire_home_apt,38.71156,-9.12987,4,1.0,1.0,1125,2,407,4.77,...,True,False,False,False,1.447601,1,0.0,0.0,0.0,3.0
3,entire_home_apt,38.71108,-9.15979,16,9.0,2.0,1125,2,140,4.95,...,True,False,False,False,2.17181,4,4.0,1.0,0.0,3.0
4,entire_home_apt,38.74606,-9.15358,3,1.0,2.0,365,6,61,4.57,...,True,False,False,False,2.917929,1,0.0,1.0,1.0,3.0


In [67]:
selected_porto['number_of_reviews'].agg(['min', 'max'])

min       0
max    1657
Name: number_of_reviews, dtype: int64

In [68]:
# Drop rows where review_scores_rating is missing
selected_porto = selected_porto.dropna(subset=['number_of_reviews'])
selected_porto['review_number'] = pd.cut(selected_porto['number_of_reviews'], bins=[0, 100, 200, 300 , 400, 500, 600, 2612], labels=False)
selected_porto.head()

Unnamed: 0,room_type,latitude,longitude,accommodates,bedrooms,host_total_listings_count,maximum_nights,minimum_nights,number_of_reviews,review_scores_rating,...,room_hotel_room,room_private_room,room_shared_room,distance_to_center,accommodates_bin,bedrooms_bin,host_scale,min_nights_bin,max_nights_bin,review_number
0,entire_home_apt,38.6975,-9.19768,3,1.0,2.0,365,4,77,4.52,...,False,False,False,5.767489,1,0.0,1.0,1.0,3.0,0.0
1,entire_home_apt,38.71241,-9.12706,3,1.0,2.0,14,2,215,4.82,...,False,False,False,1.528748,1,0.0,1.0,0.0,1.0,2.0
2,entire_home_apt,38.71156,-9.12987,4,1.0,1.0,1125,2,407,4.77,...,False,False,False,1.447601,1,0.0,0.0,0.0,3.0,4.0
3,entire_home_apt,38.71108,-9.15979,16,9.0,2.0,1125,2,140,4.95,...,False,False,False,2.17181,4,4.0,1.0,0.0,3.0,1.0
4,entire_home_apt,38.74606,-9.15358,3,1.0,2.0,365,6,61,4.57,...,False,False,False,2.917929,1,0.0,1.0,1.0,3.0,0.0


OLS

In [69]:
# Features and Target
X = selected_porto.drop('price', axis=1)
y = selected_porto['price']

# Keep only numeric variables
X = X.select_dtypes(include=[np.number])

# Filter to Drop NaNs for Required Columns
required_cols = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin', 'host_scale', 'review_number'
]

valid_index = X[required_cols].dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Add constant for OLS intercept
X = sm.add_constant(X)

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define Variables for Each Model
model1_vars = ['const', 'distance_to_center', 'review_scores_rating', 'bedrooms_bin']
model2_vars = model1_vars + ['accommodates_bin', 'max_nights_bin', 'min_nights_bin']
model3_vars = model2_vars + ['host_scale', 'review_number']

# Step 6: Fit the Models
model1 = sm.OLS(y_train, X_train[model1_vars]).fit()
model2 = sm.OLS(y_train, X_train[model2_vars]).fit()
model3 = sm.OLS(y_train, X_train[model3_vars]).fit()

# Display Model Comparison Table
summary = summary_col(
    results=[model1, model2, model3],
    float_format='%0.3f',
    stars=True,
    model_names=['Model 1', 'Model 2', 'Model 3'],
    info_dict={
        'R-squared': lambda x: f"{x.rsquared:.3f}",
        'Adj. R-squared': lambda x: f"{x.rsquared_adj:.3f}",
        'BIC': lambda x: f"{x.bic:.0f}",
        'Observations': lambda x: f"{int(x.nobs)}"
    }
)

print(summary)



                      Model 1   Model 2   Model 3  
---------------------------------------------------
const                -59.473*  -86.461** -99.087***
                     (33.599)  (35.942)  (37.082)  
distance_to_center   0.253     0.177     0.055     
                     (0.211)   (0.211)   (0.219)   
review_scores_rating 28.107*** 27.795*** 31.564*** 
                     (7.221)   (7.257)   (7.399)   
bedrooms_bin         63.801*** 38.131*** 37.703*** 
                     (3.039)   (5.123)   (5.125)   
accommodates_bin               29.753*** 30.380*** 
                               (4.895)   (4.899)   
max_nights_bin                 5.425     4.102     
                               (3.789)   (3.828)   
min_nights_bin                 12.176**  10.751**  
                               (5.310)   (5.351)   
host_scale                               2.214     
                                         (2.672)   
review_number                            -8.978*** 
           

LASSO

In [71]:
# Drop rows with NaNs
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


In [72]:

# Build LASSO Pipeline
lasso_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lasso', LassoCV(cv=5, random_state=42))
])

# Fit LASSO
lasso_pipeline.fit(X_train, y_train)

# Get Best Alpha and Coefficients
best_alpha = lasso_pipeline.named_steps['lasso'].alpha_
lasso_coefficients = lasso_pipeline.named_steps['lasso'].coef_

print(f"Best alpha (lambda): {best_alpha:.4f}")
print("\nLASSO Coefficients:")
for feature, coef in zip(X.columns, lasso_coefficients):
    print(f"{feature}: {coef:.4f}")

# Evaluate Model
y_pred = lasso_pipeline.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"\nTest R²: {r2:.4f}")
print(f"Test RMSE: {rmse:.2f}")

Best alpha (lambda): 28.1307

LASSO Coefficients:
const: 0.0000
latitude: 0.0000
longitude: -0.0000
accommodates: 28.4698
bedrooms: 15.5752
host_total_listings_count: -0.0000
maximum_nights: 0.0000
minimum_nights: 18.4488
number_of_reviews: -0.0000
review_scores_rating: 0.0000
distance_to_center: 0.0000
accommodates_bin: 0.0000
bedrooms_bin: 0.0000
host_scale: -0.0000
min_nights_bin: -0.0000
max_nights_bin: 0.0000
review_number: -0.0000

Test R²: 0.2285
Test RMSE: 107.16


random forest

In [74]:
# Select Relevant Features
selected_features = [
    'distance_to_center', 'review_scores_rating', 'bedrooms_bin',
    'accommodates_bin', 'min_nights_bin', 'max_nights_bin',
    'host_scale', 'review_number'
]

X = selected_dec[selected_features]
y = selected_dec['price']

# Drop NaNs and Split Data
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Random Forest
rf_model = RandomForestRegressor(n_estimators=100, random_state=42)
rf_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = rf_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"Random Forest Test R²: {r2:.4f}")
print(f"Random Forest Test RMSE: {rmse:.2f}")

# Show Feature Importances
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


Random Forest Test R²: -1.1270
Random Forest Test RMSE: 177.94

Top Features by Importance:
                Feature  Importance
0    distance_to_center    0.696387
1  review_scores_rating    0.076388
6            host_scale    0.053980
2          bedrooms_bin    0.046374
4        min_nights_bin    0.044719
3      accommodates_bin    0.044574
5        max_nights_bin    0.021421
7         review_number    0.016157


XGBOOST

In [75]:
# Keep only numeric columns and drop NaNs
X = X.select_dtypes(include='number')
valid_index = X.dropna().index.intersection(y.dropna().index)
X = X.loc[valid_index]
y = y.loc[valid_index]

# Train/Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and Fit XGBoost Regressor
xgb_model = xgb.XGBRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    random_state=42,
    verbosity=0
)

xgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = xgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"XGBoost Test R²: {r2:.4f}")
print(f"XGBoost Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': xgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


XGBoost Test R²: -0.3001
XGBoost Test RMSE: 139.11

Top Features by Importance:
                Feature  Importance
4        min_nights_bin    0.222010
0    distance_to_center    0.205649
2          bedrooms_bin    0.158801
5        max_nights_bin    0.122269
1  review_scores_rating    0.091055
6            host_scale    0.080590
7         review_number    0.067833
3      accommodates_bin    0.051792


LIGHT BGM

In [76]:
# Fit LightGBM Regressor
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = lgb_model.predict(X_test)
r2 = r2_score(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

print(f"LightGBM Test R²: {r2:.4f}")
print(f"LightGBM Test RMSE: {rmse:.2f}")

# Feature Importance
importance_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': lgb_model.feature_importances_
}).sort_values(by='Importance', ascending=False)

print("\nTop Features by Importance:")
print(importance_df.head(10))


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000800 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 422
[LightGBM] [Info] Number of data points in the train set: 14312, number of used features: 8
[LightGBM] [Info] Start training from score 119.604458
LightGBM Test R²: -0.1506
LightGBM Test RMSE: 130.87

Top Features by Importance:
                Feature  Importance
0    distance_to_center        1361
1  review_scores_rating         623
2          bedrooms_bin         239
4        min_nights_bin         233
6            host_scale         195
3      accommodates_bin         171
7         review_number          94
5        max_nights_bin          84


Final Analysis and Conclusions

This study developed predictive pricing models for Airbnb listings across Lisbon and Porto using five distinct machine learning approaches. The analysis yielded several key findings regarding model performance and feature importance.

Linear regression models, particularly LASSO regression, demonstrated superior predictive accuracy with R² values of approximately 0.26, indicating these methods effectively captured fundamental pricing relationships. In contrast, tree-based ensemble methods including Random Forest and gradient boosting algorithms exhibited poorer performance, likely due to overfitting issues arising from the feature engineering approach.

Geospatial features emerged as the most significant predictors, with distance_to_center consistently ranking as the most important variable across all models. This finding strongly supports the well-established real estate principle of location value. Secondary predictive factors included bedroom count and guest review scores, though their relative importance varied between the two cities.

The temporal validation using Lisbon's December dataset confirmed model stability over time, while the geographic validation in Porto revealed modest performance degradation (LASSO R² decreasing to 0.22). This suggests that while core pricing factors remain consistent, regional market differences warrant consideration in operational deployment.