In [2]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

# Set style
sns.set(style="whitegrid")
plt.figure(figsize=(12, 6))

df_clean = pd.read_csv('../data/airbnb_cleaned.csv')
df_clean['price'] = df_clean['price'].replace(r'[\$,]', '', regex=True).astype(float)

<Figure size 1200x600 with 0 Axes>

In [3]:
df_model = df_clean.copy()

# Ensure 'price' is numeric
df_model['price'] = df_model['price'].replace(r'[\$,]', '', regex=True).astype(float)

# Keep only positive prices
df_model = df_model[df_model['price'] > 0]

# Keep selected features
features = [
    'accommodates', 'bedrooms', 'bathrooms', 'room_type',
    'number_of_reviews', 'review_scores_rating', 'neighbourhood_cleansed'
]
df_model = df_model[features + ['price']].dropna()


In [4]:
import pandas as pd
import statsmodels.api as sm

# One-hot encode categorical columns
df_dummies = pd.get_dummies(
    df_model,
    columns=['room_type', 'neighbourhood_cleansed'],
    drop_first=True
)

# Define features (X) and target (y)
X_sm = df_dummies.drop('price', axis=1).astype(float)
y_sm = df_dummies['price'].astype(float)

# Add constant term for intercept
X_sm = sm.add_constant(X_sm)

# Fit the model
model = sm.OLS(y_sm, X_sm).fit()

# Show summary
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:                  price   R-squared:                       0.051
Model:                            OLS   Adj. R-squared:                  0.030
Method:                 Least Squares   F-statistic:                     2.409
Date:                Sun, 10 Aug 2025   Prob (F-statistic):           8.27e-07
Time:                        16:58:10   Log-Likelihood:                -14091.
No. Observations:                2011   AIC:                         2.827e+04
Df Residuals:                    1966   BIC:                         2.852e+04
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
                                                coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------

### Multiple Linear Regression on Price

We built a multiple linear regression model to predict Airbnb listing prices in Geneva, measured in Swiss francs (CHF). Our model included both numerical and categorical variables: **accommodates, bedrooms, bathrooms, number of reviews, review scores rating, room type,** and **neighbourhood_cleansed**. Categorical features were dummy encoded to ensure proper inclusion in the regression.

The results revealed several meaningful patterns. As expected, listings with more **bedrooms**, **bathrooms**, and higher **guest capacity** were associated with higher prices. **Entire homes/apartments** showed a strong positive effect compared to private or shared rooms. Interestingly, some neighborhoods (such as **Cologny**, **Bellevue**, and **Chêne-Bougeries**) significantly increased predicted prices, reflecting Geneva’s local housing market structure and demand for premium areas. While **review scores** showed a modest influence, they still added explanatory value. Overall, this model highlighted how physical attributes and location drive Airbnb pricing across the city.


In [5]:
import numpy as np
import statsmodels.api as sm

# Define log(price) as the new target variable
df_model['log_price'] = np.log(df_model['price'])

# Set up features (X) and target (y)
X = df_dummies.drop(columns=['price'])  # keep predictors same
y = df_model['log_price']               # use log(price)

# Add intercept
X_sm = sm.add_constant(X)

# Ensure all data is float
X_sm = X_sm.astype(float)
y = y.astype(float)

# Fit model
model = sm.OLS(y, X_sm).fit()

# Show results
print(model.summary())


                            OLS Regression Results                            
Dep. Variable:              log_price   R-squared:                       0.423
Model:                            OLS   Adj. R-squared:                  0.410
Method:                 Least Squares   F-statistic:                     32.76
Date:                Sun, 10 Aug 2025   Prob (F-statistic):          4.30e-200
Time:                        16:58:36   Log-Likelihood:                -1155.8
No. Observations:                2011   AIC:                             2402.
Df Residuals:                    1966   BIC:                             2654.
Df Model:                          44                                         
Covariance Type:            nonrobust                                         
                                                coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------

### Log-Price Regression Model & Interpretation

To improve the predictive accuracy and account for price skewness, we log-transformed the price variable and refit our multiple regression model. The results were significantly better: the **R-squared jumped from 0.05 to 0.42**, meaning the model now explains **42% of the variation in Airbnb prices** across Geneva.

Key predictors such as **accommodates, bedrooms,** and especially **bathrooms** were positively and significantly associated with higher log-prices. For instance, listings with more bathrooms had substantially higher predicted prices, suggesting guests are willing to pay a premium for comfort and privacy. In contrast, **private and shared rooms** were strongly associated with lower prices compared to entire homes. Among neighborhoods, a few like **Presinge, Thônex,** and **Céligny** showed negative pricing effects, though many areas did not individually stand out. Overall, this model gives a more interpretable and statistically sound view of Geneva's Airbnb market, capturing both physical and locational factors influencing pricing.
