
# Airbnb Prices — Outlier Analysis

This notebook performs a step-by-step outlier analysis on the **Airbnb Prices in European Cities** dataset.

> ✅ Uses **scikit-learn** (correct package name) instead of deprecated `sklearn`.


## 0. Environment setup

In [None]:

# Correct dependency installation (run once if needed)
# !pip install numpy pandas matplotlib scipy scikit-learn


## 1. Imports

In [None]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from scipy import stats

# scikit-learn (CORRECT)
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error

plt.rcParams['figure.figsize'] = (10, 5)


## 2. Load data

In [None]:

# Update path if needed
DATA_PATH = 'airbnb_europe_prices.csv'

df = pd.read_csv(DATA_PATH)
df.head()


## 3. Dataset overview

In [None]:

df.shape, df.isna().sum().sort_values(ascending=False).head(10)


## 4. Summary statistics

In [None]:

df.describe().T


## 5. Visual inspection of price distribution

In [None]:

df['price_total'].hist(bins=80)
plt.title('price_total distribution')
plt.xlabel('price_total')
plt.ylabel('frequency')
plt.show()

plt.boxplot(df['price_total'], vert=False)
plt.title('price_total boxplot')
plt.show()


## 6. IQR (Tukey) outlier detection

In [None]:

Q1 = df['price_total'].quantile(0.25)
Q3 = df['price_total'].quantile(0.75)
IQR = Q3 - Q1

upper_bound = Q3 + 1.5 * IQR
lower_bound = Q1 - 1.5 * IQR

df['outlier_iqr'] = (df['price_total'] > upper_bound) | (df['price_total'] < lower_bound)

df['outlier_iqr'].value_counts()


## 7. Extreme values inspection

In [None]:

df.loc[df['outlier_iqr']].sort_values('price_total', ascending=False).head(10)


## 8. Log transformation (recommended)

In [None]:

df['log_price'] = np.log1p(df['price_total'])

df['log_price'].hist(bins=80)
plt.title('log(price_total + 1)')
plt.xlabel('log_price')
plt.show()


## 9. Isolation Forest (multivariate outliers)

In [None]:

features = ['price_total', 'max_guests', 'num_bedrooms', 'distance_city_center']
X_iso = df[features].fillna(0)

iso = IsolationForest(contamination=0.01, random_state=42)
df['outlier_iso'] = iso.fit_predict(X_iso) == -1

df['outlier_iso'].value_counts()


## 10. Simple model comparison (raw vs log price)

In [None]:

features = ['max_guests', 'num_bedrooms', 'distance_city_center']
X = df[features].fillna(0)
y = df['price_total']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42
)

model = LinearRegression()
model.fit(X_train, y_train)

pred_raw = model.predict(X_test)
print('MAE (raw price):', mean_absolute_error(y_test, pred_raw))


# Log-price model
y_log = np.log1p(y)

X_train, X_test, y_train, y_test = train_test_split(
    X, y_log, test_size=0.25, random_state=42
)

model.fit(X_train, y_train)
pred_log = np.expm1(model.predict(X_test))

print('MAE (log price, back-transformed):',
      mean_absolute_error(np.expm1(y_test), pred_log))



## 11. Conclusions

- `price_total` is **heavily right-skewed** with extreme luxury listings.
- Outliers are **real observations**, not necessarily errors.
- Log transformation significantly stabilizes the distribution.
- Tree-based or robust models are preferred for raw prices.
- Always document whether outliers are removed, capped, or transformed.
