In [1]:
import pandas as pd
df_clean = pd.read_csv('../data/airbnb_cleaned.csv')


In [2]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Step 1: Create binary target column (1 = has TV, 0 = no TV)
df_class = df_clean.copy()
df_class['has_tv'] = df_class['amenities'].apply(lambda x: 1 if isinstance(x, str) and 'TV' in x else 0)

# Step 2: Select numeric features and drop missing rows
features = ['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews', 'review_scores_rating']
df_class = df_class[features + ['has_tv']].dropna()

# Step 3: Define X and y
X = df_class[features]
y = df_class['has_tv']

# Step 4: Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Step 5: Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 6: Fit k-NN model
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_scaled, y_train)

# Step 7: Evaluate model
y_pred = knn.predict(X_test_scaled)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.6324503311258278
Confusion Matrix:
 [[ 45 126]
 [ 96 337]]
Classification Report:
               precision    recall  f1-score   support

           0       0.32      0.26      0.29       171
           1       0.73      0.78      0.75       433

    accuracy                           0.63       604
   macro avg       0.52      0.52      0.52       604
weighted avg       0.61      0.63      0.62       604



In [3]:
from sklearn.utils import resample

# Separate classes
df_tv = df_class[df_class['has_tv'] == 1]
df_no_tv = df_class[df_class['has_tv'] == 0]

# Downsample class 1 (TV listings) to match class 0
df_tv_down = resample(df_tv, replace=False, n_samples=len(df_no_tv), random_state=42)

# Combine and shuffle
df_balanced = pd.concat([df_tv_down, df_no_tv]).sample(frac=1, random_state=42)


In [4]:
# Redefine features and target
X_bal = df_balanced[features]
y_bal = df_balanced['has_tv']

# Split
X_train, X_test, y_train, y_test = train_test_split(X_bal, y_bal, test_size=0.3, random_state=42)

# Scale
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train k-NN
knn.fit(X_train_scaled, y_train)
y_pred = knn.predict(X_test_scaled)

# Evaluate
print("Balanced Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Balanced Accuracy: 0.5619834710743802
Confusion Matrix:
 [[ 96  87]
 [ 72 108]]
Classification Report:
               precision    recall  f1-score   support

           0       0.57      0.52      0.55       183
           1       0.55      0.60      0.58       180

    accuracy                           0.56       363
   macro avg       0.56      0.56      0.56       363
weighted avg       0.56      0.56      0.56       363



### Improving k-NN Classification Through Class Balancing

Our initial k-NN classification model predicted whether a listing included a TV as an amenity. While it achieved 63% accuracy, it was heavily biased toward predicting listings with TVs, due to a significant class imbalance. To address this, we applied **undersampling** to the majority class (listings with TVs) to create a balanced dataset for model training.

After retraining the k-NN classifier on the balanced data, the model achieved a more equitable performance across both classes, with a balanced accuracy of **56.2%**. Precision and recall were nearly identical for both classes, and f1-scores hovered around **0.56–0.58**, showing that the model was no longer biased toward the majority class. While overall accuracy dropped slightly, the model became much more **realistic and fair**, which is critical when both outcomes matter. This trade-off highlights the importance of addressing class imbalance in classification tasks.


In [5]:
from sklearn.utils import resample

# Separate classes
df_tv = df_class[df_class['has_tv'] == 1]
df_no_tv = df_class[df_class['has_tv'] == 0]

# Downsample class 1 (TV listings) to match class 0
df_tv_down = resample(df_tv, replace=False, n_samples=len(df_no_tv), random_state=42)

# Combine and shuffle
df_balanced = pd.concat([df_tv_down, df_no_tv]).sample(frac=1, random_state=42)


In [6]:
from sklearn.preprocessing import KBinsDiscretizer
import pandas as pd
df_clean = pd.read_csv('../data/airbnb_cleaned.csv')

df_nb = df_clean.copy()
df_nb = df_nb.dropna(subset=['review_scores_value'])

# Equal frequency binning into 3 bins: low, medium, high
binning = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df_nb['value_bin'] = binning.fit_transform(df_nb[['review_scores_value']]).astype(int)

# Optional: label categories for readability (not required for modeling)
label_map = {0: 'Low', 1: 'Medium', 2: 'High'}
df_nb['value_label'] = df_nb['value_bin'].map(label_map)




In [7]:

features = ['accommodates', 'bedrooms', 'bathrooms', 'minimum_nights', 
            'number_of_reviews', 'price', 'availability_365']

df_nb = df_nb[features + ['value_bin']].dropna()


In [8]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix

# Define X and y
X = df_nb[features]
y = df_nb['value_bin']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Naive Bayes
nb = GaussianNB()
nb.fit(X_train, y_train)

# Predict
y_pred = nb.predict(X_test)

# Evaluate
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.3610648918469218
Confusion Matrix:
 [[104  87   3]
 [ 81 113   6]
 [ 98 109   0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.37      0.54      0.44       194
           1       0.37      0.56      0.44       200
           2       0.00      0.00      0.00       207

    accuracy                           0.36       601
   macro avg       0.24      0.37      0.29       601
weighted avg       0.24      0.36      0.29       601



### Naive Bayes Classification — Predicting Perceived Value

To understand how property features influence perceived value for money, we built a Naive Bayes model to classify listings based on their `review_scores_value`. This variable was binned using equal-frequency binning into three levels: **low**, **medium**, and **high** value. We intentionally excluded all other review-related variables to avoid redundancy and focused on structural features like **accommodates, bedrooms, bathrooms, minimum nights, number of reviews, price,** and **availability**.

The model achieved an overall accuracy of **36%**, with relatively better performance for the *low* and *medium* value categories (f1-score ≈ 0.44 each), but it failed to capture the *high value* class. This may be due to overlapping feature distributions or the strong independence assumption of the Gaussian Naive Bayes algorithm. While the performance is limited, the model offers insight into how much structural attributes alone can (or can’t) explain perceived value. In future iterations, we could explore alternative models like Decision Trees or Random Forests to better handle class imbalance and non-linearity.


In [9]:
# Fictional listing input
fictional_input = pd.DataFrame([{
    'accommodates': 2,
    'bedrooms': 1,
    'bathrooms': 1.0,
    'minimum_nights': 2,
    'number_of_reviews': 38,
    'price': 120,
    'availability_365': 250
}])

# Predict using your trained Naive Bayes model
predicted_bin = nb.predict(fictional_input)[0]

# Optional: map to label
bin_labels = {0: 'Low', 1: 'Medium', 2: 'High'}
print("Predicted value bin:", bin_labels.get(predicted_bin, "Unknown"))


Predicted value bin: Low


### Fictional Scenario Prediction

To demonstrate our model’s practical application, we created a fictional Geneva rental called the **“Alpine Minimalist Studio.”** It’s a compact, modern space designed for 2 guests, featuring 1 bedroom, 1 bathroom, and a nightly rate of CHF 120. The listing allows a 2-night minimum stay, has 38 reviews, and is available for 250 days a year — reflecting a reasonably active rental profile.

Using our trained Naive Bayes classifier, this listing was predicted to fall into the **low value bin**. This suggests that, based solely on its structural and pricing features, the model expects guests may feel the apartment offers relatively **less value for money**. This could be due to factors like its smaller size or price point relative to others in the dataset. While it’s a simplified prediction, it helps illustrate how listing features may shape value perception — even before reviews come into play.


In [10]:
from sklearn.preprocessing import KBinsDiscretizer

# Drop missing review_scores_value rows
df_nb = df_clean.copy()
df_nb = df_nb.dropna(subset=['review_scores_value'])

# Equal-frequency binning into 3 bins: 0 = low, 1 = medium, 2 = high
binning = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='quantile')
df_nb['value_bin'] = binning.fit_transform(df_nb[['review_scores_value']]).astype(int)

# Select features (excluding all review_score variables)
features = ['accommodates', 'bedrooms', 'bathrooms', 'minimum_nights', 
            'number_of_reviews', 'price', 'availability_365']

df_nb = df_nb[features + ['value_bin']].dropna()




In [11]:
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

# Define X and y
X = df_nb[features]
y = df_nb['value_bin']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train Gaussian Naive Bayes model
nb = GaussianNB()
nb.fit(X_train, y_train)


0,1,2
,priors,
,var_smoothing,1e-09


In [12]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Predict on test set
y_pred = nb.predict(X_test)

# Evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.3610648918469218
Confusion Matrix:
 [[104  87   3]
 [ 81 113   6]
 [ 98 109   0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.37      0.54      0.44       194
           1       0.37      0.56      0.44       200
           2       0.00      0.00      0.00       207

    accuracy                           0.36       601
   macro avg       0.24      0.37      0.29       601
weighted avg       0.24      0.36      0.29       601



### Naive Bayes Model Summary and Results

We trained a Gaussian Naive Bayes classifier to predict how consumers feel about the **value they receive** from a rental, using binned values of the `review_scores_value` column. After equal-frequency binning into three categories (low, medium, high), we trained the model using structural features like `price`, `bedrooms`, `bathrooms`, and availability. All other review score variables were excluded from the predictors.

Below are screenshots of the code used to:
- Build the model (feature selection + binning)
- Train the Gaussian Naive Bayes algorithm
- Evaluate model performance

#### Model Evaluation Results:

- **Accuracy:** 36.1%
- **Confusion Matrix:**

    ```
    [[104  87   3]
     [ 81 113   6]
     [ 98 109   0]]
    ```

- **Classification Report:**

    ```
                   precision    recall  f1-score   support

               0       0.37      0.54      0.44       194
               1       0.37      0.56      0.44       200
               2       0.00      0.00      0.00       207

        accuracy                           0.36       601
       macro avg       0.24      0.37      0.29       601
    weighted avg       0.24      0.36      0.
