# Assignment 2: Instructions

In this assignment, you will take on a prediction competition for Airbnb bookings. Here, as opposed to predicting prices as we have been doing so far, you will use a variety of information provided to you to __predict the number of days a given listing will be booked in the next 30 days__. 

You have been provided with real listings data from Los Angeles, but you only have the actual realized bookings for a small subset of the listings, which you can use as your training data. (The column `availability_30` represents current bookings, but due to cancellations and future bookings, it's only a very noisy proxy to actual bookings). You can find the data dictionary [here](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?gid=1322284596#gid=1322284596)

This assignment will be graded as a competition. We have a test set that only the grader has access to. In order to do well in this task, you will have to use everything you have learned in the class so far, including feature engineering and hyperparameter search via cross-validation.

## Write Up (8 pts)
    
You will need to turn in your code along with a short write-up, which you can include in your notebook. You will need to address the following components:

1. (3 pts) Explain how you constructed and / or preprocessed features to help with prediction, and why.

2. (3 pts) Explain what decisions you made using cross-validation, and how well you believe your final model will perform.

3. (2 pts) Explain what features you found to be important using the feature importance tools we discussed in class.

This write-up, along with your code, will be worth 8 points out of 15. These answers can be short: 1-3 sentences each + supporting tables or plots. 

## Performance (7 pts)
The remainder of your grade will be based on your predictive accuracy, as measured in terms of $R^2$. You will recieve one point for each percentage point of test $R^2$ you achieve over 15%, rounded down, up to a maximum of 7 -- So if your test $R^2$ is 21.9% you will recieve 6 points. To recieve full credit, you will need to achieve an $R^2$ of at least 22% on the test set.

In addition to this, there will be __5 points of extra credit__ available to each of the top 5 most accurate models across the whole in the class! You may use any method of your choice, even those that we have not covered, but be sure to explain it in your write-up. Also, to keep things well-scoped, __you may not pull in any datasets__ other than the one we are loading for you in the notebook (although this is a really good idea in practice!)

## Submission
You will need to submit two files: 

1.  Your predictions, in `.csv` format, which must have two columns: `id` and `prediction`
2.  Your code and write up, which should be provided together as an `.ipynb` notebook.

The provided notebook will get you started with loading data, and provide some checks to help you make sure your submission has the correct format. You can download `y_train.parquet` from canvas.

In [2]:
pip install pandas

Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install textblob

Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np

def basic_preprocess(df):
    df["price"] = df["price"].str.replace("$", "").str.replace(",", "").astype(float)
    df = df.dropna(subset=["price"])
    return df

x_df = basic_preprocess(
    pd.read_csv(
        "https://data.insideairbnb.com/united-states/ca/los-angeles/2024-09-04/data/listings.csv.gz"
    )
)

# Grab this from canvas and save it in this directory
y_df = pd.read_parquet("y_train.parquet")

train_df = x_df.merge(y_df, on="id")

outer = x_df.merge(y_df, how='outer', indicator=True)
test_df = outer[(outer._merge=='left_only')].drop('_merge', axis=1)

In [2]:
train_df.shape

(3729, 76)

In [7]:
# Packages needed
from sklearn.model_selection import KFold
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from textblob import TextBlob
import pandas as pd
from geopy.distance import geodesic
import ast

In [9]:
def feature_engineering(df):
    # 1. Sentiment Analysis on `description`
    df['description_sentiment'] = df['description'].apply(lambda x: TextBlob(str(x)).sentiment.polarity)

    # 2. Amenity Parsing: Create binary columns for popular amenities
    popular_amenities = ['Wifi', 'Pool', 'Washer', 'Dryer', 'Air conditioning', 
                         'Parking', 'Kitchen', 'Breakfast', 'Self check-in', 'Hot tub']
    for amenity in popular_amenities:
        df[f'has_{amenity.lower().replace(" ", "_")}'] = df['amenities'].apply(
            lambda x: 1 if amenity in ast.literal_eval(x) else 0
        )

    # 3. Host Tenure: Calculate the number of days since the host started
    df['host_since'] = pd.to_datetime(df['host_since'], errors='coerce')  # Handle parsing errors
    df['host_tenure_days'] = (pd.to_datetime('today') - df['host_since']).dt.days

    # 4. Proximity to Downtown LA: Compute distance of each listing to Downtown LA
    downtown_coords = (34.052235, -118.243683)  # Downtown LA coordinates
    df['distance_to_downtown'] = df.apply(
        lambda row: geodesic((row['latitude'], row['longitude']), downtown_coords).km, axis=1
    )

    # 5. Recency of Reviews: Calculate the number of days since the last review
    df['last_review'] = pd.to_datetime(df['last_review'], errors='coerce')
    df['days_since_last_review'] = (pd.to_datetime('today') - df['last_review']).dt.days

    # 6. Price per Guest: Calculate normalized price based on the number of guests accommodated
    df['price_per_guest'] = df['price'] / df['accommodates'].replace(0, 1)

    # 7. Binary Variables for Categorical Features: Convert categorical and binary columns to numerical format
    # Create dummy variables for `room_type` and `neighbourhood_group_cleansed`
    room_type_dummies = pd.get_dummies(df['room_type'], prefix='room_type', drop_first=True)
    neighbourhood_dummies = pd.get_dummies(df['neighbourhood_group_cleansed'], prefix='neighbourhood', drop_first=True)
    df = pd.concat([df, room_type_dummies, neighbourhood_dummies], axis=1)

    # Convert binary columns with True/False values to 1/0
    binary_columns = ['host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'host_is_superhost']
    for col in binary_columns:
        if col in df.columns:
            df[col] = df[col].map({'t': 1, 'f': 0})

    # 8. Discretization of Review Scores: Categorize review scores into bins (low, medium, high)
    review_columns = [
        'review_scores_rating', 'review_scores_cleanliness', 'review_scores_communication',
        'review_scores_accuracy', 'review_scores_location', 'review_scores_value',
        'review_scores_checkin'
    ]
    bins = [1, 4, 7, 10]  # Low: 1–4, Medium: 5–7, High: 8–10
    labels = ['low', 'medium', 'high']
    for col in review_columns:
        df[f'{col}_bin'] = pd.cut(df[col], bins=bins, labels=labels, include_lowest=True)

    # 9. Availability Ratios: Compute short-term to longer-term availability ratios
    df['availability_30_to_60_ratio'] = df['availability_30'] / (df['availability_60'] + 1)
    df['availability_30_to_90_ratio'] = df['availability_30'] / (df['availability_90'] + 1)
    df['availability_30_to_365_ratio'] = df['availability_30'] / (df['availability_365'] + 1)

    # 10. Availability Changes: Calculate differences in availability between consecutive time periods
    df['availability_change_30_to_60'] = df['availability_60'] - df['availability_30']
    df['availability_change_30_to_90'] = df['availability_90'] - df['availability_30']
    df['availability_change_30_to_365'] = df['availability_365'] - df['availability_30']


    # 11. Handle Missing Values
    # Fill numeric missing values with the median
    df.fillna(df.median(numeric_only=True), inplace=True)

    # For categorical columns, add 'missing' as a category and fill missing values
    categorical_columns = df.select_dtypes(include=['category', 'object']).columns
    for col in categorical_columns:
        if df[col].dtype.name == 'category':
            # Add 'missing' as a category
            df[col] = df[col].cat.add_categories('missing')
        df[col].fillna('missing', inplace=True)

    return df

# Apply Feature Engineering to Train and Test Datasets
train_df = feature_engineering(train_df)
test_df = feature_engineering(test_df)

# Print the first few rows to verify transformations
print(train_df.head())
print(test_df.head())


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('missing', inplace=True)


                    id                                       listing_url  \
0  1158417037056953812  https://www.airbnb.com/rooms/1158417037056953812   
1             53405110             https://www.airbnb.com/rooms/53405110   
2  1099697666994383724  https://www.airbnb.com/rooms/1099697666994383724   
3  1039777404032438158  https://www.airbnb.com/rooms/1039777404032438158   
4             53633927             https://www.airbnb.com/rooms/53633927   

        scrape_id last_scraped       source  \
0  20240904164210   2024-09-05  city scrape   
1  20240904164210   2024-09-05  city scrape   
2  20240904164210   2024-09-05  city scrape   
3  20240904164210   2024-09-05  city scrape   
4  20240904164210   2024-09-05  city scrape   

                                              name  \
0                        Lovely High Rise in Ktown   
1                                    Downtown Loft   
2  Chic Apartment in the Heart of Beverly Hills #8   
3     Blueground | NoHo, walk to restaurants

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna('missing', inplace=True)


In [11]:
# Preprocessing Training Dataset

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# 4. Select variables to be used in training
selected_columns = [
    'availability_change_30_to_60', 'availability_change_30_to_90', 'availability_change_30_to_365', 'description_sentiment',
    'has_wifi', 'has_pool', 'has_dryer', 'has_washer', 'has_hot_tub', 'has_kitchen', 'has_air_conditioning',
    'has_parking', 'host_tenure_days', 'price_per_guest', 'distance_to_downtown',
    'review_scores_rating_bin', 'review_scores_accuracy_bin', 'review_scores_cleanliness_bin',
    'review_scores_checkin_bin', 'review_scores_communication_bin', 'review_scores_location_bin',
    'review_scores_value_bin', 'days_since_last_review', 'minimum_nights', 'reviews_per_month',
    'number_of_reviews_ltm', 'number_of_reviews_l30d', 'host_acceptance_rate', 'host_response_rate',
    'room_type', 'neighbourhood_group_cleansed', 'host_has_profile_pic', 'host_identity_verified',
    'instant_bookable', 'host_is_superhost'
]

# 5. Filter out columns not present in the training dataset
selected_columns = [col for col in selected_columns if col in train_df.columns]

# 6. Define features (X) and target variable (y)
X = train_df[selected_columns]
y = train_df['days_booked']

# 7. Separate features into categorical and numerical columns
categorical_columns = X.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_columns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

# 8. Create a preprocessing pipeline
# - Apply one-hot encoding to categorical columns
# - Impute missing values and scale numerical columns
transformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'), categorical_columns),
        ('num', Pipeline([
            ('imputer', SimpleImputer(strategy='constant', fill_value=0)),  # Impute missing values with 0
            ('scaler', StandardScaler())  # Standardize numerical features
        ]), numerical_columns)
    ]
)

# 9. Fit the transformer to the data and apply the transformations
transformer.fit(X)
X_transformed = transformer.transform(X)


In [13]:
# use random forest and grid search to find the best model

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# Define the Random Forest model
rf_model = RandomForestRegressor(random_state=42)

# Define the parameter grid for Random Forest
rf_param_grid = {
    'n_estimators': [50, 100],  # Reduce number of trees
    'max_depth': [10, 20],      # Fewer depth options
    'min_samples_split': [2, 5], 
    'min_samples_leaf': [1, 2]
}

  # Minimum samples in a leaf node

# Set up GridSearchCV for Random Forest
rf_grid_search = GridSearchCV(
    estimator=rf_model,
    param_grid=rf_param_grid,
    cv=5,  # 5-fold cross-validation
    scoring='r2',  # Use R² as the scoring metric
    n_jobs=-1
)

# Fit the GridSearchCV on the preprocessed data
rf_grid_search.fit(X_transformed, y)

# Get the best model and parameters
best_rf_model = rf_grid_search.best_estimator_
print(f"Best Parameters: {rf_grid_search.best_params_}")
print(f"Best Cross-Validation R²: {rf_grid_search.best_score_:.4f}")


Best Parameters: {'max_depth': 10, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 100}
Best Cross-Validation R²: 0.2047


In [15]:
# Use XGBoost for regression and tune hyperparameters

from xgboost import XGBRegressor
from sklearn.model_selection import GridSearchCV

# Define the XGBoost model
xgb_model = XGBRegressor(random_state=42)

# Define the parameter grid for XGBoost
xgb_param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Set up GridSearchCV for XGBoost
xgb_grid_search = GridSearchCV(
    estimator=xgb_model,
    param_grid=xgb_param_grid,
    cv=5,
    scoring='r2',
    n_jobs=-1,
    verbose=3
)

# Fit GridSearchCV on the training data
xgb_grid_search.fit(X_transformed, y)

# Get the best model and parameters
best_xgb_model = xgb_grid_search.best_estimator_
print(f"Best Parameters (XGBoost): {xgb_grid_search.best_params_}")
print(f"Best Cross-Validation R² (XGBoost): {xgb_grid_search.best_score_:.4f}")

Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best Parameters (XGBoost): {'colsample_bytree': 1.0, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 50, 'subsample': 1.0}
Best Cross-Validation R² (XGBoost): 0.2188


In [16]:
# Use Ensemble Learning to Combine Predictions from Multiple Models

from sklearn.ensemble import VotingRegressor

# Initialize the Voting Regressor with the best models
# 'rf' for Random Forest and 'xgb' for XGBoost
voting_model = VotingRegressor(estimators=[
    ('rf', best_rf_model),  # Pre-trained Random Forest model
    ('xgb', best_xgb_model)  # Pre-trained XGBoost model
])

# Train the Voting Regressor on the full transformed training dataset
voting_model.fit(X_transformed, y)

# Generate predictions using the trained Voting Regressor
y_train_pred = voting_model.predict(X_transformed)

from sklearn.metrics import r2_score

# Calculate and display the R² score for the Voting Regressor on the training data
train_r2 = r2_score(y, y_train_pred)
print(f"Voting Regressor Training R²: {train_r2:.4f}")


Voting Regressor Training R²: 0.4672


In [19]:
from sklearn.model_selection import cross_val_score

# Evaluate the Voting Regressor using cross-validation
cv_scores = cross_val_score(voting_model, X_transformed, y, cv=5, scoring='r2')

# Display cross-validation scores
print("Cross-Validation R² Scores:", cv_scores)
print(f"Mean Cross-Validation R²: {cv_scores.mean():.4f}")


Cross-Validation R² Scores: [0.2421598  0.27912535 0.20894832 0.18895483 0.18081076]
Mean Cross-Validation R²: 0.2200


In [None]:
# Train the final Voting Regressor on the entire training dataset
voting_model.fit(X_transformed, y)

In [None]:
# Preprocess the test dataset using the same transformer
X_test_transformed = transformer.transform(test_df[selected_columns])

# Generate predictions for the test set
y_test_pred = voting_model.predict(X_test_transformed)

# Preview the first few predictions
print("Predictions for the Test Set:", y_test_pred[:10])


In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Select the same features from test_df as used in training
selected_columns = [
    'availability_change_30_to_60', 'availability_change_30_to_90', 'availability_change_30_to_365',
    'description_sentiment', 'has_wifi', 'has_pool', 'has_dryer', 'has_washer', 'has_hot_tub',
    'has_kitchen', 'has_air_conditioning', 'has_parking', 'host_tenure_days', 'price_per_guest',
    'distance_to_downtown', 'review_scores_rating_bin', 'review_scores_accuracy_bin',
    'review_scores_cleanliness_bin', 'review_scores_checkin_bin', 'review_scores_communication_bin',
    'review_scores_location_bin', 'review_scores_value_bin', 'days_since_last_review',
    'minimum_nights', 'reviews_per_month', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
    'host_acceptance_rate', 'host_response_rate', 'room_type', 'neighbourhood_group_cleansed',
    'host_has_profile_pic', 'host_identity_verified', 'instant_bookable', 'host_is_superhost'
]

# Ensure only columns that exist in test_df are used
selected_columns = [col for col in selected_columns if col in test_df.columns]

# Preprocess the test data
X_test = test_df[selected_columns]  # Select relevant features
X_test_transformed = transformer.transform(X_test)  # Apply the transformer

# Make predictions on the test set using Voting Regressor
y_test_pred = voting_model.predict(X_test_transformed)

# Create a DataFrame with only 'id' and 'prediction'
output_df = test_df[['id']].copy()  # Ensure only the 'id' column is selected
output_df['prediction'] = y_test_pred  # Add the predictions as a new column

# Save the output DataFrame to CSV
output_df.to_csv('predictions.csv', index=False)

print("Predictions saved to 'predictions.csv' with only 'id' and 'prediction' columns.")


In [None]:
# Check to make sure that test_df has correct format and write to csv
from hashlib import md5
import numpy as np
# Checks that you have predictions for the expected listing ids 
assert md5(np.sort(test_df.id.values)).hexdigest() == '87ed95adc911aad0ed9ef119a7a3315d', "Your listing ids are incorrect; you may need to regenerate test_df in the first cell"
assert "prediction" in test_df.columns, "You need to have a column named `prediction` in your output"
# Submit this CSV on canvas
test_df.to_csv("predictions.csv", index=False)

In [None]:
pip install matplotlib

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Random Forest feature importance
rf_feature_importance = pd.Series(
    best_rf_model.feature_importances_,
    index=transformer.get_feature_names_out()
)
rf_feature_importance_sorted = rf_feature_importance.sort_values(ascending=False)

# Plot the top 10 features
plt.figure(figsize=(12, 8))  # Adjust figure size for better readability
top_features = rf_feature_importance_sorted.head(10)

ax = top_features.plot(
    kind='bar',
    title='Top 10 Random Forest Feature Importances',
    color='orange',  # Changed bar color
    edgecolor='black'  # Add edge color for better contrast
)

# Add labels and grid
plt.ylabel('Feature Importance Score', fontsize=12)
plt.xlabel('Features', fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=10)  # Rotate feature names for better visibility
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add feature importance values to the bars
for i, value in enumerate(top_features):
    ax.text(i, value + 0.005, f'{value:.3f}', ha='center', fontsize=10)

plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# XGBoost feature importance
xgb_feature_importance = pd.Series(
    best_xgb_model.feature_importances_,
    index=transformer.get_feature_names_out()
)
xgb_feature_importance_sorted = xgb_feature_importance.sort_values(ascending=False)

# Plot the top 10 features
plt.figure(figsize=(12, 8))  # Adjust figure size for better readability
top_features = xgb_feature_importance_sorted.head(10)

ax = top_features.plot(
    kind='bar',
    title='Top 10 XGBoost Feature Importances',
    color='teal',  # Changed bar color
    edgecolor='black'  # Add edge color for better contrast
)

# Add labels and grid
plt.ylabel('Feature Importance Score', fontsize=12)
plt.xlabel('Features', fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=10)  # Rotate feature names for better visibility
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add feature importance values to the bars
for i, value in enumerate(top_features):
    ax.text(i, value + 0.005, f'{value:.3f}', ha='center', fontsize=10)

plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()


In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Combine feature importances by averaging
combined_feature_importance = (rf_feature_importance + xgb_feature_importance) / 2

# Sort and visualize the top 10 combined features
combined_feature_importance_sorted = combined_feature_importance.sort_values(ascending=False)
top_combined_features = combined_feature_importance_sorted.head(10)

# Plot the top 10 combined feature importances
plt.figure(figsize=(12, 8))  # Adjust figure size for better readability

ax = top_combined_features.plot(
    kind='bar',
    title='Top 10 Combined Feature Importances (Random Forest & XGBoost)',
    color='purple',  # Changed bar color
    edgecolor='black'  # Add edge color for better contrast
)

# Add labels and grid
plt.ylabel('Average Feature Importance', fontsize=12)
plt.xlabel('Features', fontsize=12)
plt.xticks(rotation=45, ha='right', fontsize=10)  # Rotate feature names for better visibility
plt.yticks(fontsize=10)
plt.grid(axis='y', linestyle='--', alpha=0.7)

# Add feature importance values to the bars
for i, value in enumerate(top_combined_features):
    ax.text(i, value + 0.005, f'{value:.3f}', ha='center', fontsize=10)

plt.tight_layout()  # Adjust layout to prevent clipping
plt.show()


### Write-Up (8 pts)

#### 1. Explain how you constructed and/or preprocessed features to help with prediction, and why. (3 pts)

To prepare the data for modeling, I implemented extensive feature engineering and preprocessing steps to maximize the predictive power of the model:

1. **Feature Engineering:**
   - **Sentiment Analysis:** Extracted sentiment polarity from the `description` field to capture nuances in listing descriptions.
   - **Binary Variables for Popular Amenities:** Created binary indicators for key amenities like WiFi, Pool, Kitchen, and Parking to quantify their availability in listings.
   - **Host Tenure and Activity:** Computed the number of days since the host joined (`host_tenure_days`) and days since the last review (`days_since_last_review`) to reflect host reliability and recent activity.
   - **Proximity to Downtown LA:** Calculated the geodesic distance between the listing and Downtown LA to account for location desirability.
   - **Normalized Pricing Metrics:** Derived `price_per_guest` to adjust prices based on guest capacity for better comparability across listings.
   - **Discretized Review Scores:** Categorized review scores into bins (low, medium, high) for interpretability and to reduce noise in continuous values.
   - **Availability Ratios and Changes:** Created ratios (e.g., `availability_30_to_60_ratio`) and differences (e.g., `availability_change_30_to_60`) to capture temporal availability trends.

2. **Preprocessing Steps:**
   - **Categorical Encoding:** Applied one-hot encoding to categorical variables like `room_type` and `neighbourhood_group_cleansed`.
   - **Scaling and Imputation:** Used standard scaling for numerical features and imputed missing values with the median or default values, ensuring consistent handling of missing data.

These steps were designed to capture meaningful patterns in the data and ensure compatibility with machine learning models. Together, they provided a comprehensive and clean feature set for training.

---

#### 2. Explain what decisions you made using cross-validation, and how well you believe your final model will perform. (3 pts)

To ensure that the model generalizes well to unseen data, I made several decisions based on cross-validation:

1. **Cross-Validation Strategy:**
   - I employed **5-fold cross-validation** to evaluate model performance. This method splits the training data into five subsets, using four subsets for training and one for validation in each iteration. This process helps ensure that the model performs consistently across different data splits, reducing the risk of overfitting or underfitting.
   - The mean \(R^2\) score from cross-validation provides an unbiased estimate of the model's ability to generalize.

2. **Model Selection and Tuning:**
   - Based on cross-validation scores:
     - I tested several models, including **Lasso Regression**, **Random Forest**, and **XGBoost**, to identify the best-performing approach.
     - For Lasso, despite hyperparameter tuning, the \(R^2\) remained at 15-17%, which was not sufficient. Therefore, I excluded it from the final notebook.
     - I selected a **Voting Regressor** combining **Random Forest** and **XGBoost**, as it aggregates the strengths of both models:
       - Random Forest captures interactions in smaller subsets of data.
       - XGBoost learns nuanced patterns through gradient boosting.

3. **Evaluation Metrics:**
   - The **mean cross-validation \(R^2\)** score for the Voting Regressor was **0.2200**, which meets the assignment’s threshold of 0.22 for test set performance. This suggests that the model has learned patterns in the data that are likely to generalize well to unseen examples.

4. **Confidence in Test Set Performance:**
   - While the **training \(R^2\)** was relatively high (0.4672), cross-validation results indicate that the model avoids significant overfitting.
   - The inherent noise in the dataset (e.g., cancellations, incomplete bookings, and missing data) makes achieving high \(R^2\) scores challenging. However, the cross-validation \(R^2\) of 0.2200 reflects robust generalization.

In conclusion, cross-validation and careful hyperparameter tuning were critical in selecting and optimizing the Voting Regressor. Its strong performance on validation data gives confidence in its ability to generalize to the test set.

---

#### 3. Explain what features you found to be important using the feature importance tools we discussed in class. (2 pts)

Feature importance analysis using **Random Forest**, **XGBoost**, and the combined average highlighted the following key predictors:

1. **Top Features Identified:**
   - `num__number_of_reviews_ltm`: The number of reviews in the last 12 months was consistently the most important feature across both models. It reflects listing popularity and recent activity.
   - `num__availability_change_30_to_90`: Changes in availability over 90 days provide insights into booking patterns and future demand.
   - `num__distance_to_downtown`: Listings closer to Downtown LA are generally more desirable, impacting booking likelihood.
   - `num__price_per_guest`: Normalized pricing helps capture the affordability and value of listings.
   - `num__days_since_last_review`: Recent reviews indicate active bookings, which influence predictions.

2. **Insights from Combined Importance:**
   - Aggregating feature importances from Random Forest and XGBoost showed that both models heavily relied on temporal availability changes and review-related metrics, confirming the relevance of these features.

3. **Why Feature Importance?**
   - Feature importance was chosen over permutation importance due to its computational efficiency, especially with large datasets or complex models. While permutation importance provides more robust insights by directly assessing feature impact on predictions, feature importance effectively captures the relative contributions of features and is more practical for this assignment.

4. **Visualizations:**
   - The top 10 features for each model were visualized using bar plots. These plots provide a clear picture of the most impactful predictors for bookings.

In summary, feature importance analysis verified the relevance of features derived from availability, pricing, location, and reviews. These insights align with domain knowledge and emphasize the robustness of the engineered features and the interpretability and reliability of the model.
 features.
the interpretability and reliability of the model.

---
