# Hedonic Pricing

We often try to predict the price of an asset from its observable characteristics. This is generally called **hedonic pricing**: How do the unit's characteristics determine its market price?

In the lab folder, there are three options: housing prices in pierce_county_house_sales.csv, car prices in cars_hw.csv, and airbnb rental prices in airbnb_hw.csv. If you know of another suitable dataset, please feel free to use that one.

1. Clean the data and perform some EDA and visualization to get to know the data set.
2. Transform your variables --- particularly categorical ones --- for use in your regression analysis.
3. Implement an ~80/~20 train-test split. Put the test data aside.
4. Build some simple linear models that include no transformations or interactions. Fit them, and determine their RMSE and $R^2$ on the both the training and test sets. Which of your models does the best?
5. Include transformations and interactions, and build a more complex model that reflects your ideas about how the features of the asset determine its value. Determine its RMSE and $R^2$ on the training and test sets. How does the more complex model your build compare to the simpler ones?
6. Summarize your results from 1 to 5. Have you learned anything about overfitting and underfitting, or model selection?
7. If you have time, use the sklearn.linear_model.Lasso to regularize your model and select the most predictive features. Which does it select? What are the RMSE and $R^2$? We'll cover the Lasso later in detail in class.



In [1]:
! git clone https://github.com/itisesha/labs

Cloning into 'labs'...
remote: Enumerating objects: 107, done.[K
remote: Counting objects: 100% (70/70), done.[K
remote: Compressing objects: 100% (46/46), done.[K
remote: Total 107 (delta 40), reused 43 (delta 24), pack-reused 37 (from 1)[K
Receiving objects: 100% (107/107), 21.09 MiB | 8.61 MiB/s, done.
Resolving deltas: 100% (42/42), done.
Updating files: 100% (27/27), done.


In [2]:
import pandas as pd

# Load the dataset
file_path = '/content/labs/04_hedonic_pricing/airbnb_hw.csv'
df = pd.read_csv(file_path)

# Display basic information and the first few rows of the dataset for initial inspection
df.info(), df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30478 entries, 0 to 30477
Data columns (total 13 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   Host Id                     30478 non-null  int64  
 1   Host Since                  30475 non-null  object 
 2   Name                        30478 non-null  object 
 3   Neighbourhood               30478 non-null  object 
 4   Property Type               30475 non-null  object 
 5   Review Scores Rating (bin)  22155 non-null  float64
 6   Room Type                   30478 non-null  object 
 7   Zipcode                     30344 non-null  float64
 8   Beds                        30393 non-null  float64
 9   Number of Records           30478 non-null  int64  
 10  Number Of Reviews           30478 non-null  int64  
 11  Price                       30478 non-null  object 
 12  Review Scores Rating        22155 non-null  float64
dtypes: float64(4), int64(3), object

(None,
     Host Id Host Since                                Name Neighbourhood   \
 0   5162530        NaN     1 Bedroom in Prime Williamsburg       Brooklyn   
 1  33134899        NaN     Sunny, Private room in Bushwick       Brooklyn   
 2  39608626        NaN                Sunny Room in Harlem      Manhattan   
 3       500  6/26/2008  Gorgeous 1 BR with Private Balcony      Manhattan   
 4       500  6/26/2008            Trendy Times Square Loft      Manhattan   
 
   Property Type  Review Scores Rating (bin)        Room Type  Zipcode  Beds  \
 0     Apartment                         NaN  Entire home/apt  11249.0   1.0   
 1     Apartment                         NaN     Private room  11206.0   1.0   
 2     Apartment                         NaN     Private room  10032.0   1.0   
 3     Apartment                         NaN  Entire home/apt  10024.0   3.0   
 4     Apartment                        95.0     Private room  10036.0   3.0   
 
    Number of Records  Number Of Reviews 

In [3]:
# Cleaning the data

# Remove any non-numeric characters (like $ or ,) from the Price column and convert it to numeric
df['Price'] = df['Price'].replace({'\$': '', ',': ''}, regex=True).astype(float)

# Handle missing values - we'll fill missing numeric columns with median values and categorical ones with mode
df['Host Since'] = pd.to_datetime(df['Host Since'], errors='coerce')  # Convert 'Host Since' to datetime
df['Host Since'].fillna(df['Host Since'].median(), inplace=True)

# Fill missing numerical values with the median
df['Review Scores Rating (bin)'].fillna(df['Review Scores Rating (bin)'].median(), inplace=True)
df['Review Scores Rating'].fillna(df['Review Scores Rating'].median(), inplace=True)
df['Beds'].fillna(df['Beds'].median(), inplace=True)
df['Zipcode'].fillna(df['Zipcode'].mode()[0], inplace=True)

# Drop columns that may not be useful for regression (Host Id, Name, Number of Records)
df_cleaned = df.drop(['Host Id', 'Name', 'Number of Records'], axis=1)

# One-hot encode categorical variables
df_cleaned = pd.get_dummies(df_cleaned, columns=['Neighbourhood', 'Property Type', 'Room Type'], drop_first=True)

# Splitting the data into features (X) and target (y)
X = df_cleaned.drop(columns=['Price'])
y = df_cleaned['Price']

# Perform 80/20 train-test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Display the cleaned data and shapes of the splits
df_cleaned.head(), X_train.shape, X_test.shape, y_train.shape, y_test.shape


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Host Since'].fillna(df['Host Since'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Review Scores Rating (bin)'].fillna(df['Review Scores Rating (bin)'].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never

KeyError: "['Neighbourhood'] not in index"

In [4]:
# Re-import necessary libraries since they were not loaded after the error
from sklearn.model_selection import train_test_split

# Perform 80/20 train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Check the shapes of the cleaned data and the train-test split
X_train.shape, X_test.shape, y_train.shape, y_test.shape


NameError: name 'X' is not defined

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Initialize the linear regression model
model = LinearRegression()

# Fit the model to the training data
model.fit(X_train, y_train)

# Predict on training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate RMSE and R2 for training and test sets
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

train_rmse, test_rmse, train_r2, test_r2


In [None]:
# Drop the 'Host Since' column since it's not usable in the regression model
X = X.drop(columns=['Host Since'])

# Perform the train-test split again after dropping the column
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the linear regression model
model.fit(X_train, y_train)

# Predict on training and test sets
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)

# Calculate RMSE and R2 for training and test sets
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
test_rmse = np.sqrt(mean_squared_error(y_test, y_test_pred))

train_r2 = r2_score(y_train, y_train_pred)
test_r2 = r2_score(y_test, y_test_pred)

train_rmse, test_rmse, train_r2, test_r2


In [None]:
# Redefining the train-test split and necessary imports due to environment reset
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import PolynomialFeatures
import numpy as np

# Load the dataset again
file_path = '/mnt/data/airbnb_hw.csv'
df = pd.read_csv(file_path)

# Clean the data: remove non-numeric characters from 'Price', handle missing values, etc.
df['Price'] = df['Price'].replace({'\$': '', ',': ''}, regex=True).astype(float)
df['Host Since'] = pd.to_datetime(df['Host Since'], errors='coerce')
df['Host Since'].fillna(df['Host Since'].median(), inplace=True)
df['Review Scores Rating (bin)'].fillna(df['Review Scores Rating (bin)'].median(), inplace=True)
df['Review Scores Rating'].fillna(df['Review Scores Rating'].median(), inplace=True)
df['Beds'].fillna(df['Beds'].median(), inplace=True)
df['Zipcode'].fillna(df['Zipcode'].mode()[0], inplace=True)
df_cleaned = df.drop(['Host Id', 'Name', 'Number of Records'], axis=1)

# Strip whitespaces and perform one-hot encoding
df_cleaned.columns = df_cleaned.columns.str.strip()
df_cleaned = pd.get_dummies(df_cleaned, columns=['Neighbourhood', 'Property Type', 'Room Type'], drop_first=True)

# Drop 'Host Since' column and define X and y
X = df_cleaned.drop(columns=['Host Since', 'Price'])
y = df_cleaned['Price']

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Polynomial Features (degree=2 for interaction terms)
poly = PolynomialFeatures(degree=2, include_bias=False)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

# Fit and evaluate complex linear regression model
model_poly = LinearRegression()
model_poly.fit(X_train_poly, y_train)

# Predict on train and test sets
y_train_poly_pred = model_poly.predict(X_train_poly)
y_test_poly_pred = model_poly.predict(X_test_poly)

# Calculate RMSE and R2
train_rmse_poly = np.sqrt(mean_squared_error(y_train, y_train_poly_pred))
test_rmse_poly = np.sqrt(mean_squared_error(y_test, y_test_poly_pred))

train_r2_poly = r2_score(y_train, y_train_poly_pred)
test_r2_poly = r2_score(y_test, y_test_poly_pred)

train_rmse_poly, test_rmse_poly, train_r2_poly, test_r2_poly


This suggests that the more complex model may be overfitting, as its performance on the test set deteriorated, which demonstrates the trade-off between model complexity and generalizability. ​