## 1) Exploratory Data Analysis

After cleaning and imputing the datasets, I gained many insights from the data. Using pariplots and correlation matrices, I noted a few very important nunmeric predictors, which had strong relationships with price. Among these were the number of beds, the number of bathrooms, accommodates, and the number of reviews. I also noticed that latitude, longitude, and bathrooms had higher order relationships with price, by looking at their scatterplots against price. Further, I noticed that a lot of predictors which were very similar in meaning had high correlation with one another, for example all of the different review columns and all of the different max/min nights columns. A lot of the categorical columns, such as location, had many values which only appeared once or twice. In order to find the importance of more prominent values, it was helpful to group some of these as "other."  

Insights obtained from the data past the interim report were minimal. I mostly focused on spotting colinearity to clean up the model. Most of the same insights were used and expanded upon.

## 2) Data Cleaning/Preparation

In [1]:
import pandas as pd
import statsmodels.formula.api as smf
import numpy as np
import seaborn as sns
from patsy import dmatrices
from scipy import stats

In [2]:
data_train = pd.read_csv('train_regression.csv')

# Clean and convert price
data_train.price = data_train.price.str.replace('$', '').str.replace(',','').astype(float)

# Look at predictors
data_train.info()

# get rid of unrealistic value
data_train = data_train[data_train.price < 10000]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 54 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            5000 non-null   int64  
 1   host_id                                       5000 non-null   int64  
 2   host_since                                    5000 non-null   object 
 3   host_location                                 4042 non-null   object 
 4   host_response_time                            4582 non-null   object 
 5   host_response_rate                            4582 non-null   object 
 6   host_acceptance_rate                          4709 non-null   object 
 7   host_is_superhost                             4977 non-null   object 
 8   host_neighbourhood                            4873 non-null   object 
 9   host_listings_count                           5000 non-null   i

  data_train.price = data_train.price.str.replace('$', '').str.replace(',','').astype(float)


In [3]:
# impute missing values
data_train = data_train.fillna(data_train.median())
data_train = data_train.fillna(method = 'ffill')

print(data_train.isnull().sum())

id                                              0
host_id                                         0
host_since                                      0
host_location                                   0
host_response_time                              0
host_response_rate                              0
host_acceptance_rate                            0
host_is_superhost                               0
host_neighbourhood                              0
host_listings_count                             0
host_total_listings_count                       0
host_verifications                              0
host_has_profile_pic                            0
host_identity_verified                          0
neighbourhood_cleansed                          0
latitude                                        0
longitude                                       0
property_type                                   0
room_type                                       0
accommodates                                    0


  data_train = data_train.fillna(data_train.median())


In [4]:
# Detect if the word "private" is present
data_train['is_private'] = data_train['room_type'].str.contains('Private', case=False)
data_train['is_private'].value_counts()

False    3872
True     1127
Name: is_private, dtype: int64

In [5]:
# bathrooms text (missing values): create 2 different predictors: extract number, and binary column if the word "shared" is in it

data_train['bathrooms'] = data_train['bathrooms_text'].str.extract(r'(\d*\.?\d+)', expand=False).astype(float)

# Detect if the word "shared" is present
data_train['is_shared'] = data_train['bathrooms_text'].str.contains('shared', case=False)

In [6]:
# pick fancy neighborhoods for predictor
fancy_neighborhoods = [
    "Cambridge", "River North", "Logan Square", "West Town", "Lincoln Park",
    "Lake View East", "Near North Side", "Streeterville", "Chicago Loop", "Bucktown",
    "Lakeview", "Lincoln Square", "West Loop/Greektown", "Andersonville", "West Loop",
    "North Center", "Old Town", "Portage Park", "Pilsen", "Uptown", "Lower West Side"
]

data_train['fancy'] = data_train['host_neighbourhood'].isin(fancy_neighborhoods)

In [7]:
# convert to numeric
data_train.host_response_rate = data_train.host_response_rate.str.replace('%', '').str.replace(',','').astype(float)
data_train.host_acceptance_rate = data_train.host_acceptance_rate.str.replace('%', '').str.replace(',','').astype(float)

In [8]:
# host location
data_train['host_location_category'] = data_train['host_location'].apply(lambda x: x if x in ['Chicago, IL', 'New York, NY'] else 'Other')

dummy_variables = pd.get_dummies(data_train['host_location_category'], prefix='host_location')

In [9]:
# convert to datetime, then numeric
data_train['host_since'] = pd.to_datetime(data_train['host_since'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_host'] = (reference_date - data_train['host_since']).dt.days

data_train['first_review'] = pd.to_datetime(data_train['first_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_first'] = (reference_date - data_train['first_review']).dt.days

data_train['last_review'] = pd.to_datetime(data_train['last_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_train['days_since_last'] = (reference_date - data_train['last_review']).dt.days

In [10]:
# apply everythingnto the test data

data_test = pd.read_csv('test_regression.csv')
data_test = data_test.fillna(data_train.median())
data_test = data_test.fillna(method = 'ffill')
data_test = data_test.fillna(method = 'bfill')


data_test['bathrooms'] = data_test['bathrooms_text'].str.extract(r'(\d*\.?\d+)', expand=False).astype(float)

data_test['is_shared'] = data_test['bathrooms_text'].str.contains('shared', case=False)

data_test['fancy'] = data_test['host_neighbourhood'].isin(fancy_neighborhoods)

data_test.host_response_rate = data_test.host_response_rate.str.replace('%', '').str.replace(',','').astype(float)
data_test.host_acceptance_rate = data_test.host_acceptance_rate.str.replace('%', '').str.replace(',','').astype(float)

data_test['host_location_category'] = data_test['host_location'].apply(lambda x: x if x in ['Chicago, IL', 'New York, NY'] else 'Other')

dummy_variables = pd.get_dummies(data_test['host_location_category'], prefix='host_location')

data_test['host_since'] = pd.to_datetime(data_train['host_since'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_host'] = (reference_date - data_test['host_since']).dt.days

data_test['first_review'] = pd.to_datetime(data_test['first_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_first'] = (reference_date - data_test['first_review']).dt.days

data_test['last_review'] = pd.to_datetime(data_test['last_review'])

reference_date = pd.to_datetime('2015-01-01')  
data_test['days_since_last'] = (reference_date - data_test['last_review']).dt.days

data_test['is_private'] = data_test['room_type'].str.contains('Private', case=False)


  data_test = data_test.fillna(data_train.median())
  data_test = data_test.fillna(data_train.median())


## 3) Developing the Model

Since the interim report, many changes were made to the model.The inclusion of specific interactions and transformations plays a pivotal role in capturing nuanced relationships and improving predictive accuracy.

Initially, the transition to predict logarithmized prices (log_price) reflects a more appropriate transformation to handle the skewed distribution of price data, allowing for a more linear relationship between predictors and the target variable.

One notable addition in the updated model is the incorporation of trigonometric functions, such as sine and cosine, applied to various predictors. For instance, introducing sine and cosine terms for latitude and longitude acknowledges the cyclical nature of spatial data, particularly in accounting for seasonal variations or directional influences. By incorporating these trigonometric transformations, the model can better capture the geographic variables better, thereby enhancing its ability to predict Airbnb prices accurately.

The updated model includes an extensive array of interaction terms, reflecting the interdependencies among predictors and their combined effects on price prediction. For example, interactions between accommodation attributes (e.g., beds, bathrooms) and neighborhood characteristics (e.g., 'fancy' designation) capture how upscale areas may correlate differently with rental features. Similarly, interactions between accommodation features (e.g., number of reviews, accommodates) and host-related variables (e.g., host_listings_count) account for how host behavior may influence pricing dynamics.

Interaction terms between categorical predictors (e.g., 'is_private', 'fancy') and numeric predictors (e.g., latitude, longitude) further refine the model's predictive capabilities. For instance, interactions between 'is_private' and geographic coordinates recognize how the status of a listing (private or shared) may interact differently with spatial factors in determining pricing.

By incorporating these specific interactions and transformations, the updated regression model achieves a more nuanced understanding of the complex relationships between predictors and Airbnb prices. These enhancements enable the model to better capture spatial, temporal, and categorical influences, ultimately resulting in improved predictive accuracy and robustness.

Additionally, measures to address model performance issues have been implemented. Removal of influential data points helps mitigate the impact of outliers on model estimation, ensuring more robust parameter estimates. Moreover, adjusting predictions by a factor of 1.05 to counteract underprediction signifies a proactive step towards improving model calibration and predictive accuracy.

## 4) Model

In [12]:
data_train['log_price'] = np.log(data_train['price'])

# model
formula = ('log_price~ '
           'maximum_nights + I(maximum_nights**2)+host_location_category+is_private*latitude+is_private*longitude+is_private*bathrooms+is_private*accommodates+is_private*beds+'
           'number_of_reviews*number_of_reviews_ltm+reviews_per_month +is_private*is_shared+is_private*host_listings_count+np.sin(number_of_reviews_ltm)+'
           'review_scores_value*review_scores_rating+review_scores_accuracy+review_scores_cleanliness+is_private*availability_30+is_private*availability_60+'
           'review_scores_checkin+review_scores_location+'
           'calculated_host_listings_count_entire_homes+'
           'latitude + I(latitude**2)+I(latitude**3)+I(latitude**4)+I(latitude**5)+np.sin(availability_60)+np.cos(host_listings_count)+'
           'longitude + I(longitude**2) +I(longitude**3)+I(longitude**4)++I(longitude**5)+ latitude*longitude +'
           'minimum_nights*maximum_nights+minimum_nights+maximum_nights*review_scores_accuracy+'
           'host_has_profile_pic +np.sin(reviews_per_month)+np.cos(reviews_per_month)+'
           'host_identity_verified+accommodates*instant_bookable+np.cos(days_since_host)+'
           'availability_30*availability_60+fancy*accommodates+'
           'is_shared*accommodates+np.cos(bathrooms)+np.sin(bathrooms)+'
           'fancy*beds+fancy*latitude+np.sin(latitude)+np.cos(latitude)+np.sin(longitude)+np.cos(longitude)+I(reviews_per_month**2) + I(reviews_per_month**3)+'
          'is_shared*bathrooms+is_shared*host_listings_count+calculated_host_listings_count_private_rooms*calculated_host_listings_count_entire_homes+'
           'days_since_first+days_since_last+I(days_since_last**2)+'
          'I(calculated_host_listings_count_entire_homes**2)+I(calculated_host_listings_count_entire_homes**3)+'
          'I(calculated_host_listings_count_private_rooms**2)+np.sin(fancy)+np.cos(fancy)+'
          'instant_bookable*host_listings_count+calculated_host_listings_count_entire_homes*host_listings_count+I(days_since_last**3)+'
          'minimum_nights*bathrooms+I(host_listings_count**2)+I(host_listings_count**3)+review_scores_location*is_shared+'
          'beds*accommodates*bathrooms+beds*is_shared+beds*bathrooms+fancy*beds+days_since_first*days_since_host+np.sin(accommodates)')

model = smf.ols(formula=formula, data=data_train).fit()

In [13]:
# Cleaning influential points

# Outliers
out = model.outlier_test()
N = data_train.shape[0]
p = model.df_model
alpha = 0.05
critic_val = stats.t.ppf(1 - alpha/2, N-p-1)
# High leverage points
inf = model.get_influence()
leverage = inf.hat_matrix_diag
avg_leverage = np.mean(leverage)
# Influential points
np.sum((out.student_resid > critic_val) & (leverage > 4*avg_leverage)) # 8 influential points

8

In [14]:
boolean_mask = ~((out.student_resid > critic_val) & (leverage > 4*avg_leverage))
data_train_clean = data_train.dropna().reset_index(drop=True)
train_clean = data_train_clean[boolean_mask.values]

In [15]:
# New nodel with cleaned data

model = smf.ols(formula=formula, data=train_clean).fit()

model.summary()

0,1,2,3
Dep. Variable:,log_price,R-squared:,0.704
Model:,OLS,Adj. R-squared:,0.699
Method:,Least Squares,F-statistic:,138.6
Date:,"Fri, 08 Mar 2024",Prob (F-statistic):,0.0
Time:,12:31:48,Log-Likelihood:,-2674.8
No. Observations:,4984,AIC:,5520.0
Df Residuals:,4899,BIC:,6073.0
Df Model:,84,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,0.8885,0.209,4.242,0.000,0.478,1.299
"host_location_category[T.New York, NY]",-0.0008,0.044,-0.019,0.985,-0.087,0.085
host_location_category[T.Other],-0.0012,0.021,-0.057,0.955,-0.042,0.040
is_private[T.True],31.4490,29.714,1.058,0.290,-26.803,89.701
is_shared[T.True],-0.1872,0.223,-0.838,0.402,-0.625,0.251
host_has_profile_pic[T.t],-0.1283,0.055,-2.339,0.019,-0.236,-0.021
host_identity_verified[T.t],-0.0828,0.021,-3.986,0.000,-0.124,-0.042
instant_bookable[T.t],0.0508,0.026,1.938,0.053,-0.001,0.102
fancy[T.True],21.5180,6.397,3.364,0.001,8.976,34.060

0,1,2,3
Omnibus:,215.95,Durbin-Watson:,1.998
Prob(Omnibus):,0.0,Jarque-Bera (JB):,601.652
Skew:,0.182,Prob(JB):,2.2499999999999998e-131
Kurtosis:,4.663,Cond. No.,1.41e+18


In [17]:
# get predictions (imputation for reamining missing values)

preds = model.predict(data_test)

preds = (np.exp(preds))

preds = preds*1.05

default_value = preds.median()

preds = preds.fillna(default_value)
    
output = pd.DataFrame({'id': data_test.id, 'predicted':preds})

output.to_csv('prediction_problem_submission.csv', index = False)