# Airbnb Marketplace Machine Learning Data Science Take-Home Project
## Summary to the Data Challenge

First of all, the data challenge is a very good resemblance of the real-world data since I did spend 80% plus of my time understand and clean up the data. I used 10 broad procedures to clean the dataset which reduced the data by 4.13% and removed two columns with large volumes of missing data. General strategies that I used for data cleaning including removal of rows with unreasonable data values, removal of rows with small volume of missing data, fill in artificial values to flag event-based missing values (first time listing), and imputation with segmentation mean. Refer to Data Cleaning and Checking Steps section for detail discussions.

After the data cleaning steps, I conducted feature engineering which generated 9 new features. 4 of the features are trying to gauge price competitiveness while the other 5 are trying to gauge market demand/supply. Refer to the feature engineering section for detail discussions. 

Once I have the model development dataset ready, I used random forest classifier to model the request to book probability. I use a grid search method (with default 3-fold cross validation) focusing on tuning the number of trees used in the model as well as the leaf size parameters to improve model accuracy. The model achieves an accuracy score of 78.57% which is 11.53% more accurate the empirical mean prediction at 67.04%. The corresponding AUC measure is 85.05%. Given more time, I will use a day or two to try other machine learning models and pick the model with the best accuracy score. 

The random forest classifier indicates that listing views 2 to 6 days prior to calendar night, trailing 90 day occupancy rate, and KDT median available listing price to subject’s listing price are the top three variables in predicting request to book probability. One of the key insights is that a small increase in listing views can improve the request to book probability significantly (almost by a factor of 3 from 0 average listing view to 1 listing view). Given that the 58.56% of the data has 0 listing views, there is a huge potential to increase request to book volume if the marketing team can drive traffic to view those listings. I will make a recommendation for the marketing team devote more resources to drive traffic to listings without any prior views. In addition, the model can be used as a back-engine to provide hosts a tool to set their effective daily price. For instance, a user-interface module can allow the host to enter their planned daily price (which will change the KDT median available listing price to subject’s listing price ratio) and return a request to book score (model’s predicted request to book probability that is standardized into a score). Host can use this score to price their unit while balancing the risk of not getting any request to book.

Furthermore, I fitted a random forest classifier using features that the hosts can control such as reviews, ratings, image quality, and price competitiveness etc. The purpose of this analysis is to show, irrespective of market demand and supply, what the host can do to improve the request to book probability. Two features are telling an interesting story. They are number of reviews and number of overall total ratings. Around the range of 0 review or rating to 50 reviews or ratings, both factors are showing a significant increase in request to book probability in this range. Since about 30% of the dataset listings has 0 reviews or has 0 overall ratings, a campaign can be established to proactively coach these hosts to get reviews and ratings, thereby increase their future request to book probability.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
%matplotlib inline
pd.options.display.max_columns = None

In [None]:
df = pd.read_csv('C:/Users/kanli/Documents/airbnb data challenge/TH_data_challenge.tsv', sep='\t')

#### Data Cleaning and Checking Steps

The original dataset has 184,279 observations. The data quality is low and requires extensive cleaning. Given access to the data enginnering team, I will work closely with them to find out the cause for the unreasonable data that I noticed as well as if there is any systemic issues causing the missing data.

The general strategy that I use is to exclude data points that has unreasonable values, delete missing value rows that has minimal impact on the sample size (especially missing values that coincide in several fields in small volumes), assume extreme values in place of missing value to signal business events such as first time listers, and impute with segmentation mean. Segmentation mean are generally based on market, room type, and seasonality. 

The following data cleaning steps are taken on the dataset in sequential orders:

1. Remove 131 duplicate records.
2. m_effective_daily_price has negative prices as well as extremely big prices. For datapoints that are negative and too small, I excluded them by using the 0.5% percentile level at 25 dollars as a cut off point. These are likely due to system bug. For data points above the 99% percentile at 1000 dollars, I also excluded the data. These data may belong to luxury listings which is different than a super majority of the listing data in the current dataset. In this case, I will exclude the luxury listng data separately to avoid dilution.
3. m_checkouts, m_reviews, m_total_overall_rating, and  m_professional_pictures are all having 187 missing values. ds_checkin_gap, ds_checkout_gap, m_minimum_nights, and m_maximum_nights are all having 2174 missing values. These fields appears to have missing values across the fields. Since the observation is small compared to the overall data size, I exclude these rows from the dataset.
4. I observed that in minimum nights to require and maximum nights to request fields, some of the values are extremely large especially for minimum nights required with a maximum input of 1000. The extreme values maybe due to system issues. I will exclude the extreme value observations using the 99th percentile level for minimum nights required at 31 days or less and 1275 days or less for maximum nights to request. Also, there are 2,174 rows with missing minimum and maximum nights, since they are small relative to the over dataset, I removed these rows from the model development data.
5. For image quality score, there are 13,181 missing values after the aforementioned datasteps, Since the missing data volume is sizable, I decided to impute the missing value using a correlated field in number of total professional pictures. For missing image quality score with 0 professional pictures, I imputed the missing value with the non-missing value average image quality score for listings with 0 professional pictures. For missing image quality scores with more than 1 professional pictures, I imputed the missing value with the non-missing average image quality score for listings with more than 1 professional pictures. The difference in mean image quality score between 0 professional picutre and more than 1 professional picture is large (0.4746 versus 0.7779).
6. days_since_last_booking and price_booked_most_recent were both having 34,767 missing values, which is sizable. After inspecting the summary statistics of other fields, these missing data appears to be from first time listings which intuitively should not have any values in both field. As a result I fill these missing values with -999 to flag that they are first time listings. Furthermore, since I will be using random forest classifer which is a tree based model, the large filling-value should have minimal influence.
7. I noticed that both r_kdt_m_effective_daily_price_n100_p50 and r_kdt_m_effective_daily_price_available_n100_p50 have small amount of data points with 0 values. I excluded these zero values. Eventhough it is practically possible to have zero values, these data are probably due to system bugs.
8. For listing_lookup, occ_7_lookup, occ_14_lookup, and occ_90_lookup, these fields also have sizable missing data. Since these fields is a direct proxy of market demand, I choose to inpute them using their segmentation mean. I tried using dim_market, dim_room_type, kdt_score and ds_night_day_of_year as a hierarchical segmentation scheme. However, only a small amount of missing values is able to map to the empirical subsegment means. Given time constraint, I reduce the hierarchy to dim_market, dim_room_type, and month (month of the year) to fill all of the missing values.
9. For r_kdt_m_effective_daily_price_booked_n100_p50 (booked price), there are 12,581 missing values which is sizable. I try to impute the missing values using a correlated variable in r_kdt_m_effective_daily_price_booked_n100_p50 (daily price) and a segment variable in r_kdt_listing_views_0_6_avg_n100 (listing view). I firstly create a ratio between of booked price to daily price using non-missing data. Then I bifurcate the missing data into listing views below 1 and above 1. The corresponding mean booked price to daily price ratio based on the listing view segment is calculated. The imputed booked price will equal the row’s daily price * the mean booked price to daily price ratio. In addition, all rows with booked price at 0 is dropped.
10. For p2_p3_click_through_score and p3_inquiry_score, these two features have the most missing data as more than 65% of the data are missing. Due to size of the missing data, I decided to exclude these two features from the model development dataset.  

Overall, the data cleaning step removed 7,607 observations or 4.13% of the original dataset along with a complete removal of two columns in p2_p3_click_through_score and p3_inquiry_score.

In [None]:
df.duplicated(keep='first').sum()

In [None]:
df=df.drop_duplicates(keep='first')

In [None]:
p_check = [.005,.10, .25, .5, .75, .9 , .99]
df.describe(percentiles =p_check)

In [None]:
df_1 = df[(df.m_effective_daily_price >=25) & (df.m_effective_daily_price <= 1000)]

In [None]:
df_1.isnull().sum()

In [None]:
df_2 =df_1[pd.isna(df_1.ds_checkin_gap)==False]
df_2 = df_2[pd.isna(df_2.m_checkouts)==False]
df_2 = df_2[pd.isna(df_2.r_kdt_m_effective_daily_price_available_n100_p50)==False]

In [None]:
df_2.isnull().sum()

In [None]:
df_2.describe(percentiles=p_check)

In [None]:
df_2 = df_2[df_2.m_minimum_nights<=31]
df_2 =df_2[df_2.m_maximum_nights<=1125]

In [None]:
df_2.isnull().sum()

In [None]:
with_pro_pic_mean = df_2[df_2.m_professional_pictures>0].image_quality_score.mean()
wo_pro_pic_mean = df_2[df_2.m_professional_pictures==0].image_quality_score.mean()
zero_index = df_2[(pd.isna(df_2.image_quality_score)) & (df_2.m_professional_pictures==0)].index
notzero_index = df_2[(pd.isna(df_2.image_quality_score)) & (df_2.m_professional_pictures>0)].index

In [None]:
df_2.image_quality_score[zero_index]=wo_pro_pic_mean

In [None]:
df_2.image_quality_score[notzero_index]=with_pro_pic_mean

In [None]:
df_2.isnull().sum()

In [None]:
df_2.days_since_last_booking.fillna(-999, inplace=True)
df_2.price_booked_most_recent.fillna(-999, inplace=True)

In [None]:
df_2.isnull().sum()

In [None]:
df_2.describe(percentiles = p_check)

In [None]:
df_2 = df_2[df_2.r_kdt_m_effective_daily_price_available_n100_p50  > 0]
df_2= df_2[df_2.r_kdt_m_effective_daily_price_n100_p50  >0]

In [None]:
df_2.describe()

In [None]:
df_2.isnull().sum()

In [None]:
df_2.ds_night = pd.to_datetime(df_2.ds_night)

In [None]:
df_2['month'] = pd.DatetimeIndex(df_2.ds_night).month

In [None]:
listing_lookup = pd.DataFrame(df_2.groupby(['dim_market', 'dim_room_type', 'month']).listing_m_listing_views_2_6_ds_night_decay.mean())
index_list = []
for index, row in df_2[pd.isna(df_2.listing_m_listing_views_2_6_ds_night_decay)].iterrows():
    index_value = [row.dim_market, row.dim_room_type, row.month]
    index_list.append(listing_lookup.loc[tuple(index_value),:].values)
df_2.loc[df_2.listing_m_listing_views_2_6_ds_night_decay.isnull(), 'listing_m_listing_views_2_6_ds_night_decay'] = index_list

In [None]:
occ_7_lookup = pd.DataFrame(df_2.groupby(['dim_market', 'dim_room_type', 'month']).occ_occupancy_plus_minus_7_ds_night.mean())
index_list = []
for index, row in df_2[pd.isna(df_2.occ_occupancy_plus_minus_7_ds_night)].iterrows():
    index_value = [row.dim_market, row.dim_room_type, row.month]
    index_list.append(occ_7_lookup.loc[tuple(index_value),:].values)
df_2.loc[df_2.occ_occupancy_plus_minus_7_ds_night.isnull(), 'occ_occupancy_plus_minus_7_ds_night'] = index_list

In [None]:
occ_14_lookup = pd.DataFrame(df_2.groupby(['dim_market', 'dim_room_type', 'month']).occ_occupancy_plus_minus_14_ds_night.mean())
index_list = []
for index, row in df_2[pd.isna(df_2.occ_occupancy_plus_minus_14_ds_night)].iterrows():
    index_value = [row.dim_market, row.dim_room_type, row.month]
    index_list.append(occ_14_lookup.loc[tuple(index_value),:].values)
df_2.loc[df_2.occ_occupancy_plus_minus_14_ds_night.isnull(), 'occ_occupancy_plus_minus_14_ds_night'] = index_list


In [None]:
occ_90_lookup = pd.DataFrame(df_2.groupby(['dim_market', 'dim_room_type', 'month']).occ_occupancy_trailing_90_ds.mean())
index_list = []
for index, row in df_2[pd.isna(df_2.occ_occupancy_trailing_90_ds)].iterrows():
    index_value = [row.dim_market, row.dim_room_type, row.month]
    index_list.append(occ_90_lookup.loc[tuple(index_value),:].values)
df_2.loc[df_2.occ_occupancy_trailing_90_ds.isnull(), 'occ_occupancy_trailing_90_ds'] = index_list


In [None]:
df_2.isnull().sum()

In [None]:
df_2[pd.isna(df_2.r_kdt_m_effective_daily_price_booked_n100_p50)].describe(percentiles = p_check)

In [None]:
df_2['booked_to_daily_ratio'] = df_2.r_kdt_m_effective_daily_price_booked_n100_p50 / df_2.r_kdt_m_effective_daily_price_n100_p50  

In [None]:
below_one_mean = df_2[df_2.r_kdt_listing_views_0_6_avg_n100<1].booked_to_daily_ratio.mean()
above_one_mean = df_2[df_2.r_kdt_listing_views_0_6_avg_n100>=1].booked_to_daily_ratio.mean()

In [None]:
booked_reference_df = df_2.loc[df_2.r_kdt_m_effective_daily_price_booked_n100_p50.isnull(), ['r_kdt_listing_views_0_6_avg_n100','r_kdt_m_effective_daily_price_n100_p50']]

In [None]:
replaced_book_price = []
for index, row in booked_reference_df.iterrows():
    if row.r_kdt_listing_views_0_6_avg_n100 <1:
        replacement = below_one_mean * row.r_kdt_m_effective_daily_price_n100_p50
    else: replacement = above_one_mean * row.r_kdt_m_effective_daily_price_n100_p50
    replaced_book_price.append(replacement)
    
df_2.loc[df_2.r_kdt_m_effective_daily_price_booked_n100_p50.isnull(), 'r_kdt_m_effective_daily_price_booked_n100_p50'] = replaced_book_price

In [None]:
df_2=df_2[df_2.r_kdt_m_effective_daily_price_booked_n100_p50 > 0]

In [None]:
df_2 = df_2.drop(columns='booked_to_daily_ratio')

In [None]:
print(str((len(df)- len(df_2)) / len(df) *100) + '%' +' of observations were dropped during the data cleaning step')

In [None]:
len(df)- len(df_2)

In [None]:
df_V1 = df_2.copy()

# Feature Enginneering

To complete the model development dataset, some of the irrelevant features such as the responese flag, listing IDs, dates are excluded from the model features. From the remaining feature 2 sets of new variables are established.

The first set of variables (4 variables altogether) try to capture competitive pricing information. They are derived by dividing the effective price of the listing by the KDT daily price, available price, and booked price respectively. These variables should tell us if the subject's listing price are below, equal, or above the three competitor median price metrics. In addition, there is also a price ratio variable in cleaning fee to listing price ratio, this variable is trying to capture guests' price sensitivity towards cleaning fee.

The second set of variables (5 variables altogether) try to capture market demand information. The first variable is general market available listings to general market unique searchers ratio. This ratio tries to proxy market supply surplus or shortage. The second variable is KDT average listing views to KDT number of available listing ratio, again, this ratio is trying to proxy local market supply surplus or shortage. The rest of the 3 ratios are constructed using the general market unique searchers, contacts, reservation, and booked variables. By leveraging the funnel concept in website conversion rate analysis, the unique searchers to contacts ratio tries to evaluate the 'conversion' from search to contact, whereas the contact to request ratio tries to evaluate the 'conversion' from contact to making a request. Lastly, the request to booked ratio tries to evaluate the final stage conversion rate.

After the construction of these 9 variables, the original sourced variables are dropped from the model features since they represent redundant information. I also tried a test model by including the original sourced variables. However, model results has not improved by including them in the feature variable list, which confirms that these variables don't offer additional information to predict request to book probability. 

In [None]:
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, classification_report, accuracy_score

In [None]:
y=df_V1['dim_is_requested']

In [None]:
y.replace({False: 0, True :1}, inplace = True)

In [None]:
df_V1['kdt_ListingtoDaily_Ratio'] = df_V1.m_effective_daily_price / df_V1.r_kdt_m_effective_daily_price_n100_p50 

In [None]:
df_V1['kdt_ListingtoAvailable_Ratio'] = df_V1.m_effective_daily_price / df_V1.r_kdt_m_effective_daily_price_available_n100_p50 

In [None]:
df_V1['kdt_ListingtoBooked_Ratio'] = df_V1.m_effective_daily_price / df_V1.r_kdt_m_effective_daily_price_booked_n100_p50 

In [None]:
df_V1['cleanfee_proportion'] =df_V1.m_pricing_cleaning_fee / df_V1.m_effective_daily_price

In [None]:
df_V1['general_available_to_search']= df_V1.m_available_listings_ds_night / (df_V1.general_market_m_unique_searchers_0_6_ds_night)

In [None]:
df_V1['kdt_view_to_available'] = df_V1.r_kdt_listing_views_0_6_avg_n100 /df_V1.r_kdt_n_available_n100

In [None]:
df_V1['general_contact_to_search'] = df_V1.general_market_m_contacts_0_6_ds_night / (df_V1.general_market_m_unique_searchers_0_6_ds_night)

In [None]:
df_V1['general_requested_to_contact'] = df_V1.general_market_m_reservation_requests_0_6_ds_night / (df_V1.general_market_m_contacts_0_6_ds_night)

In [None]:
df_V1['general_booked_request'] = df_V1.general_market_m_is_booked_0_6_ds_night  / (df_V1.general_market_m_reservation_requests_0_6_ds_night) 

In [None]:
df_V1.describe()

In [None]:
X = df_V1.drop(columns = ['dim_is_requested', 'ds_night', 'ds', 'id_listing_anon', 'id_user_anon', 'dim_lat', 'dim_lng', 'p2_p3_click_through_score', 'p3_inquiry_score', 'month', 'r_kdt_m_effective_daily_price_booked_n100_p50', 'r_kdt_m_effective_daily_price_available_n100_p50','r_kdt_m_effective_daily_price_n100_p50','r_kdt_n_available_n100', 'r_kdt_listing_views_0_6_avg_n100', 'm_available_listings_ds_night', 'general_market_m_unique_searchers_0_6_ds_night', 'general_market_m_reservation_requests_0_6_ds_night', 'general_market_m_contacts_0_6_ds_night', 'general_market_m_is_booked_0_6_ds_night'])

In [None]:
X = pd.get_dummies(X, drop_first = True)

# Model Fitting

Due to time constratint, I only used random forest classifier to model request to book probability. The cleansed dataset with the new variables were firstly divided into test and training set by a 30% split ratio. Then, I used grid search to perform 3-fold cross validation in order to tune the parameters for the random forest classifier. I focus on n estimator and min samples leaf parameters and the best tune is 50 trees and 617 samples respctively. 

Given more time, I will expand the parameter tuning by including max_features as well as expanding the range of the grid search ranges. Also, I will try a couple other models and pick the model with the best accuracy score.

The random forest model has an accuracy of 78.57% on the test set, which is an 10.53% improvement compared to the 67.04% rate from the raw data. Out-sample and in-sample AUC is at 85.08% and 85.63% respectively which are both high and demonstrates no overfitting. 

In [None]:
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size = 0.3, random_state =18)

In [None]:
rfc_full = RandomForestClassifier(n_jobs=-1, max_features='auto', oob_score=True, random_state=18)

param_grid= {
    "n_estimators" : [50, 100,150,200],
    "min_samples_leaf" :[int(len(X_train)/200), int(len(X_train)/100), int(len(X_train)/50)]
}

CV_rfc_full = GridSearchCV(estimator=rfc_full, param_grid=param_grid)
CV_rfc_full.fit(X_train, y_train)
print(CV_rfc_full.best_params_)
print(CV_rfc_full.best_score_)

In [None]:
y_pred_proba = CV_rfc_full.predict_proba(X_test)[:,1]
roc_auc_score(y_test, y_pred_proba)

In [None]:
y_train_pred_proba = CV_rfc_full.predict_proba(X_train)[:,1]
roc_auc_score(y_train, y_train_pred_proba)

In [None]:
y_pred=CV_rfc_full.predict(X_test)
accuracy_score(y_test, y_pred)

In [None]:
1-y_train.mean()

# Model Insights

The random forest classifer tells us that 6 variables has strong predictive power. They are average listing views, trailing 90 day occupancy rate, KDT listing to available price ratio, KDT listing to daily price ratio, general market available listing to unique searchers ratio, and KDT listing to booked price ratio. The above mentioned features makes sense since they are representing market demand and competitive pricing. 

I also fitted a simple decision tree with 3 level-splits to get a sense of what the splits look like. The first split is on the trailing 90 day occupancy rate. Follow that, the splits will be determined by recent number of listing view (market interest) and listing price to competitor price (either daily price or available price). The competitive pricing split levels are all pointing towards below market listing prices attracting more request, which makes intuitive sense. Average listing views is an intuitive variable as well. However it is also a tricky variable. The 75th percentile average listing views in the dataset is 0.33, which indicates more than half of the data sample don't have any listing views at all. This will be valuable information to the marketing team as more resources maybe needed to increase listing views of potential guests.

In [None]:
importances = CV_rfc_full.best_estimator_.feature_importances_
imp_df = pd.DataFrame(importances)
imp_df.index = X_train.columns
imp_df.columns=['Relative_Importance']
imp_df.sort_values(by='Relative_Importance', ascending=False, inplace=True)

In [None]:
imp_df[:10].plot(kind='barh', width = 0.8)
plt.gca().invert_yaxis()
plt.xlabel('Relative Importance')
plt.ylabel('Variables')
plt.title('Relative Importance Chart')
plt.show()

In [None]:
from sklearn import tree
dtree=tree.DecisionTreeClassifier(max_depth= 3, min_samples_leaf= int(len(X_train)/200))
dtree.fit(X_train, y_train)

In [None]:
import os
os.environ["PATH"] += os.pathsep + 'C:/Program Files (x86)/Graphviz2.38/bin/'

In [None]:
from IPython.display import SVG
from graphviz import Source
graph = Source( tree.export_graphviz(dtree, out_file=None, feature_names=X_train.columns))
SVG(graph.pipe(format='svg'))

In [None]:
y.mean()

# Visualize the Request Rate with Respect to Key Variables

Visualizations of important feature variables to request rate are produced for the top 3 important variables: average listing view, KDT listing to available price ratio, and trailing 90 day occupancy rate. 

The graph on average listing is again telling us that having the listing view will increase request rate a lot especially for listings without any listing views. As the zoomed in graph indicated, simply have one listing view on the listing will increase the conversaion rate by a factor of 3. Also, most of the data development sample (over 50%) has 0 listng view. If we can get the potential guests to view some of these listings, the request rate will improve significantly.

Both the KDT listing to available price ratio and trailing 90 day occupancy rate graphs are intuitive: request rates increases as price decline relative to competitors and request rate increases as prior occupancy increases (either an indicator of market demand / or tenant activity which leads to more review and ratings).

In [None]:
bins = np.linspace(df_V1.listing_m_listing_views_2_6_ds_night_decay.min(), df_V1.listing_m_listing_views_2_6_ds_night_decay.max(), 20)
group = df_V1.groupby(np.digitize(df_V1.listing_m_listing_views_2_6_ds_night_decay, bins))

plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.25, 1.05])
plt.ylabel('Requeste Rate')
plt.xlabel('Average Listing Views')
plt.title('Average Listing View to Request Rates')
plt.show()

In [None]:
bins = np.linspace(df_V1.listing_m_listing_views_2_6_ds_night_decay.min(), 5, 20)
group = df_V1.groupby(np.digitize(df_V1.listing_m_listing_views_2_6_ds_night_decay, bins))

plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.2, 0.9])
plt.xlabel('Requeste Rate')
plt.ylabel('Average Listing Views')
plt.title('Average Listing View to Request Rates (Zoomed In)')
plt.show()

In [None]:
group.dim_is_requested.mean()

In [None]:
bins

In [None]:
group.dim_is_requested.mean()[5]/group.dim_is_requested.mean()[1]

In [None]:
len(df[df.listing_m_listing_views_2_6_ds_night_decay ==0]) / len(df)

In [None]:
bins = np.linspace(df_V1.kdt_ListingtoAvailable_Ratio.min(), 3, 40)
group = df_V1.groupby(np.digitize(df_V1.kdt_ListingtoAvailable_Ratio, bins))
plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0, 0.9])
plt.ylabel('Requeste Rate')
plt.xlabel('KDT Listing to Available Price Ratio')
plt.title('KDT Listing to Available Price Ratio to Request Rate')
plt.show()

In [None]:
bins = np.linspace(df_V1.occ_occupancy_trailing_90_ds.min(), df_V1.occ_occupancy_trailing_90_ds.max(), 20)
group = df_V1.groupby(np.digitize(df_V1.occ_occupancy_trailing_90_ds, bins))
plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.1, 0.7])
plt.ylabel('Requeste Rate')
plt.xlabel('Past 90 Day Occupancy Rate')
plt.title('Past 90 Day Occupancy Rate to Request Rate')
plt.show()

# Host Only Feature Model

I also developed a model using variables that the host can influence.  Host controlable features includes variables such as effective price, cancle policy, image quality, total number of review, overall rating and etc. Any market demand and supply features are excluded from this model. 

Again, random forest classifer is used to train the model. The model accuracy comes out to be 74.45% which is still higher than the data's 67.04%. Two important variables are interesting: number of reviews and number of total overall ratings. 

Both the graphs of number of reviews and number of total overall ratings to request rate indicates that there is significant increase in request rate from low level of review and overall rating towards around 70. Our sample dataset indicates that the median number of review and overall rating is 3 and 15 respectivelly. Also, 29.5% of the sample listings don't have any review and 29.87% of listings don't have any overall rating. A campaign can be used to coash these hosts so that they can get reviews and overall ratings from their guests. Once they start to have reviews and ratings, their request to book probability will improve.

In [None]:
feature = ['m_effective_daily_price', 'm_pricing_cleaning_fee', 'dim_is_instant_bookable', 'm_reviews', 'cancel_policy', 'image_quality_score', 'm_total_overall_rating', 'm_professional_pictures', 'dim_has_wireless_internet', 'm_minimum_nights', 'm_maximum_nights','kdt_ListingtoAvailable_Ratio', 'cleanfee_proportion']

In [None]:
rfc_user = RandomForestClassifier(n_jobs=-1, max_features='auto', oob_score=True, random_state=18)

param_grid= {
    "n_estimators" : [50, 100,150,200],
    "min_samples_leaf" :[int(len(X_train)/200), int(len(X_train)/100), int(len(X_train)/50)]
}

CV_rfc_user = GridSearchCV(estimator=rfc_user, param_grid=param_grid)
CV_rfc_user.fit(X_train[feature], y_train)
print(CV_rfc_user.best_params_)
print(CV_rfc_user.best_score_)

In [None]:
user_pred_proba = CV_rfc_user.predict_proba(X_test[feature])[:,1]
roc_auc_score(y_test, user_pred_proba)

In [None]:
user_train_pred_proba = CV_rfc_user.predict_proba(X_train[feature])[:,1]
roc_auc_score(y_train, user_train_pred_proba)

In [None]:
user_pred=CV_rfc_user.predict(X_test[feature])
accuracy_score(y_test, user_pred)

In [None]:
importances = CV_rfc_user.best_estimator_.feature_importances_
imp_df = pd.DataFrame(importances)
imp_df.index = X_train[feature].columns
imp_df.columns=['Relative_Importance']
imp_df.sort_values(by='Relative_Importance', ascending=False, inplace=True)

In [None]:
imp_df.plot(kind='barh', width = 0.8)
plt.gca().invert_yaxis()
plt.xlabel('Relative Importance')
plt.ylabel('Variables')
plt.title('Relative Importance Chart')
plt.show()

In [None]:
bins = np.linspace(df_V1.m_reviews.min(), df_V1.m_reviews.max(), 20)
group = df_V1.groupby(np.digitize(df_V1.m_reviews, bins))

plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.2, 1.05])
plt.ylabel('Requeste Rate')
plt.xlabel('Number of Reviews')
plt.title('Number of Reviews to Request Rate')
plt.show()

In [None]:
bins = np.linspace(df_V1.m_reviews.min(), 100, 20)
group = df_V1.groupby(np.digitize(df_V1.m_reviews, bins))


plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.2, 0.7])
plt.ylabel('Requeste Rate')
plt.xlabel('Number of Reviews')
plt.title('Number of Reviews to Request Rate (Zoomed-In)')
plt.show()
plt.show()

In [None]:
bins = np.linspace(df_V1.m_total_overall_rating.min(), 400, 20)
group = df_V1.groupby(np.digitize(df_V1.m_total_overall_rating, bins))

plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.2, 0.7])
plt.ylabel('Requeste Rate')
plt.xlabel('Number of Overall Rating')
plt.title('Number of Overall Rating to Request Rate')
plt.show()

In [None]:
bins = np.linspace(df_V1.m_total_overall_rating.min(), 100, 20)
group = df_V1.groupby(np.digitize(df_V1.m_total_overall_rating, bins))

plt.plot(bins, group.dim_is_requested.mean())
plt.ylim([0.17, 0.6])
plt.ylabel('Requeste Rate')
plt.xlabel('Number of Overall Rating')
plt.title('Number of Overall Rating to Request Rate (Zoomed-In)')
plt.show()

In [None]:
len(df_V1[df_V1.m_reviews==0]) / len(df_V1)

In [None]:
len(df_V1[df_V1.m_total_overall_rating==0]) / len(df_V1)