<h1 align="center">Is your Airbnb listing "good" enough?<h1 align="center">


Currently, Airbnb has more than five million listings all around the world and its numbers are rapidly increasing even at this moment. This is great, right? That means there's more options for the guests! But is it also great for the hosts? Unfortunately, not so much; that means the number of their competitors are constantly increasing as well. 

Imagine there's two listings with the exact same type of property, price and neighbourhood.One might be really popular to the point where it's really hard to book the place, whereas the other might always be free. So what makes the difference? Perhaps it could be due to the difference in the listing's title and description. Here's another example: there might be two places with the exact same title and description, but in a different neighbourhood. The point is this: the listing's popularity is a result of different combination of various factors and it's difficult to find out what will make listing popular. 

But don't worry, here's the good news. With machine learning, you can actually predict which features combined with other features will increase a listing's popularity. In fact, that's what this blog article is all about. To demonstrate, We'll use a `New York City Airbnb listings` dataset from 2019 (https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data) to figure out what makes a Airbnb listing in New York popular.

<br/>

## The Quest Begins...

This is how the `New York City Airbnb listings` dataset looks like:
<br/><br/>

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.read_csv('AB_NYC_2019.csv')
df_train, df_test = train_test_split(df, test_size=0.1, random_state=123)
df_train.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
34702,27524836,Quiet and cozy room in Clinton Hill,7098693,Neha,Brooklyn,Bedford-Stuyvesant,40.69073,-73.95965,Private room,75,3,0,,,1,0
5488,3970755,Sunny Junior 1 Bedroom,3164949,Robert,Brooklyn,Cobble Hill,40.68781,-73.99415,Entire home/apt,160,4,0,,,1,0
19856,15911137,Amazing 2 bed 2 bath in Central Harlem.,1301576,Siobhan,Manhattan,Harlem,40.80811,-73.94486,Entire home/apt,200,7,3,2018-09-08,0.09,2,0
40998,31883127,✴NEWLY RENOVATED✴ 2 BDR | SLEEPS 4 @ BROOKLYN,6833598,Timothy,Brooklyn,Bedford-Stuyvesant,40.68417,-73.91899,Entire home/apt,99,2,13,2019-07-05,3.98,1,182
33208,26205265,HUGE APT on border of EAST VILLAGE & GRAMERCY!,196961556,Mike,Manhattan,Gramercy,40.73338,-73.98315,Entire home/apt,499,4,41,2019-06-30,3.31,1,10


<br/><br/>

As a means of popularity, we will be looking at the column `reviews_per_month`. That is, we will try to predict the number of reviews per month given a certain combination of features. Before we start, let's perform some Exploratory Data Analysis (EDA) to better understand the dataset. We will use `pandas_profiling`, which is a handy tool for this purpose as it provides a visual summary.


<br/><br/>

In [2]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df, title='Pandas Profiling Report') #, minimal=True)
profile.to_notebook_iframe()

<br/>

## Data Preparation


From the `pandas_profiling` report above, it seems like there's quite a few categorical variables. Unfortunately, most of the scikit-learn library functions work with numerical data and thus, some pre-processing is required. After some outlier detection and removal, here's what we've done:

1. Split out the column `reviews_per_month` because this is what we want to predict.
2. Turn the categorical features (i.e., `neighbourhood`) into numerical features via One-Hot Encoding.
3. Count Vectorize (i.e., find out the importances of each word by counting their number of occurrences) the `name` column since the words in the title could be a good predictor of our popularity prediction. 
4. Drop any unnecessary / irrelevant features (i.e., `id`, `last_review`).
5. Drop any features that seem obviously correlated to our predict column, `reviews_per_month` (i.e., number_of_reviews).
6. Scale all the numeric features


<br/>

## So what tools do we have in the bucket?

After we've pre-processed our features, we've tried several different models to fit and train on our pre-processed dataset including the `DummyRegressor`, `HuberRegressor`, `CatBoostRegressor` and `RandomForestRegressor`.
Here are the candidates:

* `DummyRegressor`: predicts a same constant value(i.e. mean or median) for all rows 
* `HuberRegressor`: linear regression model that is robust to outliers
* `CatBoostRegressor`: regressor with special handling of categorical variables
* `RandomForestRegressor`: fits a number of decision trees on sub-samples of the dataset 

Going through a trial run of each model using their default settings led to following accuracy scores. The `CatBoostRegressor` is the winner, and the `RandomForestRegressor` closely follows. 

|     DummyRegressor     |        HuberRegressor        |  CatBoostRegressor    |     RandomForestRegressor    |
|:----------------------:|:-------------:|:------:|:------:|
| 0.0 |  0.098 | 0.458 | 0.402 |

<br/>

## Can we do better?

Yes we can! In fact, we can even have a better result by tuning the settings (a.k.a hyperparameters) of the model. For today, we will try optimizing the `RandomForestRegressor`. A quick glance over the RandomForestRegressor documentation tells us that it has a lot of settings we can choose from (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html). But how are we supposed to know how to set these knobs so the model will work the best on our problem?

That's when the `RandomizedSearchCV` comes to the rescue. `RandomizedSearchCV` is part of the popular `scikit-learn` Python machine learning library. It works by searching over various settings (i.e. random values within a given range) and tells you which one worked best on your problem. We decided to fiddle with the parameters `n_estimators`, `max_depth` and `max_features`. In short, they control how complicated our model is. After this search, we saw an increase in our model's performance with an accuracy score ranging from 0.402 to 0.418.

Now let's dive into analyzing our results in detail. Although the `CatBoostRegressor` returned the best accuracy score of '0.458', `RandomForestRegressor` also showed on-par performance with a decent score of '0.418' after hyperparameter optimization. In fact, `RandomForestRegressor` has a better interpretable results and hence, we will be using the results we obtained from the `RandomForestRegressor` for our analysis. 

<br/>

## What is the verdict?

From the last section, we saw that the `RandomForestRegressor` resulted in an accuracy score of 0.42 for the predictions we made. What does this score mean? Sure we might have predicted the correct number of reviews for a posting about half of the time, but what features did `RandomForestRegressor` use for making its predictions, and are they reasonable?

To gain more insight into the results, let's use SHAP (SHapley Additive exPlanations). Shap is a useful tool for analyzing the contribution of each feature to our prediction, and as a bonus it provides great visualizations too!

Let's first look at the SHAP graphs for the `RandomForestRegressor`. The first SHAP bar chart below shows you which feature has the highest impact on our prediction, `reviews_per_month`.

![shap_importance](shap_importance.png)

As seen above, it seems like the most important features are `availability_365` and `minimum_nights`. This means that `availability_365` has roughly 14 ~ 15% impact on the `reviews_per_month` outcome and `minimum_nights` has roughly 8 ~ 10% impact. However, this graph alone does not capture what kind of impact they have. Does this mean that having a constraint of minimum nights increases the popularity? Or does it mean that having this constraint decreases the popularity? We cannot tell the relationship from this graph alone.

Thus we look at another graph as well. The graph below shows the interaction between the feature values and the SHAP output. It seems like `availability_365` does increase the predicted number of reviews per month, because most of the red dots seem to be on the right side. It also seems like minimum_nights decrease the predicted number of reviews per month, because most of the red dots seem to be on the left side.

![shap_graph](shap_graph.png)

<br/>

## Conclusion and the Caveats

All in all, the summary from the model `RandomForestRegressor` is this: 
* As a host, your listing can become more popular if you make the place available all through out the year (365 days)
* Your listing will be less popular if you put a constraint on the minimum number of nights a visitor must stay, so remove this restriction.

However, there exists some caveats that should be noted. First, is `reviews_per_month` an accurate predictor for a popularity of a listing? By definition, 'popularity' is "[being] liked, enjoyed, or supported by many people" (Cambridge Dictionary). What if a listing has a high number of reviews, but filled with negative reviews? Right now, the underlying assumption is that all the reviews would be positive but in real life, the case might not be so.

Furthermore, we have a limited amount of data from a time period of 2019. This is a very small sample for predicting the popularity of future listings in 2020 to 2050. The more data we have, the more we have to learn from and thus, we will be more confident with our results. Also consider this: what if in 2019, a pandemic arose? This could mean that all our analysis above would be a bad generalization of future predictions. Or what if a `neighborhood` in NewYork gets demolished in the future? You can never predict any future circumstances and trends so we can never be 100% confident with our prediction. It is important to keep in mind that our model should not be fully trusted.

One last thing we want to mention is, Scikit-learn doesn’t have an ability to figure out which features are to be dropped. For example, consider the column `id` in our initial dataset. It has a high cardinality and can be considered as a useful column in the perspective of Scikit-learn, but as humans, we know that there’s no significance in `id`. There is a possibility that a noise column such as these can be given a big importance value while taking away the importances of other genuinely significant features. It’s up to us humans with the domain knowledge to make a decision on what features to drop. Unfortunately, there is always a possibility where we have a lack of domain knowledge or a possibility of simply making a mistake of overlooking the features needed to be cleaned. If this happens, Scikit-learn’s limited capability to distinguish the important features will bypass this stage and will result in inaccurate, biased interpretation.


<br/>

## Reference:

- https://dictionary.cambridge.org/dictionary/english/popular