## AirBnB Seattle Analysis with CRISP-DM (Cross Industry Standard Process for Data Mining) 

A brief analysis example using CRISP-DM methodology. This methodology suggest a data analysis in the following steps:

* Business Understanding
* Data understanding
* Data Preparation
* Modeling
* Evaluation
* Deployment

### Business Understanding

Imagine that you have a room or an appartment in Seattle and you would like to offer it via Airbnb. In order o help you to make an initial setup and list it in Airbnb. You would like to know some simple insights of the Airbnb market. The following questions represent in general a good overview of the Airbnb market.

* What is the relation room type and part of the city with the price?
* Is there a price difference between room type and the bed type?
* Are bathrooms or bedrooms important perks to influence the price?
* What are the top amenities offered?
* What months are cheaper to visit Seattle?
* What are features or characteristics that influence in the price of a room?

In [None]:
# Here I just add some libraries, which will be useful during the process.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_squared_error
import seaborn as sns
from sklearn.preprocessing import MultiLabelBinarizer,LabelEncoder,OneHotEncoder,StandardScaler 
import sklearn.metrics as mtr
import math
%matplotlib inline

### Data Understanding

Second step is to explore the data and based on the [AirBnB Seattle Dataset](https://www.kaggle.com/airbnb/seattle/data), we will get mainly three datasets, which we will analyse:

* calendar.csv ==> Booking information of houses in Seattle.
* listings.csv ==> Information of houses in Seattle.
* reviews.csv ==> Reviews of houses in Seattle.


In [None]:
# Loading the datasets in pandas

df_calendar = pd.read_csv('datasets/calendar.csv')
df_reviews = pd.read_csv('datasets/reviews.csv')
df_listings = pd.read_csv('datasets/listings.csv')

print('The dataset calendar has {} rows and {} columns.'.format(df_calendar.shape[0], df_calendar.shape[1]))
print('The dataset reviews has {} rows and {} columns.'.format(df_reviews.shape[0], df_reviews.shape[1]))
print('The dataset listings has {} rows and {} columns.'.format(df_listings.shape[0], df_listings.shape[1]))

#### Data Cleaning & Exploration Strategy 
Based on the size of the datasets, we see that the calendar has most of the entries and the listings has most of the columns or features. 

The first step will be to get a sample of the first 5 rows in three datasets, in order to see the information provided and searh for information that can help us to answer the business questions.

Then in a further step, we will proceed to discard or impute some values, depending of the information needed.

In [None]:
df_calendar.head(5)

In [None]:
# Now lets take a look on the null values contained in the dataframe
null_ratio_calendar = df_calendar.isnull().sum()/df_calendar.shape[0]
null_ratio_calendar

It looks like the price has a 32.93% ratio of missing values, in this case, it may be convenient to drop them.

In [None]:
df_reviews.head(5)

In [None]:
null_ratio_reviews = df_reviews.isnull().sum()/df_reviews.shape[0]
null_ratio_reviews

Here the percentage of people, who do not leave a comment is very low. It seems that most of the guests hasve always something to mention about their experience.

In [None]:
df_listings.head(5)

In [None]:
pd.set_option('display.max_rows', 92)
null_ratio_listings = df_listings.isnull().sum()/df_listings.shape[0]
null_ratio_listings.sort_values(ascending=False)

In [None]:
# For the listings data, I get more curious and I would like to know, how many columns do not have null values?
no_null_listings = set(df_listings.columns[df_listings.isnull().mean()==0])
no_null_listings

This is great and simplify our work since we can use only the listings dataframe to answer most of our business questions.

### Data Preparation & Questions

The first set of questions 1-4 will be answered while we prepare the data and later in order to answer the last question. It is important to move into the modeling in the next section.

#### What is the price relation between room type and suburb of the city?

In [None]:
df_listings['price'] = df_listings['price'].replace(r'[$,%]', '', regex = True).astype(float)
plt.figure(figsize=(16, 6))
sns.barplot(x="city", y="price", hue="room_type", data=df_listings)

In general Seattle city and Seattle are the locations in which the major appartment concentration is located. It is quite inresting to see that the offer for shared rooms is very limited. Which implies that privacy is important.

In [None]:
# What is the place with special characters?
df_listings.groupby('city').count()

# It is 西雅图 = Seattle in chinesse

It seems that there is a very small niche for only chinesse speakers

#### Is there a price difference between room type and bed type?

In [None]:
plt.figure(figsize=(16, 5))
sns.barplot(x="room_type", y="price", hue="bed_type", data=df_listings)

There is not a significant difference between the bed material and the room type. Couch and real bed in appartments do not have any significant difference in price. This may indicates that the price is not driven by the bed quality.

#### Are the number of bathrooms or bedrooms important perks to influence the price?

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="room_type", y="price", hue="bathrooms", data=df_listings)

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="room_type", y="price", hue="bedrooms", data=df_listings)

It seems that there is a high correlation between the number of bedrooms and bathrooms, It is significant in the entire home/apt category.

#### What is the proportion of min and max nights offered in the Seattle Airbnb market? (Bonus)

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="minimum_nights", y="price", hue="room_type", data=df_listings)

There not a clear pattern of min recommended number of nights

#### Price vs Guest included by room type

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="room_type", y="price", hue="guests_included", data=df_listings)

#### What are the room types most reviewed?

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="room_type", y="number_of_reviews", data=df_listings)

In [None]:
amt = df_listings['amenities'].apply(lambda x: [amenity.replace('"', "").replace("{", "").replace("}", "") 
                                               for amenity in x.split(",")])

mlb = MultiLabelBinarizer()
label_amt = pd.DataFrame(mlb.fit_transform(amt), columns=mlb.classes_, index=amt.index)

amt_count=label_amt.sum().sort_values(ascending=False)

d = {'Amenity': amt_count.index, 'Count': amt_count.values}
df_amt_count = pd.DataFrame(data=d)

plt.figure(figsize=(16, 6))
sns.barplot(x="Amenity", y="Count", data=df_amt_count.head(10))

Internet, Kitchen, heating, smoke detector and essentials are all that you need to feel at home and make some savings during a trip.

#### Price relation between instant bookable and room type

In [None]:
plt.figure(figsize=(16, 6))
sns.barplot(x="room_type", y="price", hue="instant_bookable", data=df_listings)

There is not a significant difference. However, it seems that private rooms, which are not instant bookeble available are more expensive

#### What months are cheaper to visit Seattle?

In [None]:
df_calendar['month'] = pd.DatetimeIndex(df_calendar['date']).month
df_calendar.dropna(subset=['price'], inplace=True)
df_calendar['price'] = df_calendar['price'].replace(r'[$,%]', '', regex = True).astype(float)

plt.figure(figsize=(16, 6))
sns.lineplot(x="month", y="price",marker="o", data=df_calendar)

June, July and August are the most expensive months to visit Seattle

####  What are features or characteristics that influence in the price of a room?

In order to answer this question, we need to move forward and in this case use a simple linear regression to predict price values and depending of the features, the accuracy may increase or not.

### Modeling


In order to have a simple approximation, I would based my featues based on review scores, section of the city, room type and number of bedrooms. I would for the moment discard features such as beds, bathrooms in order to see if we get some good feature representation to predict the price.

In [None]:
features = ['city',
            'price',
            'room_type',
            'bedrooms',
            'review_scores_accuracy',
            'review_scores_checkin',
            'review_scores_value',
            'review_scores_location',
            'review_scores_cleanliness',
            'review_scores_communication',
            'review_scores_rating',
            'reviews_per_month'
           ]

df_listings_reduced = df_listings[features]

# The decision he is remove all the rows with NaN values and as we saw before, they represent fex incompete rows (~17.23%)
df_listings_cleaned = df_listings_reduced.dropna()
df_listings_cleaned.shape

In [None]:
df_listings_cleaned.head()

In [None]:
# Now we will do some encodings fo city and room type
df_listings_encoded = pd.get_dummies(df_listings_cleaned, columns=['city', 'room_type'])
df_listings_encoded

In [None]:
df_listings_encoded.columns

In [None]:
# Here we start to do our model by a linear regression and introduce the features.

X = df_listings_encoded[['review_scores_accuracy',
                         'bedrooms',
                         'review_scores_checkin',
                         'review_scores_value',
                         'review_scores_location',
                         'review_scores_cleanliness',
                         'review_scores_communication',
                         'review_scores_rating',
                         'reviews_per_month',
                         'city_Ballard, Seattle',
                         'city_Phinney Ridge Seattle',
                         'city_Seattle', 
                         'city_Seattle ',
                         'city_West Seattle',
                         'city_西雅图', # This is Seattle in chinese (There is a very specific business niche)
                         'room_type_Entire home/apt',
                         'room_type_Private room',
                         'room_type_Shared room']]
# Price is assigned to y as predicted column.
y = df_listings_encoded['price']

# Create the test and train sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = .30, random_state=42)

lm_model = LinearRegression(normalize=True) 

lm_model.fit(X_train, y_train) 

#Predict using your model
y_test_preds = lm_model.predict(X_test)
y_train_preds = lm_model.predict(X_train)

### Evaluation


In [None]:
#Score using your model
test_score = r2_score(y_test, y_test_preds)
train_score = r2_score(y_train, y_train_preds)

#traing score
print('Train Score:', train_score)

#test score
print('Test Score', test_score)

In the current model the r-squared value is 0.46 which is not too high for the test set. However, it will provide us an insight of features that may be replaced for other in order to improve the predictions.

In [None]:
def coef_weights(coefficients, X_train):
    '''
    INPUT:
    coefficients - the coefficients of the linear model 
    X_train - the training data, so the column names can be used
    OUTPUT:
    coefs_df - a dataframe holding the coefficient, estimate, and abs(estimate)
    
    Provides a dataframe that can be used to understand the most influential coefficients
    in a linear model by providing the coefficient estimates along with the name of the 
    variable attached to the coefficient.
    '''
    coefs_df = pd.DataFrame()
    coefs_df['est_int'] = X_train.columns
    coefs_df['coefs'] = lm_model.coef_
    coefs_df['abs_coefs'] = np.abs(lm_model.coef_)
    coefs_df = coefs_df.sort_values('abs_coefs', ascending=False)
    return coefs_df

#Use the function
coef_df = coef_weights(lm_model.coef_, X_train)

#A quick look at the top results
coef_df.head(20)

The following answer confirms that bedrooms, type of room and room location are the main drivers of the price and the reviews seems to not have impact in the price. However, they may have impact in the occupancy and revenue for the hosts. That would be another analysis to do in teh future.

### Deployment


In this case instead of deployin the code in a server production, I would share these insights in an article in Medium. The article can be found here:
[Article](https://medium.com/@jose0628/sea-2fe2b5cb49d) 

### Conclusions

We could see that the price of Airbnb is mostly driven by the number of bedrooms, location of the renting space and the type mostly. There is little evidence that reviews or the quality of the bed can increase the price per night. Moreover, internel, heating, kitchen and smoke detector seems to be the top ammenities offered in order to make you feel at home but it does not imply that the user may like it. Finally, the most expensive time of the year to visit Seattle is June, July and August.

All this information helps us to undertand better the Airbnb market in Seattle and thecode can be reused as a guidance for future Airbnb analyses in other locations.