# Airbnb Price Case Study Exploratory Data Analysis

Airbnb is an online marketplace that connects people who want to rent out their homes with people who are looking for accommodations in that locale. Airbnb is the platform where this happens. It's part “sharing economy,” part entrepreneurship, part meeting people, and sharing experiences.
The challenge statement was AirBNB price prediction.Some of the factors were property, room type, amenities, accommodates, bathroom, bed_type, etc.
Using machine learning technique, proceeding with all factors leading to the price prediction to develop the model.


In [46]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from datetime import datetime
import seaborn as sns
train_data = pd.read_csv('train.csv')
test_data = pd.read_excel('test.xlsx')

# Evaluating the columns and its purpose


In [47]:
train_data.head()

Unnamed: 0,id,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,...,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds,log_price,Unnamed: 21
0,6901257,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",3,1.0,Real Bed,strict,True,NYC,...,Beautiful brownstone 1-bedroom,Brooklyn Heights,2,100.0,https://a0.muscache.com/im/pictures/6d7cbbf7-c...,11201,1.0,1.0,5.010635,
1,6304928,Apartment,Entire home/apt,"{""Wireless Internet"",""Air conditioning"",Kitche...",7,1.0,Real Bed,strict,True,NYC,...,Superb 3BR Apt Located Near Times Square,Hell's Kitchen,6,93.0,https://a0.muscache.com/im/pictures/348a55fe-4...,10019,3.0,3.0,5.129899,
2,7919400,Apartment,Entire home/apt,"{TV,""Cable TV"",""Wireless Internet"",""Air condit...",5,1.0,Real Bed,moderate,True,NYC,...,The Garden Oasis,Harlem,10,92.0,https://a0.muscache.com/im/pictures/6fae5362-9...,10027,1.0,3.0,4.976734,
3,13418779,House,Entire home/apt,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",4,1.0,Real Bed,flexible,True,SF,...,Beautiful Flat in the Heart of SF!,Lower Haight,0,,https://a0.muscache.com/im/pictures/72208dad-9...,94117,2.0,2.0,6.620073,
4,3808709,Apartment,Entire home/apt,"{TV,Internet,""Wireless Internet"",""Air conditio...",2,1.0,Real Bed,moderate,True,DC,...,Great studio in midtown DC,Columbia Heights,4,40.0,,20009,0.0,1.0,4.744932,


In [48]:
train_data.columns
train_data = train_data.drop([' '], axis=1)
test_data = test_data.drop([' '], axis = 1)

In [49]:
test_data['log_price'] = np.nan

## Converting "host_since" column into proper datetime format, filling up na values with forward fill and extracting only year and variable creation

In [50]:
train_data['host_since'] = pd.to_datetime(train_data['host_since'],format='%m/%d/%Y')
train_data["host_year"] = train_data['host_since'].dt.year

test_data['host_since'] = pd.to_datetime(test_data['host_since'],format='%Y-%m-%d %H:%M:%S')
test_data['host_year'] = test_data['host_since'].dt.year

# Concatenating train and test data for data cleaning


In [51]:
print(train_data.columns)
print(test_data.columns)
print(train_data.shape)
print(test_data.shape)

Index(['id', 'property_type', 'room_type', 'amenities', 'accommodates',
       'bathrooms', 'bed_type', 'cancellation_policy', 'cleaning_fee', 'city',
       'description', 'first_review', 'host_has_profile_pic',
       'host_identity_verified', 'host_response_rate', 'host_since',
       'instant_bookable', 'last_review', 'latitude', 'longitude', 'name',
       'neighbourhood', 'number_of_reviews', 'review_scores_rating',
       'thumbnail_url', 'zipcode', 'bedrooms', 'beds', 'log_price',
       'host_year'],
      dtype='object')
Index(['id', 'property_type', 'room_type', 'amenities', 'accommodates',
       'bathrooms', 'bed_type', 'cancellation_policy', 'cleaning_fee', 'city',
       'description', 'first_review', 'host_has_profile_pic',
       'host_identity_verified', 'host_response_rate', 'host_since',
       'instant_bookable', 'last_review', 'latitude', 'longitude', 'name',
       'neighbourhood', 'number_of_reviews', 'review_scores_rating',
       'thumbnail_url', 'zipcode', 'b

In [52]:
train_data["remove"] = 'train'
test_data["remove"] = 'test'

In [53]:
frames = [train_data, test_data]
result = pd.concat(frames, ignore_index=True)

# Dropping unnecessary columns which are not of any use in the future

In [54]:
result = result.drop(['id', 'description', 'first_review', 'last_review', 'thumbnail_url','name', 'host_since'], axis = 1) 

In [55]:
result = result.drop(['zipcode'], axis = 1) 

In [56]:
result = result.drop(['neighbourhood'], axis = 1) 

In [57]:
result = result.drop(['latitude', 'longitude'], axis = 1) 

# Taking Next Step into Variable Creation and Feature Engineering

The 'Amenities' Column had be taken further to making only certain important following amenities: Internet, AirConditioning, Kitchen, FamilyFamily, Essentials, TV, PetsFriendly, Breakfast and Smoke Detector.

In [58]:
result['Internet'] = 0
result['AirConditioning'] = 0
result['Kitchen'] = 0
result['FamilyFriendly'] = 0
result['Essentials'] = 0
result['TV'] = 0
result['PetsFriendly'] = 0
result['Breakfast'] = 0
result['SmokeDetector'] = 0

In [59]:
result.loc[result['amenities'].str.contains('Family'), 'FamilyFriendly'] = int(1)
result.loc[result['amenities'].str.contains('Air'), 'AirConditioning'] = int(1)
result.loc[result['amenities'].str.contains('Kitchen'), 'Kitchen'] = int(1)
result.loc[result['amenities'].str.contains('Internet'), 'Internet'] = int(1)     
result.loc[result['amenities'].str.contains('Essentials'), 'Essentials'] = int(1)
result.loc[result['amenities'].str.contains('TV'), 'TV'] = int(1)
result.loc[result['amenities'].str.contains('Pets'), 'PetsFriendly'] = int(1)
result.loc[result['amenities'].str.contains('Smoke'), 'SmokeDetector'] = int(1)
result.loc[result['amenities'].str.contains('Breakfast'), 'Breakfast'] = int(1)

# After the previous step, we can drop the 'Amenities ' column.

In [60]:
result = result.drop(['amenities'], axis = 1) 

# Checking columns having null values

In [61]:
for i in result.columns:
    
    print(i, result[i].isna().sum())

property_type 0
room_type 0
accommodates 0
bathrooms 200
bed_type 0
cancellation_policy 0
cleaning_fee 0
city 0
host_has_profile_pic 188
host_identity_verified 188
host_response_rate 18299
instant_bookable 0
number_of_reviews 0
review_scores_rating 16722
bedrooms 91
beds 131
log_price 24111
host_year 188
remove 0
Internet 0
AirConditioning 0
Kitchen 0
FamilyFriendly 0
Essentials 0
TV 0
PetsFriendly 0
Breakfast 0
SmokeDetector 0


# Filling up the Review Scores Rating and scaling the column values in the scale of 0 to 1

In [62]:
result['review_scores_rating'].fillna((result['review_scores_rating'].median()), inplace=True)
result['review_scores_rating']=result['review_scores_rating']/100
result['review_scores_rating'].head()

0    1.00
1    0.93
2    0.92
3    0.96
4    0.40
Name: review_scores_rating, dtype: float64

# Host response rate from percentage character to converting it into float and filling up null values in the column by its mean

In [63]:
result["host_response_rate"] = result["host_response_rate"].fillna(-1)
result["host_response_rate"] = result["host_response_rate"].str.rstrip('%')
result["host_response_rate"] = result["host_response_rate"].astype(float)
result["host_response_rate"] = result["host_response_rate"].astype(str)
result["host_response_rate"] = result["host_response_rate"].replace('-1', np.nan)

result["host_response_rate"] = result["host_response_rate"].astype(float)
result["host_response_rate"].mean()
result["host_response_rate"].median()
result['host_response_rate'] = result['host_response_rate'].fillna((result['host_response_rate'].mean()))

# Filling up New Variable Created Column Host year with its median 

In [64]:
result['host_year'].fillna((result['host_year'].median()), inplace=True)

In [65]:
# train_data["host_since_timestamp"] = train_data["host_since"].apply(lambda x: datetime.timestamp(datetime.strptime(x, "%m/%d/%Y")) if not pd.isnull(x) else 0.0)

# Filling up the null values of 'host_has_profile_picture' and 'host_identity_verified' with its median.

In [66]:
type(result['host_has_profile_pic'][0])

str

In [67]:
result['host_identity_verified'].isna().sum()

188

In [68]:
# result['host_identity_verified'].value_countsnts()

In [69]:
# train_data["host_since_timestamp"] = train_data["host_since"].apply(lambda x: datetime.strptime(x, "%m/%d/%Y") if not pd.isnull(x) else 0.0)

In [70]:
result['host_has_profile_pic'].fillna("t", inplace=True)

In [71]:
result['host_identity_verified'].fillna("t", inplace=True)

# Filling up the null values of bathrooms, bedrooms and beds with their column median

In [72]:
result['bathrooms'].fillna((result['bathrooms'].median()), inplace=True)
result['bedrooms'].fillna((result['bedrooms'].median()), inplace=True)
result['beds'].fillna((result['beds'].median()), inplace=True)



In [73]:
# train_data["host_year"]

In [74]:
result1 = result.copy(deep=True)

# Checking the data types of those columns

In [75]:
result.dtypes

property_type              object
room_type                  object
accommodates                int64
bathrooms                 float64
bed_type                   object
cancellation_policy        object
cleaning_fee                 bool
city                       object
host_has_profile_pic       object
host_identity_verified     object
host_response_rate        float64
instant_bookable           object
number_of_reviews           int64
review_scores_rating      float64
bedrooms                  float64
beds                      float64
log_price                 float64
host_year                 float64
remove                     object
Internet                    int64
AirConditioning             int64
Kitchen                     int64
FamilyFriendly              int64
Essentials                  int64
TV                          int64
PetsFriendly                int64
Breakfast                   int64
SmokeDetector               int64
dtype: object

In [76]:
result['cleaning_fee'].value_counts()

True     54402
False    19708
Name: cleaning_fee, dtype: int64

# One hot encoding categorical variables, as it will increase the model prediction accuarcy

In [77]:
def encode_and_bind(original_dataframe, feature_to_encode):
    dummies = pd.get_dummies(original_dataframe[feature_to_encode], prefix = feature_to_encode, drop_first=True)
    res = pd.concat([original_dataframe, dummies], axis=1)
    res = res.drop([feature_to_encode], axis=1)
    return(res) 

features_to_encode = ['city', 'property_type', 'room_type','bed_type', 'cancellation_policy', 'cleaning_fee', 'host_has_profile_pic', 'host_identity_verified', 'instant_bookable']
for feature in features_to_encode:
    result1 = encode_and_bind(result1, feature)

In [78]:
# dummies

In [79]:
res = pd.get_dummies(result['city'])

In [80]:
result.shape

(74110, 28)

# Checking the data types of those columns

In [81]:
result1.dtypes

accommodates                             int64
bathrooms                              float64
host_response_rate                     float64
number_of_reviews                        int64
review_scores_rating                   float64
                                        ...   
cancellation_policy_super_strict_60      uint8
cleaning_fee_True                        uint8
host_has_profile_pic_t                   uint8
host_identity_verified_t                 uint8
instant_bookable_t                       uint8
Length: 72, dtype: object

# Checking newly generated columns through one hot encoding

In [82]:
result1.columns

Index(['accommodates', 'bathrooms', 'host_response_rate', 'number_of_reviews',
       'review_scores_rating', 'bedrooms', 'beds', 'log_price', 'host_year',
       'remove', 'Internet', 'AirConditioning', 'Kitchen', 'FamilyFriendly',
       'Essentials', 'TV', 'PetsFriendly', 'Breakfast', 'SmokeDetector',
       'city_Chicago', 'city_DC', 'city_LA', 'city_NYC', 'city_SF',
       'property_type_Bed & Breakfast', 'property_type_Boat',
       'property_type_Boutique hotel', 'property_type_Bungalow',
       'property_type_Cabin', 'property_type_Camper/RV',
       'property_type_Casa particular', 'property_type_Castle',
       'property_type_Cave', 'property_type_Chalet',
       'property_type_Condominium', 'property_type_Dorm',
       'property_type_Earth House', 'property_type_Guest suite',
       'property_type_Guesthouse', 'property_type_Hostel',
       'property_type_House', 'property_type_Hut', 'property_type_In-law',
       'property_type_Island', 'property_type_Lighthouse',
       'p

# Converting certain possible values of columns into category

In [83]:
for i in ['Internet', 'AirConditioning', 'Kitchen', 'FamilyFriendly','Essentials', 'TV','PetsFriendly','Breakfast','SmokeDetector']:   
    result1[i] = result1[i].astype('category')

In [84]:
result1.dtypes

accommodates                             int64
bathrooms                              float64
host_response_rate                     float64
number_of_reviews                        int64
review_scores_rating                   float64
                                        ...   
cancellation_policy_super_strict_60      uint8
cleaning_fee_True                        uint8
host_has_profile_pic_t                   uint8
host_identity_verified_t                 uint8
instant_bookable_t                       uint8
Length: 72, dtype: object

# Separating the train and test data from its combined form which was done for data cleaning

In [85]:
train_result = result1[result1['remove']=='train']

In [86]:
xtrain1 = train_result.drop(['remove', 'log_price'], axis = 1)
ytrain1 = train_result['log_price']

In [87]:
print(xtrain1.shape, ytrain1.shape)

(49999, 70) (49999,)


In [88]:
test_result = result1[result1['remove']=='test']

In [89]:
X_test = test_result.drop(['remove', 'log_price'], axis = 1)
y_test = test_result['log_price']

# Train, validation, test split 

In [91]:
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(xtrain1, ytrain1, test_size=0.33, random_state=42)

In [92]:
print(X_train.shape, X_val.shape)
print(y_train.shape, y_val.shape)

(33499, 70) (16500, 70)
(33499,) (16500,)


# Model building 

## Linear Regression Model

In [93]:
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

In [94]:
y_pred = reg.predict(X_val)

In [95]:
from sklearn.metrics import mean_squared_error

In [96]:
errors = mean_squared_error(y_val, y_pred)

In [97]:
errors

0.2160674477908023

In [98]:
print(reg.score(X_val, y_val))

0.5801528890270633


Model MSE 0.21, RMSE sqrt(MSE) = 0.465

# Ridge Model 

In [99]:
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import train_test_split, cross_val_score 
from statistics import mean 

In [100]:

# Building and fitting the Ridge Regression model 
ridgeModelChosen = Ridge(alpha = 8) 
ridgeModelChosen.fit(X_train, y_train) 
  
# Evaluating the Ridge Regression model 
print(ridgeModelChosen.score(X_val, y_val)) 


0.5804775198806627


Model Accuracy Ridge Model = 58.05%

# Random Forest Regressor model

In [102]:
from sklearn.metrics import mean_squared_error

In [103]:
from sklearn.ensemble import RandomForestRegressor

In [104]:
rfr = RandomForestRegressor().fit(X_train, y_train)

In [105]:
y_pred = rfr.predict(X_val)

In [106]:
errors = mean_squared_error(y_val, y_pred)

In [107]:
errors

0.19789500425529696

In [108]:
print(rfr.score(X_val, y_val))

0.6154643068075325


MSE = 0.198, RMSE sqrt(MSE) = 0.445 Accuracy = 61.54% (IMPROVED)

# Grid Search the best parameters

In [109]:
param_grid_rf = {

                "n_estimators": [50,100, 130],
                               "max_depth": range(3, 11, 1),
    "random_state":[0,50,100]

            }
            # Creating an object of the Grid Search class
grid= GridSearchCV(RandomForestRegressor(),param_grid_rf, verbose=3,cv=2,n_jobs=-1)
            # finding the best parameters
grid.fit(X_train, y_train)

In [110]:
# grid.best_score_

In [111]:
# grid.best_params_

In [112]:
gridreg = RandomForestRegressor(max_depth =  10, n_estimators= 130, random_state= 0)
gridreg.fit(X_train,y_train)
gridpred=gridreg.predict(X_val)

In [113]:
errors = mean_squared_error(y_val, gridpred)

In [114]:
errors

0.19604483224469704

In [115]:
from sklearn.metrics import r2_score
r2_score(y_val,gridpred)

0.6190594312994248

MSE = 0.196, RMSE sqrt(MSE) = 0.442 Accuracy = 61.90% (IMPROVED)

# Airbnb Price finalized model saving

In [116]:
import pickle
filename = 'airbnb_price_finalized_model.pkl'
pickle.dump(gridpred, open(filename, 'wb'))

In [118]:
gridtest=gridreg.predict(X_test)
gridtest_pred = pd.Series(gridtest)


In [120]:
gridtest_pred.to_csv('Y_testpred.csv')

In [None]:
X_train.to_csv('X_train.csv', header=False, index=False) 
y_train.to_csv('Y_train.csv', header=False, index=False)
X_val.to_csv('X_val.csv', header=False, index=False)
y_val.to_csv('Y_val.csv', header=False, index=False)

In [None]:
X_test.to_csv('X_test.csv', header=False, index=False)