## Project: Regression Dataset

### The dataset is downloaded from Kaggle. It consists of various datapoints collected from Airbnb website which can be used to predict the price of the stay. We use the data from Boston city and drop rows of other citiesimport warnings

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Import required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

pd.options.display.max_rows = 100
pd.options.display.max_columns = 150

In [3]:
# Load the data

raw_data = pd.read_csv("airbnb.csv")

In [4]:
# Subset the data to Boston city
boston_data = raw_data[raw_data['city'] == 'Boston']
boston_data.reset_index(inplace=True, drop=True)

In [5]:
boston_data.head()

Unnamed: 0,id,log_price,property_type,room_type,amenities,accommodates,bathrooms,bed_type,cancellation_policy,cleaning_fee,city,description,first_review,host_has_profile_pic,host_identity_verified,host_response_rate,host_since,instant_bookable,last_review,latitude,longitude,name,neighbourhood,number_of_reviews,review_scores_rating,thumbnail_url,zipcode,bedrooms,beds
0,14648556,4.59512,Townhouse,Private room,"{Internet,""Wireless Internet"",""Air conditionin...",2,2.0,Real Bed,strict,True,Boston,This is a nice duplex in a good location.Recen...,2016-07-16,t,t,100%,2014-07-27,f,2017-02-07,42.339194,-71.049672,"Comfy room (C) near T, convention center, down...",South Boston,12,88.0,https://a0.muscache.com/im/pictures/176088bb-3...,2127,1.0,1.0
1,4680055,4.682131,Condominium,Private room,"{TV,Internet,""Wireless Internet"",""Air conditio...",2,1.0,Real Bed,strict,True,Boston,Tourists/Conference-goers great choice! Privat...,2016-03-20,t,t,100%,2013-06-16,t,2017-09-17,42.330628,-71.053148,Private Bedroom Close To Downtown/Subway Red line,South Boston,40,96.0,https://a0.muscache.com/im/pictures/aad0eaa7-a...,2127,1.0,1.0
2,4274462,4.828314,Apartment,Entire home/apt,"{TV,""Wireless Internet"",""Air conditioning"",Kit...",6,1.0,Real Bed,strict,True,Boston,"An Entire 2 bedroom, 600sqft, apartment w/ 4 t...",2017-09-14,t,t,100%,2015-01-25,f,2017-10-02,42.336007,-71.052918,**NEW*Downtown/Convention/Subway/Beach C130,South Boston,5,100.0,https://a0.muscache.com/im/pictures/3d35ea0b-e...,2127,2.0,4.0
3,2278299,4.094345,House,Private room,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",2,1.0,Real Bed,flexible,False,Boston,"This is a beautiful space in a gorgeous, newly...",,t,f,,2013-12-13,f,,42.319265,-71.113246,"Beautiful Home in Jamaica Plain, MA",Jamaica Plain,0,,https://a0.muscache.com/im/pictures/86832250/f...,2130,1.0,1.0
4,16253186,4.962845,Apartment,Entire home/apt,"{TV,""Wireless Internet"",Kitchen,""Family/kid fr...",2,1.0,Real Bed,flexible,False,Boston,Nicely decorated comfortable 1 bedroom in very...,2017-05-21,t,f,,2017-05-05,t,2017-05-27,42.357198,-71.071588,Clean upscale apt and location,Beacon Hill,2,80.0,https://a0.muscache.com/im/pictures/9bf5ae4c-f...,2114,1.0,1.0


#### Dropping the features that are not contributing to predict the price

In [6]:
drop_cols = ['id','city', 'description', 'first_review', 'last_review', 'host_since', 'latitude', 'longitude', \
            'name', 'neighbourhood', 'thumbnail_url', 'zipcode']

boston_data.drop(columns=drop_cols, inplace=True)

In [7]:
boston_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3468 entries, 0 to 3467
Data columns (total 17 columns):
log_price                 3468 non-null float64
property_type             3468 non-null object
room_type                 3468 non-null object
amenities                 3468 non-null object
accommodates              3468 non-null int64
bathrooms                 3462 non-null float64
bed_type                  3468 non-null object
cancellation_policy       3468 non-null object
cleaning_fee              3468 non-null bool
host_has_profile_pic      3468 non-null object
host_identity_verified    3468 non-null object
host_response_rate        2887 non-null object
instant_bookable          3468 non-null object
number_of_reviews         3468 non-null int64
review_scores_rating      2820 non-null float64
bedrooms                  3465 non-null float64
beds                      3466 non-null float64
dtypes: bool(1), float64(5), int64(2), object(9)
memory usage: 437.0+ KB


In [8]:
boston_data.isnull().sum()

log_price                   0
property_type               0
room_type                   0
amenities                   0
accommodates                0
bathrooms                   6
bed_type                    0
cancellation_policy         0
cleaning_fee                0
host_has_profile_pic        0
host_identity_verified      0
host_response_rate        581
instant_bookable            0
number_of_reviews           0
review_scores_rating      648
bedrooms                    3
beds                        2
dtype: int64

## Initial impressions on data and next steps

### The following columns have missing values:

* bathrooms
* host_response_rate
* review_scores_rating
* bedrooms
* beds

Based on analysis, missing values will be imputed with appropriate values

The column __amenities__ has values in the form of JSON. So we will replace it with the count of amenities provided by the host

Categorical and date columns need processing to convert them into numerical form.

__host_has_profile_pic__ , __host_identity_verified__ and __instant_bookable__ columns have t for true and f for false. Replacing t with 1 and f with 0

In [9]:
boston_data.replace(to_replace = "t", value = 1,inplace=True) 
boston_data.replace(to_replace = "f", value = 0,inplace=True)

boston_data.replace(to_replace = True, value = 1,inplace=True) 
boston_data.replace(to_replace = False, value = 0,inplace=True)

In [10]:
boston_data['property_type'].value_counts()

Apartment             2383
House                  563
Condominium            339
Townhouse               54
Other                   44
Loft                    24
Bed & Breakfast         17
Boat                    10
Guest suite              8
Villa                    6
Hostel                   4
In-law                   4
Guesthouse               4
Dorm                     3
Timeshare                3
Serviced apartment       1
Boutique hotel           1
Name: property_type, dtype: int64

In [11]:
p_types = ['Apartment','House','Condominium','Townhouse']
boston_data.loc[~boston_data.property_type.isin(p_types), 'property_type'] = 'Other'

In [12]:
boston_data['room_type'].unique()

array(['Private room', 'Entire home/apt', 'Shared room'], dtype=object)

In [13]:
boston_data['bed_type'].unique()

array(['Real Bed', 'Futon', 'Airbed', 'Pull-out Sofa', 'Couch'],
      dtype=object)

In [14]:
boston_data['cancellation_policy'].unique()

array(['strict', 'flexible', 'super_strict_30', 'moderate',
       'super_strict_60'], dtype=object)

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
encoder = OneHotEncoder()
feature_df = pd.DataFrame(encoder.fit_transform(boston_data[['room_type','bed_type','cancellation_policy', 'property_type']]).toarray(), 
                        columns = encoder.get_feature_names(['room_type','bed_type','cancellation_policy', 'property_type']))
boston_data = pd.merge(boston_data, feature_df, how='left', left_index=True, right_index=True)
boston_data.drop(columns=['room_type','bed_type','cancellation_policy', 'property_type'], inplace=True)

In [17]:
amenities_count = []
for i in boston_data['amenities']:
    amenities_count.append(len(i.split(',')))
boston_data['amenities'] = amenities_count

In [18]:
boston_data['host_response_rate'].unique()

array(['100%', nan, '88%', '96%', '92%', '94%', '99%', '93%', '54%',
       '33%', '80%', '70%', '67%', '81%', '90%', '98%', '25%', '86%',
       '97%', '0%', '50%', '75%', '87%', '60%', '77%', '46%', '55%',
       '59%', '83%', '79%', '89%', '64%', '10%', '73%', '68%', '95%',
       '20%', '56%', '78%'], dtype=object)

In [19]:
boston_data['host_response_rate'] = boston_data['host_response_rate'].str.strip('%').astype(float)
boston_data['host_response_rate'] = boston_data['host_response_rate']/100

In [20]:
boston_data.describe()

Unnamed: 0,log_price,amenities,accommodates,bathrooms,cleaning_fee,host_has_profile_pic,host_identity_verified,host_response_rate,instant_bookable,number_of_reviews,review_scores_rating,bedrooms,beds,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,bed_type_Airbed,bed_type_Couch,bed_type_Futon,bed_type_Pull-out Sofa,bed_type_Real Bed,cancellation_policy_flexible,cancellation_policy_moderate,cancellation_policy_strict,cancellation_policy_super_strict_30,cancellation_policy_super_strict_60,property_type_Apartment,property_type_Condominium,property_type_House,property_type_Other,property_type_Townhouse
count,3468.0,3468.0,3468.0,3462.0,3468.0,3468.0,3468.0,2887.0,3468.0,3468.0,2820.0,3465.0,3466.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0,3468.0
mean,4.884035,20.046136,3.301615,1.236857,0.764706,0.997116,0.608131,0.968916,0.33564,25.690311,93.597518,1.343723,1.761685,0.6188,0.367935,0.013264,0.008939,0.001442,0.008074,0.004902,0.976644,0.243368,0.235582,0.510957,0.009516,0.000577,0.68714,0.097751,0.162341,0.037197,0.015571
std,0.664692,7.910602,2.185942,0.509028,0.424244,0.053629,0.488238,0.113022,0.472282,45.103616,8.059291,0.881492,1.300588,0.485751,0.482313,0.11442,0.094136,0.037949,0.089504,0.069852,0.151054,0.429177,0.424424,0.499952,0.097097,0.024011,0.463725,0.29702,0.368817,0.189272,0.123826
min,2.833213,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,4.382027,15.0,2.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,91.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,4.912655,19.0,2.0,1.0,1.0,1.0,1.0,1.0,0.0,7.0,96.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
75%,5.298317,24.0,4.0,1.0,1.0,1.0,1.0,1.0,1.0,29.0,99.0,2.0,2.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
max,7.244228,78.0,16.0,6.0,1.0,1.0,1.0,1.0,1.0,380.0,100.0,10.0,16.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
