# Regression for Airbnb

##### We are making use of linear regression to predict the rental price based on its given location. This can serves as a recommendation price for sellers that would like to put their flat on rent. They can set the rental price based on the location and also the several factors like room type and amenities. 

In [1]:
#Import basic libraries
import numpy as np
import pandas as pd

In [2]:
# Read the listing
airbnb_data = pd.read_csv("data/listings.csv")
airbnb_data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,71609,https://www.airbnb.com/rooms/71609,20221229070856,2022-12-29,city scrape,Ensuite Room (Room 1 & 2) near EXPO,For 3 rooms.Book room 1&2 and room 4<br /><br ...,,https://a0.muscache.com/pictures/24453191/3580...,367042,...,4.78,4.26,4.32,,f,6,0,6,0,0.15
1,71896,https://www.airbnb.com/rooms/71896,20221229070856,2022-12-29,city scrape,B&B Room 1 near Airport & EXPO,<b>The space</b><br />Vocational Stay Deluxe B...,,https://a0.muscache.com/pictures/2440674/ac4f4...,367042,...,4.43,4.17,4.04,,t,6,0,6,0,0.17
2,71903,https://www.airbnb.com/rooms/71903,20221229070856,2022-12-29,city scrape,Room 2-near Airport & EXPO,"Like your own home, 24hrs access.<br /><br /><...",Quiet and view of the playground with exercise...,https://a0.muscache.com/pictures/568743/7bc623...,367042,...,4.64,4.5,4.36,,f,6,0,6,0,0.33
3,275343,https://www.airbnb.com/rooms/275343,20221229070856,2022-12-29,city scrape,Amazing Room with window 10min to Redhill,Awesome location and host <br />Room near INSE...,,https://a0.muscache.com/pictures/miso/Hosting-...,1439258,...,4.42,4.53,4.63,S0399,f,46,2,44,0,0.19
4,275344,https://www.airbnb.com/rooms/275344,20221229070856,2022-12-29,city scrape,15 mins to Outram MRT Single Room,Lovely home for the special guest !<br /><br /...,Bus stop <br />Food center <br />Supermarket,https://a0.muscache.com/pictures/miso/Hosting-...,1439258,...,4.54,4.62,4.46,S0399,f,46,2,44,0,0.11


## Data Preparation and cleaning
##### We will select the important features that will help to predict the price of the rooms that seller should set. 
##### So, our filtered dataframe will become host_neighbourhood_cleansed, property_type, room_type, accommodates, bathrooms, bathrooms_text, beds, bedrooms, amenities(this will need to change to numerical categorical data), price, minimum nights, availability_365


In [3]:
# Filter dataframe
filtered_data = airbnb_data.filter(items=['id', 'host_neighbourhood_cleansed', 'property_type', 'room_type', 'accommodates','bathrooms_text', 'beds', 'bedrooms', 'amenities','price','minimum_nights','availability_365'])

In [4]:
airbnb_data["neighbourhood_cleansed"].unique()

array(['Tampines', 'Bukit Merah', 'Newton', 'River Valley', 'Rochor',
       'Bukit Timah', 'Serangoon', 'Downtown Core', 'Marine Parade',
       'Outram', 'Bedok', 'Kallang', 'Novena', 'Pasir Ris', 'Ang Mo Kio',
       'Bukit Batok', 'Hougang', 'Woodlands', 'Singapore River',
       'Queenstown', 'Orchard', 'Museum', 'Tanglin', 'Geylang',
       'Toa Payoh', 'Sembawang', 'Bishan', 'Yishun', 'Jurong West',
       'Sengkang', 'Clementi', 'Jurong East', 'Punggol', 'Mandai',
       'Choa Chu Kang', 'Bukit Panjang', 'Southern Islands',
       'Western Water Catchment', 'Tuas', 'Sungei Kadut', 'Pioneer',
       'Central Water Catchment', 'Marina South'], dtype=object)

In [5]:
airbnb_data["accommodates"].unique()

array([ 6,  1,  2,  4,  8,  3,  5, 12,  7,  9, 16, 10, 13, 15,  0])

In [6]:
# Clean the data
# Check for Null Values
filtered_data.isnull().sum()

id                    0
property_type         0
room_type             0
accommodates          0
bathrooms_text       27
beds                 90
bedrooms            305
amenities             0
price                 0
minimum_nights        0
availability_365      0
dtype: int64

In [7]:
filtered_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3037 entries, 0 to 3036
Data columns (total 11 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                3037 non-null   int64  
 1   property_type     3037 non-null   object 
 2   room_type         3037 non-null   object 
 3   accommodates      3037 non-null   int64  
 4   bathrooms_text    3010 non-null   object 
 5   beds              2947 non-null   float64
 6   bedrooms          2732 non-null   float64
 7   amenities         3037 non-null   object 
 8   price             3037 non-null   object 
 9   minimum_nights    3037 non-null   int64  
 10  availability_365  3037 non-null   int64  
dtypes: float64(2), int64(4), object(5)
memory usage: 261.1+ KB


##### Since there are missing values specifically for bathrooms_text, beds, bedrooms, we will examine whether if they are similar to other rooms types and we can replace the values

In [8]:
pd.set_option('display.max_rows', None)
print(filtered_data.loc[:, filtered_data.isnull().any()])

         bathrooms_text  beds  bedrooms
0        1 private bath   3.0       2.0
1      Shared half-bath   1.0       1.0
2      Shared half-bath   2.0       1.0
3                   NaN   1.0       1.0
4      Shared half-bath   1.0       1.0
5               3 baths   5.0       3.0
6         1 shared bath   1.0       1.0
7               0 baths   1.0       1.0
8        2 shared baths   1.0       NaN
9        4 shared baths   1.0       1.0
10     1.5 shared baths   1.0       1.0
11               1 bath   1.0       1.0
12        1 shared bath   1.0       1.0
13              0 baths   1.0       1.0
14               1 bath   1.0       1.0
15              2 baths   1.0       1.0
16              2 baths   1.0       1.0
17               1 bath   1.0       1.0
18               1 bath   1.0       1.0
19        1 shared bath   1.0       1.0
20        1 shared bath   1.0       1.0
21       1 private bath   1.0       1.0
22       2 shared baths   1.0       1.0
23                  NaN   1.0       1.0


In [36]:
df1 = filtered_data.drop_duplicates(subset=["bathrooms_text", "beds","bedrooms","accommodates"], keep=False)

In [37]:
print(df1.loc[:, df1.isnull().any()])

         bathrooms_text  beds  bedrooms
0        1 private bath   3.0       2.0
2      Shared half-bath   2.0       1.0
5               3 baths   5.0       3.0
8        2 shared baths   1.0       NaN
25    Private half-bath   1.0       1.0
40              4 baths  12.0       4.0
45              8 baths   3.0       1.0
55              4 baths   4.0       1.0
57               1 bath   3.0       1.0
73            Half-bath   2.0       1.0
77              8 baths   1.0       1.0
78              8 baths   1.0       1.0
104      2 shared baths   3.0       1.0
122      1 private bath   3.0       1.0
123      1 private bath   3.0       1.0
136              1 bath   3.0       1.0
154             0 baths   2.0       1.0
156             8 baths   1.0       1.0
162             3 baths   3.0       2.0
164      3 shared baths   5.0       5.0
169             8 baths   2.0       1.0
183           1.5 baths   5.0       3.0
218             5 baths   9.0       1.0
219             5 baths   6.0       1.0


In [None]:
#Can make use of anova