In [None]:
# !pip install scikit-learn==0.20.4

Collecting scikit-learn==0.20.4
[?25l  Downloading https://files.pythonhosted.org/packages/96/5b/5da31a6572dc6b7b2846a7cfcbe2e060a0e6af0e1059a6516965e40371b7/scikit_learn-0.20.4-cp36-cp36m-manylinux1_x86_64.whl (5.4MB)
[K    100% |████████████████████████████████| 5.4MB 3.0MB/s eta 0:00:01    33% |██████████▉                     | 1.8MB 11.3MB/s eta 0:00:01    44% |██████████████▎                 | 2.4MB 28.8MB/s eta 0:00:01    55% |█████████████████▉              | 3.0MB 12.9MB/s eta 0:00:01
Installing collected packages: scikit-learn
  Found existing installation: scikit-learn 0.19.1
    Uninstalling scikit-learn-0.19.1:
      Successfully uninstalled scikit-learn-0.19.1
Successfully installed scikit-learn-0.20.4


# Introduction

In this notebook we will focus on the listings dataset for the city of Seattle as part of the project for the Udacity course in Data Science. We will analyze the data and evaluate the importance of distinct features in the dataset.

The dataset for this analysis can be found here:
 - https://www.kaggle.com/airbnb/seattle/data

# Setup

This notebook was created in Google Colab (https://colab.research.google.com/). 

In order for this to work you need a Google Account. At the time this notebook was created I was able to run it on the Free Tier.

The following two cells allows the notebook to access the owner's Google Drive and the directory created to store the csv file along with a custom python module. This module called `helper_functions.py` contains functions to clean up the data. You can check it for docstrings explaining each function.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
# Adding custom packages dir
import sys
sys.path.append('/content/drive/<my_folder>')

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
from IPython import display
import helper_functions as f

from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import RandomForestRegressor

%matplotlib inline

In [None]:
# Setting display configuration
pd.set_option("display.max_rows", None, "display.max_columns", None)
plt.rcParams['figure.figsize'] = [27, 18]

In [None]:
listings_seattle = pd.read_csv('/content/drive/<my_folder>/data/listings_seattle.csv')

# Seattle

Let's check the features available at the Seattle's listing dataset and try to predict the price and check for the importance of each feature.

First, let's take a look at the dataset.

In [None]:
listings_seattle.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,notes,transit,thumbnail_url,medium_url,picture_url,xl_picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,market,smart_location,country_code,country,latitude,longitude,is_location_exact,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,square_feet,price,weekly_price,monthly_price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,requires_license,license,jurisdiction_names,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,reviews_per_month
0,241032,https://www.airbnb.com/rooms/241032,20160104002432,2016-01-04,Stylish Queen Anne Apartment,,Make your self at home in this charming one-be...,Make your self at home in this charming one-be...,none,,,,,,https://a1.muscache.com/ac/pictures/67560560/c...,,956883,https://www.airbnb.com/users/show/956883,Maija,2011-08-11,"Seattle, Washington, United States","I am an artist, interior designer, and run a s...",within a few hours,96%,100%,f,https://a0.muscache.com/ac/users/956883/profil...,https://a0.muscache.com/ac/users/956883/profil...,Queen Anne,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Gilman Dr W, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119,Seattle,"Seattle, WA",US,United States,47.636289,-122.371025,t,Apartment,Entire home/apt,4,1.0,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,$85.00,,,,,2,$5.00,1,365,4 weeks ago,t,14,41,71,346,2016-01-04,207,2011-11-01,2016-01-02,95.0,10.0,10.0,10.0,10.0,9.0,10.0,f,,WASHINGTON,f,moderate,f,f,2,4.07
1,953595,https://www.airbnb.com/rooms/953595,20160104002432,2016-01-04,Bright & Airy Queen Anne Apartment,Chemically sensitive? We've removed the irrita...,"Beautiful, hypoallergenic apartment in an extr...",Chemically sensitive? We've removed the irrita...,none,"Queen Anne is a wonderful, truly functional vi...",What's up with the free pillows? Our home was...,"Convenient bus stops are just down the block, ...",https://a0.muscache.com/ac/pictures/14409893/f...,https://a0.muscache.com/im/pictures/14409893/f...,https://a0.muscache.com/ac/pictures/14409893/f...,https://a0.muscache.com/ac/pictures/14409893/f...,5177328,https://www.airbnb.com/users/show/5177328,Andrea,2013-02-21,"Seattle, Washington, United States",Living east coast/left coast/overseas. Time i...,within an hour,98%,100%,t,https://a0.muscache.com/ac/users/5177328/profi...,https://a0.muscache.com/ac/users/5177328/profi...,Queen Anne,6.0,6.0,"['email', 'phone', 'facebook', 'linkedin', 're...",t,t,"7th Avenue West, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119,Seattle,"Seattle, WA",US,United States,47.639123,-122.365666,t,Apartment,Entire home/apt,4,1.0,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",,$150.00,"$1,000.00","$3,000.00",$100.00,$40.00,1,$0.00,2,90,today,t,13,13,16,291,2016-01-04,43,2013-08-19,2015-12-29,96.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,strict,t,t,6,1.48
2,3308979,https://www.airbnb.com/rooms/3308979,20160104002432,2016-01-04,New Modern House-Amazing water view,New modern house built in 2013. Spectacular s...,"Our house is modern, light and fresh with a wa...",New modern house built in 2013. Spectacular s...,none,Upper Queen Anne is a charming neighborhood fu...,Our house is located just 5 short blocks to To...,A bus stop is just 2 blocks away. Easy bus a...,,,https://a2.muscache.com/ac/pictures/b4324e0f-a...,,16708587,https://www.airbnb.com/users/show/16708587,Jill,2014-06-12,"Seattle, Washington, United States",i love living in Seattle. i grew up in the mi...,within a few hours,67%,100%,f,https://a1.muscache.com/ac/users/16708587/prof...,https://a1.muscache.com/ac/users/16708587/prof...,Queen Anne,2.0,2.0,"['email', 'phone', 'google', 'reviews', 'jumio']",t,t,"West Lee Street, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119,Seattle,"Seattle, WA",US,United States,47.629724,-122.369483,t,House,Entire home/apt,11,4.5,5.0,7.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,$975.00,,,"$1,000.00",$300.00,10,$25.00,4,30,5 weeks ago,t,1,6,17,220,2016-01-04,20,2014-07-30,2015-09-03,97.0,10.0,10.0,10.0,10.0,10.0,10.0,f,,WASHINGTON,f,strict,f,f,2,1.15
3,7421966,https://www.airbnb.com/rooms/7421966,20160104002432,2016-01-04,Queen Anne Chateau,A charming apartment that sits atop Queen Anne...,,A charming apartment that sits atop Queen Anne...,none,,,,,,https://a0.muscache.com/ac/pictures/94146944/6...,,9851441,https://www.airbnb.com/users/show/9851441,Emily,2013-11-06,"Seattle, Washington, United States",,,,,f,https://a2.muscache.com/ac/users/9851441/profi...,https://a2.muscache.com/ac/users/9851441/profi...,Queen Anne,1.0,1.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"8th Avenue West, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119,Seattle,"Seattle, WA",US,United States,47.638473,-122.369279,t,Apartment,Entire home/apt,3,1.0,0.0,2.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Indoor ...",,$100.00,$650.00,"$2,300.00",,,1,$0.00,1,1125,6 months ago,t,0,0,0,143,2016-01-04,0,,,,,,,,,,f,,WASHINGTON,f,flexible,f,f,1,
4,278830,https://www.airbnb.com/rooms/278830,20160104002432,2016-01-04,Charming craftsman 3 bdm house,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,Cozy family craftman house in beautiful neighb...,none,We are in the beautiful neighborhood of Queen ...,Belltown,The nearest public transit bus (D Line) is 2 b...,,,https://a1.muscache.com/ac/pictures/6120468/b0...,,1452570,https://www.airbnb.com/users/show/1452570,Emily,2011-11-29,"Seattle, Washington, United States","Hi, I live in Seattle, Washington but I'm orig...",within an hour,100%,,f,https://a0.muscache.com/ac/users/1452570/profi...,https://a0.muscache.com/ac/users/1452570/profi...,Queen Anne,2.0,2.0,"['email', 'phone', 'facebook', 'reviews', 'kba']",t,t,"14th Ave W, Seattle, WA 98119, United States",Queen Anne,West Queen Anne,Queen Anne,Seattle,WA,98119,Seattle,"Seattle, WA",US,United States,47.632918,-122.372471,t,House,Entire home/apt,6,2.0,3.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",,$450.00,,,$700.00,$125.00,6,$15.00,1,1125,7 weeks ago,t,30,60,90,365,2016-01-04,38,2012-07-10,2015-10-24,92.0,9.0,9.0,10.0,10.0,9.0,9.0,f,,WASHINGTON,f,strict,f,f,1,0.89


## List of Attributes

After checking the dataset I came up with a list of possible important features that might help us predict the price. The following list will be our first input to our model.

Input:

    - host_since
    - host_response_rate
    - host_acceptance_rate
    - host_is_superhost
    - host_total_listings_count
    - host_identity_verified
    - neighbourhood_group_cleansed
    - property_type
    - room_type
    - accommodates
    - bathrooms
    - bedrooms
    - bed_type
    - amenities
    - square_feet
    - guests_included
    - number_of_reviews
    - cancellation_policy
    
Output:

    - price

In [None]:
df = listings_seattle[['host_since',
                      'host_response_rate',
                      'host_acceptance_rate',
                      'host_is_superhost',
                      'host_total_listings_count',
                      'host_identity_verified',
                      'neighbourhood_group_cleansed',
                      'property_type',
                      'room_type',
                      'accommodates',
                      'bathrooms',
                      'bedrooms',
                      'bed_type',
                      'amenities',
                      'square_feet',
                      'guests_included',
                      'number_of_reviews',
                      'cancellation_policy',
                      'price']]

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3818 entries, 0 to 3817
Data columns (total 19 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   host_since                    3816 non-null   object 
 1   host_response_rate            3295 non-null   object 
 2   host_acceptance_rate          3045 non-null   object 
 3   host_is_superhost             3816 non-null   object 
 4   host_total_listings_count     3816 non-null   float64
 5   host_identity_verified        3816 non-null   object 
 6   neighbourhood_group_cleansed  3818 non-null   object 
 7   property_type                 3817 non-null   object 
 8   room_type                     3818 non-null   object 
 9   accommodates                  3818 non-null   int64  
 10  bathrooms                     3802 non-null   float64
 11  bedrooms                      3812 non-null   float64
 12  bed_type                      3818 non-null   object 
 13  ame

As we can see above we will need to transform some of the data to work with it properly.

The transformations will allow us to use the attributes to fit a regression model. We are using a regression model because we want to predict a continuous value. As a good baseline start we will try and use the simplest regression model a linear regressor.

One of the main assumptions made by the linear regressor is the linearity assumption. It states that the relationship between X and the mean of Y is linear.

This is a very strong assumption that might not be easily found in real life as our attributes might not be linear related to our output. This will be further explored later and for now lets use the linear regressor.


---



List of Transformations:

    - host_since -> calculate the number of days the host is a member of the plataform. This will be the difference between host_since date and last_scraped date, that is 2016-01-04. (object -> int)
    - host_response_rate -> transform the object into a float representing the rate
    - host_acceptence_rate -> transform the object into a float representing the rate
    - host_is_superhost -> transform the object into int
    - host_total_listings_count -> transform float to int
    - host_identity_verified -> transform the object into int
    - neighbourhood_group_cleansed -> one-hot encoding
    - property_type -> one-hot encoding
    - room_type -> one-hot encoding
    - accommodates -> No transformation
    - bathrooms -> transform float to int
    - bedrooms -> bedrooms
    - bed_type -> one-hot encoding
    - amenities -> pivot the list and one-hot encoding
    - square_feet -> No transformation
    - guests_included -> No transformation
    - number_of_reviews -> No transformation
    - cancellation_policy -> transform float to int
    
    - price -> transform object to int. We will work with cent values.

In [None]:
df.describe()

Unnamed: 0,host_total_listings_count,accommodates,bathrooms,bedrooms,square_feet,guests_included,number_of_reviews
count,3816.0,3818.0,3802.0,3812.0,97.0,3818.0,3818.0
mean,7.157757,3.349398,1.259469,1.307712,854.618557,1.672603,22.223415
std,28.628149,1.977599,0.590369,0.883395,671.404893,1.31104,37.730892
min,1.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,1.0,2.0,1.0,1.0,420.0,1.0,2.0
50%,1.0,3.0,1.0,1.0,750.0,1.0,9.0
75%,3.0,4.0,1.0,2.0,1200.0,2.0,26.0
max,502.0,16.0,8.0,7.0,3000.0,15.0,474.0


In [None]:
df.head()

Unnamed: 0,host_since,host_response_rate,host_acceptance_rate,host_is_superhost,host_total_listings_count,host_identity_verified,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,bed_type,amenities,square_feet,guests_included,number_of_reviews,cancellation_policy,price
0,2011-08-11,96%,100%,f,3.0,t,Queen Anne,Apartment,Entire home/apt,4,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,2,207,moderate,$85.00
1,2013-02-21,98%,100%,t,6.0,t,Queen Anne,Apartment,Entire home/apt,4,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",,1,43,strict,$150.00
2,2014-06-12,67%,100%,f,2.0,t,Queen Anne,House,Entire home/apt,11,4.5,5.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",,10,20,strict,$975.00
3,2013-11-06,,,f,1.0,t,Queen Anne,Apartment,Entire home/apt,3,1.0,0.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Indoor ...",,1,0,flexible,$100.00
4,2011-11-29,100%,,f,2.0,t,Queen Anne,House,Entire home/apt,6,2.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",,6,38,strict,$450.00


In [None]:
df.shape

(3818, 19)

## Handling Missing Values

In [None]:
df.isna().sum()

host_since                         2
host_response_rate               523
host_acceptance_rate             773
host_is_superhost                  2
host_total_listings_count          2
host_identity_verified             2
neighbourhood_group_cleansed       0
property_type                      1
room_type                          0
accommodates                       0
bathrooms                         16
bedrooms                           6
bed_type                           0
amenities                          0
square_feet                     3721
guests_included                    0
number_of_reviews                  0
cancellation_policy                0
price                              0
dtype: int64

We can see that the majority of attributes chosen have only a few missing values with the exception of: host_response_rate, host_acceptance_rate, bathrooms and most notably square_feet. Intuitively speaking, the size of an aparment tells a lot about the renting price but other attributes such as: number of bedrooms, number of bathrooms and number of guests (accomodates), might compansate for the lack of information regarding the apartment's size.

As a first step we will remove the square_feet column entirely

In [None]:
df_fs1 = df.drop(['square_feet'], axis=1)

Regarding the columns host_response_rate and host_acceptance_rate a few things might explain the missing values. Since both columns represent rates the missing values might represent listings without questioning and without renting requests. As we are trying to explain the importance of certain attributes while setting a rent price, we will at first remove both columns from our dataset and evaluate how our model performs.

In [None]:
df_fs2 = df_fs1.drop(['host_response_rate', 'host_acceptance_rate'], axis=1)

Lastly, we will remove the rows containing missing values for our baseline model.

In [None]:
df_fs3 = df_fs2.dropna()

In [None]:
df_fs3 = df_fs3.reset_index()

In [None]:
print('Ratio between number of datapoints before and after handling missing values:')
print(f'{df_fs3.shape[0]/df.shape[0]:.2f}')

Ratio between number of datapoints before and after handling missing values:
0.99


We can see that we have kept 99% of datapoints.

Now we will perform the transformations discussed previously on our df_fs3 dataset. 

## Transformations

At the following cells we will create a dataframe called: `df_trans` that will hold the transformed features as listed above.



In [None]:
df_fs3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3793 entries, 0 to 3792
Data columns (total 17 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   index                         3793 non-null   int64  
 1   host_since                    3793 non-null   object 
 2   host_is_superhost             3793 non-null   object 
 3   host_total_listings_count     3793 non-null   float64
 4   host_identity_verified        3793 non-null   object 
 5   neighbourhood_group_cleansed  3793 non-null   object 
 6   property_type                 3793 non-null   object 
 7   room_type                     3793 non-null   object 
 8   accommodates                  3793 non-null   int64  
 9   bathrooms                     3793 non-null   float64
 10  bedrooms                      3793 non-null   float64
 11  bed_type                      3793 non-null   object 
 12  amenities                     3793 non-null   object 
 13  gue

In [None]:
df_fs3.head()

Unnamed: 0,index,host_since,host_is_superhost,host_total_listings_count,host_identity_verified,neighbourhood_group_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,bed_type,amenities,guests_included,number_of_reviews,cancellation_policy,price
0,0,2011-08-11,f,3.0,t,Queen Anne,Apartment,Entire home/apt,4,1.0,1.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",2,207,moderate,$85.00
1,1,2013-02-21,t,6.0,t,Queen Anne,Apartment,Entire home/apt,4,1.0,1.0,Real Bed,"{TV,Internet,""Wireless Internet"",Kitchen,""Free...",1,43,strict,$150.00
2,2,2014-06-12,f,2.0,t,Queen Anne,House,Entire home/apt,11,4.5,5.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",""A...",10,20,strict,$975.00
3,3,2013-11-06,f,1.0,t,Queen Anne,Apartment,Entire home/apt,3,1.0,0.0,Real Bed,"{Internet,""Wireless Internet"",Kitchen,""Indoor ...",1,0,flexible,$100.00
4,4,2011-11-29,f,2.0,t,Queen Anne,House,Entire home/apt,6,2.0,3.0,Real Bed,"{TV,""Cable TV"",Internet,""Wireless Internet"",Ki...",6,38,strict,$450.00


In [None]:
df_trans = pd.DataFrame()
# The basedate is the day the data was scrapped from the Airbnb website
basedate = pd.Timestamp('2016-01-04')
# host_since
df_trans['host_since_days'] = df_fs3['host_since'].apply(lambda date: (basedate - pd.Timestamp(date)).days)
# host_is_superhost
df_trans['host_is_superhost'] = df_fs3['host_is_superhost'].apply(lambda superhost: 1 if superhost == 't' else 0)
# host_total_listings_count
df_trans['host_total_listings_count'] = df_fs3['host_total_listings_count'].apply(lambda listings: int(listings))
# host_identity_verified
df_trans['host_identity_verified'] = df_fs3['host_identity_verified'].apply(lambda identity: 1 if identity == 't' else 0)
# neighbourhood_group_cleansed
df_trans = df_trans.merge(pd.get_dummies(df_fs3['neighbourhood_group_cleansed']), left_index=True, right_index=True)
# property_type
df_trans = df_trans.merge(pd.get_dummies(df_fs3['property_type']), left_index=True, right_index=True)
# room_type
df_trans = df_trans.merge(pd.get_dummies(df_fs3['room_type']), left_index=True, right_index=True)
# accommodates
df_trans['accommodates'] = df_fs3['accommodates']
# bathrooms
df_trans['bathrooms'] = df_fs3['bathrooms'].apply(lambda bathrooms: int(bathrooms))
# bedrooms
df_trans['bedrooms'] = df_fs3['bedrooms'].apply(lambda bedrooms: int(bedrooms))
# bed_type
df_trans = df_trans.merge(pd.get_dummies(df_fs3['bed_type']), left_index=True, right_index=True)
# amenities
df_trans = df_trans.merge(f.one_hot_encode_amenities(df_fs3), left_index=True, right_index=True)
# guests_included
df_trans['guests_included'] = df_fs3['guests_included']
# number_of_reviews
df_trans['number_of_reviews'] = df_fs3['number_of_reviews']
# cancellation_policy
df_trans = df_trans.merge(pd.get_dummies(df_fs3['cancellation_policy']), left_index=True, right_index=True)

# price
price = df_fs3.price.apply(lambda price: int(price.strip('$').replace('.', '').replace(',', '')))

In [None]:
df_trans.head()

Unnamed: 0,host_since_days,host_is_superhost,host_total_listings_count,host_identity_verified,Ballard,Beacon Hill,Capitol Hill,Cascade,Central Area,Delridge,Downtown,Interbay,Lake City,Magnolia,Northgate,Other neighborhoods,Queen Anne,Rainier Valley,Seward Park,University District,West Seattle,Apartment,Bed & Breakfast,Boat,Bungalow,Cabin,Camper/RV,Chalet,Condominium,Dorm,House,Loft,Other,Tent,Townhouse,Treehouse,Yurt,Entire home/apt,Private room,Shared room,accommodates,bathrooms,bedrooms,Airbed,Couch,Futon,Pull-out Sofa,Real Bed,24-Hour Check-in,Air Conditioning,Breakfast,Buzzer/Wireless Intercom,Cable TV,Carbon Monoxide Detector,Cat(s),Dog(s),Doorman,Dryer,Elevator in Building,Essentials,Family/Kid Friendly,Fire Extinguisher,First Aid Kit,Free Parking on Premises,Gym,Hair Dryer,Hangers,Heating,Hot Tub,Indoor Fireplace,Internet,Iron,Kitchen,Laptop Friendly Workspace,Lock on Bedroom Door,Other pet(s),Pets Allowed,Pets live on this property,Pool,Safety Card,Shampoo,Smoke Detector,Smoking Allowed,Suitable for Events,TV,Washer,Washer / Dryer,Wheelchair Accessible,Wireless Internet,guests_included,number_of_reviews,flexible,moderate,strict
0,1607,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,4,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,2,207,0,1,0
1,1047,1,6,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,4,1,1,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,1,1,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,1,43,0,0,1
2,571,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,11,4,5,0,0,0,0,1,0,1,0,0,1,1,1,1,0,1,0,1,1,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,10,20,0,0,1
3,789,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,3,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,1,1,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,1,1,0,1,0,0
4,1497,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,6,2,3,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,6,38,0,0,1


In [None]:
price.head()

0     8500
1    15000
2    97500
3    10000
4    45000
Name: price, dtype: int64

In [None]:
df_trans.shape

(3793, 94)

In [None]:
price.shape

(3793,)

As we can see we are working with 3.793 samples. This number might be a bit low for a ML model but let's model our baseline first and see how it perform before trying to get more samples.

### Generating Polynomial Features

I will generate a new column based on the ratio between the columns `accommodates` and `bedrooms`. This ratio will be multiplied by the neighborhood columns to give us an idea of price based on the size of each listing and its neighborhood.

In [None]:
def people_per_room_neighborhood(df):
  '''
    This function will find the ratio between the columns accommodates and 
    bedrooms.
    Parameters:
      df (pandas.DataFrame)
    Returns
      df (pandas.DataFrame): Frame with new column
  '''

  df['ratio_accommodates_bedrooms'] = df['accommodates']/df['bedrooms']

  return df

In [None]:
def ratio_neighborhood(df, list_of_neighborhoods):
  '''
    This function will generate a new column for each neighborhood 
    (list_of_neighborhoods) based on the column
    ratio_accommodates_bedrooms
    Parameters:
      df (pandas.DataFrame)
      list_of_neighborhoods (List)

    Return:
      df (pandas.DataFrame)
  '''

  for column in list_of_neighborhoods:
    df[f'ratio_{column}'] = df['ratio_accommodates_bedrooms']*df[f'{column}']

  return df

In [None]:
list_of_neighborhoods = ['Ballard', 'Beacon Hill', 'Capitol Hill',
       'Cascade', 'Central Area', 'Delridge', 'Downtown', 'Interbay',
       'Lake City', 'Magnolia', 'Northgate', 'Other neighborhoods',
       'Queen Anne', 'Rainier Valley', 'Seward Park', 'University District',
       'West Seattle']

In [None]:
df_trans = people_per_room_neighborhood(df_trans)

In [None]:
df_trans = ratio_neighborhood(df_trans, list_of_neighborhoods)

In [None]:
df_trans.head()

Unnamed: 0,host_since_days,host_is_superhost,host_total_listings_count,host_identity_verified,Ballard,Beacon Hill,Capitol Hill,Cascade,Central Area,Delridge,Downtown,Interbay,Lake City,Magnolia,Northgate,Other neighborhoods,Queen Anne,Rainier Valley,Seward Park,University District,West Seattle,Apartment,Bed & Breakfast,Boat,Bungalow,Cabin,Camper/RV,Chalet,Condominium,Dorm,House,Loft,Other,Tent,Townhouse,Treehouse,Yurt,Entire home/apt,Private room,Shared room,accommodates,bathrooms,bedrooms,Airbed,Couch,Futon,Pull-out Sofa,Real Bed,24-Hour Check-in,Air Conditioning,Breakfast,Buzzer/Wireless Intercom,Cable TV,Carbon Monoxide Detector,Cat(s),Dog(s),Doorman,Dryer,Elevator in Building,Essentials,Family/Kid Friendly,Fire Extinguisher,First Aid Kit,Free Parking on Premises,Gym,Hair Dryer,Hangers,Heating,Hot Tub,Indoor Fireplace,Internet,Iron,Kitchen,Laptop Friendly Workspace,Lock on Bedroom Door,Other pet(s),Pets Allowed,Pets live on this property,Pool,Safety Card,Shampoo,Smoke Detector,Smoking Allowed,Suitable for Events,TV,Washer,Washer / Dryer,Wheelchair Accessible,Wireless Internet,guests_included,number_of_reviews,flexible,moderate,strict,ratio_accommodates_bedrooms,ratio_Ballard,ratio_Beacon Hill,ratio_Capitol Hill,ratio_Cascade,ratio_Central Area,ratio_Delridge,ratio_Downtown,ratio_Interbay,ratio_Lake City,ratio_Magnolia,ratio_Northgate,ratio_Other neighborhoods,ratio_Queen Anne,ratio_Rainier Valley,ratio_Seward Park,ratio_University District,ratio_West Seattle
0,1607,0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,4,1,1,0,0,0,0,1,0,1,0,0,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,2,207,0,1,0,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
1,1047,1,6,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,4,1,1,0,0,0,0,1,0,0,0,1,0,1,0,0,0,1,0,1,1,1,1,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1,0,1,0,0,1,1,0,0,1,1,43,0,0,1,4.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0
2,571,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,11,4,5,0,0,0,0,1,0,1,0,0,1,1,1,1,0,1,0,1,1,0,0,1,0,0,0,1,1,1,1,0,1,0,0,0,1,1,0,0,1,1,0,0,1,1,0,0,1,10,20,0,0,1,2.2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.2,0.0,0.0,0.0,0.0
3,789,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,3,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,1,1,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0,0,0,1,1,1,0,0,0,1,0,0,1,1,0,1,0,0,inf,,,,,,,,,,,,,inf,,,,
4,1497,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,6,2,3,0,0,0,0,1,0,0,0,0,1,1,0,0,0,0,0,1,1,1,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0,1,6,38,0,0,1,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0


### Excluding listings with no bedrooms

Listings with no bedrooms will generate infinite ratios and NaN values that will not allow us to train our baseline model.

In [None]:
index_without_bedrooms = df_trans.loc[df_trans['bedrooms'] == 0].index

In [None]:
df_trans = df_trans.loc[df_trans['bedrooms'] != 0]

In [None]:
price = price.drop(index_without_bedrooms)

In [None]:
df_trans.shape

(3422, 112)

In [None]:
price.shape

(3422,)

## Splitting and Scaling

### Filtering listings with no bedrooms

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_trans, price, test_size=0.35, random_state=42)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(2224, 112) (1198, 112) (2224,) (1198,)


In [None]:
# Saving indices
X_train_indices = X_train.index
X_test_indices = X_test.index
y_train_indices = y_train.index
y_test_indices = y_test.index

In [None]:
X_train.head()

Unnamed: 0,host_since_days,host_is_superhost,host_total_listings_count,host_identity_verified,Ballard,Beacon Hill,Capitol Hill,Cascade,Central Area,Delridge,Downtown,Interbay,Lake City,Magnolia,Northgate,Other neighborhoods,Queen Anne,Rainier Valley,Seward Park,University District,West Seattle,Apartment,Bed & Breakfast,Boat,Bungalow,Cabin,Camper/RV,Chalet,Condominium,Dorm,House,Loft,Other,Tent,Townhouse,Treehouse,Yurt,Entire home/apt,Private room,Shared room,accommodates,bathrooms,bedrooms,Airbed,Couch,Futon,Pull-out Sofa,Real Bed,24-Hour Check-in,Air Conditioning,Breakfast,Buzzer/Wireless Intercom,Cable TV,Carbon Monoxide Detector,Cat(s),Dog(s),Doorman,Dryer,Elevator in Building,Essentials,Family/Kid Friendly,Fire Extinguisher,First Aid Kit,Free Parking on Premises,Gym,Hair Dryer,Hangers,Heating,Hot Tub,Indoor Fireplace,Internet,Iron,Kitchen,Laptop Friendly Workspace,Lock on Bedroom Door,Other pet(s),Pets Allowed,Pets live on this property,Pool,Safety Card,Shampoo,Smoke Detector,Smoking Allowed,Suitable for Events,TV,Washer,Washer / Dryer,Wheelchair Accessible,Wireless Internet,guests_included,number_of_reviews,flexible,moderate,strict,ratio_accommodates_bedrooms,ratio_Ballard,ratio_Beacon Hill,ratio_Capitol Hill,ratio_Cascade,ratio_Central Area,ratio_Delridge,ratio_Downtown,ratio_Interbay,ratio_Lake City,ratio_Magnolia,ratio_Northgate,ratio_Other neighborhoods,ratio_Queen Anne,ratio_Rainier Valley,ratio_Seward Park,ratio_University District,ratio_West Seattle
1779,1978,0,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,1,1,0,0,0,0,1,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,2,36,0,1,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0
2191,1644,1,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,2,1,1,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,0,0,1,1,12,0,1,0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0
598,1040,0,3,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,6,1,2,0,0,0,0,1,0,0,0,0,1,1,0,0,0,1,0,1,1,0,0,0,0,0,0,1,0,0,1,0,1,1,0,0,1,0,0,0,1,1,0,0,1,1,0,0,1,4,66,0,0,1,3.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
765,505,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,1,1,0,0,0,0,1,0,1,0,1,1,1,0,0,1,1,1,1,0,1,1,1,1,0,0,1,0,1,1,0,1,0,0,0,0,0,1,1,1,1,0,0,1,1,0,1,1,1,1,1,0,0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
2138,339,0,21,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,3,1,1,0,0,0,0,1,0,0,1,0,0,1,0,0,0,1,0,1,1,1,0,1,0,0,0,1,0,1,1,0,1,0,0,0,0,0,0,0,1,1,0,0,0,1,0,0,1,1,15,0,0,1,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0


In [None]:
cols = X_train.columns.tolist()

We will scale both numerical and categorical in order to better understand the importance of each feature later. One may argue that categorical variables don't need to be scaled but for our baseline model we will start by scaling everything. Later on future analysis we will check how this decision impacts our model.

In order to properly scale we fit the scaler in our training set and use it to scale our test set. This avoids data leakage from our test set to our training set.

In [None]:
ct = ColumnTransformer([('scaling', StandardScaler(), cols)], remainder='passthrough')

In [None]:
X_train_scaled = ct.fit_transform(X_train)

In [None]:
X_test_scaled = ct.transform(X_test)

In [None]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=cols, index=X_train_indices)

In [None]:
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=cols, index=X_test_indices)

In [None]:
# Selecting features
simple_features = ['host_total_listings_count', 
                   'accommodates',
                   'bathrooms', 
                   'bedrooms', 
                   'guests_included',
                   'Ballard', 
                   'Beacon Hill',
                   'Capitol Hill', 
                   'Cascade', 
                   'Central Area', 
                   'Delridge', 
                   'Downtown',
                   'Interbay', 
                   'Lake City', 
                   'Magnolia', 
                   'Northgate', 
                   'Other neighborhoods',
                   'Queen Anne', 
                   'Rainier Valley', 
                   'Seward Park', 
                   'University District',
                   'West Seattle', 
                   'Entire home/apt',
                   'Private room', 
                   'Shared room', 
                   'flexible', 
                   'moderate', 
                   'strict']

polynomial_features = [
                     'host_total_listings_count',
                     'accommodates',
                     'guests_included',
                     'ratio_Ballard',
                     'ratio_Beacon Hill',
                     'ratio_Capitol Hill',
                     'ratio_Cascade',
                     'ratio_Central Area',
                     'ratio_Delridge',
                     'ratio_Downtown',
                     'ratio_Interbay',
                     'ratio_Lake City',
                     'ratio_Magnolia',
                     'ratio_Northgate',
                     'ratio_Other neighborhoods',
                     'ratio_Queen Anne',
                     'ratio_Rainier Valley',
                     'ratio_Seward Park',
                     'ratio_University District',
                     'ratio_West Seattle',
                     'Entire home/apt',
                     'Private room', 
                     'Shared room', 
                     'flexible', 
                     'moderate', 
                     'strict']

In [None]:
X_train_scaled_selected_features_df = X_train_scaled_df[simple_features]

In [None]:
X_test_scaled_selected_features_df = X_test_scaled_df[simple_features]

In [None]:
X_train_scaled_selected_features_polynomial_df = X_train_scaled_df[polynomial_features]

In [None]:
X_test_scaled_selected_features_polynomial_df = X_test_scaled_df[polynomial_features]

# Training and Testing

## Linear Regression

In [None]:
model = LinearRegression()

In [None]:
model.fit(X_train_scaled_selected_features_df, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
importance = model.coef_

In [None]:
# prediction on training set
yhat_train = model.predict(X_train_scaled_selected_features_df)

In [None]:
# evaluate training prediction
mae = mean_absolute_error(y_train, yhat_train)
print('MAE: %.3f' % mae)

MAE: 3682.491


In [None]:
# make predictions
yhat = model.predict(X_test_scaled_selected_features_df)

In [None]:
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

MAE: 3794.752


In [None]:
# r2-score on test
r2_score(y_test, yhat)

0.5014057498925586

In [None]:
for (feature, (i,v)) in zip(simple_features, enumerate(importance)):
	print(f'Feature: {feature}, Score: {v:.5f}')

Feature: host_total_listings_count, Score: 272.47764
Feature: accommodates, Score: 445.35744
Feature: bathrooms, Score: 1450.99405
Feature: bedrooms, Score: 4118.09564
Feature: guests_included, Score: 955.93087
Feature: Ballard, Score: -7579725111287464.00000
Feature: Beacon Hill, Score: -5795891139213781.00000
Feature: Capitol Hill, Score: -10786437727369330.00000
Feature: Cascade, Score: -4456571508304670.00000
Feature: Central Area, Score: -9386453859877996.00000
Feature: Delridge, Score: -4775962107071221.00000
Feature: Downtown, Score: -10291664330520708.00000
Feature: Interbay, Score: -1874661659900756.50000
Feature: Lake City, Score: -4057980006855795.00000
Feature: Magnolia, Score: -3841975064271897.00000
Feature: Northgate, Score: -4456571508305564.00000
Feature: Other neighborhoods, Score: -12882106487869800.00000
Feature: Queen Anne, Score: -8274832698065251.00000
Feature: Rainier Valley, Score: -6701088347374905.00000
Feature: Seward Park, Score: -3365840593505921.00000
Fea

## Linear Regression with Polynomial Features

In [None]:
model_polynomial = LinearRegression()

In [None]:
model_polynomial.fit(X_train_scaled_selected_features_polynomial_df, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [None]:
importance = model_polynomial.coef_

In [None]:
# prediction on training set
yhat_train = model_polynomial.predict(X_train_scaled_selected_features_polynomial_df)

In [None]:
# evaluate training prediction
mae = mean_absolute_error(y_train, yhat_train)
print('MAE: %.3f' % mae)

MAE: 3787.793


In [None]:
# make predictions
yhat = model_polynomial.predict(X_test_scaled_selected_features_polynomial_df)

In [None]:
# evaluate predictions
mae = mean_absolute_error(y_test, yhat)
print('MAE: %.3f' % mae)

MAE: 3814.628


In [None]:
# r2-score on test
r2_score(y_test, yhat)

0.498321881081502

In [None]:
for (feature, (i,v)) in zip(polynomial_features, enumerate(importance)):
	print(f'Feature: {feature}, Score: {v:.5f}')

Feature: host_total_listings_count, Score: 254.05895
Feature: accommodates, Score: 5709.98508
Feature: guests_included, Score: 945.37050
Feature: ratio_Ballard, Score: -1903.96261
Feature: ratio_Beacon Hill, Score: -1583.54900
Feature: ratio_Capitol Hill, Score: -1869.27005
Feature: ratio_Cascade, Score: -578.69037
Feature: ratio_Central Area, Score: -2313.25091
Feature: ratio_Delridge, Score: -1417.02751
Feature: ratio_Downtown, Score: -2147.83546
Feature: ratio_Interbay, Score: -402.58793
Feature: ratio_Lake City, Score: -1071.73977
Feature: ratio_Magnolia, Score: -646.69782
Feature: ratio_Northgate, Score: -1476.31376
Feature: ratio_Other neighborhoods, Score: -3489.70264
Feature: ratio_Queen Anne, Score: -1209.10746
Feature: ratio_Rainier Valley, Score: -2148.21764
Feature: ratio_Seward Park, Score: -1134.35341
Feature: ratio_University District, Score: -1571.21475
Feature: ratio_West Seattle, Score: -1827.45933
Feature: Entire home/apt, Score: 88889537043512384.00000
Feature: Priv

# Comparison

As we can see our two linear models perform similary and unfortunately poorly. Both models have inputs that explain about 50% of the variance in our target (rent price). This means that we still have ample space to explore new features and models to try and predict our price. The most important point to remember here is that this is a baseline model with a strong linearity assumption that we tried to circunvent by using a few polynomial features. Nonetheless this is a good start to explore how each of these features interact with each other. In a future analysis I will take a step further and analyze how we can engineer better features and improve our model.

# Results and Future Steps

Our best model with an R2 value equals to 0.5014 and a MAE value equals to $37.94 was the linear regressor with no polynomial features.

It classified the following features as important (from most important to least)

![Feature Importance](/content/drive/MyDrive/Courses/Udacity/Project_1/graph/scatter_reviewxprice.png)

As we can see the room types, either the entire apartment or just a room are the two most important features before the neighborhoods. This makes sense as we expect that rooms numbers and types will greatly affect how the owner will estipulate the price. 

Following this findings, I think a logical next step for further analyzes is to create new polynomial features using both the room types and the listings location in the city. This feature engineering might seem cumbersome but it allow us to maintain a certain degree of explainability regarding our original features.
