<a href="https://colab.research.google.com/github/pe44enka/AirbnbPricePrediction/blob/master/DataAnalysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Airbnb Price Prediction

## Objectives

---

**Goal of the project:** to built ML model to predict the price of rent on Airbnb and find out which factors influent on it the most

**Data:** [San Francisco Airbnb](https://www.kaggle.com/jeploretizo/san-francisco-airbnb-listings) dataset available at Kaggle

---

In [188]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV

# Load Data

In [205]:
data = pd.read_csv('https://raw.githubusercontent.com/pe44enka/AirbnbPricePrediction/master/data/SF.csv')
print(data.shape)
pd.set_option('display.max_columns', 20) #display 20 columns of df
data.head()

(8111, 106)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,summary,space,description,experiences_offered,neighborhood_overview,...,instant_bookable,is_business_travel_ready,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,958,https://www.airbnb.com/rooms/958,20191000000000.0,10/14/19,"Bright, Modern Garden Unit - 1BR/1B",New update: the house next door is under const...,"Newly remodeled, modern, and bright garden uni...",New update: the house next door is under const...,none,*Quiet cul de sac in friendly neighborhood *St...,...,f,f,moderate,f,f,1,1,0,0,1.74
1,3850,https://www.airbnb.com/rooms/3850,20191000000000.0,10/14/19,Charming room for two,Your own private room plus access to a shared ...,This room can fit two people. Nobody else will...,Your own private room plus access to a shared ...,none,"This is a quiet, safe neighborhood on a substa...",...,f,f,strict_14_with_grace_period,f,f,3,0,3,0,1.28
2,5858,https://www.airbnb.com/rooms/5858,20191000000000.0,10/14/19,Creative Sanctuary,,We live in a large Victorian house on a quiet ...,We live in a large Victorian house on a quiet ...,none,I love how our neighborhood feels quiet but is...,...,f,f,strict_14_with_grace_period,f,f,1,1,0,0,0.87
3,7918,https://www.airbnb.com/rooms/7918,20191000000000.0,10/14/19,A Friendly Room - UCSF/USF - San Francisco,Nice and good public transportation. 7 minute...,"Settle down, S.F. resident, student, hospital,...",Nice and good public transportation. 7 minute...,none,"Shopping old town, restaurants, McDonald, Whol...",...,f,f,strict_14_with_grace_period,f,f,9,0,9,0,0.15
4,8142,https://www.airbnb.com/rooms/8142,20191000000000.0,10/14/19,Friendly Room Apt. Style -UCSF/USF - San Franc...,Nice and good public transportation. 7 minute...,"Settle down, S.F. resident, student, hospital,...",Nice and good public transportation. 7 minute...,none,,...,f,f,strict_14_with_grace_period,f,f,9,0,9,0,0.13


# Data Exploration and Pre-processing

---
I will start with familiarizing myself with the columns in the dataset, to understand what each feature represents. This is important, because a poor understanding of the features could cause us to make mistakes in the data analysis and Machine Learning (ML) modeling. That's why before applying any of ML algorithms I need make some data pre-processing:
* clean data:
    * remove all uninformative columns: columns that bring no useful info for the model (e.g. ids, urls etc)
    * remove all columns contained no values
    * deal with missing values: fill or drop them
* perform Exploratory Data Analysis (EDA): to get familiar with the data:
  * to catch outliiers
  * to notice if the data is skewed
  * to check data for unbalanced
* get rid of categorical data:
    * using one hot coding: `pd.get_dummies()`
* divide dataset on train/test datasets for training and evaluating the model properly

---


## 1. Clean Data

### Reducing number of columns



---


To make future ML model faster and more efficient I need to reduce  number of columns. I start with dropping the columns that either empty or do not carry info that is useful in ML models.

For this I will check and analyze all columns of the data as follows.

---



In [6]:
for i in data.columns[41:61]:
  print(i)

city
state
zipcode
market
smart_location
country_code
country
latitude
longitude
is_location_exact
property_type
room_type
accommodates
bathrooms
bedrooms
beds
bed_type
amenities
square_feet
price


In [7]:
data.price.value_counts(dropna=False)

$150.00       271
$100.00       236
$200.00       210
$250.00       202
$125.00       159
             ... 
$291.00         1
$565.00         1
$989.00         1
$430.00         1
$2,010.00       1
Name: price, Length: 526, dtype: int64



---

I drop all following uninformative columns as long as they:

* contain **distinct values** for every observation (that are not useful for ML model):
  * `id`
  * `listing_url`
  * `name`
  * `picture_url`
  * `host_id`
  * `host_url`
  * `host_name`
  * `host_thumbnail_url`
  * `host_picture_url`
  * `license`
* are **constant feature** (show only 1 variable for all of the observations) => not useful for ML model:
  * `scrape_id`
  * `experiences_offered`
  * `city`
  * `state`
  * `market`
  * `smart_location`
  * `country_code`
  * `country`
  * `has_availability`
  * `calendar_last_scraped`
  * `requires_license`
  * `jurisdiction_names`
  * `is_business_travel_ready`
* are **empty**:
  * `thumbnail_url`
  * `medium_url`
  * `xl_picture_url`
  * `host_acceptance_rate`
  * `neighbourhood_group_cleansed`
  * `square_feet`
* are **uninformative** as it is a subject to daily updates:
  * `calendar_updated`
  * `availability_30`
  * `availability_60`
  * `availability_90`
  * `availability_365`
* are **text features** that require lot of preprocessing to turn into useful feature:
  * `summary`
  * `space`
  * `description`
  * `neighborhood_overview`
  * `notes`
  * `transit`
  * `access`
  * `interaction`
  * `house_rules`
  * `host_about`
* contain **redundant info** that is already contained elsewhere:
  * `host_neaigborhood`: `host_location` is used insted
  * `host_listing_count`, `host_total_listings_count`: more accurate  `calculated_host_listings_count` will be used
  * `host_verification`: list of host verification methods - info already contained in `host_identity_verified`
  * `street`, `neigbourhood`, `zipcode`: `neighbourhood_cleansed` will be used instead
  * `is_location_exact`: unimportant as it could be inacurate up to 150 meters [details](http://insideairbnb.com/about.html#disclaimers)
  * `weekly_price`, `monthly`: price is used instead
  * `minimum_minimum_nights`, `maximum_minimum_nights`, `minimum_nights_avg_ntm`: `minimum_nights` is used instead
  * `minimum_maximum_nights`, `maximum_maximum_nights`, `maximum_nights_avg_ntm`: `maximum_nights` is used instead
  * `number_of_reviews_ltm`: `number_of_reviews` is used instead
  * `last_review`: `last_scraped` will be used instead
  * `review_scores_rating`: calculated as weighted sum of other scores
  * `calculated_host_listings_count_entire_homes`, `calculated_host_listings_count_private_rooms`, `calculated_host_listings_count_shared_rooms`: `calculated_host_listings_count` will be used instead
  * `reviews_per_month`: this field will be recalculated by using data 




---



In [206]:
uninform_cols = ['id', 'listing_url', 'scrape_id', 'last_review', 'name', 'experiences_offered', 'thumbnail_url', 
                 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_acceptance_rate', 
                 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 
                 'host_total_listings_count', 'host_verifications', 'street', 'neighbourhood', 
                 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country', 
                 'country_code', 'is_location_exact', 'square_feet', 'weekly_price', 'monthly_price', 
                 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 
                 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'calendar_updated', 'has_availability', 
                 'availability_30', 'availability_60', 'availability_90', 'availability_365','calendar_last_scraped', 
                 'number_of_reviews_ltm', 'review_scores_rating', 'license', 'requires_license', 'jurisdiction_names', 
                 'is_business_travel_ready', 'calculated_host_listings_count_entire_homes', 
                 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count', 'reviews_per_month']
text_cols = ['summary', 'space', 'description', 'neighborhood_overview', 'notes', 'transit', 'access', 'interaction', 
             'house_rules', 'host_about']
geo_data = ['latitude', 'longitude']
df = data.drop(columns=uninform_cols).drop(columns=text_cols).drop(columns=geo_data)
print(df.shape)
df.tail()

(8111, 37)


Unnamed: 0,last_scraped,host_since,host_location,host_response_time,host_response_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,property_type,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count_shared_rooms
8106,10/14/19,12/5/12,"San Francisco, California, United States",,,f,t,f,Bernal Heights,House,...,,,,,,t,strict_14_with_grace_period,f,f,0
8107,10/14/19,2/25/19,"Santa Cruz, California, United States",within an hour,100%,f,t,f,Downtown/Civic Center,Condominium,...,,,,,,t,flexible,f,f,0
8108,10/14/19,2/2/15,"San Francisco, California, United States",within a few hours,78%,f,t,t,Bayview,Hut,...,,,,,,t,flexible,f,f,0
8109,10/14/19,12/16/16,"New York, New York, United States",within an hour,98%,f,t,f,Downtown/Civic Center,Apartment,...,,,,,,t,flexible,f,f,0
8110,10/14/19,8/16/19,US,within an hour,99%,t,t,f,Crocker Amazon,House,...,,,,,,f,flexible,f,f,0




---

After all iterations out of 106 columns I got 37 ones left:

**Column** | **Description** | **Data type** | **Note**
---|---|---|---
`last_scraped` | Date of last scraping | str | Will be used it to calculate `listing_duration` & `hosting_duration`
`host_location` | Location of the host | cat | To be converted to 0/1 host_local columns
`host_since` | Date when host start use airbnb | cat | Will be used to calculate `hosting_duration`
`host_response_time` | Time that applicant is waiting for host response | cat | To be one hot coded (dummies)
`host_response_rate` | On how many cases host responses (in percentages) | str | To be converted into int
`host_is_superhost` | Describing highly rated and relaible hosts | str (t/f) | To be converted into 0/1
`host_has_profile_pic` | Profiles with pictures are seen as more credible | str (t/f) | To be converted into 0/1
`host_identity_verified` | Another host credibility metric | str (t/f) | To be converted into 0/1
`neigbourhood_cleansed` | Identify most popular parts of San Francisco | cat | To be one hot coded (dummies)
geo_data(`latitude` & `longitude`) | Geo data | int | Will be used just for Viz
`property_type` | Type of property to rent | cat | Reducing variety of values is needed + one hot coding (dummies)
`room_type` | Type of room to rent | cat | To be one hot coded (dummies)
`accommodates` | Total number of people allowed on the property | int |   
`bathroom`| Number of bathrooms available on property | int |   
`beds` | Number of beds available  on property | int |   
`bed_type` | Type of beed available on the property | cat | To be one hot coded (dummies)
`amenities` | Emenities available at the property | str | To be converted into int by counting number of emenities
`price` | Target feature, cost of rent the property (in USD) | str | To be converted into int
`security_deposit` | Cost of security deposit (in USD) | str | To be converted into int
`cleaning_fee` | Cost of cleaning the property (in USD) | str | To be converted into int
`guests_included` | Number of guests on the property | int | Will be use to evaluate the cost per person 
`extra_people` | Cost of additional people per night (in USD) | str | To be converted into int
`minimum_nights` | Minimum number of nights available for rent | int | Listing with high value of minimum nights are likely sublettings 
`maximum_nights` | Maximum number of nights available for rent | int |   
`number_of_reviews` | Total number of reviews in entire listing history | int |   
`first_review` | Date of first review on the listing | str |  Will be used it to calculate `reviews_per_month` & `listing_duration`
`review_scores_accuracy` | Scores of accuracy | int (2-10) |   
`review_scores_cleanliness` | Scores of cleanliness | int (2-10) |   
`review_scores_checkin` | Scores of checkin | int (2-10) |   
`review_scores_communication` | Scores of communication | int (2-10) |   
`review_scores_location` | Scores of location | int (2-10) |   
`review_scores_value` | Scores of value | int (2-10) |   
`instant_bookable` | if it's possible to instant book the property | str (t/f) | To be converted into 0/1
`cancellation_policy` | Cancelation policy | cat | To be one hot coded (dummies)
`require_guest_profile_picture` | If applicant need To have a picture | str (t/f) | To be converted into 0/1
`require_guest_phone_verification` | If applicant's phone is required bo be verified | str (t/f) | To be converted into 0/1
`calculated_host_listings_count` | another metric to measure host experience or to distinguish buisness from individual | int |   


---



## Missing Data

In [193]:
df.isna().sum()[df.isna().sum()>0]

host_since                        8
host_location                    13
host_response_time              927
host_response_rate              927
host_is_superhost                 8
host_has_profile_pic              8
host_identity_verified            8
bathrooms                        12
bedrooms                          4
beds                              9
security_deposit               1692
cleaning_fee                    924
first_review                   1605
review_scores_accuracy         1654
review_scores_cleanliness      1654
review_scores_checkin          1655
review_scores_communication    1653
review_scores_location         1655
review_scores_value            1655
dtype: int64



---

ML algotithms can't work properly with missing values => to clean data from missing values I did follows:

**Column** | **Fill NaN with** | **Data Type** |**No info will be considered as**
--- | --- | --- | ---
`host_since` | `'10/14/19'` (last_scraped) | str| host has just starts his business (at date of scraping)
`host_location` | `'San Francisco'` | str| host is local (San Francisco)
`host_response_time` | `'a few days or more'` | str | host hasn't answered yet or at all 
`host_response_rate` | `'0%'` | str | host hasn't answered yet or at all
`host_is_superhost` | `'f'` | str | host is not superhost
`host_has_profile_pic` | `'f'` | str|  host has no picture
`host_identity_verified` | `''f''` | str|  host is not verified
`security_deposit` | `'$0.00'` | str | cost is \$0
`cleaning_fee` | `'$0.00'` | str | cost is \$0
`bathrooms` |  `0` | int| there is no bathrooms at the property
`bedrooms` | `0` | int| there is no bedrooms at the property
`beds` | `0` | int| there is no beds at the property
`first_review` | `'10/14/19'` (last_scraped) | str| host has just starts his business (at date of scraping)
`review_scores_accuracy` | `0` | int|score is 0
`review_scores_cleanliness` | `0` | int| score is 0
`review_scores_checkin` | `0` | int| score is 0
`review_scores_communication` | `0` | int| score is 0
`review_scores_location` | `0` | int| score is 0
`review_scores_value` | `0` | int| score is 0


---



In [207]:
df['host_since'] = df['host_since'].fillna(df.last_scraped.value_counts().index[0])
df['host_location'] = df['host_location'].fillna('San Francisco')
df['host_response_time'] = df['host_response_time'].fillna('a few days or more')
df['host_response_rate'] = df['host_response_rate'].fillna('0%')
df['host_is_superhost'] = df['host_is_superhost'].fillna('f')
df['host_has_profile_pic'] = df['host_has_profile_pic'].fillna('f')
df['host_identity_verified'] = df['host_identity_verified'].fillna('f')
df['security_deposit'] = df['security_deposit'].fillna('$0.00')
df['cleaning_fee'] = df['cleaning_fee'].fillna('$0.00')
df['bathrooms'] = df['bathrooms'].fillna(0) 
df['bedrooms'] = df['bedrooms'].fillna(0) 
df['beds'] = df['beds'].fillna(0) 
df['first_review'] = df['first_review'].fillna(df.last_scraped.value_counts().index[0])
df['review_scores_accuracy'] = df['review_scores_accuracy'].fillna(0)
df['review_scores_cleanliness'] = df['review_scores_cleanliness'].fillna(0)
df['review_scores_checkin'] = df['review_scores_checkin'].fillna(0)
df['review_scores_communication'] = df['review_scores_communication'].fillna(0)
df['review_scores_location'] = df['review_scores_location'].fillna(0)
df['review_scores_value'] = df['review_scores_value'].fillna(0)

print('Shape of data: ', df.shape, '\n\n\t\tMissing Data:')
df.isna().sum()

Shape of data:  (8111, 37) 

		Missing Data:


last_scraped                                   0
host_since                                     0
host_location                                  0
host_response_time                             0
host_response_rate                             0
host_is_superhost                              0
host_has_profile_pic                           0
host_identity_verified                         0
neighbourhood_cleansed                         0
property_type                                  0
room_type                                      0
accommodates                                   0
bathrooms                                      0
bedrooms                                       0
beds                                           0
bed_type                                       0
amenities                                      0
price                                          0
security_deposit                               0
cleaning_fee                                   0
guests_included     

## Calculated Fields & Columns Pre-processing



---

I add some calculated fields to the data:

**Column** | **Formula** | **Description**
--- | --- | ---
`listing_duration` | (last_scraped - first_review) | Duration of listing's existence on Airbnb (in days)
`host_duration` | (last_scraped - host_since) | Duration of host being on Airbnb (in days)
`reviews_per_month` | (last_scrapted = first_review)/number_of_reviews | Number of reviews per month


I also dropn some columns which I needed just for calculations:
* `last_scraped`
* `first_review`

---



In [208]:
df['listing_duration'] = (pd.to_datetime(df['last_scraped']) - pd.to_datetime(df['first_review'])).astype('str').str.split(expand=True).iloc[:,0].astype(int)
df['hosting_duration'] = (pd.to_datetime(df['last_scraped']) - pd.to_datetime(df['host_since'])).astype('str').str.split(expand=True).iloc[:,0].astype(int)
df['review_per_month'] = df['number_of_reviews'] / (((df['last_scraped'].apply(pd.to_datetime) - df['first_review'].apply(pd.to_datetime))/np.timedelta64(1, 'M')).round(1).replace({0.0:1.0}))

cols_to_drop = ['last_scraped', 'first_review']
df = df.drop(columns=cols_to_drop)

pd.set_option('display.max_columns', None) #display all columns of df
print(df.shape)
df.head()

(8111, 38)


Unnamed: 0,host_since,host_location,host_response_time,host_response_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,number_of_reviews,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count_shared_rooms,listing_duration,hosting_duration,review_per_month
0,7/31/08,"San Francisco, California, United States",within a day,100%,t,t,t,Western Addition,Apartment,Entire home/apt,3,1.0,1.0,2.0,Real Bed,"{TV,""Cable TV"",Internet,Wifi,Kitchen,""Pets liv...",$170.00,$100.00,$100.00,2,$25.00,1,30,217,10.0,10.0,10.0,10.0,10.0,9.0,f,moderate,f,f,0,3735,4092,1.768541
1,12/8/08,"San Francisco, California, United States",within an hour,100%,t,t,t,Inner Sunset,House,Private room,2,1.0,1.0,1.0,Real Bed,"{Internet,Wifi,Kitchen,Breakfast,""Free street ...",$99.00,$0.00,$10.00,2,$20.00,1,5,160,10.0,10.0,10.0,10.0,10.0,10.0,f,strict_14_with_grace_period,f,f,0,3742,3962,1.301871
2,3/2/09,"San Francisco, California, United States",within a day,80%,f,t,t,Bernal Heights,Apartment,Entire home/apt,5,1.0,2.0,3.0,Real Bed,"{Internet,Wifi,Kitchen,Heating,""Family/kid fri...",$235.00,$0.00,$100.00,2,$0.00,30,60,111,10.0,10.0,10.0,10.0,10.0,9.0,f,strict_14_with_grace_period,f,f,0,3816,3878,0.885167
3,6/17/09,"San Francisco, California, United States",within an hour,86%,t,t,t,Haight Ashbury,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,""Free street parking...",$65.00,$200.00,$50.00,1,$12.00,32,60,18,8.0,8.0,9.0,9.0,9.0,8.0,f,strict_14_with_grace_period,f,f,0,3696,3771,0.14827
4,6/17/09,"San Francisco, California, United States",within an hour,86%,t,t,t,Haight Ashbury,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,"{TV,Internet,Wifi,Kitchen,""Free street parking...",$65.00,$200.00,$50.00,1,$12.00,32,90,8,9.0,9.0,10.0,10.0,9.0,9.0,f,strict_14_with_grace_period,f,f,0,1862,3771,0.130719




---

I also pre-process some of columns to present them in form more suitable for the future ML model:

**Column** | **Action**
--- | ---
`host_location` | convert location into 1 if host is local (San Francisco) and 0 if (s)he is not
`host_since` | Extract year out of date 
`host_response_time` | Replace values with shorter versions
`host_response_rate` | Get rid of '%' and convert it to float dtype
`host_is_superhost` | Convert str to num as True=1, False=0
`host_has_profile_pic` | Convert str to num as True=1, False=0
`host_identity_verified` | Convert str to num as True=1, False=0
`neighbourhood_cleansed` | Rename into `neighbourhood`
`property_type` | Decrease number of values by combining some of them together
`emenities` | Replace lists of emenities with number of emenities in the list
`price` | Get rid of '\$' and convert it to float dtype
`security_deposit` | Get rid of '\$' and convert it to float dtype
`cleaning_fee` | Get rid of '\$' and convert it to float dtype
`extra_people` | Get rid of '\$' and convert it to float dtype
`minimum_nights` | Rename to `min_nights`
`maximum_nights` | Rename to `max_nights`
`instant_bookable` | Convert str to num as True=1, False=0
`require_guest_profile_picture` | Convert str to num as True=1, False=0
`require_guest_phone_verification` | Convert str to num as True=1, False=0
`calculated_host_listings_count`  | Rename into `host_listings_count`


---



In [209]:
#function defining if host is local (San Francisco) as 1 or not as 0
def local(var):
  if var.split(',')[0] in ['San Francisco', 'SanFrancisco', 'SF', 'S F']:
    var = 1
  else:
    var = 0
  return var
df['host_location'] = df.host_location.apply(local).rename('host_local')

#function to replace list of emenities with number of amenities in the list
def amenities_num(var):
  return (len(var.split(',')))
df['amenities'] = df.amenities.apply(amenities_num).rename('emenities_num')

df['host_since'] = pd.DatetimeIndex(df['host_since']).year
df['host_response_time'] = df.host_response_time.replace({'within an hour':'hour', 'within a few hours':'few_hours', 
                                                          'within a day':'day', 'a few days or more':'few_days'})
df['host_response_rate'] = (df.host_response_rate.str.split('%', expand=True).iloc[:,0].astype('float').div(100).rename('host_response_rate_pct'))
df['host_is_superhost'] = df.host_is_superhost.replace({'t':1, 'f':0})
df['host_has_profile_pic'] = df.host_has_profile_pic.replace({'t':1, 'f':0})
df['host_identity_verified'] = df.host_identity_verified.replace({'t':1, 'f':0})
df['property_type'] = df.property_type.replace({'Hut':'House', 'Dome house':'House', 'Camper/RV':'Hotel', 'In-law':'Hotel', 
                                                'Earth house':'House', 'Tiny house':'House', 'Cabin':'House', 
                                                'Castle':'House', 'Villa':'House', 'Cottage':'House', 'Resort':'Hotel', 
                                                'Bungalow':'House', 'Aparthotel':'Hotel', 'Bed and breakfast':'Hotel', 
                                                'Loft':'Apartment', 'Hostel':'Hotel', 'Serviced apartment':'Apartment',
                                                'Townhouse':'House', 'Boutique hotel':'Hotel', 'Guest suite':'Hotel', 
                                                'Guesthouse':'Hotel', 'Other':'Hotel', 'Condominium':'Apartment'})
df['price'] = df.price.str.replace(r'[$ ,]','').astype('float')
df['security_deposit'] = df.security_deposit.str.replace(r'[$ ,]','').astype('float')
df['cleaning_fee'] = df.cleaning_fee.str.replace(r'[$ ,]','').astype('float')
df['extra_people'] = df.extra_people.str.replace(r'[$ ,]','').astype('float')
df['instant_bookable'] = df.instant_bookable.replace({'t':1, 'f':0})
df['require_guest_profile_picture'] = df.require_guest_profile_picture.replace({'t':1, 'f':0})
df['require_guest_phone_verification'] = df.require_guest_phone_verification.replace({'t':1, 'f':0})

df = df.rename({'neighbourhood_cleansed':'neighbourhood', 'mininum_nights':'min_nights', 'maximum_nights':'max_nights',
                'calculated_host_listings_count':'host_listing_count'})

print(df.shape)
df.head()

(8111, 38)


Unnamed: 0,host_since,host_location,host_response_time,host_response_rate,host_is_superhost,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,amenities,price,security_deposit,cleaning_fee,guests_included,extra_people,minimum_nights,maximum_nights,number_of_reviews,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,cancellation_policy,require_guest_profile_picture,require_guest_phone_verification,calculated_host_listings_count_shared_rooms,listing_duration,hosting_duration,review_per_month
0,2008,1,day,1.0,1,1,1,Western Addition,Apartment,Entire home/apt,3,1.0,1.0,2.0,Real Bed,32,170.0,100.0,100.0,2,25.0,1,30,217,10.0,10.0,10.0,10.0,10.0,9.0,0,moderate,0,0,0,3735,4092,1.768541
1,2008,1,hour,1.0,1,1,1,Inner Sunset,House,Private room,2,1.0,1.0,1.0,Real Bed,30,99.0,0.0,10.0,2,20.0,1,5,160,10.0,10.0,10.0,10.0,10.0,10.0,0,strict_14_with_grace_period,0,0,0,3742,3962,1.301871
2,2009,1,day,0.8,0,1,1,Bernal Heights,Apartment,Entire home/apt,5,1.0,2.0,3.0,Real Bed,17,235.0,0.0,100.0,2,0.0,30,60,111,10.0,10.0,10.0,10.0,10.0,9.0,0,strict_14_with_grace_period,0,0,0,3816,3878,0.885167
3,2009,1,hour,0.86,1,1,1,Haight Ashbury,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,18,65.0,200.0,50.0,1,12.0,32,60,18,8.0,8.0,9.0,9.0,9.0,8.0,0,strict_14_with_grace_period,0,0,0,3696,3771,0.14827
4,2009,1,hour,0.86,1,1,1,Haight Ashbury,Apartment,Private room,2,4.0,1.0,1.0,Real Bed,16,65.0,200.0,50.0,1,12.0,32,90,8,9.0,9.0,10.0,10.0,9.0,9.0,0,strict_14_with_grace_period,0,0,0,1862,3771,0.130719


# 2. Exploratory Data Analysis