# DA3 Assignment 2
## AirBnB Price Prediction - Buenos Aires
### Data from September 22, 2023
#### Nicolas Fernandez
The goal of this assignment is to build a price prediction model for small and mid-sized apartments that can host 2-6 guests in Buenos Aires. Several models will be constructed using different methods for comparison. Descriptions of each column available within the data can be found in detial at this link: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596

In [1]:
# Importing libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
from pathlib import Path
import sys
from patsy import dmatrices
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.inspection import PartialDependenceDisplay
from sklearn.inspection import partial_dependence
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

## Get Data

In [2]:
# Reading data from github
data = pd.read_csv('https://raw.githubusercontent.com/nxfern/DA3_Assignment_2/main/listings.csv')

In [3]:
# Viewing shape and first 5 observations
print(data.shape)
data.head()

(29346, 75)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,11508,https://www.airbnb.com/rooms/11508,20230922223302,2023-09-23,city scrape,Condo in Buenos Aires · ★4.81 · 1 bedroom · 1 ...,LUXURIOUS 1 BDRM APT- POOL/ GYM/ SPA/ 24-HR SE...,AREA: PALERMO SOHO<br /><br />Minutes walking ...,https://a0.muscache.com/pictures/19357696/b1de...,42762,...,4.97,4.94,4.89,,f,1,1,0,0,0.26
1,107259,https://www.airbnb.com/rooms/107259,20230922223302,2023-09-23,city scrape,Rental unit in Buenos Aires · ★4.58 · 6 bedroo...,"We have 7 bedrooms and 5 bathrooms,gourmet kit...",,https://a0.muscache.com/pictures/822490/5bc2ab...,555693,...,4.71,4.63,4.53,,f,2,2,0,0,0.28
2,14222,https://www.airbnb.com/rooms/14222,20230922223302,2023-09-23,city scrape,Rental unit in Palermo/Buenos Aires · ★4.79 · ...,Beautiful cozy apartment in excellent location...,Palermo is such a perfect place to explore the...,https://a0.muscache.com/pictures/4695637/bbae8...,87710233,...,4.9,4.89,4.75,,f,7,7,0,0,0.81
3,15074,https://www.airbnb.com/rooms/15074,20230922223302,2023-09-23,previous scrape,Rental unit in Buenos Aires · 1 bedroom · 1 be...,<b>The space</b><br />I OFFER A ROOM IN MY APA...,,https://a0.muscache.com/pictures/91166/c0fdcb4...,59338,...,,,,,f,1,0,1,0,
4,108089,https://www.airbnb.com/rooms/108089,20230922223302,2023-09-23,city scrape,Rental unit in Buenos Aires · ★4.59 · 1 bedroo...,Amazing apartment in the best area of Palermo....,Palermo is the best neighborhhod in the city.<...,https://a0.muscache.com/pictures/717831/fbb7cd...,559463,...,4.77,4.94,4.66,,f,4,4,0,0,0.77


In [4]:
# Viewing information on data types of all columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29346 entries, 0 to 29345
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            29346 non-null  int64  
 1   listing_url                                   29346 non-null  object 
 2   scrape_id                                     29346 non-null  int64  
 3   last_scraped                                  29346 non-null  object 
 4   source                                        29346 non-null  object 
 5   name                                          29346 non-null  object 
 6   description                                   28747 non-null  object 
 7   neighborhood_overview                         16259 non-null  object 
 8   picture_url                                   29346 non-null  object 
 9   host_id                                       29346 non-null 

## EDA and Feature Engineering
First a filter will be placed to only grab the listings that accommodate at least 2 but no more than 6 people total. From there, data review and cleaning will take place along with the potential creation of dummy variables for further analysis.

In [5]:
# Filtering data to only contain listings that accommodate between 2 and 6 people and assigning it to a new df to work off of. Printing the shape of the new df, deleting initial dataframe that's no longer needed
df = data.query('2 <= accommodates <= 6')
print(df.shape)
del data

(27340, 75)


In [6]:
# Checking sum of all null values, sorted by descending order
df.isna().sum().sort_values(ascending=False).head(30)

neighbourhood_group_cleansed    27340
bathrooms                       27340
calendar_updated                27340
license                         26932
host_about                      12298
neighbourhood                   12195
neighborhood_overview           12195
host_neighbourhood               9052
host_location                    6300
review_scores_checkin            5022
review_scores_cleanliness        5022
review_scores_accuracy           5021
review_scores_location           5021
review_scores_value              5021
review_scores_communication      5020
first_review                     4969
last_review                      4969
reviews_per_month                4969
review_scores_rating             4967
bedrooms                         4944
host_response_rate               3342
host_response_time               3342
host_acceptance_rate             2292
host_is_superhost                1400
description                       569
beds                              233
bathrooms_te

The bathrooms column which indicates how many bathrooms the listing has available is completely null in this dataset. That would otherwise be an important data point for analysis but given that it is completely full of null values, the column will be dropped along with all other columns that are fully or mostly null. The other columns that have many null values include columns with information/data about the host of the listing and not information about the accommodation itself. This includes the neighbourhood the host has reported they reside in, and information about the host. What may be important, however, is the location of the host (`host_location`) so that will be kept. There are other factors within the data concerning the host that are more impactful for analysis, such as whether or not the host is a superhost, therefore these columns will be dropped as well.

Null values seen within `host_is_superhost` can be seen as the host not being a superhost. Given as it is a desirable status symbol on AirBnB to be a superhost, if the value is null it can be assumed that it is 0. Null values in that column will be imputed as 0 with a flag variable created.

In [7]:
# Renaming all columns that say neighbourhood to neihborhood for consistency
df.columns = df.columns.str.replace('neighbourhood', 'neighborhood')

In [8]:
# Dropping columns that are fully or mostly null values
df.drop(['neighborhood_group_cleansed', 'bathrooms', 'calendar_updated', 'license', 'host_about', 'host_neighborhood', 'host_about'], axis=1, inplace=True)

In [9]:
# Checking neighborhood_cleansed column for any results outside of Buenos Aires
df.neighborhood_cleansed.value_counts()

neighborhood_cleansed
Palermo              9325
Recoleta             4077
San Nicolas          1622
Belgrano             1477
Retiro               1332
Monserrat            1105
Almagro               971
Villa Crespo          870
Balvanera             833
San Telmo             732
Colegiales            653
Nuñez                 622
Caballito             524
Chacarita             430
Villa Urquiza         338
Constitucion          324
Puerto Madero         320
Barracas              203
Saavedra              185
San Cristobal         170
Flores                122
Coghlan               105
Villa Ortuzar         104
Villa Devoto           96
Villa Del Parque       90
Boedo                  76
Boca                   76
Parque Patricios       65
Parque Chas            62
Parque Chacabuco       61
Villa Pueyrredon       52
Paternal               43
Agronomia              42
Floresta               38
Villa Santa Rita       36
Villa Luro             28
Villa Gral. Mitre      27
Mataderos       

There are no results outside of Buenos Aires. Also dropping `neighborhood` and `neighborhood_overview` columns as they do not contain new information from `neighborhood_cleansed` and also contain null values.

In [10]:
# Dropping `neighborhood` column
df.drop(['neighborhood', 'neighborhood_overview'], axis=1, inplace=True)

In [11]:
# Checking room_type contents
df.room_type.value_counts()

room_type
Entire home/apt    25765
Private room        1432
Shared room           87
Hotel room            56
Name: count, dtype: int64

In [12]:
# Dropping hotels from examination
df = df.loc[df['room_type'] != 'Hotel room']

In [13]:
# Creating flag variable for null values that are to be imputed in host_is_superhost as a binary
df['flag_superhost'] = df.host_is_superhost.isna().astype(int)

# Changing host_is_superhost to binary with 1 being yes and 0 being no. Imputing 0 for null values.
df.host_is_superhost.replace('t', 1, inplace=True)
df.host_is_superhost.replace('f', 0, inplace=True)
df.host_is_superhost.fillna(0, inplace=True, downcast='infer')

In [14]:
df.flag_superhost.value_counts()

flag_superhost
0    25884
1     1400
Name: count, dtype: int64

In [15]:
# Reviewing observations where reviews_per_month are null
print(df[df.reviews_per_month.isna() == True].shape) # Printing shape

# Creating test df for reviewing null values associated with reviews_per_month when that is null also
test = df[df.reviews_per_month.isna() == True]
test.isna().sum().sort_values(ascending=False).head(30)

(4957, 68)


reviews_per_month              4957
first_review                   4957
last_review                    4957
review_scores_rating           4955
review_scores_accuracy         4955
review_scores_cleanliness      4955
review_scores_checkin          4955
review_scores_value            4955
review_scores_location         4955
review_scores_communication    4955
host_location                  1604
host_response_rate             1367
host_response_time             1367
host_acceptance_rate           1255
bedrooms                        840
description                     117
beds                             51
bathrooms_text                    6
availability_365                  0
has_availability                  0
maximum_nights_avg_ntm            0
minimum_nights_avg_ntm            0
maximum_maximum_nights            0
availability_30                   0
availability_60                   0
availability_90                   0
id                                0
calendar_last_scraped       

In [16]:
# Examining the two non-null values in revew_scores_rating
test[test.review_scores_rating.isna() != True]

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,picture_url,host_id,host_url,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,flag_superhost
14380,675897953216372358,https://www.airbnb.com/rooms/675897953216372358,20230922223302,2023-09-24,city scrape,Serviced apartment in Palermo · ★New · 1 bedro...,Disfruta de la sencillez de este alojamiento t...,https://a0.muscache.com/pictures/8f618490-0ad6...,450965790,https://www.airbnb.com/users/show/450965790,...,5.0,5.0,5.0,f,1,1,0,0,,0
27914,965888041736392206,https://www.airbnb.com/rooms/965888041736392206,20230922223302,2023-09-24,city scrape,Rental unit in Buenos Aires · ★New · 2 bedroom...,Espectacular apartamento de dos habitaciones d...,https://a0.muscache.com/pictures/miso/Hosting-...,533833227,https://www.airbnb.com/users/show/533833227,...,5.0,5.0,5.0,t,1,1,0,0,,0


In [17]:
# Examing number_of_reviews column in the search above
test[test.review_scores_rating.isna() != True]['number_of_reviews']

14380    0
27914    0
Name: number_of_reviews, dtype: int64

In [18]:
# Viewing value counts for `number_of_reviews` column within test df
test.number_of_reviews.value_counts()

number_of_reviews
0    4957
Name: count, dtype: int64

From the above result we can confirm that for all values in the test df there are no reviews listed within the `number_of_reviews` column. From this it can be inferred that when `first_review`, `last_review`, and `reviews_per_month` are null that it's a brand new listing with no history. These will be imputed with the mean value for each column rather than null and a flag variable will be included called `flag_new_listings` for all imputed variables.

For the two observations that do have values for review scores amongst several columns it can be inferred that this was an error. If no reviews were actually left then there can be no review scores for the listing. For these two listings specifically, their values will be corrected and a flag variable `flag_corrected_new_listing` will be created.

In [19]:
# Deleting test df
del test

# Creating flag variable for corrected new listings
df['flag_corrected_new_listing'] = df.index.isin([14380, 27914]).astype(int)

# Creating index array of columns starting with 'review_scores'
columns_to_update = df.columns[df.columns.str.startswith('review_scores')]

# Creating for loop to update each column with their mean value for the new listings in these two specific indexes
for column in columns_to_update:
    df.loc[df.index.isin([14380, 27914]), column] = df[column].mean()

# Creating flag variable for imputed null values on first_review, last_review, and reviews_per_month
df['flag_new_listings'] = df.reviews_per_month.isna().astype(int)

# Filling null values for first_review, last_review, and reviews_per_month with 0
df.first_review.fillna(0, inplace=True, downcast='infer')
df.last_review.fillna(0, inplace=True, downcast='infer')
df.reviews_per_month.fillna(0, inplace=True, downcast='infer')

In [20]:
# Rechecking null values
df.isna().sum().sort_values(ascending=False).head(20)

host_location                  6291
review_scores_checkin          5010
review_scores_cleanliness      5010
review_scores_value            5009
review_scores_accuracy         5009
review_scores_location         5009
review_scores_communication    5008
review_scores_rating           4955
bedrooms                       4938
host_response_rate             3328
host_response_time             3328
host_acceptance_rate           2285
description                     566
beds                            232
bathrooms_text                   10
availability_90                   0
calendar_last_scraped             0
availability_365                  0
maximum_nights_avg_ntm            0
minimum_nights_avg_ntm            0
dtype: int64

There are still many null values, specifically in the `review_scores` columns. These values will be checked to see if there are reviews or not associated with them

In [21]:
# Creating test df again to make data filtering easier
test = df[df.review_scores_checkin.isna()]

In [22]:
# Setting number_of_reviews column equal to 0 and checking null values when that condition is True
test[test.number_of_reviews == 0].isna().sum().sort_values(ascending=False).head(20)

review_scores_cleanliness      4955
review_scores_value            4955
review_scores_rating           4955
review_scores_accuracy         4955
review_scores_checkin          4955
review_scores_communication    4955
review_scores_location         4955
host_location                  1602
host_response_time             1367
host_response_rate             1367
host_acceptance_rate           1255
bedrooms                        840
description                     117
beds                             51
bathrooms_text                    6
number_of_reviews_ltm             0
availability_60                   0
availability_30                   0
has_availability                  0
availability_90                   0
dtype: int64

Here we can see that for the majority of the null values remaining in the `review_scores` columns are null because they are new listings. Imputing the mean value for review scores for each respective column. They are already withn new listings flag

In [23]:
# Using for loop method used earlier
for column in columns_to_update:
    df.loc[(df.number_of_reviews == 0), column] = df[column].mean()

In [24]:
# Deleting test df
del test

In [25]:
# Viewing null values in 'bedrooms' by 'accommodates'
df[df.bedrooms.isna()].groupby('accommodates').size()

accommodates
2    3716
3     758
4     398
5      30
6      36
dtype: int64

Rather than drop the listings that have do not have the amount of bedrooms listed, a decision will be made to impute the mean bedrooms rounded for each respective level of `accommodates` listed. A flag variable `flag_bedrooms` will be created to reflect this

In [26]:
# Creating bedrooms flag variable
df['flag_bedrooms'] = df.bedrooms.isna().astype(int)

# Impute the null values with the mean values for their respective 'accommodates' group using a for loop
for accommodates, mean_bedrooms in df.groupby('accommodates').bedrooms.mean().items():
    df.loc[(df.accommodates == accommodates) & df.bedrooms.isnull(), 'bedrooms'] = round(mean_bedrooms)

In [27]:
# Viewing 'host' columns to check what type of values they have
print(df.host_acceptance_rate)
print()
print(df.host_response_rate)
print()
print(df.host_response_time)

0         91%
2        100%
4        100%
5        100%
6         97%
         ... 
29338    100%
29339     97%
29340     85%
29341     70%
29343    100%
Name: host_acceptance_rate, Length: 27284, dtype: object

0        100%
2        100%
4        100%
5        100%
6         96%
         ... 
29338     NaN
29339    100%
29340    100%
29341    100%
29343    100%
Name: host_response_rate, Length: 27284, dtype: object

0            within an hour
2            within an hour
4            within an hour
5            within an hour
6            within an hour
                ...        
29338                   NaN
29339    within a few hours
29340        within an hour
29341        within an hour
29343        within an hour
Name: host_response_time, Length: 27284, dtype: object


In [28]:
# Converting host rate columns to numeric percentages, filling host_response_time null values with 'Missing'
df['host_acceptance_rate'] = df.host_acceptance_rate.str.rstrip('%').astype(float) / 100
df['host_response_rate'] = df.host_response_rate.str.rstrip('%').astype(float) / 100
df.host_response_time.fillna('Missing', inplace=True)

In [29]:
# Rechecking null values
df.isna().sum().sort_values(ascending=False).head(15)

host_location                  6291
host_response_rate             3328
host_acceptance_rate           2285
description                     566
beds                            232
review_scores_cleanliness        55
review_scores_checkin            55
review_scores_accuracy           54
review_scores_location           54
review_scores_value              54
review_scores_communication      53
bathrooms_text                   10
number_of_reviews                 0
calendar_last_scraped             0
availability_365                  0
dtype: int64

In [30]:
# Checking null value count in host_response_rate using two conditions above
df[(df.number_of_reviews == 0) & df.host_acceptance_rate.isna()].host_response_rate.isna().sum()

930

For the null values in `host_response_rate` and `host_acceptance_rate` when `number_of_reviews` = 0 a decision will be made to assume that those are new listings. They will be added to the new listings flag and the mean will be imputed for these values.

In [31]:
# Adding to flag_new_listings if not already accounted for
df.loc[(df.number_of_reviews == 0) & df.host_acceptance_rate.isna(), 'flag_new_listings'] = 1

# Imputing values in both columns with their means using a for loop
for column in ['host_response_rate', 'host_acceptance_rate']:
    df.loc[(df.number_of_reviews == 0) & df.host_acceptance_rate.isna() & df.host_acceptance_rate.isna(), column] = df[column].mean()

In [32]:
df.isna().sum().sort_values(ascending=False).head(15)

host_location                  6291
host_response_rate             2398
host_acceptance_rate           1030
description                     566
beds                            232
review_scores_cleanliness        55
review_scores_checkin            55
review_scores_accuracy           54
review_scores_location           54
review_scores_value              54
review_scores_communication      53
bathrooms_text                   10
number_of_reviews                 0
calendar_last_scraped             0
availability_365                  0
dtype: int64

In [33]:
# Checking null values in host_reponse_rate when host_acceptance_rate and number_of_reviews both = 0
df[(df.number_of_reviews == 0) & (df.host_acceptance_rate == 0)].host_response_rate.isna().sum()

197

In [34]:
# Imputing 0 for these values as they are likely to have been an error if no reviews and host has a 0 for an acceptance rate
df.loc[(df.number_of_reviews == 0) & (df.host_acceptance_rate == 0) & df.host_response_rate.isna(), 'host_response_rate'] = float(0)

For the null values in `host_response_rate` and `host_acceptance_rate` that are remaining the mean will be imputed for each respective level of accommodates, as done previously. Flag variables will be created for both, `flag_hrr` and `flag_har` respectively

In [35]:
# Adding flag variables for each
df['flag_hrr'] = df.host_response_rate.isna().astype(int)
df['flag_har'] = df.host_acceptance_rate.isna().astype(int)

# Imputing mean value by accommodates for host_response_rate
for accommodates, mean_hrr in df.groupby('accommodates').host_response_rate.mean().items():
    df.loc[(df.accommodates == accommodates) & df.host_response_rate.isna(), 'host_response_rate'] = mean_hrr
    
# Imputing mean value by accommodates for host_acceptance_rate
for accommodates, mean_har in df.groupby('accommodates').host_acceptance_rate.mean().items():
    df.loc[(df.accommodates == accommodates) & df.host_acceptance_rate.isna(), 'host_acceptance_rate'] = mean_har

In [36]:
# Filling null values in categorical columns with 'Missing'
df.host_location.fillna('Missing', inplace=True)
df.description.fillna('Missing', inplace=True)

In [37]:
# Reviewing bathrooms_text column
df.bathrooms_text.value_counts()

bathrooms_text
1 bath               20042
1.5 baths             3244
2 baths               2001
2.5 baths              498
1 shared bath          419
1 private bath         368
3 baths                192
1.5 shared baths       123
2 shared baths          90
3.5 baths               78
3 shared baths          46
0 baths                 27
2.5 shared baths        26
4 baths                 23
4 shared baths          16
3.5 shared baths         9
0 shared baths           8
6.5 shared baths         7
Half-bath                7
4.5 baths                7
5 shared baths           6
9 baths                  5
Shared half-bath         5
5 baths                  5
6 shared baths           4
9 shared baths           3
4.5 shared baths         3
8 baths                  2
7 baths                  2
6 baths                  1
7 shared baths           1
5.5 shared baths         1
8.5 shared baths         1
8 shared baths           1
Private half-bath        1
22 baths                 1
5.5 baths    

In [38]:
# Checking bathrooms_text by room_type
df.groupby('room_type').bathrooms_text.value_counts()

room_type        bathrooms_text   
Entire home/apt  1 bath               19898
                 1.5 baths             3151
                 2 baths               1937
                 2.5 baths              485
                 3 baths                164
                 3.5 baths               75
                 4 baths                 19
                 0 baths                 15
                 4.5 baths                7
                 Half-bath                5
                 5 baths                  3
                 5.5 baths                1
Private room     1 shared bath          374
                 1 private bath         368
                 1 bath                 144
                 1.5 shared baths       119
                 1.5 baths               93
                 2 shared baths          80
                 2 baths                 64
                 3 shared baths          40
                 3 baths                 28
                 2.5 shared baths        

In [39]:
# Checking strange values in column
print(df[df.bathrooms_text == '22 baths'].accommodates)
print(df[df.bathrooms_text == 'Private half-bath'].accommodates)
print(df[df.bathrooms_text == '0 baths'].accommodates)
print(df[df.bathrooms_text == '0 shared baths'].accommodates)
print(df[df.bathrooms_text == '9 baths'].accommodates)
print(df[df.bathrooms_text == '9 shared baths'].accommodates)

4772    2
Name: accommodates, dtype: int64
4857    2
Name: accommodates, dtype: int64
126      3
2222     2
2240     2
2735     2
3013     2
3623     2
4443     2
16099    2
16770    4
26109    3
27546    3
27720    2
27798    2
27813    4
28607    4
28608    4
28613    2
28617    2
28619    2
28624    2
28628    2
28631    2
28633    3
28635    3
28641    3
28704    2
29292    2
Name: accommodates, dtype: int64
495      6
829      2
3210     3
5182     3
5196     2
6195     3
6229     2
12367    2
Name: accommodates, dtype: int64
25225    2
25258    2
25286    2
25363    2
25423    2
Name: accommodates, dtype: int64
18797    4
24186    4
26355    4
Name: accommodates, dtype: int64


For each of these strange values observed within the `bathrooms_text` column, the following can be inferred:
- '22 baths' is an error and should be 2 baths
- 'Private half-bath' does not appear to be a mistake
- '0 baths' and '0 shared baths' do not appear to be mistakes
- '9 shared baths' and '9 baths' are strange but could be correct and will be maintained

The `bathrooms_text` column will be renamed to `bathrooms` (a column of the same name was dropped earlier for being fully null) and converted to float values and the null values will be imputed with the mean values for their respective `accommodates` column as done previously with other colusmn. A flag variable `flag_bathrooms` will be created to account for this. Also, a dummy variable `d_shared_bath` will be created as well to account for listings that have shared bathrooms.

In [40]:
# Renaming bathrooms_text column to bathrooms
df.rename(columns={'bathrooms_text': 'bathrooms'}, inplace=True)

# Changing values in bathrooms as mentioned above
replacement_map = {
    '22 baths': '2 baths',
    'Private half-bath': '.5 half-bath',
    'Half-bath': '.5 baths',
    'Shared half-bath': '.5 shared bath'}
df['bathrooms'] = df.bathrooms.replace(replacement_map)

# Filling in null values in bathrooms with a temp value of 30 baths
df.bathrooms.fillna('30 baths', inplace=True)

# Creating dummy variable 'd_shared_bath' that is 1 when bathrooms has the word 'shared' in it, 0 if not
df['d_shared_bath'] = df.bathrooms.str.contains('shared').astype(int)

# Changing bathrooms to float values, reverting null values to null values, and then imputing the mean per accommodates to those null values
df['bathrooms'] = [float(bath.split()[0]) for bath in df.bathrooms]
df['bathrooms'] = df.bathrooms.replace('30.0', np.nan)
for accommodates, mean_baths in df.groupby('accommodates').bathrooms.mean().items():
    df.loc[(df.accommodates == accommodates) & df.bathrooms.isna(), 'bathrooms'] = float(round(mean_baths))

In [41]:
# Same issue for beds, imputing mean rounded values for beds per accommodates. First creating flag variable flag_beds for imputed values
df['flag_beds'] = df.beds.isna().astype(int)

for accommodates, mean_beds in df.groupby('accommodates').beds.mean().items():
    df.loc[(df.accommodates == accommodates) & df.beds.isna(), 'beds'] = round(mean_beds)

The remaining null values in all the `review_scores` columns have non-zero values in `number_of_reviews` and cannot be rectified. Dropping all of these null values.

In [42]:
# Dropping remaining null values in review_scores columns
df.dropna(subset=df.columns[df.columns.str.startswith('review_scores')], inplace=True)

In [43]:
# Checking to see remaining null values, if any
df.isna().sum().sort_values(ascending=False)

id                           0
review_scores_cleanliness    0
review_scores_rating         0
last_review                  0
first_review                 0
                            ..
neighborhood_cleansed        0
host_identity_verified       0
host_has_profile_pic         0
host_verifications           0
flag_beds                    0
Length: 75, dtype: int64

Within the dataset the following columns are "boolean" with a string value of 't' or 'f':
- `host_has_profile_pic`
- `host_identity_verified`
- `has_availability`
- `instant_bookable`

Converting these to columns to binary

In [48]:
# Columns to update
update_cols = ['host_has_profile_pic', 'host_identity_verified', 'has_availability', 'instant_bookable']

# Converting all columns to binary
for column in update_cols:
    df[column] = df[column].replace('t', 1).replace('f', 0)

In [58]:
test = [x for x in df.amenities]
lengths = [len(x) for x in df.amenities]

In [61]:
avg = sum(lengths)/len(lengths)

In [63]:
len([x for x in lengths if x > avg])

13436

In [64]:
test[0]

'["Dining table", "Kitchen", "Freezer", "Safe", "Extra pillows and blankets", "Shared hot tub - open specific hours", "Air conditioning", "Bidet", "Bed linens", "Paid street parking off premises", "Coffee maker", "Paid washer \\u2013 In building", "Shared backyard \\u2013 Fully fenced", "Hair dryer", "Cooking basics", "Body soap", "Elevator", "Drying rack for clothing", "Bathtub", "Dishes and silverware", "Shared pool", "Laundromat nearby", "Stove", "Wine glasses", "Room-darkening shades", "Long term stays allowed", "Hot water", "TV with standard cable", "Paid dryer \\u2013 In building", "Shared sauna", "Microwave", "Cleaning products", "Private entrance", "Gym", "Private patio or balcony", "Essentials", "Clothing storage: closet and dresser", "Host greets you", "Outdoor furniture", "Refrigerator", "Heating", "Hangers", "Oven", "Toaster", "Wifi", "Single level home", "Iron", "Books and reading material"]'

In [67]:
from langdetect import detect