# DA3 Assignment 2
## AirBnB Price Prediction - Buenos Aires
### Data from September 22, 2023
#### Nicolas Fernandez
The goal of this assignment is to build a price prediction model for small and mid-sized apartments that can host 2-6 guests in Buenos Aires. Several models will be constructed using different methods for comparison. Descriptions of each column available within the data can be found in detial at this link: https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=1322284596

In [1]:
# Importing libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import os
from pathlib import Path
import sys
from patsy import dmatrices
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.inspection import permutation_importance
from sklearn.inspection import PartialDependenceDisplay
from sklearn.inspection import partial_dependence
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import mean_squared_error

## Get Data

In [2]:
# Reading data from github
data = pd.read_csv('https://raw.githubusercontent.com/nxfern/DA3_Assignment_2/main/listings.csv')

In [3]:
# Viewing shape and first 5 observations
print(data.shape)
data.head()

(29346, 75)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,11508,https://www.airbnb.com/rooms/11508,20230922223302,2023-09-23,city scrape,Condo in Buenos Aires · ★4.81 · 1 bedroom · 1 ...,LUXURIOUS 1 BDRM APT- POOL/ GYM/ SPA/ 24-HR SE...,AREA: PALERMO SOHO<br /><br />Minutes walking ...,https://a0.muscache.com/pictures/19357696/b1de...,42762,...,4.97,4.94,4.89,,f,1,1,0,0,0.26
1,107259,https://www.airbnb.com/rooms/107259,20230922223302,2023-09-23,city scrape,Rental unit in Buenos Aires · ★4.58 · 6 bedroo...,"We have 7 bedrooms and 5 bathrooms,gourmet kit...",,https://a0.muscache.com/pictures/822490/5bc2ab...,555693,...,4.71,4.63,4.53,,f,2,2,0,0,0.28
2,14222,https://www.airbnb.com/rooms/14222,20230922223302,2023-09-23,city scrape,Rental unit in Palermo/Buenos Aires · ★4.79 · ...,Beautiful cozy apartment in excellent location...,Palermo is such a perfect place to explore the...,https://a0.muscache.com/pictures/4695637/bbae8...,87710233,...,4.9,4.89,4.75,,f,7,7,0,0,0.81
3,15074,https://www.airbnb.com/rooms/15074,20230922223302,2023-09-23,previous scrape,Rental unit in Buenos Aires · 1 bedroom · 1 be...,<b>The space</b><br />I OFFER A ROOM IN MY APA...,,https://a0.muscache.com/pictures/91166/c0fdcb4...,59338,...,,,,,f,1,0,1,0,
4,108089,https://www.airbnb.com/rooms/108089,20230922223302,2023-09-23,city scrape,Rental unit in Buenos Aires · ★4.59 · 1 bedroo...,Amazing apartment in the best area of Palermo....,Palermo is the best neighborhhod in the city.<...,https://a0.muscache.com/pictures/717831/fbb7cd...,559463,...,4.77,4.94,4.66,,f,4,4,0,0,0.77


In [4]:
# Viewing information on data types of all columns
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29346 entries, 0 to 29345
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            29346 non-null  int64  
 1   listing_url                                   29346 non-null  object 
 2   scrape_id                                     29346 non-null  int64  
 3   last_scraped                                  29346 non-null  object 
 4   source                                        29346 non-null  object 
 5   name                                          29346 non-null  object 
 6   description                                   28747 non-null  object 
 7   neighborhood_overview                         16259 non-null  object 
 8   picture_url                                   29346 non-null  object 
 9   host_id                                       29346 non-null 

## EDA and Feature Engineering
First a filter will be placed to only grab the listings that accommodate at least 2 but no more than 6 people total. From there, data review and cleaning will take place along with the potential creation of dummy variables for further analysis.

In [5]:
# Filtering data to only contain listings that accommodate between 2 and 6 people and assigning it to a new df to work off of. Printing the shape of the new df, deleting initial dataframe that's no longer needed
df = data.query('2 <= accommodates <= 6')
print(df.shape)
del data

(27340, 75)


In [23]:
# Checking sum of all null values, sorted by descending order
df.isna().sum().sort_values(ascending=False).head(30)

neighbourhood_group_cleansed    27340
bathrooms                       27340
calendar_updated                27340
license                         26932
host_about                      12298
neighbourhood                   12195
neighborhood_overview           12195
host_neighbourhood               9052
host_location                    6300
review_scores_checkin            5022
review_scores_cleanliness        5022
review_scores_accuracy           5021
review_scores_location           5021
review_scores_value              5021
review_scores_communication      5020
first_review                     4969
last_review                      4969
reviews_per_month                4969
review_scores_rating             4967
bedrooms                         4944
host_response_rate               3342
host_response_time               3342
host_acceptance_rate             2292
host_is_superhost                1400
description                       569
beds                              233
bathrooms_te

The bathrooms column which indicates how many bathrooms the listing has available is completely null in this dataset. That would otherwise be an important data point for analysis but given that it is completely full of null values, the column will be dropped along with all other columns that are fully or mostly null. The other columns that have many null values include columns with information/data about the host of the listing and not information about the accommodation itself. This includes the neighbourhood the host has reported they reside in, and information about the host. What may be important, however, is the location of the host (`host_location`) so that will be kept. There are other factors withih the data concerning the host that are more impactful for analysis, such as whether or not the host is a superhost, therefore these columns will be dropped as well.

In [28]:
# Dropping columns that are fully or mostly null values
df.drop(['neighbourhood_group_cleansed', 'bathrooms', 'calendar_updated', 'license', 'host_about', 'host_neighbourhood', 'host_about'], axis=1, inplace=True)

In [33]:
# Renaming all columns that say neighbourhood to neihborhood for consistency
df.columns = df.columns.str.replace('neighbourhood', 'neighborhood')

In [104]:
# Checking neighborhood_cleansed column for any results outside of Buenos Aires
df.neighborhood_cleansed.value_counts()

neighborhood_cleansed
Palermo              9325
Recoleta             4077
San Nicolas          1622
Belgrano             1477
Retiro               1332
Monserrat            1105
Almagro               971
Villa Crespo          870
Balvanera             833
San Telmo             732
Colegiales            653
Nuñez                 622
Caballito             524
Chacarita             430
Villa Urquiza         338
Constitucion          324
Puerto Madero         320
Barracas              203
Saavedra              185
San Cristobal         170
Flores                122
Coghlan               105
Villa Ortuzar         104
Villa Devoto           96
Villa Del Parque       90
Boedo                  76
Boca                   76
Parque Patricios       65
Parque Chas            62
Parque Chacabuco       61
Villa Pueyrredon       52
Paternal               43
Agronomia              42
Floresta               38
Villa Santa Rita       36
Villa Luro             28
Villa Gral. Mitre      27
Mataderos       

There are no results outside of Buenos Aires

In [112]:
# Checking room_type contents
df.room_type.value_counts()

room_type
Entire home/apt    25765
Private room        1432
Shared room           87
Hotel room            56
Name: count, dtype: int64

In [114]:
# Dropping hotels from examination
df = df.loc[df['room_type'] != 'Hotel room']