# What factors affect Airbnb Prices in Paris on Weekends?


## Inrtoduction 

Airbnbs have specific attributes (room type, cleanliness rating, superhost status etc.) that determine its price and perceived quality. The quality of a Airbnbs often evaluated by users through a online/digital rating system. Attributes of Parisian Airbnbs have been determined and collected into a data set that is available on kaggle. It can be used to analyze trends in Airbnb prices and popularity across different cities and neighborhoods, as well as to identify factors that may influence prices and demand. The data set includes 19 variables in each respective column, with each row documenting an Airbnb's price. 

For details of data set, reference it [here](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?select=paris_weekends.csv ).

The following are the potential predictor variables including their description: 

| Variable | Description | Type |
| --- | --- | --- |
| realSum | The total price of the Airbnb listing. | Numeric |
| room_type | The type of room being offered (e.g. private, shared, etc.). | Categorical |
| room_shared | Whether the room is shared or not. | Boolean |
| room_private | Whether the room is private or not. | Boolean |
| person_capacity | The maximum number of people that can stay in the room. | Numeric |
| host_is_superhost | Whether the host is a superhost or not. | Boolean) |
| multi | Whether the listing is for multiple rooms or not. | Boolean |
| biz | Whether the listing is for business purposes or not. | Boolean |
| cleanliness_rating | The cleanliness rating of the listing. | Numeric |
| guest_satisfaction_overall | The overall guest satisfaction rating of the listing. | Numeric |
| bedrooms | The number of bedrooms in the listing. | Numeric |
| dist | The distance from the city centre. | Numeric |
| metro_dist | The distance from the nearest metro station. | Numeric |
| lng | The longitude of the listing. | Numeric |
| lat | The latitude of the listing. | Numeric |

This project will take this data and attempt to answer the question: *What factors affect Airbnb Prices in Paris on Weekends?*





## Preliminary exploratory data analysis

##### The preliminary exploratory data analysis will include:
- Reading dataset from web link
- Cleaning and wrangling data into a tidy format
- Splitting into training data and test data
- Statistics of training subset
- Visualizing training data comparing distributions of predictor variables

In [24]:
#load the necessary libraries
library(repr)
library(tidyverse)
library(tidymodels)
library(GGally)
library(gridExtra)
options(repr.matrix.max.rows = 6)

#### Reading from web link
Read from link with appropriate delimiter.

In [38]:
# Read the data frame from Kaggle's database
# Source: https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?select=paris_weekends.csv
paris_data_link <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
# use ';' delimiter, read_csv2 will not work as '.' is used as decimal points
paris_data <- read_delim(paris_data_link, ';')

“One or more parsing issues, see `problems()` for details”
[1mRows: [22m[34m142[39m [1mColumns: [22m[34m1[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (1): <!DOCTYPE html>

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


<!DOCTYPE html>
<chr>
"<html lang=""en"">"
<head>
<title>Airbnb Prices in European Cities | Kaggle</title>
⋮
</main>
</body>
</html>


In [28]:
#filter the needed columns
new_paris_data <- paris_data |>
  select(realSum, cleanliness_rating, guest_satisfaction_overall, dist, metro_dist, 
        room_type, host_is_superhost)
new_paris_data 

realSum,cleanliness_rating,guest_satisfaction_overall,dist,metro_dist,room_type,host_is_superhost
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<chr>,<lgl>
536.3967,9,89,1.3512012,0.2123455,Entire home/apt,FALSE
290.1016,10,97,0.6998212,0.1937103,Private room,TRUE
445.7545,10,100,0.9689817,0.2943429,Entire home/apt,FALSE
⋮,⋮,⋮,⋮,⋮,⋮,⋮
223.9258,9,89,4.205205,0.2530289,Entire home/apt,FALSE
200.8575,9,93,2.891214,0.2406744,Entire home/apt,TRUE
301.2862,10,92,3.469749,0.5085167,Entire home/apt,FALSE


In [45]:
#set the seed
set.seed(8888)

In [46]:
#splitting the data
paris_split <- initial_split(new_paris_data, prop = 0.75, strata = realSum) 
paris_train <- training(paris_split)
paris_test <- testing(paris_split)

In [47]:
#Table and Counts
#Cleanliness_rating
paris_proportions1 <- paris_train |> #use TRAINING data
group_by(cleanliness_rating) |>
summarize(n = n())  #used to count the number of observations in a given group
paris_proportions1

#guest_satisfaction_overall
paris_proportions2 <- paris_train |>
group_by(guest_satisfaction_overall) |>
summarize(n = n())
paris_proportions2

#dist
paris_proportions3 <- paris_train |>
group_by(dist) |>
summarize(n = n()) 
paris_proportions3

#metro_dist
paris_proportions4 <- paris_train |>
group_by(metro_dist) |>
summarize(n = n()) 
paris_proportions4

#room_type
paris_proportions5 <- paris_train |>
group_by(room_type) |>
summarize(n = n()) 
paris_proportions5

#host_is_superhost
paris_proportions6 <- paris_train |>
group_by(host_is_superhost) |>
summarize(n = n()) 
paris_proportions6


cleanliness_rating,n
<dbl>,<int>
2,3
4,7
5,2
⋮,⋮
8,272
9,947
10,1334


guest_satisfaction_overall,n
<dbl>,<int>
20,4
40,5
50,4
⋮,⋮
98,138
99,70
100,584


dist,n
<dbl>,<int>
0.1395432,1
0.1470952,1
0.1714837,1
⋮,⋮
7.620912,1
7.680000,1
7.703733,1


metro_dist,n
<dbl>,<int>
0.003220008,1
0.003935058,1
0.006388847,1
⋮,⋮
0.9102045,1
0.9689053,1
1.0458365,1


room_type,n
<chr>,<int>
Entire home/apt,2065
Private room,568
Shared room,34


host_is_superhost,n
<lgl>,<int>
False,2290
True,377


In [61]:
#Mean 
mean1 <- paris_train |> #use TRAINING data 
    select(cleanliness_rating:metro_dist)|> 
    map_df(max, na.rm = TRUE)
mean1

cleanliness_rating,guest_satisfaction_overall,dist,metro_dist
<dbl>,<dbl>,<dbl>,<dbl>
10,100,7.703733,1.045836
