# **STAT 301 – Final Report - Group 5**

---

### **Group Members: Mu Ye Liu, Taha Memon, Sajid Mahmood**

---

# 1. Data Description

Airbnb has revolutionized the short-term rental indutry, whereby it provides tourists with housing units through hosts willing to give them shared or full houses. It has matured to be a wolrdwide marketplace with prices being driven by many factors such as the nature of the property and the host, demand created by the guests and genographical loation.

It is important to understand the factors of Airbnb price to both the market analysts as well as the host. Strategic pricing can maximize occupancy and revenue to the hosts or can further give insight to the researchers on how the market activities and preferences can interpolate into the sharing economy.

The dataset we analyze contains Airbnb listings from several major European cities, including **Amsterdam, Athens, and Berlin**, with separate data for **weekdays and weekends**. Each listing includes information on:

- **Price (`realSum`)**: The total cost of the listing
- **Property and host attributes** (e.g., room type, number of bedrooms, superhost status)
- **Business and guest indicators** (e.g., business-friendly flag, guest satisfaction)
- **Geographical context** (e.g., distance from city center, latitude, longitude)
- **Time of listing** (weekday vs weekend)

Using this detailed data we can explore how the host choices and decisions, property characteristics and features, property charactersitics and features, the time of year, and location affects the pricing. Investigating these relationships helps us to unpop before us the patterns and behaviors that are readily present when data is raw.

# Exploratory Data Analysis (EDA)

At this point, we would like to examine and clean the data and proceed with the answer to our research question:

****

This involves:
1. It shows that it is possible to load the dataset to R.
2. Cleaning and wrangling the data into a state that is readable and usable.
3. Production of relevant visualization to determine pivoting patterns and potential issues in modeling.

###  Step 1: Load the Data

In total, we have **six files**, which were merged into a single dataset using **R**.
Test.

In [10]:
# Load libraries
library(readr)
library(dplyr)

# Read in all six CSVs
amsterdam_weekdays <- read_csv("amsterdam_weekdays.csv")
amsterdam_weekends <- read_csv("amsterdam_weekends.csv")
athens_weekdays <- read_csv("athens_weekdays.csv")
athens_weekends <- read_csv("athens_weekends.csv")
berlin_weekdays <- read_csv("berlin_weekdays.csv")
berlin_weekends <- read_csv("berlin_weekends.csv")

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m1103[39m [1mColumns: [22m[34m20[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): room_type
[32mdbl[39m (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
[33mlgl[39m  (3): room_shared, room_private, host_is_superhost

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m977[39m [1mColumns: [22m[34m20[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): room_type
[32mdbl[39m (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
[33mlgl[39m  (3): room_shared, room_pri

###  Step 2: Clean and Wrangle the Data

We'll now combine the datasets and add two important context columns:
- `city`: name of the city (Amsterdam, Athens, Berlin)
- `day_type`: either `"weekday"` or `"weekend"`

We'll also:
- Convert logical and categorical columns to appropriate types
- Ensure a clean, tidy structure for plotting and modeling

In [11]:
# Add city and day_type to each dataset
amsterdam_weekdays$city <- "Amsterdam"
amsterdam_weekends$city <- "Amsterdam"
athens_weekdays$city <- "Athens"
athens_weekends$city <- "Athens"
berlin_weekdays$city <- "Berlin"
berlin_weekends$city <- "Berlin"

amsterdam_weekdays$day_type <- "weekday"
amsterdam_weekends$day_type <- "weekend"
athens_weekdays$day_type <- "weekday"
athens_weekends$day_type <- "weekend"
berlin_weekdays$day_type <- "weekday"
berlin_weekends$day_type <- "weekend"

# Combine all datasets
airbnb <- bind_rows(
  amsterdam_weekdays,
  amsterdam_weekends,
  athens_weekdays,
  athens_weekends,
  berlin_weekdays,
  berlin_weekends
)

# Convert to factors
airbnb_clean <- airbnb %>%
  mutate(
    city = as.factor(city),
    day_type = as.factor(day_type),
    host_is_superhost = as.factor(host_is_superhost),
    biz = as.factor(biz)
  )
