# STAT 301 – Final Report (Group)

# 1. Data Description

##  Dataset Source and Collection

The dataset used in this project is titled **“Airbnb Prices in European Cities”**, published on [Kaggle](https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities). It was assembled with **scraped data of the publicly available sources of Airbnb listings**. Such listings have specific metadata like the price, the room features, the satisfaction and cleanliness level, the geographic position, and the availability of the host.

Though the original Kaggle dataset lacks any particular code on web scraping, we can infer that the data was scraped by the process of **automated crawling of the Airbnb platform** over a series of cities of Europe on the basis of the structure and the naming of the variables. Every listing represents only one available property at one particular time.



## Dataset Overview

This project focuses on three cities:
- **Amsterdam**
- **Athens**
- **Berlin**

Each city has two files:
- One for **weekday listings**
- One for **weekend listings**

In total, we have **six files**, which were merged into a single dataset using **R**.

In [2]:
# Load libraries
library(readr)
library(dplyr)

# Read in all six CSVs
amsterdam_weekdays <- read_csv("amsterdam_weekdays.csv")
amsterdam_weekends <- read_csv("amsterdam_weekends.csv")
athens_weekdays <- read_csv("athens_weekdays.csv")
athens_weekends <- read_csv("athens_weekends.csv")
berlin_weekdays <- read_csv("berlin_weekdays.csv")
berlin_weekends <- read_csv("berlin_weekends.csv")

[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m1103[39m [1mColumns: [22m[34m20[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): room_type
[32mdbl[39m (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
[33mlgl[39m  (3): room_shared, room_private, host_is_superhost

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.
[1m[22mNew names:
[36m•[39m `` -> `...1`
[1mRows: [22m[34m977[39m [1mColumns: [22m[34m20[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m  (1): room_type
[32mdbl[39m (16): ...1, realSum, person_capacity, multi, biz, cleanliness_rating, gu...
[33mlgl[39m  (3): room_shared, room_pri

In [3]:
# Add city and day_type to each dataset
amsterdam_weekdays$city <- "Amsterdam"
amsterdam_weekends$city <- "Amsterdam"
athens_weekdays$city <- "Athens"
athens_weekends$city <- "Athens"
berlin_weekdays$city <- "Berlin"
berlin_weekends$city <- "Berlin"

amsterdam_weekdays$day_type <- "weekday"
amsterdam_weekends$day_type <- "weekend"
athens_weekdays$day_type <- "weekday"
athens_weekends$day_type <- "weekend"
berlin_weekdays$day_type <- "weekday"
berlin_weekends$day_type <- "weekend"

# Combine all datasets
airbnb <- bind_rows(
  amsterdam_weekdays,
  amsterdam_weekends,
  athens_weekdays,
  athens_weekends,
  berlin_weekdays,
  berlin_weekends
)

# Convert to factors
airbnb_clean <- airbnb %>%
  mutate(
    city = as.factor(city),
    day_type = as.factor(day_type),
    host_is_superhost = as.factor(host_is_superhost),
    biz = as.factor(biz)
  )
