# Assignment 1: Data Description & Exploratory Data Analysis

### Group 35: Prasojo, Naufal

### Section 1: Data Description

**1.1 Data Summary**
- **Dataset Name:** Airbnb Prices in European Cities
- **Source:** Kaggle dataset by The Devastator (2021). Available at
https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?utm_source=chatgpt.com
- **License:** CC BY-NC 4.0 (Attribution-NonCommercial).
- **Data Collection:** Listings were web-scraped from Airbnb for 10 European cities (Amsterdam, Athens, Barcelona, Berlin, Budapest, Lisbon, London, Paris, Rome, Vienna), but for this assignment we only use Athens data.
     - Each city has two files (weekday and weekend) capturing snapshot prices for 2-night stays for two guests.
     - Approx. 2627~2653 observations (rows) and 19 variables (columns).



**Variable Description**

| Variable Name                        | Type                | Description                                                         |
| :----------------------------------- | :------------------ | :------------------------------------------------------------------ |
| `realSum`                            | Numeric             | Total price (in euros) for the stay.                                |
| `room_type`                          | Categorical         | Type of accommodation (Entire home/apt, Private room, Shared room). |
| `room_shared`                        | Boolean             | 1 if room is shared; 0 otherwise.                                   |
| `room_private`                       | Boolean             | 1 if room is private; 0 otherwise.                                  |
| `person_capacity`                    | Integer             | Maximum number of guests.                                           |
| `superhost`                          | Boolean             | 1 if host is a Superhost status; 0 otherwise.                       |
| `multi`                              | Boolean             | 1 if host owns 2–4 listings; 0 otherwise.                           |
| `biz`                                | Boolean             | 1 if host is a business (>4 listings).                              |
| `cleanliness_rating`                 | Numeric (1–10)      | Guest-reported cleanliness rating.                                  |
| `guest_satisfaction_overall`         | Numeric (1–100)     | Overall guest satisfaction score.                                   |
| `bedrooms`                           | Integer             | Number of bedrooms (0 for studio).                                  |
| `dist`                               | Numeric             | Distance from city centre (km).                                     |
| `metro_dist`                         | Numeric             | Distance from nearest metro station (km).                           |
| `attr_index`                         | Numeric             | Local attraction density index.                                     |
| `rest_index`                         | Numeric             | Local restaurant density index.                                     |
| `attr_index_norm`, `rest_index_norm` | Numeric             | Normalized versions of indices for comparability across cities.     |
| `lng`, `lat`                         | Numeric             | Longitude and latitude coordinates of listing.                      |



**1.2 Source and Information**

The data were collected by web-scraping Airbnb public listings around 2020–2021 by The Devastator. The author compiled cleaned CSV files for each city and time category (weekday vs weekend) and license as CC BY-NC 4.0 (Attribution-NonCommercial).

**1.3 Pre-Selection of Variables**

- Variables to keep:

`realSum`, `room_type`, `person_capacity`, `superhost`, `cleanliness_rating`, `guest_satisfaction_overall`, `dist`, `metro_dist`, `attr_index`, `rest_index`,  `weekday/weekend`.

Reasoning: These variables seem most useful for both understanding factors for Airbnb prices or for building a model that can help estimate fair prices for future listings in Athens.

- Variables to drop (initial cleaning phase):

    - `lng`, `lat` – spatial coordinates not needed for non-map models (but can be added later for spatial EDA).
    - `attr_index_norm`, `rest_index_norm` – correlated with non-normalized versions and may introduce redundancy.
    - `multi`, `biz` – may be reintroduced if host type becomes an interest variable, but initial models will focus on inference model for price.

Reasoning: Variables dropped contain redundant or identifier information not directly useful for predictive or interpretive modelling. All decisions will be re-evaluated after EDA.


### Section 2: Scientific Question

**2.1 Question**

I would like to model and predict listing prices in European Airbnb across various listing Characteristic (e.g `room_type`, `person_capacity`, `superhost`, etc) and Quantitative data (e.g `dist`, `metro_dist`, `attr_index`, etc). 

**2.2 Name the response**

The response variable is `realSum`, representing the total price of the Airbnb listing i Europe (especially in Athens where we focus on).

**2.3 Explain whether your question is focused on prediction, inference, or both**

My question focuses on both prediction and inference. I aim to identify which listing and location features significantly influence Airbnb prices (inference) while also building a model that can accurately predict the price of future listings based on these factors.



### Section 3: Exploratory Data Analysis and Visualization 

In [24]:
# List CSV files in ./data
csv_files <- list.files("data", pattern = "\\.csv$", full.names = TRUE)
csv_files

# Read each CSV into a named list of data.frames
data_list <- lapply(csv_files, read.csv, stringsAsFactors = FALSE)
names(data_list) <- basename(csv_files)  # names are file names

# # Quick preview: show number of rows and first few rows for each
# lapply(data_list, function(df) {
#   list(rows = nrow(df), cols = ncol(df), head = head(df, 3))
# })

In [45]:
athens_weekdays <- data_list[["athens_weekdays.csv"]]
athens_weekends <- data_list[["athens_weekends.csv"]]

head(athens_weekdays)
head(athens_weekends)

X,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat
0,129.82448,Entire home/apt,False,False,4,False,0,0,10,100,2,2.8139635,0.88189,55.34857,2.086871,78.77838,5.91516,23.766,37.983
1,138.96375,Entire home/apt,False,False,4,True,1,0,10,96,1,0.4072929,0.3045679,240.30665,9.060559,407.1677,30.572629,23.73168,37.97776
2,156.30492,Entire home/apt,False,False,3,True,0,1,10,98,1,1.2372111,0.2884881,199.50737,7.522257,395.9674,29.731642,23.722,37.979
3,91.62702,Entire home/apt,False,False,4,True,1,0,10,99,1,4.3674572,0.2974673,39.80305,1.50074,58.70658,4.408047,23.72712,38.01435
4,74.05151,Private room,False,True,2,False,0,0,10,100,1,2.194185,0.3852657,78.7334,2.968577,113.32597,8.509204,23.73391,37.99529
5,113.88934,Entire home/apt,False,False,6,True,1,0,10,96,2,2.0712056,0.4538674,96.58899,3.641806,158.64432,11.911981,23.71584,37.98598


X,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat
0,138.96375,Entire home/apt,False,False,4,True,1,0,10,96,1,0.4072777,0.3045697,240.3065,9.054205,407.16796,6.0806216,23.73168,37.97776
1,91.62702,Entire home/apt,False,False,4,True,1,0,10,99,1,4.3674631,0.2974735,39.803,1.499687,58.70652,0.8767197,23.72712,38.01435
2,76.62925,Private room,False,True,2,False,0,0,10,100,1,2.1941738,0.3852475,78.734,2.966519,113.32668,1.6924138,23.73391,37.99529
3,151.85246,Entire home/apt,False,False,4,True,0,1,10,100,2,2.5089816,0.5634735,68.77488,2.591282,101.16207,1.5107482,23.732,37.998
4,98.65723,Entire home/apt,False,False,2,True,1,0,10,95,1,2.7405814,0.7250455,62.90286,2.370037,92.61113,1.3830489,23.731,38.0
5,173.88044,Entire home/apt,False,False,4,True,1,0,10,97,1,0.8690264,0.4707861,132.33536,4.986097,221.33873,3.3054592,23.7368,37.98331


In [40]:
columns <- colnames(athens_weekdays)olumns

In [None]:
# Just to check room types / room shared
# room_shared_weekdays <- athens_weekdays[athens_weekdays$room_type == 'Shared room', ]
# room_shared_weekends <- athens_weekends[athens_weekends$room_type == 'Shared room', ]
# room_shared_weekdays
# nrow(room_shared_weekdays)
# room_shared_weekends