# Assignment 1: Data Description & Exploratory Data Analysis

### Group 35: Prasojo, Naufal

### Section 1: Data Description

**1.1 Data Summary**
- **Dataset Name:** Airbnb Prices in European Cities
- **Source:** Kaggle dataset by The Devastator (2021). Available at
https://www.kaggle.com/datasets/thedevastator/airbnb-prices-in-european-cities?utm_source=chatgpt.com
- **License:** CC BY-NC 4.0 (Attribution-NonCommercial).
- **Data Collection:** Listings were web-scraped from Airbnb for 10 European cities (Amsterdam, Athens, Barcelona, Berlin, Budapest, Lisbon, London, Paris, Rome, Vienna).
     - Each city has two files (weekday and weekend) capturing snapshot prices for 2-night stays for two guests.
     - Approx. 51 000 observations (rows) and ≈ 21–23 variables (columns).



**1.2 Variable Description**

| Variable Name                        | Type                | Description                                                         |
| :----------------------------------- | :------------------ | :------------------------------------------------------------------ |
| `realSum`                            | Numeric             | Total price (in euros) for the stay.                                |
| `room_type`                          | Categorical         | Type of accommodation (Entire home/apt, Private room, Shared room). |
| `room_shared`                        | Boolean             | 1 if room is shared; 0 otherwise.                                   |
| `room_private`                       | Boolean             | 1 if room is private; 0 otherwise.                                  |
| `person_capacity`                    | Integer             | Maximum number of guests.                                           |
| `superhost`                          | Boolean             | 1 if host is a Superhost status; 0 otherwise.                       |
| `multi`                              | Boolean             | 1 if host owns 2–4 listings; 0 otherwise.                           |
| `biz`                                | Boolean             | 1 if host is a business (>4 listings).                              |
| `cleanliness_rating`                 | Numeric (1–10)      | Guest-reported cleanliness rating.                                  |
| `guest_satisfaction_overall`         | Numeric (1–100)     | Overall guest satisfaction score.                                   |
| `bedrooms`                           | Integer             | Number of bedrooms (0 for studio).                                  |
| `dist`                               | Numeric             | Distance from city centre (km).                                     |
| `metro_dist`                         | Numeric             | Distance from nearest metro station (km).                           |
| `attr_index`                         | Numeric             | Local attraction density index.                                     |
| `rest_index`                         | Numeric             | Local restaurant density index.                                     |
| `attr_index_norm`, `rest_index_norm` | Numeric             | Normalized versions of indices for comparability across cities.     |
| `lng`, `lat`                         | Numeric             | Longitude and latitude coordinates of listing.                      |
| `city`                               | Categorical         | City name (added from filename).                                    |
| `weekday/weekend`                    | Categorical         | Whether price is from weekday or weekend file.                      |
| `Unnamed`, `ID`, `host_id`           | Categorical/Integer | Technical identifier columns (if present; not used in analysis).    |


In [None]:
# List CSV files in ./data
csv_files <- list.files("data", pattern = "\\.csv$", full.names = TRUE)
csv_files

# Read each CSV into a named list of data.frames
data_list <- lapply(csv_files, read.csv, stringsAsFactors = FALSE)
names(data_list) <- basename(csv_files)  # names are file names

# # Quick preview: show number of rows and first few rows for each
# lapply(data_list, function(df) {
#   list(rows = nrow(df), cols = ncol(df), head = head(df, 3))
# })

In [None]:
amsterdam <- data_list[["amsterdam_weekdays.csv"]]
head(amsterdam) 
# str(amsterdam)

X,realSum,room_type,room_shared,room_private,person_capacity,host_is_superhost,multi,biz,cleanliness_rating,guest_satisfaction_overall,bedrooms,dist,metro_dist,attr_index,attr_index_norm,rest_index,rest_index_norm,lng,lat
0,194.0337,Private room,False,True,2,False,1,0,10,93,1,5.0229638,2.53938,78.69038,4.166708,98.2539,6.846473,4.90569,52.41772
1,344.2458,Private room,False,True,4,False,0,0,8,85,1,0.4883893,0.2394039,631.17638,33.421209,837.28076,58.342928,4.90005,52.37432
2,264.1014,Private room,False,True,2,False,0,1,9,87,1,5.7483119,3.6516213,75.27588,3.985908,95.38695,6.6467,4.97512,52.36103
3,433.5294,Private room,False,True,4,False,0,1,9,90,2,0.384862,0.4398761,493.27253,26.119108,875.0331,60.973565,4.89417,52.37663
4,485.5529,Private room,False,True,2,True,0,0,10,98,1,0.5447382,0.3186926,552.83032,29.272733,815.30574,56.811677,4.90051,52.37508
5,552.8086,Private room,False,True,3,False,0,0,8,100,2,2.1314201,1.9046682,174.78896,9.255191,225.20166,15.692376,4.87699,52.38966


'data.frame':	1103 obs. of  20 variables:
 $ X                         : int  0 1 2 3 4 5 6 7 8 9 ...
 $ realSum                   : num  194 344 264 434 486 ...
 $ room_type                 : chr  "Private room" "Private room" "Private room" "Private room" ...
 $ room_shared               : chr  "False" "False" "False" "False" ...
 $ room_private              : chr  "True" "True" "True" "True" ...
 $ person_capacity           : num  2 4 2 4 2 3 2 4 4 2 ...
 $ host_is_superhost         : chr  "False" "False" "False" "False" ...
 $ multi                     : int  1 0 0 0 0 0 0 0 0 1 ...
 $ biz                       : int  0 0 1 1 0 0 0 0 0 0 ...
 $ cleanliness_rating        : num  10 8 9 9 10 8 10 10 9 10 ...
 $ guest_satisfaction_overall: num  93 85 87 90 98 100 94 100 96 88 ...
 $ bedrooms                  : int  1 1 1 2 1 2 1 3 2 1 ...
 $ dist                      : num  5.023 0.488 5.748 0.385 0.545 ...
 $ metro_dist                : num  2.539 0.239 3.652 0.44 0.319 ...
 $ attr_in