# **AirBnB Data Exploration**
# **01 DATA LOADING**
Initial exploration of AirBnB datasets including listings, calendar, and neighborhood data.

In [1]:
# Import required libraries
import pandas as pd
from pathlib import Path
import geopandas as gpd
from shapely.geometry import Point

## Data Loading
Loading all required datasets from the raw data directory.

In [2]:
# Set up project paths
project_root = Path().resolve().parent


### Load Listings Data

Inside Airbnb provides two “listings” datasets:

- `listings_summary.csv`
- `listings_extended.csv`

It is unclear whether these datasets differ. Therefore, our goal is to understand the differences between them and determine if one is a subset of the other.

To do this, we will:
1. Load both datasets.  
2. Compare their shapes (number of rows and columns).  
3. Identify common and unique columns.  

In [3]:
# Define file paths
listings_summary_path = project_root / "data" / "raw" / "listings_summary.csv" 
listings_extended_path = project_root / "data" / "raw" / "listings_extended.csv"

# Load listings summary data
df_listings_summary = pd.read_csv(listings_summary_path)
print("Listings Summary Shape:", df_listings_summary.shape)
display(df_listings_summary.head(5))  # Better visualization in Jupyter


# Load listings extended data
df_listings_extended = pd.read_csv(listings_extended_path)
pd.set_option('display.max_columns', len(df_listings_extended.columns)) # To view all columns
pd.set_option('display.max_rows', 100)
print("\nListings Extended Shape:", df_listings_extended.shape)
display(df_listings_extended.head(5))


# Find common columns between the two datasets (excluding 'id' for uniqueness check)
common_cols = df_listings_summary.columns.intersection(df_listings_extended.columns).drop('id')
print("\nCommon columns:", list(common_cols))

# Identify unique columns in each dataset
unique_extended = df_listings_extended.columns.difference(common_cols)
unique_summary = df_listings_summary.columns.difference(common_cols)

print("\nUnique columns in listings_extended:", list(unique_extended))
print("\nUnique columns in listings_summary:", list(unique_summary))


Listings Summary Shape: (23705, 18)


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6400,The Studio Milan,13822,Francesca,,TIBALDI,45.44119,9.17813,Private room,100.0,4,10,2019-04-13,0.06,1,358,0,
1,23986,""" Characteristic Milanese flat""",95941,Jeremy,,NAVIGLI,45.44806,9.17373,Entire home/apt,180.0,1,27,2024-04-20,0.18,1,362,1,
2,40470,Giacinto Cosy & clean flat near MM1,174203,Giacinto,,VIALE MONZA,45.52023,9.22747,Entire home/apt,80.0,3,43,2024-08-17,0.26,2,183,2,
3,46536,Nico & Cinzia's Pink Suite!,138683,Nico&Cinzia,,VIALE MONZA,45.52276,9.22478,Entire home/apt,110.0,3,37,2024-06-22,0.24,1,8,5,
4,59226,Near Piazza Gae Aulenti silent e reserved flat,244087,Francesca,,CENTRALE,45.48201,9.19809,Entire home/apt,180.0,3,9,2022-06-13,0.05,1,177,0,



Listings Extended Shape: (23705, 75)


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,6400,https://www.airbnb.com/rooms/6400,20240917031337,2024-09-17,city scrape,The Studio Milan,"Enjoy your stay at The Studio, a light-filled ...",The neighborhood is quiet and very well connec...,https://a0.muscache.com/pictures/474737/5955ba...,13822,https://www.airbnb.com/users/show/13822,Francesca,2009-04-17,"Milan, Italy","I'm am Francesca Sottilaro, i live in Milan an...",,,0%,f,https://a0.muscache.com/im/users/13822/profile...,https://a0.muscache.com/im/users/13822/profile...,Zona 5,1,2,"['email', 'phone']",t,f,"Milan, Lombardy, Italy",TIBALDI,,45.44119,9.17813,Private room in rental unit,Private room,1,3.5,3.5 baths,3.0,1.0,"[""First aid kit"", ""Hangers"", ""Wifi"", ""Elevator...",$100.00,4,5,4,4,5,5,4.0,5.0,,t,23,53,83,358,2024-09-17,10,0,0,2010-04-19,2019-04-13,4.89,5.0,5.0,5.0,5.0,4.56,4.67,,f,1,0,1,0,0.06
1,23986,https://www.airbnb.com/rooms/23986,20240917031337,2024-09-17,city scrape,""" Characteristic Milanese flat""",I look forward to welcoming you in my flat; it...,,https://a0.muscache.com/pictures/623d63f8-56cf...,95941,https://www.airbnb.com/users/show/95941,Jeremy,2010-03-19,"Milan, Italy","Hallo , I'm Jeremy Hayne I live in Milan and I...",within an hour,100%,20%,f,https://a0.muscache.com/im/users/95941/profile...,https://a0.muscache.com/im/users/95941/profile...,Navigli,1,1,['email'],t,t,,NAVIGLI,,45.44806,9.17373,Entire rental unit,Entire home/apt,4,1.0,1 bath,1.0,1.0,"[""Hot water"", ""Hangers"", ""Fast wifi \u2013 92 ...",$180.00,1,730,1,1,730,730,1.0,730.0,,t,28,57,87,362,2024-09-17,27,1,0,2012-04-24,2024-04-20,4.65,4.67,4.22,4.59,4.74,4.7,4.48,,f,1,1,0,0,0.18
2,40470,https://www.airbnb.com/rooms/40470,20240917031337,2024-09-17,city scrape,Giacinto Cosy & clean flat near MM1,,,https://a0.muscache.com/pictures/891684/01c17b...,174203,https://www.airbnb.com/users/show/174203,Giacinto,2010-07-20,"Milan, Italy","Ciao sono Giacinto, amo i viaggi fatti con la ...",within an hour,100%,38%,f,https://a0.muscache.com/im/users/174203/profil...,https://a0.muscache.com/im/users/174203/profil...,Zona 2,2,2,"['email', 'phone']",t,t,,VIALE MONZA,,45.52023,9.22747,Entire rental unit,Entire home/apt,4,1.0,1 bath,2.0,4.0,"[""First aid kit"", ""Hot water"", ""Air conditioni...",$80.00,3,90,3,3,90,90,3.0,90.0,,t,21,21,21,183,2024-09-17,43,2,0,2010-12-20,2024-08-17,4.66,4.71,4.83,4.98,4.88,4.44,4.51,,f,2,2,0,0,0.26
3,46536,https://www.airbnb.com/rooms/46536,20240917031337,2024-09-17,city scrape,Nico & Cinzia's Pink Suite!,Over the international fair in April we rent f...,"Flat It's located in north side of milan, jus...",https://a0.muscache.com/pictures/4eb8e0f5-e17b...,138683,https://www.airbnb.com/users/show/138683,Nico&Cinzia,2010-06-05,"Milan, Italy","Hi, we are Nico and Cinzia.. do you like a eas...",,,75%,f,https://a0.muscache.com/im/users/138683/profil...,https://a0.muscache.com/im/users/138683/profil...,Zona 2,1,3,"['email', 'phone']",t,t,"Milan, Lombardy, Italy",VIALE MONZA,,45.52276,9.22478,Entire rental unit,Entire home/apt,5,1.0,1 bath,2.0,3.0,"[""Hot water"", ""AC - split type ductless system...",$110.00,3,730,3,3,730,730,3.0,730.0,,t,8,8,8,8,2024-09-17,37,5,0,2011-12-05,2024-06-22,4.53,4.61,4.64,4.83,4.92,4.33,4.58,,f,1,1,0,0,0.24
4,59226,https://www.airbnb.com/rooms/59226,20240917031337,2024-09-18,city scrape,Near Piazza Gae Aulenti silent e reserved flat,The apartment (45mq) is at the 3 floor of a re...,New distric and Historical area at the same time.,https://a0.muscache.com/pictures/3381507/de21e...,244087,https://www.airbnb.com/users/show/244087,Francesca,2010-09-24,"Milan, Italy","Comfortable and very quiet "" ringhiera "" flat ...",,,0%,f,https://a0.muscache.com/im/users/244087/profil...,https://a0.muscache.com/im/users/244087/profil...,Centro Direzionale,1,3,"['email', 'phone']",t,t,"Milan, Lombardy, Italy",CENTRALE,,45.48201,9.19809,Entire rental unit,Entire home/apt,2,1.0,1 bath,1.0,1.0,"[""Hot water"", ""Bed linens"", ""Dishes and silver...",$180.00,3,10,3,3,10,10,3.0,10.0,,t,27,57,87,177,2024-09-18,9,0,0,2011-02-03,2022-06-13,4.0,4.33,3.89,4.89,4.89,4.44,4.44,,f,1,1,0,0,0.05



Common columns: ['name', 'host_id', 'host_name', 'neighbourhood', 'latitude', 'longitude', 'room_type', 'price', 'minimum_nights', 'number_of_reviews', 'last_review', 'reviews_per_month', 'calculated_host_listings_count', 'availability_365', 'number_of_reviews_ltm', 'license']

Unique columns in listings_extended: ['accommodates', 'amenities', 'availability_30', 'availability_60', 'availability_90', 'bathrooms', 'bathrooms_text', 'bedrooms', 'beds', 'calculated_host_listings_count_entire_homes', 'calculated_host_listings_count_private_rooms', 'calculated_host_listings_count_shared_rooms', 'calendar_last_scraped', 'calendar_updated', 'description', 'first_review', 'has_availability', 'host_about', 'host_acceptance_rate', 'host_has_profile_pic', 'host_identity_verified', 'host_is_superhost', 'host_listings_count', 'host_location', 'host_neighbourhood', 'host_picture_url', 'host_response_rate', 'host_response_time', 'host_since', 'host_thumbnail_url', 'host_total_listings_count', 'host_u

### Analyze Column Overlap
From this analysis, we can deduce that the only column present in `listings_summary` but not in `listings_extended` is `neighbourhood_group`. However, I suspect that this column consists entirely of null values. Therefore, I will perform additional checks before integrating the two datasets using this variable.

In [4]:
print(df_listings_summary["neighbourhood_group"].value_counts(dropna=False))


neighbourhood_group
NaN    23705
Name: count, dtype: int64


Okay, it looks like all values are null. Let's verify this further.

In [5]:
print(df_listings_summary["neighbourhood_group"].notna().any())


False


Since all values are null, we can conclude that listings_summary.csv is essentially a subset of listings_extended.csv. 
### Therefore, we will retain only listings_extended.csv for further analysis.

## Load Calendar Data


In [6]:
# Load and process calendar data
calendar_path = project_root / "data" / "raw" / "calendar.csv"
df_calendar = pd.read_csv(calendar_path)

# Basic preprocessing
df_calendar['date'] = pd.to_datetime(df_calendar['date'])  # Convert date column to datetime format
df_calendar['available'] = df_calendar['available'].apply(lambda x: 1 if x == 't' else 0)  
# Convert 'available' column: 't' (true) → 1 (available), 'f' (false) → 0 (not available)

print("Calendar data shape:", df_calendar.shape)
df_calendar.head()



  df_calendar = pd.read_csv(calendar_path)


Calendar data shape: (8652209, 7)


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,6400,2024-09-17,0,$100.00,,4.0,5.0
1,6400,2024-09-18,0,$100.00,,4.0,5.0
2,6400,2024-09-19,0,$100.00,,4.0,5.0
3,6400,2024-09-20,0,$100.00,,4.0,5.0
4,6400,2024-09-21,0,$100.00,,4.0,5.0


## Load Neighborhood Data

We have two neighborhood datasets:

1. **`neighbourhoods.csv`**  
   - A structured **tabular dataset** that typically contains neighborhood names and their corresponding **neighbourhood_group** (if available).  
   - Useful for **categorical** analysis, grouping, and merging with Airbnb listing data.  

2. **`neighbourhoods.geojson`**  
   - A **geospatial dataset** that contains the **geographical boundaries** of each neighborhood, stored as polygons (MULTIPOLYGON format).  
   - Essential for **spatial analysis**, **visualization**, and **mapping Airbnb listings to specific neighborhoods**.  

We will load and inspect both datasets to determine their structure, missing values, and potential usage in further analysis.


In [7]:
# Load neighborhood CSV data
neighborhoods_path = project_root / "data" / "raw" / "neighbourhoods.csv"
df_neighborhoods = pd.read_csv(neighborhoods_path)
print("Neighborhoods shape:", df_neighborhoods.shape)
df_neighborhoods.head()



Neighborhoods shape: (88, 2)


Unnamed: 0,neighbourhood_group,neighbourhood
0,,ADRIANO
1,,AFFORI
2,,BAGGIO
3,,BANDE NERE
4,,BARONA


# Load GeoJSON neighborhood data

In [8]:
neighborhoods_geo_path = project_root / "data" / "raw" / "neighbourhoods.geojson"
gdf_neighborhoods = gpd.read_file(neighborhoods_geo_path)
print("GeoJSON neighborhoods shape:", gdf_neighborhoods.shape)

GeoJSON neighborhoods shape: (88, 3)


In [9]:
pd.set_option('display.max_columns', len(gdf_neighborhoods.columns)) # To view all columns
pd.set_option('display.max_rows', 100)
gdf_neighborhoods.head(7)

Unnamed: 0,neighbourhood,neighbourhood_group,geometry
0,SACCO,,"MULTIPOLYGON (((9.12195 45.51602, 9.12163 45.5..."
1,COMASINA,,"MULTIPOLYGON (((9.16887 45.52396, 9.16804 45.5..."
2,STEPHENSON,,"MULTIPOLYGON (((9.12932 45.50998, 9.12973 45.5..."
3,QT 8,,"MULTIPOLYGON (((9.14368 45.48474, 9.14338 45.4..."
4,ORTOMERCATO,,"MULTIPOLYGON (((9.23739 45.45588, 9.23731 45.4..."
5,MAGGIORE - MUSOCCO,,"MULTIPOLYGON (((9.13067 45.50471, 9.13062 45.5..."
6,PARCO LAMBRO - CIMIANO,,"MULTIPOLYGON (((9.2688 45.51034, 9.26878 45.50..."


In [10]:
# Get all unique values from neighbourhood_cleansed column
unique_neighborhoods = gdf_neighborhoods["neighbourhood"].unique()
print(f"Number of unique neighborhoods: {len(unique_neighborhoods)}")
print("\nAll unique neighborhoods:")
print(unique_neighborhoods)

Number of unique neighborhoods: 88

All unique neighborhoods:
['SACCO' 'COMASINA' 'STEPHENSON' 'QT 8' 'ORTOMERCATO' 'MAGGIORE - MUSOCCO'
 'PARCO LAMBRO - CIMIANO' 'GALLARATESE' 'S. SIRO' 'GHISOLFA' 'BAGGIO'
 'QUARTO CAGNINO' 'LORENTEGGIO' 'GIAMBELLINO' 'S. CRISTOFORO'
 'RONCHETTO SUL NAVIGLIO' 'TIBALDI' 'CASCINA TRIULZA - EXPO'
 'QUARTO OGGIARO' 'AFFORI' 'PADOVA' 'EX OM - MORIVIONE' 'ADRIANO' 'FARINI'
 'MUGGIANO' 'UMBRIA - MOLISE' 'TRIULZO SUPERIORE' 'CORSICA' "CITTA' STUDI"
 'SELINUNTE' "PARCO MONLUE' - PONTE LAMBRO" 'PORTELLO'
 "NIGUARDA - CA' GRANDA" 'STADERA' 'GUASTALLA' 'BRERA' 'DUOMO'
 'SCALO ROMANA' 'MAGENTA - S. VITTORE' 'BOVISASCA' 'LODI - CORVETTO'
 'LAMBRATE' 'BARONA' 'BRUZZANO' 'TRENNO' 'GRATOSOGLIO - TICINELLO'
 'FIGINO' 'QUINTO ROMANO' 'PARCO NORD' 'TRE TORRI' 'PARCO AGRICOLO SUD'
 'VILLAPIZZONE' 'BOVISA' 'DERGANO' 'BICOCCA' 'PARCO SEMPIONE'
 'GIARDINI PORTA VENEZIA' 'TORTONA' 'NAVIGLI' 'XXII MARZO'
 'BUENOS AIRES - VENEZIA' 'QUINTOSOLE' 'RONCHETTO DELLE RANE'
 'CHIARAVAL

the columns we may be interested in are "neighbourhood" from df_listings_extended and we will opt to mix the two dataframes on this part.


In [11]:
# Get all unique values from neighbourhood_cleansed column
unique_neighborhoods = df_listings_extended["neighbourhood_cleansed"].unique()
print(f"Number of unique neighborhoods: {len(unique_neighborhoods)}")
print("\nAll unique neighborhoods:")
print(unique_neighborhoods)

Number of unique neighborhoods: 88

All unique neighborhoods:
['TIBALDI' 'NAVIGLI' 'VIALE MONZA' 'CENTRALE' 'ISOLA' 'WASHINGTON'
 'PORTA ROMANA' 'PARCO FORLANINI - ORTICA' 'VILLAPIZZONE'
 "NIGUARDA - CA' GRANDA" 'DUOMO' 'SARPI' 'PADOVA' 'GIAMBELLINO'
 "CITTA' STUDI" 'XXII MARZO' 'STADERA' 'S. CRISTOFORO'
 'BUENOS AIRES - VENEZIA' 'VIGENTINA' 'LORETO' 'TICINESE' 'TORTONA'
 'UMBRIA - MOLISE' 'BOVISASCA' 'PARCO LAMBRO - CIMIANO'
 'MAGENTA - S. VITTORE' 'BRERA' 'GRATOSOGLIO - TICINELLO'
 'GARIBALDI REPUBBLICA' 'BANDE NERE' 'RIPAMONTI' 'LODI - CORVETTO'
 'DE ANGELI - MONTE ROSA' 'GIARDINI PORTA VENEZIA' 'GALLARATESE'
 'RONCHETTO SUL NAVIGLIO' 'LAMBRATE' 'GHISOLFA' 'GUASTALLA' 'BOVISA'
 'QUINTO ROMANO' 'PAGANO' 'ADRIANO' 'LORENTEGGIO' 'CORSICA'
 'MACIACHINI - MAGGIOLINA' 'EX OM - MORIVIONE' 'DERGANO' 'FARINI' 'GRECO'
 'ORTOMERCATO' 'QUARTO CAGNINO' 'S. SIRO' 'ROGOREDO' 'PORTELLO' 'QT 8'
 'TRE TORRI' 'SELINUNTE' 'AFFORI' 'TRIULZO SUPERIORE' 'SCALO ROMANA'
 'BAGGIO' 'BARONA' "PARCO MONLUE' - P

we need to merge on the basis of the columns called "neighbourhood_cleansed" in df_listings_extended. 
this is called  "neighbourhood" in the geojson data, called "gdf_neighborhoods"
we need to create therefore a new dataset, that we will use for our analysis, and that has both of these characteristics: all the listings data, to which we will add the geojson polygon data. we need to create a new, seprate dataset, that has these, to work on, so that we don't ruin original data. 

### We now that these neighbourhoods are the same in 'neighbourhood_cleansed' of df_listings_extended as 'neighbourhood' in geojson_neighborhoods and will be useful later.

In [12]:
# Let's check for exact matches between neighborhood names in both datasets
# This is like checking that the keys fit the locks before we try to open doors

listings_neighborhoods = set(df_listings_extended['neighbourhood_cleansed'].dropna().unique())
geojson_neighborhoods = set(gdf_neighborhoods['neighbourhood'].dropna().unique())

# Find neighborhoods that exist in both datasets
matching_neighborhoods = listings_neighborhoods.intersection(geojson_neighborhoods)
print(f"Neighborhoods that match exactly: {len(matching_neighborhoods)}")

# Find neighborhoods that exist in listings but not in GeoJSON
listings_only = listings_neighborhoods - geojson_neighborhoods
if listings_only:
    print(f"Neighborhoods in listings but not in GeoJSON: {len(listings_only)}")
    print("Examples:", list(listings_only)[:5])

# Find neighborhoods that exist in GeoJSON but not in listings
geojson_only = geojson_neighborhoods - listings_neighborhoods  
if geojson_only:
    print(f"Neighborhoods in GeoJSON but not in listings: {len(geojson_only)}")
    print("Examples:", list(geojson_only)[:5])

Neighborhoods that match exactly: 88


in listings

neighbourhood_cleansed -> contains the neightbourhood 
latitude	longitude contain the exact coordinate

the geojson file contains insted for each quartiere (neighbourhood) the exact coordinates as multypoligon; I guess they are essential for gettin the representation through data; right ?
the point for us is mixing the two datasets so that we don't have to just have the features and characteristics as a statistic, but we can have cool visualizations with Folium.

### Load Review Data

We have two neighborhood datasets:

1. **`reviews.csv`**   

2. **`reviews_id_date`**  

In [13]:
reviews_path = project_root / "data" / "raw" / "reviews.csv"
reviews_data_path = project_root / "data" / "raw" / "reviews_id_date.csv"

df_reviews = pd.read_csv(reviews_path)
df_reviews_id_data = pd.read_csv(reviews_data_path)


df_reviews_id_data.head()

Unnamed: 0,listing_id,date
0,6400,2010-04-19
1,6400,2011-04-16
2,6400,2012-04-22
3,6400,2014-04-11
4,6400,2014-04-14


In [14]:
df_reviews.head()

Unnamed: 0,listing_id,...,comments
0,6400,...,I had such a great stay at 'the studio.' Fran...
1,6400,...,Staying at Francesca's and Alberto's place was...
2,6400,...,This is my second time staying with Francesca ...
3,6400,...,"Ein wunderbares Zimmer mit privatem Bad/ WC, a..."
4,6400,...,"I was lucky so I have stayed with Francesca, A..."


### Verifying if `reviews_id_date.csv` is a Subset of `reviews.csv`

To confirm whether `reviews_id_date.csv` is simply a reduced version of `reviews.csv` with only `listing_id` and `date`, we performed the following checks:

#### **1️ Subset Verification using `merge()`**
We merged `reviews_id_date.csv` (left) with `reviews.csv` (right) on `listing_id` and `date`, using an **indicator column**:

The indicator=True argument in merge() adds a special _merge column that tracks where each row comes from. The possible values are
- both: The row exists in both reviews_id_date.csv and reviews.csv
- left: The row exists only in reviews_id_date.csv (but not in reviews.csv)
- right: The row exists only in reviews.csv (but not in reviews_id_date.csv)

In [15]:
# Check if all rows in df_reviews_id_data exist in df_reviews
is_subset = df_reviews_id_data.merge(df_reviews, on=['listing_id', 'date'], how='left', indicator=True)

# Count unmatched rows
print(is_subset['_merge'].value_counts())


_merge
both          871596
left_only          0
right_only         0
Name: count, dtype: int64


Since left_only == 0, every row in reviews_id_date.csv is found in reviews.csv, confirming that it is a subset.

In [16]:
print(f"Reviews ID-Date shape: {df_reviews_id_data.shape}")
print(f"Full Reviews shape: {df_reviews.shape}")


Reviews ID-Date shape: (868564, 2)
Full Reviews shape: (868564, 6)


Same number of rows: reviews_id_date.csv does not contain extra reviews.
Fewer columns: reviews_id_date.csv is a filtered version with only listing_id and date.

## **Data Selection and Next Steps**  

After thoroughly reviewing the available datasets and carefully examining the **Inside Airbnb data dictionary** ([source](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit?gid=1322284596#gid=1322284596)), I have outlined the following approach for this analysis.  

---

### **Primary Dataset for Initial Analysis**  
- **`df_listings_extended` (from `listings_extended.csv`)**  
  This dataset provides essential features for the analysis, eliminating the need for immediate aggregation or merging of other datasets. Key attributes include:  

  - **Review Metrics**  
    - `number_of_reviews`, `number_of_reviews_ltm` (last 12 months), `number_of_reviews_l30d` (last 30 days).  
    - These fields allow for an assessment of listing popularity **without requiring additional aggregation** from `reviews.csv`.  
    - As a result, the review dataset will not be merged at this stage.  

  - **Availability Indicators**  
    - `availability_30`, `availability_60`, `availability_90`, `availability_365`.  
    - These fields represent future availability as determined by the **scraped calendar**, considering both **booked and blocked dates** over the next 30, 60, 90, and 365 days.  
    - Since these features already provide availability insights, **there is no immediate need to aggregate `calendar.csv`**.  

  - **Geocoded Neighborhood and Location Data**  
    - `neighbourhood_cleansed`: The neighborhood assigned based on latitude and longitude, aligned with public shapefiles.  
    - `neighbourhood_group_cleansed`: A higher-level grouping of neighborhoods, geocoded using public shapefiles.  
    - `latitude` and `longitude`: Exact geolocation of each listing.  
    - Given that these attributes are already included in `df_listings_extended`, I will not merge additional neighborhood datasets at this stage.  

---

### **Scope and Next Steps**  
For the initial phase of the project, I will focus on **non-spatial data** to address the core research questions.  
- **I will proceed with exploratory data analysis (EDA) on `df_listings_extended`**, examining price distributions, occupancy trends, and review metrics.  
- **I will not merge `df_reviews` or `df_calendar` at this stage**, as `df_listings_extended` already contains sufficient review and availability data.  
- **Geospatial data (`gdf_neighborhoods`) will be considered later** if spatial analysis proves relevant to the research objectives.  

