# Inside Airbnb Santiago — Exploratory Analysis

## 1. Project Overview
This project analyzes the Inside Airbnb dataset for Santiago, Chile, to better understand pricing, availability, and listing behavior in the city.

We’ll focus on three key tables:
- `calendar`: daily price and availability for listings
- `listings`: listing metadata (e.g., price, reviews, location)
- `reviews`: text and timestamps of reviews



## 2. Data Source
The dataset is downloaded from [Inside Airbnb](http://insideairbnb.com/get-the-data.html), specifically for Santiago.  
Download date: 07/06/2025

We are using the **"full" versions** of `listings.csv` and `reviews.csv`, and the cleaned `calendar.csv`.



## 3. Overview of Tables

### 3.1 Calendar
Contains daily availability and pricing for each listing.


In [10]:
# Load CSV file and print the first few rows
import pandas as pd

df_calendar = pd.read_csv('santiago/calendar.csv.gz', compression='gzip')
print(f"----Preview:")
display(df_calendar.head())
print("\n----Info:\n")
df_calendar.info()
print("\n----Shape:\n")
print(df_calendar.shape)

----Preview:


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,49392,2024-12-27,f,$55.00,,3,730
1,49392,2024-12-28,f,$55.00,,3,730
2,49392,2024-12-29,t,$55.00,,3,730
3,49392,2024-12-30,t,$55.00,,3,730
4,49392,2024-12-31,t,$55.00,,3,730



----Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5493615 entries, 0 to 5493614
Data columns (total 7 columns):
 #   Column          Dtype  
---  ------          -----  
 0   listing_id      int64  
 1   date            object 
 2   available       object 
 3   price           object 
 4   adjusted_price  float64
 5   minimum_nights  int64  
 6   maximum_nights  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 293.4+ MB

----Shape:

(5493615, 7)


### 3.2 Listings

The dataset provides two versions of the listings table: a summary version (`listings.csv`) and a full version (`listings.csv.gz`). This section compares both to understand the scope and level of detail offered by each.

The summary version includes a subset of key attributes for each listing, making it suitable for lightweight analysis or quick overviews. In contrast, the full version contains an extensive set of variables — including host details, review scores, policies, amenities, and more — which allows for deeper analysis and feature engineering.

In the following cells, we load both versions and inspect their structure:


In [19]:
# Define the file path for the listings data
listings = {
    'listings_full': 'santiago/listings.csv.gz',
    'listings_summary': 'santiago/listings.csv'
}

# Read the listings files into DataFrame
df_listings = {name_listing: pd.read_csv(listing, compression='gzip' if 'gz' in listing else None) for name_listing, listing in listings.items()}


##### ➡️ Preview 

In [20]:
# Display the first few rows of each DataFrame
for name_listing, df in df_listings.items():
    print(f"----Preview of {name_listing}:")
    display(df.head())


----Preview of listings_full:


Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,49392,https://www.airbnb.com/rooms/49392,20241227033155,2024-12-27,city scrape,Share my Flat in Providencia,,,https://a0.muscache.com/pictures/3740612/b1850...,224592,...,,,,,f,1,0,1,0,
1,52811,https://www.airbnb.com/rooms/52811,20241227033155,2024-12-27,city scrape,Suite Providencia 1 Santiago Chile,Apartment located on the subway station Manuel...,Building located on the access to the Manuel M...,https://a0.muscache.com/pictures/miso/Hosting-...,244792,...,4.59,4.64,4.36,,t,3,3,0,0,0.26
2,53494,https://www.airbnb.com/rooms/53494,20241227033155,2024-12-27,city scrape,depto centro ski el colorado chile,,,https://a0.muscache.com/pictures/310936/ff7d53...,249097,...,4.88,4.79,4.69,,f,1,1,0,0,0.46
3,787045,https://www.airbnb.com/rooms/787045,20241227033155,2024-12-27,city scrape,right at home,"A few steps from metro station ""FERNANDO CASTI...","Metro Station "" FERNANDO CASTILLO"" (LINE 3)"" i...",https://a0.muscache.com/pictures/airflow/Hosti...,4134987,...,4.93,4.66,4.85,,f,2,0,2,0,1.01
4,795701,https://www.airbnb.com/rooms/795701,20241227033155,2024-12-27,city scrape,Lindo Depto 2 dormitorios,Nice and comfortable two-bedroom apartment. Fu...,"Centrally located by day works commercially, a...",https://a0.muscache.com/pictures/14703811/def2...,4191304,...,4.86,4.55,4.69,,f,2,2,0,0,0.2


----Preview of listings_summary:


Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,65058,Dpto amoblado centro historico,318016,Patricio,,Recoleta,-33.43049,-70.64079,Private room,,2,0,,,1,0,0,
1,73752,Barrio Lastarria,374124,Daniela&Ricardo,,Santiago,-33.43865,-70.64241,Private room,,3,0,,,1,0,0,
2,80482,Room Private for Woman,154527,Jacqueline,,La Florida,-33.51922,-70.59152,Private room,,2,0,,,1,0,0,
3,88944,COZY APT. PROVIDENCIA METRO WIFI TV,485358,Macarena,,Providencia,-33.42141,-70.60832,Entire home/apt,39849.0,3,223,2024-12-07,1.39,1,338,18,
4,90694,Apartment x Rent in Providencia 802,491253,Hector,,Providencia,-33.42629,-70.61866,Entire home/apt,49999.0,3,76,2024-09-01,0.46,2,364,5,


##### ➡️ Column names and data types

In [21]:
# Display column names and data types for each DataFrame
for name_listing, df in df_listings.items():
    print(f"\n----Info for {name_listing}:\n")
    df.info()




----Info for listings_full:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15051 entries, 0 to 15050
Data columns (total 75 columns):
 #   Column                                        Non-Null Count  Dtype  
---  ------                                        --------------  -----  
 0   id                                            15051 non-null  int64  
 1   listing_url                                   15051 non-null  object 
 2   scrape_id                                     15051 non-null  int64  
 3   last_scraped                                  15051 non-null  object 
 4   source                                        15051 non-null  object 
 5   name                                          15051 non-null  object 
 6   description                                   14533 non-null  object 
 7   neighborhood_overview                         5588 non-null   object 
 8   picture_url                                   15051 non-null  object 
 9   host_id                        

##### 👉 Conclusion
For this project, we will use the full version of the listings table (listings_full.csv) as it provides a richer set of features necessary for meaningful analysis, such as detailed review metrics, host characteristics, and amenities.

### 3.3 Reviews

The dataset also includes two versions of the reviews table:

- `reviews.csv `(summary version)
- `reviews.csv.gz` (detailed version)

We will preview both to understand the difference in structure and content.

In [25]:
# Define the file path for the reviews data

reviews = {
    'reviews_full': 'santiago/reviews.csv.gz',
    'reviews_summary': 'santiago/reviews.csv'
}

# Read the reviews files into DataFrame
df_reviews = {name_review: pd.read_csv(review, compression='gzip' if 'gz' in review else None) for name_review, review in reviews.items()}


##### ➡️ Preview 

In [26]:
# Display the first few rows of each DataFrame
for name_review, df in df_reviews.items():
    print(f"----Preview of {name_review}:")
    display(df.head())

----Preview of reviews_full:


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,52811,138055,2010-11-13,18583,John,"Cristian's a great host, and the apartment was..."
1,52811,207757,2011-03-23,379370,Gordon,Excellent accommodation and location. Couldn't...
2,52811,877546,2012-01-23,555663,Rasha,Cristian's apartment was a fantastic spot for ...
3,52811,1069452,2012-04-01,1608237,Melissa,Brilliant location right on top of Manuel Mont...
4,52811,2052452,2012-08-21,2505878,Greg,"This is an amazing apartment, and a great loca..."


----Preview of reviews_summary:


Unnamed: 0,listing_id,date
0,88944,2011-10-21
1,88944,2012-01-29
2,88944,2012-04-03
3,88944,2012-06-03
4,88944,2012-07-06


##### ➡️ Column names and data types

In [27]:
# Display column names and data types for each DataFrame
for name_review, df in df_reviews.items():
    print(f"\n----Info for {name_review}:\n")
    df.info()


----Info for reviews_full:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 454372 entries, 0 to 454371
Data columns (total 6 columns):
 #   Column         Non-Null Count   Dtype 
---  ------         --------------   ----- 
 0   listing_id     454372 non-null  int64 
 1   id             454372 non-null  int64 
 2   date           454372 non-null  object
 3   reviewer_id    454372 non-null  int64 
 4   reviewer_name  454372 non-null  object
 5   comments       454351 non-null  object
dtypes: int64(3), object(3)
memory usage: 20.8+ MB

----Info for reviews_summary:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 454372 entries, 0 to 454371
Data columns (total 2 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   listing_id  454372 non-null  int64 
 1   date        454372 non-null  object
dtypes: int64(1), object(1)
memory usage: 6.9+ MB


##### 👉 Conclusion
Both versions contain the same number of rows (454,372), but differ in the level of detail.
The summary version includes only listing_id and date, while the full version adds reviewer_id, reviewer_name, and most importantly, the comments column.

For this analysis, we will use reviews_full.csv, as the review text and user metadata are essential for analyzing user sentiment, identifying review trends, and enriching listing-level insights.

### 3.4 Table Selection Summary

Based on the structure and content of the datasets, we will proceed with the following tables:

**calendar** — contains daily availability and pricing data

**listings_full** — includes complete listing metadata

**reviews_full** — provides the full text and timestamps of user reviews

## 4. Initial Observations
- The calendar table contains 5,493,615 rows and 7 columns, capturing daily availability and pricing information for each listing.

- The listings_full table includes 15,051 listings and 75 columns, spanning a broad set of attributes such as location, pricing, host characteristics, property type, and amenities.
A more detailed review will be conducted in the dedicated listings analysis to identify which features are most relevant and which can be excluded to streamline the dataset.

- The reviews_full table contains 454,372 reviews, including review text, reviewer metadata, and timestamps, which will be useful for exploring guest feedback and temporal trends.

## 5. Plan for Analysis

We’ll conduct deeper cleaning and analysis in separate notebooks:

- [`02_analysis_calendar.ipynb`](02_analysis_calendar.ipynb)

- [`03_analysis_listings.ipynb`](03_analysis_listings.ipynb)

- `04_analysis_reviews.ipynb` (to be created)

#### Each notebook will follow a structured process:

- **Data Cleaning:** Handle missing values, correct data types, and remove inconsistencies

- **Exploratory Data Analysis (EDA):** Understand distributions, relationships, and patterns

- **Feature Selection & Engineering:** Identify and prepare relevant features for analysis

- **Export:** Save the cleaned version as a new CSV and upload it to Google Cloud Storage

These cleaned datasets will be used to populate BigQuery, which serves as the project's central data warehouse.

Once the three core tables are cleaned and available in BigQuery, I plan to:

- Build additional mart-style tables (e.g. aggregated fact tables) to support analysis

- Use Looker Studio for data visualization and dashboarding, leveraging its native integration with BigQuery

