# Hackathon: From Raw Data to ML-Ready Dataset
## Insight-Driven EDA and End-to-End Feature Engineering on Airbnb Data Using pandas and Plotly

### What is a Hackathon?

A hackathon is a fast-paced, collaborative event where participants use data and technology to solve a real problem end-to-end.  
In this hackathon, you will work with a **real-world Airbnb dataset** and complete two interconnected goals:

- Produce a **high-quality exploratory data analysis (EDA)** using `pandas` and `plotly`, extracting meaningful insights, trends, and signals from the data.  
- Design and deliver a **clean, feature-rich, ML-ready dataset** that will serve as the foundation for a follow-up hackathon focused on building and evaluating machine learning models.

Your task is to **get the most out of the data**: uncover structure and patterns through EDA, and engineer informative features (numerical, categorical, temporal, textual (TF‚ÄìIDF), and optionally image-based) to maximize the predictive power of the final dataset.

<div class="alert alert-success">
<b>About the Dataset</b>

<u>Context</u>

The data comes from <a href="https://insideairbnb.com/get-the-data/">Inside Airbnb</a>, an open project that publishes detailed, regularly updated datasets for cities around the world.  
Each city provides three main CSV files:

- <b>listings.csv</b> ‚Äî property characteristics, host profiles, descriptions, amenities, etc.  
- <b>calendar.csv</b> ‚Äî daily availability and pricing information for each listing.  
- <b>reviews.csv</b> ‚Äî guest feedback and textual reviews.

These datasets offer a rich view of the short-term rental market, including availability patterns, pricing behavior, host attributes, and guest sentiment.  

<u>Inspiration</u>

Your ultimate objective is to create a dataset suitable for training a machine learning model that predicts whether a specific Airbnb listing will be <b>available on a given date</b>, using property attributes, review information, and host characteristics.
</div>

<div class="alert alert-info">
<b>Task</b>

Using one city of your choice from Inside Airbnb, create an end-to-end pipeline that:

1. Loads and explores the raw data (EDA).  
2. Engineers features (numerical, categorical, temporal, textual TF‚ÄìIDF, etc.).  
3. Builds a unified ML-ready dataset.  

Please remember to add comments explaining your decisions. Comments help us understand your thought process and ensure accurate evaluation of your work. This assignment requires code-based solutions‚Äî**manually calculated or hard-coded results will not be accepted**. Thoughtful comments and visualizations are encouraged and will be highly valued.

- Write your solution directly in this notebook, modifying it as needed.
- Once completed, submit the notebook in **.ipynb** format via Moodle.
    
<b>Collaboration Requirement: Git & GitHub</b>

You must collaborate with your team using a **shared GitHub repository**.  
Your use of Git is part of the evaluation. We will specifically look at:

- Commit quality (clear messages, meaningful steps).  
- Balanced participation across team members.  
- Use of branches.  
- Ability to resolve merge conflicts appropriately.  
- A clean, readable project history that reflects real collaboration.

Good Git practice is **part of your grade**, not optional.
</div>
<div class="alert alert-danger">
    You are free to add as many cells as you wish as long as you leave untouched the first one.
</div>

<div class="alert alert-warning">

<b>Hints</b>

- Text columns often carry substantial predictive power, use text-vectorization methods to extract meaningful features.  
- Make sure all columns use appropriate data types (categorical, numeric, datetime, boolean). Correct dtypes help prevent subtle bugs and improve performance.  
- Feel free to enrich the dataset with any additional information you consider useful: engineered features, external data, derived temporal features, etc.  
- If the dataset is too large for your computer, use <code>.sample()</code> to work with a subset while preserving the logic of your pipeline.  
- Plotly offers a wide variety of powerful visualizations, experiment creatively, but always begin with a clear analytical question: *What insight am I trying to uncover with this plot?*

</div>




<div class="alert alert-danger">
<b>Submission Deadline:</b> Wednesday, December 3rd, 12:00

Start with a simple, working pipeline.  
Do not over-complicate your code too much. Start with a simple working solution and refine it if you have time.
</div>

<div class="alert alert-danger">
    
You may add as many cells as you want, but the **first cell must remain exactly as provided**. Do not edit, move, or delete it under any circumstances.
</div>


In [None]:
# LEAVE BLANK

### Team Information

Fill in the information below.  
All fields are **mandatory**.

- **GitHub Repository URL**: Paste the link to the team repo you will use for collaboration.
- **Team Members**: List all student names (and emails or IDs if required).

Do not modify the section title.  
Do not remove this cell.


In [None]:
# === Team Information (Mandatory) ===
# Fill in the fields below.

GITHUB_REPO = "https://github.com/pedromgfcresende/Hackathon_Python.git"       # e.g. "https://github.com/myteam/airbnb-hackathon"
TEAM_MEMBERS = [
    "Pedro Resende",
    "Wouter Louwman",
    "Sharath Raveendran",
    "Sara Saleem"
]

GITHUB_REPO, TEAM_MEMBERS


Data Loading Function

Purpose:It loads the complete Airbnb Barcelona dataset consisting of three CSV files:
- **`listings.csv`**: Core listing information (price, location, amenities, host details)
- **`calendar.csv`**: Daily availability and pricing data for each listing
- **`reviews.csv`**: Guest reviews and ratings for listings


In [1]:
import pandas as pd

def load_data(folder_path):
    listings = pd.read_csv(f"{folder_path}/listings.csv")
    calendar = pd.read_csv(f"{folder_path}/calendar.csv")
    reviews = pd.read_csv(f"{folder_path}/reviews.csv")

    return listings, calendar, reviews

listings, calendar, reviews = load_data("/Users/pedroresende/Library/CloudStorage/OneDrive-UniversitatRam√≥nLlull/ESADE MiBA/First Term/Python for Data Science/Hackathon/Data")

## Initial Data Exploration

### Comprehensive Data Inspection

This code block performs a **complete data profiling** of the `listings` DataFrame to understand:

#### 1. **Display Configuration**

**Purpose**: Removes Pandas' default row truncation (typically 60 rows)  
**Effect**: Shows **ALL** rows when displaying DataFrames  
**Why**: Essential for hackathons - reveals **every column name** and **every missing value count** without scrolling

#### 2. **DataFrame Anatomy Check**
| Command | Reveals |
|---------|---------|
| `display(listings.columns)` | **All 70+ column names** (critical for feature selection) |
| `display(listings.isna().sum())` | **Absolute missing counts per column** |
| `print(listings.isna().mean() * 100)` | **Missing % per column** (flags problematic features) |
| `display(listings.head())` | **Sample data + data types** |
| `print(listings.shape)` | **Total rows √ó columns** |


In [2]:
pd.set_option('display.max_rows', None) # This forces Pandas to show ALL rows

display(listings.columns)
display(listings.isna().sum())
print(listings.isna().mean() * 100)
display(listings.head())
print(listings.shape)

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'source', 'name',
       'description', 'neighborhood_overview', 'picture_url', 'host_id',
       'host_url', 'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'ca

id                                                  0
listing_url                                         0
scrape_id                                           0
last_scraped                                        0
source                                              0
name                                                0
description                                       737
neighborhood_overview                           10424
picture_url                                         0
host_id                                             0
host_url                                            0
host_name                                           5
host_since                                          5
host_location                                    4702
host_about                                       7170
host_response_time                               3126
host_response_rate                               3126
host_acceptance_rate                             2748
host_is_superhost           

id                                                0.000000
listing_url                                       0.000000
scrape_id                                         0.000000
last_scraped                                      0.000000
source                                            0.000000
name                                              0.000000
description                                       3.797012
neighborhood_overview                            53.704276
picture_url                                       0.000000
host_id                                           0.000000
host_url                                          0.000000
host_name                                         0.025760
host_since                                        0.025760
host_location                                    24.224626
host_about                                       36.939722
host_response_time                               16.105100
host_response_rate                               16.1051

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,18674,https://www.airbnb.com/rooms/18674,20250914152803,2025-09-15,city scrape,Huge flat for 8 people close to Sagrada Familia,110m2 apartment to rent in Barcelona. Located ...,Apartment in Barcelona located in the heart of...,https://a0.muscache.com/pictures/13031453/413c...,71615,...,4.62,4.82,4.32,ESFCTU000008058000039706000000000000000HUTB-00...,t,26,26,0,0,0.34
1,23197,https://www.airbnb.com/rooms/23197,20250914152803,2025-09-14,city scrape,"Forum CCIB DeLuxe, Spacious, Large Balcony, relax",Beautiful and Spacious Apartment with Large Te...,"Strategically located in the Parc del F√≤rum, a...",https://a0.muscache.com/pictures/miso/Hosting-...,90417,...,4.99,4.66,4.68,ESFCTU000008106000547162000000000000000000HUTB...,f,1,1,0,0,0.52
2,32711,https://www.airbnb.com/rooms/32711,20250914152803,2025-09-15,city scrape,Sagrada Familia area - C√≤rsega 1,A lovely two bedroom apartment only 250 m from...,What's nearby <br />This apartment is located...,https://a0.muscache.com/pictures/357b25e4-f414...,135703,...,4.89,4.89,4.47,HUTB-001722,f,2,2,0,0,0.88
3,34241,https://www.airbnb.com/rooms/34241,20250914152803,2025-09-15,city scrape,Stylish Top Floor Apartment - Ramblas Plaza Real,Located in close proximity to Plaza Real and L...,,https://a0.muscache.com/pictures/2437facc-2fe7...,73163,...,4.68,4.73,4.23,Exempt,f,3,3,0,0,0.14
4,34981,https://www.airbnb.com/rooms/34981,20250914152803,2025-09-15,city scrape,VIDRE HOME PLAZA REAL on LAS RAMBLAS,Spacious apartment for large families or group...,"Located in Ciutat Vella in the Gothic Quarter,...",https://a0.muscache.com/pictures/c4d1723c-e479...,73163,...,4.72,4.65,4.46,ESFCTU000008119000093652000000000000000HUTB-00...,f,3,3,0,0,1.49


(19410, 79)


## Calendar Data Exploration

### Purpose
This block provides a **high-level overview** of the `calendar` DataFrame, which captures **daily availability and pricing details** for each Airbnb listing in Barcelona.

### Steps Explained

1. `display(calendar.columns)`  
   - **Shows all calendar column names.**
   - Typical columns include: `listing_id`, `date`, `available`, `price`, etc.
   - Helps identify which variables can be used for time-based analysis or merged with other datasets.

2. `display(calendar.isna().sum())`  
   - **Displays missing values count per column.**
   - Critical for knowing which columns need cleaning (e.g., price or availability).

3. `print(calendar.isna().mean() * 100)`  
   - **Displays missing value percentage per column.**
   - Flags features with high missingness (over 20% may require imputation or removal).

4. `display(calendar.head())`  
   - **Shows first few rows of calendar data.**
   - Reveals the structure, example dates, and actual content.

5. `print(calendar.shape)`  
   - **Prints number of rows and columns.**
   - In calendar data, the row count is usually very large:  
     (Listings √ó Days included in dataset).

### Why This Matters

This inspection is essential for:
- Understanding the coverage of the calendar data.
- Identifying columns with data quality issues.
- Informing time-series feature engineering or merges with listing details.

In [3]:
display(calendar.columns)
display(calendar.isna().sum())
print(calendar.isna().mean() * 100)
display(calendar.head())
print(calendar.shape)

Index(['listing_id', 'date', 'available', 'price', 'adjusted_price',
       'minimum_nights', 'maximum_nights'],
      dtype='object')

listing_id              0
date                    0
available               0
price             7084654
adjusted_price    7084654
minimum_nights          0
maximum_nights          0
dtype: int64

listing_id          0.0
date                0.0
available           0.0
price             100.0
adjusted_price    100.0
minimum_nights      0.0
maximum_nights      0.0
dtype: float64


Unnamed: 0,listing_id,date,available,price,adjusted_price,minimum_nights,maximum_nights
0,18674,2025-09-15,f,,,3,999
1,18674,2025-09-16,t,,,2,999
2,18674,2025-09-17,t,,,2,999
3,18674,2025-09-18,t,,,2,999
4,18674,2025-09-19,f,,,3,999


(7084654, 7)


## üìù Reviews Data Exploration

### Purpose
This block summarizes the structure and quality of the `reviews` DataFrame, which holds *textual feedback* and *review metadata* for Airbnb listings. Reviews are essential for understanding guest satisfaction, host performance, and temporal feedback trends.

### Step-by-Step Breakdown

1. `display(reviews.columns)`  
   - **Lists all columns in the reviews data.**
   - Common columns: `listing_id`, `id` (review id), `date`, `reviewer_id`, `reviewer_name`, `comments`.
   - Helps plan which variables can be used for sentiment analysis, reviewer aggregation, or time-based features.

2. `display(reviews.isna().sum())`  
   - **Shows the number of missing values per column.**
   - Important for deciding if columns like `comments` or `reviewer_name` have gaps that could affect analysis.

3. `print(reviews.isna().mean() * 100)`  
   - **Shows the percentage of missing values in each column.**
   - Quickly identifies columns where missing data is a significant issue.

4. `display(reviews.head())`  
   - **Shows the first few rows of sample review data.**
   - Gives direct insight into review text and data structure (date format, reviewer info).

5. `print(reviews.shape)`  
   - **Shows the number of rows and columns.**
   - Indicates the scale of the text data and helps estimate resource needs for text mining or join operations.

### Why It‚Äôs Important
Exploring these aspects lets you:
- Assess if review text is viable for feature extraction (e.g., NLP, sentiment analysis).
- Understand the timeframe and granularity of feedback.
- Check for issues that might require data cleaning before joining with listings.

This structured exploration is a standard first step in Airbnb review data analysis, ensuring any downstream review-based features are well-founded.


In [4]:
display(reviews.columns)
display(reviews.isna().sum())
print(reviews.isna().mean() * 100)
display(reviews.head())
print(reviews.shape)

Index(['listing_id', 'id', 'date', 'reviewer_id', 'reviewer_name', 'comments'], dtype='object')

listing_id         0
id                 0
date               0
reviewer_id        0
reviewer_name      4
comments         110
dtype: int64

listing_id       0.000000
id               0.000000
date             0.000000
reviewer_id      0.000000
reviewer_name    0.000392
comments         0.010792
dtype: float64


Unnamed: 0,listing_id,id,date,reviewer_id,reviewer_name,comments
0,703984,415003002495917725,2021-07-26,324403082,Gabriela,"Excelente lugar y buena ubicaci√≥n, repetir√≠a e..."
1,703984,422225979748637708,2021-08-05,208472604,Abdoulaye,Very good host and always ready to help
2,703984,428711187547685597,2021-08-14,75793287,Nikos,Excellent place to stay and great location. Re...
3,703984,435298891748897953,2021-08-23,207073569,Berta,"Easy and quick communication with the host, gr..."
4,703984,438894164765136324,2021-08-28,391402125,Ahmad,Beautiful apartment in a very great neighbourh...


(1019270, 6)
