# Airbnb Price Prediction – Data Cleaning

This notebook cleans and preprocesses the raw Airbnb listings dataset for Lisbon.
The objective is to prepare a high-quality dataset for exploratory analysis and
machine learning modeling.


## Imports

In [7]:
# Import pandas library for data manipulation and analysis
import pandas as pd
# Import numpy library for numerical operations
import numpy as np

## Load Raw Data

In [8]:
# Load the listings data from CSV file into a pandas DataFrame
df = pd.read_csv("../data/raw/listings.csv")
# Display the first 5 rows of the DataFrame to inspect the data
df.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365,number_of_reviews_ltm,license
0,6499,Belém 1 Bedroom Historical Apartment,14455,Bruno,Lisboa,Belm,38.6975,-9.19768,Entire home/apt,87.0,3,95,2025-09-03,0.71,1,283,19,
1,25659,Heart of Alfama - Le cœur d'Alfama (3 people),107347,Ellie,Lisboa,Santa Maria Maior,38.71241,-9.12706,Entire home/apt,92.0,2,223,2025-09-18,1.6,1,324,10,56539/AL.
2,29396,Alfama Hill - Boutique apartment,126415,Mónica,Lisboa,Santa Maria Maior,38.71156,-9.12987,Entire home/apt,77.0,3,433,2025-09-11,2.67,1,287,32,28737/AL
3,29720,TheHOUSE - Your luxury home,128075,Francisco,Lisboa,Estrela,38.71108,-9.15979,Entire home/apt,1250.0,2,161,2025-09-14,0.9,1,308,27,55695/AL
4,29915,Modern and Spacious Apartment in Lisboa,128890,Sara,Lisboa,Avenidas Novas,38.74571,-9.15264,Entire home/apt,96.0,6,61,2024-05-20,0.33,1,68,0,85851/AL.


## Select Relevant Feature

In [9]:
# Define a list of columns to keep in the dataframe
columns = [
    "price",
    "neighbourhood",
    "room_type",
    "minimum_nights",
    "number_of_reviews",
    "reviews_per_month",
    "availability_365"
]

# Filter the dataframe to include only the specified columns
df = df[columns]

## Clean Price Column

In [10]:
# Convert price column from string format (with € symbol and commas) to float
# 1. Replace € symbols and commas with empty strings using regex
# 2. Convert the resulting strings to float values
df["price"] = (
    df["price"]
    .replace("[€,]", "", regex=True)  # Remove € and comma characters
    .astype(float)                     # Convert to floating point numbers
)

## Handle Missing Values

In [11]:
# Fill missing values in the 'reviews_per_month' column with 0
# This ensures that properties with no reviews don't have NaN values
df["reviews_per_month"] = df["reviews_per_month"].fillna(0)

## Remove Outliers

In [12]:
# Filter the dataframe to only include rows where the price is less than $500
df = df[df["price"] < 500]

## Save Processed Data

In [13]:
# Export the cleaned DataFrame to a CSV file
# Excludes the DataFrame index from the output file
df.to_csv("../data/processed/clean_listings.csv", index=False)

## Output

The cleaned dataset has been saved as `clean_listings.csv` and will be used
in subsequent exploratory analysis and modeling notebooks.
