
# Airbnb Market Analysis — Italy 🇮🇹
### by Luca Scarpantonio

## 1. Introduction
In this project, I explore Airbnb data for an Italian city using the **CRISP-DM** framework.
The goal is to understand what drives listing prices and how host, location, and reputation influence market trends.

**Dataset:** [Inside Airbnb](http://insideairbnb.com/get-the-data.html)

We'll answer key business questions such as:
- How do prices vary across neighbourhoods and room types?
- How do reviews and host reputation affect pricing?
- Can we identify factors that predict listing price?
    

## 2. Gather — Data Collection

In [None]:

import pandas as pd

# Load dataset (replace with your city file)
df = pd.read_csv('listings.csv')

# Quick preview
df.head()
    

## 3. Assess — Data Quality Check

In [None]:

df.info()
df.describe()
df.isnull().sum().sort_values(ascending=False).head(10)
    


- Identify missing values
- Detect invalid or extreme values (e.g., `price = 0` or `price > 10000`)
- Identify duplicates or outliers
    

## 4. Clean — Data Wrangling

In [None]:

# Example cleaning steps
df = df.drop_duplicates()

# Convert price to numeric
df['price'] = df['price'].replace('[\$,€]', '', regex=True).astype(float)

# Filter unrealistic prices
df = df[df['price'] < 1000]

# Drop rows with missing key fields
df = df.dropna(subset=['price', 'room_type', 'neighbourhood'])
    

## 5. Analyze — Business Questions


### Questions:
1. What is the distribution of prices across room types?
2. How do prices vary by neighbourhood?
3. Is there a relationship between price and number of reviews?
4. How do ratings impact listing price?
5. What are the most common types of accommodation?
    

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", palette="muted")
plt.figure(figsize=(8,5))
sns.boxplot(data=df, x='room_type', y='price')
plt.title('Price Distribution by Room Type')
plt.show()
    

## 6. (Optional) Model — Price Prediction

In [None]:

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# Example model: predict price using reviews and rating
X = df[['number_of_reviews', 'review_scores_rating']].dropna()
y = df.loc[X.index, 'price']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("R²:", r2_score(y_test, y_pred))
    

## 7. Visualize — Key Insights


Use clear, polished plots to support your findings.

Examples:
- Price vs. Room Type
- Price vs. Review Count
- Price vs. Location
- Correlation Heatmap
    

In [None]:

corr = df[['price', 'number_of_reviews', 'review_scores_rating']].corr()
sns.heatmap(corr, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
    


## 8. Conclusion

- Prices vary significantly across neighbourhoods.
- Entire apartments are priced higher than private rooms.
- Positive reviews correlate with slightly higher prices.
- Some listings contain inconsistent data and should be reviewed further.
    


## 9. Appendix — References & Acknowledgements

**Data Source:** [Inside Airbnb](http://insideairbnb.com/get-the-data.html)  
**Author:** Luca Scarpantonio  
**Project:** Introduction to Data Science Nanodegree (Udacity)
    