# 🏨 Airbnb Hotel Booking Analysis (New York City)
### ✅ Cleaned & Enhanced Version (Google Colab Ready)

This notebook explores Airbnb listings in New York City and extracts insights about property types, pricing, host behavior, and customer satisfaction. All column names are automatically normalized to prevent KeyErrors.

## 🪜 Step 1 — Import Libraries

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Modern seaborn theme (replaces deprecated 'seaborn' style)
sns.set_theme(style='whitegrid')

print("✅ Libraries imported successfully!")

## 🪜 Step 2 — Load & Clean Dataset

In [None]:
from google.colab import files

# Upload your dataset manually in Colab if needed
# uploaded = files.upload()

file_path = '/content/1730285881-Airbnb_Open_Data.xlsx'
df = pd.read_excel(file_path)

# Normalize column names (lowercase, underscores)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_')

print("✅ Dataset loaded successfully!")
print("Columns:", df.columns.tolist())

# Basic info
df.info()


## 🧹 Step 3 — Data Cleaning

In [None]:
# Handle missing values
missing = df.isnull().sum()
print("Missing values per column:\n", missing[missing > 0])

# Drop duplicates
df.drop_duplicates(inplace=True)

# Convert numeric columns safely
for col in ['price', 'service_fee', 'number_of_reviews', 'review_rate_number', 'availability_365', 'construction_year']:
    if col in df.columns:
        df[col] = pd.to_numeric(df[col], errors='coerce')

print("✅ Data cleaned!")

## ❓ Step 4 — Analysis Questions

### 1️⃣ Property Types (Bar Chart)

In [None]:
if 'room_type' in df.columns:
    plt.figure(figsize=(6,4))
    df['room_type'].value_counts().plot(kind='bar')
    plt.title('Property / Room Types Distribution')
    plt.ylabel('Count')
    plt.show()
else:
    print('⚠️ Column for property or room type not found.')

### 2️⃣ Neighborhood Group with Most Listings (Bar Chart)

In [None]:
if 'neighbourhood_group' in df.columns:
    plt.figure(figsize=(6,4))
    df['neighbourhood_group'].value_counts().plot(kind='bar', color='skyblue')
    plt.title('Listings by Neighborhood Group')
    plt.ylabel('Count')
    plt.show()
else:
    print('⚠️ Column neighbourhood_group not found.')

### 3️⃣ Highest Average Prices by Neighborhood Group (Bar Chart)

In [None]:
if 'neighbourhood_group' in df.columns and 'price' in df.columns:
    avg_price = df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=False)
    plt.figure(figsize=(6,4))
    sns.barplot(x=avg_price.index, y=avg_price.values, palette='viridis')
    plt.title('Average Price by Neighborhood Group')
    plt.ylabel('Average Price ($)')
    plt.show()
else:
    print('⚠️ Required columns missing.')

### 4️⃣ Construction Year vs Price (Scatter Plot)

In [None]:
if 'construction_year' in df.columns and 'price' in df.columns:
    plt.figure(figsize=(6,4))
    sns.scatterplot(x='construction_year', y='price', data=df, alpha=0.6)
    plt.title('Construction Year vs Price')
    plt.show()
else:
    print('⚠️ Columns construction_year or price not found.')

### 5️⃣ Top 10 Hosts by Listing Count (Horizontal Bar Chart)

In [None]:
if 'host_name' in df.columns and 'calculated_host_listings_count' in df.columns:
    top_hosts = (df.groupby('host_name')['calculated_host_listings_count']
                 .max().sort_values(ascending=False).head(10))
    plt.figure(figsize=(6,4))
    top_hosts.plot(kind='barh', color='teal')
    plt.title('Top 10 Hosts by Listing Count')
    plt.xlabel('Listings Count')
    plt.gca().invert_yaxis()
    plt.show()
else:
    print('⚠️ Required columns missing.')

### 6️⃣ Verified Hosts vs Review Ratings (Box Plot)

In [None]:
if 'host_identity_verified' in df.columns and 'review_rate_number' in df.columns:
    plt.figure(figsize=(6,4))
    sns.boxplot(x='host_identity_verified', y='review_rate_number', data=df)
    plt.title('Verified Hosts vs Review Ratings')
    plt.show()
else:
    print('⚠️ Required columns missing.')

### 7️⃣ Price vs Service Fee (Scatter Plot + Correlation)

In [None]:
if 'price' in df.columns and 'service_fee' in df.columns:
    plt.figure(figsize=(6,4))
    sns.scatterplot(x='service_fee', y='price', data=df, alpha=0.6)
    corr = df[['price','service_fee']].corr().iloc[0,1]
    plt.title(f'Price vs Service Fee (r = {corr:.2f})')
    plt.show()
else:
    print('⚠️ Columns price or service_fee not found.')

### 8️⃣ Avg Review by Neighborhood Group & Room Type (Grouped Bar Chart)

In [None]:
if 'neighbourhood_group' in df.columns and 'room_type' in df.columns and 'review_rate_number' in df.columns:
    avg_review = df.groupby(['neighbourhood_group', 'room_type'])['review_rate_number'].mean().unstack()
    avg_review.plot(kind='bar', figsize=(8,4))
    plt.title('Average Review Rating by Neighborhood & Room Type')
    plt.ylabel('Average Rating')
    plt.show()
else:
    print('⚠️ Required columns missing.')

### 9️⃣ Host Listings vs Availability (Scatter Plot)

In [None]:
if 'calculated_host_listings_count' in df.columns and 'availability_365' in df.columns:
    plt.figure(figsize=(6,4))
    sns.scatterplot(x='calculated_host_listings_count', y='availability_365', data=df, alpha=0.6)
    plt.title('Host Listings Count vs Availability (365 days)')
    plt.show()
else:
    print('⚠️ Required columns missing.')

### 🔥 Correlation Heatmap

In [None]:
plt.figure(figsize=(8,5))
corr = df.corr(numeric_only=True)
sns.heatmap(corr, cmap='coolwarm', annot=True)
plt.title('Correlation Heatmap')
plt.show()

## ✅ Conclusion
- Manhattan tends to have the highest prices and most listings.
- Verified hosts usually receive slightly higher ratings.
- Price and service fee are positively correlated.
- Hosts with many listings may not always have higher availability.