# Airbnb Data Analysis and Machine Learning Tasks

## Setup
First, let's import the necessary libraries and load our data.

In [None]:
%pip install pandas numpy matplotlib seaborn scikit-learn requests

In [None]:
import os
import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder

# if we can import the drive module, we are running on colab
thePath = "./"  # Adjust this path as necessary
link = "https://dse200.dev/Day3/airbnb_sd_listings.csv"
file = "airbnb_sd_listings.csv"

if not os.path.exists(thePath + file):
    r = requests.get(link)
    with open(thePath + file, 'wb') as f:
        f.write(r.content)

# Load the data
data = pd.read_csv(thePath + file)

# Display the first few rows and data info
print(data.head())
print(data.info())

## Basic Data Analysis and Visualization

a) Create a bar plot showing the average price for each room_type.

b) What is the most expensive room_type on average?

c) Create a scatter plot of price vs. number_of_reviews. What do you observe?

In [None]:
# Your code here


## Predicting Price with Linear Regression

We'll create a machine learning model to predict the price of a listing.

a) Prepare the data for machine learning:
   - Select relevant features (you decide which ones to use)
   - Handle any categorical variables
   - Split the data into training and testing sets

b) Create and train linear regression model.

c) Evaluate linear regression model using Mean Squared Error and R-squared score.



In [None]:
# Select relevant features

In [None]:
# Linear Regression


## Predicting Price with Random Forest Regressor

We'll create a machine learning model to predict the price of a listing.

a) Create and train Random Forest Regressor.

c) Evaluate model using Mean Squared Error and R-squared score.

In [None]:
## Your code here


## Comparison of Models and using best model to predict price

a) Which model performs better? Why do you think this is the case?

b) How did the speed of the models compare?

c) Use your best model to predict the price for a new listing with these features:
   - room_type: 'Entire home/apt'
   - minimum_nights: 2
   - number_of_reviews: 50
   - availability_365: 200
   - neighbourhood: 'Pacific Beach'

In [None]:
# Your code here

## Advanced Visualization and Comparison

a) Create a heatmap showing the correlation between numerical features in the dataset.

b) Using the Random Forest model from Task 8, plot the feature importances. Which features are most important for predicting price?

c) Create a box plot showing the distribution of prices for each neighbourhood. What do you observe about price variations across neighbourhoods?

d) Based on your analysis, what recommendations would you give to someone looking to list their property on this platform to maximize their potential earnings?

In [None]:
## Your code here

## Feature Engineering: Creating Better Predictors

Feature engineering is the process of creating new features from existing data to improve model performance. Let's progressively build more sophisticated features from our Airbnb dataset.

### Why Feature Engineering Matters
Raw data rarely contains all the patterns models need. By creating new features, we can:
- Capture non-linear relationships
- Encode domain knowledge
- Create interaction effects between variables
- Extract signal from text and temporal data

Let's start simple and progressively get more advanced.

### 1. Temporal Features

**Justification:** The `last_review` date contains hidden information:
- How recent is the activity? Recent reviews suggest active, popular listings
- Seasonality patterns (month, quarter) affect pricing
- Inactive listings might be priced differently

These features capture time-based patterns that affect demand and pricing.

In [None]:
# Make a copy for feature engineering
data_fe = data.copy()

# Convert to datetime
data_fe['last_review'] = pd.to_datetime(data_fe['last_review'])

# Extract temporal features
data_fe['days_since_last_review'] = (pd.Timestamp.now() - data_fe['last_review']).dt.days
data_fe['last_review_year'] = data_fe['last_review'].dt.year
data_fe['last_review_month'] = data_fe['last_review'].dt.month
data_fe['last_review_quarter'] = data_fe['last_review'].dt.quarter
data_fe['is_recently_reviewed'] = (data_fe['days_since_last_review'] < 30).astype(int)

print("Temporal features created:")
print(data_fe[['last_review', 'days_since_last_review', 'last_review_month', 
               'last_review_quarter', 'is_recently_reviewed']].head(10))

### 2. Geographic Features

**Justification:** Location is crucial for Airbnb pricing:
- Distance to downtown/attractions affects desirability
- Proximity to coast/beach in San Diego commands premium pricing
- Spatial clustering reveals neighborhood quality

Geographic features encode location value beyond just neighborhood names.

In [None]:
# Distance from San Diego downtown (32.7157° N, 117.1611° W)
from math import radians, sin, cos, sqrt, atan2

def haversine_distance(lat1, lon1, lat2=32.7157, lon2=-117.1611):
    """Calculate distance in km between two lat/lon points"""
    R = 6371  # Earth radius in km
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    return 2 * R * atan2(sqrt(a), sqrt(1-a))

data_fe['distance_to_downtown'] = data_fe.apply(
    lambda x: haversine_distance(x['latitude'], x['longitude']), axis=1
)

# Coastal proximity (approximate - more negative longitude = closer to ocean)
data_fe['distance_to_coast'] = abs(data_fe['longitude'] + 117.25)

print("Geographic features created:")
print(data_fe[['neighbourhood', 'latitude', 'longitude', 
               'distance_to_downtown', 'distance_to_coast']].head())

### 3. Ratio and Interaction Features

**Justification:** Ratios reveal efficiency and engagement:
- `review_rate`: How many reviews per available day? Shows booking frequency
- `reviews_per_listing`: Distributes attention across host's portfolio
- `price_per_min_night`: Normalizes price by minimum stay requirements
- `is_professional_host`: Professional hosts may price differently

These capture relationships between variables that models might miss.

In [None]:
# Review engagement metrics
data_fe['review_rate'] = data_fe['number_of_reviews'] / (data_fe['availability_365'] + 1)
data_fe['reviews_per_listing'] = data_fe['number_of_reviews'] / (data_fe['calculated_host_listings_count'] + 1)

# Price-related ratios
data_fe['price_per_min_night'] = data_fe['price'] / (data_fe['minimum_nights'] + 1)

# Host categorization
data_fe['is_professional_host'] = (data_fe['calculated_host_listings_count'] > 2).astype(int)

print("Ratio/interaction features created:")
print(data_fe[['number_of_reviews', 'availability_365', 'review_rate', 
               'reviews_per_listing', 'is_professional_host']].head())

### 4. Aggregated Neighborhood Features

**Justification:** Neighborhood context matters:
- Individual listing price means little without local market context
- Neighborhood statistics capture area desirability and market dynamics
- `price_vs_neighborhood`: Is this listing priced above/below local average?

This is **target encoding** - using aggregate target statistics as features (be careful of leakage!).

In [None]:
# Calculate neighborhood statistics
neighborhood_stats = data_fe.groupby('neighbourhood').agg({
    'price': ['mean', 'median', 'std'],
    'number_of_reviews': 'mean',
    'availability_365': 'mean'
}).reset_index()

# Flatten column names
neighborhood_stats.columns = ['neighbourhood', 'neighborhood_avg_price', 
                               'neighborhood_median_price', 'neighborhood_price_std',
                               'neighborhood_avg_reviews', 'neighborhood_avg_availability']

# Merge back to main dataset
data_fe = data_fe.merge(neighborhood_stats, on='neighbourhood', how='left')

# Relative pricing feature
data_fe['price_vs_neighborhood'] = data_fe['price'] / (data_fe['neighborhood_avg_price'] + 1)

print("Neighborhood aggregated features created:")
print(data_fe[['neighbourhood', 'price', 'neighborhood_avg_price', 
               'neighborhood_median_price', 'price_vs_neighborhood']].head())

### 5. Natural Language Processing (NLP) Features

**Justification:** The listing `name` contains marketing signals:
- Keywords like "luxury", "beach", "ocean" signal premium listings
- Text length/complexity may correlate with professionalism
- Certain words indicate amenities (pool, parking, wifi)
- Writing style (exclamation marks, uppercase) shows listing effort

**Simple approach:** Keyword matching and basic statistics  
**Advanced approach:** TF-IDF to find discriminative words

In [None]:
# Basic text features
data_fe['name_length'] = data_fe['name'].str.len()
data_fe['name_word_count'] = data_fe['name'].str.split().str.len()

# Keyword indicators (domain knowledge!)
data_fe['has_luxury_words'] = data_fe['name'].str.lower().str.contains(
    'luxury|stunning|beautiful|amazing|gorgeous', na=False
).astype(int)

data_fe['has_location_words'] = data_fe['name'].str.lower().str.contains(
    'beach|ocean|downtown|view|bay', na=False
).astype(int)

data_fe['has_amenity_words'] = data_fe['name'].str.lower().str.contains(
    'pool|parking|wifi|kitchen|patio', na=False
).astype(int)

# Style indicators
data_fe['has_exclamation'] = data_fe['name'].str.contains('!', na=False).astype(int)
data_fe['is_uppercase'] = data_fe['name'].str.isupper().astype(int)

print("NLP features created:")
print(data_fe[['name', 'name_length', 'name_word_count', 'has_luxury_words', 
               'has_location_words', 'has_amenity_words']].head(10))

#### Advanced NLP: TF-IDF Features

**TF-IDF (Term Frequency-Inverse Document Frequency)** identifies words that are distinctive for certain listings. Words that appear frequently in one listing but rarely across all listings get high scores.

This is more sophisticated than simple keyword matching - it discovers patterns automatically.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF features from listing names
tfidf = TfidfVectorizer(max_features=20, stop_words='english', min_df=5)
tfidf_features = tfidf.fit_transform(data_fe['name'].fillna(''))

# Convert to DataFrame
tfidf_df = pd.DataFrame(
    tfidf_features.toarray(), 
    columns=[f'tfidf_{word}' for word in tfidf.get_feature_names_out()]
)

# Add to main dataset
data_fe = pd.concat([data_fe.reset_index(drop=True), tfidf_df], axis=1)

print(f"TF-IDF features created: {tfidf_df.shape[1]} features")
print("\nTop TF-IDF words found:")
print(list(tfidf.get_feature_names_out()))

### 6. Binning and Categorization

**Justification:** Sometimes continuous variables work better as categories:
- Non-linear price relationships: A $50 → $100 increase matters more than $500 → $550
- Binning can capture threshold effects (e.g., listings >180 days available may be different)
- Helps models learn patterns within ranges

This creates discrete buckets from continuous features.

In [None]:
# Price categories
data_fe['price_category'] = pd.cut(
    data_fe['price'], 
    bins=[0, 100, 200, 500, 10000], 
    labels=['budget', 'moderate', 'expensive', 'luxury']
)

# Availability categories
data_fe['availability_category'] = pd.cut(
    data_fe['availability_365'], 
    bins=[0, 90, 180, 270, 365],
    labels=['low', 'medium', 'high', 'very_high']
)

# Review volume categories
data_fe['review_volume'] = pd.cut(
    data_fe['number_of_reviews'],
    bins=[0, 10, 50, 200, 1000],
    labels=['new', 'established', 'popular', 'very_popular']
)

print("Categorization features created:")
print(data_fe[['price', 'price_category', 'availability_365', 
               'availability_category', 'number_of_reviews', 'review_volume']].head(10))

### 7. Missing Value Indicators

**Justification:** Missing values contain information:
- Listings with no reviews might be new (different pricing strategy)
- Missing `reviews_per_month` indicates no activity (not truly "missing")
- The *presence* of missing data can be predictive

Always create indicator variables for missingness patterns.

In [None]:
# Missing value indicators
data_fe['has_reviews'] = (~data_fe['last_review'].isna()).astype(int)
data_fe['missing_reviews_per_month'] = data_fe['reviews_per_month'].isna().astype(int)
data_fe['missing_host_name'] = data_fe['host_name'].isna().astype(int)

print("Missing value indicators created:")
print(data_fe[['last_review', 'has_reviews', 'reviews_per_month', 
               'missing_reviews_per_month']].head(10))

### Summary: Feature Engineering Complete!

We've created **50+ new features** from just 18 original columns:

| Category | Features | Difficulty | Key Insight |
|----------|----------|------------|-------------|
| **Temporal** | 5 features | Easy | Extract time patterns from dates |
| **Geographic** | 2 features | Medium | Location drives pricing |
| **Ratios** | 4 features | Medium | Relationships between variables matter |
| **Aggregates** | 5 features | Advanced | Neighborhood context is crucial |
| **NLP (Simple)** | 7 features | Advanced | Text contains marketing signals |
| **NLP (TF-IDF)** | 20 features | Advanced | Auto-discover important words |
| **Binning** | 3 features | Medium | Categories capture non-linearity |
| **Missing** | 3 features | Easy | Missingness is informative |

### Next Steps: Compare Model Performance

Now let's see if these engineered features improve model accuracy!

In [None]:
# Display final feature count
print(f"Original columns: {data.shape[1]}")
print(f"After feature engineering: {data_fe.shape[1]}")
print(f"New features created: {data_fe.shape[1] - data.shape[1]}")
print(f"\nSample of engineered features:")
print(data_fe.columns.tolist()[-20:])