# PySpark DataFrame Essentials – Airbnb Listings

**Dataset**: `samples.airbnb.listings` (adjust to your workspace sample name)

In this notebook you will:
1. Load the Airbnb sample dataset
2. Explore schema, columns, and basic profiling
3. Use `select`, `filter`, `withColumn`, `groupBy`, `orderBy`
4. Mix PySpark DataFrames with Spark SQL
5. Do a small feature engineering exercise


In [None]:
from pyspark.sql import functions as F

# Load Airbnb sample dataset
# If this table name differs in your workspace, run: spark.catalog.listTables("samples") to inspect.
airbnb_df = spark.read.table("samples.airbnb.listings")

display(airbnb_df.limit(5))
print("Count:", airbnb_df.count())
airbnb_df.printSchema()


## 1. Column Selection & Renaming


In [None]:
# Select a subset of columns
cols_df = airbnb_df.select(
    "id",
    "name",
    "host_id",
    "host_name",
    "neighbourhood",
    "room_type",
    "price",
    "number_of_reviews",
    "minimum_nights"
)

display(cols_df.limit(10))

# Rename columns using alias
renamed_df = cols_df.select(
    F.col("id").alias("listing_id"),
    F.col("host_id").alias("hostId"),
    F.col("neighbourhood").alias("area"),
    "room_type",
    "price",
    "number_of_reviews",
    "minimum_nights"
)

display(renamed_df.limit(5))


## 2. Filtering Rows

Examples:
- Filter by room type
- Filter by price range
- Combine conditions with `&` and `|`


In [None]:
# Private rooms cheaper than 100
cheap_private_df = (
    renamed_df
    .filter((F.col("room_type") == "Private room") & (F.col("price") < 100))
)

print("Cheap private rooms:", cheap_private_df.count())
display(cheap_private_df.orderBy(F.col("price")).limit(10))


## 3. Adding Columns with `withColumn`

We'll create:
- `price_per_min_night` = price / minimum_nights
- A boolean flag `is_popular` based on number of reviews


In [None]:
enhanced_df = (
    renamed_df
    .withColumn(
        "price_per_min_night",
        F.when(F.col("minimum_nights") > 0,
               F.col("price") / F.col("minimum_nights"))
    )
    .withColumn(
        "is_popular",
        F.col("number_of_reviews") >= 50
    )
)

display(enhanced_df.limit(10))


## 4. Aggregations with `groupBy`

Examples:
- Average price per neighbourhood (`area`)
- Average price and average reviews per room type


In [None]:
avg_price_by_area_df = (
    enhanced_df
    .groupBy("area")
    .agg(
        F.round(F.avg("price"), 2).alias("avg_price"),
        F.count("*").alias("listing_count")
    )
    .orderBy(F.col("avg_price").desc())
)

display(avg_price_by_area_df.limit(20))


In [None]:
room_stats_df = (
    enhanced_df
    .groupBy("room_type")
    .agg(
        F.round(F.avg("price"), 2).alias("avg_price"),
        F.round(F.avg("number_of_reviews"), 2).alias("avg_reviews"),
        F.count("*").alias("total_listings")
    )
)

display(room_stats_df)


## 5. Using Spark SQL with Temporary Views

Sometimes it's easier to write SQL than PySpark.
We'll register a temp view and run equivalent SQL queries.


In [None]:
# Register as temporary view
enhanced_df.createOrReplaceTempView("airbnb_enhanced")

# Use SQL for analysis
area_sql_df = spark.sql('''
SELECT
  area,
  room_type,
  COUNT(*) AS listing_count,
  ROUND(AVG(price), 2) AS avg_price,
  ROUND(AVG(number_of_reviews), 2) AS avg_reviews
FROM airbnb_enhanced
GROUP BY area, room_type
ORDER BY listing_count DESC
''')

display(area_sql_df.limit(30))


## 6. Small Feature Engineering Exercise

Goal:
- Build a simple quality/attractiveness score:
  - Normalize price (cheap is better)
  - Normalize number_of_reviews (more is better)
  - Combine them into one score


In [None]:
# Compute min/max for price and number_of_reviews
stats = enhanced_df.agg(
    F.min("price").alias("min_price"),
    F.max("price").alias("max_price"),
    F.min("number_of_reviews").alias("min_reviews"),
    F.max("number_of_reviews").alias("max_reviews")
).collect()[0]

min_price, max_price = stats["min_price"], stats["max_price"]
min_reviews, max_reviews = stats["min_reviews"], stats["max_reviews"]

print("Price range:", min_price, "->", max_price)
print("Review count range:", min_reviews, "->", max_reviews)

price_range = max_price - min_price if max_price != min_price else 1.0
review_range = max_reviews - min_reviews if max_reviews != min_reviews else 1.0

scored_df = (
    enhanced_df
    .withColumn(
        "price_norm",
        1.0 - (F.col("price") - min_price) / price_range  # cheaper = higher score
    )
    .withColumn(
        "reviews_norm",
        (F.col("number_of_reviews") - min_reviews) / review_range
    )
    .withColumn(
        "quality_score",
        0.5 * F.col("price_norm") + 0.5 * F.col("reviews_norm")
    )
)

display(
    scored_df
    .select("area", "room_type", "price", "number_of_reviews", "quality_score")
    .orderBy(F.col("quality_score").desc())
    .limit(20)
)
