# SQL Analysis: Price Strategy and Engagement

## Objective
Analyze how pricing strategy relates to player engagement and popularity using SQL queries.

## Research Question
**How does pricing strategy relate to player engagement and popularity?**

## Dataset
- **Input:** `archive1/games_march2025_ml_ready.csv` (pre-cleaned ML-ready dataset)
- **Filters Applied:** 
  - Games with ≥500 total reviews (already filtered in ML-ready dataset)
  - Paid games only (price > 0) - applied in this notebook


## Setup and Imports


In [1]:
import warnings
warnings.filterwarnings('ignore')

from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


In [None]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("PriceStrategyAnalysis") \
    .getOrCreate()

# Set log level to reduce output noise
spark.sparkContext.setLogLevel("WARN")

print("Spark session created")


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/12/26 03:30:45 WARN Utils: Your hostname, nobara-pc, resolves to a loopback address: 127.0.1.1; using 192.168.1.159 instead (on interface enp108s0)
25/12/26 03:30:45 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/12/26 03:30:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/12/26 03:30:46 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/12/26 03:30:46 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/12/26 03:30:46 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


Spark session created successfully!
Spark version: 4.1.0


## Step 1: Load Pre-cleaned ML-Ready Data

**Note:** We use the existing ML-ready dataset which already has:
- Unnecessary columns removed
- Games with ≥500 reviews filtered
- Boolean columns converted to 0/1
- Null values handled

We only need to filter for paid games (price > 0).


In [3]:
# Read pre-cleaned ML-ready CSV file
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("escape", '"') \
    .option("multiLine", "true") \
    .option("quote", '"') \
    .option("ignoreLeadingWhiteSpace", "true") \
    .option("ignoreTrailingWhiteSpace", "true") \
    .csv("archive1/games_march2025_ml_ready.csv")

print(f"ML-Ready dataset loaded:")
print(f"Total games: {df.count():,}")
print(f"Total columns: {len(df.columns)}")
print(f"\nNote: This dataset already has:")
print(f"  - Unnecessary columns removed")
print(f"  - Games with ≥500 reviews filtered")
print(f"  - Null values handled")


ML-Ready dataset loaded:
Total games: 11,889
Total columns: 31

Note: This dataset already has:
  - Unnecessary columns removed
  - Games with ≥500 reviews filtered
  - Null values handled


### Step 1.1: Filter Paid Games Only


In [5]:
# Filter for paid games only (price > 0)
games_before = df.count()
print(f"Games before filtering: {games_before:,}")

df_cleaned = df.filter(
    (col('price').isNotNull()) &
    (col('price') > 0)  # Remove free games
)

games_after = df_cleaned.count()
games_removed = games_before - games_after


print(f"\n- Paid games (price > 0): {games_after:,}")
print(f"- Free games removed: {games_removed:,} ({games_removed/games_before*100:.2f}%)")


Games before filtering: 11,889

- Paid games (price > 0): 9,460
- Free games removed: 2,429 (20.43%)


### Step 1.2: Create SQL View


In [8]:
# Create temporary view for SQL queries
df_cleaned.createOrReplaceTempView("games")

print("Data ready and registered as 'games' view")

print(f"  - Total games: {df_cleaned.count():,}")
print(f"  - All games have ≥500 reviews")
print(f"  - All games are paid (price > 0)")


Data ready and registered as 'games' view
  - Total games: 9,460
  - All games have ≥500 reviews
  - All games are paid (price > 0)



# Price Strategy Analysis

## Query: Price Strategy and Engagement Patterns


In [None]:

# Price Strategy Query

query_price = """
WITH price_categorized AS (
    SELECT 
        appid,
        name,
        price,
        CASE 
            WHEN price > 0 AND price <= 5 THEN 'Budget ($0-5)'
            WHEN price > 5 AND price <= 10 THEN 'Low ($5-10)'
            WHEN price > 10 AND price <= 20 THEN 'Medium ($10-20)'
            WHEN price > 20 AND price <= 40 THEN 'High ($20-40)'
            WHEN price > 40 AND price <= 60 THEN 'Premium ($40-60)'
            ELSE 'AAA ($60+)'
        END AS price_category,
        average_playtime_forever,
        num_reviews_total,
        peak_ccu,
        pct_pos_total
    FROM games
    WHERE price IS NOT NULL
        AND average_playtime_forever IS NOT NULL
),
engagement_categorized AS (
    SELECT 
        *,
        CASE 
            WHEN average_playtime_forever < 60 THEN 'Low Engagement'
            WHEN average_playtime_forever < 300 THEN 'Medium Engagement'
            WHEN average_playtime_forever < 1000 THEN 'High Engagement'
            ELSE 'Very High Engagement'
        END AS engagement_level,
        CASE 
            WHEN num_reviews_total < 1000 THEN 'Low Popularity'
            WHEN num_reviews_total < 10000 THEN 'Medium Popularity'
            WHEN num_reviews_total < 50000 THEN 'High Popularity'
            ELSE 'Very High Popularity'
        END AS popularity_level
    FROM price_categorized
)
SELECT 
    price_category,
    COUNT(*) AS game_count,
    ROUND(AVG(price), 2) AS avg_price,
    ROUND(AVG(average_playtime_forever) / 60.0, 2) AS avg_playtime_hours,
    ROUND(AVG(num_reviews_total), 0) AS avg_reviews,
    ROUND(AVG(peak_ccu), 0) AS avg_peak_ccu,
    ROUND(AVG(pct_pos_total), 2) AS avg_positive_pct,
    COUNT(CASE WHEN engagement_level = 'Very High Engagement' THEN 1 END) AS very_high_engagement_count,
    COUNT(CASE WHEN popularity_level = 'Very High Popularity' THEN 1 END) AS very_high_popularity_count,
    ROUND(COUNT(CASE WHEN engagement_level = 'Very High Engagement' THEN 1 END) * 100.0 / COUNT(*), 2) AS pct_very_high_engagement
FROM engagement_categorized
GROUP BY price_category
ORDER BY 
    CASE price_category
        WHEN 'Budget ($0-5)' THEN 1
        WHEN 'Low ($5-10)' THEN 2
        WHEN 'Medium ($10-20)' THEN 3
        WHEN 'High ($20-40)' THEN 4
        WHEN 'Premium ($40-60)' THEN 5
        WHEN 'AAA ($60+)' THEN 6
    END
"""

spark.sql(query_price).show(truncate=False)


+----------------+----------+---------+------------------+-----------+------------+----------------+--------------------------+--------------------------+------------------------+
|price_category  |game_count|avg_price|avg_playtime_hours|avg_reviews|avg_peak_ccu|avg_positive_pct|very_high_engagement_count|very_high_popularity_count|pct_very_high_engagement|
+----------------+----------+---------+------------------+-----------+------------+----------------+--------------------------+--------------------------+------------------------+
|Budget ($0-5)   |2160      |3.01     |20.29             |3813.0     |117.0       |82.3            |83                        |23                        |3.84                    |
|Low ($5-10)     |2090      |8.66     |4.23              |5272.0     |235.0       |83.47           |116                       |36                        |5.55                    |
|Medium ($10-20) |3267      |16.94    |6.42              |7350.0     |311.0       |83.58           |


# Summary

## Query Description:
This query analyzes how pricing strategy relates to player engagement and popularity by:
1. Categorizing games into price buckets (Budget, Low, Medium, High, Premium, AAA)
2. Calculating engagement metrics (playtime, reviews, peak CCU, positive review percentage) per price category
3. Categorizing games by engagement level and popularity level
4. Aggregating statistics to compare price-performance relationships

## Dataset Characteristics:
- **Source:** ML-ready dataset (pre-cleaned)
- **Filtered:** Games with ≥500 reviews and price > 0 (paid games only)
- **Metrics:** Playtime in hours (converted from minutes)


In [13]:
spark.stop()