# Data Preprocessing for ML - Steam Games Dataset

## Objective
Clean and preprocess the Steam games dataset to make it ready for machine learning:
- Remove URLs, emails, and unnecessary description columns
- Convert boolean columns to 0/1
- Filter games with less than 500 total reviews
- Prepare clean, structured data for ML models

## Dataset
- **Input:** `../archive1/games_march2025_cleaned.csv`
- **Output:** Cleaned dataset ready for ML


In [50]:
import warnings
warnings.filterwarnings('ignore')

from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


In [51]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, isnan, isnull, regexp_replace
from pyspark.sql.types import IntegerType, DoubleType

# Initialize Spark Session
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("DataPreprocessingML") \
    .getOrCreate()

# Set log level to reduce output noise
spark.sparkContext.setLogLevel("WARN")

print("Spark session created successfully!")
print(f"Spark version: {spark.version}")


Spark session created successfully!
Spark version: 4.1.0


## Step 1: Load Data


In [None]:
# Read CSV file with improved options to handle complex formatting
df = spark.read \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .option("escape", '"') \
    .option("multiLine", "true") \
    .option("quote", '"') \
    .option("ignoreLeadingWhiteSpace", "true") \
    .option("ignoreTrailingWhiteSpace", "true") \
    .csv("../archive1/games_march2025_cleaned.csv")

print(f"Initial dataset:")
print(f"Total number of games: {df.count():,}")
print(f"Number of columns: {len(df.columns)}")
print("\nSchema:")
df.printSchema()

# Verify column alignment by checking a sample
print("\nSample data (first 3 rows, first 10 columns):")
df.select(df.columns[:10]).show(3, truncate=50)

# Data validation: Check if data looks correct
print("\n" + "="*80)
print("DATA VALIDATION")
print("="*80)

# Check appid - should be numeric
print("\n1. AppID validation:")
appid_sample = df.select("appid").limit(5).collect()
print("Sample appid values:")
for row in appid_sample:
    print(f"  - {row.appid} (type: {type(row.appid).__name__})")

# Check name - should be strings
print("\n2. Name validation:")
name_sample = df.select("name").limit(5).collect()
print("Sample name values:")
for row in name_sample:
    name_val = str(row.name)[:50] if row.name else "NULL"
    print(f"  - {name_val}")

# Check price - should be numeric
print("\n3. Price validation:")
price_stats = df.select("price").describe().show()


                                                                                

Initial dataset:


                                                                                

Total number of games: 89,618
Number of columns: 47

Schema:
root
 |-- appid: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- release_date: date (nullable = true)
 |-- required_age: integer (nullable = true)
 |-- price: double (nullable = true)
 |-- dlc_count: integer (nullable = true)
 |-- detailed_description: string (nullable = true)
 |-- about_the_game: string (nullable = true)
 |-- short_description: string (nullable = true)
 |-- reviews: string (nullable = true)
 |-- header_image: string (nullable = true)
 |-- website: string (nullable = true)
 |-- support_url: string (nullable = true)
 |-- support_email: string (nullable = true)
 |-- windows: boolean (nullable = true)
 |-- mac: boolean (nullable = true)
 |-- linux: boolean (nullable = true)
 |-- metacritic_score: integer (nullable = true)
 |-- metacritic_url: string (nullable = true)
 |-- achievements: integer (nullable = true)
 |-- recommendations: integer (nullable = true)
 |-- notes: string (nullable = true

[Stage 8:>                                                          (0 + 1) / 1]

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             89618|
|   mean| 7.309622620449302|
| stddev|13.331073254511667|
|    min|               0.0|
|    max|            999.98|
+-------+------------------+



                                                                                

## Step 2: Remove Unnecessary Columns

Remove URLs, emails, and description columns that are not useful for ML.


In [53]:
# Columns to remove:
# - URLs: header_image, website, support_url, metacritic_url, screenshots, movies
# - Email: support_email
# - Descriptions: detailed_description, about_the_game, short_description, reviews, notes
# - Additional: packages, user_score, score_rank, metacritic_score (many have 0/missing)

columns_to_remove = [
    'header_image',
    'website',
    'support_url',
    'support_email',
    'metacritic_url',
    'screenshots',
    'movies',
    'detailed_description',
    'about_the_game',
    'short_description',
    'reviews',
    'notes',
    'packages',
    'user_score',
    'score_rank',
    'metacritic_score'
]

print("Columns to remove:")
for col_name in columns_to_remove:
    if col_name in df.columns:
        print(f"  - {col_name}")
    else:
        print(f"  - {col_name} (not found)")

# Select only columns we want to keep
columns_to_keep = [col_name for col_name in df.columns if col_name not in columns_to_remove]
df_cleaned = df.select(columns_to_keep)

print(f"\nAfter removing columns:")
print(f"Number of columns: {len(df_cleaned.columns)}")
print(f"\nRemaining columns:")
for i, col_name in enumerate(df_cleaned.columns, 1):
    print(f"{i:2d}. {col_name}")


Columns to remove:
  - header_image
  - website
  - support_url
  - support_email
  - metacritic_url
  - screenshots
  - movies
  - detailed_description
  - about_the_game
  - short_description
  - reviews
  - notes
  - packages
  - user_score
  - score_rank
  - metacritic_score

After removing columns:
Number of columns: 31

Remaining columns:
 1. appid
 2. name
 3. release_date
 4. required_age
 5. price
 6. dlc_count
 7. windows
 8. mac
 9. linux
10. achievements
11. recommendations
12. supported_languages
13. full_audio_languages
14. developers
15. publishers
16. categories
17. genres
18. positive
19. negative
20. estimated_owners
21. average_playtime_forever
22. average_playtime_2weeks
23. median_playtime_forever
24. median_playtime_2weeks
25. discount
26. peak_ccu
27. tags
28. pct_pos_total
29. num_reviews_total
30. pct_pos_recent
31. num_reviews_recent


## Step 3: Convert Boolean Columns to 0/1

Convert windows, mac, linux columns from boolean/string to integer (0 or 1).


In [54]:
# Check current boolean columns
print("Boolean columns before conversion:")
for bool_col in ['windows', 'mac', 'linux']:
    if bool_col in df_cleaned.columns:
        print(f"\n{bool_col}:")
        df_cleaned.select(bool_col).distinct().show()

# Convert boolean columns to 0/1
df_cleaned = df_cleaned.withColumn(
    'windows',
    when(col('windows') == True, 1)
    .when(col('windows') == 'True', 1)
    .when(col('windows') == 'true', 1)
    .otherwise(0).cast(IntegerType())
).withColumn(
    'mac',
    when(col('mac') == True, 1)
    .when(col('mac') == 'True', 1)
    .when(col('mac') == 'true', 1)
    .otherwise(0).cast(IntegerType())
).withColumn(
    'linux',
    when(col('linux') == True, 1)
    .when(col('linux') == 'True', 1)
    .when(col('linux') == 'true', 1)
    .otherwise(0).cast(IntegerType())
)

print("\nBoolean columns after conversion:")
for bool_col in ['windows', 'mac', 'linux']:
    if bool_col in df_cleaned.columns:
        print(f"\n{bool_col}:")
        df_cleaned.select(bool_col).distinct().show()


Boolean columns before conversion:

windows:


                                                                                

+-------+
|windows|
+-------+
|   true|
|  false|
+-------+


mac:


                                                                                

+-----+
|  mac|
+-----+
| true|
|false|
+-----+


linux:


                                                                                

+-----+
|linux|
+-----+
| true|
|false|
+-----+


Boolean columns after conversion:

windows:


                                                                                

+-------+
|windows|
+-------+
|      1|
|      0|
+-------+


mac:


                                                                                

+---+
|mac|
+---+
|  1|
|  0|
+---+


linux:


[Stage 26:>                                                         (0 + 1) / 1]

+-----+
|linux|
+-----+
|    1|
|    0|
+-----+



                                                                                

## Step 4: Filter Games with Minimum Reviews

Remove games with less than 300 total reviews to ensure data quality.


In [55]:
# Check num_reviews_total distribution
print("Review count statistics:")
df_cleaned.select('num_reviews_total').describe().show()

# Count games before filtering
games_before = df_cleaned.count()
print(f"\nGames before filtering: {games_before:,}")

# Filter games with at least 300 total reviews
df_cleaned = df_cleaned.filter(
    (col('num_reviews_total').isNotNull()) &
    (col('num_reviews_total') >= 300)
)

games_after = df_cleaned.count()
games_removed = games_before - games_after

print(f"Games after filtering (≥300 reviews): {games_after:,}")
print(f"Games removed: {games_removed:,} ({games_removed/games_before*100:.2f}%)")


Review count statistics:


                                                                                

+-------+------------------+
|summary| num_reviews_total|
+-------+------------------+
|  count|             89618|
|   mean|1315.4901359101966|
| stddev| 35423.69677558889|
|    min|                -1|
|    max|           8632939|
+-------+------------------+



                                                                                


Games before filtering: 89,618


[Stage 35:>                                                         (0 + 1) / 1]

Games after filtering (≥300 reviews): 11,889
Games removed: 77,729 (86.73%)


                                                                                

## Step 5: Data Quality Check

Check for null values and data types to ensure data is clean.


In [56]:
# Check null values in each column
print("="*80)
print("NULL VALUE ANALYSIS")
print("="*80)
print(f"{'Column':<30} {'Null Count':<15} {'Null %':<15}")
print("-"*80)

total_rows = df_cleaned.count()
for col_name in df_cleaned.columns:
    null_count = df_cleaned.filter(col(col_name).isNull()).count()
    null_pct = (null_count / total_rows * 100) if total_rows > 0 else 0
    print(f"{col_name:<30} {null_count:<15,} {null_pct:<15.2f}%")


NULL VALUE ANALYSIS
Column                         Null Count      Null %         
--------------------------------------------------------------------------------


                                                                                

appid                          0               0.00           %


                                                                                

name                           0               0.00           %


                                                                                

release_date                   0               0.00           %


                                                                                

required_age                   0               0.00           %


                                                                                

price                          0               0.00           %


                                                                                

dlc_count                      0               0.00           %
windows                        0               0.00           %
mac                            0               0.00           %
linux                          0               0.00           %


                                                                                

achievements                   0               0.00           %


                                                                                

recommendations                0               0.00           %


                                                                                

supported_languages            0               0.00           %


                                                                                

full_audio_languages           0               0.00           %


                                                                                

developers                     0               0.00           %


                                                                                

publishers                     0               0.00           %


                                                                                

categories                     0               0.00           %


                                                                                

genres                         0               0.00           %


                                                                                

positive                       0               0.00           %


                                                                                

negative                       0               0.00           %


                                                                                

estimated_owners               0               0.00           %


                                                                                

average_playtime_forever       0               0.00           %


                                                                                

average_playtime_2weeks        0               0.00           %


                                                                                

median_playtime_forever        0               0.00           %


                                                                                

median_playtime_2weeks         0               0.00           %


                                                                                

discount                       0               0.00           %


                                                                                

peak_ccu                       0               0.00           %


                                                                                

tags                           0               0.00           %


                                                                                

pct_pos_total                  0               0.00           %


                                                                                

num_reviews_total              0               0.00           %


                                                                                

pct_pos_recent                 0               0.00           %


[Stage 128:>                                                        (0 + 1) / 1]

num_reviews_recent             0               0.00           %


                                                                                

In [57]:
# Show sample of cleaned data
print("Sample of cleaned data:")
df_cleaned.show(10, truncate=50)

print("\nFinal schema:")
df_cleaned.printSchema()

# Verify data integrity - check if appid and name columns look correct
print("\nData integrity check:")
print("Sample appid and name values:")
df_cleaned.select("appid", "name").show(10, truncate=False)

# Check for any obvious misalignment (e.g., appid should be numeric)
print("\nChecking appid column (should be numeric):")
df_cleaned.select("appid").describe().show()


Sample of cleaned data:
+-------+-------------------------------+------------+------------+-----+---------+-------+---+-----+------------+---------------+--------------------------------------------------+--------------------------------------------------+---------------------+---------------------+--------------------------------------------------+--------------------------------------------------+--------+--------+---------------------+------------------------+-----------------------+-----------------------+----------------------+--------+--------+--------------------------------------------------+-------------+-----------------+--------------+------------------+
|  appid|                           name|release_date|required_age|price|dlc_count|windows|mac|linux|achievements|recommendations|                               supported_languages|                              full_audio_languages|           developers|           publishers|                                        categories

[Stage 133:>                                                        (0 + 1) / 1]

+-------+-----------------+
|summary|            appid|
+-------+-----------------+
|  count|            11889|
|   mean|1119218.078896459|
| stddev|788254.1029442574|
|    min|               20|
|    max|          3496470|
+-------+-----------------+



                                                                                

## Step 6: Save Preprocessed Data

Save the cleaned dataset for ML use.


In [None]:
# Save preprocessed data
output_path = "../archive1/games_march2025_ml_ready.csv"

print(f"Saving preprocessed data to: {output_path}")
print(f"Total rows: {df_cleaned.count():,}")
print(f"Total columns: {len(df_cleaned.columns)}")

# Save as CSV with proper escaping to avoid column misalignment
df_cleaned.coalesce(1).write \
    .mode("overwrite") \
    .option("header", "true") \
    .option("escape", '"') \
    .option("quote", '"') \
    .csv(output_path)

print("\nData saved successfully!")
print(f"Output file: {output_path}")
print("\nNote: Spark saves CSV as a directory with part files.")
print("To get a single CSV file, you may need to merge the part files manually.")


Saving preprocessed data to: archive1/games_march2025_ml_ready.csv


                                                                                

Total rows: 11,889
Total columns: 31


[Stage 139:>                                                        (0 + 1) / 1]


Data saved successfully!
Output file: archive1/games_march2025_ml_ready.csv

Note: Spark saves CSV as a directory with part files.
To get a single CSV file, you may need to merge the part files manually.


                                                                                

## Step 7: Summary Statistics

Generate summary statistics for key numerical columns.


In [59]:
# Summary statistics for key numerical columns
numerical_cols = [
    'price', 'dlc_count', 'required_age',
    'achievements', 'recommendations', 'positive', 'negative',
    'average_playtime_forever', 'median_playtime_forever',
    'discount', 'peak_ccu', 'pct_pos_total', 'num_reviews_total',
    'pct_pos_recent', 'num_reviews_recent'
]

# Filter to only columns that exist
existing_numerical_cols = [col_name for col_name in numerical_cols if col_name in df_cleaned.columns]

print("Summary Statistics for Numerical Columns:")
print("="*100)
df_cleaned.select(existing_numerical_cols).describe().show()


Summary Statistics for Numerical Columns:


[Stage 140:>                                                        (0 + 1) / 1]

+-------+------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+------------------------+-----------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+
|summary|             price|         dlc_count|      required_age|      achievements|  recommendations|         positive|          negative|average_playtime_forever|median_playtime_forever|          discount|          peak_ccu|    pct_pos_total|num_reviews_total|    pct_pos_recent|num_reviews_recent|
+-------+------------------+------------------+------------------+------------------+-----------------+-----------------+------------------+------------------------+-----------------------+------------------+------------------+-----------------+-----------------+------------------+------------------+
|  count|             11889|             11889|             11889|             11889|         

                                                                                

In [60]:
# Platform distribution
print("Platform Distribution:")
print("="*50)
print(f"Windows: {df_cleaned.filter(col('windows') == 1).count():,} games")
print(f"Mac: {df_cleaned.filter(col('mac') == 1).count():,} games")
print(f"Linux: {df_cleaned.filter(col('linux') == 1).count():,} games")

# Price distribution
print("\nPrice Distribution:")
print("="*50)
df_cleaned.select('price').describe().show()

# Free vs Paid games
free_games = df_cleaned.filter((col('price').isNull()) | (col('price') == 0)).count()
paid_games = df_cleaned.filter(col('price') > 0).count()
print(f"\nFree games: {free_games:,}")
print(f"Paid games: {paid_games:,}")


Platform Distribution:


                                                                                

Windows: 11,889 games


                                                                                

Mac: 3,688 games


                                                                                

Linux: 2,472 games

Price Distribution:


                                                                                

+-------+------------------+
|summary|             price|
+-------+------------------+
|  count|             11889|
|   mean|12.528331230549863|
| stddev|13.026034566276827|
|    min|               0.0|
|    max|             89.99|
+-------+------------------+



[Stage 158:>                                                        (0 + 1) / 1]


Free games: 2,429
Paid games: 9,460


                                                                                

## Summary

### Preprocessing Steps Completed:

1. **Removed unnecessary columns:**
   - URLs: header_image, website, support_url, metacritic_url, screenshots, movies
   - Email: support_email
   - Descriptions: detailed_description, about_the_game, short_description, reviews, notes
   - Additional: packages, user_score, score_rank, metacritic_score (many have 0/missing)

2. **Converted boolean columns to 0/1:**
   - windows, mac, linux

3. **Filtered games:**
   - Removed games with less than 500 total reviews

4. **Data quality:**
   - Checked for null values
   - Verified data types
   - Generated summary statistics

### Output:
- **File:** `../archive1/games_march2025_ml_ready.csv`
- **Format:** Clean CSV ready for ML models
- **Quality:** Only games with sufficient reviews (≥500)

### Next Steps for ML:
- Feature engineering (create new features from existing ones)
- Handle remaining null values
- Encode categorical variables (developers, publishers, genres, categories)
- Scale/normalize numerical features
- Split into train/test sets


In [61]:
# Clean up
spark.stop()
print("Spark session stopped.")


Spark session stopped.
