# Publisher Quality Analysis - Steam Score (Positive - Negative)

## Objective
Analyze which publishers consistently release higher-quality games by computing the average **net Steam score** (positive reviews - negative reviews) per publisher.

## Dataset
- **File:** `archive1/games_march2025_cleaned.csv`
- **Size:** ~447 MB
- **Rows:** ~89,619 games
- **Key Columns:** `publishers`, `positive`, `negative`

## Scoring Method
- **Net Score = Positive Reviews - Negative Reviews**
- Only includes games where (positive - negative) > 0
- Higher net score indicates better reception


In [1]:
import warnings
warnings.filterwarnings('ignore')

from IPython.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))


In [3]:
import findspark
findspark.init()

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

# Initialize Spark Session
spark = SparkSession.builder \
    .master("local[*]") \
    .appName("PublisherSteamScoreAnalysis") \
    .getOrCreate()

# Set log level to reduce output noise
spark.sparkContext.setLogLevel("WARN")

print("Spark session created")


Spark session created


## Step 1: Load and Inspect Data


In [4]:
# Read CSV file
df = spark.read.csv(
    "archive1/games_march2025_cleaned.csv",
    header=True,
    inferSchema=True,
    escape='"',
    multiLine=True
)

print(f"Total number of games: {df.count()}")
print(f"Number of columns: {len(df.columns)}")


[Stage 2:>                                                          (0 + 1) / 1]

Total number of games: 89618
Number of columns: 47


                                                                                

In [26]:
# Show sample data for relevant columns
print("Sample data (publishers, positive, negative):")
df.select("name", "publishers", "positive", "negative").show(10, truncate=False)


Sample data (publishers, positive, negative):
+-------------------------------+---------------------+--------+--------+
|name                           |publishers           |positive|negative|
+-------------------------------+---------------------+--------+--------+
|Counter-Strike 2               |['Valve']            |7480813 |1135108 |
|PUBG: BATTLEGROUNDS            |['KRAFTON, Inc.']    |1487960 |1024436 |
|Dota 2                         |['Valve']            |1998462 |451338  |
|Grand Theft Auto V Legacy      |['Rockstar Games']   |1719950 |250012  |
|Tom Clancy's Rainbow Six® Siege|['Ubisoft']          |1152763 |218446  |
|Team Fortress 2                |['Valve']            |1025633 |120619  |
|Terraria                       |['Re-Logic']         |1344773 |34460   |
|Rust                           |['Facepunch Studios']|1043708 |152272  |
|Garry's Mod                    |['Valve']            |1106689 |36727   |
|Apex Legends™                  |['Electronic Arts']  |660150  |32

## Step 2: Data Preprocessing

Parse publishers and calculate net Steam scores (positive - negative).


In [8]:
# Function to parse publishers string
def parse_publishers(pub_str):
    """Parse publishers string into a list of publisher names"""
    if pub_str is None or pub_str == '' or pub_str == '[]':
        return []
    try:
        if isinstance(pub_str, str):
            pub_str = pub_str.strip()
            if pub_str.startswith('[') and pub_str.endswith(']'):
                pub_str = pub_str[1:-1]
            publishers = [p.strip().strip("'\"") for p in pub_str.split(',') if p.strip()]
            return [p for p in publishers if p]
        return []
    except:
        return []

# Check data quality
print("Data quality check:")
print(f"Total rows: {df.count()}")
print(f"Rows with null publishers: {df.filter(col('publishers').isNull()).count()}")
print(f"Rows with null positive: {df.filter(col('positive').isNull()).count()}")
print(f"Rows with null negative: {df.filter(col('negative').isNull()).count()}")

# Check games with positive net scores
df_with_scores = df.filter(
    col('positive').isNotNull() & 
    col('negative').isNotNull() &
    (col('positive') - col('negative') > 0)
)
print(f"Games with positive net score (positive - negative > 0): {df_with_scores.count()}")


Data quality check:


                                                                                

Total rows: 89618


                                                                                

Rows with null publishers: 0


                                                                                

Rows with null positive: 0


                                                                                

Rows with null negative: 0


[Stage 18:>                                                         (0 + 1) / 1]

Games with positive net score (positive - negative > 0): 61342


                                                                                

## Step 3: MapReduce Implementation

### Map Phase:
- Extract (publisher, net_score) pairs where net_score = positive - negative
- Only include games where (positive - negative) > 0
- Handle games with multiple publishers

### Reduce Phase:
- Group by publisher
- Calculate sum of net scores and count of games
- Compute average net score


In [9]:
# Convert DataFrame to RDD for MapReduce operations
rdd = df.select("publishers", "positive", "negative").rdd

# Map Phase: Extract publisher-net_score pairs
def map_publisher_scores(row):
    """Map function: Extract publisher-net_score pairs"""
    publishers_str = row.publishers
    positive = row.positive
    negative = row.negative
    
    # Skip if scores are null
    if positive is None or negative is None:
        return []
    
    try:
        positive = int(positive) if positive else 0
        negative = int(negative) if negative else 0
        net_score = positive - negative
        
        # Only include games with positive net score
        if net_score <= 0:
            return []
        
        # Parse publishers
        publishers = parse_publishers(publishers_str)
        
        # Create (publisher, (net_score, 1)) pairs
        result = []
        for publisher in publishers:
            if publisher:
                result.append((publisher, (float(net_score), 1)))
        
        return result
    except (ValueError, TypeError):
        return []

# Apply map function (flatMap because we return a list)
publisher_score_pairs = rdd.flatMap(map_publisher_scores)

print("Map phase completed")
print(f"Total publisher-score pairs: {publisher_score_pairs.count()}")
print("\nSample pairs:")
publisher_score_pairs.take(10)


Map phase completed


                                                                                

Total publisher-score pairs: 65129

Sample pairs:


                                                                                

[('Valve', (6345705.0, 1)),
 ('KRAFTON', (463524.0, 1)),
 ('Inc.', (463524.0, 1)),
 ('Valve', (1547124.0, 1)),
 ('Rockstar Games', (1469938.0, 1)),
 ('Ubisoft', (934317.0, 1)),
 ('Valve', (905014.0, 1)),
 ('Re-Logic', (1310313.0, 1)),
 ('Facepunch Studios', (891436.0, 1)),
 ('Valve', (1069962.0, 1))]

In [10]:
# Reduce Phase: Aggregate scores by publisher
def reduce_scores(a, b):
    """Reduce function: Sum net scores and counts"""
    total_score = a[0] + b[0]
    total_count = a[1] + b[1]
    return (total_score, total_count)

publisher_aggregates = publisher_score_pairs.reduceByKey(reduce_scores)

print("Reduce phase completed")
print("\nSample aggregates (publisher, (total_net_score, total_count)):")
publisher_aggregates.take(10)


Reduce phase completed

Sample aggregates (publisher, (total_net_score, total_count)):


                                                                                

[('Valve', (10230439.0, 13)),
 ('KRAFTON', (496517.0, 6)),
 ('Inc.', (2779835.0, 434)),
 ('Rockstar Games', (2288944.0, 11)),
 ('Ubisoft', (2411018.0, 94)),
 ('Re-Logic', (1310680.0, 2)),
 ('Facepunch Studios', (893068.0, 3)),
 ('Electronic Arts', (2127803.0, 87)),
 ('Game Science', (1064542.0, 2)),
 ('CD PROJEKT RED', (1350438.0, 6))]

In [11]:
# Calculate average net score per publisher
def calculate_stats(aggregate):
    """Calculate average and total from aggregate"""
    total_score, total_count = aggregate
    if total_count > 0:
        avg_score = total_score / total_count
        return (avg_score, total_count, total_score)
    return (0.0, 0, 0.0)

publisher_stats = publisher_aggregates.mapValues(calculate_stats)

print("\nSample results (publisher, (avg_net_score, count, total_net_score)):")
publisher_stats.take(10)



Sample results (publisher, (avg_net_score, count, total_net_score)):


[('Valve', (786956.8461538461, 13, 10230439.0)),
 ('KRAFTON', (82752.83333333333, 6, 496517.0)),
 ('Inc.', (6405.149769585253, 434, 2779835.0)),
 ('Rockstar Games', (208085.81818181818, 11, 2288944.0)),
 ('Ubisoft', (25649.127659574468, 94, 2411018.0)),
 ('Re-Logic', (655340.0, 2, 1310680.0)),
 ('Facepunch Studios', (297689.3333333333, 3, 893068.0)),
 ('Electronic Arts', (24457.505747126437, 87, 2127803.0)),
 ('Game Science', (532271.0, 2, 1064542.0)),
 ('CD PROJEKT RED', (225073.0, 6, 1350438.0))]

## Step 4: Results and Analysis

Sort by average net Steam score (descending) to find publishers with best reception.


In [12]:
# Sort by average net score (descending)
publisher_stats_sorted = publisher_stats.sortBy(
    lambda x: x[1][0],  # Sort by average net score
    ascending=False
)

# Collect results
results = publisher_stats_sorted.collect()

print(f"Total publishers analyzed: {len(results)}")

print("TOP 30 PUBLISHERS BY AVERAGE NET STEAM SCORE (Positive - Negative)")

print(f"{'Rank':<6} {'Publisher':<40} {'Avg Net Score':<18} {'Games':<10} {'Total Net Score':<18}")

for rank, (publisher, (avg_score, count, total_score)) in enumerate(results[:30], 1):
    # Format large numbers
    avg_str = f"{avg_score:,.0f}" if avg_score >= 1000 else f"{avg_score:.2f}"
    total_str = f"{total_score:,.0f}" if total_score >= 1000 else f"{total_score:.2f}"
    print(f"{rank:<6} {publisher[:38]:<40} {avg_str:<18} {count:<10} {total_str:<18}")


Total publishers analyzed: 35273
TOP 30 PUBLISHERS BY AVERAGE NET STEAM SCORE (Positive - Negative)
Rank   Publisher                                Avg Net Score      Games      Total Net Score   
1      Wallpaper Engine Team                    838,676            1          838,676           
2      ConcernedApe                             827,973            1          827,973           
3      Valve                                    786,957            13         10,230,439        
4      Kinetic Games                            692,280            1          692,280           
5      Re-Logic                                 655,340            2          1,310,680         
6      Endnight Games Ltd                       560,726            1          560,726           
7      Game Science                             532,271            2          1,064,542         
8      Smartly Dressed Games                    456,589            1          456,589           
9      Psyonix LLC         

In [13]:
# Filter publishers with at least 7 games for more reliability
publishers_min_7_games = publisher_stats.filter(lambda x: x[1][1] >= 7)

publishers_min_7_sorted = publishers_min_7_games.sortBy(
    lambda x: x[1][0],
    ascending=False
)

results_7_games = publishers_min_7_sorted.collect()

print(f"Publishers with at least 7 games: {len(results_7_games)}")

print("TOP 30 PUBLISHERS (Minimum 7 Games) BY AVERAGE NET STEAM SCORE")

print(f"{'Rank':<6} {'Publisher':<40} {'Avg Net Score':<18} {'Games':<10} {'Total Net Score':<18}")


for rank, (publisher, (avg_score, count, total_score)) in enumerate(results_7_games[:30], 1):
    avg_str = f"{avg_score:,.0f}" if avg_score >= 1000 else f"{avg_score:.2f}"
    total_str = f"{total_score:,.0f}" if total_score >= 1000 else f"{total_score:.2f}"
    print(f"{rank:<6} {publisher[:38]:<40} {avg_str:<18} {count:<10} {total_str:<18}")


Publishers with at least 7 games: 921
TOP 30 PUBLISHERS (Minimum 7 Games) BY AVERAGE NET STEAM SCORE
Rank   Publisher                                Avg Net Score      Games      Total Net Score   
1      Valve                                    786,957            13         10,230,439        
2      FromSoftware                             257,640            7          1,803,477         
3      Rockstar Games                           208,086            11         2,288,944         
4      Larian Studios                           112,258            8          898,062           
5      Aspyr (Mac)                              87,992             10         879,925           
6      Coffee Stain Publishing                  77,331             14         1,082,633         
7      Landfall                                 66,186             7          463,304           
8      Klei Entertainment                       65,597             11         721,562           
9      Bandai Namco Entert

In [20]:
# Additional statistics

all_scores = [score for _, (score, _, _) in results]
all_counts = [count for _, (_, count, _) in results]
all_totals = [total for _, (_, _, total) in results]

print(f"Total publishers: {len(results)}")
print(f"Average net score across all publishers: {sum(all_scores) / len(all_scores):,.2f}")
print(f"Highest average net score: {max(all_scores):,.2f}")
print(f"Lowest average net score: {min(all_scores):,.2f}")
print(f"\nAverage number of games per publisher: {sum(all_counts) / len(all_counts):.2f}")
print(f"Publisher with most games: {max(results, key=lambda x: x[1][1])[0]} ({max(all_counts)} games)")
print(f"Publisher with highest average: {max(results, key=lambda x: x[1][0])[0]} ({max(all_scores):,.2f})")
print(f"\nTotal net positive reviews across all publishers: {sum(all_totals):,.0f}")


Total publishers: 35273
Average net score across all publishers: 901.57
Highest average net score: 838,676.00
Lowest average net score: 1.00

Average number of games per publisher: 1.85
Publisher with most games: Inc. (434 games)
Publisher with highest average: Wallpaper Engine Team (838,676.00)

Total net positive reviews across all publishers: 109,964,412


## Step 5: Analysis Excluding Free Games

Analyze publishers excluding free-to-play games (price = 0) to focus on paid games only.


In [14]:
# Filter out free games (price = 0 or null)
df_paid_games = df.filter(
    col('price').isNotNull() & 
    (col('price') > 0)
)


print("PAID GAMES ANALYSIS")

print(f"Total games: {df.count()}")
print(f"Free games (price = 0 or null): {df.count() - df_paid_games.count()}")
print(f"Paid games (price > 0): {df_paid_games.count()}")

# Check games with positive net scores among paid games
df_paid_with_scores = df_paid_games.filter(
    col('positive').isNotNull() & 
    col('negative').isNotNull() &
    (col('positive') - col('negative') > 0)
)
print(f"Paid games with positive net score: {df_paid_with_scores.count()}")


PAID GAMES ANALYSIS


                                                                                

Total games: 89618


                                                                                

Free games (price = 0 or null): 14160


                                                                                

Paid games (price > 0): 75458


[Stage 43:>                                                         (0 + 1) / 1]

Paid games with positive net score: 55187


                                                                                

In [16]:
# MapReduce for paid games only
rdd_paid = df_paid_games.select("publishers", "positive", "negative").rdd

# Use the same map function
publisher_score_pairs_paid = rdd_paid.flatMap(map_publisher_scores)

print(f"Total publisher-score pairs (paid games): {publisher_score_pairs_paid.count()}")

# Reduce phase
publisher_aggregates_paid = publisher_score_pairs_paid.reduceByKey(reduce_scores)

# Calculate stats
publisher_stats_paid = publisher_aggregates_paid.mapValues(calculate_stats)

# Sort by average net score
publisher_stats_paid_sorted = publisher_stats_paid.sortBy(
    lambda x: x[1][0],
    ascending=False
)

results_paid = publisher_stats_paid_sorted.collect()

print(f"\nTotal publishers (paid games only): {len(results_paid)}")


                                                                                

Total publisher-score pairs (paid games): 58615


[Stage 50:>                                                         (0 + 1) / 1]


Total publishers (paid games only): 31458


                                                                                

In [17]:
# Top publishers for paid games

print("TOP 30 PUBLISHERS BY AVERAGE NET STEAM SCORE (PAID GAMES ONLY)")

print(f"{'Rank':<6} {'Publisher':<40} {'Avg Net Score':<18} {'Games':<10} {'Total Net Score':<18}")


for rank, (publisher, (avg_score, count, total_score)) in enumerate(results_paid[:30], 1):
    avg_str = f"{avg_score:,.0f}" if avg_score >= 1000 else f"{avg_score:.2f}"
    total_str = f"{total_score:,.0f}" if total_score >= 1000 else f"{total_score:.2f}"
    print(f"{rank:<6} {publisher[:38]:<40} {avg_str:<18} {count:<10} {total_str:<18}")


TOP 30 PUBLISHERS BY AVERAGE NET STEAM SCORE (PAID GAMES ONLY)
Rank   Publisher                                Avg Net Score      Games      Total Net Score   
1      Game Science                             1,060,562          1          1,060,562         
2      Wallpaper Engine Team                    838,676            1          838,676           
3      ConcernedApe                             827,973            1          827,973           
4      Kinetic Games                            692,280            1          692,280           
5      Re-Logic                                 655,340            2          1,310,680         
6      Endnight Games Ltd                       560,726            1          560,726           
7      Facepunch Studios                        446,168            2          892,335           
8      Team Cherry                              376,574            1          376,574           
9      RobTop Games                             348,875         

In [18]:
# Filter paid games publishers with at least 5 games
publishers_paid_min_5 = publisher_stats_paid.filter(lambda x: x[1][1] >= 5)
publishers_paid_min_5_sorted = publishers_paid_min_5.sortBy(lambda x: x[1][0], ascending=False)
results_paid_5 = publishers_paid_min_5_sorted.collect()

print(f"Publishers with at least 5 paid games: {len(results_paid_5)}")

print("TOP 30 PUBLISHERS (Minimum 5 Paid Games) BY AVERAGE NET STEAM SCORE")

print(f"{'Rank':<6} {'Publisher':<40} {'Avg Net Score':<18} {'Games':<10} {'Total Net Score':<18}")

for rank, (publisher, (avg_score, count, total_score)) in enumerate(results_paid_5[:30], 1):
    avg_str = f"{avg_score:,.0f}" if avg_score >= 1000 else f"{avg_score:.2f}"
    total_str = f"{total_score:,.0f}" if total_score >= 1000 else f"{total_score:.2f}"
    print(f"{rank:<6} {publisher[:38]:<40} {avg_str:<18} {count:<10} {total_str:<18}")


Publishers with at least 5 paid games: 1370
TOP 30 PUBLISHERS (Minimum 5 Paid Games) BY AVERAGE NET STEAM SCORE
Rank   Publisher                                Avg Net Score      Games      Total Net Score   
1      CD PROJEKT RED                           266,356            5          1,331,780         
2      FromSoftware                             257,640            7          1,803,477         
3      Valve                                    225,977            6          1,355,863         
4      SCS Software                             161,814            6          970,882           
5      Larian Studios                           126,894            7          888,259           
6      Aspyr (Linux)                            111,504            6          669,022           
7      Behaviour Interactive Inc.               92,567             5          462,835           
8      Aspyr (Mac)                              87,992             10         879,925           
9      Rockstar

In [21]:
# Statistics for paid games only

all_scores_paid = [score for _, (score, _, _) in results_paid]
all_counts_paid = [count for _, (_, count, _) in results_paid]
all_totals_paid = [total for _, (_, _, total) in results_paid]

print(f"Total publishers (paid games): {len(results_paid)}")
print(f"Average net score across all publishers: {sum(all_scores_paid) / len(all_scores_paid):,.2f}")
print(f"Highest average net score: {max(all_scores_paid):,.2f}")
print(f"Lowest average net score: {min(all_scores_paid):,.2f}")
print(f"\nAverage number of games per publisher: {sum(all_counts_paid) / len(all_counts_paid):.2f}")
print(f"Publisher with most paid games: {max(results_paid, key=lambda x: x[1][1])[0]} ({max(all_counts_paid)} games)")
print(f"Publisher with highest average (paid): {max(results_paid, key=lambda x: x[1][0])[0]} ({max(all_scores_paid):,.2f})")
print(f"\nTotal net positive reviews (paid games): {sum(all_totals_paid):,.0f}")

# Comparison

print("COMPARISON: All Games vs Paid Games Only")

print(f"{'Metric':<40} {'All Games':<20} {'Paid Games Only':<20}")

print(f"{'Total publishers':<40} {len(results):<20} {len(results_paid):<20}")
print(f"{'Avg net score per publisher':<40} {sum(all_scores)/len(all_scores):>15,.2f} {sum(all_scores_paid)/len(all_scores_paid):>15,.2f}")
print(f"{'Avg games per publisher':<40} {sum(all_counts)/len(all_counts):>15,.2f} {sum(all_counts_paid)/len(all_counts_paid):>15,.2f}")
print(f"{'Total net positive reviews':<40} {sum(all_totals):>15,.0f} {sum(all_totals_paid):>15,.0f}")


Total publishers (paid games): 31458
Average net score across all publishers: 874.07
Highest average net score: 1,060,562.00
Lowest average net score: 1.00

Average number of games per publisher: 1.86
Publisher with most paid games: Big Fish Games (410 games)
Publisher with highest average (paid): Game Science (1,060,562.00)

Total net positive reviews (paid games): 87,944,520
COMPARISON: All Games vs Paid Games Only
Metric                                   All Games            Paid Games Only     
Total publishers                         35273                31458               
Avg net score per publisher                       901.57          874.07
Avg games per publisher                             1.85            1.86
Total net positive reviews                   109,964,412      87,944,520


## Step 6: Save Results (Optional)


In [None]:
# Convert results to DataFrame for easier export (publishers with at least 5 games)
from pyspark.sql.types import StructType, StructField, StringType, DoubleType, IntegerType

# Create schema
schema = StructType([
    StructField("publisher", StringType(), True),
    StructField("avg_net_score", DoubleType(), True),
    StructField("game_count", IntegerType(), True),
    StructField("total_net_score", DoubleType(), True)
])

# Use filtered results (publishers with at least 5 games)
results_rdd = spark.sparkContext.parallelize([
    (publisher, float(avg_score), int(count), float(total_score))
    for publisher, (avg_score, count, total_score) in results_filtered
])

results_df = spark.createDataFrame(results_rdd, schema)

# Show sample
print("Results DataFrame (Publishers with at least 5 games):")
results_df.show(30, truncate=False)

# save to CSV
results_df.coalesce(1).write.mode("overwrite").option("header", "true").csv("output/publisher_steam_score_analysis_min5games")


Results DataFrame (Publishers with at least 5 games):
+--------------------------+------------------+----------+---------------+
|publisher                 |avg_net_score     |game_count|total_net_score|
+--------------------------+------------------+----------+---------------+
|Valve                     |786956.8461538461 |13        |1.0230439E7    |
|FromSoftware              |257639.57142857142|7         |1803477.0      |
|CD PROJEKT RED            |225073.0          |6         |1350438.0      |
|Rockstar Games            |208085.81818181818|11        |2288944.0      |
|SCS Software              |161813.66666666666|6         |970882.0       |
|Larian Studios            |112257.75         |8         |898062.0       |
|Aspyr (Linux)             |111503.66666666667|6         |669022.0       |
|Aspyr (Mac)               |87992.5           |10        |879925.0       |
|KRAFTON                   |82752.83333333333 |6         |496517.0       |
|Supergiant Games          |78010.6           

                                                                                

## Summary

This MapReduce implementation analyzes publisher quality using **Steam review scores**:

1. **Map Phase**: 
   - Extracted (publisher, net_score) pairs where net_score = positive - negative
   - Only included games where (positive - negative) > 0
   - Handled multiple publishers per game

2. **Reduce Phase**: 
   - Aggregated net scores by publisher
   - Calculated average net score per publisher
   - Counted number of games per publisher

3. **Results**: 
   - **All Games Analysis**: Identified publishers with highest net positive reception
   - **Paid Games Only Analysis**: Same analysis excluding free-to-play games (price = 0)
   - Multiple views: all publishers, min 5 games, min 10 games (for both analyses)
   - **Comparison**: Side-by-side comparison of all games vs paid games only

### Key Insights:
- **Net score (positive - negative)** directly measures player satisfaction
- Higher net scores indicate games with more positive than negative reviews
- Publishers with consistently high net scores have better player reception
- Filtering by minimum game count provides more reliable statistics
- **Paid games analysis** helps identify publishers that excel with monetized games
- Free games often have different review patterns (higher volume, different expectations)
- This metric focuses on actual player feedback rather than professional reviews


In [22]:
spark.stop()