# Project Progress Report
**Date:** April 10th, 2025

**Team Members:**
ALex Cruz,
Miguel Madrigal,
Christian Julias,
Mohsin Patel,
Angel Ramirez


**Link to GitHub Repository:** https://github.com/mmadr5/SteamSalesAndPricingAnalysis

# ===========================================

# 1. Project Introduction
Briefly introduce your project, the dataset, and the problems you aim to investigate:

The goal of our project is to analyze the pricing and discount strategies used on the popular PC gaming platform, Steam. 
As frequent gamers ourselves, we were particularly interested in understanding the patterns behind the widely anticipated Steam sales, 
and how various game attributes influence both regular prices and discount behavior.

To do this, we collected data from several publicly available Steam-related datasets. These datasets include information on game prices, 
genres, user reviews, tags, release dates, and other game-specific metrics.

# ===========================================

# 2. Changes since Proposal
Discuss any changes in the scope or approach since the initial proposal:

Since our initial proposal, the scope and overall approach of our project have remained consistent. 
We haven't added or removed any major components. Our focus continues to be on analyzing Steam game pricing and discount patterns.


# ===========================================

# 3. Data Preparation
Explain the steps taken to prepare your data.


[Describe your data preparation process, including data cleaning, transformations, feature extraction, etc.]

Our dataset was collected from SteamDB where we first started with the top 100 games in each genre, just because collecting
data for the entire game catalog on Steam has been time consuming. To clean the data we looked at the price column and removed 
any entry that did not have a price or had a price of 0 which meant it was free, which is not needed for our project. After that we removed 
any duplicates or any rows with missing info like name or genre. 
These steps ensured our dataset was clean, consistent, and ready for analysis related to Steam's pricing patterns and sales strategy.


In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
df = pd.read_csv('../Steam_Project/steam_top100_cleaned_data_corrected.csv')

print("First 5 rows of the dataset:")
display(df.head())

# Dataset overview
print("\nData Info:")
df.info()

print("\nStatistical Summary:")
display(df.describe())

# Convert price to numeric and fix decimal
# Some prices are stored as integers like 9999 instead of 99.99
df['price'] = pd.to_numeric(df['price'], errors='coerce')

# Filter price between 0 and 210 (exclusive of 0)
df = df[(df['price'] > 0) & (df['price'] <= 210)]

# Drop duplicate app entries (keep first occurrence)
df.drop_duplicates(subset='appid', keep='first', inplace=True)

# Add total_reviews and positive_ratio columns
df['total_reviews'] = df['positive'] + df['negative']
df['positive_ratio'] = df['positive'] / (df['total_reviews'].replace(0, 1))  # Avoid divide by zero
df['negative_ratio'] = df['negative'] / (df['total_reviews'].replace(0, 1))

# Check for missing values
print("\nMissing Values Per Column:")
print(df.isnull().sum())

# Optional: drop rows with critical missing values (e.g., no name or no genre)
df.dropna(subset=['name', 'genre'], inplace=True)

# Reset index after dropping rows
df.reset_index(drop=True, inplace=True)

# Preview cleaned data
print("\nCleaned Data Sample:")
display(df.head())

# ===========================================

# 4. Exploratory Data Analysis (EDA)


In [None]:

sns.histplot(df['price'], bins=50)

plt.title("Price Distribution of Apps")
plt.xlim(0, 210)
plt.xlabel("Price")
plt.ylabel("Number of Apps")
plt.show()

# Most common genres
from collections import Counter

all_genres = ','.join(df['genre'].dropna()).split(',')
genre_counts = pd.Series(Counter(all_genres)).sort_values(ascending=False)
genre_counts.head(10).plot(kind='bar')
plt.title("Top 10 Most Common Genres")
plt.ylabel("Number of Apps")
plt.show()

# ===========================================

# 5. Hypotheses Visualizations
(At least 5 visualizations with explanations and responsible team members)


### Hypothesis 1:
- Visualization explanation: Free apps get more reviews than paid apps
- Responsible member(s): Angel Ramirez
- Why it’s interesting: Helps understand if pricing strategy impacts popularity.


In [None]:
# Add 'is_free' column to indicate whether the app is free
df['is_free'] = df['price'] == 0.00

# Add 'log_reviews' column to calculate log(1 + total_reviews)
import numpy as np
df['log_reviews'] = np.log1p(df['total_reviews'])

# Plot the violin plot
sns.violinplot(x='is_free', y='log_reviews', data=df, inner='box')
plt.title("Free vs Paid Apps: Log Review Count (Violin Plot)")
plt.xlabel("Is Free")
plt.ylabel("Log(1 + Total Reviews)")
plt.show()

### Hypothesis 2:
- Visualization explanation: 

    > The first visualization shows the top 10 game genres with the highest average positive review ratios. Each dot represents a genre, and its size reflects the number of games in that genre. Larger dots mean more games, giving context to how representative the average ratio is. The genres are sorted by positivity, helping us quickly see which genres tend to receive the most favorable feedback from players. For example, if “Indie” appears at the top, it means that on average, players rate Indie games very positively.

    > The second visualization highlights the top 10 genres with the highest average negative review ratios. Like the first chart, dot size indicates how many games are in each genre. A high negative ratio suggests that games in that genre tend to receive more critical or dissatisfied reviews. This helps identify genres that may have issues with gameplay, quality, or audience expectations. Together, both plots give a balanced view of how different genres perform in terms of user sentiment. They can be used to compare genre reputation and player satisfaction patterns in the Steam game ecosystem.

- Responsible member(s): Alexander Cruz 
- Why is it intresting? 

    > It reveals which game genres are most loved or criticized by players on average. It helps developers and gamers understand trends in player satisfaction across different genres. By comparing both positive and negative feedback, we can spot genres that may be underrated or overhyped. It also offers insight into how genre popularity relates to review quality, not just quantity.


In [None]:
# [Visualization Code]
# Sort and select top 10 by positive ratio

# Extract main genre
df['main_genre'] = df['genre'].str.split(',').str[0]

# Group by main genre and compute averages
genre_summary = df.groupby('main_genre').agg(
    avg_positive_ratio=('positive_ratio', 'mean'),
    avg_negative_ratio=('negative_ratio', 'mean'),
    app_count=('appid', 'count')
).sort_values(by='avg_positive_ratio', ascending=False).head(10)

top_positive = genre_summary.sort_values(by='avg_positive_ratio', ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.scatter(
    top_positive['avg_positive_ratio'],
    top_positive.index,
    s=top_positive['app_count'] * 3,
    alpha=0.7,
    color='green'
)
plt.title("Top 10 Genres by Average Positive Review Ratio")
plt.xlabel("Average Positive Ratio")
plt.ylabel("Main Genre")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()

# Sort and select top 10 by negative ratio
top_negative = genre_summary.sort_values(by='avg_negative_ratio', ascending=False).head(10)

plt.figure(figsize=(10, 6))
plt.scatter(
    top_negative['avg_negative_ratio'],
    top_negative.index,
    s=top_negative['app_count'] * 3,
    alpha=0.7,
    color='red'
)
plt.title("Top 10 Genres by Average Negative Review Ratio")
plt.xlabel("Average Negative Ratio")
plt.ylabel("Main Genre")
plt.grid(True, linestyle='--', alpha=0.6)
plt.tight_layout()
plt.show()


### Hypothesis 3:
- Visualization explanation:
- Responsible member(s):

In [None]:
# [Visualization Code]

### Hypothesis 4:
- Visualization explanation:
- Responsible member(s):

In [None]:
# [Visualization Code]

### Hypothesis 5:
- Visualization explanation:
- Responsible member(s):

In [None]:
# [Visualization Code]

# 6. Machine Learning Analyses
(At least 2 analyses with baselines and explanations)

**Analysis 1:**
- ML technique explanation:
- Baseline used:
- Results interpretation:
- Responsible member(s):

In [None]:
# [ML Analysis Code]

**Analysis 2:**
- ML technique explanation:
- Baseline used:
- Results interpretation:
- Responsible member(s):

In [None]:
# [ML Analysis Code]

# ===========================================

# 7. Reflection
*Address the reflection points:*

- **Most challenging part so far:** Gathering data or finding useful consistent data has been a nightmare

- **Initial insights:**

- **Concrete results available:**

- **Current biggest problems:**

- *Are you on track?*

- *Worth proceeding with the current approach?*

# ===========================================

# 8. Next Steps
Outline concrete plans and goals for the next month.

- In the last week we found a dataset table that lists the games currently on steam. We hope to use this table to retrieve current price and review data.
- We recently retrieved a table that has listed publisher data classed by AAA, AA, Indie, and Hobbyist. As well other relevant data per publisher that we can leverage in our study.
- [Goal 3]