<a href="https://www.kaggle.com/code/johnangelobelarma/2024-steam-statistics-python-sql-tableau?scriptVersionId=198297304" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Analyzing Steam's 2024 Data

This project aims to explore Steam’s 2024 data with a focus on total sales, publisher performance, and game distribution. The workbook will showcase data cleaning, SQL querying, and data visualization using Tableau Public to uncover the data and information.

***

# 1. Introduction

Steam is the largest platform for PC gaming, and analyzing its data offers valuable insights into game sales, publisher performance, and market trends. This project aims to explore:

* Steam Statistics of 2024
* Copies sold across these games
* Game reviews, revenue and average playtimes
* Publisher data, including:
 * Revenue per game publisher
 * Number of games published, categorized by publisher class.

# Objectives
* Use SQL to clean and prepare the dataset
* Query to explore data and find specific insights
* Visualize the data through Tableau Public for meaningful storytelling

***

# 2. Data Overview

The dataset consists of Steam game statas for 2024. It contains the following columns:

* name: The name of the game
* releaseDate: The release date of the game
* copiesSold: Total copies sold
* price: The price of the game
* revenue: Total revenue from the game
* avgPlaytime: Average playtime per player
* reviewScore: User rating score of the game
* publisherClass: Classification of publishers (Indie, AAA, etc.)
* publishers: Names of the game publishers
* developers: Names of the game developers
* steamId: Unique identifier for each game on Steam

***

# 3. Query and Documentation
# Initial imports and connecting BigQuery to Kaggle


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# Import BigQuery to Kaggle
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='steam-435621', location='US')

# Import the BQ API Client library
from google.cloud import bigquery
client = bigquery.Client(project='steam-435621', location='US')

In [None]:
# Construct a reference to the steam dataset that is within the project
dataset_ref = client.dataset('steam_data', project='steam-435621')

# Make an API request to fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [None]:
# Make a list of all the tables in the dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset
for table in tables:  
    print(table.table_id)

# Previewing the data

In [None]:
# Preview the first ten lines of the table
client.list_rows(table, max_results=10).to_dataframe()

# Data cleaning

In [None]:
# Make sure all the data are loaded

query1 = """
          SELECT 
            *
          FROM 
            `steam-435621.steam_data.steam_db`          
        """

# Set up the query
query_job1 = client.query(query1)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job1.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

In [None]:
# Check for any missing data
query2 = """
          SELECT 
            COUNT(*) AS total_rows,
            COUNT(name) AS name_non_null,
            COUNT(revenue) AS revenue_non_null
          FROM 
            `steam-435621.steam_data.steam_db`          
        """

# Set up the query
query_job2 = client.query(query2)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job2.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

# Data Exploration

In [None]:
# What are the top 15 steam games of 2024 based on revenue
# Converted the revenue to $ format, 2 decimal places with (,) as separators

# Write the query
query3 = """
          SELECT 
            name,
          CONCAT
            ('$', FORMAT("%'.2f", revenue)) AS formatted_revenue
          FROM 
            `steam-435621.steam_data.steam_db`          
          ORDER BY revenue
          DESC
          LIMIT 15;
        """

# Set up the query
query_job3 = client.query(query3)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job3.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

In [None]:
# Load data of paid games that sold more than 1M copies and copies sold respectively
query4 = """
          SELECT 
            name,
          CONCAT
          ('$',FORMAT("%'.2f", price)) AS formatted_price,
          FORMAT
          ("%'.2f", copiesSold) AS formatted_copiesSold,
          CONCAT
          ('$',FORMAT("%'.2f", revenue)) AS formatted_revenue
          FROM
          `steam-435621.steam_data.steam_db`
          WHERE
          copiesSold > 1000000
          AND
          price > 0
          ORDER BY
          copiesSold
          DESC;
        """

# Set up the query
query_job4 = client.query(query4)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job4.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

In [None]:
# Load data of paid games that sold more than 1M copies and copies sold respectively
query6 = """
          SELECT 
            publishers,
          SUM
            (revenue) AS total_revenue,
          AVG
            (reviewScore) AS avg_review_score,
          COUNT
            (name) AS total_games
          FROM
            `steam-435621.steam_data.steam_db`
          GROUP BY 
            publishers
          ORDER BY 
            total_revenue
          DESC
          LIMIT
            30;
        """

# Set up the query
query_job6 = client.query(query6)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job6.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

In [None]:
# Load final dataset to use for analysis
query5 = """
          SELECT 
            name,
          FORMAT
            ("%'.2f", price) AS formatted_price,
          FORMAT
            ("%'.2f", copiesSold) AS formatted_copiesSold,
          FORMAT
            ("%'.2f", revenue) AS formatted_revenue,
          FORMAT
            ("%'.2f", avgPlaytime) AS formatted_avgPlaytime,
            reviewScore, publisherClass, publishers,
          FROM
          `steam-435621.steam_data.steam_db`
          ORDER BY
          revenue
          DESC;
        """

# Set up the query
query_job5 = client.query(query5)

# Make an API request  to run the query and return a pandas DataFrame
housestyleAC = query_job5.to_dataframe()

# See the resulting table made from the query
print(housestyleAC)

***

# 4. Data Visualization using Tableau Public

# Data story via Tableau Public

After querying, I used Tableau Public to create the following visualizations:

* Games Sold vs Revenue: A bar chart that displays total revenue by games, paired with the amount of games sold.
* Game Price vs Games Sold: A scatter plot showing the relationship between the number of copies sold and the price of the game, revealing trends between high-priced games and copies sold.

* Publisher Class Distribution: A simple pie chart visualizing the publisher class distribution by AAA, AA, Indie studios and Hobbyists.
* Total Revenue by Publisher: A line chart showing the revenue earned publishing companies for 2024.

* Review Score vs. Playtime: A bubble chart comparing average review scores with playtime, revealing how player satisfaction correlates with engagement.

In [None]:
[View my Tableau Visualization](<div class='tableauPlaceholder' id='viz1727102281015' style='position: relative'><noscript><a href='#'><img alt='2024 Steam Statistics ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Q2&#47;Q25K6CTG8&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='path' value='shared&#47;Q25K6CTG8' /> <param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Q2&#47;Q25K6CTG8&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1727102281015');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='1016px';vizElement.style.height='991px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>)

# Data Dashboard

This dashboard provides an in-depth analysis of Steam game sales for 2024, focusing on key metrics providing a comprehensive view of market trends in the gaming industry.

In [None]:
[View my Tableau Visualization](<div class='tableauPlaceholder' id='viz1727101817403' style='position: relative'><noscript><a href='#'><img alt='Dashboard 2 ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;St&#47;SteamData_17271014029670&#47;Dashboard2&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='SteamData_17271014029670&#47;Dashboard2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;St&#47;SteamData_17271014029670&#47;Dashboard2&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1727101817403');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='1366px';vizElement.style.height='795px';} else { vizElement.style.width='100%';vizElement.style.height='1877px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>)

***

# 5. Acknowledgements

Ali Cem Topcu. (2024 September). Top 1500 games on steam by revenue 09-09-2024, version 1. Retrieved 2024 September 10 from https://www.kaggle.com/datasets/alicemtopcu/top-1500-games-on-steam-by-revenue-09-09-2024/data